Ethical Risks of Exclusionary, Biased, and Erroneous Financial Data
Machine Learning in Finance – Cautionary Tales I
A Post on Finance and Society, Quantitative Finance, and Financial Economics
Ethical risks in financial data as it pertains to machine learning are mostly limited to (1) the exclusion of certain sub-groups from the training dataset, (2) the biased nature of data in the training set, and more generally (3) erroneous training data.
Any recent changes in the true underlying state of the business process that is not incorporated in a model’s training dataset will lead to biased predictions due to a fundamental misunderstanding of the underlying process or condition. A group that has always been excluded will never be included within a machine learning feedback loop because the model is simply unaware of the group’s existence. As an example, if the women's suffrage movement and women’s increasing role in the global labour force occurred in the age of AI, a machine learning model might still be suspicious to lend out money to white-collar female professional, even after they earn a comparable level of money to their male counterparts; the model has simply never seen a past example of a labelled lending occurrence.
The model preserves old stereotypes because it has never been made aware of these entities and individuals. The model itself is not pernicious, but simple perpetuates the discrimination of a bygone era. A venture capital investment model that has never seen a South-African Canadian founder under the age of thirty might wrongfully withhold from investing in Elon Musk. A further example would be a wealth management recommender that might be unaware of the progressively changing retirement age, and for that reason misallocate an investor’s personal funds. This exclusionary limitation has vast implications for all financial modelling tasks. It does not mean that there are no solutions to deal with this limitation. Modellers can seek to develop synthetic data for unrepresented groups, actions, and activities. They can also change the cost function or adjust the model’s parameters to penalise predictions or accentuate randomness in the prediction process.
Data that has not historically been collected, due to it not existing or due to record-keeping limitations simply would not appear in the machine learning system. For that reason, we would say that machine learning has an ‘exclusionary’ nature to it. When data has in fact been recorded, we must come to grips with three forms of data biased that historical data could exhibit. First, a subgroup included in the training data might not represent the overall demographic profile or proportion of the target market both in quantity and type, which can lead to positive or negative discriminative effects.
Second, even if the sample is demographically representative, it could simply lack in quantity leading to a model that treats different subpopulation differently. If a specific group of individuals appears more frequently than others in the training data, the model will optimise for those individuals to boost overall accuracy. For instance, Asian and African Americans often have a high error rate because they are not represented equally in the datasets. A US dataset would, because of demographic proportion, only include 1.7% Native Americans; consequently, they would perform worse compared to other ethnicities due to a lack of equal representation; this is sometimes referred to as algorithmic bias.
Lastly, even if you have included the same quantity of data points for all demographic profiles, historical discrimination and current discrimination could lead to feature correlations that suggest unequal outcomes on a caeteris paribus basis. If a minority group shows more defaults, all else equal, in a credit lending database, it could be a cause of concern and have serious knock-on effects.
For example, it has been noted that employees working in STEM fields can obtain a lower interest rate, all else equal. It has also been noted that STEM-based online advertisement is disproportionately targeted to the male demographic. Men receiving a disproportionate level of STEM-based advertisements are therefore more likely to enter a STEM-based field and obtain a lower interest rate that would enable them to enter the housing market. In some cultures, homeownership is an essential component in building a family and living a comfortable life. The way in which advertisements are presented to a demographic can therefore have large downstream effects. Thinking about these downstream effects are daunting and exhausting for that matter, making these problems harder and harder to address.
There are also non-discriminative errors in data that needs to be critically assessed. These errors could of course have implications for discrimination. The errors of the data can be planted using poison attacks, or it could be less intentional. Two twitter attacks come to mind. The first is the 2013 Associated Press (AP) Twitter hack where the market intermittently lost more than $100 billion dollars on news that the ‘’White House’’ is under attack and that ‘’Barack Obama is injured”. The other notable event was then 2020 Twitter hack of more than 100 high profile accounts were hacked to promote a bitcoin scheme; the hacker in this scenario was at first more sophisticated than the AP hack by compromising so many accounts, but largely lacked the financial expertise to have ‘’benefited’’ from it fully.
Alternative financial data to some extent rests on the public copy of the internet, and because of mere fungibility concerns, this data should be treated with caution. Successful investment strategies can be formed around employee review websites like Glassdoor, or company review websites like Yelp. But businesses have an incentive to produce fraudulent reviews to attract customers. Research by academics at HBS shows that a one-star increase in rating can lead to a 5 to 9 percent increase in revenue. However, research also shows that at least 16% of reviews are fraudulent.
We know this data is useful, MaryJo Fitzgerald, Corporate Affairs Manager at Glassdoor, said already in 2016 that they “hear from and talk to investors who use the data on our site all the time”. BlackRock, the world’s largest asset manager, has also been reported to use Glassdoor data for investment decisions (Crowe, 2016; Rose, 2016). Making investment decisions on mutable data could lead to more fragile and error-prone financial systems.
Stakeholder incentive for positive, and potentially fraudulent, reviews might soon span further than the operating entity. Yelp data has for example shown promise when used to predict local economic outlooks and has been shown to be useful for policy analysis. Fraudulent reviews can help curry favour for regional councils if used as part of a nowcasting policy agenda. Moreover, we suspect that the predictability of facility closures and changes in business outlook is already used within algorithmic trading firms, meaning that an adversarial fight for reviewer control has already taken hold and that strategies could be implemented to inject fraudulent reviews into hedge funds’ favourite social media and review websites like Twitter, Glassdoor, Yelp, Trends, LinkedIn, and Amazon.
For the most part, the public-facing data on these websites can be easily scraped i.e. downloaded. These websites are notably frustrated when their valuable public data are used by third parties for their own pecuniary benefit. We suspect that more anti-scraper techniques will be developed to throw algorithmic traders off. This can be as easy as small value perturbations so that the change is not noticeable by the human eye, but throws machines off, i.e., adding or removing decimals, changing the layout of the website, using a static image as opposed to text.
Another method is to attempt a homograph attack also known as script-spoofing where the Latin character is replaced with a similar-looking Cyrillic or Greek character e.g., the Latin character "a" is replaced with the Cyrillic character "а", or 425 is switched out for symbols ４２５. Many of these companies have been reasonably successful of getting prohibited web scrapers off their website, but it is a delicate dance, because of SEO restrictions that might be imposed by Google’s own scraper, i.e. Google search engine which would be to their disadvantage.
Realising the value of their data, some platforms have started to form profitable partnerships by curating and selling the data either in bulk or through an API. For example, even though all Glassdoor’s data is public they have a ‘’…APIs that are not provided publicly…’’ This is nothing new for more established companies like Visa and others who have established long term partnerships with hedge funds.
So, what about scenarios where the error is not introduced on purpose but by mistake. It is well known that data has been driving a lot of the credit scoring and decision making in recent years. This is in spite of the fact that already in 2004, the National Association of State Public Interest Research Groups (SPIRG) found that a whopping 80% of credit reports had errors in them. Moreover, 25% of the reports contained errors so significant that they would lead to a denial of credit.
Historical recording keeping has been notoriously bad in many sectors, the simple reason being that we have never really come to appreciate the value of accurate data since recent times. Digging further into the SPIRG report, we can see 54% of applicants had inaccuracies around personal information, 30% around account status, and 8% around the existence of credit accounts. Seen from this perspective, it is no wonder that there is a call for transparency in a time where practising professionals enthusiastically proclaim that “all data is credit data’’.
This is nothing new. Historically credit bureaus would be the ultimate black-box organisation. Entities historically assessed creditworthiness by collecting data without disclosing what exactly they are assessing. The records were poorly kept and included attributes to measure ''poorly kept yards'' and ''effeminate gestures" as suggested by a few circulating reports in the 1960s. This drew some public attention and led to the US Congress passing the Fair Credit Reporting Act which provides some strict requirements on data accuracy and relevancy. However, from a modern vantage point, it does not seem like 50 years is enough to get unbiased accurate data.
The problem with historical data errors is that they echo far into the future because of how decision outputs inevitably become future model inputs. If an applicant’s home address is somehow entered as St. Charles Place as opposed to Park Place, then they might not be extended a loan due to negative neighbourhood connotations. If the mistake is realised in the future and the value is adjusted to reflect the reality while the label would in all likelihood still say ‘denied’. That hiccup of slow or erroneous data propagation can then lead to an inadequate model denying all Park Place residents’ loan application; this could be reinforced with future loans if the problem is not identified and intercepted by a human agent; Park Place has now become a tainted address in perpetuity.
This is just a small example of one of the numerous ways in which data mistakes can undermine a machine learning system used in production. The errors can become so entrenched that they are impossible to correct, the reason being that ancillary features such as location, residential attributes and others would have interacted with the error also causing errors for applicants who do not even reside in either Park Place or St. Charles Place. What is even worse is that out of privacy concerns, consumers might alter their online social media profiles on purpose, so it need not even be a mistake as much as an altered record for honourable reasons — but with ultimately bad consequences.
The challenge is that even though there are regulation to promote the accuracy and timeliness of the data such as the Fair Credit Reporting Act (FCRA) in the US, that doesn’t mean that the data is subjected to strenuous audits. Clients should be made aware as to what information is collected on them and then used as part of the decision-making process, as it can give them time to alert the modeller of necessary changes. Clients should also be able to ask for an updated assessment based on authentic changes.
It is essential to follow standard regulatory practice to ensure that data is complete, accurate, and timely. Data hygiene occurs not just at the data collection stage, but also at the processing stage. Data representation engineering could lead to large changes in the underlying data that does not accurately reflect the true underlying circumstance. Data processing validation should form part of a pipeline for software performance testing.
A postprocessing error can be as simple as the target variable being misaligned so that it appears on the same row of the observed sample for that time step. As an example, you want to predict tomorrows stock returns, but you forgot to shift the return column one step ahead so you are in fact predicting today’s return as opposed to tomorrows, which is of no use to anyone. This is sometimes referred to as an off-by-one-error.
Another problem closely related to the above is that new features are generated using the label column, for example, a moving average of returns that spans multiple rows. This contaminates the feature set that now derives some value from the label column. The errors above could be spotted by with too-good-to-be-true performance, or by looking at outsides feature importance values.
Another postprocessing error occurs when you back-fill data. If you fill outlier or null values with the next true future value, then you are leaking some information from the future into the past. Backfilling can also be more sinister in that it need not be the result of any postprocessing done on your part and can simply be a matter of how the dataset has been constructed by the data provider. It is well known that some data providers would backward-adjust financial values as a result of future reporting adjustments, the problem being that this data would not have been available to an investor at the time, risking the development of a biased algorithm. This should not be allowed because no data points should be able to ‘time travel’ from the future.
To some extent machine learning models are great exactly because they perform so well with erroneous data, unlike statistics where you seek to recover the underlying truth from which the data is generated, machine learning is largely agnostic to the underlying truth, and is instead signed to the tune of ‘more data is better’. This simple shift from small-data to big-data methods causes a rift in incentives that might have to be smoothed out with policies such as data audit requirements. Data validation forms an important part of model validation and should be taken seriously.
The methods above address only a small sliver of the data issues that one has to frequently deal with, moreover, we have only addressed erroneous data problems, whereas exclusionary and biased data problems have so far been ignored. It is not easy to solve either one of these. For this reason, we should not expect any of these problems to disappear anytime soon or for any quick solutions to appear. There is yet to be convergence on what in particular can be done to solve these respective problems. For one, domain knowledge is essential to know what groups have historically been excluded from the dataset. That knowledge by itself is not enough, the next question is how to find data for these previously excluded individuals, if at all available. If not, one needs knowledge on how to develop synthetic data or how to develop methods to find a valid proxy group.
Once you have ticked that box, you have to consider whether the subgroups, as represented, could be biased. In fact, you can as a default assumption that there would always be bias in a dataset, the reason being that you are always able to cut finer and finer intersectional groups to expose them. Three methods have come out of fairness bias research, being preprocessing, algorithmic, and post-processing solutions. The easiest method to use is post-processing, but it has its own set of problems. Erroneous data can hide from even the best domain experts, there is a need to perform extensive validation tests as well as speak to industry experts, working groups, and consult the latest academic literature. This is just a teaser of some thoughts; more will be said on this topic in the future.
 MIT has published a nice review on this subject: https://perma.cc/HGS3-VNQM
 Test it out here: https://www.irongeek.com/homoglyph-attack-generator.php
 Glassdoor API documentation - https://conifer.rhizome.org/snowde/the-finance-parlour/20201104094900/https://www.glassdoor.co.uk/developer/index.htm