Structural Limitations of Machine Learning
Machine Learning in Finance – Cautionary Tales II
A Post on Finance and Society, Quantitative Finance, and Financial Economics
Structural Limitations for Finance
There is a global movement in socio-technological disciplines to give more agency to intelligent self-operating systems, not just for cost advantages, but also for improvements in the quality of decisions made. This movement is riding on the back of prolonged and sustained successes achieved by a few trailblazing companies like Google for information retrieval, Amazon for computational resources, Netflix for entertainment, and Renaissance Technologies for quantitative investment. Machine learning has some deep structural limitations that are worth surveying, and we will do this with financial applications in mind. We will look at model awareness, data availability, atheoretical decision making, anomalies and corners cases, and changing distributions.
Machine learning predictions are not guided by a cultural or moral framework. Machines make reductive assumptions that give the appearance of good performance within each applied abstraction. A model that has a hard-objective function to eliminate any possibility of credit default when selecting customers, could eliminate an entire ethnicity from the selection process due to a single incidence of bankruptcy in a small training set. And a model with a loose reward function could also lead you down a reward-hacking rabbit hole. Here is a favourite fearmongering example that displays the wickedness more explicit: a machine prompted to eradicate cancer could, as a shortcut, eliminate humans altogether. For algorithmic finance, this accentuates the importance of using machine learning as a tool, as opposed to end-to-end agents with unlimited freedom, especially if unleashed into the wild without testing a complete set of exploration mistakes in some calibrated environment, like an agent-based simulator.
At first blush, machine learning tools allow one to replace human agents with computer agents, for example, machine agents can listen in on multiple earnings calls at the same time, and allow you to move the probability of certain events like mergers, acquisitions, bankruptcy, and earnings surprises occurring. The predictions still have to be aided with context, culture, sentiment, jargon, to which little to no data exists and where data exists too little is available for any good predictive power. These models would have to understand topics pertaining to the deal negotiations; they need to understand certain legal and regulatory requisites and intuit the conversations that are happening behind closed doors. Many of these effects are hard to measure and almost “extrastatistical”. These contextual issues are not captured and for that reason do not appear in the loss landscape of the prediction model. This difficulty is even empirically backed up by researchers who are in the business of selling you their product, S&P Global researchers show that by analysing these calls you can predict tomorrows return direction with 52.45 percent. Ignoring potential modelling and statistical errors, that’s a mere 2 percentage points above random. Analysts with deep expertise are not going to be relegated to janitorial services anytime soon. If machines replace analysts, they will only use what is measurable to the exclusion of everything else, this has historically been very problematic in other domains.
There would always be elements of information that cannot be incorporated into a machine learning model, and for that reason, a machine learning prediction should, in some respects, only be one component in a system of Bayesian decision-making. In so many domains, data must be understood within a holistic framework of one-off events, e.g., when Minneapolis Fed President, Kashkari warns about a ‘greater catastrophe’, one must understand the context and intermeshed layers of incentives not always readily programmable. There is a fundamental limitation in that we are forced to present the world as a data matrix of mostly independent rows — and yet we know the world is more complex than that, especially viewed through the prism of financial graph networks that illustrates more interconnectedness than previously imagined.
Also, because the machine learning model is yet to learn from the vast array of mistakes humans have made in finance, it is immediately at a severe disadvantage, the following might highlight the instability. There are reasons that intelligent agents can’t be left to their own devices, for example, if an automated agent uncovered a martingale strategy by itself where, when one loses, more volume is added to the trade until you regain your losses. The agent has never failed with the strategy and hence undertakes it each time, not knowing that under some unforeseen circumstances, like in a downturn, you could be constantly hit with unfavourable trades regardless of the amount of capital the agent has access to. The very first thing a machine learning model will try to do is to minimise the error function, when it is running a reinforcement learning strategy, it can continue for years, but it will never be primed for anomalies. For that reason, it is better for the model in a simulated environment where it can face-off with these anomalies. Lastly, all machine learning solutions are bound by history, if you optimise your generative model to historical text from movie scripts or lyrics it might set a limit on innovation leading to cultural staleness. In a similar vein, you will never see a trading model developed on theoretical never-before-seen future eventualities.
There could also be hidden expectations to use machine learning models when they would not be appropriate, for example, when not enough records are available. You might enter muddy waters if you for example try to predict defaults on sovereign debt. There simply is not enough data. This has been the downfall of many a machine learning project. One would be hard-pressed to develop a prediction model of substance when we only have Argentinian, Russian and Pakistani as modern-day default events. This data would not be enough to accurately predict whether South Africa will default in the next 5 years; other quantitative and qualitative methods would have to be used to substantiate that claim. This has not stopped researchers from trying, all the studies seem to frame it as a win for gradient boosting and neural network machine learning when they barely outperform vanilla logistic regressions on out of sample data.
It should be noted that no “...amount of complex mathematical/statistical analysis can possibly squeeze more information from a data set than it contains initially.’’ Another example would be the use of social media data to assess credit scores, this sort of data has not been put through the test of time, and it is very likely that we would see a large mismatch in prediction quality in the next regime shift. Notwithstanding, in the 2008 financial crises and the Covid-19 pandemic, financial instability is rare, and for that reason, the variables that are useful in predicting losses are not useful in normal times. A recent example is the airline industry moving away from their algorithmic pricing models back to the standard microeconomic ones used in a bygone era.
In another example, you might have a big data problem, like lending and credit, but you have just started out and therefore do not have a database of delinquent and successful loans. In this case, you have to build up a labelled customer database by first using traditional credit models and recording the delinquencies or by relying on third-party datasets for a ‘warm start’. You can also find yourself in the vicinity of a big data domain and be egged on to use it by upper management data-first approach. For example, equities have a lot of exchange data, but the same cannot be said for fixed income. Accurate and granular data collection is more difficult as fixed income generally occurs OTC without centralised registries or databased. The fact that your company copies Google’s data-first approach would not make your machine learning models automatically more useful.
Similarly, you might be within the same area but find very different levels of predictability depending on the task. For example, in trade execution, it is true that liquidity and trading costs are very forecastable using volatility, spread, and trading volume as inputs. However, this stands in stark contrast to the predictability of returns that acts more like some strange quantum phenomena; returns are dynamic and prone to disappear as soon as you look for it, whereas liquidity and trading cost are driven by more fundamental diffusion processes that can’t be arbitraged away and as a result remain largely forecastable.
A further concern is that embedded within machine learning is this need to carve out patterns by simply paying attention to correlations. In some respects, regulations, law, and constructs like financial contracts, securities, derivatives, exchanges can be reimagined as atoms connected to form molecules of purpose like companies and institutions. And as is true in physics, as much as life is built upon these devices, it is truly made from the cycles of cause-and-effect that loops between these constructs, these interactions reveal themselves as correlations and connections as evidence for the dynamism of life, however, these correlations are not the essence of the process.
The universe of finance is as much an agglomeration of events between these contracts that have been set in motion since the dawn of trade, as it is a function of the existence of these institutions and constructs. It is hard to believe that intelligent decision-making agents would be able to pierce the veil of isolated institutions by somehow being made aware of the financial system in all its complexity through mere correlations. In fact, we do not yet know how to create a realistic reinforcement learning environment that could simulate the constraints and parameters that would allow us to train such an agent. A real-life financial ecosystem involves orders of magnitude more parameters than any Atari game, and worse even it is an adaptive ecosystem that doesn’t automatically benefit from an additional billion rounds of in silico trials that would ordinarily improve the performance of say, Space Invaders or Pac-Men.
In large part, a machine learning model in isolation is atheoretical and datacentric. To some extent, domain-expertise is used as a first filter to decide on the list of features that could help the model to find optimal solutions. It is also true that architectures, especially for neural network research can be guided with domain expertise. However, by picking and prodding at machine learning models, theories can be conjured up and shared with friends. Investment managers have described machine learning as a great enabler of ideas and strategies. This is true because some grey-box models allow investors to identify important non-linear interactions that would otherwise have previously gone unnoticed — both as a result of traditional statistical methods not being able parameterise large feature sets, and their failure to model non-linear effects — in a domain that researchers found to be highly non-linear. It is true that this inferential use of machine learning is still in an early growth stage, but a range of methods, like feature Importance, feature interactions, PDPs, ICEs, ALEs have been identified that could help one draw enhanced theoretical conclusions, including the importance of threshold effects.
Along a similar vein, machine learning is also amoral, and it does not cast value judgements beyond the data it has trained on. The fear is that matching old data, known to contain biases, with glossy machine learning models one inadvertently perpetuates certain discriminatory practices into the future. However, if one can establish a good set of data, then the use of machine learning significantly limits the direct effects of human evaluators which are irrevocably tainted with bias and bounded rationality, both in cognitive ability and time at disposal. In a research example, I have shown how the inclusion of additional data can significantly de-bias analysts’ earnings expectation.
The problem with an atheoretical model is that users might be confused about the extent to which the relationships are of causal significance because the models are effectively the best-known method to find patterns in correlation networks, resultingly, one might be honing in on the wrong problem leading to a treatment of a symptom as opposed to an eradication of the cause. It is therefore important that any user of these models understand that the predictions are not driven by an unshakable theoretical foundation, especially if you are operating in a highly stochastic and dynamic domain without fixed rules.
To some extent, where existing lower-dimensional pattern matching models (or humans) are switched out machine learning models, large benefits can be obtained. For example, in medicine, radiology has led some cancer specialist in the 20th century away from traditional more bodily ways of identifying cancer to the X-ray, these cancer specialists morphed into radiologists — the quintessential pattern spotting aficionado. Around the same time, fundamental financial researchers started to realise that they have to pay special attention to the interaction of financial variables to predict bankruptcies and technical researchers realised that the stock market exhibits recurring patterns.
What these early researchers might not have been aware of is that in the future, not only could machines be used to spot pre-determined patterns, but they could also learn new patterns associated with the same phenomena. In fields where a better pattern matcher replaces an old pattern matcher, a limited number of things can go wrong. The larger concern is where machine learning is being used to replace more formal models the existence of which is justified by the need of explainability and theoretical discoveries. Pattern matching could probably replace fraud detection without much downside, but when it comes to consumer services, rating agencies, or regulatory agencies more explainable models might be preferred.
Anomalies & Corner Cases
Statistics and machine learning both are driven by the search for an underlying central tendency, be it the mean or median for regression problems or the modal class for classification problems. This central tendency invariably punishes outliers. This central tendency has caused many a failure, for example, the US Air Force of the 1940s modelled the design of their cockpit for to the average pilot, leading to the seat never fitting any one pilot quite right which resulted to multiple deaths until the development of an adjustable seat. Todd Rose in The End of Average would like to remind us that “no one is average” because averaging distorts the relationships between features. As a consequence, it is preferred that any predicted value have some uncertainty bound; we should immediately assume that there is some Bayesian like probability distribution of possible values around a point estimate.
Good feature engineering could alleviate this problem to some degree by introducing skewness, variation, and some higher-order moments as features. Anomalies and corner cases could be well accepted within the training stage of the model. For example, it is known that decision tree type models perform well with outliers, they in fact largely ignore them. As for neural network type models, there is a need to remove outliers so that you can successfully perform the required normalisation procedure. However, if the outlying datapoint must be predicted this can become problematic because the outlier will be grouped with normal-looking data or simply removed.
A lot of innovation is dedicated to the functioning of machine learning models during normal times. Models are improved on clean benchmark datasets, none of which experience any distributional shifts that the machine learning models are forced to contend with. As a result, the performance of these models are pure abstractions and bear little resemblance to what one might expect in real-world scenarios. The US supplement to the Basel standards require stress testing to ensure that banks can remain solvent during severe recessions, which requires a projection of losses using very little data over turbulent times. This is a problem for both traditional and advanced statistical techniques. There is some evidence that machine learning could help if it doesn’t overfit on the ‘good times’.
Regime shifts highlight the fact that offline supervised learning does not offer you much robustness, whereas a reinforcement learning model, or event an online supervised learning model can dynamically recalibrate its strategy within a known set of parameters. There is also an argument that offline supervised learning models can be used, a la the 1950s RAND, to perform programmatic ‘scenario planning’ on with a simple regime shift model that can be hardcoded using exogenous data. Machine learning responds badly when input data differs, and very few companies have been monitoring changes in their systems to alert them of changing behaviour and distributional shifts. This has especially become clear in recent years. In the airline industry, it was quickly realised that the standard machine learning pricing models that study flight patterns, fuel costs, and user behaviour have become useless during the pandemic, as evidenced by them falling back to rely on traditional macroeconomic modelling.
In some scenarios, even having a reinforcement learning model, or a regime shifting model is not all that helpful. For example, the 2020 pandemic has led to a good number of attempts to use machine learning for virus transmission forecasts. Machine learning was specifically used by some researchers to attempt to predict the spread of the virus. It was found to perform poorly for several reasons. The first is that these models require data, so sure, ex-post predictions do well as presented by researchers predicting the spread of the 2015 Zika virus. The problem is that by that time you have accurate historical data, the virus has already spread. In the period over which the prediction was performed no real dataset exists, and it should rather be seen as a form of retrodiction that tries to understand the patterns of the past. By the time a good performing machine learning model reaches the institutions, it is too late to be helpful.
Furthermore, it is not just the existence of data that constraints modellers during turbulent times, but also the type and quality of data that is being used and relied upon. The problems with using big data to forecast infectious disease are nothing new; it has been widely acknowledged as a problem since the notorious Google Flu forecast failure that employed Google search trends. In this example, when the infection spreads, the media attention around it accumulates and disturbs any meaningful trends that could otherwise be decerned from google search results. This particular concern would make it hard for any big-data or nowcasting system that relies primarily on social media or other fungible data to function during new and unprecedented times. Some has referred to the pandemic event as the kryptonite for machine learning models and have alluded to the fact that technology companies would in the short run be pulling humans back into the loop.
 In the Vietnam wartime effort, the US Secretary of Defense relied almost exclusively on enemy body count because that was the easiest to measure, the consequence of which is widely known. Leading to the fallacy being dubbed as the “McNamara’’ fallacy.
 Someone has to redo all these studies and pit a logistic regression with all the bells and whistles e.g., discretise features, remove outliers, remove multicollinearity, assert normal distribution (BoxCox), assert linear assumption (log transform), feature scaling, and regularisation.
 An oft-forgotten component to prediction research is that various prediction models can be used to drive a final prediction that is closer to the objective function you want to maximise. The predicted liquidity and trading cost patterns can be used to optimise the trading system and achieve improved returns. When combining predictions in such a fashion, one could potentially be better off with a reinforcement learning framework that by design considers all these parameters.
 Partial Dependence Plots (PDPs), Individual Conditional Expectation (ICEs), Accumulated Local Effects (ALEs),
 Furthermore, although machine learning models are great with interpolation most machine learning models cannot extrapolate well as they do not generate rigorous extrapolative definitions and instead split the feature space using methods like decision trees.
 Fortunately, the models to track the spread of the virus did not rely on machine learning and instead made use of epidemiological SIR models (not saying they were perfect).