To Place a Value on Alternative Data
Data Cartels: Artificial Intelligence Information Heists XI
It should be noted that no "...amount of complex mathematical/statistical analysis can possibly squeeze more information from a data set than it contains initially." The combination of uncorrelated datasets does, however, have attractive synergistic qualities. Data mining one dataset has its limits but adding additional datasets (or column attributes) to the datamining operation improves the predictive quality according to an S-curve. The same holds for row-wise additions to data; it also follows an S-curve. Every new dataset should be assessed for its marginal contribution to the chosen measure for predictive quality. In the experience of Etienne Vincent, head of quant research at BNP Paribas Asset Management, alternative data sources are correlated with the core five Fama-French factors "100% of the time", minimising its impact to the investment process[1]. What you are really looking for is information that is uncontained in the price history.
Most alternative data companies fall for an element of post-hoc bias, they wait for an event to occur, like the collapse of a dam in Brazil that might affect the local mining operation, and only then do study their 'proprietary' data to identify whether they could (have) predict (ed) the event. MSCI, as a seller of ESG data, used satellite data to do just that. It's one thing to show you could have predicted something; it's another to predict it in advance and set off a warning shot[2].
Financial data is notoriously dirty, especially in academia, but it is slowly changing. For example, when using IBES earnings estimates from Thomson Reuters, there are at least ten additional cleansing steps with 30-40 lines of code needed to get it in the shape where it is accurate, unbiased, and internally consistent. This is not a new dataset; it has been around for some time. Although it has been improved, the same can be said for datasets like CompStat and CRSP data, which today is almost half a century old. Most importantly, these are issues unknown to those who don't practice in the field, especially if you are not acquainted with the appropriate literature discussing them.
The newfound alternative data companies supply data to traders and decision-makers, and the quality of their data can sometimes be more accurate than the academic datasets because there is more at stake. However, some of these companies are charging you exorbitant fees, literally 50-100x of what it would cost you to get someone on UpWork or Freelancer to source the data for you. One company Vertical Knowledge, for example, sells simple Glassdoor data for $30k an annual pop[3]. It is probably better than paying the $300k-3mn that Glassdoor might charge you directly when you go through their Contact Us form. If you buy any of these, you are doubly-duped. Here I have written scripts for ten scrapers, and you can amend these, or find some new ones, buy some online proxies, and scrape the Glassdoor data for under $1k[4]. And good luck finding a Neudata consultant to tell you that.
Sometimes it is impossible even to do backtests to obtain anything approaching statistical significance for alternative data, simply because the collection period has been so short. The rise of Twitter about ten years ago was one of the first modern drivers of alternative data. It is reported to have some use cases around sentiment analysis on the stock market; however, this is still an open question, see for example this snarky blog. As a relatively new field, the data cannot always allow for rigorous backtesting, and the quality of the signals are often harder to assess. There is also the concept of data half-life, I might give you data from 2010-2015 that seems to provide additional predictive power over that period, but you rely on a stream of data over time, and for that reason, the data might have a short half-life, whereas cross-sectional data on user demographics is valuable as long as the customer is alive and well.
An important lesson that buyers of alternative data have to understand is that the price of the data does not determine the level of "exclusivity". These companies benefit from the fact that it is so hard to measure the marginal benefit that the data contains. Moreover, there is a growing industry consensus that cleaned data should not be trusted as it is too easy to extract signal and even then, investors are deceiving themselves; it is not just your MIT PhD that knows how to parse data and extract a signal, any one of Kaggle's 5mn participants can help you with that. Data-vendors could also taint their datasets to make the back-tests look more promising; for this and many other reasons, you would need to physically gain real-time access of the data as part of some trial to test the proper out of sample performance.
Scrolling through Amazons' AWS data exchange is a truly eye-opening experience, I am not sure what type of white striped suit ends up paying these exorbitant prices. What I do know is that this level of corporate-data acquisition information asymmetry won't last a decade, so get in while the gold is rushing if you are a data peddler. It is a fundamentally deflationary business where no one respects anyone's licensing rights; Vertical Knowledge is selling Glassdoor's data (that may or may not have scrapped without Glassdoor's permission) and selling it at $50k, nothing stops another peddler from buying that data and selling it at $10k to ten people, and then $100 to hundreds of people, and so on and so forth. Of course, you could patch this deflationary phenomenon with usage agreements and strict contracts, but come on its finance; it's all happening behind closed doors in any case.
Companies should be wary of 'peak data" issues where data extraction cost exceeds its value. In recent years, some have been willing to establish dangerous 'shale' operations risking future litigation due to privacy and other regulatory concerns. Often times, companies blindly pursue policies to acquire data with the belief that it would improve their corporate value. The acquisition of new data or the development of new algorithms should not be made in isolation from the overall business objectives.
Instead of indiscriminately improving your optimisation models, find a business metric to optimise on. Do you really need that perfectly calibrated model? Do the additional model improvements driven by your data acquisition policies really improve your bottom line? For example, if an online retailer uses a differential pricing strategy for an additional X% of revenue, does this cover the additional regulatory costs from fairness requests and the general public perception?
To assess the true benefit of your data and modelling, you have to look at the holistic gains. A system that automates parts of a customer service operation, for example, can appear to be more efficient by a narrow set of metrics. Still, it can also be jarring to customers and disempowering to employees, leading to long-term losses. Business is a social science, and for that reason, it is adaptive and exploitative due to human behaviour; as a result, you should always keep your eye on the metric you are using to make sure that it still makes sense. It is always just a matter of time until the metric of importance gets gerrymandered, leading to powerfully perverse incentives. A business KPI must align with the ultimate business goals, not short-term vanity numbers. It needs to be both simple and interpretable. This requirement often leaves data scientists with the important responsibility of translating a fuzzy business question into a data science question that will use rigorous analysis to get results, and then later, to be translated back into business terms.
Companies should insist on actual business values measures when assessing the quality of model improvements of adding new datasets. For example, in marketing, one would be interested in business metrics like conversion and retention instead of clicks. And this has to be measured to controlled A/B testing. To emphasise the importance of this, Booking.com even found a -10% correlation between model improvements and business metrics effect size. Let that sink in, the offline improvements (measured by AUC or RMSE) in e.g., clicks, is not the same as offline improvements say sales, and leads to the opposite, more clicks, lower sales! In the same way, a classification model that predicts tomorrow's stock price direction might be improved with additional data, but when backtested or put into production, could lead to worsening trading profits.
Where the offline metric is almost a business metric, Booking.com observed some correlation. However, it is hard to have offline business metrics as it usually requires some form of interaction which generally requires a more extensive reinforcement learning framework[5]. The reasons for this are relatively obvious. First, the business value could be non-linear, and you can't indefinitely drive business value from improving models, second, you might not even be driving business value, because you could be over-optimising on proxy metrics, by for example using click-through-rate as a proxy when you actually care about conversion.
Companies often believe that being "data first" and by hiring team of data scientist they can achieve FAANG success. Unfortunately, companies like Netflix, Google, Amazon, and Facebook know exactly what they want to do with their data. Beyond a customer management system, companies do not naturally have problems that require the use of "big data" and "machine learning". In some scenarios, it might even be beneficial to have no prediction systems at all where prediction would only expose one unnecessary risk.
Corporations fearful of creative destruction will often make big decisions based on predictions instead of waiting for the actual event to occur. If all companies base their decision on the same mistaken prediction, it can have disastrous systemic consequences. For example, in activities like asset-liability management, actuarial modelling, capital modelling, and interest rate setting, it could perhaps be beneficial to forecast interest rate changes based on some form of sentiment analysis on Central bank communications. However, Central banks, like the Bank of England (BoE) don't always make it easy and they sometimes set a tone that doesn't always give off the correct expectation. Their weak signal is often misinterpreted, and it could have considerable negative repercussions.
It is not that predictions are bad, period, but predictions that are interpreted as the truth as opposed to one version of the truth can prove detrimental. When predictions are not questioned, like the spectacular failure of LTCM, where even Nobel price academics have shown their fallibility for trusting models, then sometimes it is better to make no predictions at all. The same holds true for financial hedging, having hedging instruments at your disposal could entice you to become speculative; sometimes, it is best to just lock the hedging instrument in the cupboard.
Data scientists find it hard to work with business metrics, because they are naturally slower to propagate, and don't allow for modellers to iterate quickly or even automatically to improve their models. There is not always enough data for long-termism in machine learning which is why your see analysts stick with what they know. Once you have an appropriate metric, it is essential to compare it with benchmarks, simple and advanced. In a complex environment, it is especially important to test your prediction model against simple baselines. It is well known from the decades-old M4 time series forecasting competition, that yesterday's price or some mechanical time series model almost always outperforms "advanced" machine learning prediction models.
Even data platforms with thousands of datapoints that seek to match couples similarly struggle to outperform random baselines. The Match Group owns OkCupid, Tinder, and Match.com and boasts 59 million active users per month. The one subsidiary, OkCupid, uses an algorithm that calculates a "match percentage" based on many questions it asks to assess your religion, politics, lifestyle, and other attributes. What researchers have found is that when they artificially inflate the "match percentage" for random individuals, that these individuals were just as likely to send four messages than those that truly achieved a high match-percentage, the algorithm is therefore not predictive but suggestive[6]. When assessing the value of alternative data, you might find that something like sentiment might simply capture a momentum effect of previous stock movements and that it is not more predictive than a simple moving average indicator.
[1] https://conifer.rhizome.org/snowde/the-finance-parlour/20201201124311/https://www.risk.net/asset-management/6605516/funds-hunt-for-the-real-mccoy-in-alternative-data-jungle
[2] https://www.risk.net/regulation/6972571/using-alternative-data-to-spot-esg-risks
[3] https://conifer.rhizome.org/snowde/the-finance-parlour/20201201091517/https://aws.amazon.com/marketplace/pp/prodview-vkdjmbklepq4k?ref_=srh_res_product_title#offers
[4] https://github.com/firmai/scrapers/tree/master/glassdoor
[5] https://sci-hub.st/https://doi.org/10.1145/3292500.3330744
[6] https://web.archive.org/web/20140731091409/http://blog.okcupid.com/index.php/we-experiment-on-human-beings/