Machine Learning & Quant Finance

Share this post

Flirting Dangerously with Alternative Data

blog.ml-quant.com

Flirting Dangerously with Alternative Data

Data Cartels: Artificial Intelligence Information Heists II

DS
Jan 5, 2021
Share this post

Flirting Dangerously with Alternative Data

blog.ml-quant.com

In this essay, I opine that alternative data practices need closer regulatory and legal scrutiny; a lot of dubious deals are happening behind close-doors. We see a rise of ‘data-cartels’, for lack of a better word, whose single mission it is to acquire and assemble data through acquisitions, data exchange agreements, or contracts with unvetted third-party data providers.

The Rise of Alternative Data

Decades of prediction folly have harmed analysts, pundits, and policymakers alike. There is an especially palpable pursuit to heal battered reputations, pummelled by the recent pandemic. This search for a solution has more than even been transfigured into a quest for data. And not just data about inflation expectations, interest rates, and consumer spending, but an entirely new flavour of data known as alternative data that has been seemingly dislodged from privacy policies losing all modicum of officiality.

Alternative data had its start as a tool to guide investors to select the appropriate assets and stocks to buy but has since morphed into an industry that supplies data to companies, municipalities, policy groups, and regulators, and really anyone willing to pay the right price. The data that is being collected does not simply end up in Excel pivot-tables; instead, it becomes the crucial ingredient to modern machine learning models used to predict events, behaviours, and outcomes. These models benefit from a greater quantity and a larger variety of data. The extent to which machine learning models rely on data puts severe pressure on data privacy as consumer data is increasingly shared without informed consent. Competitive and cost pressures lures entities into using private data without consumer consent, there is a sense for the simple disregard of anonymisation and encryption procedures needed to obscure otherwise personally identifiable information. The last few years have shown us that these transgressions are lightly fined, but some countries are starting to put their feet in the right crevices.

No industry has been left untouched for the quest to unravel, process, and refine alternative data, but this search for unique and potentially sensitive data is most evident economic domains. Financial asset managers have made many foot faults over the year with lacklustre performance becoming the status quo. The average asset managers have historically performed no better than a chimp throwing darts. Before the deluge of data, managers sought out strategies to obscure the true performance of their funds. In the age of data, many have now come to believe that they might in fact be able to outperform a random benchmark and need not just rely on their sales and marketing teams when they have access to granular real-time spending patterns via a consumer email corpus. Hinesh Kalian, the director of data science at the Man Group, said that since the pandemic ‘’the demand has skyrocketed”.

The capacity to process data has led to an arms race for data. This move is captured by an excerpt from The Man Who Solved the Market, detailing the rise of the world’s most successful investment fund, Renaissance Technologies. ‘’Soon, researchers were tracking newspaper and newswire stories, internet posts, and more obscure data—such as offshore insurance claims—racing to get their hands on pretty much any information that could be quantified and scrutinised for its predictive value. [Renaissance’s] Medallion fund became something of a data sponge, soaking up a terabyte, or one trillion bytes, of information annually, buying expensive disk drives and processors to digest, store, and analyse it all, looking for reliable patterns… [the profits] were piling up as Renaissance began digesting new kinds of information. The team collected every trade order, including those that hadn’t been completed, along with annual and quarterly earnings reports, records of stock trades by corporate executives, government reports, and economic predictions.’’ [1] At Renaissance, the difference between profit and data didn’t draw much distinction. More data entails more trading signals, entails more profit, and this is soon becoming the truth for all corporations.

Alternative data is agnostic to the source and driven by the amount of signal, i.e., the benefit or money-making potential, embedded in the data. The sources that have become popular include web traffic, search trends, social media posts, app data, transaction records, news feeds, emails, location logs, satellite imagery, and logistics data. Alternative data can be intrusive and simultaneously legal. Data provider Return Path specialises in ‘volunteered’ personal email data — normally by users agreeing to unread terms and conditions — covering approximately 70% of the worldwide total email accounts. By collating this with purchase email receipt data for around 5000 retailers, it offers analytics around purchase behaviour and consumer preferences to anyone willing to fork out the requisite fee. Return Path is not that unique; firms like Slice Intelligence and Superfly Insights also readily sells consumer emails to the highest bidder.

Alternative data can also be obtained from smartphone applications, like those apps created to help consumers manage their finances by tracking spending patterns and offering canned services and advice to their clients. These apps typically gain access to bank, investment, and retirement accounts, as well as loan and insurance details, including bills, rewards data, and even payment transactions. This data is then anonymised, aggregated, and sold to third parties. Firms in this category prefer to remain out of the media spotlight. Envestment Yodlee is a prime example: they have partnered with 12 of the 20 largest U.S. banks and tracking around 6 million users. They sell credit-and debit-card transactions data to investors and research firms, which apply the latest data-mining techniques to scour the data for patterns of interest.

The smartphone data only scratches the surface, the point to make here is that it can be any dataset; even job listings or executive jet records would qualify as alternative data. This is not science fiction, in 2018, the shares in a small drug cancer company called Geron Corporation spiked 25% after the parent company Johnson & Johnson posted a job listing referring to the fact that a key regulatory decision is imminent. In 2017 the flight details of a Gulfstream V’s were used to predict a $10bn dollar investment by Warren Buffet. Companies are recognising that they can sell their ‘exhaust data’ that is derived from there primary activities as a secondary source of income. If one day Domino’s pizza starts selling its transaction data like credit card companies are doing now, researchers might come to realise that late-night pizza orders from the Pentagon are 35% correlated with foreign military intervention or that late-night pizza or take-in deliveries made from the audit partners at Deloitte or E.Y., could signal botched accounts or suspicious activity from a publicly traded corporation.

Hedge funds and asset managers are buying and merging terabytes of data and throwing it at the latest, probably open-source, machine learning algorithms, hoping that it sticks. Finance, being an adaptive market, does not always allow for these models to stick often, and when they do, they don’t stick for long. So inevitably the whole enterprise comes down, some fail, some succeed, those that fail blame a lack of data and those that succeed take the praise and build a hedge fund behemoth that acquires and processes even more. The only real loser is the individual whose data is being sold, re-sold, and scrutinised for, for at the very least, their consumption habits.

To put it into perspective, asset managers have trillions of dollars under management and have an open cheque book when it comes to new alternative data sources. These institutions have the perverted incentive to know exactly what you are buying, and when you will be buying. The trouble is that even if the privacy aspect is dealt with using the appropriate anonymisation and de-identification methods, a great many other problems still persists. As important as consumer privacy is, it is only the very tip of an iceberg. The bulk of the problem sits within competition law, insider trading laws, discrimination law, and the ethics of data mining.

The importance of data in finance has become especially prominent at the turn of the century. Renaissance Technologies’ previous CEO — and Cambridge Analytica investor — Robert Mercer was quoted saying that “There’s no data like more data”. Brokers of financial data are increasingly competing with many other data providers to provide a larger quantity, and larger variety of data and previously expensive datasets are depreciating at a staggering rate. On another occasion in 2012 ZestFinance’s CEO Douglas Merrill proclaimed that ‘’all data is credit data.’’ These comments lead to a great furore in the industry an incentivises the push towards more granular and more sensitive data. As providers exchange and aggregate data, these datasets soon become commodities and are trafficked and peddled not unlike salt. These data-providers compete for the price down at a staggering rate, if a company is selling web-scrapped employee receive data for $50k, a larger aggregator can buy this and sell it to ten smaller companies for $10k and make 100% profit, contract permitting[2].

In the end, resting in disused servers all over the world, are data that pose an immense risk to the users embedded into every bite of data, and soon enough the economics of intangibles takes over and the marginal cost of sharing sensitive user data drops and it loses its value on the open market. This cheapness could be fundamentally destabilising. The data can be picked up by foreign governments and be used for disinformation campaigns. It is no coincidence that the CEO of one of the world’s most successful data mining companies, Renaissance Technologies, was also instrumental in the creation of Cambridge Analytica, one of the most memorable and modern psyop campaigns.  

You don’t need a lot of capital to get your hands-on consumer data. It is somewhat effortless to, for example, get access to leaked data, you can freely access thousands of databases on Raid Forums, all you need is to go through a registration-wall. Among others, you can download LinkedIn, Dropbox, Patreon, Twitter, Adobe, Experian T-Mobile, and Apple data. On the government side, a Turkish Citizenship, and Alabama voter database can be accessed[3]. You can also buy data from companies like SnusBase that openly states that they consider what they are doing as perfectly legal: ‘’once a site has been hacked and the database is in the hands of a number of individuals not related to the hack it is considered public information’’.

This cheapness factor could be concerning, by analogy, if all it took were a microwave to develop enriched plutonium, the world would be out of luck, and we will enter an age of significant destabilisation[4]. We don’t know what happens in a society where the panopticon is all-knowing. If this trend continues, we will soon enter a realm where something as sensitive as your DNA can be picked up for a mere pittance and be used in highly targeted advertising, or to set insurance premiums, or maybe to find your perfect romantic match, or even to screen candidates for a job, all without your explicit knowledge[5]. It is known that 23andme have already shared your data with third parties, and who knows what will happen to ancestary.com that has recently been acquired by Blackstone Group.

It is not just investors that are at fault, the push for access to sensitive data has come from various policy initiatives and ministries with the hope to `nowcast’ the economy to predict human behaviours. What we have once reprimanded governments for, have become openly shared secrets among private companies and policy institutes. Policy institute is promoting ‘’national safety’’ not unlike the 2000s when governments were acting under the guise of national security. The data that is being ‘acquired’ by government bodies could make them complicit with “companies [that] profit from the exploitation of the personal data.” In 2020, almost universally, governments have pushed for timely statistics in the form of alternative data. Eurostat has, for example, signed agreements from Airbnb, Booking, Expedia, and TripAdvisor to access their data on short-term accommodations[6]. And the Federal Reserve, for example, received 3-day lagged credit card data.[7] This air of formality doesn’t take away the risk that the data could entail, not just based on how they were collected but how they are distributed. If governments are also normalising this form of data-sharing, we could be dealing with a harder to solve, more entrenched problem.   

The abstraction, which is ‘alternative data’ has been commonplace over the ages both in its philosophical and modern context. In 4000 BCE, ancient Babylonians identified and recorded correlations between commodity prices and condition of the Euphrates river on clay tablets[8].  A little later around 3500 BCE, we have the first recorded use of debt by ancient Sumerians. Credit systems back then, like now, relied on a large body of evidence to assess risks. The historical equivalent of the internal revenue service, Puzrish-Dagan, recorded in writing revenues from livestock like oxen, sheep, and goats. This written credit record, or citizens-dataset, would be precious to the Mesopotamian banking-families to extend credit at the time. The tax-authority can make money, not just from there primary activity, which is the collection of taxes, but also by selling this information to banks that can use to extend loans or set interest rates.

In modern times, most jurisdictions have decided that taxes are personal information that should be kept private. When the data is confidential, it has a lot of value on the black market, which could be scooped up by large hedge funds with unlimited capital. When the data is public it has no value in the black market, but it could entrench citizens because private banks might be more sceptical about having you as a client due to your tainted past. In modern times we live somewhere in-between, where our data is not fully visible nor fully safeguarded. This in-between status fosters an asymmetric relationship between citizens and corporations, where those with enough money can use legal backchannels to collect and process data about individuals to among other things accurately predict their life attributes and outcomes.

Customer level data sources are not the only type of alternative data, and financial institutions also benefit from secondary data that do not directly infringe on an individual’s privacy rights. As an example, Venetian traders in the 15th century would use telescopes to inspect incoming trade ships, to derive clues on what commodities to buy or sell. This form of data doesn’t necessarily hurt any one individual except the unaware investor buying or selling their assets at a discount from the investor with more information. Although privacy is not undermined, this type of data could, however undermine the stability of the market, especially if used as an indicator by numerous institutions. As an example, if you sold your data to many brokers, the price of say corn might increase many folds due to a wrong reading of the telescope. This bears a resemblance to the algorithmic world of high-frequency trading where misreading’s have destroyed billions of dollars in a matter of seconds only to be recovered a few minutes later.

In short, alternative data is by no means new, but it has been thrown into the spotlight due to the intrusiveness of the data collection strategies, and the security risks posed by the increased centralisation of data. This sentiment is being echoed by regulators globally. All investors and corporations are in search of stable long-term returns. The acquisition and innovative use of data can allow them to squeeze out additional prediction efficiencies, and we can assume that these practices will proliferate.  However, when new practices are not driven by a moral compass it may unconsciously stray into misuse. The reason alternative data has become a problem is that the data-providers are going into finer granularity and are starting to expand in tail offerings with more expensive varieties of alternative data with consumer profiles at its centre. At this stage, regulators globally have no settled on a unified view on how to approach the various issues surrounding the collection and use of alternative data. While a standardised approach is unlikely, burgeoning focus by regulators worldwide on personal data use and abuse indicates heightened concern for market risks and ethics.

[1] The Man who Solved the Market, Chapter 12

[2] https://conifer.rhizome.org/snowde/the-finance-parlour/20201201091517/https://aws.amazon.com/marketplace/pp/prodview-vkdjmbklepq4k?ref_=srh_res_product_title#offers

[3] https://raidforums.com/Forum-Official?sortby=views

[4] Even regulation to make the data more expensive can be useful.

[5] https://www.scientificamerican.com/article/23andme-is-terrifying-but-not-for-the-reasons-the-fda-thinks/

[6] https://ec.europa.eu/eurostat/web/products-eurostat-news/-/CN-20200305-1

[7] https://www.ft.com/content/9e0e2038-6131-11e9-a27a-fdd51850994c

[8] Lo and Hasanhodzic 2010

Share this post

Flirting Dangerously with Alternative Data

blog.ml-quant.com
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Derek Snow
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing