Data Cartels and Monopolistic Pressures
Data Cartels: Artificial Intelligence Information Heists IX
The ‘democratisation’ of AI models and competitive data pooling creates invisible monopolies. Larger companies are better positioned to take advantage of the supposed ‘democratisation’ of AI because they generally act as data gatekeepers. Data alliances are not always made public. Firms can, as a result, conspire to form data cartels. As a consequence of the monopolistic forces of data, the number of acquisitions will significantly increase, and multi-faceted collaboration will grow when acquisitions become suspect in times of rising anti-trust sentiment. The alternative data market is much larger than what we are led to believe. Data has become a sin word in some circles, and corporations are not reporting their use of it openly.
Data competition will force smaller firms, who do not have access to data-repositories, to compete by conceiving new algorithms and products leading to idea-stage acquisition, while traditional players will leverage their data, capital, and existing customer base for growth. Quite indicative of this move is the buy-out or investment in alternative finance companies. Coatue management invested in Domino Data Lab, Point72 invested in Quantopian, Worldquant invested in Estimize. And naturally, the intelligence community also support this move, and the CIA’s venture capital arm also invested in Domino. Most companies are taking a safe, legal strategy and splitting their data acquisition and processing business and their core business in two. This year’s latest was Winton and Nielson. Nielson will split into two publicly traded entities with one looking and spending trends looking and granular information about consumer habits and the other tracking the media industry.
AI democratisation is a marketing stunt, a business strategy that helps tech giants attract top talent, whilst having outside developers improve their open-source code. Don’t get fooled by the ruse, those who own the data owns the models, nothing more, nothing less. PayPal, for example, built its machine learning fraud detection software on open-source tools[1]. But their key competitiveness in the development of machine learning models is factual exclusivity and expertise (i.e., data)[2]. Researchers at Cambridge and Stanford found in 2014 that they need a mere 10 'likes' on a social platform to predict your personality better than a colleague, 70 to beat a roommate, 150 to beat a parent of sibling and 300 to beat a spouse[3]. Whether you use a proprietary neural network, random forest, or gradient boosting machine learning model doesn't matter much. It is the cross-correlation of your digital life across different platforms that give the technological giants their competitive advantage. Without data, you are not democratizing AI; you are democratising machine learning algorithms. This begs the question, should AI and data even be democratised. At first glance, the widespread use of algorithms has also raised concerns of possible anti-competitive behaviour as it can make it easier for firms to achieve and sustain collusion without any formal agreement or human interaction.
Systematic issues precipitate and grow due to the monopolistic propensity of data. Data wants to be shared to the point where all firms take the same information into account in assessing customer adequacy, which could permanently exclude certain groups from market participation. AI and data sharing pose new sources of systemic risk. As algorithms and datasets get increasingly shared among multiple institutions, errors resulting from model miscalibration, such as miscalculation of credit risk, can quickly spread across all participating institutions. Like algorithmically driven flash-crashes, self-reinforcing models can lead to precipitous contagion having immediate consequences.
Data alliances help companies to relieve costs through data economies of scale. However, the emergence of data alliances makes it hard to hold individual parties liable for privacy breaches. Data tends to accumulate at third party providers who are less incentivised to keep data safe. These entities might be incentivised to sell data to hedge funds, telemarketers, and political campaigns to investigate and change subject behaviour. They have the incentive to trade and merge new datasets into larger and larger relational databases. The quest for alternative data sources (e.g. social media, email, and point of sale data) allows aggregators to develop enhanced personalisation that can lead to individual financial institutions pursuing extractive and manipulative data practices.
Furthermore, depending on the quality of anonymisation practices, anonymised data may easily be de-anonymised. Brokers combine, swap, and recombine the data they acquire into new profiles, which they can then sell back to the original collectors or other firms. These profiles can proliferate with swapping agreements or be leaked and appear on sites like Raid Forums. All you need to do is sign-up through a free registration-wall. Among others, you can download LinkedIn, Dropbox, Patreon, Twitter, Adobe, Experian T-Mobile, and Apple data.
Data cartels engage in real monopolistic behaviour. They have the incentive to accumulate all types of data, be it competitive, collaborative, or toxic. For example, one data broker sold the names of 500,000 gamblers over 55 years old for 8.5 cents apiece to criminals, who then bilked money from vulnerable seekers of “luck.”[4] Others offered lists of patients with cancer or Alzheimer’s disease. The push is also coming from companies that never really planned to sell their data. “The biggest growing group of alternative data is from companies that collect data as part of their business, like UPS, a shipping company, and NCR from point-of-sale credit-card transactions … [t]here are companies that never planned on selling their data, but there are obviously huge opportunities to expand their channels of distribution and revenue stream by selling their data.”
There is a drive towards entity and consumer mapping. Merging is performed where two datasets have some overlapping attributes. I can have a lending, and transactional dataset, but I really need age, income, and industry attributes to make better use of this combined dataset. The idea is to establish a sizeable relational table. Sometimes this would involve expert guidance and consumer profiling and even some level of unsupervised learning to form sensible merging operations. The real power comes from these large, merged databases that allows corporations and governments to ‘understand the world’. And once they have been established they never disappear. Paul Ohm wrote about a concept he called the "Database of Ruin", "Once we have created this database, it is unlikely we will ever be able to tear it apart."[5] I like thinking of data as a gravitational force with Kelly-inevitability to be tracked, accessed, screened, filtered, remixed, and shared[6].
The use and sharing of these datasets raise obvious questions, should a credit card company be entitled to raise a couple’s interest rate if they seek marriage counselling? If so, should cardholders know this? Should the hundreds of thousands of American citizens placed on secret “watch lists” be so informed, and should they be given a chance to clear their names. One estimate puts spending on the alternative data industry this year at $1.7 billion, a sevenfold surge from just five years. Investment banks and hedge funds made billions of dollars by courting sellers who did not understand the value of what they were holding and buyers who didn’t understand the problems with what they were purchasing.
In 2014, the money that Visa was making by selling data was masked in “other revenues”. In the first quarter of 2014, that amount increased 22% to $341 million, outpacing the 14% growth of total revenue dominated by payments to $2.177 billion, according to Reuters. By the first quarter of 2018, other revenues had popped 119% to $748 million since the first quarter of 2014, according to an SEC filing, outpacing its total revenue of $3.58 billion -- up 64% since 2014. If the market for alternative data is just $1.3 billion, then what are we looking at here? Clearly, the use of alternative data is massively underreported. I predict that the market for alternative data is more likely to be in the tens of billions of dollars as opposed to the meagre $1.7 billion being reported.
Hedge funds could be regarded as a closed-form platform monopoly. The network effects of past money lead to more money flow and more resources as a percentage of assets under management. The ultimate hedge funds in that regard would be a sovereign wealth fund that could act with impunity, like ADIA and others. Data cartels can improve their competitive position, by developing or buying a platform or a product that collects data, by providing a third-party data service, by entering into data-sharing arrangements, or by simply buying data from a data brokerage house. Hedge funds are doing all of the above.
In this sense, new technology also has a tremendous potential to centralize and concentrate decisional power to private market actors. Algorithms can also adopt monopolistic behaviour. Recent research shows that algorithms can collude without communicating with each other. After a few iterations, these algorithms set prices between the Nash price and monopoly price. They can look at the actions of the other algorithms and without concerted effort increase their prices to extract value from customers. A recent natural experiment has followed by Germany’s gas stations adoption of new algorithmic pricing [7].
In terms of financial stability, the FSB is correct in stressing that “network effects and the scalability of new technologies may in the future give rise to additional third-party dependencies” and this “could, in turn, lead to the emergence of new systemically important players” Many of these new third-party participants are unregulated and unsupervised even though they would likely become our future monopolies and oligopolies. These third-party dependencies and interconnections could have systemic effects, many of which would go unreported due to interpretability or “auditability” concerns. It is particularly challenging to audit machine learning models, and AI-related expertise beyond those developing the AI is limited, in both the private sectors and among regulators. These dynamics potentially magnify the familiar problem of the governability—or rather un-governability—of the modern financial system to a new level of magnitude[8].
Any AI monopoly will be able to execute mass-scale behavioural analysis leading to magical customisation and ease of use leading to a better-quality product/service, more customers, and the ability to charge a higher price, and the positive feedback loop continues until excessive data ownership unwittingly trap consumers like a moth in a bath. The democratisation of AI as a means to remove the monopolistic forces of AI is a futile attempt, and will in fact just do the opposite (i.e. advance monopolies).
Data is the fuel, and ML models are the furnace. You can democratise the furnace, but without the fuel it is useless. The recent advances in open data have allowed some smaller firms to compete against the larger powers; however, open data protocols have still been mostly to the benefit of large strategic players. The reason is that the open data initiatives are primarily sourced from public or non-profit entities instead of being scalped from large rent-seeking companies who generate and store a torrent of proprietary data.
Big companies are competitively positioned to incorporate additional data into their established models swiftly and to use it to capitalise on existing clients and networks. Large players are those that have the privileged position to stand around the fire while it is being stocked by small and large firms alike. Data has become the invisible rent-seeking tool hiding in the clouds pearly gates of heaven strictly guarded by Saint Don’t Be Evil and friends.
In the context of massive Internet firms, competition is unlikely. Today, most startups aim to be bought by a company like Google or Facebook, not to displace them. Data is the fuel of the information economy, and the more data a company already has, the better it can monetize it. Rather than merely hoping for the competition that may never come, we need to assure that the natural monopolization now at play in fields like search and social networking doesn’t come at too high a cost to the rest of the economy.
The same “rich get richer” dynamics of Google, and Facebook afflict finance, where the largest entities tend to attract more capital simply because they are viewed as “too big to fail” and “too big to jail.” There is wealth inequality in perpetuating illegal activities. Only the largest banks would be able to perpetrate financial crimes due to their enhanced ability to bypass automated agents spotting fraud. Adversarial machine learning tactics would have to be used to bypass monitoring systems. In this race, regulators will become more ‘applied’ and ‘applied’ companies will become more like regulators. And the largest players will come out on top once more.
The ability to use ‘big data’ comes from firms that are large (economies of scale), and firms that have existed for a long time, so as to benefit from bigger and more granular data sets to identify customer level preferences and behaviour. Therefore, the real question regulators have to ask is whether the policies they introduce helps to fairly distribute this customer network advantage to new entrants without undermining the incumbents, and more generally, the economy. There would be no economically competitive benefit for widespread model and data democratisation without more radical network democratisation like mass-corporate open data schemes. Customer networks get entrenched as companies learn your preferences, current status, and general behaviour patterns. This deepens further with the breadth of products/services on offer.
There are scenarios where competitive data becomes collaborative when it is used in academia for research and innovation, when it is reported to a regulatory body, or where smaller firms use it to compete against larger firms. Some suggest that one way of reducing the advantage of larger firms is for the smaller firms to pool their data voluntarily, as is sometimes done by smaller insurance firms. Such pooling could reduce the competitive advantage of larger firms but could also raise privacy considerations similar to those of forced data sharing. Another alternative would be for the legal system to take the position that the customer owns their data and can share it as they chose. The European Union, in its Payment Systems Directive 2 (PSD2) has recently taken this approach.
Some hedge funds and government entities are also seeking and obtaining exclusive access to some datasets to improve their trading. It is reported that the People’s Bank of China has ordered online payment groups to funnel their payments through a centralized clearinghouse. Privacy is not only relevant to individuals but also to corporations. A corporation’s competitive position could be significantly weakened if other firms could observe their financial transactions.
[1] https://www.americanbanker.com/news/how-paypal-is-taking-a-chance-on-ai-to-fight-fraud
[2] https://www.law.ox.ac.uk/business-law-blog/blog/2020/06/intellectual-property-justification-artificial-intelligence
[3] https://www.pnas.org/content/112/4/1036.abstract?sid=fefde0d1-d260-40e6-84d3-a1992208031a
[4] https://www.nytimes.com/2007/05/21/world/americas/21iht-data.1.5803543.html
[5] https://hbr.org/2012/08/dont-build-a-database-of-ruin
[6] https://www.goodreads.com/book/show/27209431-the-inevitable
[7] This is a mathematical model, algorithmic collusion has so far not yet been identified in the real world; see for instance: https://voxeu.org/article/artificial-intelligence-algorithmic-pricing-and-collusion
[8] This new form of market power explains why the problem of tech (or fintech) platform regulation transcends the rigid boundaries of antitrust law as it is currently applied in the US. See sources cited at n 85; Lina Khan, ‘The Separation of Platforms and Commerce’ (2019) 119 Columbia Law Review 973; Frank Pasquale, ‘Privacy, Antitrust and Power’ (2013) 20 George Mason Law Review 1009.