Automate the Software, the Engineer, and the Data Scientist
AutoML already had its AlphaGo Moment
It isn’t just end-to-end reinforcement learning models that are becoming automated; there is a movement towards automating all prediction problems. A new effort in automated machine learning called AutoML is taking foot. AutoML automates feature engineering, model selection, and hyperparameter optimisation for supervised learning. AutoML had its “Alpha Go” moment when a Google AutoML engine was pitted against hundreds of the world’s best data scientists and came second in the competition with the mere click of a button.
Edit: best comment after posing the question to Reddit, “Has AutoML already had its AlphaGo moment?”
The Autogluon paper would have you believe that Amazon’s solution is even better!
TBH I believe that yes, AutoML has proved itself. They approach the problem exactly how a DS would – testing several state of the art algos, estimating their performance by cross-validation, and combining them in an optimized ensemble, subject to a loss function of choice. Feature selection, treatment of categorical variables, semi-supervised pre-training, distributed computing etc. are all easily automated.
One area where these algorithms could still be improved is in the creation of derived features. It is hard for the computer to know i.e. Feature 1 and Feature 2 are both related and maybe the product of those features is meaningful.
For now, it is computationally expensive to run an AutoML system, but Google is betting on an AutoML model that would become even smarter and cheaper. Kaggle has hosted challenges for financial institutions like Two Sigma challenge, Winton, and Jane Street, all of whom have readily programmable and automatable prediction problems to which an AutoML model can easily be applied.
The automation of machine learning happens in three verticals being software, engineering, and data science. Automation starts with the automation of software. A big challenge for companies is the trade-off between system sunk cost and the fear of creative destruction, both instilled by the steady progress of production-ready machine learning.
Legacy systems that took countless lines of manpower can now be replaced with a few lines of code. Google’s translation software is a case in point. Their decades’ old software was replaced with a system based on deep learning and the resulting system replaced around 500,000 lines of code with 500 lines of TensorFlow code.
A lot of automation is happening at high costs with no benefit to the user. Salespeople have taken over RPA companies, and companies are sold RPA software at exorbitant prices for tasks that can be completed with microprograms like the “if-then” recipes standard to IFTTT (which stands for “if this, then that”). Unlike these RPA companies, the infrastructure of AutoML is more widely shared and has not yet yielded to consulting.
Within the decade, we will see extensive adoption of algorithmically driven companies. The expectation is that more of the decision-rights within an organisation would be moved to models. The expectations are that machines will interact more with customers and be in charge of repetitive back-office functions. Over the years, we have seen a transition from process-driven companies to data-driven companies, which further inspires model-driven companies. All companies are an entanglement of these three methods. Model-driven businesses are agile, iterative, creative, and experimental in nature. Due to these characteristics, they intuit the importance of communication, production systems, documentation, knowledge retainment, and random luck.
The next level of automation concerns the automation of the engineer. A semi-established model-driven company has to emphasise the importance of a production environment to create in-house machine learning systems. The production environment is established to support and automate the machine learning workflow. This includes procedures for data acquisition, data processing, model development, deployment, and monitoring.
This process is costly, so it shouldn’t be your first step but should follow from successful AI investigative experiments inside your business. In the beginning, this environment is glued together by human capital, but eventually, it will naturally mature into a production environment for faster experimentation. We want to build a production environment when there is an expectation of developing multiple models to help drive decision making. Take Airbnb; they started with only a few models in 2016, which typically took ten weeks to build, at which point they realised they need to increase the velocity of production. They wanted to develop models for search ranking, dynamic pricing, and fraud prevention. They decided to build infrastructure to eliminate the incidental complexities in developing models, by establishing infrastructure to help the team accessing data, develop features, set up servers, and scale-up models.
For example, an internal product, Zipline, sits in front of the data warehouse and provides a feature repository for vetted crowdsourced features, efficient backfills for enhanced efficiency, data versioning, feature quality monitoring, and feature exploration tools. Closely related, they have a transformation library, called Bighead, that catalogues 100s of canned transformations ready to be applied to data to encourage users not to reinvent the wheel. They also have a workflow engine that runs behind the scenes to periodically retrain end evaluation models, providing visual outputs and alerts based on the evaluation scores; they call this the ML Automator.
These models have to be served a scalable environment, preferably in a container that would provide consistent development and training environments, runtime isolation, and scalability; their environment is called Deep Thought and allows the final models to be exposed via APIs. This production environment is configuration-driven, so data scientist do not have to involve engineers to deploy models. You do, however, need engineers to build this environment and maintain it. It would involve a considerable upfront cost. I am sure that in the coming years, we would see good production environment tools be developed by third parties, to remove the need for each company to develop their own inhouse infostructure solutions. In fact, a few options have already been open-sourced. Your machine learning infrastructure team doesn’t have to be huge, about a year ago, Facebook’s team consisted of 11 engineers.
Facebook has a very similar setup to Airbnb; they have a Feature Store, that allow developers to discover and use prebuilt features in their models. They have a library of predefined pipelines called Flow, that can be used to apply commonly used models like deep neural networks, gradient boosting trees, and logistic regressions. After training the model on Flow, they use a set of tools called Predictor to deploy the model to production and offer a scalable, low-latency model serving environment for online predictions. Users can then run multi-tenant experiments to compare multiple live versions of the production models. All large model-first companies follow a similar design. LinkedIn also has a feature marketplace, feature monitoring tools that look at statistical moments, off-the-shelf machine learning models, and a scalable serving environment. Feature repositories or stores seems to have become central to model-driven enterprises. In 2017, Uber reported having more than 10,000 features in their feature store.
It goes without saying, but easy access to entralised data is essential for model-driven companies. This should come before you entertain the idea of feature repositories. Furthermore, there is no benefit to feature repositories, if you can’t recreate the features, in which case you have to preserve the data pipelines used to create the feature. There would be a need to create the feature again in the future; this would evolve a selection of steps to process raw data through a series of transformations before it can be used. In the end, we want to establish what is called feature provenance, the idea being that we can trace in model inference back to the data used in the model, a real transparent end-to-end solution. Standardised, automated workflows are an important element of ensuring the efficiency and repeatability of training. And all of this is much easier by taking users out of the loop, whether that is a good think in net, is another question.
Finally, even data scientists are slowly becoming automated. Central to machine learning is the labelling of data; some new techniques have come along to allow for semi-supervised learning where only some of the instances are labelled. There are various non-human services online to help with labelling if you want to externalise this effort. Features can also be generated using automated methods like MIT’s CSAIL’s Deep Feature Synthetises. Machine learning is a very iterative experience, most of which can be offloaded to the engineers. It typically involves running many experiments in parallel. Whereas before hyperparameters were manually tracked by data scientists modern machine learning platform limit this need by offering automated hyperparameter optimisation and feature management.
Data is the lifeblood of automation. There seems to be a fundamental misunderstanding in the automatability of tasks, and a belief that only low skilled jobs are in jeopardy. The truth of the matter is that the requisite skills for a job are just a tiny contributing factor, and it is much more important to know if a current machine learning construct exists for the specific task and if enough data is available. For example, although a radiologist is a highly skilled profession, machine learning models outperform radiologist in spotting cancerous and other anomalies. The reason is that the construct of image analysis and medical object identification is a lively area with a lot of money and data thrown at it.
Skilled staff at JP Morgan Chase have suffered a similar fate; a new contract intelligence programme was established to interpret commercial-loan agreements that previously required 360k hours of legal work by lawyers and loan officers (Son, 2017). Other tasks include post-allocation requests at UBS and policy pay-outs at Fukoku Mutual Life Insurance. Again, natural language processing for legal documents is a lively field with a lot of money and data thrown at it. What should also be noted is that machine learning is very task-driven and that jobs that are task-specific are at risk of dissolving. Norman Weiner, the founder Cybernetics left us with a forewarning in his book eponymously titled book, ‘’…remember that the automatic machine is the precise equivalent of slave labor. Any labour which compete with slave labor must accept the economic consequences of slave labor.” There is a good economic argument to be made that machine learning is coming for the most expensive human-capital tasks first, so hold on to your hat.