AutoML - what do the clouds have to offer?

Automated Machine Learning (AutoML) has gained much popularity. All cloud providers have done research in this area and offer AutoML services.

Author: Floarea Serban

The various AutoML services evaluate numerous ML algorithms with automatic model building based on previously defined data. The generated models can be deployed on the public cloud or in a container and later integrated into applications via API.

What is AutoML?

AutoML is a process that automates the repetitive and reusable tasks of data science processes. This process allows data engineers, data scientists, analysts and developers to develop models with higher scalability, speed, efficiency and productivity as well as good model quality.

According to Gartner, AutoML has attracted great interest in recent years: "Sales lead scoring, risk assessment and next-best-action recommendation". In addition, one of Gartner's strategic hypotheses is that by 2022 the number of applications using AutoML will increase from 1% to 25%.

The traditional development of ML models requires considerable resources and different employee profiles. On the one hand, data engineers must obtain and provide large amounts of data from different sources. On the other hand, it is up to the data scientists and business analysts to understand, aggregate and transform the data. At the same time, ML researchers are constantly developing and optimizing new algorithms and model structures. These are integrated by software developers into reusable libraries, which finally end up in the applications. Only then does the circle close and the optimized models can (at best) generate business value. 

AutoML aims to automate the entire data science process - from data cleansing to parameter optimization. This process goes through the following steps:

  1. Data Purging
  2. Feature Pre-Processing, Selection and Construction
  3. Model Selection
  4. Parameter optimization

Until now, most AutoML tools have focused on model selection and parameter optimization.

20200504_AutoML_Cloud_FSE.JPG
Figure 1: Data science process

What do the cloud platforms offer for AutoML?


AutoML is nothing new. Research has been working in this area for years. It started with the development of hyperparameter optimization methods for individual models and currently extends to the development of methods such as automated stacking, neural architecture search, pipeline optimization and feature engineering. The popularity of this field has increased significantly in recent years due to the increased use of the cloud. This raises the question: What are the latest developments of cloud providers and what solutions do they offer?

«AutoML is nothing new. However, its popularity has risen sharply due to the cloud.»
Floarea Serban IT-Architect

There are several cloud-based AutoML platforms such as Databricks, DataRobot, IBM, RapidMiner, H2O.ai and TPOT. 

The top three of the largest cloud providers are analyzed below in Gartner's Magic Quadrant for Cloud AI Developer Services (see Figure 2). They all provide their own platform.

 

20200504_Gartner_Magic Quadrant_01.JPG
Figure 2: Gartner Magic Quadrant Source: https://www.gartner.com/document/3981253?ref=solrAll&refval=247503224

The analysis is based on the suppliers' product catalogues as well as on experience from various sources. A detailed comparison of the remaining products can be found here.

Amazon, Google and Microsoft (in alphabetical order, no ranking) launched their AutoML services simultaneously. Since 2018, AWS SageMaker Automatic Model Tuning, Google Cloud AutoML, Microsoft Azure AutoML are available in the cloud platforms.

1.    Amazon Sagemaker Auto-Pilot


The Sagemaker Auto-Pilot is part of the AWS framework Sagemaker for Machine Learning. It provides a platform and tools to support the entire life cycle of an ML project. You can read more about it in our blog  «Why Machine Learning in the Cloud - the example of Amazon SageMaker».

In summary, the SageMaker Autopilot works as follows:

  1. Analyzing data
  2. Feature Engineering Model Tuning
  3. Leader Board with the models and scoring. In addition, the trade-offs for each model can be checked (accuracy vs. performance, etc.). More than 250 models are generated and tested for one experiment.
  4. The best model is used to create an inference pipeline. The pipeline can be used as single endpoint or batch processing. All steps use the fully managed infrastructure.

2.    Google Cloud AutoML

Based on the research results of the Google Research Labs, Google Cloud AutoML offers specialized solutions for various areas such as Natural Language Processing (NLP), Computer Vision and Tables. The products are based on Transfer Learnings and Neural Architecture Search as technologies. Model development and model selection are proprietary tools from Google and their functionality is not disclosed. The NLP product can train custom models for four different tasks:

  • The classification of documents with a single label
  • The classification of documents with multiple labels
  • Entity extraction recognizes entities in documents
  • The Sentiment Analysis determines the subjective feelings in documents

The training of models can take several hours, depending on the file size. After it has been successfully trained, various metrics of the model can be checked, for example, how accurate and how well it performed.

AutoML Vision simplifies the entire ML process for the user. All that is required is to provide the images with the appropriate labels. When the model is fully trained, an overview of the performance of the model is provided. This shows how good the model is by means of different results (Precision, Recall, Confusion Matrix, etc.). The evaluation is shown as a diagram.

Google Tables can be used for structured data. Before the training, the following feature engineering tasks are performed in AutoML Tables:

  • Normalize and categorize numerical features
  • Create one-hot encoding and embedding for categorical features
  • Basic processing of text features
  • Extract date and time-related features from timestamp columns

Training in AutoML is conducted simultaneously for different model architectures. This approach allows you to find the appropriate model architecture within a short time. The following model architectures are supported:

  • Linear
  • Neural Deep Learning Feed-Forward Network
  • Gradient Boosted decision tree
  • AdaNet
  • Groups of different model architectures

3.    Microsoft Azure AutoML

Microsoft's AutoML services are also a result of Microsoft research in recent years. Microsoft uses probabilistic methods to derive automated decisions and meta-learning to reduce the complexity in high-dimensional optimization problems and to enable the transfer of knowledge about files and problems. Microsoft recommends using AutoML for the following three problems: Classification, regression, and time series prediction.

Microsoft Azure AutoML can train a model and work towards a defined target metric. The focus is on the following steps of the ML process:

  • Pre-processing: The data is automatically scaled or normalized for the algorithms to work well. Advanced features such as data protection, coding and transformations, replacement of missing data, etc. are also available.
  • Feature Selection
  • Model selection
  • Hyperparameter tuning

The service iterates with the Feature Selection through all ML algorithms. Each iteration results in a model with associated training score. The higher the score the better the model. During Azure ML training, many parallel pipelines are created to test different algorithms and parameters. The whole experiment is considered to be completed when the target criterion matches the actual score of the experiment.

Azure AutoML shows how many models were tested, what score they achieved and how long the training took. The best model can be deployed directly as a Web Service. AutoML is also integrated and available in other Microsoft services/products such as ML.NET, HDInsight, Power BI and SQL Server.

Reservations regarding AutoML

Besides the advantages of AutoML, there are also some deficits which must not be ignored. I list the most important ones here:

  • All services focus more on feature selection and parameter optimization within the data science process. Pre-processing is not yet supported satisfactorily.
  • AutoML is limited for problems like classification and regression. Recommendations and ranking models cannot yet be created.
  • In general there is a lack of transparency. Most AutoML services are like a "black box" with no information about which algorithms are used in each step (data preparation, model selection, parameter tuning selection, etc.). 
  • AI Explainability is the keyword for those clients who want increased transparency of ML platforms.
  • A wave of ML democratization is imminent. The ability to develop AI solutions will be transferred from highly specialized data scientists to the entire IT (software developers and citizen developers). According to Gartner, democratization is one of the top 10 Strategic Technology Trends of 2020 in AI. AutoML will play an important role in this process.
  • AutoML is an excellent tool for studying models and ranking them. However, AutoML is not able to take over the work of a data scientist. The data scientist must still understand and define the business problem, and create and select the important features. The difficulty is both identifying the problem, asking the right questions, and defining a sustainable strategy. It is also necessary to interpret and question the results correctly and to define adequate metrics for decisions. Accuracy is only one of the factors in model evaluation. The most accurate model is not necessarily the best. AutoML cannot automatically make such trade-offs.
«Accuracy is a factor, but the most accurate ML model is not always the best.»
Floarea Serban IT-Architect
  • It is not clear whether AutoML can create better, worse or comparable model quality. KDnuggets has analyzed several scenarios and compared the results of AutoML with those of Data Scientist. The results are mixed, but it is clear that a Data Scientist can use AutoML to help.

Conclusion

In summary, with AutoML the productive models can be developed much faster. Data Scientists and Data Analysts as well as software developers in various industries can use AutoML to:

  • Implement ML solutions without having much programming knowledge: AutoML provides a UI to map the process. However, the user needs a certain business and problem understanding.
  • Save time and resources: AutoML is also a help for experienced data scientists. They can use it to arrive at a valid model for a particular problem much more efficiently.
  • Reduce costs: With AutoML, less time is needed for parameter selection and optimization. All possible combinations are automatically tried out until the best solution is found.
  • Easy to apply Data Science Best Practices

Sources

  1. KDnuggets
  2. Alibaba Cloud
  3. Towards Data Science
  4. Microsoft
  5. Google Cloud
  6. AWS Sagemaker 1
  7. AWS Sagemaker 2
  8. AutoML GitHub
  9. Gartner