QCon London 2020: Three trends in machine learning & data stream

The QCon conference in London was a good opportunity to see in which direction the machine learning and data world is moving.

Author: Valentin Trifonov

This year again, ipt attended the QCon conference in London. Machine Learning (ML) and Streaming Data were also represented at this large, industry-oriented software engineering and architecture conference - even with their own tracks. A good opportunity to see in which direction the ML and data world is moving.

ML-as-a-Service

With the growing cloud offering, ML is now easier than ever to use.

The large cloud providers offer services for well-researched standard cases, such as recommendation systems, speech recognition, image recognition or sentiment analysis. These services are only one API call away - so they can be used without the need for own models, data science knowledge and infrastructure for training and operation. This relieves the developers of work. Especially in combination with Serverless, prototypes can be set up very quickly with such technologies.

20200416_Machine Learning Model_01.png
Figure 1: Implementation and operation of machine learning models (here a Convolutional Neural Network for image recognition) can quickly cause a lot of work - managed services take care of that for us. Source:ch.mathworks.com

But this approach does not solve all our problems. After all, the same characteristics can also be a disadvantage: A tailor-made model, developed on the basis of our own data and with the help of domain knowledge, can achieve a higher degree of accuracy.

Machine Learning at the Edge

Edge computing is computing that takes place «close» to the end user instead of centralized in the cloud. This is a particularly relevant term for IoT technologies. IoT devices that process data on site offer a higher quality of service. The calculations are independent of the availability of the network and other services in the backend in the cloud. In addition, latency times can be greatly reduced. In the meantime, however, data protection and security have also moved into the limelight and thus strong arguments for independence from the cloud.

Machine Learning is a particularly good candidate for edge computing. Training models requires a lot of computing power. Inferencing, on the other hand, i.e. making predictions with an existing model, is much less computationally intensive and can in many cases be transferred to the «edge». It is predicted that this approach will become more widespread in the future. The market for corresponding hardware is growing. The motto is:

«Take the data. Act on the data. Throw the data away.»
Alasdair Allan The Internet of Things might have less Internet than we thought? | Medium
20200416_Edge Computing Hardware_01.jpeg
Figure 2: Various machine learning accelerator hardware, including from Google, Intel and NVIDIA. Source: medium.com/@aallan

Gartner made a similar prediction in the Hype Cycle last fall. Could this mean that in the future the cloud will lose relevance for machine learning?

State of Data Engineering

Machine learning is only as good as the quality of the data. To prepare data in real time, one turns to streaming data technologies - such as Kafka, Spark, or Flink. These platforms are highly available, scalable and fast, but also very complex. Also this year we see that there are still many data engineering challenges to be solved:

 

  • Multi-Tenancy Hell: As the number of jobs on a cluster grows, the risk that a faulty job will tear down the entire cluster or block resources for others increases. Pioneers like Lyft tell us how they cleverly solve the problem by using a separate cluster for each job instead. To keep the operating expenses from going through the roof, a Kubernet Operator automatically provision and manage the clusters.
  • Resource allocation: Which job requires how much computing power? Resource consumption can fluctuate greatly, for example when historical data is read in to start a job.
  • The old and the new world unite: Streaming data does not replace batch processing. Instead, both architectures complement each other. Data is often archived or processed in batches, not only for technical reasons. In case of errors, calculations must be repeatable at any time in the future. Both architectures must therefore coexist - also known as Lambda architecture.

 

Interesting developments in this area include autoscaling (for example Google Cloud Dataflow), or a common abstraction for batch and streaming (for example Apache Beam).

Future

Where are Machine Learning and Streaming Data heading now, into the cloud or away from the cloud? This depends strongly on the use case. It is certainly possible that we will continue to see trends in both directions in the future. We are curious to see how these topics develop with maturing tooling.