
Adam Szendrei
IT Architect
A data & analytics platform is a complex structure and is operated by experts. Whether this cloud-native can be built, you can find out in this blog.
Author: Adam Szendrei
To put it straight up front: Yes; even a complex tool landscape, such as a data and analytics platform, can be set up cloud-natively. The two major advantages of such an approach are:
Such questions must be answered if a data and analytics platform is to be built, both on premises and in the cloud.
Figure 1 shows the typical processing stages of an analytics project:
As mentioned above, any data and analytics platform should cover these topics and areas and be able to answer the questions formulated above satisfactorily. In the next section, we will show how Google is addressing this challenge.
Google's answer to the data and analytics platform is Cloud Data Fusion. It is a collection of tools that can be used to do traditional data processing (e.g. ETL with Spark). On the other hand, however, large amounts of data can also be processed with a short latency time. Google Cloud Data Fusion is a Google Cloud platform on which code free ETL pipelines can be developed via a drag and drop interface. Cloud Data Fusion translates the visually created pipeline in an Apache Spark- or MapReduce program, which performs transformations in parallel in a short-lived Dataproc cluster. In this way, complex transformations over large amounts of data can be easily implemented in a scalable and reliable manner without having to manage the infrastructure. Cloud Data Fusion is comparable to Google Dataflow. Dataflow is also a parallel data processing service for both batch and stream processing. However, it uses Apache Beam instead of CDAP and can switch from a batch to a stream pipeline with few code modifications. Figure 2 shows the structure of the Cloud Data Fusion platform.
Data analyst, data scientist or business analyst who are used to working directly in databases on the company's legacy DWH infrastructure will say: "We have much more flexibility and also agility when we set it up and run it ourselves". Is there any truth to that? We say yes and no. In any case, the following reasons speak very strongly in favor of a ready-to-use platform:
In summary, it can be said that the use of ETL technologies in cloud-native environments has many advantages over a self-maintained on-premises variant. A small deficit in flexibility is by far eliminated. The degree of portability is sufficiently high with Google Data Fusion, so that one does not fall victim to a strong lock-in.