Can a data and analytics platform be built cloud-native?
A data & analytics platform is a complex structure and is operated by experts. Whether this cloud-native can be built, you can find out in this blog.
Author: Adam Szendrei
What does a data and analytics platform in the cloud need?
To put it straight up front: Yes; even a complex tool landscape, such as a data and analytics platform, can be set up cloud-natively. The two major advantages of such an approach are:
- The many different tools are managed by the cloud provider. The effort for setting up and operating the platform is thus much lower.
- The development is much more accessible for different user groups thanks to standardized processes and interfaces.
The issue is very complex in detail:
- How is the platform scaled?
- Is the data always up-to-date?
- How do you keep an overview of the data and keep costs under control?
- Is the platform secure enough?
- How are the applications deployed?
- How is clean staging of data mapped on the cloud?
Such questions must be answered if a data and analytics platform is to be built, both on premises and in the cloud.
Figure 1 shows the typical processing stages of an analytics project:
- Ingest: Integration of different data sources and types
- Prepare: Preparing the data for further processes (e.g. cleansing, filtering, etc.)
- Transform: Aggregate and combine the data to create new data
- Analyze: Analyzing the data and gaining insights
- Visualize: Visualize data and insights with a business analysis tool
As mentioned above, any data and analytics platform should cover these topics and areas and be able to answer the questions formulated above satisfactorily. In the next section, we will show how Google is addressing this challenge.
Why should the ETL process (Extract, Transform, Load) be done in the cloud?
Google's answer to the data and analytics platform is Cloud Data Fusion. It is a collection of tools that can be used to do traditional data processing (e.g. ETL with Spark). On the other hand, however, large amounts of data can also be processed with a short latency time. Google Cloud Data Fusion is a Google Cloud platform on which code free ETL pipelines can be developed via a drag and drop interface. Cloud Data Fusion translates the visually created pipeline in an Apache Spark- or MapReduce program, which performs transformations in parallel in a short-lived Dataproc cluster. In this way, complex transformations over large amounts of data can be easily implemented in a scalable and reliable manner without having to manage the infrastructure. Cloud Data Fusion is comparable to Google Dataflow. Dataflow is also a parallel data processing service for both batch and stream processing. However, it uses Apache Beam instead of CDAP and can switch from a batch to a stream pipeline with few code modifications. Figure 2 shows the structure of the Cloud Data Fusion platform.
What are the advantages of purchasing such a platform as-a-Service?
Data analyst, data scientist or business analyst who are used to working directly in databases on the company's legacy DWH infrastructure will say: "We have much more flexibility and also agility when we set it up and run it ourselves". Is there any truth to that? We say yes and no. In any case, the following reasons speak very strongly in favor of a ready-to-use platform:
- Cloud Data Fusion provides a unified view of the data pipelines and data projects in the Google cloud. Whether you are a developer, data engineer or business analyst, everyone can focus on their content. Various interfaces are available for this purpose, from code to drag-and-drop GUI-based design tools.
- The platform is open. Open source is written in capital letters. The CDAP technology enables transferability into a hybrid and multi-cloud environment. Dataproc clusters are used in the background to execute code.
- The Google infrastructure with its petabyte network and storage capacity offers almost endless scalability. Cloud Data Fusion does not reinvent the world, but interweaves many well-known technologies such as Google Kubernetes Engine (GKE), Cloud SQL, Cloud Storage, Persistent Disk or Cloud Key Management Service.
- With BigQuery and TensorFlows high-performance analysis tools are available.
- The future is serverless. Why should I pay for something I don't use? BigQuery, for example, is based on a simple and transparent pricing model that allows you to estimate the costs of queries and the associated storage space at the beginning.
- The topic of realtime is becoming more and more important. With Cloud Data Fusion, batch and real-time ETL/ELT pipelines can be mixed transparently. MapReduce, Spark or Spark streaming pipelines run in the background.
- Cloud Data Fusion offers end-to-end metadata handling. Data Lineage can be seamlessly built using the integrated platform and individual attributes can be traced back to their source
In summary, it can be said that the use of ETL technologies in cloud-native environments has many advantages over a self-maintained on-premises variant. A small deficit in flexibility is by far eliminated. The degree of portability is sufficiently high with Google Data Fusion, so that one does not fall victim to a strong lock-in.
How can I support you?