Apache Kafka and what the hype is about

Apache Kafka is on everyone's lips, increasingly also in Swiss companies from all sectors. Interest is coming from both IT departments and management. With keywords such as high availability, scalability, fault tolerance and high performance, the message streaming platform originally developed by LinkedIn is being promoted. But what is behind it? Is the interest justified? Is there actually a business benefit? This blog post is intended to shed light on how Kafka works, how it differs from other messaging systems and where to find use cases.

Kafka in a Nutshell

Author: David Konatschnig

Apache Kafka is a distributed stream processing and messaging platform suitable for processing real-time data streams (so-called stream processing). The platform was developed by LinkedIn. This was because the classic messaging systems at the time did not meet their own high requirements (primarily in terms of fault tolerance and scalability). With the in-house development, important pioneering work was done. Later, Kafka was pursued by the Apache Foundation as an open source project. Kafka is a publish-subscribe system designed for high performance in terms of message throughput and latency. It offers the following functionalities, among others:

  • Read and write message streams, similar to message queues or enterprise messaging systems.
  • Data streams are stored in a fault-tolerant and persistent manner.
  • Short latency for real-time systems
  • Kafka Connect: Adapter framework for connecting data via different interfaces and big-data technologies such as
  • JDBC, HDFS, Splunk, Elasticsearch, NoSQL, Spark, etc.
  • Kafka Streams: Framework for stream processing with powerful functions for processing data streams

The term «distributed streaming platform» describes that Kafka can be operated in a cluster consisting of several servers (so-called brokers). A great advantage of Kafka is that it can be operated very flexibly on inexpensive hardware (so-called commodity hardware) or in containers and does not require an expensive, dedicated setup.

Data transported by Apache Kafka is stored in so-called «Topics». This actually comes from the world of relational databases. You can think of a topic as a table. These are written by so-called producers and read by so-called consumers. Topics are divided into partitions, which are distributed among several brokers. This allows the data to be read from a topic in parallel.

This is also an important difference to classic messaging systems: Messages are typically deleted from the queue as soon as they have been read by all consumers. With Kafka, on the other hand, a topic has a retention time that can be configured according to its duration. If this is reached, messages that are older are deleted from the hard disk. However, as long as the message is in the topic, it can be read as often as required and by as many consumers as desired. The whole topic or a part of it can be read from a certain offset. Kafka thus completely hands over the responsibility for which data has already been consumed to the consumer. In concrete terms, this means that a topic can be read simultaneously by a real-time application, a batch process and a machine learning algorithm.

Apache Kafka has the ambition to become the central nervous system within a company, linking all data and systems together. In doing so, the data should be moved into the center of attention, towards an event-driven architecture. This should work for start-ups as well as for large corporations.

Business Value through Apache Kafka

Two things can be achieved with Apache Kafka: On the one hand, silos can be broken open by tapping into old, cumbersome back-end applications and publishing data changes in the company in seconds. On the other hand, it is also possible to implement completely new use cases that were previously not possible or only possible with restrictions for technical reasons. For example, the real-time analysis of clickstream or transaction data.

In today's fast-moving times, you want to be able to react immediately to an event. Considering the many influences, such as clicks and likes from social media, or measurement data from IoT devices, a classic ETL process, which only processes the data every 24 hours, is always one step behind the competition. With Kafka as the central nervous system, it is possible to react promptly to all events, such as API requests or database updates, and to trigger a process or control a micro service.

If you look at the financial sector, for example, there are countless use cases that can provide real added value for the customer, such as fraud detection based on real-time evaluation of transactions or real-time currency conversions for purchases made. In the automotive industry, sensor data in the vehicle can be used to alert customers when a component needs to be replaced and where the nearest service provider is located.

IPT_Kafka_web.png
Apache Kafka as central nervous system within a company.

Differentiation from Kafka to other messaging systems

Kafka is not just another messaging system that gets mixed up with the competition in the market. It offers a number of key differences that have a huge impact on infrastructure operating costs and user experience - Apache Kafka, for example, raises itself to a level of its own. The following is an overview of the most important unique selling points:

  • Strict order of the news
  • No administration of the registered consumers. At Kafka, it is the responsibility of the consumer to manage the so-called offset (the last point read from the data record).
  • Distribution of the data over several brokers. This leads to a high availability and fault tolerance. The redundancy factor can be configured individually for each topic.
  • Scalability and optimization for very large data streams (e.g. gigabytes per second)
  • Suitable for operation in containers
  • Runs on standard hardware
  • No error handling. Errors at message level must be handled by the consumer.

Conclusion

Apache Kafka is definitely more than just hype. As with any new technology, you have to manage expectations every now and then. But more and more companies are realizing that they can offer digital services that are innovative and disruptive if the right data is provided and integrated. This is where Apache Kafka can differentiate itself from common messaging systems by offering higher data throughput, higher availability and better scalability. This is directly reflected in the user experience, for example, by offering new services that use real-time data.