Introduction to Apache Kafka on waitingforcode.com

This article starts a small succession of posts about Apache Kafka, considered often as one of solutions to data ingestion.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

The first part of the article presents context and main features of Apache Kafka. In the second part, we try to make some basic operations with this system.

Apache Kafka basics

At the first look, Apache Kafka is similar to RabbitMQ, presented more in details in the articles about RabbitMQ. However, it's only small similarity. To constat that, let's begin by describing in some major lines what happens in Apache Kafka.

Comparison with RabbitMQ makes automatically think that Apache Kafka takes and dispatches messages. It's almost that. Apache Kafka is in fact append-only log system, characterized by:

speed - initially used in LinkedIn, but released as Open Source project in 2011, Apache Kafka is able to handle a lot of data ("hundred megabytes per second" according to official doc) in very short time.
scalability - as almost every component in BigData, Apache Kafka is scalable. It means, among others, the possibility to handle more and more data with growing cluster.
durability - Kafka messages are stored in persistent manner, according to previously defined configuration. Unlike RabbitMQ, the durability isn't dictated by consumer message acknowledging.
distributed design - adapted to fault-tolerance necessity imposed by BigData architectures such as lambda or kappa.

Topics

But Kafka characteristics are not single terms needed to understand the basics. We need also to discored the way of working. Kafka stores published data in containers called topics. Each topic contains a partitioned log. This log looks like a FIFO queue which items are represented by published immutable messages. It's the reason why Kafka is very often called a append-only log system. Messages can be stored even after consumer consumption. The storage time is defined in configuration and don't depend on the fact of being consumed or not.

Since messages aren't removed after consumption, how does Kafka consumers know which messages take ? They know that thanks to offset. It's a simple numeric value indicating which index of a queue was recently read. For example, when consumer consumed first 100 messages in given topic, its offset will be 100. If there are new published messages, this consumer won't start to take them by the first key (0) but by the last non read (100).

Because each topic is partitioned, it's automatically composed by several partitions. They also help to achieve fault-tolerance because they're replicated among different servers, called brokers, in the cluster. In Kafka terminology, main server is called leader and the replicated ones followers. Secondly, partitions support parallelism.

Producers and consumers

The messages can't come to topics by magic. They are explicitely published by producers. Data can be sent unitary, one by one, or in SQL-like batch procedure. In the second case, producer accumulates several messages in the memory and sends them once configured threshold reached (either number of messages or latency bound).

Consumer has a little bit more complexity. They are based on pull approach. It means that the consumers demand new messages when they can proceed more. As already told, consumers don't acknowledge message when it's consumed. Instead, they increment offset pointing to the next message to consume.

A powerful abstraction proposed by Kafka and related to consumers is consumer group. This abstraction imitates well, depending on configuration, two approaches of messages reading: queueing and publish-subscribe. When we have one consumer group with 4 consumers, Kafka dispatches messages through available consumers. So it's possible that each of 4 first messages will be consumed by different consumer. It illustrates queue.

In the other side, if we have 4 different consumer groups, each with 1 consumer, messages will be dispatches in publish-subscribe way - each message to each consumer.

And Zookeeper in this ?

Apache Kafka has a dependency on Apache Zookeeper. The easiest way of thinking about this tool consists on compare it to filesystem. This system is composed by root (/) and subdirectories, in Zookeeper called zNodes (Zookeeper Nodes). These zNodes can contain other zNodes or files. The difference with normal filesystem is that Zookeeper's one is distributed. Since it's distributed, it requires also a synchronization between cluster nodes. The synchronization is evidently managed by Zookeper.

Generally, Kafka uses Zookeeper to manage brokers in the cluster. It helps to check brokers statutes (down, up) and to detect nodes joins or leaves. In additionally, Zookeper is used to chose the server being a "leader" for partitions - including the situation when the leader goes down and must be replaced by another broker, also elected by Zookeeper. It also manages the notifications. When new message is produced, producer is notified about its correct processing once that the write is fully committed to all concerned brokers.

Zookeeper stores also some information about consumers. It contains the parameters about consumer group membership, subscribed topics and consumed messages (offset). Consumers also are rebalanced when new broker joins or leaves the cluster. Rebalancing consists on choosing from which partitions messages will be consumed.

Apache Kafka appears to be a complex tool. It introduces some new concepts in messaging part, such as consumer groups to handle two messaging approaches within 1 common abstraction. Aside of that, it uses some ideas similar to other messaging systems, such as messages or topics (known elsewhere as channels or queues). To guarantee fault-tolerancy, Apache Kafka works in distributed way and replicates topics along cluster.The distributed nature is managed by Apache Zookeeper which is a prerequisite to make Apache Kafka working.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects