Introduction to Apache Kafka Java API

Recently we discovered some theoretical concepts about Apache Kafka. So it's a good moment to discover Java API.

Looking for a better data engineering position and skills?

You have been working as a data engineer but feel stuck? You don't have any new challenges and are still writing the same jobs all over again? You have now different options. You can try to look for a new job, now or later, or learn from the others! "Become a Better Data Engineer" initiative is one of these places where you can find online learning resources where the theory meets the practice. They will help you prepare maybe for the next job, or at least, improve your current skillset without looking for something else.

👉 I'm interested in improving my data engineering skillset

See you there, Bartosz

This article covers main objects used in Kafka development with its Java API. The first part focuses on actors used to produce messages. The second part describes message consumers. The last part describes common parts for producer and consumer.

Kafka producers

Message producing features are defined in org.apache.kafka.clients.producer package. Let's begin with the most interesting object for messages producing - KafkaProducer. It's very important to underline that the instances of this class are thread-safe, so they can be shared. It should make producing faster than in the case of one instance per thread.

Messages are sent through send(ProducerRecord, Callback) method which can be either blocking or not blocking. In this second case, it returns a java.util.concurrent.Future typed to org.apache.kafka.clients.producer.RecordMetadata. Returned object represents message sent and acknowledged by broker. It contains, among others, the information about offset, topic or partition.

KafkaProducer doesn't necesarilly send messages immediately to the server. Instead, it appends them to an instance of org.apache.kafka.clients.producer.internals.RecordAccumulator. RecordAccumulator is a kind of queue accumulating records before sending them in batch.

Another way to send messages to cluster is flush() method. However, it's a blocking operation because it waits that requests related to sent records complete.

What happens if the delivery of one message fails ? The producer can retry or abort the sending. The behavior depends on configuration entry retries which defines the number of tries after sending failure. In configuration we can find some other interesting entries, such as:

Kafka consumers

We retrieve similar concepts to producing in consuming, located in org.apache.kafka.clients.consumer package. Its main component is KafkaConsumer. It contains different methods to read data, either seeking it from particular position (seek* methods) or by polling them through poll(long) method. Polling consists on getting all new messages since the last retrieval. These messages are recognized thanks to last committed offset for given list of partitions.

But consumer can't take any message if it doesn't subscribe to to topics. This actions is made through subscribe(List<String>) method. Subscription is not incremental. It means that each time this method is invoked, previously defined subscriptions are overridden. Consumer can also subscribe to particular topic partitions, represented by org.apache.kafka.common.TopicPartition objects. Unsubscribing is made by a call of unsubscribe() method. Because it doesn't accept any parameter, given consumer unsubscribes from all of previously subscribed topics.

Record retrieved by consumers are represented by instances of org.apache.kafka.clients.consumer.ConsumerRecord class, typed to message key and value.

As you can deduce, offset is an important part of messages reading. It can be handled automatically or manually. If we chose to manage it manually, we can use either synchronous (commitSync()) or asynchronous (commitAsync()) methods. The first operation blocks until the commit is done. The commits concern all subscribed list of topics and partitions.

If you remember from the first article, consumers can be grouped. At least for Java's KafkaConsumer, it's done in configuration step by specifying parameter. In the list of configuration entires we retrieve similar concepts to producer: bootstrap.servers or key/value deserializers (not serializers). Besides that, another entries, specific to consuming, exist:

Kafka common objects

org.apache.kafka.common package contains objects which can be used by producer and consumer. Mostly, we can find there objects representing cluster participants, starting with Cluster class to represent a subset of the nodes, topics, and partitions in the Kafka cluster.

Going deeper in architecture, we can get some information about cluster nodes. They are represented by Node class. Further, we can retrieve indicators about given topic and its partition. It's expressed by the instances of already discovered TopicPartition class. If we want to get more details, we can use PartitionInfo. It contains more details, mostly about topic organisation in the cluster. So, we can retrieve: leader node, followers (both alive and not used anymore), topic name and partition id.

Kafka Java API is quite small and contains only the most evident objects representing messaging concepts: producers, consumers, cluster components and serializers. But the real discovery of them will be the subject of next articles which, as usually until now, will mix theoretical and practical approaches.

If you liked it, you should read:

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!