Amazon Kinesis is not Apache Kafka

Open Source tools helped me switch to the cloud world a lot. The managed cloud services often share the same fundamentals as their Open alternatives. However, there is always something different. Today I'll focus on these differences for Amazon Kinesis service and Apache Kafka ecosystem.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

Difference 1: delivering guarantee for producers

The first big difference is the delivery semantics for producers. Amazon Kinesis Data Streams provides 2 APIs for producers. The first is PutRecord and it sends only 1 record in the request. It won't work well at scale because as you may already know, one of the best things to do when you have a lot of data to deliver, is to use batch requests. A batch request groups multiple records in a single call and the job spends less time on managing network connections.

Kinesis also provides a batch version called PutRecords. It's great but if you do worry about the order of written records, you won't be that happy:

The response Records array includes both successfully and unsuccessfully processed records. Kinesis Data Streams attempts to process all records in each PutRecords request. A single record failure does not stop the processing of subsequent records. As a result, PutRecords doesn't guarantee the ordering of records. If you need to read records in the same order they are written to the stream, use PutRecord instead of PutRecords, and write to the same shard.

Source: https://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecords.html

Apache Kafka does better here. A batch request either fully succeeds or not. There won't be records only from the middle added to the log segments.

Difference 2: data synchronization

This point is more related to the whole ecosystem. Amazon Kinesis comes with a Firehose component for data synchronization. It's serverless synchronization service for the streaming data from a Kinesis Data Streams to other AWS services, such as S3, Redshift, or OpenSearch. Even though you can extend it to other places by leveraging the API Gateway destination, it's much less flexible than the Apache Kafka alternative.

In Kafka you can use the Apache Kafka Connect layer to write streaming data to other data stores, including the AWS databases but not only!

Difference 3: scaling

Amazon Kinesis Data Streams supports scaling up and scaling down scenarios. The operation is called resharding and stands for splitting or merging the shards (= partitions). Even though it may take time depending on the number of shards, it's possible to adapt the stream to the current business needs, both increasing and decreasing.

On the other hand, Apache Kafka doesn't come with that flexibility. There is a KIP-694 proposal to implement a both ends scaling but it's still under discussion. For now you can only add new partitions to the topic.

Difference 4: official SDK

I've only used Java and Python libraries for Apache Kafka from Confluent but I can say they covered all my needs, including data writing, data reading, buffering, and automatic retries. I can't say the same for Amazon Kinesis Data Streams SDK that lets a lot of those things to implement on your own.

If you're a happy JVM user you can use an alternative for data generation with the Kinesis Producer Library that provides buffering, retries and synchronous or asynchronous delivery. However, it's only limited to the JVM users and if you need a similar tool in other languages, you may need to get a community project. It's a valid choice too but in the end, it'll always be a risk that the maintainers retire or don't release that often as the main company behind a given product.

Difference 5: exactly-once

Apache Kafka offers the Exactly-Once delivery semantic if you use transactional producers and consumers with read_committed transaction isolation.

This feature is not available for Amazon Kinesis Data Streams where a record can be only delivered at-least once. You may try to implement a kind of exactly-once by writing data to a DynamoDB table and streaming the inserts. Compared to the native Apache Kafka solution, it has some extra cost, though.

Difference 6: compaction

Apache Kafka has this great compaction feature. Shortly speaking, a compaction can delete old records for the same key and keep only the most recent - not-yet-compacted - ones available for the consumers. There is no such flexibility in Amazon Kinesis Data Streams where the service removes all the records passing beyond the configured TTL.

Is one better than another? No, it all depends on your context. If you're working with a small team without many ops skills, you can choose Amazon Kinesis. Despite the shortcomings presented in the blog post, you will be able to focus only on the data reading and writing. AWS will take care of the rest. On the other hand, if you do care about flexibility and have a team with strong ops culture, Apache Kafka will probably be a better pick.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects

TAGS: #streaming

Read also about Amazon Kinesis is not Apache Kafka here:

Summary of the Amazon Kinesis Event in the Northern Virginia (US-EAST-1) Region PutRecord KIP-98 - Exactly Once Delivery and Transactional Messaging

If you liked it, you should read:

Despite some high-level similarities, #ApacheKafka and #AWS Kinesis Data Streams are different! Despite both being streaming brokers, they have different scaling policies, supported delivery semantics, ... and many others available in the new blog post ? https://t.co/U0RhjFFy2D pic.twitter.com/76MsSRvzEX
— Bartosz Konieczny (@waitingforcode) May 5, 2023