Open Source tools helped me switch to the cloud world a lot. The managed cloud services often share the same fundamentals as their Open alternatives. However, there is always something different. Today I'll focus on these differences for Amazon Kinesis service and Apache Kafka ecosystem.
A virtual conference at the intersection of Data and AI. This is not a conference for the hype. Its real users talking about real experiences.
- 40+ speakers with the likes of Hannes from Duck DB, Sol Rashidi, Joe Reis, Sadie St. Lawrence, Ryan Wolf from nvidia, Rebecca from lidl
- 12th September 2024
- Three simultaneous tracks
- Panels, Lighting Talks, Keynotes, Booth crawls, Roundtables and Entertainment.
- Topics include (ingestion, finops for data, data for inference (feature platforms), data for ML observability
- 100% virtual and 100% free
👉 Register here
Difference 1: delivering guarantee for producers
The first big difference is the delivery semantics for producers. Amazon Kinesis Data Streams provides 2 APIs for producers. The first is PutRecord and it sends only 1 record in the request. It won't work well at scale because as you may already know, one of the best things to do when you have a lot of data to deliver, is to use batch requests. A batch request groups multiple records in a single call and the job spends less time on managing network connections.
Kinesis also provides a batch version called PutRecords. It's great but if you do worry about the order of written records, you won't be that happy:
The response Records array includes both successfully and unsuccessfully processed records. Kinesis Data Streams attempts to process all records in each PutRecords request. A single record failure does not stop the processing of subsequent records. As a result, PutRecords doesn't guarantee the ordering of records. If you need to read records in the same order they are written to the stream, use PutRecord instead of PutRecords, and write to the same shard.
Source: https://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecords.html
Apache Kafka does better here. A batch request either fully succeeds or not. There won't be records only from the middle added to the log segments.
Difference 2: data synchronization
This point is more related to the whole ecosystem. Amazon Kinesis comes with a Firehose component for data synchronization. It's serverless synchronization service for the streaming data from a Kinesis Data Streams to other AWS services, such as S3, Redshift, or OpenSearch. Even though you can extend it to other places by leveraging the API Gateway destination, it's much less flexible than the Apache Kafka alternative.
In Kafka you can use the Apache Kafka Connect layer to write streaming data to other data stores, including the AWS databases but not only!
Difference 3: scaling
Amazon Kinesis Data Streams supports scaling up and scaling down scenarios. The operation is called resharding and stands for splitting or merging the shards (= partitions). Even though it may take time depending on the number of shards, it's possible to adapt the stream to the current business needs, both increasing and decreasing.
On the other hand, Apache Kafka doesn't come with that flexibility. There is a KIP-694 proposal to implement a both ends scaling but it's still under discussion. For now you can only add new partitions to the topic.
Difference 4: official SDK
I've only used Java and Python libraries for Apache Kafka from Confluent but I can say they covered all my needs, including data writing, data reading, buffering, and automatic retries. I can't say the same for Amazon Kinesis Data Streams SDK that lets a lot of those things to implement on your own.
If you're a happy JVM user you can use an alternative for data generation with the Kinesis Producer Library that provides buffering, retries and synchronous or asynchronous delivery. However, it's only limited to the JVM users and if you need a similar tool in other languages, you may need to get a community project. It's a valid choice too but in the end, it'll always be a risk that the maintainers retire or don't release that often as the main company behind a given product.
Difference 5: exactly-once
Apache Kafka offers the Exactly-Once delivery semantic if you use transactional producers and consumers with read_committed transaction isolation.
This feature is not available for Amazon Kinesis Data Streams where a record can be only delivered at-least once. You may try to implement a kind of exactly-once by writing data to a DynamoDB table and streaming the inserts. Compared to the native Apache Kafka solution, it has some extra cost, though.
Difference 6: compaction
Apache Kafka has this great compaction feature. Shortly speaking, a compaction can delete old records for the same key and keep only the most recent - not-yet-compacted - ones available for the consumers. There is no such flexibility in Amazon Kinesis Data Streams where the service removes all the records passing beyond the configured TTL.
Is one better than another? No, it all depends on your context. If you're working with a small team without many ops skills, you can choose Amazon Kinesis. Despite the shortcomings presented in the blog post, you will be able to focus only on the data reading and writing. AWS will take care of the rest. On the other hand, if you do care about flexibility and have a team with strong ops culture, Apache Kafka will probably be a better pick.