Infoshare 2024: Stream processing fallacies, part 2

The blog shares the last fallacies for my 7 years stream processing journey.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in πŸ‘‰ Early Release on the O'Reilly platform

I also help solve your data engineering problems πŸ‘‰ contact@waitingforcode.com πŸ“©

I'm proud and a bit ashamed at the same time. Proud because this time there are only 3 fallacies. Ashamed because they might not be there if I had worked on those projects a bit later in my journey.

Fallacy 6: by the book

Context: at the time of this project, I was still under the Kappa impression winning the battle with the Lambda architecture. One of the Kappa's promises is to base everything on the streaming broker. While it works great for real-time processing, it appears not to be that great for less frequent tasks..

Fallacy: a newly released job version was silently ignoring a small part of the processed events. As we needed all of them for the underlying analytics stack, the loss was not acceptable. Unfortunately, the issue was spotted after two weeks and the issue was fixed by restarting the pipeline from the past. But passed the 18 days (don't ask me why this number...), it wouldn't be possible to replay.

Reason: Kappa architecture recommends to run any backfilling job as a concurrent application executed on top of the same streaming data source. There are some issues with that approach. To put it short, retaining data on streaming brokers is more expensive than on usual low cost stores, such as object store. Besides, data from the streaming broker cannot be optimized anyhow for reading. On the other hand, when you process the same data, but from a data lake or lakehouse storage, you can leverage the columnar storage to optimize the read part.

Tiered storage

Modern streaming brokers, including Apache Kafka (KAFKA-7739, Apache Pulsar, and Redpanda, support storing a part of the log segment in the object store natively. However, the format may still not be adapted to the direct querying as you may not apply the custom partitioning for example. Besides, the querying may still require using the broker's API.

Solution: an approach not disrupting the main execution flow was to run the reprocessing pipeline from the data lake by targeting the processing time between the release of the failing job version and the release of the fixed one. Although it introduced some duplicates at the time boundaries, it was not an issue for the downstream consumers (occasional duplicates were already a part of the contract with them).

Fallacy 7: gotta catch 'em all

Context: an Apache Spark Structured Streaming job. It has been working great for several weeks on a staging environment until facing a data provider issue that stopped it for the weekend. After the restart, the job's performance degraded by x5.

Fallacy: as a consequence of the failure, the streaming consumer was continuously failing because of the memory issueS.

Reason: the consumer didn't use any throughput mechanism, i.e. it was reading all data available at a given moment for the same cluster capacity.

Solution: setting the throughput limitation (maxOffsetsPerTrigger in that case). A drawback was the static character of this value that needed to be updated manually after each scaling action. Besides, to catch up the backlog, a per-partition (minPartitions) parallelism was added alongside doubling cluster capacity.

Fallacy 8: where is my data?!

Context: a streaming job failed Friday early at night. After restarting it manually, it failed with a data consistency exception telling that the next planned data to process doesn't exist anymore.

Fallacy: a part of the data to process disappeared. As a result, we lost it.

Reason: retention period set to only 3 days. Although it was enough to live through a weekend failure, the issue happened on Monday's public holiday. Besides, we didn't have any pipeline to continuously synchronize the streaming data to a lake-like storage.

Solution: either increase the retention period, at least for the longer out-of-office time, or use a lake-like storage for backfilling purposes.

Pretty happy to share that it's all for my fallacies. I'm not a superhuman and there will probably be new ones. Hopefully, I'll not write a new follow-up blog post earlier than in 7 years ;-)


If you liked it, you should read:

πŸ“š Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!