Infoshare 2024: Stream processing fallacies, part 1

Last week I was speaking in Gdansk on the DataMass track at Infoshare. As it often happens, the talk time slot impacted what I wanted to share but maybe it's for good. Otherwise, you wouldn't read stream processing fallacies!

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in πŸ‘‰ Early Release on the O'Reilly platform

I also help solve your data engineering problems πŸ‘‰ contact@waitingforcode.com πŸ“©

It's a short blog post series about the stream processing fallacies I've faced so far. The "fallacy" is there on purpose because it describes painful moments but sounds less scary than a problem or failure.

Fallacy 1: logs

Context: one of my first data projects ever. The main challenge was purely related to poor input data quality, such as format inconsistencies and different schemas to manage in a single job. I didn't expect to deal with other kinds of issues for this pipeline.

Fallacy: However, returning to the office from the weekend I found the job in the failed state. Weird thing, the logs didn't show off any data quality runtime exceptions. Instead, I found one unhealthy node in the cluster dashboard and by connecting the dots with other, more detailed, dashboards, found out the full disk usage.

Reason: It was weird, the job didn't write anything on disk besides...the logs! A quick look at the HDFS [job on YARN] status showed that the streaming job that was perfectly running for weeks, generated too big log file.

Solution: Store less logs, so rotate them and eventually offload to more flexible storage. Do take care about all other accumulables, such as cache or downloaded packages.

Fallacy 2: metadata operation

Context: an Apache Spark Structured Streaming job processing files stored on an object store. It was performing great in the first months but all of the sudden, the alerting system notified me about I/O peaks in the database and an increased latency.

Fallacy: the job was running perfectly fine in terms of the compute resources. It was busy 100% and the data ingestion process remained performant as well. The problem was elsewhere.

Reason: after digging into the job dashboard, I found a step that was listing objects present in the input location of the job. Unfortunately, the input location was a bucket with a long retention period and many small objects. Overall, the listing operation, that by the way was necessary to discover new files to process, was taking up to 1.5 minutes instead of the 20 seconds from the first months.

Solution: listing files is an operation common in batch systems but it's not a part of the streaming world where the changes are streamed. To mitigate the issue, the new job was streaming file names from a queue and later processing them with the files API.

Fallacy 3: Serverless is fancy

Context: a project with some serverless solution for a stateful job to be replicated on other providers. Nothing difficult at a first glance, and pretty exciting as it was the first stateful serverless pipeline.

Fallacy: But before implementing the pipeline elsewhere, I was working on some run issues that spotted some serious limitations of the solution, such as difficult debugging, deployment, complex monitoring, and costs that weren't that low as expected.

Reason: the components were too decoupled. Moreover, other data providers didn't share the same near real-time latency requirement. It turned out, they were ok with one hour latency.

Solution: use a more simpler batch pipeline that processes data on top of an object store used as the data lake backend.

Fallacy 4: X = Z

Context: I used to say that one tool (Amazon Kinesis DataStreams in that context) is the same as an Open Source alternative (Apache Kafka).

Fallacy: even though they look similar on surface, they don't behave the same way. The difference may be crucial if you have some specific project requirements.

Reason: there are many. Kinesis doesn't have the same semantic for bulk requests (Kafka is all-or-nothing, Kinesis can have partial failures). Kinesis sequence numbers are not Apache Kafka offsets as they are increasing over time, even when there is no data written. Finally, the synchronicity in Kinesis Producer Library, which was a recommended tool for writing to Kinesis from a JVM, uses a sleep(...) method to block. It can have a strong latency impact.

Solution: be more careful with superficial comparisons. Always adapt a tool to the end use case context by considering its pros and cons.

Fallacy 5: Late data killed the stream star

With reference to The Buggles - Video Killed The Radio Star (Official Music Video)

Context: an Apache Spark Structured Streaming job doing just fine. Unfortunately, one day data producers had some connectivity issues and some of them sent their events several hours late, at once.

Fallacy: as the job was IO-bounded, all that data coming from the same provider, led to throttled, hence slower, writes. As a result, all the providers were impacted by this latency, and so despite a control configuration on the max number of incoming records.

Reason: Apache Spark Structured Streaming processes data in blocking micro-batches. As a result, new data cannot be processed as long as the previous micro-batch doesn't complete. Due to the increased latency on several providers, their slow tasks delayed the whole pipeline.

Solution: define the threshold for the volume of events that can be processed and store all extra ones in a dedicated storage space (aka backpressure).

That's all for the first fallacies. In the next post I'll share the next ones!


If you liked it, you should read:

πŸ“š Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!