Apache Spark Structured Streaming articles

Spark Declarative Pipelines 101

One of the biggest changes to the Apache Spark Structured Streaming API over the past few years is undoubtedly the introduction of the declarative API, AKA Spark Declarative Pipelines. This post kicks off a three-part series dedicated to this new functionality. By the end of these articles, you will be able to effectively leverage declarative programming in your workflows and gain a deeper understanding of what happens under the hood when you do.

Continue Reading β†’

Listener or checkpoints for external progress tracking?

Among the many ways to track the progress of Apache Spark Structured Streaming jobs, you'll find custom offset extraction and ingestion into your observability tool of choice. Although this approach sounds simple, it can be implemented in several ways. In this blog post, we're going to discuss two of them: a custom batch job and a custom listener.

Continue Reading β†’

State schema evolution in arbitrary stateful processing

Schema evolution is a widespread topic in data engineering. You'll face it in batch whenever you need to modify an output table. You'll face it in streaming whenever you need to change the structure of the records written to your Apache Kafka topic. Ultimately, you'll also face it in stateful processing whenever you need to change the schema of your state. This last aspect will be the topic of our blog post.

Continue Reading β†’

Apache Spark Structured Streaming UI patterns

When you start a Structured Streaming job, your Spark UI will get a new tab in the menu where you follow the progress of the running jobs. In the beginning this part may appear a bit complex but there are some visual detection patterns that can help you understand what's going on.

Continue Reading β†’

What's new in Apache Spark 4.0.0 - Arbitrary state API v2 - batch

To close the topic of the new arbitrary stateful processing API in Apache Spark Structured Streaming let's focus on its...batch counterpart!

Continue Reading β†’

What's new in Apache Spark 4.0.0 - Arbitrary state API v2 - internals

Last week we discovered the new way to write arbitrary stateful transformations in Apache Spark 4 with the transformWithState API. Today it's time to delve into the implementation details and try to understand the internal logic a bit better.

Continue Reading β†’

What's new in Apache Spark 4.0 - Arbitrary state API v2 - introduction

Arbitrary stateful processing has been evolving a lot in Apache Spark. The initial version with updateStateByKey evolved to mapWithState in Apache Spark 2. When Structured Streaming was released, the framework got mapGroupsWithState and flatMapGroupsWithState. Now, Apache Spark 4 introduces a completely new way to interact with the arbitrary stateful processing logic, the Arbitrary state API v2!

Continue Reading β†’

Dealing with quotas and limits - Apache Spark Structured Streaming for Amazon Kinesis Data Streams

Using cloud managed services is often a love and hate story. On one hand, they abstract a lot of tedious administrative work to let you focus on the essentials. From another, they often have quotas and limits that you, as a data engineer, have to take into account in your daily work. These limits become even more serious when they operate in a latency-sensitive context, as the one of stream processing.

Continue Reading β†’

Event time skew and global watermark in Apache Spark Structured Streaming

A few months ago I wrote a blog post about event skew and how dangerous it is for a stateful streaming job. Since it was a high-level explanation, I didn't cover Apache Spark Structured Streaming deeply at that moment. Now the watermark topic is back to my learning backlog and it's a good opportunity to return to the event skew topic and see the dangers it brings for Structured Streaming stateful jobs.

Continue Reading β†’

DAIS 2024: Unit tests - configuration and declaration

Code organization and assertions flow are both important but even them, they can't guarantee your colleagues' adherence to the unit tests. There are other user-facing attributes to consider as well.

Continue Reading β†’

DAIS 2024: Orchestrating and scoping assertions in Apache Spark Structured Streaming

Testing batch jobs is not the same as testing streaming ones. Although the transformation (the WHAT from the previous article) is similar in both cases, more complete validation tests on the job logic are not. After all, streaming jobs often iteratively build the final outcome while the batch ones generate it in a single pass.

Continue Reading β†’

DAIS 2024: Testing framework from the Dataflow model for Apache Spark Structured Streaming

With this blog I'm starting a follow-up series for my Data+AI Summit 2024 talk. I missed this family of blog posts a lot as the previous DAIS with me as speaker was 4 years ago! As previously, this time too I'll be writing several blog posts that should help you remember the talk and also cover some of the topics left aside because of the time constraints.

Continue Reading β†’

OutputModes in Apache Spark Structured Streaming - complementary notes

I wrote a blog post about OutputModes 6 (yes!) years ago and after reading it a few times, I realized it was not good enough to be a quick refresher. For that reason you can read about OutputModes for the second time here. Hopefully, this one will be a good try!

Continue Reading β†’

Stopping a Structured Streaming query

Streaming jobs are supposed to run continuously but it applies to the data processing logic. After all, sometimes you may need to release a new job package with upgraded dependencies or improved business logic. What happens then?

Continue Reading β†’

StreamingQueryListener, from states to questions

Apache Spark leverages the observer design pattern for the framework-to-code communication. One of the consumers' implementations is StreamingQueryListener.

Continue Reading β†’

Processing time trigger, to be or not to be?

That's the question. The lack of the processing time trigger means more a reactive micro-batch triggering but it cannot be considered as the single true best practice. Let's see why.

Continue Reading β†’

Anatomy of a Structured Streaming job

Apache Spark Structured Streaming relies on the micro-batch pattern which evaluates the same query in each execution. That's only a high level vision, though. Under-the-hood, there are many other interesting things that happen.

Continue Reading β†’

Min rate limits for Apache Kafka

I bet you know it already. You can limit the max throughput for Apache Spark Structured Streaming jobs for popular data sources such as Apache Kafka, Delta Lake, or raw files. Have you known that you can also control the lower limit, at least for Apache Kafka?

Continue Reading β†’

Static enrichment dataset with Delta Lake

Data enrichment is one of common data engineering tasks. It's relatively easy to implement with static datasets because of the data availability. However, this apparently easy task can become a nightmare if used with inappropriate technologies.

Continue Reading β†’

Accumulators and reliability

In March I wrote a blog showing how to use accumulators to know the application of each filter statement. Turns out, the solution may not be perfect as mentioned by Aravind in one of the comments. I bet you already have an idea but if not, keep reading. Everything will be clear in the end!

Continue Reading β†’