Apache Spark Structured Streaming articles

Home Apache Spark Structured Streaming

July 25, 2023 • Apache Spark Structured Streaming

State expiration in stream-to-stream joins with event time range condition

You certainly know it, the watermark (aka GC Watermark) is responsible for cleaning state store in Apache Spark Structured Streaming. But you may not know that it's not the single time-based condition. There is a different one involved in the stream-to-stream joins.

Continue Reading →

July 21, 2023 • Apache Spark Structured Streaming

How to initialize state in Apache Spark Structured Streaming stateful jobs?

Starting from Apache Spark 3.2.0 is now possible to load an initial state of the arbitrary stateful pipelines. Even though the feature is easy to implement, it hides some interesting implementation details!

Continue Reading →

July 7, 2023 • Apache Spark Structured Streaming

Multiple queries running in Apache Spark Structured Streaming

That's often a dilemma, whether we should put multiple sinks working on the same data source in the same or in different Apache Spark Structured Streaming applications? Both solutions may be valid depending on your use case but let's focus here on the former one including multiple sinks together.

Continue Reading →

May 31, 2023 • Apache Spark Structured Streaming

What's new in Apache Spark 3.4.0 - Structured Streaming

The asynchronous progress tracking and correctness issue fixes presented in the previous blog posts are not the single new feature in Apache Spark Structured Streaming 3.4.0. There are many others but to keep the blog post readable, I'll focus here only on 3 of them.

Continue Reading →

May 25, 2023 • Apache Spark Structured Streaming

What's new in Apache Spark 3.4.0 - Structured Streaming and correctness issue

Apache Spark is infamous for its correctness issue for chained stateful operations. Fortunately things get improved in each release. The most recent one, the 3.4.0, also got some important changes on that field!

Continue Reading →

May 17, 2023 • Apache Spark Structured Streaming

What's new in Apache Spark 3.4.0 - Async progress tracking for Structured Streaming

Finally, the time has come to start the analysis of the new features in Apache Spark. The first of them that grabbed my attention was the Async progress tracking from Structured Streaming.

Continue Reading →

July 24, 2022 • Apache Spark Structured Streaming

What's new in Apache Spark 3.3.0 - Structured Streaming

Even though the Project Lightspeed is not there yet, Apache Spark Structured Streaming 3.3.0 has several interesting features that should make your daily life easier.

Continue Reading →

April 2, 2022 • Apache Spark Structured Streaming

Integration tests and Structured Streaming

Unit tests are the backbone of modern software but they only verify a particular unit of the application. What to do if we wanted to check the interaction between all these units? One of the solutions are automated integration tests. While they are relatively easy to implement against data in-rest, they are more challenging for streaming scenarios.

Continue Reading →

March 5, 2022 • Apache Spark Structured Streaming

Dynamic resource allocation in Structured Streaming

Structured Streaming micro-batch mode inherits a lot of features from the batch part. Apart from the retry mechanism presented previously, it also has the same auto-scaling logic relying on the Dynamic Resource Allocation.

Continue Reading →

January 29, 2022 • Apache Spark Structured Streaming

Broadcast join and changing static dataset

Last year I wrote a blog post about broadcasting in Structured Streaming and I got an interesting question under one of the demo videos. What happens if the joined static dataset in a broadcast mode gets new data? Let's check this out!

Continue Reading →

January 15, 2022 • Apache Spark Structured Streaming

Task retries in Apache Spark Structured Streaming

Unexpected things happen and sooner or later, any pipeline can fail. Hopefully, sometimes the errors may be temporary and automatically recovered after some retries. What if the job is a streaming one? Let's see here how Apache Spark Structured Streaming handles task retries in micro-batch and continuous modes!

Continue Reading →

November 6, 2021 • Apache Spark Structured Streaming

What's new in Apache Spark 3.2.0 - Structured Streaming

After previous blog posts focusing on 2 specific Structured Streaming features, it's time to complete them with a list of other changes made in the 3.2.0 version!

Continue Reading →

October 30, 2021 • Apache Spark Structured Streaming

What's new in Apache Spark 3.2.0 - session windows

Initially I wanted to include the session windows in the blog post about Structured Streaming changes. But I changed my mind when I saw how many things it involves!

Continue Reading →

October 23, 2021 • Apache Spark Structured Streaming

What's new in Apache Spark 3.2.0 - RocksDB state store

It's big news for Apache Spark Structured Streaming users. RocksDB is now available as a Vanilla Spark-backed state store backend!

Continue Reading →

July 31, 2021 • Apache Spark Structured Streaming

Structured Streaming and Apache Kafka Schema Registry

The topic of this post brought Luan Carvalho who shared with me an Open Source project connecting Apache Spark to Apache Kafka Schema Registry. Initially, I wanted to exclusively focus on the project but on my way I discovered some other interesting points.

Continue Reading →

July 10, 2021 • Apache Spark Structured Streaming

Arbitrary stateful processing: update and put dependency

At first glance, the update operation in an arbitrary stateful application looks just like another map's put function. However, it has an impact on what happens later with the state store. In this blog post, you will see an example that can eventually help you to reduce an I/O pressure of the updates.

Continue Reading →

July 3, 2021 • Apache Spark Structured Streaming

Does maxOffsetsPerTrigger guarantee idempotent processing?

If you've used Apache Kafka source in Structured Streaming, you undoubtedly noticed a property called maxOffsetsPerTrigger. According to the documentation, it helps to "limit on maximum number of offsets processed per trigger interval". My initial reaction to this property was, "Cool! We can enforce idempotent processing". I was not wrong, but the blog post will show you that I wasn't entirely right either!

Continue Reading →

June 26, 2021 • Apache Spark Structured Streaming

Apache Kafka transactional writer with foreach sink, is it possible?

Even though Apache Kafka supports transactional producers, they're not present in Apache Spark Kafka sink. But despite that, is it possible to implement a transactional producer in Apache Spark Structured Streaming? You should see that at the end of this article.

Continue Reading →

May 22, 2021 • Apache Spark Structured Streaming

State store metrics

State store is a critical part of any stateful Structured Streaming application. It's important to know what happens when your business logic and input data interact with it. State store metrics will provide you some key insight into this interaction. If you don't know them now, no worries, it's the topic of this blog post!

Continue Reading →

May 15, 2021 • Apache Spark Structured Streaming

Checkpoint file manager - FileSystem and FileContext

If you read my blog post, you certainly noticed that very often I get lost on the internet. Fortunately, very often it helps me write blog posts. But the internet is not the only place where I can get lost. It also happens to me to do that with Apache Spark code and one of my most recent confusions was about FileSystem and FileContext classes.

Continue Reading →