Articles about Apache Spark Structured Streaming on waitingforcode.com - articles for the pleasure of learning and discovery

April 18, 2024 • Apache Spark Structured Streaming

Stopping a Structured Streaming query

Streaming jobs are supposed to run continuously but it applies to the data processing logic. After all, sometimes you may need to release a new job package with upgraded dependencies or improved business logic. What happens then?

Continue Reading →

March 22, 2024 • Apache Spark Structured Streaming

StreamingQueryListener, from states to questions

Apache Spark leverages the observer design pattern for the framework-to-code communication. One of the consumers' implementations is StreamingQueryListener.

Continue Reading →

March 13, 2024 • Apache Spark Structured Streaming

Processing time trigger, to be or not to be?

That's the question. The lack of the processing time trigger means more a reactive micro-batch triggering but it cannot be considered as the single true best practice. Let's see why.

Continue Reading →

February 28, 2024 • Apache Spark Structured Streaming

Anatomy of a Structured Streaming job

Apache Spark Structured Streaming relies on the micro-batch pattern which evaluates the same query in each execution. That's only a high level vision, though. Under-the-hood, there are many other interesting things that happen.

Continue Reading →

February 21, 2024 • Apache Spark Structured Streaming

Min rate limits for Apache Kafka

I bet you know it already. You can limit the max throughput for Apache Spark Structured Streaming jobs for popular data sources such as Apache Kafka, Delta Lake, or raw files. Have you known that you can also control the lower limit, at least for Apache Kafka?

Continue Reading →

January 24, 2024 • Apache Spark Structured Streaming

Static enrichment dataset with Delta Lake

Data enrichment is one of common data engineering tasks. It's relatively easy to implement with static datasets because of the data availability. However, this apparently easy task can become a nightmare if used with inappropriate technologies.

Continue Reading →

November 29, 2023 • Apache Spark Structured Streaming

Accumulators and reliability

In March I wrote a blog showing how to use accumulators to know the application of each filter statement. Turns out, the solution may not be perfect as mentioned by Aravind in one of the comments. I bet you already have an idea but if not, keep reading. Everything will be clear in the end!

Continue Reading →

November 1, 2023 • Apache Spark Structured Streaming

What's new in Apache Spark 3.5.0 - watermark propagation

Watermark, or rather multiple watermarks management, has been a thorn in the side of Apache Spark Structured Streaming. It has improved in the previous release (3.4.0) but still had some room for improvement. Well, it did have because the 3.5.0 release brought a serious fix for the multiple watermarks scenario.

Continue Reading →

October 25, 2023 • Apache Spark Structured Streaming

What's new in Apache Spark 3.5.0 - Structured Streaming

It's time to start the series covering Apache Spark 3.5.0 features. As the first topic I'm going to cover Structured Streaming which has got a lot of RocksDB improvements and some major API changes.

Continue Reading →

October 18, 2023 • Apache Spark Structured Streaming

Watermark and input data filtering in Apache Spark Structured Streaming

I've already written about watermarks in a few places in the blog but despite that, I still find things to refresh. One of them is the watermark used to filter out the late data, which will be the topic of this blog post.

Continue Reading →

October 4, 2023 • Apache Spark Structured Streaming

Making applyInPandasWithState less painful

Do not get the title wrong! Having applyInPandasWithState in the PySpark API is huge! However, due to Python duck typing, some operations are more difficult and more risky to express in the code than in the strongly typed Scala API.

Continue Reading →

September 27, 2023 • Apache Spark Structured Streaming

Arbitrary stateful processing in PySpark with applyInPandasWithState

It's always a huge pleasure to see the PySpark API covering more and more Scala API features. Starting from Apache Spark 3.4.0 you can even write arbitrary stateful processing jobs! But since the API is a little bit different than the one available on the Scala side, I wanted to take a deeper look.

Continue Reading →

August 10, 2023 • Apache Spark Structured Streaming

_spark_metadata in Apache Spark Structured Streaming issue is no more!

There are probably not that many people working today on the flat files with Structured Streaming than 5 years ago thanks to the table file formats. However, if you are in this group and are still generating CSVs or JSONs with the streaming sink, brace yourself, the memory problems are coming if you don't take action!

Continue Reading →

August 2, 2023 • Apache Spark Structured Streaming

The first state in Apache Spark Structured Streaming arbitrary stateful processing

When you wrote your first arbitrary stateful processing pipelines, the state expiration is maybe the first tricky point you had to deal with. Why is that? After all, it's just about setting the timeout, doesn't it? Most of the time, yes, but there is an exception.

Continue Reading →

July 25, 2023 • Apache Spark Structured Streaming

State expiration in stream-to-stream joins with event time range condition

You certainly know it, the watermark (aka GC Watermark) is responsible for cleaning state store in Apache Spark Structured Streaming. But you may not know that it's not the single time-based condition. There is a different one involved in the stream-to-stream joins.

Continue Reading →

July 21, 2023 • Apache Spark Structured Streaming

How to initialize state in Apache Spark Structured Streaming stateful jobs?

Starting from Apache Spark 3.2.0 is now possible to load an initial state of the arbitrary stateful pipelines. Even though the feature is easy to implement, it hides some interesting implementation details!

Continue Reading →

July 7, 2023 • Apache Spark Structured Streaming

Multiple queries running in Apache Spark Structured Streaming

That's often a dilemma, whether we should put multiple sinks working on the same data source in the same or in different Apache Spark Structured Streaming applications? Both solutions may be valid depending on your use case but let's focus here on the former one including multiple sinks together.

Continue Reading →

May 31, 2023 • Apache Spark Structured Streaming

What's new in Apache Spark 3.4.0 - Structured Streaming

The asynchronous progress tracking and correctness issue fixes presented in the previous blog posts are not the single new feature in Apache Spark Structured Streaming 3.4.0. There are many others but to keep the blog post readable, I'll focus here only on 3 of them.

Continue Reading →

May 25, 2023 • Apache Spark Structured Streaming

What's new in Apache Spark 3.4.0 - Structured Streaming and correctness issue

Apache Spark is infamous for its correctness issue for chained stateful operations. Fortunately things get improved in each release. The most recent one, the 3.4.0, also got some important changes on that field!

Continue Reading →

May 17, 2023 • Apache Spark Structured Streaming

What's new in Apache Spark 3.4.0 - Async progress tracking for Structured Streaming

Finally, the time has come to start the analysis of the new features in Apache Spark. The first of them that grabbed my attention was the Async progress tracking from Structured Streaming.

Continue Reading →

Apache Spark Structured Streaming articles