It's always a huge pleasure to see the PySpark API covering more and more Scala API features. Starting from Apache Spark 3.4.0 you can even write arbitrary stateful processing jobs! But since the API is a little bit different than the one available on the Scala side, I wanted to take a deeper look.
There are probably not that many people working today on the flat files with Structured Streaming than 5 years ago thanks to the table file formats. However, if you are in this group and are still generating CSVs or JSONs with the streaming sink, brace yourself, the memory problems are coming if you don't take action!
When you wrote your first arbitrary stateful processing pipelines, the state expiration is maybe the first tricky point you had to deal with. Why is that? After all, it's just about setting the timeout, doesn't it? Most of the time, yes, but there is an exception.
You certainly know it, the watermark (aka GC Watermark) is responsible for cleaning state store in Apache Spark Structured Streaming. But you may not know that it's not the single time-based condition. There is a different one involved in the stream-to-stream joins.
Starting from Apache Spark 3.2.0 is now possible to load an initial state of the arbitrary stateful pipelines. Even though the feature is easy to implement, it hides some interesting implementation details!
That's often a dilemma, whether we should put multiple sinks working on the same data source in the same or in different Apache Spark Structured Streaming applications? Both solutions may be valid depending on your use case but let's focus here on the former one including multiple sinks together.
The asynchronous progress tracking and correctness issue fixes presented in the previous blog posts are not the single new feature in Apache Spark Structured Streaming 3.4.0. There are many others but to keep the blog post readable, I'll focus here only on 3 of them.
Apache Spark is infamous for its correctness issue for chained stateful operations. Fortunately things get improved in each release. The most recent one, the 3.4.0, also got some important changes on that field!
Finally, the time has come to start the analysis of the new features in Apache Spark. The first of them that grabbed my attention was the Async progress tracking from Structured Streaming.
Even though the Project Lightspeed is not there yet, Apache Spark Structured Streaming 3.3.0 has several interesting features that should make your daily life easier.
Unit tests are the backbone of modern software but they only verify a particular unit of the application. What to do if we wanted to check the interaction between all these units? One of the solutions are automated integration tests. While they are relatively easy to implement against data in-rest, they are more challenging for streaming scenarios.
Structured Streaming micro-batch mode inherits a lot of features from the batch part. Apart from the retry mechanism presented previously, it also has the same auto-scaling logic relying on the Dynamic Resource Allocation.
Last year I wrote a blog post about broadcasting in Structured Streaming and I got an interesting question under one of the demo videos. What happens if the joined static dataset in a broadcast mode gets new data? Let's check this out!
Unexpected things happen and sooner or later, any pipeline can fail. Hopefully, sometimes the errors may be temporary and automatically recovered after some retries. What if the job is a streaming one? Let's see here how Apache Spark Structured Streaming handles task retries in micro-batch and continuous modes!
After previous blog posts focusing on 2 specific Structured Streaming features, it's time to complete them with a list of other changes made in the 3.2.0 version!
Initially I wanted to include the session windows in the blog post about Structured Streaming changes. But I changed my mind when I saw how many things it involves!
It's big news for Apache Spark Structured Streaming users. RocksDB is now available as a Vanilla Spark-backed state store backend!
The topic of this post brought Luan Carvalho who shared with me an Open Source project connecting Apache Spark to Apache Kafka Schema Registry. Initially, I wanted to exclusively focus on the project but on my way I discovered some other interesting points.
At first glance, the update operation in an arbitrary stateful application looks just like another map's put function. However, it has an impact on what happens later with the state store. In this blog post, you will see an example that can eventually help you to reduce an I/O pressure of the updates.
If you've used Apache Kafka source in Structured Streaming, you undoubtedly noticed a property called maxOffsetsPerTrigger. According to the documentation, it helps to "limit on maximum number of offsets processed per trigger interval". My initial reaction to this property was, "Cool! We can enforce idempotent processing". I was not wrong, but the blog post will show you that I wasn't entirely right either!