What's new in Apache Spark 4.0.0 - Arbitrary state API v2 - batch

Versions: Apache Spark 4.0.0 https://github.com/bartosz25/spark-...main/scala/com/waitingforcode/batch

To close the topic of the new arbitrary stateful processing API in Apache Spark Structured Streaming let's focus on its...batch counterpart!

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

I know, it may sound surprising that the API added primarily for Structured Streaming can be used in batch pipelines, but that's how it is! For that reason, it's good to see some potential use cases of this batch-based transformWithState:

Differences

The crucial difference between the streaming and batch versions is the isStreaming flag present on the TransformWithStateExec physical operator. It's set to true by default but overwritten to false when Apache Spark plans the query with transformWithState for the batch API:

Besides this flag that conditions the internal behavior, the batch operator has other specificities:

However, the biggest difference comes from this isStreaming flag that conditions the execution inside the physical operator. This conditioning means:

Despite these differences, the batch version shares a few features with its streaming counterpart, such as timers processing and state initialization.

Incremental processing

Even though the batch version doesn't support state persistence, there is a way to simulate it by saving the state to dedicate storage location. It follows the principles of the Incremental Sessionizer pattern I explained in Chapter 5 of my Data Engineering Design Patterns book. Let me adapt it to the incremental processing with the batch version of the transformWithState:


In case you didn't get the point, the idea of the incremental processing with the batched transformWithState consist of:

Demo

I recorded a short demo to show a batch code for the transformWithState:

With this demo I'm closing the new arbitrary stateful API part in Apache Spark 4 series. Now, it's time to move on and discover other new features from the most recent release!

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
đź”— past projects