What's new in Apache Spark 4.0.0 - Arbitrary state API v2 - batch

To close the topic of the new arbitrary stateful processing API in Apache Spark Structured Streaming let's focus on its...batch counterpart!

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants · production-ready templates
See the full curriculum → €7,000 flat fee · cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

I know, it may sound surprising that the API added primarily for Structured Streaming can be used in batch pipelines, but that's how it is! For that reason, it's good to see some potential use cases of this batch-based transformWithState:

Differences

The crucial difference between the streaming and batch versions is the isStreaming flag present on the TransformWithStateExec physical operator. It's set to true by default but overwritten to false when Apache Spark plans the query with transformWithState for the batch API:

Besides this flag that conditions the internal behavior, the batch operator has other specificities:

However, the biggest difference comes from this isStreaming flag that conditions the execution inside the physical operator. This conditioning means:

Despite these differences, the batch version shares a few features with its streaming counterpart, such as timers processing and state initialization.

Incremental processing

Even though the batch version doesn't support state persistence, there is a way to simulate it by saving the state to dedicate storage location. It follows the principles of the Incremental Sessionizer pattern I explained in Chapter 5 of my Data Engineering Design Patterns book. Let me adapt it to the incremental processing with the batch version of the transformWithState:


In case you didn't get the point, the idea of the incremental processing with the batched transformWithState consist of:

Demo

I recorded a short demo to show a batch code for the transformWithState:

With this demo I'm closing the new arbitrary stateful API part in Apache Spark 4 series. Now, it's time to move on and discover other new features from the most recent release!

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems contact@waitingforcode.com đź“©