Output modes in Apache Spark Structured Streaming

Versions: Spark 2.2.1

Structured Streaming introduced a lot of new concepts regarding to the DStream-based streaming. One of them is the output mode.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in 👉 Early Release on the O'Reilly platform

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

This post presents the output modes introduced in Spark 2.0.0 to deal with streaming data output. The first part shows them through a short theoretical part. The second section presents their API. The last part shows how they work in some learning tests.

Output modes definition

The output mode specifies the way of writing the data to the result table. Among the available output modes we can distinguish:

Output modes API

The output mode definition occurs in DataStreamWriter#outputMode(outputMode: String) method. So passed name is later translated to the corresponding case object from InternalOutputModes object.

The resolved instance is used mainly in the DataStreamWriter class. It's passed from there to the StreamingQueryManager#startQuery(userSpecifiedName: Option[String], userSpecifiedCheckpointLocation: Option[String], df: DataFrame, sink: Sink, outputMode: OutputMode, useTempCheckpointLocation: Boolean = false, recoverFromCheckpointLocation: Boolean = true, trigger: Trigger = ProcessingTime(0), triggerClock: Clock = new SystemClock()). As the name of this method indicates, it starts the streaming query execution.

In the physical execution side we can find the tracks of the output modes in StateStoreSaveExec. It's there where the intermediate stateful results are stored. By the way we can find there a lot of references to the watermarking that helps to remove too old results. If you want to learn more about it, please go to the post about StateStore in Apache Spark Structured Streaming.

Output modes examples

Below list summarizes which modes can be used for given types of processing. After each of them some tests are written in order to show the use and not-use cases:

The output modes in Apache Spark determines how the output is generated. Among 3 different strategies, one of them returns always the complete result while 2 others either appends the results that are not supposed to receive the data anymore or update already computed results. All of these main behaviors were shown in the tests defined in the 3rd section.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
đź”— past projects