Output modes in Apache Spark Structured Streaming

Versions: Spark 2.2.1

Structured Streaming introduced a lot of new concepts regarding to the DStream-based streaming. One of them is the output mode.

Looking for a better data engineering position and skills?

You have been working as a data engineer but feel stuck? You don't have any new challenges and are still writing the same jobs all over again? You have now different options. You can try to look for a new job, now or later, or learn from the others! "Become a Better Data Engineer" initiative is one of these places where you can find online learning resources where the theory meets the practice. They will help you prepare maybe for the next job, or at least, improve your current skillset without looking for something else.

๐Ÿ‘‰ I'm interested in improving my data engineering skillset

This post presents the output modes introduced in Spark 2.0.0 to deal with streaming data output. The first part shows them through a short theoretical part. The second section presents their API. The last part shows how they work in some learning tests.

Output modes definition

The output mode specifies the way of writing the data to the result table. Among the available output modes we can distinguish:

Output modes API

The output mode definition occurs in DataStreamWriter#outputMode(outputMode: String) method. So passed name is later translated to the corresponding case object from InternalOutputModes object.

The resolved instance is used mainly in the DataStreamWriter class. It's passed from there to the StreamingQueryManager#startQuery(userSpecifiedName: Option[String], userSpecifiedCheckpointLocation: Option[String], df: DataFrame, sink: Sink, outputMode: OutputMode, useTempCheckpointLocation: Boolean = false, recoverFromCheckpointLocation: Boolean = true, trigger: Trigger = ProcessingTime(0), triggerClock: Clock = new SystemClock()). As the name of this method indicates, it starts the streaming query execution.

In the physical execution side we can find the tracks of the output modes in StateStoreSaveExec. It's there where the intermediate stateful results are stored. By the way we can find there a lot of references to the watermarking that helps to remove too old results. If you want to learn more about it, please go to the post about StateStore in Apache Spark Structured Streaming.

Output modes examples

Below list summarizes which modes can be used for given types of processing. After each of them some tests are written in order to show the use and not-use cases:

The output modes in Apache Spark determines how the output is generated. Among 3 different strategies, one of them returns always the complete result while 2 others either appends the results that are not supposed to receive the data anymore or update already computed results. All of these main behaviors were shown in the tests defined in the 3rd section.

Looking for a better data engineering position and skills?

You have been working as a data engineer but feel stuck? You don't have any new challenges and are still writing the same jobs all over again? You have now different options. You can try to look for a new job, now or later, or learn from the others! "Become a Better Data Engineer" initiative is one of these places where you can find online learning resources where the theory meets the practice. They will help you prepare maybe for the next job, or at least, improve your current skillset without looking for something else.

๐Ÿ‘‰ I'm interested in improving my data engineering skillset