Output modes in Apache Spark Structured Streaming

Versions: Spark 2.2.1

Structured Streaming introduced a lot of new concepts regarding to the DStream-based streaming. One of them is the output mode.

This post presents the output modes introduced in Spark 2.0.0 to deal with streaming data output. The first part shows them through a short theoretical part. The second section presents their API. The last part shows how they work in some learning tests.

Output modes definition

The output mode specifies the way of writing the data to the result table. Among the available output modes we can distinguish:

Output modes API

The output mode definition occurs in DataStreamWriter#outputMode(outputMode: String) method. So passed name is later translated to the corresponding case object from InternalOutputModes object.

The resolved instance is used mainly in the DataStreamWriter class. It's passed from there to the StreamingQueryManager#startQuery(userSpecifiedName: Option[String], userSpecifiedCheckpointLocation: Option[String], df: DataFrame, sink: Sink, outputMode: OutputMode, useTempCheckpointLocation: Boolean = false, recoverFromCheckpointLocation: Boolean = true, trigger: Trigger = ProcessingTime(0), triggerClock: Clock = new SystemClock()). As the name of this method indicates, it starts the streaming query execution.

In the physical execution side we can find the tracks of the output modes in StateStoreSaveExec. It's there where the intermediate stateful results are stored. By the way we can find there a lot of references to the watermarking that helps to remove too old results. If you want to learn more about it, please go to the post about StateStore in Apache Spark Structured Streaming.

Output modes examples

Below list summarizes which modes can be used for given types of processing. After each of them some tests are written in order to show the use and not-use cases:

The output modes in Apache Spark determines how the output is generated. Among 3 different strategies, one of them returns always the complete result while 2 others either appends the results that are not supposed to receive the data anymore or update already computed results. All of these main behaviors were shown in the tests defined in the 3rd section.