Apache Spark Structured Streaming articles

Continuous execution in Apache Spark Structured Streaming

During the years Apache Spark's streaming was perceived as working with micro-batches. However, the release 2.3.0 tries to change this and proposes a new execution model called continuous. Even though it's still in experimental status, it's worthy to learn more about it.

Continue Reading →

Stateful transformations with mapGroupsWithState

Streaming stateful processing in Apache Spark evolved a lot from the first versions of the framework. At the beginning was updateStateByKey but some time after, judged inefficient, it was replaced by mapWithState. With the arrival of Structured Streaming the last method was replaced in its turn by mapGroupsWithState.

Continue Reading →

Stateful aggregations in Apache Spark Structured Streaming

Recently we discovered the concept of state stores used to deal with stateful aggregations in Structured Streaming. But at that moment we didn't spend the time on these aggregations. As promised, they'll be described now.

Continue Reading →

Output modes in Apache Spark Structured Streaming

Structured Streaming introduced a lot of new concepts regarding to the DStream-based streaming. One of them is the output mode.

Continue Reading →

StateStore in Apache Spark Structured Streaming

During my last Spark exploration of the RPC implementation one class caught my attention. It was StateStoreCoordinator used by the state store that is an important place in Structured Streaming pipelines.

Continue Reading →

Triggers in Apache Spark Structured Streaming

Some last weeks I was focused on Apache Beam project. After some readings, I discovered a lot of similar concepts between Beam and Spark Structured Streaming (or inversely?). One of this similarities are triggers.

Continue Reading →

Apache Spark Structured Streaming and watermarks

The idea of watermark was firstly presented in the occasion of discovering the Apache Beam project. However it's also implemented in Apache Spark to respond to the same problem - the problem of late data.

Continue Reading →

org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start() explained

The error quoted in the title of this post is quite common when you want to copy conception logic from Spark DStream/RDD to Spark structured streaming. This post makes some insight on it.

Continue Reading →

Analyzing Structured Streaming Kafka integration - Kafka source

Spark 2.2.0 brought the change of structured streaming state. Between 2.0 and 2.2.0 it was marked as "alpha". But the last version changed this status to General Availability. It's so a good moment to start to play with this new feature - even if some basics have already been covered in the post about structured streaming. This time we'll go deeper and analyze the integration with Apache Kafka that will be helpful to

Continue Reading →

Structured streaming

Project Tungsten, explained in one of previous posts, brought a lot of optimizations - especially in terms of memory use. Until now it was essentially used by Spark SQL and Spark MLib projects. However, since 2.0.0, some work was done to integrate DataFrame/Dataset in streaming processing (Spark Streaming).

Continue Reading →