Apache Spark Structured Streaming articles

Home Apache Spark Structured Streaming

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests

Medallion architecture & Lakeflow SDP

Max 10 participants · production-ready templates

See the full curriculum → €7,000 flat fee · cohort of up to 10

Bartosz
Konieczny

December 7, 2019 • Apache Spark Structured Streaming

Extending data reprocessing period for arbitrary stateful processing applications

After my Summit's talk I got an interesting question on "off" for the data reprocessing of sessionization streaming pipeline. I will try to develop the answer I gave in this post.

Continue Reading →

December 1, 2019 • Apache Spark Structured Streaming

Custom checkpoint file manager in Structured Streaming

In this post I will start the customization part of the topics covered during my talk. The first customized class will be the class responsible for the checkpoint management.

Continue Reading →

November 23, 2019 • Apache Spark Structured Streaming

Sessionization pipeline - from Kafka to Kinesis version

I'm slowly going closer to the end of Spark+AI Summit follow-up posts series. But before I terminated, I owe you an explanation for how to run the pipeline from my Github on Kinesis.

Continue Reading →

November 17, 2019 • Apache Spark Structured Streaming

Kafka timestamp as the watermark

In the first version of my demo application I used Kafka's timestamp field as the watermark. At that moment I was exploring the internals of arbitrary stateful processing so it wasn't a big deal. But just in case if you're wondering what I didn't keep that for the official demo version, I wrote this article.

Continue Reading →

November 16, 2019 • Apache Spark Structured Streaming

Reprocessing stateful data pipelines in Structured Streaming

During my talk, I insisted a lot on the reprocessing part. Maybe because it's the less pleasant part to work with. After all, we all want to test new pipelines rather than reprocess the data because of some regressions in the code or any other errors. Despite that, it's important to know how Structured Streaming integrates with this data engineering task.

Continue Reading →

November 10, 2019 • Apache Spark Structured Streaming

Output modes in Structured Streaming

The series of notes I took during my Apache Spark Summit preparation continues. Today it's time to cover output modes that I also used in the presented solution for sessionization problem.

Continue Reading →

November 9, 2019 • Apache Spark Structured Streaming

State lifecycle management in Structured Streaming

In this post about state store in Structured Streaming I will focus on the state lifecycle management. The goal is to see what happens when the state expires, why removing it from the state store is so important and some other interesting questions!

Continue Reading →

November 2, 2019 • Apache Spark Structured Streaming

Delta and snapshot state store formats

State store uses checkpoint location to persist state which is locally cached in memory for faster access during the processing. The checkpoint location is used at the recovery stage. An important thing to know here is that there are 2 file formats with checkpointed state, delta and snapshot files.

Continue Reading →

October 27, 2019 • Apache Spark Structured Streaming

Watermark in Structured Streaming

I was already taking about watermark on my blog but this time I will focus more on its use in the context of a stateful processing.

Continue Reading →

October 26, 2019 • Apache Spark Structured Streaming

State store 101

After checkpointing, it's time to start a new chapter of Spark Summit AI 2019 preparation posts. And in this new chapter I will describe the state store. It's the first of 3 articles about this important part of the stateful processing.

Continue Reading →

October 19, 2019 • Apache Spark Structured Streaming

Checkpoint storage in Structured Streaming

At the moment of writing this post I'm preparing the content for my first Spark Summit talk about solving sessionization problem in batch or streaming. Since I'm almost sure that I will be unable to say everything I prepared, I decided to take notes and transform them into blog posts. You're currently reading the first post from this series (#Spark Summit 2019 talk notes).

Continue Reading →

July 18, 2019 • Apache Spark Structured Streaming

Apache Spark Structured Streaming and Apache Kafka offsets management

Some time ago I got 3 interesting questions about the implementation of Apache Kafka connector in Apache Spark Structured Streaming. I will answer them in this post.

Continue Reading →

February 27, 2019 • Apache Spark Structured Streaming

Initializing state in Structured Streaming

Some time ago I was asked by Sunil whether it was possible to load the initial state in Apache Spark Structured Streaming like in DStream-based API. Since the response was not obvious, I decided to investigate and share the findings through this post.

Continue Reading →

February 6, 2019 • Apache Spark Structured Streaming

Apache Spark 2.4.0 features - foreachBatch

When I first heard about the foreachBatch feature, I thought that it was the implementation of foreachPartition in the Structured Streaming module. However, after some analysis I saw how I was wrong because this new feature addresses other but also important problems. You will find more .

Continue Reading →

January 9, 2019 • Apache Spark Structured Streaming

Apache Spark 2.4.0 features - watermark configuration

The series about Apache Spark 2.4.0 features continues. After last week's discovery of bucket pruning, it's time to switch to Structured Streaming module and see its major evolution.

Continue Reading →

September 16, 2018 • Apache Spark Structured Streaming

Stream-to-stream joins internals

In 3 recent posts about Apache Spark Structured Streaming we discovered streaming joins: inner joins, outer joins and state management strategies. Discovering what happens under-the-hood of all of these operations is a good point to sum up the series.

Continue Reading →

September 9, 2018 • Apache Spark Structured Streaming

Stream-to-stream state management

Last weeks we've discovered 2 stream-to-stream join types in Apache Spark Structured Streaming. As told in these posts, state management logic may be sometimes omitted (for inner joins) but generally it's advised to reduce the memory pressure. Apache Spark proposes 3 different state management strategies that will be detailed in the following sections.

Continue Reading →

September 2, 2018 • Apache Spark Structured Streaming

Outer joins in Apache Spark Structured Streaming

Previously we discovered inner stream-to-stream joins in Apache Spark but they aren't the single supported type. Another one are outer joins that let us to combine streams without matching rows.

Continue Reading →

August 26, 2018 • Apache Spark Structured Streaming

Inner joins between streams in Apache Spark Structured Streaming

Apache Kafka Streams supports joins between streams and the community expected the same for Apache Spark. This feature was implemented and released with recent 2.3.0 version and after some months after that, it's a good moment to talk a little about it.

Continue Reading →

April 8, 2018 • Apache Spark Structured Streaming

Query metrics in Apache Spark Structured Streaming

One of important points for long-living queries is the tracking. It's always important to know how the query performs. In Structured Streaming we can follow this execution thanks to special object called ProgressReporter.

Continue Reading →