Articles about Apache Spark Structured Streaming on waitingforcode.com - articles for the pleasure of learning and discovery

May 23, 2020 • Apache Spark Structured Streaming

Idempotent logic for stateful processing and late data

Sometimes I come back to the topics I already covered, often because by mistake I discover something new that can improve them. And that's the case for my today's article about idempotence in stateful processing.

Continue Reading →

April 18, 2020 • Apache Spark Structured Streaming

Watermarks and not grouped query - why they don't work

Several weeks ago I played with watermark, just to recall some concepts. I wrote a query and...the watermark didn't work! Of course, my query was wrong but this intrigued me enough to write this short article.

Continue Reading →

April 12, 2020 • Apache Spark Structured Streaming

Nested fields, dropDuplicates and watermark in Apache Spark Structured Streaming

When I was playing with my data-generator and Apache Spark Structured Streaming, I was surprised by one behavior that I would like to share and explain in this post. To not deep delve into the details right now, the story will be about the use of nested structures in several operations.

Continue Reading →

April 5, 2020 • Apache Spark Structured Streaming

Two topics, two schemas, one subscription in Apache Spark Structured Streaming

After my January's talk about Apache Kafka integration in Structured Streaming I got an interesting question on off. The question was, how to process 2 topics simultaneously with Structured Streaming? The "small" problem was the fact that both had different schemas.

Continue Reading →

March 14, 2020 • Apache Spark Structured Streaming

Corrupted records aka poison pill records in Apache Spark Structured Streaming

Some time ago I watched an interesting Devoxx France 2019 talk about poison pills in streaming systems presented by Loïc Divad. I learned a few interesting patterns like sentinel value that may help to deal with corrupted data but the talk was oriented on Kafka Streams. And since I didn't find a corresponding resource for Apache Spark Structured Streaming [and also because I'm simply an Apache Spark enthusiast ;)], I decided to write one trying to implement Loïc's ideas in the Structured Streaming world.

Continue Reading →

February 9, 2020 • Apache Spark Structured Streaming

Apache Kafka source in Structured Streaming - "beyond the offsets"

Even though I've already written a few posts about Apache Kafka as a data source in Apache Spark Structured Streaming, I still had some questions in my head. In this post I will try to answer them and let this Kafka integration in Spark topic for investigation later.

Continue Reading →

January 25, 2020 • Apache Spark Structured Streaming

Apache Kafka sink in Structured Streaming

I've written a lot about data sources, including Apache Kafka. However, Apache Spark is not only about sources but also about targets called sinks. In this post I will focus on Apache Kafka sink integration and try to answer some question in FAQ mode.

Continue Reading →

December 21, 2019 • Apache Spark Structured Streaming

Extending state store in Structured Streaming - reprocessing and limits

In my previous post I have shown you the writing and reading parts of my custom state store implementation. Today it's time to cover the data reprocessing and also the limits of the solution.

Continue Reading →

December 15, 2019 • Apache Spark Structured Streaming

Extending state store in Structured Streaming - reading and writing state

In my previous post I introduced the classes involved in the interactions with the state store, and also shown the big picture of the implementation. Today it's time to write some code :)

Continue Reading →

December 14, 2019 • Apache Spark Structured Streaming

Why UnsafeRow.copy() for state persistence in the state store?

In my last Spark+AI Summit 2019 follow-up posts I'm implementing a custom state store. The extension is inspired by the default state store. At the moment of code analysis, one of the places that intrigued me was the put(key: UnsafeRow, value: UnsafeRow) method. Keep reading if you're curious why.

Continue Reading →

December 8, 2019 • Apache Spark Structured Streaming

Extending state store in Structured Streaming - introduction

When I started to think about implementing my own state store, I had an idea to load the state on demand for given key from a distributed and single-digit milliseconds latency store like AWS DynamoDB. However, after analyzing StateStore API and how it's used in different places, I saw it won't be easy.

Continue Reading →

December 7, 2019 • Apache Spark Structured Streaming

Extending data reprocessing period for arbitrary stateful processing applications

After my Summit's talk I got an interesting question on "off" for the data reprocessing of sessionization streaming pipeline. I will try to develop the answer I gave in this post.

Continue Reading →

December 1, 2019 • Apache Spark Structured Streaming

Custom checkpoint file manager in Structured Streaming

In this post I will start the customization part of the topics covered during my talk. The first customized class will be the class responsible for the checkpoint management.

Continue Reading →

November 23, 2019 • Apache Spark Structured Streaming

Sessionization pipeline - from Kafka to Kinesis version

I'm slowly going closer to the end of Spark+AI Summit follow-up posts series. But before I terminated, I owe you an explanation for how to run the pipeline from my Github on Kinesis.

Continue Reading →

November 17, 2019 • Apache Spark Structured Streaming

Kafka timestamp as the watermark

In the first version of my demo application I used Kafka's timestamp field as the watermark. At that moment I was exploring the internals of arbitrary stateful processing so it wasn't a big deal. But just in case if you're wondering what I didn't keep that for the official demo version, I wrote this article.

Continue Reading →

November 16, 2019 • Apache Spark Structured Streaming

Reprocessing stateful data pipelines in Structured Streaming

During my talk, I insisted a lot on the reprocessing part. Maybe because it's the less pleasant part to work with. After all, we all want to test new pipelines rather than reprocess the data because of some regressions in the code or any other errors. Despite that, it's important to know how Structured Streaming integrates with this data engineering task.

Continue Reading →

November 10, 2019 • Apache Spark Structured Streaming

Output modes in Structured Streaming

The series of notes I took during my Apache Spark Summit preparation continues. Today it's time to cover output modes that I also used in the presented solution for sessionization problem.

Continue Reading →

November 9, 2019 • Apache Spark Structured Streaming

State lifecycle management in Structured Streaming

In this post about state store in Structured Streaming I will focus on the state lifecycle management. The goal is to see what happens when the state expires, why removing it from the state store is so important and some other interesting questions!

Continue Reading →

November 2, 2019 • Apache Spark Structured Streaming

Delta and snapshot state store formats

State store uses checkpoint location to persist state which is locally cached in memory for faster access during the processing. The checkpoint location is used at the recovery stage. An important thing to know here is that there are 2 file formats with checkpointed state, delta and snapshot files.

Continue Reading →

October 27, 2019 • Apache Spark Structured Streaming

Watermark in Structured Streaming

I was already taking about watermark on my blog but this time I will focus more on its use in the context of a stateful processing.

Continue Reading →

Apache Spark Structured Streaming articles