Apache Spark Structured Streaming articles

Watermark and window-based processing

One of the not obvious things about the watermark is how it applies on the windows. At first glance, you could think that it will filter out the records produced before the watermark value. But it's not how it works for windows.

Continue Reading β†’

Data+AI Summit follow-up: aggregations and state management

In previous blog posts you discovered how the state store interacts with dropDuplicates and limit operators. This time you will see how it's used in aggregations.

Continue Reading β†’

Data+AI Summit follow-up: drop duplicates and state management

Another stateful operation requiring the state store is drop duplicates. You can use it to deduplicate your streaming data before pushing it to the sink.

Continue Reading β†’

Data+AI Summit follow-up: global limit and state management

It's the second follow-up Data+AI Summit post but the first one focusing on the stateful operations and their interaction with the state store.

Continue Reading β†’

Data+AI follow-up: StateStoreRDD - building block for stateful processing

The main Apache Spark component enabling stateful processing is StateStoreRDD. It creates a partition-based state store instance but also triggers state-based computation.

Continue Reading β†’

Broadcasting in Structured Streaming

Some time ago @ArunJijo36 mentioned me on Twitter with a question about broadcasting in Structured Streaming. If, like me at this time, you don't know what happens, I think that this article will be good for you 👊

Continue Reading β†’

File source and its internals

Few months ago, before the Apache Spark 3.0 features series, you probably noticed a short series about files processing in Structured Streaming. If you enjoyed it, here is a complementary note presenting the file data source :)

Continue Reading β†’

What's new in Apache Spark 3 - Structured Streaming

Apache Kafka changes in Apache Spark 3.0 was one of the first topics covered in the "what's new" series. Even though there were a lot of changes related to the Kafka source and sink, they're not the single ones in Structured Streaming.

Continue Reading β†’

File sink and Out-Of-Memory risk

A few weeks ago I wrote 3 posts about file sink in Structured Streaming. At this time I wasn't aware of one potential issue, namely an Out-Of-Memory problem that at some point will happen.

Continue Reading β†’

What's new in Apache Spark 3.0 - Apache Kafka integration improvements

After previous presentations of the new date time and functions features in Apache Spark 3.0 it's time to see what's new on the streaming side in Structured Streaming module, and more precisely, on its Apache Kafka integration.

Continue Reading β†’

Structured Streaming file sink and reprocessing

I presented in my previous posts how to use a file sink in Structured Streaming. I focused there on the internal execution and its use in the context of data reprocessing. In this post I will address a few of the previously described points.

Continue Reading β†’

File sink and manifest compaction

In my previous post I introduced the file sink in Apache Spark Structured Streaming. Today it's time to focus on an important concept of this output format which is the manifest file lifecycle.

Continue Reading β†’

File sink in Apache Spark Structured Streaming

One of the homework tasks of my Become a Data Engineer course is about synchronizing streaming data with a file system storage. When I was trying to implement this part, I found a manifest-based file stream that I will explore in this and next blog posts.

Continue Reading β†’

Idempotent logic for stateful processing and late data

Sometimes I come back to the topics I already covered, often because by mistake I discover something new that can improve them. And that's the case for my today's article about idempotence in stateful processing.

Continue Reading β†’

Watermarks and not grouped query - why they don't work

Several weeks ago I played with watermark, just to recall some concepts. I wrote a query and...the watermark didn't work! Of course, my query was wrong but this intrigued me enough to write this short article.

Continue Reading β†’

Nested fields, dropDuplicates and watermark in Apache Spark Structured Streaming

When I was playing with my data-generator and Apache Spark Structured Streaming, I was surprised by one behavior that I would like to share and explain in this post. To not deep delve into the details right now, the story will be about the use of nested structures in several operations.

Continue Reading β†’

Two topics, two schemas, one subscription in Apache Spark Structured Streaming

After my January's talk about Apache Kafka integration in Structured Streaming I got an interesting question on off. The question was, how to process 2 topics simultaneously with Structured Streaming? The "small" problem was the fact that both had different schemas.

Continue Reading β†’

Corrupted records aka poison pill records in Apache Spark Structured Streaming

Some time ago I watched an interesting Devoxx France 2019 talk about poison pills in streaming systems presented by LoΓ―c Divad. I learned a few interesting patterns like sentinel value that may help to deal with corrupted data but the talk was oriented on Kafka Streams. And since I didn't find a corresponding resource for Apache Spark Structured Streaming [and also because I'm simply an Apache Spark enthusiast ;)], I decided to write one trying to implement LoΓ―c's ideas in the Structured Streaming world.

Continue Reading β†’

Apache Kafka source in Structured Streaming - "beyond the offsets"

Even though I've already written a few posts about Apache Kafka as a data source in Apache Spark Structured Streaming, I still had some questions in my head. In this post I will try to answer them and let this Kafka integration in Spark topic for investigation later.

Continue Reading β†’

Apache Kafka sink in Structured Streaming

I've written a lot about data sources, including Apache Kafka. However, Apache Spark is not only about sources but also about targets called sinks. In this post I will focus on Apache Kafka sink integration and try to answer some question in FAQ mode.

Continue Reading β†’