Articles about Apache Spark Structured Streaming on waitingforcode.com - articles for the pleasure of learning and discovery

March 27, 2021 • Apache Spark Structured Streaming

What's new in Apache Spark 3.1 - Structured Streaming

Aside from the joins presented in the previous blog post, Structured Streaming also got a few other interesting new features that I will present here.

Continue Reading →

March 20, 2021 • Apache Spark Structured Streaming

What's new in Apache Spark 3.1 - streaming joins

In the previous blog post, you discovered what changed for joins in Apache Spark 3.1. If you remember the summary sentence, it was not the single join changes in this new release. Apart from them, you can also do a bit more with Structured Streaming joins!

Continue Reading →

January 30, 2021 • Apache Spark Structured Streaming

Data+AI Summit: custom state store integration feedback

After the introductory part, it's time to share what I learned from the custom state store implementation.

Continue Reading →

January 23, 2021 • Apache Spark Structured Streaming

Data+AI Summit: Custom state store - API

After previous introductory posts, it's time to deep delve into the state store API and implement our own custom state store.

Continue Reading →

January 10, 2021 • Apache Spark Structured Streaming

Structured Streaming and temporary views

I don't know you, but me, when I first saw the code with createTempView method, I thought it created a temporary table in the metastore. But it's not true and in this blog post, you will see why.

Continue Reading →

January 9, 2021 • Apache Spark Structured Streaming

Data+AI Summit follow-up: arbitrary stateful processing and state management

After previous posts about native stateful operations, it's time to focus on the one where you can define your custom stateful logic.

Continue Reading →

January 2, 2021 • Apache Spark Structured Streaming

Data+AI Summit follow-up: joins and state management

Streaming joins are an interesting feature that heavily uses state store. Even though I already blogged about it in the past (2018), some changes were made and also - I hope so - my explanation capacity improved.

Continue Reading →

December 26, 2020 • Apache Spark Structured Streaming

Watermark and window-based processing

One of the not obvious things about the watermark is how it applies on the windows. At first glance, you could think that it will filter out the records produced before the watermark value. But it's not how it works for windows.

Continue Reading →

December 12, 2020 • Apache Spark Structured Streaming

Data+AI Summit follow-up: aggregations and state management

In previous blog posts you discovered how the state store interacts with dropDuplicates and limit operators. This time you will see how it's used in aggregations.

Continue Reading →

November 28, 2020 • Apache Spark Structured Streaming

Data+AI Summit follow-up: drop duplicates and state management

Another stateful operation requiring the state store is drop duplicates. You can use it to deduplicate your streaming data before pushing it to the sink.

Continue Reading →

November 14, 2020 • Apache Spark Structured Streaming

Data+AI Summit follow-up: global limit and state management

It's the second follow-up Data+AI Summit post but the first one focusing on the stateful operations and their interaction with the state store.

Continue Reading →

November 7, 2020 • Apache Spark Structured Streaming

Data+AI follow-up: StateStoreRDD - building block for stateful processing

The main Apache Spark component enabling stateful processing is StateStoreRDD. It creates a partition-based state store instance but also triggers state-based computation.

Continue Reading →

October 25, 2020 • Apache Spark Structured Streaming

Broadcasting in Structured Streaming

Some time ago @ArunJijo36 mentioned me on Twitter with a question about broadcasting in Structured Streaming. If, like me at this time, you don't know what happens, I think that this article will be good for you 👊

Continue Reading →

October 18, 2020 • Apache Spark Structured Streaming

File source and its internals

Few months ago, before the Apache Spark 3.0 features series, you probably noticed a short series about files processing in Structured Streaming. If you enjoyed it, here is a complementary note presenting the file data source :)

Continue Reading →

October 17, 2020 • Apache Spark Structured Streaming

What's new in Apache Spark 3 - Structured Streaming

Apache Kafka changes in Apache Spark 3.0 was one of the first topics covered in the "what's new" series. Even though there were a lot of changes related to the Kafka source and sink, they're not the single ones in Structured Streaming.

Continue Reading →

July 12, 2020 • Apache Spark Structured Streaming

File sink and Out-Of-Memory risk

A few weeks ago I wrote 3 posts about file sink in Structured Streaming. At this time I wasn't aware of one potential issue, namely an Out-Of-Memory problem that at some point will happen.

Continue Reading →

July 11, 2020 • Apache Spark Structured Streaming

What's new in Apache Spark 3.0 - Apache Kafka integration improvements

After previous presentations of the new date time and functions features in Apache Spark 3.0 it's time to see what's new on the streaming side in Structured Streaming module, and more precisely, on its Apache Kafka integration.

Continue Reading →

June 14, 2020 • Apache Spark Structured Streaming

Structured Streaming file sink and reprocessing

I presented in my previous posts how to use a file sink in Structured Streaming. I focused there on the internal execution and its use in the context of data reprocessing. In this post I will address a few of the previously described points.

Continue Reading →

June 6, 2020 • Apache Spark Structured Streaming

File sink and manifest compaction

In my previous post I introduced the file sink in Apache Spark Structured Streaming. Today it's time to focus on an important concept of this output format which is the manifest file lifecycle.

Continue Reading →

May 30, 2020 • Apache Spark Structured Streaming

File sink in Apache Spark Structured Streaming

One of the homework tasks of my Become a Data Engineer course is about synchronizing streaming data with a file system storage. When I was trying to implement this part, I found a manifest-based file stream that I will explore in this and next blog posts.

Continue Reading →

Apache Spark Structured Streaming articles