Apache Spark Structured Streaming articles

Checkpoint file manager - FileSystem and FileContext

If you read my blog post, you certainly noticed that very often I get lost on the internet. Fortunately, very often it helps me write blog posts. But the internet is not the only place where I can get lost. It also happens to me to do that with Apache Spark code and one of my most recent confusions was about FileSystem and FileContext classes.

Continue Reading →

What's new in Apache Spark 3.1 - Structured Streaming

Aside from the joins presented in the previous blog post, Structured Streaming also got a few other interesting new features that I will present here.

Continue Reading →

What's new in Apache Spark 3.1 - streaming joins

In the previous blog post, you discovered what changed for joins in Apache Spark 3.1. If you remember the summary sentence, it was not the single join changes in this new release. Apart from them, you can also do a bit more with Structured Streaming joins!

Continue Reading →

Data+AI Summit: custom state store integration feedback

After the introductory part, it's time to share what I learned from the custom state store implementation.

Continue Reading →

Data+AI Summit: Custom state store - API

After previous introductory posts, it's time to deep delve into the state store API and implement our own custom state store.

Continue Reading →

Structured Streaming and temporary views

I don't know you, but me, when I first saw the code with createTempView method, I thought it created a temporary table in the metastore. But it's not true and in this blog post, you will see why.

Continue Reading →

Data+AI Summit follow-up: arbitrary stateful processing and state management

After previous posts about native stateful operations, it's time to focus on the one where you can define your custom stateful logic.

Continue Reading →

Data+AI Summit follow-up: joins and state management

Streaming joins are an interesting feature that heavily uses state store. Even though I already blogged about it in the past (2018), some changes were made and also - I hope so - my explanation capacity improved.

Continue Reading →

Watermark and window-based processing

One of the not obvious things about the watermark is how it applies on the windows. At first glance, you could think that it will filter out the records produced before the watermark value. But it's not how it works for windows.

Continue Reading →

Data+AI Summit follow-up: aggregations and state management

In previous blog posts you discovered how the state store interacts with dropDuplicates and limit operators. This time you will see how it's used in aggregations.

Continue Reading →

Data+AI Summit follow-up: drop duplicates and state management

Another stateful operation requiring the state store is drop duplicates. You can use it to deduplicate your streaming data before pushing it to the sink.

Continue Reading →

Data+AI Summit follow-up: global limit and state management

It's the second follow-up Data+AI Summit post but the first one focusing on the stateful operations and their interaction with the state store.

Continue Reading →

Data+AI follow-up: StateStoreRDD - building block for stateful processing

The main Apache Spark component enabling stateful processing is StateStoreRDD. It creates a partition-based state store instance but also triggers state-based computation.

Continue Reading →

Broadcasting in Structured Streaming

Some time ago @ArunJijo36 mentioned me on Twitter with a question about broadcasting in Structured Streaming. If, like me at this time, you don't know what happens, I think that this article will be good for you 👊

Continue Reading →

File source and its internals

Few months ago, before the Apache Spark 3.0 features series, you probably noticed a short series about files processing in Structured Streaming. If you enjoyed it, here is a complementary note presenting the file data source :)

Continue Reading →

What's new in Apache Spark 3 - Structured Streaming

Apache Kafka changes in Apache Spark 3.0 was one of the first topics covered in the "what's new" series. Even though there were a lot of changes related to the Kafka source and sink, they're not the single ones in Structured Streaming.

Continue Reading →

File sink and Out-Of-Memory risk

A few weeks ago I wrote 3 posts about file sink in Structured Streaming. At this time I wasn't aware of one potential issue, namely an Out-Of-Memory problem that at some point will happen.

Continue Reading →

What's new in Apache Spark 3.0 - Apache Kafka integration improvements

After previous presentations of the new date time and functions features in Apache Spark 3.0 it's time to see what's new on the streaming side in Structured Streaming module, and more precisely, on its Apache Kafka integration.

Continue Reading →

Structured Streaming file sink and reprocessing

I presented in my previous posts how to use a file sink in Structured Streaming. I focused there on the internal execution and its use in the context of data reprocessing. In this post I will address a few of the previously described points.

Continue Reading →

File sink and manifest compaction

In my previous post I introduced the file sink in Apache Spark Structured Streaming. Today it's time to focus on an important concept of this output format which is the manifest file lifecycle.

Continue Reading →