Spark AI Summit Europe 2019 articles articles

Extending state store in Structured Streaming - reprocessing and limits

In my previous post I have shown you the writing and reading parts of my custom state store implementation. Today it's time to cover the data reprocessing and also the limits of the solution.

Continue Reading →

Extending state store in Structured Streaming - reading and writing state

In my previous post I introduced the classes involved in the interactions with the state store, and also shown the big picture of the implementation. Today it's time to write some code :)

Continue Reading →

Why UnsafeRow.copy() for state persistence in the state store?

In my last Spark+AI Summit 2019 follow-up posts I'm implementing a custom state store. The extension is inspired by the default state store. At the moment of code analysis, one of the places that intrigued me was the put(key: UnsafeRow, value: UnsafeRow) method. Keep reading if you're curious why.

Continue Reading →

Extending state store in Structured Streaming - introduction

When I started to think about implementing my own state store, I had an idea to load the state on demand for given key from a distributed and single-digit milliseconds latency store like AWS DynamoDB. However, after analyzing StateStore API and how it's used in different places, I saw it won't be easy.

Continue Reading →

Extending data reprocessing period for arbitrary stateful processing applications

After my Summit's talk I got an interesting question on "off" for the data reprocessing of sessionization streaming pipeline. I will try to develop the answer I gave in this post.

Continue Reading →

Custom checkpoint file manager in Structured Streaming

In this post I will start the customization part of the topics covered during my talk. The first customized class will be the class responsible for the checkpoint management.

Continue Reading →

Local deduplication or dropDuplicates?

One of the points I wanted to cover during my talk but for which I haven't enough time, was the dilemma about using a local deduplication or Apache Spark's dropDuplicates method to not integrate duplicated logs. That will be the topic of this post.

Continue Reading →

Output invalidation pattern

My last slides of Spark Summit 2019 were dedicated to an output invalidation pattern that is very useful to build maintainable data pipelines. In this post I will deep delve into it.

Continue Reading →

Sessionization pipeline - from Kafka to Kinesis version

I'm slowly going closer to the end of Spark+AI Summit follow-up posts series. But before I terminated, I owe you an explanation for how to run the pipeline from my Github on Kinesis.

Continue Reading →

Kafka timestamp as the watermark

In the first version of my demo application I used Kafka's timestamp field as the watermark. At that moment I was exploring the internals of arbitrary stateful processing so it wasn't a big deal. But just in case if you're wondering what I didn't keep that for the official demo version, I wrote this article.

Continue Reading →

Reprocessing stateful data pipelines in Structured Streaming

During my talk, I insisted a lot on the reprocessing part. Maybe because it's the less pleasant part to work with. After all, we all want to test new pipelines rather than reprocess the data because of some regressions in the code or any other errors. Despite that, it's important to know how Structured Streaming integrates with this data engineering task.

Continue Reading →

Output modes in Structured Streaming

The series of notes I took during my Apache Spark Summit preparation continues. Today it's time to cover output modes that I also used in the presented solution for sessionization problem.

Continue Reading →

State lifecycle management in Structured Streaming

In this post about state store in Structured Streaming I will focus on the state lifecycle management. The goal is to see what happens when the state expires, why removing it from the state store is so important and some other interesting questions!

Continue Reading →

DataFrame or Dataset to solve sessionization problem?

When I was preparing the demo code for my talk about sessionization at Spark AI Summit 2019 in Amsterdam, I wrote my first version of code with DataFrame abstraction. I hadn't type safety but the data manipulation was quite clear thanks to the mapping. Later, I tried to rewrite the code with Dataset and I got type safety but sacrificed a little bit of clarity. Let me deep delve into that in this post.

Continue Reading →

Delta and snapshot state store formats

State store uses checkpoint location to persist state which is locally cached in memory for faster access during the processing. The checkpoint location is used at the recovery stage. An important thing to know here is that there are 2 file formats with checkpointed state, delta and snapshot files.

Continue Reading →

Watermark in Structured Streaming

I was already taking about watermark on my blog but this time I will focus more on its use in the context of a stateful processing.

Continue Reading →

State store 101

After checkpointing, it's time to start a new chapter of Spark Summit AI 2019 preparation posts. And in this new chapter I will describe the state store. It's the first of 3 articles about this important part of the stateful processing.

Continue Reading →

Checkpoint storage in Structured Streaming

At the moment of writing this post I'm preparing the content for my first Spark Summit talk about solving sessionization problem in batch or streaming. Since I'm almost sure that I will be unable to say everything I prepared, I decided to take notes and transform them into blog posts. You're currently reading the first post from this series (#Spark Summit 2019 talk notes).

Continue Reading →