DAIS 2024: Testing framework from the Dataflow model for Apache Spark Structured Streaming

With this blog I'm starting a follow-up series for my Data+AI Summit 2024 talk. I missed this family of blog posts a lot as the previous DAIS with me as speaker was 4 years ago! As previously, this time too I'll be writing several blog posts that should help you remember the talk and also cover some of the topics left aside because of the time constraints.

Continue Reading β†’

Data+AI Summit 2024 - Retrospective - Streaming

Welcome to the first Data+AI Summit 2024 retrospective blog post. I'm opening the series with the topic close to my heart at the moment, stream processing!

Continue Reading β†’

Infoshare 2024 - Retrospective

Last May I gave a talk about stream processing fallacies at Infoshare in Gdansk. Besides this speaking experience, I was also - and maybe among others - an attendee who enjoyed several talks in software and data engineering areas. I'm writing this blog post to remember them and why not, share the knowledge with you!

Continue Reading β†’

Delta Lake table as a changelog

One of the big challenges in streaming Delta Lake is the inability to handle in-place changes, like updates, deletes, or merges. There is good news, though. With a little bit of effort on your data provider's side, you can process a Delta Lake table as you would process Apache Kafka topics, hence without in-place changes.

Continue Reading β†’

Infoshare 2024: Stream processing fallacies, part 2

The blog shares the last fallacies for my 7 years stream processing journey.

Continue Reading β†’

Infoshare 2024: Stream processing fallacies, part 1

Last week I was speaking in Gdansk on the DataMass track at Infoshare. As it often happens, the talk time slot impacted what I wanted to share but maybe it's for good. Otherwise, you wouldn't read stream processing fallacies!

Continue Reading β†’

mapGroupsWithState and...batch?

That's one of my recent surprises. While I have been exploring arbitrary stateful processing, hence the mapGroupsWithState among others, I mistakenly created a batch DataFrame and applied the mapping function on top of it. Turns out, it worked! Well, not really but I let you discover why in this blog post.

Continue Reading β†’

OutputModes in Apache Spark Structured Streaming - complementary notes

I wrote a blog post about OutputModes 6 (yes!) years ago and after reading it a few times, I realized it was not good enough to be a quick refresher. For that reason you can read about OutputModes for the second time here. Hopefully, this one will be a good try!

Continue Reading β†’

Event time skew in stream processing

As a data engineer you're certainly familiar with data skew. Yes, this bad phenomena where one task takes considerably more input than the others and often causes unexpected latency or failures. Turns out, stream processing also has its skew but more related to time.

Continue Reading β†’

Stopping a Structured Streaming query

Streaming jobs are supposed to run continuously but it applies to the data processing logic. After all, sometimes you may need to release a new job package with upgraded dependencies or improved business logic. What happens then?

Continue Reading β†’

Data enrichment strategies in Apache Flink

Data enrichment is a crucial step in making data more usable by the business users. Doing that with a batch is relatively easy due to the static nature of the dataset. When it comes to streaming, the task is more challenging.

Continue Reading β†’

Rolling history logs in Spark History UI

Stream processing is great but it brings some gotchas that are not obvious. Logs are one of them.

Continue Reading β†’

Schema tracking in Delta Lake

Streaming Delta tables is slightly different from streaming native streaming sources, such as Apache Kafka topics. One of the significant differences is schema enforcement. It leads to the job failure in case of schema changes of the streamed table.

Continue Reading β†’

StreamingQueryListener, from states to questions

Apache Spark leverages the observer design pattern for the framework-to-code communication. One of the consumers' implementations is StreamingQueryListener.

Continue Reading β†’

Processing time trigger, to be or not to be?

That's the question. The lack of the processing time trigger means more a reactive micro-batch triggering but it cannot be considered as the single true best practice. Let's see why.

Continue Reading β†’