Overwriting partitioned tables in Apache Spark SQL

After publishing a release of my blog post about the insertInto trap, I got an intriguing question in the comments. The alternative to the insertInto, the saveAsTable method, doesn't work well on partitioned data in overwrite mode while the insertInto does. True, but is there an alternative to it that doesn't require using this position-based function?

Continue Reading β†’

The insertInto trap in Apache Spark SQL

Even though Apache Spark SQL provides an API for structured data, the framework sometimes behaves unexpectedly. It's the case of an insertInto operation that can even lead to some data quality issues. Why? Let's try to understand in this short article.

Continue Reading β†’

Event time skew and global watermark in Apache Spark Structured Streaming

A few months ago I wrote a blog post about event skew and how dangerous it is for a stateful streaming job. Since it was a high-level explanation, I didn't cover Apache Spark Structured Streaming deeply at that moment. Now the watermark topic is back to my learning backlog and it's a good opportunity to return to the event skew topic and see the dangers it brings for Structured Streaming stateful jobs.

Continue Reading β†’

Delta Lake and restore - traveling in time differently

Time travel is a quite popular Delta Lake feature. But do you know it's not the single one you can use to interact with the past versions? An alternative is the RESTORE command, and it'll be the topic of this blog post.

Continue Reading β†’

2024 retrospective on waitingforcode.com

Even though I was blogging less in the second half of the previous year, the retrospective is still the blog post I'm waiting for each year. Every year I summarize what happened in the past 12 months and share with you my future plans. It's time for the 2024 Edition!

Continue Reading β†’

DAIS 2024: Unit tests - configuration and declaration

Code organization and assertions flow are both important but even them, they can't guarantee your colleagues' adherence to the unit tests. There are other user-facing attributes to consider as well.

Continue Reading β†’

DAIS 2024: Orchestrating and scoping assertions in Apache Spark Structured Streaming

Testing batch jobs is not the same as testing streaming ones. Although the transformation (the WHAT from the previous article) is similar in both cases, more complete validation tests on the job logic are not. After all, streaming jobs often iteratively build the final outcome while the batch ones generate it in a single pass.

Continue Reading β†’

Data+AI Summit 2024 - Retrospective - Apache Spark

Welcome to the second blog post dedicated to the previous Data+AI Summit. This time I'm going to share with you a summary of Apache Spark talks.

Continue Reading β†’

DAIS 2024: Testing framework from the Dataflow model for Apache Spark Structured Streaming

With this blog I'm starting a follow-up series for my Data+AI Summit 2024 talk. I missed this family of blog posts a lot as the previous DAIS with me as speaker was 4 years ago! As previously, this time too I'll be writing several blog posts that should help you remember the talk and also cover some of the topics left aside because of the time constraints.

Continue Reading β†’

Data+AI Summit 2024 - Retrospective - Streaming

Welcome to the first Data+AI Summit 2024 retrospective blog post. I'm opening the series with the topic close to my heart at the moment, stream processing!

Continue Reading β†’

Infoshare 2024 - Retrospective

Last May I gave a talk about stream processing fallacies at Infoshare in Gdansk. Besides this speaking experience, I was also - and maybe among others - an attendee who enjoyed several talks in software and data engineering areas. I'm writing this blog post to remember them and why not, share the knowledge with you!

Continue Reading β†’

Delta Lake table as a changelog

One of the big challenges in streaming Delta Lake is the inability to handle in-place changes, like updates, deletes, or merges. There is good news, though. With a little bit of effort on your data provider's side, you can process a Delta Lake table as you would process Apache Kafka topics, hence without in-place changes.

Continue Reading β†’

Infoshare 2024: Stream processing fallacies, part 2

The blog shares the last fallacies for my 7 years stream processing journey.

Continue Reading β†’

Infoshare 2024: Stream processing fallacies, part 1

Last week I was speaking in Gdansk on the DataMass track at Infoshare. As it often happens, the talk time slot impacted what I wanted to share but maybe it's for good. Otherwise, you wouldn't read stream processing fallacies!

Continue Reading β†’

mapGroupsWithState and...batch?

That's one of my recent surprises. While I have been exploring arbitrary stateful processing, hence the mapGroupsWithState among others, I mistakenly created a batch DataFrame and applied the mapping function on top of it. Turns out, it worked! Well, not really but I let you discover why in this blog post.

Continue Reading β†’