Data processing articles

Home Data processing

Looking for something else? Check the categories of Data processing:

Apache Beam Apache Flink Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark

If not, below you can find all articles belonging to Data processing.

November 4, 2025 • Apache Spark SQL

Apache Spark and the show command

Some time ago when I was analyzing the execution of my Apache Spark job on Spark UI, I noticed a limit(...) action. It was weird as I actually was running only the show(...) command to display the DataFrame locally. At the time I understood why but hadn't found time to write a blog post. Recently Antoni reminded me on LinkedIn that I should have blogged about show(...) back then to better answer his question :)

Continue Reading →

September 25, 2025 • Apache Spark Structured Streaming

Apache Spark Structured Streaming UI patterns

When you start a Structured Streaming job, your Spark UI will get a new tab in the menu where you follow the progress of the running jobs. In the beginning this part may appear a bit complex but there are some visual detection patterns that can help you understand what's going on.

Continue Reading →

August 20, 2025 • Apache Spark Structured Streaming

What's new in Apache Spark 4.0.0 - Arbitrary state API v2 - batch

To close the topic of the new arbitrary stateful processing API in Apache Spark Structured Streaming let's focus on its...batch counterpart!

Continue Reading →

August 13, 2025 • Apache Spark Structured Streaming

What's new in Apache Spark 4.0.0 - Arbitrary state API v2 - internals

Last week we discovered the new way to write arbitrary stateful transformations in Apache Spark 4 with the transformWithState API. Today it's time to delve into the implementation details and try to understand the internal logic a bit better.

Continue Reading →

August 6, 2025 • Apache Spark Structured Streaming

What's new in Apache Spark 4.0 - Arbitrary state API v2 - introduction

Arbitrary stateful processing has been evolving a lot in Apache Spark. The initial version with updateStateByKey evolved to mapWithState in Apache Spark 2. When Structured Streaming was released, the framework got mapGroupsWithState and flatMapGroupsWithState. Now, Apache Spark 4 introduces a completely new way to interact with the arbitrary stateful processing logic, the Arbitrary state API v2!

Continue Reading →

June 13, 2025 • Apache Spark SQL

Lateral column aliases in Apache Spark SQL

It's the second blog post about laterals in Apache Spark SQL. Previously you discovered how to combine queries with lateral subquery and lateral views. Now it's time to see a more local feature, lateral column aliases.

Continue Reading →

June 4, 2025 • Apache Spark SQL

Lateral subquery, aka lateral join, and lateral views in Apache Spark SQL

Seven (!) years have passed since my blog post about Join types in Apache Spark SQL (2017). Coming from a software engineering background, I was so amazed that the world of joins doesn't stop on LEFT/RIGHT/FULL joins that I couldn't not blog about it ;) Time has passed but lucky me, each new project teaches me something.

Continue Reading →

May 8, 2025 • PySpark

Abstracting column access in PySpark with Proxy design pattern

One of the biggest changes for PySpark has been the DataFrame API. It greatly reduces the JVM-to-PVM communication overhead and improves the performance. However, it also complexities the code. Probably, some of you have already seen, written, or worked with the code like this...

Continue Reading →

February 26, 2025 • Apache Spark SQL

The saveAsTable in Apache Spark SQL, alternative to insertInto

Is there an easier way to address the insertInto position-based data writing in Apache Spark SQL? Totally, if you use a column-based method such as saveAsTable with append mode.

Continue Reading →

February 19, 2025 • Apache Spark Structured Streaming

Dealing with quotas and limits - Apache Spark Structured Streaming for Amazon Kinesis Data Streams

Using cloud managed services is often a love and hate story. On one hand, they abstract a lot of tedious administrative work to let you focus on the essentials. From another, they often have quotas and limits that you, as a data engineer, have to take into account in your daily work. These limits become even more serious when they operate in a latency-sensitive context, as the one of stream processing.

Continue Reading →

February 12, 2025 • Apache Spark SQL

Overwriting partitioned tables in Apache Spark SQL

After publishing a release of my blog post about the insertInto trap, I got an intriguing question in the comments. The alternative to the insertInto, the saveAsTable method, doesn't work well on partitioned data in overwrite mode while the insertInto does. True, but is there an alternative to it that doesn't require using this position-based function?

Continue Reading →

January 23, 2025 • Apache Spark SQL

The insertInto trap in Apache Spark SQL

Even though Apache Spark SQL provides an API for structured data, the framework sometimes behaves unexpectedly. It's the case of an insertInto operation that can even lead to some data quality issues. Why? Let's try to understand in this short article.

Continue Reading →

January 15, 2025 • Apache Spark Structured Streaming

Event time skew and global watermark in Apache Spark Structured Streaming

A few months ago I wrote a blog post about event skew and how dangerous it is for a stateful streaming job. Since it was a high-level explanation, I didn't cover Apache Spark Structured Streaming deeply at that moment. Now the watermark topic is back to my learning backlog and it's a good opportunity to return to the event skew topic and see the dangers it brings for Structured Streaming stateful jobs.

Continue Reading →

August 22, 2024 • Apache Spark Structured Streaming

DAIS 2024: Unit tests - configuration and declaration

Code organization and assertions flow are both important but even them, they can't guarantee your colleagues' adherence to the unit tests. There are other user-facing attributes to consider as well.

Continue Reading →

August 9, 2024 • Apache Spark Structured Streaming

DAIS 2024: Orchestrating and scoping assertions in Apache Spark Structured Streaming

Testing batch jobs is not the same as testing streaming ones. Although the transformation (the WHAT from the previous article) is similar in both cases, more complete validation tests on the job logic are not. After all, streaming jobs often iteratively build the final outcome while the batch ones generate it in a single pass.

Continue Reading →

July 16, 2024 • Apache Spark Structured Streaming

DAIS 2024: Testing framework from the Dataflow model for Apache Spark Structured Streaming

With this blog I'm starting a follow-up series for my Data+AI Summit 2024 talk. I missed this family of blog posts a lot as the previous DAIS with me as speaker was 4 years ago! As previously, this time too I'll be writing several blog posts that should help you remember the talk and also cover some of the topics left aside because of the time constraints.

Continue Reading →

May 10, 2024 • Apache Spark SQL

mapGroupsWithState and...batch?

That's one of my recent surprises. While I have been exploring arbitrary stateful processing, hence the mapGroupsWithState among others, I mistakenly created a batch DataFrame and applied the mapping function on top of it. Turns out, it worked! Well, not really but I let you discover why in this blog post.

Continue Reading →

May 8, 2024 • Apache Spark Structured Streaming

OutputModes in Apache Spark Structured Streaming - complementary notes

I wrote a blog post about OutputModes 6 (yes!) years ago and after reading it a few times, I realized it was not good enough to be a quick refresher. For that reason you can read about OutputModes for the second time here. Hopefully, this one will be a good try!

Continue Reading →

April 18, 2024 • Apache Spark Structured Streaming

Stopping a Structured Streaming query

Streaming jobs are supposed to run continuously but it applies to the data processing logic. After all, sometimes you may need to release a new job package with upgraded dependencies or improved business logic. What happens then?

Continue Reading →

April 11, 2024 • Apache Flink

Data enrichment strategies in Apache Flink

Data enrichment is a crucial step in making data more usable by the business users. Doing that with a batch is relatively easy due to the static nature of the dataset. When it comes to streaming, the task is more challenging.

Continue Reading →

1
2
3
4
5
6
7
Next ⟶