Articles about Data processing on waitingforcode.com - articles for the pleasure of learning and discovery

Looking for something else? Check the categories of Data processing:

Apache Beam Apache Flink Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark

If not, below you can find all articles belonging to Data processing.

January 26, 2026 • Apache Spark SQL

Clean architecture for PySpark

This year, I'll be exploring various software engineering principles and their application to data engineering. Clean Architecture is a prime example, and it serves as the focus of this post.

Continue Reading →

January 20, 2026 • Apache Spark SQL

ASOF Join in Apache Spark SQL

Even though eight years have passed since my blog post about various join types in Apache Spark SQL, I'm still learning something new about this apparently simple data operation which is the join. Recently in "Building Machine Learning Systems with a Feature Store: Batch, Real-Time, and LLM Systems" by Jim Dowling I read about ASOF Joins and decided to dedicate some space for them in the blog.

Continue Reading →

January 16, 2026 • Apache Flink

Apache Flink vs. Apache Spark Structured Streaming - high-level comparison

If you follow me, you know I'm an Apache Spark enthusiast. Despite that, I'm doing my best to keep my mind open to other technologies. The one that got my strong attention past years is Apache Flink and I found nothing better to start than comparing it with Apache Spark Structured Streaming.

Continue Reading →

November 27, 2025 • Apache Spark SQL

Outer operations in Apache Spark, or why I consider NULLs as NullPointerException

Picture this. You get a list of values in a column and you need to combine each of them with another row. The simplest way for doing that is to use the explode operation and create a dedicated row for the concatenated values. Unlucky you, several rows in the input have nulls instead of the list.

Continue Reading →

November 4, 2025 • Apache Spark SQL

Apache Spark and the show command

Some time ago when I was analyzing the execution of my Apache Spark job on Spark UI, I noticed a limit(...) action. It was weird as I actually was running only the show(...) command to display the DataFrame locally. At the time I understood why but hadn't found time to write a blog post. Recently Antoni reminded me on LinkedIn that I should have blogged about show(...) back then to better answer his question :)

Continue Reading →

September 25, 2025 • Apache Spark Structured Streaming

Apache Spark Structured Streaming UI patterns

When you start a Structured Streaming job, your Spark UI will get a new tab in the menu where you follow the progress of the running jobs. In the beginning this part may appear a bit complex but there are some visual detection patterns that can help you understand what's going on.

Continue Reading →

August 20, 2025 • Apache Spark Structured Streaming

What's new in Apache Spark 4.0.0 - Arbitrary state API v2 - batch

To close the topic of the new arbitrary stateful processing API in Apache Spark Structured Streaming let's focus on its...batch counterpart!

Continue Reading →

August 13, 2025 • Apache Spark Structured Streaming

What's new in Apache Spark 4.0.0 - Arbitrary state API v2 - internals

Last week we discovered the new way to write arbitrary stateful transformations in Apache Spark 4 with the transformWithState API. Today it's time to delve into the implementation details and try to understand the internal logic a bit better.

Continue Reading →

August 6, 2025 • Apache Spark Structured Streaming

What's new in Apache Spark 4.0 - Arbitrary state API v2 - introduction

Arbitrary stateful processing has been evolving a lot in Apache Spark. The initial version with updateStateByKey evolved to mapWithState in Apache Spark 2. When Structured Streaming was released, the framework got mapGroupsWithState and flatMapGroupsWithState. Now, Apache Spark 4 introduces a completely new way to interact with the arbitrary stateful processing logic, the Arbitrary state API v2!

Continue Reading →

June 13, 2025 • Apache Spark SQL

Lateral column aliases in Apache Spark SQL

It's the second blog post about laterals in Apache Spark SQL. Previously you discovered how to combine queries with lateral subquery and lateral views. Now it's time to see a more local feature, lateral column aliases.

Continue Reading →

June 4, 2025 • Apache Spark SQL

Lateral subquery, aka lateral join, and lateral views in Apache Spark SQL

Seven (!) years have passed since my blog post about Join types in Apache Spark SQL (2017). Coming from a software engineering background, I was so amazed that the world of joins doesn't stop on LEFT/RIGHT/FULL joins that I couldn't not blog about it ;) Time has passed but lucky me, each new project teaches me something.

Continue Reading →

May 8, 2025 • PySpark

Abstracting column access in PySpark with Proxy design pattern

One of the biggest changes for PySpark has been the DataFrame API. It greatly reduces the JVM-to-PVM communication overhead and improves the performance. However, it also complexities the code. Probably, some of you have already seen, written, or worked with the code like this...

Continue Reading →

February 26, 2025 • Apache Spark SQL

The saveAsTable in Apache Spark SQL, alternative to insertInto

Is there an easier way to address the insertInto position-based data writing in Apache Spark SQL? Totally, if you use a column-based method such as saveAsTable with append mode.

Continue Reading →

February 19, 2025 • Apache Spark Structured Streaming

Dealing with quotas and limits - Apache Spark Structured Streaming for Amazon Kinesis Data Streams

Using cloud managed services is often a love and hate story. On one hand, they abstract a lot of tedious administrative work to let you focus on the essentials. From another, they often have quotas and limits that you, as a data engineer, have to take into account in your daily work. These limits become even more serious when they operate in a latency-sensitive context, as the one of stream processing.

Continue Reading →

February 12, 2025 • Apache Spark SQL

Overwriting partitioned tables in Apache Spark SQL

After publishing a release of my blog post about the insertInto trap, I got an intriguing question in the comments. The alternative to the insertInto, the saveAsTable method, doesn't work well on partitioned data in overwrite mode while the insertInto does. True, but is there an alternative to it that doesn't require using this position-based function?

Continue Reading →

January 23, 2025 • Apache Spark SQL

The insertInto trap in Apache Spark SQL

Even though Apache Spark SQL provides an API for structured data, the framework sometimes behaves unexpectedly. It's the case of an insertInto operation that can even lead to some data quality issues. Why? Let's try to understand in this short article.

Continue Reading →

January 15, 2025 • Apache Spark Structured Streaming

Event time skew and global watermark in Apache Spark Structured Streaming

A few months ago I wrote a blog post about event skew and how dangerous it is for a stateful streaming job. Since it was a high-level explanation, I didn't cover Apache Spark Structured Streaming deeply at that moment. Now the watermark topic is back to my learning backlog and it's a good opportunity to return to the event skew topic and see the dangers it brings for Structured Streaming stateful jobs.

Continue Reading →

August 22, 2024 • Apache Spark Structured Streaming

DAIS 2024: Unit tests - configuration and declaration

Code organization and assertions flow are both important but even them, they can't guarantee your colleagues' adherence to the unit tests. There are other user-facing attributes to consider as well.

Continue Reading →

August 9, 2024 • Apache Spark Structured Streaming

DAIS 2024: Orchestrating and scoping assertions in Apache Spark Structured Streaming

Testing batch jobs is not the same as testing streaming ones. Although the transformation (the WHAT from the previous article) is similar in both cases, more complete validation tests on the job logic are not. After all, streaming jobs often iteratively build the final outcome while the batch ones generate it in a single pass.

Continue Reading →

July 16, 2024 • Apache Spark Structured Streaming

DAIS 2024: Testing framework from the Dataflow model for Apache Spark Structured Streaming

With this blog I'm starting a follow-up series for my Data+AI Summit 2024 talk. I missed this family of blog posts a lot as the previous DAIS with me as speaker was 4 years ago! As previously, this time too I'll be writing several blog posts that should help you remember the talk and also cover some of the topics left aside because of the time constraints.

Continue Reading →

1
2
3
4
5
6
7
Next ⟶

Data processing articles