Apache Spark SQL articles

Home Apache Spark SQL

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests

Medallion architecture & Lakeflow SDP

Max 10 participants · production-ready templates

See the full curriculum → €7,000 flat fee · cohort of up to 10

Bartosz
Konieczny

May 21, 2026 • Apache Spark SQL

Combining DataFrames - beyond UNIONs

A seasoned data engineer you are, you certainly know UNIONs are a great way to combine many DataFrames into one data processing abstraction. But do you know Apache Spark has other methods that operate on multiple DataFrames and go far beyond simply concatenating two datasets?

Continue Reading →

January 26, 2026 • Apache Spark SQL

Clean architecture for PySpark

This year, I'll be exploring various software engineering principles and their application to data engineering. Clean Architecture is a prime example, and it serves as the focus of this post.

Continue Reading →

January 20, 2026 • Apache Spark SQL

ASOF Join in Apache Spark SQL

Even though eight years have passed since my blog post about various join types in Apache Spark SQL, I'm still learning something new about this apparently simple data operation which is the join. Recently in "Building Machine Learning Systems with a Feature Store: Batch, Real-Time, and LLM Systems" by Jim Dowling I read about ASOF Joins and decided to dedicate some space for them in the blog.

Continue Reading →

November 27, 2025 • Apache Spark SQL

Outer operations in Apache Spark, or why I consider NULLs as NullPointerException

Picture this. You get a list of values in a column and you need to combine each of them with another row. The simplest way for doing that is to use the explode operation and create a dedicated row for the concatenated values. Unlucky you, several rows in the input have nulls instead of the list.

Continue Reading →

November 4, 2025 • Apache Spark SQL

Apache Spark and the show command

Some time ago when I was analyzing the execution of my Apache Spark job on Spark UI, I noticed a limit(...) action. It was weird as I actually was running only the show(...) command to display the DataFrame locally. At the time I understood why but hadn't found time to write a blog post. Recently Antoni reminded me on LinkedIn that I should have blogged about show(...) back then to better answer his question :)

Continue Reading →

June 13, 2025 • Apache Spark SQL

Lateral column aliases in Apache Spark SQL

It's the second blog post about laterals in Apache Spark SQL. Previously you discovered how to combine queries with lateral subquery and lateral views. Now it's time to see a more local feature, lateral column aliases.

Continue Reading →

June 4, 2025 • Apache Spark SQL

Lateral subquery, aka lateral join, and lateral views in Apache Spark SQL

Seven (!) years have passed since my blog post about Join types in Apache Spark SQL (2017). Coming from a software engineering background, I was so amazed that the world of joins doesn't stop on LEFT/RIGHT/FULL joins that I couldn't not blog about it ;) Time has passed but lucky me, each new project teaches me something.

Continue Reading →

February 26, 2025 • Apache Spark SQL

The saveAsTable in Apache Spark SQL, alternative to insertInto

Is there an easier way to address the insertInto position-based data writing in Apache Spark SQL? Totally, if you use a column-based method such as saveAsTable with append mode.

Continue Reading →

February 12, 2025 • Apache Spark SQL

Overwriting partitioned tables in Apache Spark SQL

After publishing a release of my blog post about the insertInto trap, I got an intriguing question in the comments. The alternative to the insertInto, the saveAsTable method, doesn't work well on partitioned data in overwrite mode while the insertInto does. True, but is there an alternative to it that doesn't require using this position-based function?

Continue Reading →

January 23, 2025 • Apache Spark SQL

The insertInto trap in Apache Spark SQL

Even though Apache Spark SQL provides an API for structured data, the framework sometimes behaves unexpectedly. It's the case of an insertInto operation that can even lead to some data quality issues. Why? Let's try to understand in this short article.

Continue Reading →

May 10, 2024 • Apache Spark SQL

mapGroupsWithState and...batch?

That's one of my recent surprises. While I have been exploring arbitrary stateful processing, hence the mapGroupsWithState among others, I mistakenly created a batch DataFrame and applied the mapping function on top of it. Turns out, it worked! Well, not really but I let you discover why in this blog post.

Continue Reading →

April 16, 2023 • Apache Spark SQL

Spark SQL checkpoints

In my long - but not long enough! - journey with Apache Spark I've met the "checkpointing" world in the context of Structured Streaming mostly. But this term also applies to other modules including Apache Spark SQL, so batch processing!

Continue Reading →

March 2, 2023 • Apache Spark SQL

Filtering rules accumulator

Data can have various quality issues, from missing to badly formatted values. However, there is another issue less people talk about, the erroneous filtering logic.

Continue Reading →

November 19, 2022 • Apache Spark SQL

Generated method too long to be JIT compiled

There are days like that. You inherit a code and it doesn't really work as expected. While digging into issues you find usual weird warnings but also several new things. For me one of these things was the "Generated method too long to be JIT compiled..." info message.

Continue Reading →

November 5, 2022 • Apache Spark SQL

Wildcard path and partitions

Let's suppose you store the partitioned data under the /data/mydir location. What will be the difference if you read this directory with Apache Spark as /data/mydir/ and /data/mydir/* ? You should find the answer to the question just below.

Continue Reading →

September 24, 2022 • Apache Spark SQL

Observable metrics

Observability is a hot topic nowadays, not only for the data but also the software industry. Apache Spark innovates in this field a lot, including new metrics for Structured Streaming and an important update added in the 3.0.0 release that I missed at the time, which are the observable metrics.

Continue Reading →

September 17, 2022 • Apache Spark SQL

Predicate pushdown, why it doesn't work every time?

Pushdowns in Apache Spark are great to delegate some operations to the data sources. It's a great way to reduce the data volume to be processed in the job. However, there is one important gotcha. Watch out the definition of your predicate because from time to time, even though the pushdown predicate is supported by the data source, the predicate can still be executed by the Apache Spark job!

Continue Reading →

July 23, 2022 • Apache Spark SQL

What's new in Apache Spark 3.3.0 - Data Source V2

After a break for the Data+AI Summit retrospective, it's time to return to Apache Spark 3.3.0 and see what changed for the DataSource V2 API.

Continue Reading →

June 30, 2022 • Apache Spark SQL

What's new in Apache Spark 3.3 - new functions

New Apache SQL functions are a regular position in my "What's new in Apache Spark..." series. Let's see what has changed in the most recent (3.3.0) release!

Continue Reading →

June 27, 2022 • Apache Spark SQL

What's new in Apache Spark 3.3 - joins

Joins are probably the most popular operation for combining datasets and Apache Spark supports multiple types of them already! In the new release, the framework got 2 new strategies, the storage-partitioned and row-level runtime filters.

Continue Reading →

1
2
3
4
5
6
7
Next ⟶