Spark SQL joins articles

4-day workshop Β· In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that β€” unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants Β· production-ready templates
See the full curriculum β†’ €7,000 flat fee Β· cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

Join types in Spark SQL

Spark SQL reflects the most of concepts related to relational databases as possible. One of them are joins that can be defined in one of 7 forms.

Continue Reading β†’

Broadcast join in Spark SQL

Joining DataFrames can be a performance-sensitive task. After all, it involves matching data from two data sources and keeping matched results in a single place. As you can deduce, the first thinking goes towards shuffle join operation. However, it's not the single strategy implemented in Spark SQL. For some specific use cases another type called broadcast join can be preferred.

Continue Reading β†’

Shuffle join in Spark SQL

Shuffle consists on moving data with the same key to the one executor in order to execute some specific processing on it. We could think that it concerns only *ByKey operations but it's not necessarily true.

Continue Reading β†’

Sort-merge join in Spark SQL

After discovering two methods used to join DataFrames, broadcast and hashing, it's time to talk about the third possibility - sort-merge join.

Continue Reading β†’

Regression tests with Apache Spark SQL joins

Regressions are one of the risks of our profession. Fortunately, we can limit the risk thanks to different testing strategies. One of them are regression tests that we can use to check whether the modified data processing logic didn't introduce the regressions simply by comparing two datasets.

Continue Reading β†’

Broadcast join - complementary notes for local shuffle reader

The Local shuffle reader presented in one of the previous posts might have introduced some doubt in the way how the broadcast join is working. If it's the case, this blog post should shed some light on it. If not, it can give you more in-depth details than the ones introducing this type of join a few years ago.

Continue Reading β†’

ASOF Join in Apache Spark SQL

Even though eight years have passed since my blog post about various join types in Apache Spark SQL, I'm still learning something new about this apparently simple data operation which is the join. Recently in "Building Machine Learning Systems with a Feature Store: Batch, Real-Time, and LLM Systems" by Jim Dowling I read about ASOF Joins and decided to dedicate some space for them in the blog.

Continue Reading β†’