Spark SQL joins articles

Broadcast join - complementary notes for local shuffle reader

The Local shuffle reader presented in one of the previous posts might have introduced some doubt in the way how the broadcast join is working. If it's the case, this blog post should shed some light on it. If not, it can give you more in-depth details than the ones introducing this type of join a few years ago.

Continue Reading β†’

Regression tests with Apache Spark SQL joins

Regressions are one of the risks of our profession. Fortunately, we can limit the risk thanks to different testing strategies. One of them are regression tests that we can use to check whether the modified data processing logic didn't introduce the regressions simply by comparing two datasets.

Continue Reading β†’

Sort-merge join in Spark SQL

After discovering two methods used to join DataFrames, broadcast and hashing, it's time to talk about the third possibility - sort-merge join.

Continue Reading β†’

Shuffle join in Spark SQL

Shuffle consists on moving data with the same key to the one executor in order to execute some specific processing on it. We could think that it concerns only *ByKey operations but it's not necessarily true.

Continue Reading β†’

Broadcast join in Spark SQL

Joining DataFrames can be a performance-sensitive task. After all, it involves matching data from two data sources and keeping matched results in a single place. As you can deduce, the first thinking goes towards shuffle join operation. However, it's not the single strategy implemented in Spark SQL. For some specific use cases another type called broadcast join can be preferred.

Continue Reading β†’

Join types in Spark SQL

Spark SQL reflects the most of concepts related to relational databases as possible. One of them are joins that can be defined in one of 7 forms.

Continue Reading β†’