The Local shuffle reader presented in one of the previous posts might have introduced some doubt in the way how the broadcast join is working. If it's the case, this blog post should shed some light on it. If not, it can give you more in-depth details than the ones introducing this type of join a few years ago.
Regressions are one of the risks of our profession. Fortunately, we can limit the risk thanks to different testing strategies. One of them are regression tests that we can use to check whether the modified data processing logic didn't introduce the regressions simply by comparing two datasets.
After discovering two methods used to join DataFrames, broadcast and hashing, it's time to talk about the third possibility - sort-merge join.
Shuffle consists on moving data with the same key to the one executor in order to execute some specific processing on it. We could think that it concerns only *ByKey operations but it's not necessarily true.
Joining DataFrames can be a performance-sensitive task. After all, it involves matching data from two data sources and keeping matched results in a single place. As you can deduce, the first thinking goes towards shuffle join operation. However, it's not the single strategy implemented in Spark SQL. For some specific use cases another type called broadcast join can be preferred.
Spark SQL reflects the most of concepts related to relational databases as possible. One of them are joins that can be defined in one of 7 forms.