Apache Spark SQL articles

on waitingforcode.com

Spark SQL Cost-Based Optimizer

Prior to Spark 2.2.0 release, the data processing was based on a set of heuristic rules ignoring the typology of the data. But the most recent release brought a tool well known from the RDBMS world that is a Cost-Based Optimizer. Continue Reading →

Shuffle join in Spark SQL

Shuffle consists on moving data with the same key to the one executor in order to execute some specific processing on it. We could think that it concerns only *ByKey operations but it's not necessarily true. Continue Reading →

Broadcast join in Spark SQL

Joining DataFrames can be a performance-sensitive task. After all, it involves matching data from two data sources and keeping matched results in a single place. As you can deduce, the first thinking goes towards shuffle join operation. However, it's not the single strategy implemented in Spark SQL. For some specific use cases another type called broadcast join can be preferred. Continue Reading →

Loading data from RDBMS

Structured data processing takes more and more place in Apache Spark project. Structured streaming is one of the proofs. But how does Spark SQL work - and particularly, how does it load data from sources of structured data as RDMBS ? Continue Reading →