Spark SQL operator optimizations - part 1

Versions: Spark 2.2.0

Pushdown predicate is one of the most popular optimizations in Spark SQL. But it's not the single one and their main list is defined in org.apache.spark.sql.catalyst.optimizer.Optimizer abstract class.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

This post lists and describes a part of these operator optimizations from org.apache.spark.sql.catalyst.optimizer.Optimizer. Because of a big number of them, it's only the first part. It contains operators from A to L. Moreover, its format is unusual. Unlike previous posts, this one is not divided in different sections. Instead, the described operators are presented in a list where each point contains some explanation and code proofs.

All tests defined in this and the second part use the following snippet:

private val sparkSession: SparkSession = SparkSession.builder().appName("Logical Plan Optimizer test").master("local")
  .getOrCreate()

import sparkSession.implicits._
private val Users = Seq(
  (1, "user_1")
).toDF("id", "login")

val NewUsers = Seq(
  (2, "user_2"), (3, "user_3")
).toDF("id", "login")


private val UserOrders = Seq(
  (2, "order_1"), (1, "order_2")
).toDF("user_id", "order_name")

private val AllOrdersPerUser = Seq(
  (1, Array(100, 200, 300)), (2, Array(100, 300, 100)), (3, Array(9, 8, 102))
).toDF("user_id", "orders_amounts")

override def afterAll() {
  sparkSession.stop()
}

The post described almost a half of logical plan operator optimizations. We could learn that some of them are propagated to subsequent parts of the query, the others are simply eliminated because of their triviality while remaining are reformatted in order to gain better performances. But it's only a subset. The remaining optimizations will be presented in the second post that will be published next week.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
đź”— past projects


If you liked it, you should read:

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!