Spark SQL operator optimizations - part 1

Pushdown predicate is one of the most popular optimizations in Spark SQL. But it's not the single one and their main list is defined in org.apache.spark.sql.catalyst.optimizer.Optimizer abstract class.

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants · production-ready templates
See the full curriculum → €7,000 flat fee · cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

This post lists and describes a part of these operator optimizations from org.apache.spark.sql.catalyst.optimizer.Optimizer. Because of a big number of them, it's only the first part. It contains operators from A to L. Moreover, its format is unusual. Unlike previous posts, this one is not divided in different sections. Instead, the described operators are presented in a list where each point contains some explanation and code proofs.

All tests defined in this and the second part use the following snippet:

private val sparkSession: SparkSession = SparkSession.builder().appName("Logical Plan Optimizer test").master("local")
  .getOrCreate()

import sparkSession.implicits._
private val Users = Seq(
  (1, "user_1")
).toDF("id", "login")

val NewUsers = Seq(
  (2, "user_2"), (3, "user_3")
).toDF("id", "login")


private val UserOrders = Seq(
  (2, "order_1"), (1, "order_2")
).toDF("user_id", "order_name")

private val AllOrdersPerUser = Seq(
  (1, Array(100, 200, 300)), (2, Array(100, 300, 100)), (3, Array(9, 8, 102))
).toDF("user_id", "orders_amounts")

override def afterAll() {
  sparkSession.stop()
}

The post described almost a half of logical plan operator optimizations. We could learn that some of them are propagated to subsequent parts of the query, the others are simply eliminated because of their triviality while remaining are reformatted in order to gain better performances. But it's only a subset. The remaining optimizations will be presented in the second post that will be published next week.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems contact@waitingforcode.com đź“©