Spark SQL operator optimizations - part 1

Versions: Spark 2.2.0

Pushdown predicate is one of the most popular optimizations in Spark SQL. But it's not the single one and their main list is defined in org.apache.spark.sql.catalyst.optimizer.Optimizer abstract class.

This post lists and describes a part of these operator optimizations from org.apache.spark.sql.catalyst.optimizer.Optimizer. Because of a big number of them, it's only the first part. It contains operators from A to L. Moreover, its format is unusual. Unlike previous posts, this one is not divided in different sections. Instead, the described operators are presented in a list where each point contains some explanation and code proofs.

All tests defined in this and the second part use the following snippet:

private val sparkSession: SparkSession = SparkSession.builder().appName("Logical Plan Optimizer test").master("local")
  .getOrCreate()

import sparkSession.implicits._
private val Users = Seq(
  (1, "user_1")
).toDF("id", "login")

val NewUsers = Seq(
  (2, "user_2"), (3, "user_3")
).toDF("id", "login")


private val UserOrders = Seq(
  (2, "order_1"), (1, "order_2")
).toDF("user_id", "order_name")

private val AllOrdersPerUser = Seq(
  (1, Array(100, 200, 300)), (2, Array(100, 300, 100)), (3, Array(9, 8, 102))
).toDF("user_id", "orders_amounts")

override def afterAll() {
  sparkSession.stop()
}

The post described almost a half of logical plan operator optimizations. We could learn that some of them are propagated to subsequent parts of the query, the others are simply eliminated because of their triviality while remaining are reformatted in order to gain better performances. But it's only a subset. The remaining optimizations will be presented in the second post that will be published next week.


If you liked it, you should read:

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!