Drop is a...select

Versions: Apache Spark 3.0.0

Have you ever wondered what is the relationship between drop and select operations in Apache Spark SQL? If not, I will shed some light on them in this short blog post.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in πŸ‘‰ Early Release on the O'Reilly platform

I also help solve your data engineering problems πŸ‘‰ contact@waitingforcode.com πŸ“©

Let's suppose that I want to reduce the size of the dataset and select only a subset of columns. My initial input looks that way:

  private val testSparkSession = SparkSession.builder()
    .appName("Select vs Drop").master("local[*]")
    .getOrCreate()
  import testSparkSession.implicits._

  val dataset = Seq(
    (0, "a", "A"), (1, "b", "B"), (2, "c", "C")
  ).toDF("nr", "lower_letter", "upper_letter")

To retrieve only lower_letter and upper_letter columns, we can select them explicitly:

dataset.select("lower_letter", "upper_letter").show()

But it's not the single approach. Another one uses drop(colName: String) function that removes the columns from the dataset:

dataset.drop("nr").show()

An interesting thing to notice is how the drop operation works under-the-hood. In fact, it's a select statement that is executed on the columns not matching the columns from the dropped list:

  def drop(colNames: String*): DataFrame = {
    val resolver = sparkSession.sessionState.analyzer.resolver
    val allColumns = queryExecution.analyzed.output
    val remainingCols = allColumns.filter { attribute =>
      colNames.forall(n => !resolver(attribute.name, n))
    }.map(attribute => Column(attribute))
    if (remainingCols.size == allColumns.size) {
      toDF()
    } else {
      this.select(remainingCols: _*)
    }
  }

It's even more visible when you compare the logical plans. The first one is for the select-based reduction and the second one for the drop-based one:

# select("lower_letter", "upper_letter")
== Analyzed Logical Plan ==
lower_letter: string, upper_letter: string
Project [lower_letter#11, upper_letter#12]
+- Project [_1#3 AS nr#10, _2#4 AS lower_letter#11, _3#5 AS upper_letter#12]
   +- LocalRelation [_1#3, _2#4, _3#5]

== Optimized Logical Plan ==
LocalRelation [lower_letter#11, upper_letter#12]

# drop("nr")
== Analyzed Logical Plan ==
lower_letter: string, upper_letter: string
Project [lower_letter#11, upper_letter#12]
+- Project [_1#3 AS nr#10, _2#4 AS lower_letter#11, _3#5 AS upper_letter#12]
   +- LocalRelation [_1#3, _2#4, _3#5]

== Optimized Logical Plan ==
LocalRelation [lower_letter#11, upper_letter#12]

And a short video to prove you that I'm not liying ;-):

Drop is then a select :) No need to add it then if you explicitly select some columns and want to remove the not used ones.


If you liked it, you should read:

πŸ“š Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!