Drop is a...select

Versions: Apache Spark 3.0.0

Have you ever wondered what is the relationship between drop and select operations in Apache Spark SQL? If not, I will shed some light on them in this short blog post.

Looking for a better data engineering position and skills?

You have been working as a data engineer but feel stuck? You don't have any new challenges and are still writing the same jobs all over again? You have now different options. You can try to look for a new job, now or later, or learn from the others! "Become a Better Data Engineer" initiative is one of these places where you can find online learning resources where the theory meets the practice. They will help you prepare maybe for the next job, or at least, improve your current skillset without looking for something else.

👉 I'm interested in improving my data engineering skillset

See you there, Bartosz

Let's suppose that I want to reduce the size of the dataset and select only a subset of columns. My initial input looks that way:

  private val testSparkSession = SparkSession.builder()
    .appName("Select vs Drop").master("local[*]")
    .getOrCreate()
  import testSparkSession.implicits._

  val dataset = Seq(
    (0, "a", "A"), (1, "b", "B"), (2, "c", "C")
  ).toDF("nr", "lower_letter", "upper_letter")

To retrieve only lower_letter and upper_letter columns, we can select them explicitly:

dataset.select("lower_letter", "upper_letter").show()

But it's not the single approach. Another one uses drop(colName: String) function that removes the columns from the dataset:

dataset.drop("nr").show()

An interesting thing to notice is how the drop operation works under-the-hood. In fact, it's a select statement that is executed on the columns not matching the columns from the dropped list:

  def drop(colNames: String*): DataFrame = {
    val resolver = sparkSession.sessionState.analyzer.resolver
    val allColumns = queryExecution.analyzed.output
    val remainingCols = allColumns.filter { attribute =>
      colNames.forall(n => !resolver(attribute.name, n))
    }.map(attribute => Column(attribute))
    if (remainingCols.size == allColumns.size) {
      toDF()
    } else {
      this.select(remainingCols: _*)
    }
  }

It's even more visible when you compare the logical plans. The first one is for the select-based reduction and the second one for the drop-based one:

# select("lower_letter", "upper_letter")
== Analyzed Logical Plan ==
lower_letter: string, upper_letter: string
Project [lower_letter#11, upper_letter#12]
+- Project [_1#3 AS nr#10, _2#4 AS lower_letter#11, _3#5 AS upper_letter#12]
   +- LocalRelation [_1#3, _2#4, _3#5]

== Optimized Logical Plan ==
LocalRelation [lower_letter#11, upper_letter#12]

# drop("nr")
== Analyzed Logical Plan ==
lower_letter: string, upper_letter: string
Project [lower_letter#11, upper_letter#12]
+- Project [_1#3 AS nr#10, _2#4 AS lower_letter#11, _3#5 AS upper_letter#12]
   +- LocalRelation [_1#3, _2#4, _3#5]

== Optimized Logical Plan ==
LocalRelation [lower_letter#11, upper_letter#12]

And a short video to prove you that I'm not liying ;-):

Drop is then a select :) No need to add it then if you explicitly select some columns and want to remove the not used ones.


If you liked it, you should read:

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!