After date time management, it's time to see another important feature of Apache Spark 3.0, he new SQL functions.
Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote
one on that topic! You can read it online
on the O'Reilly platform,
or get a print copy on Amazon.
I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩
ALTER DATABASE ... SET LOCATION
That's not really a SQL function but a Hive's one. It's a metadata operation that sets a new location for a given database. If it's not a SQL concept, why am I mentioning it here? When I first saw this function I thought that it can be a good way to implement aliasing strategy for Big Data immutability (learn more in Big Data immutability approaches - aliasing) but I was wrong.
This command only changes the location for the new tables and partitions, the old ones aren't modified at all. If you thought like me, I hope that this explanation will help to keep this in mind.
bit_and, bit_or, bit_xor
BIT_AND, BIT_OR and BIT_XOR are 2 new functions to deal with bitwise integer aggregates. You can see them in action in the following snippet:
"bitwise functions" should "apply to the column values" in { Seq( (1), (2), (3) ).toDF("number").createTempView("bitwise_test") val functionsString = sparkSession .sql(""" |SELECT |CONCAT( | BIT_AND(number), ",", BIT_OR(number), ",", BIT_XOR(number) |) AS bitwise_concat |FROM bitwise_test""".stripMargin) .map(row => row.getAs[String]("bitwise_concat")) .collect() functionsString should have size 1 functionsString(0) shouldEqual ("0,3,0") } "bitwise functions" should "apply to the column values" in { Seq( (1), (2), (3) ).toDF("number").createTempView("bitwise_test") val functionsString = sparkSession .sql(""" |SELECT |CONCAT( | BIT_AND(number), ",", BIT_OR(number), ",", BIT_XOR(number) |) AS bitwise_concat |FROM bitwise_test""".stripMargin) .map(row => row.getAs[String]("bitwise_concat")) .collect() functionsString should have size 1 functionsString(0) shouldEqual ("0,3,0") }
date_part
From my previous article you learned that Apache Spark 3.0 brings a lot of changes in datetime management. One of their parts are new temporal functions and DATE_PART is one of them. As its name indicates, you can use it to retrieve a field from the timestamp or interval field. In the following snippet I'm using it to extract day and month:
"date_part" should "extract day from a timestamp column" in { Seq( (new Timestamp(0L)), (new Timestamp(TimeUnit.DAYS.toMillis(2L))) ).toDF("dt").createTempView("date_part_test") val extractedDays = sparkSession .sql("SELECT DATE_PART('day', dt) AS extracted_day FROM date_part_test") .map(row => row.getAs[Integer]("extracted_day")) .collect() extractedDays should have size 2 extractedDays should contain allOf(1, 3) }
Hyperbolic functions
To be honest, I didn't know this name when I saw the functions in the list. After a quick research I learned that the hyperbolic functions are like trigonometric functions but defined for a hyperbola instead of a circle. They can be used then for instance to describe the shape of the curve formed by a high-voltage line suspended between two towers. The following snippet compares the circle and hyperbole versions:
"hyperbolic functions" should "return different results than not hyperbolic ones" in { Seq( (1.3) ).toDF("number").createTempView("hyperbolic_funcs_test") val functionsString = sparkSession .sql(""" |SELECT |CONCAT( | acosh(number), ",", acos(number), ",", | asinh(number), ",", asin(number), ",", | atanh(number), ",", atan(number) |) AS funcs_concat |FROM hyperbolic_funcs_test""".stripMargin) .map(row => row.getAs[String]("funcs_concat")) .collect() functionsString should have size 1 functionsString(0) shouldEqual "0.7564329108569596,NaN,1.078451058954897,NaN,NaN,0.9151007005533605" }
make_date
As the name indicates, you can create a date-typed field from a year, month and day:
"make_date" should "create a date from 3 separated fields" in { Seq( (2020, 4, 1), (2019, 1, 20), (2020, 2, 31) // error on purpose, let's see what happens ).toDF("y", "m", "d").createTempView("make_date_test") val mappedDates = sparkSession .sql("SELECT CAST(make_date(y, m, d) AS STRING) AS date_from_ymd FROM make_date_test") .map(row => row.getAs[String]("date_from_ymd")) .collect() mappedDates should have size 3 mappedDates should contain allOf(null, "2020-04-01", "2019-01-20") }
make_timestamp
Apart from the date, you can also create a timestamp-typed field:
"make_timestamp" should "create a timestamp from 4 separated fields" in { Seq( (2020, 4, 1, 10, 20), (2019, 1, 20, 15, 59), (2020, 2, 2, 25, 10) // error on purpose, let's see what happens ).toDF("y", "m", "d", "h", "min").createTempView("make_timestamp_test") val mappedTs = sparkSession .sql("SELECT CAST(make_timestamp(y, m, d, h, min, 0, 'UTC') AS STRING) " + "AS ts_from_ymd FROM make_timestamp_test") .map(row => row.getAs[String]("ts_from_ymd")) .collect() mappedTs should have size 3 mappedTs should contain allOf(null, "2020-04-01 12:20:00", "2019-01-20 16:59:00") }
max_by, min_by
These 2 functions are quite interesting since they came to the 3.0 release from Presto. The MAX_BY(field1, field2) returns the field1 associated with the maximum value of field2 over all input values. The MIN_BY does the opposite. You can see that in the following test cases:
"min_by and max_by" should "retrieve a column for min and max values" in { Seq( (2020, 4, 1, "year 2020"), (2019, 1, 20, "year 2019"), (2018, 2, 2, "year 2018") ).toDF("y", "m", "d", "label").createTempView("min_by_max_by_test") val minAndMaxLabels = sparkSession .sql("SELECT min_by(label, y) AS minLabel, max_by(label, y) AS maxLabel FROM min_by_max_by_test") .map(row => (row.getAs[String]("minLabel"), row.getAs[String]("maxLabel"))) .collect() minAndMaxLabels should have size 1 minAndMaxLabels(0) shouldEqual ("year 2018", "year 2020") }
overlay
The next added function, OVERLAY, replaces a substring by another one:
"overlay" should "replace the first 3 letters by xxx" in { Seq( ("abcdef"), ("gh"), ("ijk"), ("lmn") ).toDF("label").createTempView("overlay_test") val minAndMaxLabels = sparkSession .sql("SELECT OVERLAY(label PLACING 'xxx' FROM 1) AS formattedLabel FROM overlay_test") .map(row => row.getAs[String]("formattedLabel")) .collect() minAndMaxLabels should have size 4 // As you can see, OVERLAY adds new characters for the columns shorter than 3 characters minAndMaxLabels should contain allElementsOf (Seq("xxxdef", "xxx", "xxx", "xxx")) }
New aliases
Apart from really new functions, Apache Spark 3.0 also brings few new aliases for already existing functions. Thanks to them you can use BOOL_AND instead of EVERY, BOOL_OR, SOME instead of ANY and RANDOM instead of RAND.:
"new aliases" should "do the same thing like existing methods" in { Seq( (true, false), (true, true), (true, false) ).toDF("bool1", "bool2").createTempView("aliases_test") val boolAliases = sparkSession .sql( """ |SELECT |BOOL_AND(bool1) AS bool_and_version, EVERY(bool1) AS every_version, |BOOL_OR(bool2) AS bool_or_version, ANY(bool2) AS any_version |FROM aliases_test""".stripMargin) .map(row => (row.getAs[Boolean]("bool_and_version"), row.getAs[Boolean]("every_version"), row.getAs[Boolean]("bool_or_version"), row.getAs[Boolean]("any_version"))) .collect() boolAliases should have size 1 boolAliases(0) shouldEqual (true, true, true, true) } "new aliases for random" should "generate different results every time" in { Seq( (true, false), (true, true), (true, false) ).toDF("bool1", "bool2").createTempView("random_aliases_test") val randomAliases = sparkSession .sql( """ |SELECT RANDOM() AS random_v, RAND() AS rand_v |FROM random_aliases_test""".stripMargin) .map(row => (row.getAs[Double]("random_v"), row.getAs[Double]("rand_v"))) .collect() randomAliases should have size 3 }
Update February 5, 2021: I missed this change and thank you, German, very much for sharing it with me! Starting from Spark 3 you can also set a custom comparator function to array_sort.
Reverted changes
When I was writing this blog post, I was so happy to discover new functions that I almost forgot that some of them were reverted. The first of them, MAKE_INTERVAL was initially added but finally it was removed because it's too PostgreSQL-specific. Another example, JUSTIFY_DAYS, JUSTIFY_HOURS and JUSTIFY_INTERVAL, were removed after merging them to master for the same reason.
Maybe the functions added in this release are less revolutionary than higher-order functions introduced in 2.4, but this appearance can be misleading. After all, they bring a possibility to cover new use cases, especially if you need to migrate your processing code from a standard SQL-based solution to the more distributed environment on top of Apache Spark SQL.
Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions.
As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and
drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!
👉 contact@waitingforcode.com
đź”— past projects