What's new in Apache Spark 3.3 - new functions

Versions: Apache Spark 3.3.0 https://github.com/bartosz25/spark-playground/tree/master/spark-3.3.0-features/sql_functions

New Apache SQL functions are a regular position in my "What's new in Apache Spark..." series. Let's see what has changed in the most recent (3.3.0) release!

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

Date time functions

For starters, let's see the changes for the date time functions:

Credits for the changes above go to Max Gekk.

Additionally, AngersZhuuuu added a support for interval types in PERCENTILE_APPROX, PERCENTILE, and APPROX_COUNT_DISTINCT_FOR_INTERVALS:

You can find some test cases below:

    sparkSession.sql(
      """
        |SELECT
        | TIMESTAMPADD(HOUR, 2, TIMESTAMP '2022-06-01 20:00:00') AS eight_pm_in_2_hours,
        | DATEADD(HOUR, 45, DATE '2022-06-01 10:00:00') AS date_in_2_days,
        | DATEADD(HOUR, 48, DATE '2022-06-01 10:00:00') AS date_in_3_days
        |""".stripMargin)
      .show(false)

    // Prints:
    /*
    +-------------------+-------------------+-------------------+
    |eight_pm_in_2_hours|date_in_2_days     |date_in_3_days     |
    +-------------------+-------------------+-------------------+
    |2022-06-01 22:00:00|2022-06-02 21:00:00|2022-06-03 00:00:00|
    +-------------------+-------------------+-------------------+
     */


    sparkSession.sql(
      """
        |SELECT
        | TIMESTAMPDIFF(MINUTE, TIMESTAMP '2022-06-01 20:00:00',
        |     TIMESTAMP '2022-06-01 20:40:00') AS tms_diff,
        | DATEDIFF(MINUTE, DATE '2022-06-01', DATE '2022-06-02') AS date_diff
        |""".stripMargin)
      .show(false)
    // Prints:
    /*
    +--------+---------+
    |tms_diff|date_diff|
    +--------+---------+
    |40      |1440     |
    +--------+---------+
     */

Error-handling functions

Moreover, Apache Spark 3.3.0 extended the list of error-handling functions that return null in case of an error. The new functions are:

Credits for them go to Gengliang Wang and you will find their examples below:

    sparkSession.sql(
      s"""
         |SELECT
         |${Long.MaxValue} * 2 AS not_safe_multiply,
         |TRY_MULTIPLY(${Long.MaxValue}, 2) AS long_safe_multiply
         |""".stripMargin)
      .show(false)
    // Prints
    /*
    +-----------------+------------------+
    |not_safe_multiply|long_safe_multiply|
    +-----------------+------------------+
    |-2               |null              |
    +-----------------+------------------+
     */

    sparkSession.sql(
      s"""
         |SELECT
         |TRY_SUM(data),
         |SUM(data)
         |FROM VALUES (${Long.MaxValue}), (${Long.MaxValue}) AS test_table(data)
         |""".stripMargin)
      .show(false)
    // Prints
    /*
    +-------------+---------+
    |try_sum(data)|sum(data)|
    +-------------+---------+
    |null         |-2       |
    +-------------+---------+
     */

Complex types

In the most recent release Apache Spark also got 2 new functions to handle complex types. The ARRAY_SIZE, added by Xinrong Meng, returns the length of an array. Another function, the MAP_CONTAINS_KEY implemented by Gengliang Wang, returns true if the map has a specific key:

    sparkSession.sql(
      """
        |SELECT
        | ARRAY_SIZE(ARRAY(1, 2, 3, 4, 5)) AS array_length,
        | MAP_CONTAINS_KEY(MAP('a', 'AA', 'b', 'BB'), 'b') AS b_key_in_map
        |""".stripMargin).show(false)
    // Prints
    /*
    +------------+------------+
    |array_length|b_key_in_map|
    +------------+------------+
    |5           |true        |
    +------------+------------+
     */

String functions

Several interesting functions are also available for strings. Max Gekk added CONTAINS(), STARTSWITH(), ENDSWITH() and ILIKE() to facilitate finding/matching a substring. The most mysterious from them is ILIKE which is a case-insensitive like that you can see in action just below:

    sparkSession.sql(
      """
        |SELECT
        | CONTAINS('defabcxyz', 'abc') AS contains_abc,
        | STARTSWITH('abcdefxyz', 'abc') AS startswith_abc,
        | ENDSWITH('defxyzabc', 'abc') AS endswith_abc,
        | 'defxyzABC' LIKE '%abc' AS case_sensitive_like,
        | 'defxyzABC' ILIKE '%abc' AS case_insensitive_like
        |""".stripMargin).show(false)
    // Prints
    /*
    +------------+--------------+------------+-------------------+---------------------+
    |contains_abc|startswith_abc|endswith_abc|case_sensitive_like|case_insensitive_like|
    +------------+--------------+------------+-------------------+---------------------+
    |true        |true          |true        |false              |true                 |
    +------------+--------------+------------+-------------------+---------------------+
     */

ANSI aggregations

Three new aggregation functions are also available thanks to the Jiaan Geng contribution:

In addition to this contribution, Jiaan Geng added functions for regression analysis. In "official" terms, it's a statistical method to calculate the relationship between an independent variable and a dependent variable with a linear regression equation. In simpler terms, these functions are helpful to find trends in the data. For example, the REGR_AVGY calculates the average of the dependent variable as sum(Y)/N. There are also less simpler functions in the group, like REGR_COUNT(X, Y) that returns the number of rows where X and Y aren't null:

    sparkSession.sql(
      """
        |SELECT
        | REGR_COUNT(x, y),
        | REGR_AVGY(x, y)
        | FROM VALUES (10, NULL), (20, 200), (NULL, 300), (40, 400) AS tab(x, y)
        |""".stripMargin).show(false)
    // Prints
    /*
    +----------------+---------------+
    |regr_count(x, y)|regr_avgy(x, y)|
    +----------------+---------------+
    |2               |30.0           |
    +----------------+---------------+
     */

Misc

Let me finish with the other functions changes:

Thanks to the recent effort, Apache Spark extended its support for SQL-based processing and compatibility with the SQL standards. Next week we'll see what changed in the DataSource V2 API!

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
đź”— past projects


If you liked it, you should read:

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!