What's new in Apache Spark 3.3 - new functions

Versions: Apache Spark 3.3.0 https://github.com/bartosz25/spark-playground/tree/master/spark-3.3.0-features/sql_functions

New Apache SQL functions are a regular position in my "What's new in Apache Spark..." series. Let's see what has changed in the most recent (3.3.0) release!

Date time functions

For starters, let's see the changes for the date time functions:

Credits for the changes above go to Max Gekk.

Additionally, AngersZhuuuu added a support for interval types in PERCENTILE_APPROX, PERCENTILE, and APPROX_COUNT_DISTINCT_FOR_INTERVALS:

You can find some test cases below:

    sparkSession.sql(
      """
        |SELECT
        | TIMESTAMPADD(HOUR, 2, TIMESTAMP '2022-06-01 20:00:00') AS eight_pm_in_2_hours,
        | DATEADD(HOUR, 45, DATE '2022-06-01 10:00:00') AS date_in_2_days,
        | DATEADD(HOUR, 48, DATE '2022-06-01 10:00:00') AS date_in_3_days
        |""".stripMargin)
      .show(false)

    // Prints:
    /*
    +-------------------+-------------------+-------------------+
    |eight_pm_in_2_hours|date_in_2_days     |date_in_3_days     |
    +-------------------+-------------------+-------------------+
    |2022-06-01 22:00:00|2022-06-02 21:00:00|2022-06-03 00:00:00|
    +-------------------+-------------------+-------------------+
     */


    sparkSession.sql(
      """
        |SELECT
        | TIMESTAMPDIFF(MINUTE, TIMESTAMP '2022-06-01 20:00:00',
        |     TIMESTAMP '2022-06-01 20:40:00') AS tms_diff,
        | DATEDIFF(MINUTE, DATE '2022-06-01', DATE '2022-06-02') AS date_diff
        |""".stripMargin)
      .show(false)
    // Prints:
    /*
    +--------+---------+
    |tms_diff|date_diff|
    +--------+---------+
    |40      |1440     |
    +--------+---------+
     */

Error-handling functions

Moreover, Apache Spark 3.3.0 extended the list of error-handling functions that return null in case of an error. The new functions are:

Credits for them go to Gengliang Wang and you will find their examples below:

    sparkSession.sql(
      s"""
         |SELECT
         |${Long.MaxValue} * 2 AS not_safe_multiply,
         |TRY_MULTIPLY(${Long.MaxValue}, 2) AS long_safe_multiply
         |""".stripMargin)
      .show(false)
    // Prints
    /*
    +-----------------+------------------+
    |not_safe_multiply|long_safe_multiply|
    +-----------------+------------------+
    |-2               |null              |
    +-----------------+------------------+
     */

    sparkSession.sql(
      s"""
         |SELECT
         |TRY_SUM(data),
         |SUM(data)
         |FROM VALUES (${Long.MaxValue}), (${Long.MaxValue}) AS test_table(data)
         |""".stripMargin)
      .show(false)
    // Prints
    /*
    +-------------+---------+
    |try_sum(data)|sum(data)|
    +-------------+---------+
    |null         |-2       |
    +-------------+---------+
     */

Complex types

In the most recent release Apache Spark also got 2 new functions to handle complex types. The ARRAY_SIZE, added by Xinrong Meng, returns the length of an array. Another function, the MAP_CONTAINS_KEY implemented by Gengliang Wang, returns true if the map has a specific key:

    sparkSession.sql(
      """
        |SELECT
        | ARRAY_SIZE(ARRAY(1, 2, 3, 4, 5)) AS array_length,
        | MAP_CONTAINS_KEY(MAP('a', 'AA', 'b', 'BB'), 'b') AS b_key_in_map
        |""".stripMargin).show(false)
    // Prints
    /*
    +------------+------------+
    |array_length|b_key_in_map|
    +------------+------------+
    |5           |true        |
    +------------+------------+
     */

String functions

Several interesting functions are also available for strings. Max Gekk added CONTAINS(), STARTSWITH(), ENDSWITH() and ILIKE() to facilitate finding/matching a substring. The most mysterious from them is ILIKE which is a case-insensitive like that you can see in action just below:

    sparkSession.sql(
      """
        |SELECT
        | CONTAINS('defabcxyz', 'abc') AS contains_abc,
        | STARTSWITH('abcdefxyz', 'abc') AS startswith_abc,
        | ENDSWITH('defxyzabc', 'abc') AS endswith_abc,
        | 'defxyzABC' LIKE '%abc' AS case_sensitive_like,
        | 'defxyzABC' ILIKE '%abc' AS case_insensitive_like
        |""".stripMargin).show(false)
    // Prints
    /*
    +------------+--------------+------------+-------------------+---------------------+
    |contains_abc|startswith_abc|endswith_abc|case_sensitive_like|case_insensitive_like|
    +------------+--------------+------------+-------------------+---------------------+
    |true        |true          |true        |false              |true                 |
    +------------+--------------+------------+-------------------+---------------------+
     */

ANSI aggregations

Three new aggregation functions are also available thanks to the Jiaan Geng contribution:

In addition to this contribution, Jiaan Geng added functions for regression analysis. In "official" terms, it's a statistical method to calculate the relationship between an independent variable and a dependent variable with a linear regression equation. In simpler terms, these functions are helpful to find trends in the data. For example, the REGR_AVGY calculates the average of the dependent variable as sum(Y)/N. There are also less simpler functions in the group, like REGR_COUNT(X, Y) that returns the number of rows where X and Y aren't null:

    sparkSession.sql(
      """
        |SELECT
        | REGR_COUNT(x, y),
        | REGR_AVGY(x, y)
        | FROM VALUES (10, NULL), (20, 200), (NULL, 300), (40, 400) AS tab(x, y)
        |""".stripMargin).show(false)
    // Prints
    /*
    +----------------+---------------+
    |regr_count(x, y)|regr_avgy(x, y)|
    +----------------+---------------+
    |2               |30.0           |
    +----------------+---------------+
     */

Misc

Let me finish with the other functions changes:

Thanks to the recent effort, Apache Spark extended its support for SQL-based processing and compatibility with the SQL standards. Next week we'll see what changed in the DataSource V2 API!


If you liked it, you should read:

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!