What's new in Apache Spark 3.2.0 - miscellaneous changes on waitingforcode.com

Versions: Apache Spark 3.2.0 https://github.com/bartosz25/spark-...3/spark-3.2.0-features/misc_changes

My Apache Spark 3.2.0 comes to its end. Today I'll focus on the miscellaneous changes, so all the improvements I couldn't categorize in the previous blog posts.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

Exception messages

If you wrote an Apache Spark 3.2.0 pipeline and it failed, you probably saw the first change, the new error messages. Allison Wang and Xinyi Yu proposed a SPIP where they listed the current pain points of the exceptions in Apache Spark, and submitted an improvement strategy. Among the pain points you will mostly find:

inconsistencies - the same error has 2 different error messages
duplicates - the same error is defined in multiple places

To overcome these problems, they proposed to group the error messages into a dedicated QueryExecutionErrors class in each module (SQL, catalyst, ...). Jiaan Geng, Karen Feng, dgd-contributor and Gengliang Wang also helped implement this change.

In addition to these standardization efforts, the error messages also were extended by the SQLSTATE. Karen Feng added this information for some ANSI/ISO standard errors, such as division by zero (SQLSTATE = 22012) or missing columns in the input (SQLCODE = 42000).

    val sparkSession = SparkSession.builder()
      .appName("SPARK-34920 : errors with extra fields").master("local[*]")
      .config("spark.sql.ansi.enabled", true) // Otherwise a null is returned instead of the exception
      .getOrCreate()

    val exception = intercept[SparkThrowable] {
      sparkSession.sql("SELECT 3/0").show()
    }
    exception.getSqlState() shouldEqual "22012"
    exception.getErrorClass() shouldEqual "DIVIDE_BY_ZERO"

User-Defined Type

The second "misc" feature is not something new in the code base but it's if we consider the public API. The User-Defined Types have been there for a while, since Apache Spark 1! I even blogged about them when I was still learning Scala and hence, was using Java for the code snippets (User Defined Type). And it was Apache Spark 1.6.

However, the API was considered at that time as something to improve and the community decided to make it private in Hide UserDefinedType in Spark 2.0. At the same time, the JIRA ticket aiming to make it public (SPARK-7768) was still open, and some tricky solutions existed, such as creating an org.apache.spark.sql package and putting there the UDTs. Unfortunately, these solutions did work only for Java 8.

To overcome these issues and because of the lack of further improvement discussions about the UDT API, Sean Owen put it back to the public API! However, it's still annotated as a @DeveloperApi, so maybe it won't be the first thing you will do with Apache Spark.

But if you do and are looking for some demo code, it's just below:

class UserDefinedTypeTest extends AnyFlatSpec with Matchers  {

  // You can run the code with Spark 3.1.0 to see the registering problem
  it should "registre the UDT in Apache Spark 3.2.0" in {
    val sparkSession = SparkSession.builder()
      .appName("SPARK-7768 : promote UDT to the public API").master("local[*]")
      .getOrCreate()

    UDTRegistration.register("com.waitingforcode.Account",
      "com.waitingforcode.AccountUserDefinedType")
    import sparkSession.implicits._

    val schema = Seq(Account("abc", 1), Account("def", 2)).toDF("account").schema

    schema.simpleString shouldEqual "struct"
  }

}

case class Account(nr: String, priority: Int)

class AccountUserDefinedType extends UserDefinedType[Account] {
  override def sqlType: DataType = StructType(Seq(
    StructField("nr", DataTypes.StringType, nullable = false),
    StructField("priority", DataTypes.IntegerType, nullable = false)
  ))

  override def serialize(accountClass: Account): InternalRow = {
    InternalRow(UTF8String.fromString(accountClass.nr), accountClass.priority)
  }

  override def deserialize(datum: Any): Account = datum match {
    case row: InternalRow => Account(row.getString(0), row.getInt(1))
  }

  override def userClass: Class[Account] = classOf[Account]

  override def typeName: String = "accountuserdefinedtype_typename_override"
}

Monitoring

And let's terminate this blog post with a section dedicated to the monitoring and metrics. With the list of changes, this time, for the ease of presentation:

The number of tasks fallback to the sort-based aggregate and spill metrics for the object hash aggregate added by Cheng Su.
Holden Karau worked on the executor loss reason propagation to the UI.
On the external shuffle service side, Erik Krogen added new metrics to monitor block rate transfer. The shuffle service already had a metric for block transfer rate but it applied only to the transferred bytes. The new one working at the blocks level is there to monitor the impact of the many small block transfers on the performance, especially for random read-sensitive HDD.
Angers Zhu added percentile distribution to the task and executor metrics API to better spot any skew impacting the job.

That's all for my exploration of Apache Spark 3.2.0. Apache Spark 3.3.0 is on the way but I hope to have some extra time to cover other Spark-related topics from my backlog!

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects

TAGS: #Apache Spark 3.2.0 features