What's new in Apache Spark 3.0 - Proleptic Calendar and date time management on waitingforcode.com

Versions: Apache Spark 3.0.0

When I was writing my blog post about datetime conversion in Apache Spark 2.4, I wanted to check something on Apache Spark's Github. To my surprise, the code had nothing in common with the code I was analyzing locally. And that's how I discovered the first change in Apache Spark 3.0. The first among few others that I will cover in a new series "What's new in Apache Spark 3.0".

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

To be honest with you, I thought that changing date formatting is a quite simple operation. However, after looking at the history of this evolution in the backlog, I found how I was wrong 😱 And I will try to share this with you in the next sections. In the first, a very short one, I will explain why the change was required. After that, I will present some technical details, like the backward compatibility and rebasing.r

Proleptic Gregorian calendar

Before Apache Spark 3.0, the framework used the hybrid Gregorian+Julian calendar and there were multiple reasons for changing it to the Proleptic Gregorian calendar. First and foremost, this brand new release of Apache Spark tends to go closer to SQL standards and the Gregorian+Julian calendar was not a standardized way to represent temporal information. Gregorian calendar is the one quoted in ISO 8601 specification which is a norm for the exchange of date- and time-related data.

Apart from this, the use of not non-standardized calendar led to some interoperability issues with PySpark and Pandas, which for some of them should be fixed after the release. Also, the previously used Java's API (java.util package) tends to be slower than the new used API (java.time package). That being said, I didn't find any direct benchmarks for Apache Spark, but according to a few other, more general ones quoted in the "Read more" section, this fact is true.

Those are the main reasons behind switching to a more standardized calendar. However, and that was a quite big surprise for me, the switch was quite painful.

Challenges

Why was changing the calendar so hard? After all, the difference between the Proleptic Gregorian and Julian+Gregorian calendars is mostly about dates in October 1582 and before. SPARK-30951 gives some quite interesting use cases. Working on historical data is nothing strange but using very old dates like "1200-01-01" to indicate a null value IMO, really is. A less strange, but still interesting use case for the old dates, is a library that transforms sensitive dates by converting them to the past dates, like "0602-04-04".

Moreover, the format supported by the new API has some important differences with the old format. One of them is that DateTimeFormatter, added to 3.0, defines a year as a u whereas in the old formatter (SimpleDateFormat) this letter means the day of the week.

To address these 2 issues, Apache Spark contributors implemented 2 different solutions that I'm describing in the next sections.

Internal format

The first problem encountered with this change was related to the different characters used in date time patterns (old SimpleDateFormat vs new DateTimeFormatter). To handle it, Apache Spark 3.0 provides its own layer for date time formatting which is backward compatible, ie. you still define the year as a "y" but under-the-hood, the framework does the job to translate it and adapt to the new datetime API. The magic happens in DateTimeFormatterHelper#convertIncompatiblePattern(pattern: String):

// ...
  final val weekBasedLetters = Set('Y', 'W', 'w', 'u', 'e', 'c')
  final val unsupportedLetters = Set('A', 'n', 'N', 'p')

          for (c <- patternPart if weekBasedLetters.contains(c)) {
            throw new IllegalArgumentException(s"All week-based patterns are unsupported since" +
              s" Spark 3.0, detected: $c, Please use the SQL function EXTRACT instead")
          }
          for (c <- patternPart if unsupportedLetters.contains(c) ||
            (isParsing && unsupportedLettersForParsing.contains(c))) {
            throw new IllegalArgumentException(s"Illegal pattern character: $c")
          }
// ...
          // In DateTimeFormatter, 'u' supports negative years. We substitute 'y' to 'u' here for
          // keeping the support in Spark 3.0. If parse failed in Spark 3.0, fall back to 'y'.
          // We only do this substitution when there is no era designator found in the pattern.
          if (!eraDesignatorContained) {
            patternPart.replace("y", "u")
          } else {
            patternPart
          }

From the snippet you can also notice that not all characters are supported in the pattern. Among the forbidden ones you will find final val unsupportedLetters = Set('A', 'c', 'e', 'n', 'N', 'p'). If you check the documentation for DateTimeFormatter and SimpleDateFormat, you will find that there is not really a match between them so no possibility to adapt the old API to the new one. And if you use one of them in your datetime pattern, you will get this error:

  "not allowed time format characters" should "make the processing fail" in {
    val exception = intercept[IllegalArgumentException] {
      sparkSession.read.option("timestampFormat", "Y A").json(datasetPath).show()
    }

    exception.getMessage should startWith("All week-based patterns are unsupported since Spark 3.0, detected: Y, Please use the SQL function EXTRACT instead")
  }

With this brand new formatter you should also be aware of one new breaking change. It won't fall back to any other formatter as it was the case in Spark 2. In the post I quoted in the introduction, Implicit datetime conversion in Apache Spark SQL, you can learn that in the previous versions the framework worked on the user format and if the conversion failed, it fell back into DateTimeUtils conversion class. In Apache Spark 3.0, a single formatter being an instance of DateTimeFormatter is created. If the formatting fails, the framework uses an instance of legacy formatters to check whether the operation would succeed in the previous release and if it's the case, depending on the spark.sql.legacy.timeParserPolicy, it will either return the formatted date or fail.

The new formatter class is Iso8601TimestampFormatter class. The fallback mechanism looks like in this snippet:

  override def parse(s: String): Long = {
    val specialDate = convertSpecialTimestamp(s.trim, zoneId)
    specialDate.getOrElse {
      try {
        val parsed = formatter.parse(s)
// ...
        Math.addExact(SECONDS.toMicros(epochSeconds), microsOfSecond)
      } catch checkParsedDiff(s, legacyFormatter.parse)
    }
  }
  protected def checkDiffResult[T](
      s: String, legacyParseFunc: String => T): PartialFunction[Throwable, T] = {
    case e: DateTimeParseException if SQLConf.get.legacyTimeParserPolicy == EXCEPTION =>
      val res = try {
        Some(legacyParseFunc(s))
      } catch {
        case _: Throwable => None
      }
      if (res.nonEmpty) {
        throw new SparkUpgradeException("3.0", s"Fail to parse '$s' in the new parser. You can " +
          s"set ${SQLConf.LEGACY_TIME_PARSER_POLICY.key} to LEGACY to restore the behavior " +
          s"before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.", e)
      } else {
        throw e
      }
  }

Please notice that I'm writing here about timestamp fields but the same rule applies on the date fields. Let's see now how this fallback logic behaves, depending on the timeParserPolicy configuration.

Backward compatibility

To implement the compatibility with previous versions of Apache Spark, the version 3 of the framework has a new property called spark.sql.legacy.timeParserPolicy which defines the parser policy used to deal with dates. It can be one of:

exception - it's the default setting and it's used in the snippet mentioned before. If there is any error in the parsing using the new formatter, Apache Spark fallbacks to the old formatter and if the latter succeeds, a SparkUpgradeException is thrown. It can happen for instance when the day of the date is outside the accepted range of the given month:

  "EXCEPTION timeParsePolicy" should "make the processing fail for a date that can be parsed with" +
    "old formatter and not the new one" in {
    val testSparkSession = createSparkSession(LegacyBehaviorPolicy.EXCEPTION)

    val sparkUpgradeException = intercept[SparkException] {
      testSparkSession.read.schema(datasetSchema).option("timestampFormat", "yyyy-MM-dd").json(datasetLegacyPath).collect()
    }

    sparkUpgradeException.getMessage contains("You may get a different result due to the upgrading of Spark 3.0")
    sparkUpgradeException.getMessage contains("Fail to parse '2020-01-35 00:00:00' in the new parser. " +
      "You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, " +
      "or set to CORRECTED and treat it as an invalid datetime string.")
  }

legacy - when used, one of the legacy formatters will be returned. One? Yes, because there are 2 of them called LegacyFastDateFormatter and LegacySimpleDateFormatter. The former one uses FastDateFormat class from apache-commons project which wraps in a thread-safe manner the SimpleDateFormat and is used in data sources. The latter directly uses SimpleDateFormat and is reserved for datetime expressions. If we take the same example as previously, we should get a datetime of the 4th day of February (31 days of January + 4 days of February = 35th day from the date):
```
  "LEGACY timeParsePolicy" should "fallback to the legacy parser and restore old parsing behavior" in {
    val testSparkSession = createSparkSession(LegacyBehaviorPolicy.LEGACY)
    import testSparkSession.implicits._

    val timestamps = testSparkSession.read.schema(datasetSchema).option("timestampFormat", "yyyy-MM-dd")
      .json(datasetLegacyPath).as[String].collect()

    timestamps should have size 1
    timestamps(0) shouldEqual "2020-02-04 00:00:00"
  }
```

corrected - this parse result represents the API using DateTimeFormatter, so the changes introduced by Spark 3.0. In case of a parsing error, null will be returned as in this test case:

  "CORRECTED timeParsePolicy" should "consider parsing error as a missing value" in {
    val testSparkSession = createSparkSession(LegacyBehaviorPolicy.CORRECTED)
    import testSparkSession.implicits._

    val timestamps = testSparkSession.read.schema(datasetSchema)
      .json(datasetLegacyPath).as[String].collect()

    timestamps should have size 1
    timestamps(0) shouldBe null
  }

Rebase

So far for me, rebase was the world reserved to Git. However, it's also used in Spark 3.0 to describe the behavior for the files written in Parquet, Avro, or ORC formats. These formats store the temporal values as the representation of days or seconds from the Unix epoch. The problem with that is the files written with Spark 2.4, so Gregorian+Julian calendar may return different results when reading with Spark 3.0, especially for old dates. Below you can find an example of a Parquet file written with Spark 2.4 but read with Spark 3.0:

// Writer's part, Spark 2.4.5
object ParquetTimestampDataWriter extends App {
  private val outputPath = "/home/bartosz/workspace/spark-playground/spark-3.0-features/src/test/resources/parquet-spark-2.4.0"

  private val sparkSession: SparkSession = SparkSession.builder()
    .appName("Spark SQL Parquet Writer")
    .master("local[*]")
      .config("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
    .getOrCreate()
  import sparkSession.implicits._

  val dataset = Seq(
    (Timestamp.valueOf("1000-01-01 10:00:00"))
  ).toDF("datetime_col")

  dataset.write.mode(SaveMode.Overwrite).parquet(outputPath)
}

// Reader's part, Spark 3.0.0, default 
  "file created by Spark 2.4 read with CORRECTED mode" should "return invalid date" in {
    // FYI, in `corrected` we read the dates as it, so without rebasing
    val testSparkSession = createSparkSession(LegacyBehaviorPolicy.CORRECTED)
    import testSparkSession.implicits._

    val dateFromSpark240 = testSparkSession.read.parquet(s"${inputPath}").as[String].collect()

    dateFromSpark240 should have size 1
    // 1000-01-06 09:09:21 => but the initial date was 1000-01-01 10:00:00
    dateFromSpark240(0) shouldEqual "1000-01-06 09:09:21"
  }

To handle that, Apache Spark 3.0 comes with the "rebase" configurations like the following 2 for Parquet format:

spark.sql.legacy.parquet.datetimeRebaseModeInRead - Apache Spark will reformat the dates before reading them. For LEGACY, the framework will convert Gregorian+Julian format into Proleptic Gregorian, for CORRECTED nothing will happen (new calendar is used, as you saw) and for EXCEPTION, an exception will be thrown for potentially ambiguous dates. Since you already saw the CORRECTED, in the tests below you can find the examples for 2 remaining cases:

  "file created by Spark 2.4 read with EXCEPTION mode" should "fail for ambiguous dates" in {
    val testSparkSession = createSparkSession(LegacyBehaviorPolicy.EXCEPTION)
    import testSparkSession.implicits._

    val sparkException = intercept[SparkException] {
      testSparkSession.read.parquet(s"${inputPath}").as[String].collect()
    }

    sparkException.getMessage() should include("SparkUpgradeException: You may get a different result due " +
      "to the upgrading of Spark 3.0")
    sparkException.getMessage() should include("reading dates before 1582-10-15 or " +
      "timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, " +
      "as the files may be written by Spark 2.x or legacy versions of Hive, " +
      "which uses a legacy hybrid calendar that " +
      "is different from Spark 3.0+'s Proleptic Gregorian calendar")
  }

  "file created by Spark 2.4 read with LEGACY mode" should "return valid date" in {
    val testSparkSession = createSparkSession(LegacyBehaviorPolicy.LEGACY)
    import testSparkSession.implicits._

    val dateFromSpark240 = testSparkSession.read.parquet(s"${inputPath}").as[String].collect()

    dateFromSpark240 should have size 1
    // 1000-01-06 09:09:21 => but the initial date was 1000-01-01 10:00:00
    dateFromSpark240(0) shouldEqual "1000-01-01 10:00:00"
  }

These tests show one of 2 ways of configuring the rebase, with the configuration which is a fallback choice! Rebase mode for Parquet is resolved in org.apache.spark.sql.execution.datasources.DataSourceUtils#datetimeRebaseMode(lookupFileMeta, modeByConfig) method where before applying the user choice, Apache Spark tries to resolve the rebase mode by himself.
During the resolution, Apache Spark checks the value of org.apache.spark.version in the footer of the written Parquet files and if the version is lower than 3.0.0 or a property org.apache.spark.legacyDateTime is defined, it uses LEGACY mode. Otherwise (>= 3.0.0, no org.apache.spark.legacyDateTime), it goes with CORRECTED strategy. If the footer doesn't contain the information about Spark version, it uses the configuration entry:

    Option(lookupFileMeta(SPARK_VERSION_METADATA_KEY)).map { version =>
      // Files written by Spark 2.4 and earlier follow the legacy hybrid calendar and we need to
      // rebase the datetime values.
      // Files written by Spark 3.0 and latter may also need the rebase if they were written with
      // the "LEGACY" rebase mode.
      if (version < "3.0.0" || lookupFileMeta(SPARK_LEGACY_DATETIME) != null) {
        LegacyBehaviorPolicy.LEGACY
      } else {
        LegacyBehaviorPolicy.CORRECTED
      }
    }.getOrElse(LegacyBehaviorPolicy.withName(modeByConfig))

spark.sql.legacy.parquet.datetimeRebaseModeInWrite - applies on writing, if set to LEGACY, Apache Spark will write the dates in the legacy Gregorian+Julian format. CORRECTED mode will not trigger the rebase operation whereas EXCEPTION will throw an exception if there are some possible dates that may be ambiguous between 2 calendars.
As you can deduce after the rebase-in-read mode presentation, Parquet writer will include org.apache.spark.legacyDateTime for LEGACY writer, and therefore, control which of the rebase-in-read modes will be used over the user-specified entry:
```
    val metadata = Map(
      SPARK_VERSION_METADATA_KEY -> SPARK_VERSION_SHORT,
      ParquetReadSupport.SPARK_METADATA_KEY -> schemaString
    ) ++ {
      if (datetimeRebaseMode == LegacyBehaviorPolicy.LEGACY) {
        Some(SPARK_LEGACY_DATETIME -> "")
      } else {
        None
      }
    }
```

I hope you see better now why changing the time formatting wasn't an obvious change in the codebase. And you should take all of this into account when upgrading to Apache Spark 3.0.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects

TAGS: #Apache Spark 3.0.0 features