Managing different date time formats with DateTimeFormatterBuilder

Versions: Java 8

In one of the homework of my Become a Data Engineer course I'm asking students to normalize a dataset. In the dataset a date time field has different supported formats. When I was analyzing the possible solutions, I found a class I've never met before, the DateTimeFormatterBuilder. And it will be the topic of this post.

If you already used DateTimeFormatter's constants like ISO_LOCAL_DATE, ISO_LOCAL_DATE_TIME, you already met the DateTimeFormatterBuilder because all these constants are constructed from the builder. Below you can find an example where ISO_LOCAL_DATE_TIME is created from 2 other constant formatter:

        ISO_LOCAL_DATE_TIME = new DateTimeFormatterBuilder()
                .parseCaseInsensitive()
                .append(ISO_LOCAL_DATE)
                .appendLiteral('T')
                .append(ISO_LOCAL_TIME)
                .toFormatter(ResolverStyle.STRICT, IsoChronology.INSTANCE);

That's the first thing to notice. Thanks to the builder's append(DateTimeFormatter formatter) method we can easily create another formatter, without repeating the pattern strings. In the above snippet you can also see the second creation method called appendLiteral(String literal) that adds a character or string to the built formatter.

Hopefully, the composition is not the single advantage of the builder. Another one is the explicitness. I don't know you, but for me it was (it still is!) painful to look at the formatter patterns and find whether m stands for month or minute. With builder you can build the expressions much simpler and make them more readable:

  val builderFromValuesAndLiterals = new DateTimeFormatterBuilder()
    .appendLiteral("The time is ")
    .appendValue(ChronoField.HOUR_OF_DAY, 2)
    .appendLiteral(":")
    .appendValue(ChronoField.MINUTE_OF_HOUR, 2)
    .toFormatter
LocalTime.parse("The time is 03:30", builderFromValuesAndLiterals)

The above formatter will create a LocalTime instance for 03:30. And to say whether the hour or minute should have 1 or 2 places, you simply call appendValue method and specify the size of the field. From that we can go even further and deal with optional formats! Let's imagine the case when our date field can be a year, year with month and a full date. Dealing with that from a builder is quite easy:

  val parserOptionalMonthDays = new DateTimeFormatterBuilder()
    .appendValue(ChronoField.YEAR, 4)
    .optionalStart()
      .appendPattern("MM[dd]")
    .optionalEnd()
    .parseDefaulting(ChronoField.MONTH_OF_YEAR, 1)
    .parseDefaulting(ChronoField.DAY_OF_MONTH, 1)
    .toFormatter()

  assert(LocalDate.parse("2020", parserOptionalMonthDays).toString == "2020-01-01")
  assert(LocalDate.parse("202005", parserOptionalMonthDays).toString == "2020-05-01")
  assert(LocalDate.parse("20200505", parserOptionalMonthDays).toString == "2020-05-05")

It works if we have a field having the same format. But what if we have a field storing the date as a text (ex. 3 Dec 2020) or numbers (03122020)? Optionals, and more exactly the appendOptional(DateTimeFormatter formatter)s, are the solution:

  val parserOptionalFormats = new DateTimeFormatterBuilder()
    .appendOptional(DateTimeFormatter.ISO_LOCAL_DATE)
    .appendOptional(DateTimeFormatter.ofPattern("d MMM uuuu"))
    .appendOptional(DateTimeFormatter.ofPattern("yyyyMMdd"))
    .toFormatter()
  assert(LocalDate.parse("2020-12-01", parserOptionalFormats).toString == "2020-12-01")
  assert(LocalDate.parse("1 Dec 2020", parserOptionalFormats).toString == "2020-12-01")
  assert(LocalDate.parse("20201201", parserOptionalFormats).toString == "2020-12-01")

Of course, there is no magic and if your date is not defined in the optionals, the parsing will fail. From the above example, a date like "1 December 2020" would fail. Sounds great! Thanks to that we can process datasets that don't have good data governance. The problem is that we'll probably process big amounts of data so if this operation is greedy, our processing will take time ⌛

To be aware of what happens, let's check the internals. What happens when we want to format a date having multiple possible formats? Everything happens inside CompositePrinterParser class that stores all format printers in its private final DateTimePrinterParser[] printerParsers array. Every printer is called in this loop of the format method:

                for (DateTimePrinterParser pp : printerParsers) {
                    position = pp.parse(context, text, position);
                    if (position < 0) {
                        break;
                    }
                }
Snippet 1

As you can see, we call here a parse method that looks very similarly to the format:

                for (DateTimePrinterParser pp : printerParsers) {
                    position = pp.parse(context, text, position);
                    if (position < 0) {
                        break;
                    }
                }
Snippet 2

Why so? The first level of printers is our global patterns like "yyyyMMdd", "d MMM uuuu". And each of these patterns is composed of other printer parsers that are used to match the string against the pattern. The following image shows this dependency:

You can also notice that some of these parsers are marked as optionals. In the format and parse method mentioned before, the optional parsers are currently the ones used to really parse the date:

            if (optional) {
                context.startOptional();
                int pos = position;
                for (DateTimePrinterParser pp : printerParsers) {
                    pos = pp.parse(context, text, pos);
                    if (pos < 0) {
                        context.endOptional(false);
                        return position;  // return original position
                    }
                }
                context.endOptional(true);
                return pos;
Snippet 3

But let's go back to the higher level. In fact, all defined parsers will be executed every time but in case of completely different dates, not fully. The line pos = pp.parse(context, text, pos); from the above snippet will return the position lower than 0 and the given parser will simply quit after the first element. Nonetheless, the fact of iterating over all possible formats is a small drawback of this approach - even though it stops as soon as possible. To overcome this issue I first thought about using append to define all supported formats but it didn't work since the whole expression is considered as a single pattern:

  val parserOptionalFormats = new DateTimeFormatterBuilder()
    .append(DateTimeFormatter.ofPattern("d MMM uuuu"))
    .appendLiteral(" ")
    .append(DateTimeFormatter.ofPattern("yyyyMMdd"))
    .toFormatter()
  assert(LocalDate.parse("1 Dec 2020 20201201", parserOptionalFormats).toString == "2020-12-01")

But to be honest, I didn't spend a lot of time in understanding the formatter builder and maybe I got things wrong somewhere. If it's not the right thing to do, I will be glad to discover a more efficient way from the comments. Thank you 🙏


If you liked it, you should read:

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!