Schema versions in Parquet

When I've started to play with Apache Parquet I was surprised about 2 versions of writers. Before approaching the rest of planed topics, it's a good moment to explain these different versions better.

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants · production-ready templates
See the full curriculum → €7,000 flat fee · cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

This post talks about schema versions in Parquet. The first section describes differences between 2 formats. The second one shows how the files generated with both versions are different.

Writers differences

The first important question - why 2 different write modes ? The change was introduced in December 2013 and was dictated by the addition of new encoding formats. In order to keep the retro-compatibility, a small enumeration in ParquetProperties with 2 supported writing versions was added:

public enum WriterVersion {
  PARQUET_1_0 ("v1"),
  PARQUET_2_0 ("v2");

Prior to Parquet 2.0, most of values were written with plain encoding. Only the arrival of the next version brought some new more efficient encodings, such as: RLE/Bit-packing (for booleans), delta encodings (for binary and fixed length byte arrays types) and delta encodings with binary packing (for integers).

Writer versions examples

In order to see the differences between writers, let's analyze applied encodings for some of columns of our example of WorkingCitizen class:

@Test
public void should_compare_files_written_with_both_available_versions() throws IOException {
  Path filePathV1 = new Path(TEST_FILE_V1);
  writeCitizens(filePathV1, ParquetProperties.WriterVersion.PARQUET_1_0);
  Path filePathV2 = new Path(TEST_FILE_V2);
  writeCitizens(filePathV2, ParquetProperties.WriterVersion.PARQUET_2_0);

  ParquetFileReader fileReaderV1 = ParquetFileReader.open(new Configuration(), filePathV1);
  ParquetFileReader fileReaderV2 = ParquetFileReader.open(new Configuration(), filePathV2);

  List rowGroupsV1 = fileReaderV1.getRowGroups();
  BlockMetaData rowGroupV1 = rowGroupsV1.get(0);
  List rowGroupsV2 = fileReaderV2.getRowGroups();
  BlockMetaData rowGroupV2 = rowGroupsV2.get(0);
  // Check double value
  ColumnChunkMetaData creditRatingV1 = getMetadataForColumn(rowGroupV1, "creditRating");
  ColumnChunkMetaData creditRatingV2 = getMetadataForColumn(rowGroupV2, "creditRating");
  assertThat(creditRatingV1.getEncodings()).isNotEqualTo(creditRatingV2.getEncodings());
  assertThat(creditRatingV1.getEncodings()).contains(Encoding.BIT_PACKED, Encoding.PLAIN);
  assertThat(creditRatingV2.getEncodings()).contains(Encoding.PLAIN);
  // Check nested type
  ColumnChunkMetaData professionalSkillsV1 = getMetadataForColumn(rowGroupV1, "professionalSkills");
  ColumnChunkMetaData professionalSkillsV2 = getMetadataForColumn(rowGroupV2, "professionalSkills");
  assertThat(professionalSkillsV1.getEncodings()).isNotEqualTo(professionalSkillsV2.getEncodings());
  assertThat(professionalSkillsV1.getEncodings()).contains(Encoding.PLAIN_DICTIONARY, Encoding.RLE);
  assertThat(professionalSkillsV2.getEncodings()).contains(Encoding.RLE_DICTIONARY, Encoding.PLAIN);
  // Check enum type
  ColumnChunkMetaData civilityV1 = getMetadataForColumn(rowGroupV1, "civility");
  ColumnChunkMetaData civilityV2 = getMetadataForColumn(rowGroupV2, "civility");
  assertThat(civilityV1.getEncodings()).isNotEqualTo(civilityV2.getEncodings());
  assertThat(civilityV1.getEncodings()).contains(Encoding.BIT_PACKED, Encoding.PLAIN);
  assertThat(civilityV2.getEncodings()).contains(Encoding.DELTA_BYTE_ARRAY);
}

Different writer versions are only a result of Parquet evolution. The 2.0 version greatly improved encoding capabilities and this change needed to be retro-compatible. Because of that the writers were divided in 2 different versions, both applying different encoding methods during values writing. The second section of this post proved that through learning tests comparing the same data written with 2 different writers.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems contact@waitingforcode.com đź“©