Schema versions in Parquet

Versions: Parquet 1.9.0

When I've started to play with Apache Parquet I was surprised about 2 versions of writers. Before approaching the rest of planed topics, it's a good moment to explain these different versions better.

This post talks about schema versions in Parquet. The first section describes differences between 2 formats. The second one shows how the files generated with both versions are different.

Writers differences

The first important question - why 2 different write modes ? The change was introduced in December 2013 and was dictated by the addition of new encoding formats. In order to keep the retro-compatibility, a small enumeration in ParquetProperties with 2 supported writing versions was added:

public enum WriterVersion {
  PARQUET_1_0 ("v1"),
  PARQUET_2_0 ("v2");

Prior to Parquet 2.0, most of values were written with plain encoding. Only the arrival of the next version brought some new more efficient encodings, such as: RLE/Bit-packing (for booleans), delta encodings (for binary and fixed length byte arrays types) and delta encodings with binary packing (for integers).

Writer versions examples

In order to see the differences between writers, let's analyze applied encodings for some of columns of our example of WorkingCitizen class:

public void should_compare_files_written_with_both_available_versions() throws IOException {
  Path filePathV1 = new Path(TEST_FILE_V1);
  writeCitizens(filePathV1, ParquetProperties.WriterVersion.PARQUET_1_0);
  Path filePathV2 = new Path(TEST_FILE_V2);
  writeCitizens(filePathV2, ParquetProperties.WriterVersion.PARQUET_2_0);

  ParquetFileReader fileReaderV1 = Configuration(), filePathV1);
  ParquetFileReader fileReaderV2 = Configuration(), filePathV2);

  List rowGroupsV1 = fileReaderV1.getRowGroups();
  BlockMetaData rowGroupV1 = rowGroupsV1.get(0);
  List rowGroupsV2 = fileReaderV2.getRowGroups();
  BlockMetaData rowGroupV2 = rowGroupsV2.get(0);
  // Check double value
  ColumnChunkMetaData creditRatingV1 = getMetadataForColumn(rowGroupV1, "creditRating");
  ColumnChunkMetaData creditRatingV2 = getMetadataForColumn(rowGroupV2, "creditRating");
  assertThat(creditRatingV1.getEncodings()).contains(Encoding.BIT_PACKED, Encoding.PLAIN);
  // Check nested type
  ColumnChunkMetaData professionalSkillsV1 = getMetadataForColumn(rowGroupV1, "professionalSkills");
  ColumnChunkMetaData professionalSkillsV2 = getMetadataForColumn(rowGroupV2, "professionalSkills");
  assertThat(professionalSkillsV1.getEncodings()).contains(Encoding.PLAIN_DICTIONARY, Encoding.RLE);
  assertThat(professionalSkillsV2.getEncodings()).contains(Encoding.RLE_DICTIONARY, Encoding.PLAIN);
  // Check enum type
  ColumnChunkMetaData civilityV1 = getMetadataForColumn(rowGroupV1, "civility");
  ColumnChunkMetaData civilityV2 = getMetadataForColumn(rowGroupV2, "civility");
  assertThat(civilityV1.getEncodings()).contains(Encoding.BIT_PACKED, Encoding.PLAIN);

Different writer versions are only a result of Parquet evolution. The 2.0 version greatly improved encoding capabilities and this change needed to be retro-compatible. Because of that the writers were divided in 2 different versions, both applying different encoding methods during values writing. The second section of this post proved that through learning tests comparing the same data written with 2 different writers.