Schema versions in Parquet on waitingforcode.com

Versions: Parquet 1.9.0

When I've started to play with Apache Parquet I was surprised about 2 versions of writers. Before approaching the rest of planed topics, it's a good moment to explain these different versions better.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

This post talks about schema versions in Parquet. The first section describes differences between 2 formats. The second one shows how the files generated with both versions are different.

Writers differences

The first important question - why 2 different write modes ? The change was introduced in December 2013 and was dictated by the addition of new encoding formats. In order to keep the retro-compatibility, a small enumeration in ParquetProperties with 2 supported writing versions was added:

public enum WriterVersion {
  PARQUET_1_0 ("v1"),
  PARQUET_2_0 ("v2");

Prior to Parquet 2.0, most of values were written with plain encoding. Only the arrival of the next version brought some new more efficient encodings, such as: RLE/Bit-packing (for booleans), delta encodings (for binary and fixed length byte arrays types) and delta encodings with binary packing (for integers).

Writer versions examples

In order to see the differences between writers, let's analyze applied encodings for some of columns of our example of WorkingCitizen class:

@Test
public void should_compare_files_written_with_both_available_versions() throws IOException {
  Path filePathV1 = new Path(TEST_FILE_V1);
  writeCitizens(filePathV1, ParquetProperties.WriterVersion.PARQUET_1_0);
  Path filePathV2 = new Path(TEST_FILE_V2);
  writeCitizens(filePathV2, ParquetProperties.WriterVersion.PARQUET_2_0);

  ParquetFileReader fileReaderV1 = ParquetFileReader.open(new Configuration(), filePathV1);
  ParquetFileReader fileReaderV2 = ParquetFileReader.open(new Configuration(), filePathV2);

  List rowGroupsV1 = fileReaderV1.getRowGroups();
  BlockMetaData rowGroupV1 = rowGroupsV1.get(0);
  List rowGroupsV2 = fileReaderV2.getRowGroups();
  BlockMetaData rowGroupV2 = rowGroupsV2.get(0);
  // Check double value
  ColumnChunkMetaData creditRatingV1 = getMetadataForColumn(rowGroupV1, "creditRating");
  ColumnChunkMetaData creditRatingV2 = getMetadataForColumn(rowGroupV2, "creditRating");
  assertThat(creditRatingV1.getEncodings()).isNotEqualTo(creditRatingV2.getEncodings());
  assertThat(creditRatingV1.getEncodings()).contains(Encoding.BIT_PACKED, Encoding.PLAIN);
  assertThat(creditRatingV2.getEncodings()).contains(Encoding.PLAIN);
  // Check nested type
  ColumnChunkMetaData professionalSkillsV1 = getMetadataForColumn(rowGroupV1, "professionalSkills");
  ColumnChunkMetaData professionalSkillsV2 = getMetadataForColumn(rowGroupV2, "professionalSkills");
  assertThat(professionalSkillsV1.getEncodings()).isNotEqualTo(professionalSkillsV2.getEncodings());
  assertThat(professionalSkillsV1.getEncodings()).contains(Encoding.PLAIN_DICTIONARY, Encoding.RLE);
  assertThat(professionalSkillsV2.getEncodings()).contains(Encoding.RLE_DICTIONARY, Encoding.PLAIN);
  // Check enum type
  ColumnChunkMetaData civilityV1 = getMetadataForColumn(rowGroupV1, "civility");
  ColumnChunkMetaData civilityV2 = getMetadataForColumn(rowGroupV2, "civility");
  assertThat(civilityV1.getEncodings()).isNotEqualTo(civilityV2.getEncodings());
  assertThat(civilityV1.getEncodings()).contains(Encoding.BIT_PACKED, Encoding.PLAIN);
  assertThat(civilityV2.getEncodings()).contains(Encoding.DELTA_BYTE_ARRAY);
}

Different writer versions are only a result of Parquet evolution. The 2.0 version greatly improved encoding capabilities and this change needed to be retro-compatible. Because of that the writers were divided in 2 different versions, both applying different encoding methods during values writing. The second section of this post proved that through learning tests comparing the same data written with 2 different writers.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects

TAGS: #Parquet encoding #Parquet versions