When I've started to play with Apache Parquet I was surprised about 2 versions of writers. Before approaching the rest of planed topics, it's a good moment to explain these different versions better.
Data Engineering Design Patterns
Looking for a book that defines and solves most common data engineering problems? I'm currently writing
one on that topic and the first chapters are already available in π
Early Release on the O'Reilly platform
I also help solve your data engineering problems π contact@waitingforcode.com π©
This post talks about schema versions in Parquet. The first section describes differences between 2 formats. The second one shows how the files generated with both versions are different.
Writers differences
The first important question - why 2 different write modes ? The change was introduced in December 2013 and was dictated by the addition of new encoding formats. In order to keep the retro-compatibility, a small enumeration in ParquetProperties with 2 supported writing versions was added:
public enum WriterVersion { PARQUET_1_0 ("v1"), PARQUET_2_0 ("v2");
Prior to Parquet 2.0, most of values were written with plain encoding. Only the arrival of the next version brought some new more efficient encodings, such as: RLE/Bit-packing (for booleans), delta encodings (for binary and fixed length byte arrays types) and delta encodings with binary packing (for integers).
Writer versions examples
In order to see the differences between writers, let's analyze applied encodings for some of columns of our example of WorkingCitizen class:
@Test public void should_compare_files_written_with_both_available_versions() throws IOException { Path filePathV1 = new Path(TEST_FILE_V1); writeCitizens(filePathV1, ParquetProperties.WriterVersion.PARQUET_1_0); Path filePathV2 = new Path(TEST_FILE_V2); writeCitizens(filePathV2, ParquetProperties.WriterVersion.PARQUET_2_0); ParquetFileReader fileReaderV1 = ParquetFileReader.open(new Configuration(), filePathV1); ParquetFileReader fileReaderV2 = ParquetFileReader.open(new Configuration(), filePathV2); ListrowGroupsV1 = fileReaderV1.getRowGroups(); BlockMetaData rowGroupV1 = rowGroupsV1.get(0); List rowGroupsV2 = fileReaderV2.getRowGroups(); BlockMetaData rowGroupV2 = rowGroupsV2.get(0); // Check double value ColumnChunkMetaData creditRatingV1 = getMetadataForColumn(rowGroupV1, "creditRating"); ColumnChunkMetaData creditRatingV2 = getMetadataForColumn(rowGroupV2, "creditRating"); assertThat(creditRatingV1.getEncodings()).isNotEqualTo(creditRatingV2.getEncodings()); assertThat(creditRatingV1.getEncodings()).contains(Encoding.BIT_PACKED, Encoding.PLAIN); assertThat(creditRatingV2.getEncodings()).contains(Encoding.PLAIN); // Check nested type ColumnChunkMetaData professionalSkillsV1 = getMetadataForColumn(rowGroupV1, "professionalSkills"); ColumnChunkMetaData professionalSkillsV2 = getMetadataForColumn(rowGroupV2, "professionalSkills"); assertThat(professionalSkillsV1.getEncodings()).isNotEqualTo(professionalSkillsV2.getEncodings()); assertThat(professionalSkillsV1.getEncodings()).contains(Encoding.PLAIN_DICTIONARY, Encoding.RLE); assertThat(professionalSkillsV2.getEncodings()).contains(Encoding.RLE_DICTIONARY, Encoding.PLAIN); // Check enum type ColumnChunkMetaData civilityV1 = getMetadataForColumn(rowGroupV1, "civility"); ColumnChunkMetaData civilityV2 = getMetadataForColumn(rowGroupV2, "civility"); assertThat(civilityV1.getEncodings()).isNotEqualTo(civilityV2.getEncodings()); assertThat(civilityV1.getEncodings()).contains(Encoding.BIT_PACKED, Encoding.PLAIN); assertThat(civilityV2.getEncodings()).contains(Encoding.DELTA_BYTE_ARRAY); }
Different writer versions are only a result of Parquet evolution. The 2.0 version greatly improved encoding capabilities and this change needed to be retro-compatible. Because of that the writers were divided in 2 different versions, both applying different encoding methods during values writing. The second section of this post proved that through learning tests comparing the same data written with 2 different writers.