When I've started to play with Apache Parquet I was surprised about 2 versions of writers. Before approaching the rest of planed topics, it's a good moment to explain these different versions better.
What would it take for you to trust your Databricks pipelines in production?
A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.
Konieczny
This post talks about schema versions in Parquet. The first section describes differences between 2 formats. The second one shows how the files generated with both versions are different.
Writers differences
The first important question - why 2 different write modes ? The change was introduced in December 2013 and was dictated by the addition of new encoding formats. In order to keep the retro-compatibility, a small enumeration in ParquetProperties with 2 supported writing versions was added:
public enum WriterVersion {
PARQUET_1_0 ("v1"),
PARQUET_2_0 ("v2");
Prior to Parquet 2.0, most of values were written with plain encoding. Only the arrival of the next version brought some new more efficient encodings, such as: RLE/Bit-packing (for booleans), delta encodings (for binary and fixed length byte arrays types) and delta encodings with binary packing (for integers).
Writer versions examples
In order to see the differences between writers, let's analyze applied encodings for some of columns of our example of WorkingCitizen class:
@Test
public void should_compare_files_written_with_both_available_versions() throws IOException {
Path filePathV1 = new Path(TEST_FILE_V1);
writeCitizens(filePathV1, ParquetProperties.WriterVersion.PARQUET_1_0);
Path filePathV2 = new Path(TEST_FILE_V2);
writeCitizens(filePathV2, ParquetProperties.WriterVersion.PARQUET_2_0);
ParquetFileReader fileReaderV1 = ParquetFileReader.open(new Configuration(), filePathV1);
ParquetFileReader fileReaderV2 = ParquetFileReader.open(new Configuration(), filePathV2);
List rowGroupsV1 = fileReaderV1.getRowGroups();
BlockMetaData rowGroupV1 = rowGroupsV1.get(0);
List rowGroupsV2 = fileReaderV2.getRowGroups();
BlockMetaData rowGroupV2 = rowGroupsV2.get(0);
// Check double value
ColumnChunkMetaData creditRatingV1 = getMetadataForColumn(rowGroupV1, "creditRating");
ColumnChunkMetaData creditRatingV2 = getMetadataForColumn(rowGroupV2, "creditRating");
assertThat(creditRatingV1.getEncodings()).isNotEqualTo(creditRatingV2.getEncodings());
assertThat(creditRatingV1.getEncodings()).contains(Encoding.BIT_PACKED, Encoding.PLAIN);
assertThat(creditRatingV2.getEncodings()).contains(Encoding.PLAIN);
// Check nested type
ColumnChunkMetaData professionalSkillsV1 = getMetadataForColumn(rowGroupV1, "professionalSkills");
ColumnChunkMetaData professionalSkillsV2 = getMetadataForColumn(rowGroupV2, "professionalSkills");
assertThat(professionalSkillsV1.getEncodings()).isNotEqualTo(professionalSkillsV2.getEncodings());
assertThat(professionalSkillsV1.getEncodings()).contains(Encoding.PLAIN_DICTIONARY, Encoding.RLE);
assertThat(professionalSkillsV2.getEncodings()).contains(Encoding.RLE_DICTIONARY, Encoding.PLAIN);
// Check enum type
ColumnChunkMetaData civilityV1 = getMetadataForColumn(rowGroupV1, "civility");
ColumnChunkMetaData civilityV2 = getMetadataForColumn(rowGroupV2, "civility");
assertThat(civilityV1.getEncodings()).isNotEqualTo(civilityV2.getEncodings());
assertThat(civilityV1.getEncodings()).contains(Encoding.BIT_PACKED, Encoding.PLAIN);
assertThat(civilityV2.getEncodings()).contains(Encoding.DELTA_BYTE_ARRAY);
}
Different writer versions are only a result of Parquet evolution. The 2.0 version greatly improved encoding capabilities and this change needed to be retro-compatible. Because of that the writers were divided in 2 different versions, both applying different encoding methods during values writing. The second section of this post proved that through learning tests comparing the same data written with 2 different writers.
Data Engineering Design Patterns
Looking for a book that defines and solves most common data engineering problems? I wrote
one on that topic! You can read it online
on the O'Reilly platform,
or get a print copy on Amazon.
I also help solve your data engineering problems contact@waitingforcode.com đź“©
Read also about Schema versions in Parquet here:
- Add writer version flag to parquet and make initial changes for supported parquet 2.0 encodings turn on parquet 2.0 flags
Related blog posts:
Still exploring #ApacheParquet internals. It's time for encoding https://t.co/04pL51Na4c and indirectly schema versions https://t.co/NTZV8Uxyqc
— bardev (@waitingforcode) November 12, 2017
