Introduction to Apache Parquet

Versions: Parquet 1.9.0

Very often an appropriate storage is as important as the data processing pipeline. And among different possibilities we can still store the data in files. Thanks to different formats, such as column-oriented ones, some of actions in reading path can be optimized.

A virtual conference at the intersection of Data and AI. This is not a conference for the hype. Its real users talking about real experiences.
- 40+ speakers with the likes of Hannes from Duck DB, Sol Rashidi, Joe Reis, Sadie St. Lawrence, Ryan Wolf from nvidia, Rebecca from lidl
- 12th September 2024
- Three simultaneous tracks
- Panels, Lighting Talks, Keynotes, Booth crawls, Roundtables and Entertainment.
- Topics include (ingestion, finops for data, data for inference (feature platforms), data for ML observability
- 100% virtual and 100% free

👉 Register here

This post is the first one devoted to Apache Parquet format. As purely theoretical and introductory post, it won't contain any code snippet. In place of that, it'll try to give some global ideas about Parquet's ecosystem (1st section) and the storage format (2nd section) without going really deep.

What is Apache Parquet ?

Apache Parquet is a columnar storage format using some of concepts defined in "Dremel: Interactive Analysis of Web-Scale Datasets" article. This inspiration concerns especially nested structures that Parquet is able to store efficiently. Thanks to optimized data organization, Parquet fits pretty well for analytics queries.

Columnar storage

In the column-oriented storage the data from specific columns is stored together in one or logically separated chunks. The benefits of this type of storage are:

  • I/O reduction - if our query is targeted to 1 or 2 columns, the storage engine doesn't need to scan over all rows, decompress them, ignore not important columns. Instead, it decompress columns chunk directly and applies the filter on it.
  • better compression - each column can be compressed differently
  • encoding optimization - if the data cardinality in the column is low we can optimize the storage with sepcific encoding. For instance if the column stores only values for 2 countries idenentified by "US" and "DE" ISO codes, they can be optimized through the integers mapping. It could lead to store 0 for "US" and 1 for "DE" instead of "US" and "DE" strings each time.

The Apache Parquet project is composed of 3 main modules those meaning is not always obvious at first glance:

Storage in Apache Parquet

Before introducing some storage details, let's clarify that Parquet files are immutable. So in order to change already created file, we need then open it, save in other place and remove at the end. This limitation is caused by the storage internals that contributes to Parquet efficiency.

The files written by Parquet are stored in a special format composed of the layers defined in the image below:

As you can see, a Parquet file is composed of row groups. Each row group stores subset of data in pages that grouped together compose a column chunk.

Parquet file contains also some metadata:

The metadata is very helpful to optimize storage (efficient compression) and querying (e.g. filtering that can be optimized with min/max values).

Apache Parquet is a storage format well suited for queries not involving all columns. As shown in the first benchmarks at Twitter, the scanning time was up to 5 times smaller than in the case of Thrift storage. This efficiency is achieved not only thanks to column-oriented storage but also thanks to other optimization techniques are statistics or column-level compression discussed briefly in the 2nd part of this post. But all of that will be explained better in further posts.