Introduction to Apache Parquet

Versions: Parquet 1.9.0

Very often an appropriate storage is as important as the data processing pipeline. And among different possibilities we can still store the data in files. Thanks to different formats, such as column-oriented ones, some of actions in reading path can be optimized.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in 👉 Early Release on the O'Reilly platform

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

This post is the first one devoted to Apache Parquet format. As purely theoretical and introductory post, it won't contain any code snippet. In place of that, it'll try to give some global ideas about Parquet's ecosystem (1st section) and the storage format (2nd section) without going really deep.

What is Apache Parquet ?

Apache Parquet is a columnar storage format using some of concepts defined in "Dremel: Interactive Analysis of Web-Scale Datasets" article. This inspiration concerns especially nested structures that Parquet is able to store efficiently. Thanks to optimized data organization, Parquet fits pretty well for analytics queries.

Columnar storage

In the column-oriented storage the data from specific columns is stored together in one or logically separated chunks. The benefits of this type of storage are:

  • I/O reduction - if our query is targeted to 1 or 2 columns, the storage engine doesn't need to scan over all rows, decompress them, ignore not important columns. Instead, it decompress columns chunk directly and applies the filter on it.
  • better compression - each column can be compressed differently
  • encoding optimization - if the data cardinality in the column is low we can optimize the storage with sepcific encoding. For instance if the column stores only values for 2 countries idenentified by "US" and "DE" ISO codes, they can be optimized through the integers mapping. It could lead to store 0 for "US" and 1 for "DE" instead of "US" and "DE" strings each time.

The Apache Parquet project is composed of 3 main modules those meaning is not always obvious at first glance:

Storage in Apache Parquet

Before introducing some storage details, let's clarify that Parquet files are immutable. So in order to change already created file, we need then open it, save in other place and remove at the end. This limitation is caused by the storage internals that contributes to Parquet efficiency.

The files written by Parquet are stored in a special format composed of the layers defined in the image below:

As you can see, a Parquet file is composed of row groups. Each row group stores subset of data in pages that grouped together compose a column chunk.

Parquet file contains also some metadata:

The metadata is very helpful to optimize storage (efficient compression) and querying (e.g. filtering that can be optimized with min/max values).

Apache Parquet is a storage format well suited for queries not involving all columns. As shown in the first benchmarks at Twitter, the scanning time was up to 5 times smaller than in the case of Thrift storage. This efficiency is achieved not only thanks to column-oriented storage but also thanks to other optimization techniques are statistics or column-level compression discussed briefly in the 2nd part of this post. But all of that will be explained better in further posts.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
đź”— past projects