ACID file formats - file system layout

Versions: Apache Hudi 0.10.0, Apache Iceberg 0.13.1, Delta Lake 1.1.0 https://github.com/bartosz25/acid-file-formats/tree/main/001_storage_layout

Last week I presented the API of the 3 analyzed ACID file formats. Under-the-hood, they obviously generate data files but not only. And that's something we'll focus on in this blog post.

New ebook 🔥

Learn 84 ways to solve common data engineering problems with cloud services.

👉 I want my Early Access edition

The demo code consists of 3 applications manipulating a dataset of Letter(id: String, lowerCase: String):

  1. The first application creates a dataset composed of Letter("A", "a"), Letter("B", "b"), Letter("C", "c"), Letter("D", "d"), Letter("E", "e").
  2. The second application updates the "A", deletes the "C" and inserts a Letter("F", "f").
  3. The last job overwrites the current dataset by Letter("G", "g").

You'll find the code for each file format in the 001_storage_layout module on Github. Instead, I'll dig into the code for the first time to understand the relationship between the stored files and the classes.

Apache Hudi

The files left after running the 3 jobs in Apache Hudi looks like in the following schema:

The first interesting file is hoodie_partition.metadata, represented in the code by the HoodiePartitionMetadata class. Apache Hudi creates it for each partition and writes 2 related information:

The table metadata is stored under the .hoodie subdirectory, in the hoodie.properties file. The class generating this file is HoodieTableConfig and it defines all the attributes of the table, such as the type (copy on write in the example), storage format (Parquet), name (letters), or the partition fields (missing).

Other interesting files are the ones starting with a timestamp. They're the markers for the actions made on the table:

Why these 3 files? They're useful for performing the rollbacks and avoiding doing that with the rename() operation that is not atomic on the object stores. In the code they're defined in the HoodieTimeline and HoodieInstant classes.

The 3 subdirectories from the directory listing are:

And these are only the metadata files! Apache Hudi root directory also contains the data files. If you analyze their names, you'll find an interesting pattern. Let's take the example from the directory tree which is db342aa2-8901-41a0-8b66-6c79d9bac234-0_0-6-6_20220320163857541.parquet. The file name is composed as:

You'll find the information about them in the FSUtils makeWriteToken and makeDataFileName methods.

Apache Iceberg

As you have seen, Apache Hudi has a quite extended directory structure. What about Apache Iceberg? I found the organization a bit simpler because after running the first 2 jobs it looks like that:

What do we have here? Let me start with the metadata subdirectory where Iceberg stores:

Let me terminate by analyzing the data files and more specificaly, the 00003-3-b0870fb0-512c-4c71-a6fe-34524a20363a-00001.parquet. As for Apache Hudi, all Iceberg output files ve a standardized name composed of:

To see the method creating the name for the output files, you can go to OutputFileFactory#generateFilename.

Delta Lake

Delta Lake is the last format to analyse in this article. The output of the first 2 jobs looks like:

The separation is pretty clear. The _delta_log subdirectory is the metadata location storing the commits representing the incremental state of the table. This state lists the added and removed files, persist the protocol information, the schema, and the commit info.

The code's interaction with _delta_log commits mainly involves 2 classes. The first is the OptimisticTransaction that asks for the creation of the the file in the commit method. The second class is LogStore that takes the request and materializes the commit information in the file. Since the LogStore is an interface, the underlying implementation will depend on the file system (local, distributed, object store) used for Delta Lake tables.

When it comes to the data files, they use the same naming mechanism as Apache Spark, so: part- + task attempt id + random UUID + the number of files written so far in the process. The logic naming logic is defined in the DelayedCommitProtocol#getFileName method.

The storage layout for the 3 analyzed file formats is similar. They all consider the datasets in terms of metadata and data. Although the metadata part is mostly responsible for the transactional guarantee, it can also contain some performance tips, such as the partition boundaries in Apache Iceberg. The data files are usually generated in the output root directory (Delta Lake, Apache Hudi) or in a separate subdirectory (Apache Iceberg).