Table file formats - Z-Order compaction: Apache Iceberg on waitingforcode.com

Versions: Apache Iceberg 1.1.0

Last time you discovered the Z-Order compaction in Delta Lake. But guess what? Apache Iceberg also has this feature!

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

The Z-Compaction is a special ordering strategy for the compaction in Apache Iceberg. You control it as the other options, with a builder method:

SparkActions
  .get()
  .rewriteDataFiles(lettersTable)
  .zOrder("col1", "col2")
  .execute()

The underlying mechanism is also similar to what you saw in the Table file formats - compaction: Apache Iceberg. Let's see!

SparkZOrderStrategy

When you call the zOrder(...) highlighted in the previous snippet, you ask Apache Iceberg to create an instance of SparkZOrderStrategy. As other compaction classes, it implements a rewriteFiles(List<FileScanTask> filesToRewrite) to replace the data files. It's also the place where all the Z-Order magic I explained in the Table file formats - Z-Order compaction: Delta Lake, happens.

First, the method verifies if the partitioning schema has changed between the table and the compaction action. If yes, the sort expression used for rewrite satisfies this new partitioning requirement.
Next, the logic defines the number of shuffle partitions with this algorithm:
```
  	long numOutputFiles =
      	numOutputFiles((long) (inputFileSize(filesToRewrite) * sizeEstimateMultiple()));
  	cloneSession.conf().set(SQLConf.SHUFFLE_PARTITIONS().key(), Math.max(1, numOutputFiles));
```
Since the Adaptive Query Execution is disabled for the compaction, this value based on the compression-factory property must ensure evenly sized files.
It's time to prepare the Z-Order sorting now. The first step consists of creating a new column of the array type to store the values of the Z-Order columns sorted lexicographically:
```
  Column zvalueArray = functions.array(
     zOrderColumns.stream().map(colStruct ->
          zOrderUDF.sortedLexicographically(functions.col(colStruct.name()), colStruct.dataType())).toArray(Column[]::new));
```
The sortedLexicographically method creates a User-Defined Function that orders the byte representations of the compared objects. Under-the-hood, Apache Iceberg uses the helper methods from ZOrderByteUtils to perform this action.
After this preparation step, another Z-Order operation enters into action, the interleaving. The SparkZOrderStrategy adds a new column to the compressed dataset:
```
Dataset<Row> zvalueDF = scanDF.withColumn(Z_COLUMN, zOrderUDF.interleaveBytes(zvalueArray));
```
This column is used as a part of the sort expression generated in the first step:
```
  private static final org.apache.iceberg.SortOrder Z_SORT_ORDER =
  	org.apache.iceberg.SortOrder.builderFor(Z_SCHEMA)
      	.sortBy(Z_COLUMN, SortDirection.ASC, NullOrder.NULLS_LAST)
      	.build();
```
I already presented the interleaving in the already quoted blog post about Z-Order and Delta Lake, so I won't repeat the details here to keep at least this article short enough.

In the end, a new Dataset with the Z-Order sort expression gets created:

LogicalPlan sortPlan = sortPlan(distribution, ordering, zvalueDF.logicalPlan(), sqlConf);
Dataset<Row> sortedDf = new Dataset<>(cloneSession, sortPlan, zvalueDF.encoder());

Finally, the SparkZOrderStrategy triggers the compaction action with the writing:

  sortedDf
      .select(originalColumns)
      .write()
      .format("iceberg")
     
.option(SparkWriteOptions.REWRITTEN_FILE_SCAN_TASK_SET_ID, groupID)
      	.option(SparkWriteOptions.TARGET_FILE_SIZE_BYTES, writeMaxFileSize())
      	.option(SparkWriteOptions.USE_TABLE_DISTRIBUTION_AND_ORDERING, "false")
      .mode("append")
      .save(groupID)

Lexicographical order

If you've read my blog post about Z-Ordering in Delta Lake, the algorithm should look familiar. But there is one point that troubled me a bit, the sortedLexicographically method. Fortunately, some research helped me refresh some knowledge.

Your table may have signed numerical values such as integers or longs. These "signed" values are numbers with "+" or "-" signs that makes the whole Z-Order sorting more challenging. Natively, the signed values don't support their numerical ordering:

Order position	Signed number	Binary representation
1	1	0000 0001
2	2	0000 0010
3	-2	1111 1110
4	-1	1111 1111

It's weird, isn't it? We all know that the negative numbers go before the positive ones. Yes, we do but the machines don't and that's when the lexicographical ordering function helps. Let's take a look at its Javadoc:

public class ZOrderByteUtils {

  public static final int PRIMITIVE_BUFFER_SIZE = 8;

  private ZOrderByteUtils() {}

  static ByteBuffer allocatePrimitiveBuffer() {
	return ByteBuffer.allocate(PRIMITIVE_BUFFER_SIZE);
  }

  /**
   * Signed ints do not have their bytes in magnitude order because of the sign bit. To fix this,
   * flip the sign bit so that all negatives are ordered before positives. This essentially shifts
   * the 0 value so that we don't break our ordering when we cross the new 0 value.
   */
  public static ByteBuffer intToOrderedBytes(int val, ByteBuffer reuse) {
	ByteBuffer bytes = ByteBuffers.reuse(reuse, PRIMITIVE_BUFFER_SIZE);
	bytes.putLong(((long) val) ^ 0x8000000000000000L);
	return bytes;
  }

As you can see, it flips the sign bit to correctly reorder the values. If we apply this to our table, we'll get the expected sorting:

Order position	Signed number	Binary representation	Flipped integer
1	-2	0111 1110	126
2	-1	0111 1111	127
3	1	1000 0001	129
4	2	1000 0010	130

Now, all types can be sorted lexicographically and the Z-Order address used in the sort expression too!

It's the last table file format blog post before a short break. But there are more things to come, especially about schema evolution. Stay tuned!

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects