Table file formats - Compaction: Delta Lake

Versions: Delta Lake 2.2.0 https://github.com/bartosz25/acid-file-formats/tree/main/005_compaction/delta_lake

The small files is a well known problem in data systems. Fortunately, modern table file formats have built-in features to address it. In the next series we'll see how.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

Rewriting

Delta Lake implements the files compaction process with 2 features. The first of them is explicit files rewriting:

sparkSession.read.format("delta").load(outputDir)
  .repartition(numberOfFiles)
  .write.option("dataChange", "false")
  .format("delta").mode("overwrite").save(outputDir)

It looks like any other writing process but as you can see in the highlighted part, the writer uses an option("dataChange", "false"). This option informs the downstream consumers that the writing operation only rearranges the data. Thanks to that notification, the consumers can ignore this new event in the transaction log.

Under-the-hood, this option is used by WriteIntoDelta#write method and implies the following:

Besides these 2 aspects, the writing operation works normally, i.e. it identifies the files to overwrite, marks them as "removed" and writes new "added" files instead. It's worth noting that the workflow supports overwriting particular partitions:

	val partitionToRewrite = "partition_number = 0"
	val numberOfFiles = 1
	sparkSession.read
  	.format("delta")
  	.load(outputDir)
  	.where(partitionToRewrite)
  	.repartition(numberOfFiles)
  	.write
  	.option("dataChange", "false")
  	.format("delta")
  	.mode("overwrite")
  	.option("replaceWhere", partitionToRewrite)
  	.save(outputDir)

OPTIMIZE

If you don't want to use the API for compaction you can use SQL and the OPTIMIZE command:

OPTIMIZE my_table

The underlying implementation is similar to the rewriting one. It has a few subtleties, though:

The list above doesn't include points for Z-Ordered tables compaction. Due to the logic of this storage layer, the OPTIMIZE command behaves differently in bins creation and stats generation but it's a topic for another blog post.

Rewriting and OPTIMIZE are 2 ways to compact smaller files into bigger ones in Delta Lake and improve the reading I/O. Even though they have subtle differences, they both mark the rewritten files as rearranged-only with the dataChange flag set to false. Good news, we don't stop here for the compaction and next week we're going to see it with Apache Iceberg!

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
đź”— past projects


If you liked it, you should read:

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!