Table file formats - Compaction: Delta Lake

The small files is a well known problem in data systems. Fortunately, modern table file formats have built-in features to address it. In the next series we'll see how.

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants · production-ready templates
See the full curriculum → €7,000 flat fee · cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

Rewriting

Delta Lake implements the files compaction process with 2 features. The first of them is explicit files rewriting:

sparkSession.read.format("delta").load(outputDir)
  .repartition(numberOfFiles)
  .write.option("dataChange", "false")
  .format("delta").mode("overwrite").save(outputDir)

It looks like any other writing process but as you can see in the highlighted part, the writer uses an option("dataChange", "false"). This option informs the downstream consumers that the writing operation only rearranges the data. Thanks to that notification, the consumers can ignore this new event in the transaction log.

Under-the-hood, this option is used by WriteIntoDelta#write method and implies the following:

Besides these 2 aspects, the writing operation works normally, i.e. it identifies the files to overwrite, marks them as "removed" and writes new "added" files instead. It's worth noting that the workflow supports overwriting particular partitions:

	val partitionToRewrite = "partition_number = 0"
	val numberOfFiles = 1
	sparkSession.read
  	.format("delta")
  	.load(outputDir)
  	.where(partitionToRewrite)
  	.repartition(numberOfFiles)
  	.write
  	.option("dataChange", "false")
  	.format("delta")
  	.mode("overwrite")
  	.option("replaceWhere", partitionToRewrite)
  	.save(outputDir)

OPTIMIZE

If you don't want to use the API for compaction you can use SQL and the OPTIMIZE command:

OPTIMIZE my_table

The underlying implementation is similar to the rewriting one. It has a few subtleties, though:

The list above doesn't include points for Z-Ordered tables compaction. Due to the logic of this storage layer, the OPTIMIZE command behaves differently in bins creation and stats generation but it's a topic for another blog post.

Rewriting and OPTIMIZE are 2 ways to compact smaller files into bigger ones in Delta Lake and improve the reading I/O. Even though they have subtle differences, they both mark the rewritten files as rearranged-only with the dataChange flag set to false. Good news, we don't stop here for the compaction and next week we're going to see it with Apache Iceberg!

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems contact@waitingforcode.com đź“©