Table file formats - Compaction: Delta Lake

Versions: Delta Lake 2.2.0 https://github.com/bartosz25/acid-file-formats/tree/main/005_compaction/delta_lake

The small files is a well known problem in data systems. Fortunately, modern table file formats have built-in features to address it. In the next series we'll see how.

New ebook 🔥

Learn 84 ways to solve common data engineering problems with cloud services.

👉 I want my copy

Rewriting

Delta Lake implements the files compaction process with 2 features. The first of them is explicit files rewriting:

sparkSession.read.format("delta").load(outputDir)
  .repartition(numberOfFiles)
  .write.option("dataChange", "false")
  .format("delta").mode("overwrite").save(outputDir)

It looks like any other writing process but as you can see in the highlighted part, the writer uses an option("dataChange", "false"). This option informs the downstream consumers that the writing operation only rearranges the data. Thanks to that notification, the consumers can ignore this new event in the transaction log.

Under-the-hood, this option is used by WriteIntoDelta#write method and implies the following:

Besides these 2 aspects, the writing operation works normally, i.e. it identifies the files to overwrite, marks them as "removed" and writes new "added" files instead. It's worth noting that the workflow supports overwriting particular partitions:

	val partitionToRewrite = "partition_number = 0"
	val numberOfFiles = 1
	sparkSession.read
  	.format("delta")
  	.load(outputDir)
  	.where(partitionToRewrite)
  	.repartition(numberOfFiles)
  	.write
  	.option("dataChange", "false")
  	.format("delta")
  	.mode("overwrite")
  	.option("replaceWhere", partitionToRewrite)
  	.save(outputDir)

OPTIMIZE

If you don't want to use the API for compaction you can use SQL and the OPTIMIZE command:

OPTIMIZE my_table

The underlying implementation is similar to the rewriting one. It has a few subtleties, though:

The list above doesn't include points for Z-Ordered tables compaction. Due to the logic of this storage layer, the OPTIMIZE command behaves differently in bins creation and stats generation but it's a topic for another blog post.

Rewriting and OPTIMIZE are 2 ways to compact smaller files into bigger ones in Delta Lake and improve the reading I/O. Even though they have subtle differences, they both mark the rewritten files as rearranged-only with the dataChange flag set to false. Good news, we don't stop here for the compaction and next week we're going to see it with Apache Iceberg!

If you liked it, you should read:

The comments are moderated. I publish them when I answer, so don't worry if you don't see yours immediately :)

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!