ACID file formats - API on waitingforcode.com - articles about Data engineering

Versions: Apache Hudi 0.10.0, Apache Iceberg 0.13.1, Delta Lake 1.1.0 https://github.com/bartosz25/acid-file-formats/tree/main/000_api

It's time to start a new series on the blog! I hope to catch on to the ACID file formats that are gaining more and more importance. It's also a good occasion to test a new learning method. Instead of writing one blog post per feature and format, I'll try to compare Delta Lake, Apache Iceberg, and Apache Hudi concepts in the same article. Besides this personal challenge, I hope you'll enjoy the series and also learn something interesting!

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

I'll start the first blog post of the series with a high-level view of the API. Here, you'll see the code you can use to perform the basic operations, like writing the data, creating a table, or querying it.

Surprises

Before discussing the writing options, I'd like to share the problems faced during the setup. Naturally, I had much less issues with Delta Lake since I've already written some code with this format. It was not so rosy for Apache Hudi and Apache Iceberg. But let me explain that for Hudi first:

Kryo serializer. To write data in Hudi you must configure the .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") in SparkSession. Otherwise, you'll get an error like hoodie only support org.apache.spark.serializer.KryoSerializer as spark.serializer. Since it's an introductory article, I'm not explaining the whys of the observations here. The explanations will follow in the next blog posts.
Write operation. Once I resolved the Kryo issue, I got another exception. This time, ts(Part -ts) field not found in record. To fix it, I had to set the hoodie.datasource.write.operation property to insert. By default, it's defined as upsert and probably the fact of not having the "ts" field caused makes the upser operation fail that way.
Record key. The last faced problem was the error indicating a missing record key: recordKey value: "null" for field: "uuid" cannot be null or empty. I didn't have the "uuid" column in my dataset and it's a default value taken for the hoodie.datasource.write.recordkey.field configuration property. The solution was simply to define a record key existing in the dataset.

After all these issues, I was ready to go with Apache Hudi. However, I hadn't been expecting to configure as many things. Hopefully, in the following weeks, I will understand better why. And what about Apache Iceberg?

Catalog configuration. Hudi and Delta Lake also expect a catalog configuration in SparkSession, but their configuration remains relatively easy. Iceberg is not the case because setting the Icerberg's catalog class name is not enough. Additionally, you'll have to define the catalog type and location. Otherwise, you can expect the errors like Cannot find constructor for interface org.apache.iceberg.catalog.Catalog or Cannot instantiate hadoop catalog. No location provided for warehouse (Set warehouse config).

Delta Lake

In the analysis I'll check the following aspects:

Writing a DataFrame? Is it possible? What methods are available?
Writing a SQL expression. Although this approach is not well suited for big data volumes, it can be interesting to test and use in data exploration or test, for example on the notebooks.
Reading the data. What modes are available here? Can we query the data only with Apache Spark DataFrame API, or also with SQL?

Delta Lake can write a DataFrame and create a table from a SQL expression. Moreover, it supports the new V2 API.

    val inputData = Seq(
      Order(1, 33.99d, "Order#1"), Order(2, 14.59d, "Order#2"), Order(3, 122d, "Order#3")
    ).toDF
    inputData.write.format("delta").save(outputDir)
    sparkSession.sql(s"CREATE TABLE default.orders USING DELTA LOCATION '${outputDir}'")
    inputData.writeTo("orders_from_write_to").using("delta").createOrReplace()
    sparkSession.sql("DROP TABLE IF EXISTS orders_from_sql")
    sparkSession.sql(
      s"""
        |CREATE OR REPLACE TABLE orders_from_sql (
        | id LONG,
        | amount DOUBLE,
        | title STRING
        |) USING delta LOCATION "${outputDir}/orders_from_sql_${System.currentTimeMillis()}"
        |""".stripMargin)
    sparkSession.sql(
      """
        |INSERT INTO orders_from_sql (id, amount, title) VALUES
        |(1, 33.99, "Order#1"), (2, 14.59, "Order#2"), (1, 122, "Order#3")
        |""".stripMargin)

What about reading? Again, all cases checked! You can query Delta Lake tables with the programmatic API and SQL:

    sparkSession.read.format("delta").load(outputDir).where("amount > 40").show(false)
    sparkSession.sql("SELECT * FROM default.orders WHERE amount > 40").show(false)
    sparkSession.sql("SELECT * FROM orders_from_write_to WHERE amount > 40").show(false)
    sparkSession.sql("SELECT * FROM orders_from_sql WHERE amount > 40").show(false)

Can we write the same snippet for Apache Hudi? Let's see!

Apache Hudi

For Apache Hudi, some previously presented operations are not supported. When you try to run the inputData.writeTo("orders_from_write_to").using("hudi").createOrReplace() , you'll get the error of an unsupported writing mode ("REPLACE TABLE AS SELECT is only supported with v2 tables"). Similar error pops up when you try to run the SQL commands to create the table. I didn't manage to solve them just by reading the documentation, so instead I used the old and good one createOrReplaceTempView. Anyway, the single working method to create a Hudi table was the programmatic API:

    val inputData = Seq(
      Order(1, 33.99d, "Order#1"), Order(2, 14.59d, "Order#2"), Order(3, 122d, "Order#3")
    ).toDF

    inputData.write.format("hudi").options(getQuickstartWriteConfigs)
      .option("hoodie.table.name", "orders")
      .option("hoodie.datasource.write.operation", INSERT_OPERATION_OPT_VAL)
      .option("hoodie.datasource.write.recordkey.field", "id")
      .mode(SaveMode.Overwrite)
      .save(outputDir)

I had less troubles for the reading part because it supports DataFrame API and SQL:

    val allReadOrders = sparkSession.read.format("hudi").load(outputDir)
    sparkSession.read.format("hudi").load(outputDir).where("amount > 40").show(false)
    sparkSession.sql("SELECT * FROM orders_table WHERE amount > 40").show(false)

One thing surprised me, though. The SELECT statements return both data and metadata information. Maybe I misconfigured something or didn't find the relevant information in the documentation, but well, writing data was a bit more difficult than for Delta Lake. I'll be happy to learn if you have any suggestions!

Apache Iceberg

For Apache Iceberg, the experience was smoother than for Hudi. Except for one thing. The V1 DataFrame API. I didn't succeed in configuring the SparkSession to support both V1 and V2 APIs. Later on, I found the documentation doesn't recommend the V1, so I didn't consider it as a blocker point and moved on in the tests:

The v1 DataFrame write API is still supported, but is not recommended.
When writing with the v1 DataFrame API in Spark 3, use saveAsTable or insertInto to load tables with a catalog. Using format("iceberg") loads an isolated table reference that will not automatically refresh tables used by queries.

The writing part is relatively similar to the Delta Lake's:

inputData.writeTo("local.db.orders_from_write_to").using("iceberg").createOrReplace()
sparkSession.sql("DROP TABLE IF EXISTS orders_from_sql")
sparkSession.sql(
  s"""
     |CREATE OR REPLACE TABLE local.db.orders_from_sql (
     | id LONG,
     | amount DOUBLE,
     | title STRING
     |) USING iceberg
     |""".stripMargin)
sparkSession.sql(
  """
    |INSERT INTO local.db.orders_from_sql (id, amount, title) VALUES
    |(1, 33.99, "Order#1"), (2, 14.59, "Order#2"), (1, 122, "Order#3")
    |""".stripMargin)

During the reading tests I discovered a new V2 API method to read data from tables, the table(...) function:

    sparkSession.table("local.db.orders_from_write_to").where("amount > 40").show(false)
    sparkSession.sql("SELECT * FROM local.db.orders_from_write_to WHERE amount > 40").show(false)
    sparkSession.sql("SELECT * FROM local.db.orders_from_sql WHERE amount > 40").show(false)

I tested the table() method on Delta Lake and Apache Hudi. Both formats support it correctly.

If you're still hungry after this introduction, no worries! It's only an introduction! Next weeks I'll publish more in depth blog posts about these 3 ACID file formats. And if you already have some questions about their features, feel free to leave a comment. I'll try to include them in my planning!

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects

TAGS: #ACID file formats