Table file formats are on the cloud

There is always a gap between a disruption in the data engineering industry and its integration on the cloud. It was not different for table file formats which have started gaining interest on AWS, Azure, GCP recently.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

In this blog post we'll see how the major cloud providers integrate the table file formats. To start, AWS.

AWS

AWS is probably the most active on the table file formats field. Athena is the first of 4 services with some table file formats capabilities, such as:

Pretty nice integration with Apache Iceberg, including tables management (DDL), schema evolution, time travel or writing queries (DML).
Limited Delta Lake support. Athena can mostly read Delta Lake tables and hence, doesn't support writing operations nor time travel.
Limited Apache Hudi support. As for Delta Lake, the support is limited to the read-only queries.

At first, I also thought that the table file formats are a part of the governed tables on Lake Formation. Their features, including ACID transactions, data compaction, or time-travel queries, fully match with the features provided by open table file formats. However, creating a governed table on Lake Formation looks more like creating a table with closed table format.

Besides Athena, Glue also supports open table file formats. The service represents the feature under the Data Lake frameworks. The integration supports reading and writing tables without installing additional connectors or performing extra setup steps. Simply, define the table file format as a --datalake-formats parameter of your job and configure Spark session with the good catalog and extensions. Some of the Glue features are not supported, though, including the files grouping or job bookmarks.

Also Redshift integrates with open table file formats. Redshift Spectrum supports Apache Hudi and Delta Lake as the external tables formats.

Finally, there is also a Delta Lake and Apache Iceberg native support in EMR. If you need any of these formats, you can enable them by setting the delta.enabled or iceberg.enabled flag in the cluster configuration file. If present, EMR will take care of installing all required dependencies for you.

Azure

The support for open table file formats on Azure could be reduced to Databricks which is the company behind Delta Lake. However, it can also be deployed on AWS and GCP and that's the reason why I omitted this service in the blog post. But there is good news! Azure has other services interacting with Delta Lake (unfortunately, I haven't found any mention for Apache Iceberg and Apache Hudi).

The first of them is Synapse. The data warehouse offering can query Delta Lake files from Serverless SQL Pool using the T-SQL syntax.

When it comes to the writing Delta Lake tables you can use Stream Analytics where Delta Lake is one of the available sinks.

Finally, Azure Databricks Delta Lake is one of connectors supported in Data Factory. It means you can use it directly in the pipelines, including Mapping Data Flow, Copy or Lookup activities.

GCP

GCP also has less services integrating with open table file formats than AWS. The support is mostly limited to BigQuery. More exactly, the ability to read Apache Iceberg tables from this data warehouse layer belongs to a new component called BigLake.

Additionally, Dataproc Metastore can be used as an Apache Iceberg catalog for Apache Spark, Hive, or Presto.

I don't know your feeling after reading the blog post but mine after writing it was more like "the cloud providers love Apache Iceberg". It's pretty well integrated with AWS services and is the single format available for BigQuery. Only Azure is different which, probably thanks to the native Databricks integration, seems to work better with Delta Lake. When it comes to Apache Hudi, it has some support on AWS but seems to have less interest than Apache Iceberg and Delta Lake.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects

Read also about Table file formats are on the cloud here:

Using a cluster with Delta Lake installed Use an Iceberg cluster with Spark Query Delta Lake files using serverless SQL pool in Azure Synapse Analytics Azure Stream Analytics - write to Delta Lake table (Public Preview) Copy data to and from Azure Databricks Delta Lake using Azure Data Factory or Azure Synapse Analytics Query Apache Iceberg tables Support Iceberg tables with Dataproc Metastore

If you liked it, you should read:

So far I've been talking about table file formats in terms of Open Source tools. But you will find them on the cloud too! Cloud providers innovate in this field a lot. How? A short summary is in the new blog post ? https://t.co/0FlTPALNvL
— Bartosz Konieczny (@waitingforcode) March 9, 2023