There is always a gap between a disruption in the data engineering industry and its integration on the cloud. It was not different for table file formats which have started gaining interest on AWS, Azure, GCP recently.
A virtual conference at the intersection of Data and AI. This is not a conference for the hype. Its real users talking about real experiences.
- 40+ speakers with the likes of Hannes from Duck DB, Sol Rashidi, Joe Reis, Sadie St. Lawrence, Ryan Wolf from nvidia, Rebecca from lidl
- 12th September 2024
- Three simultaneous tracks
- Panels, Lighting Talks, Keynotes, Booth crawls, Roundtables and Entertainment.
- Topics include (ingestion, finops for data, data for inference (feature platforms), data for ML observability
- 100% virtual and 100% free
👉 Register here
In this blog post we'll see how the major cloud providers integrate the table file formats. To start, AWS.
AWS
AWS is probably the most active on the table file formats field. Athena is the first of 4 services with some table file formats capabilities, such as:
- Pretty nice integration with Apache Iceberg, including tables management (DDL), schema evolution, time travel or writing queries (DML).
- Limited Delta Lake support. Athena can mostly read Delta Lake tables and hence, doesn't support writing operations nor time travel.
- Limited Apache Hudi support. As for Delta Lake, the support is limited to the read-only queries.
At first, I also thought that the table file formats are a part of the governed tables on Lake Formation. Their features, including ACID transactions, data compaction, or time-travel queries, fully match with the features provided by open table file formats. However, creating a governed table on Lake Formation looks more like creating a table with closed table format.
Besides Athena, Glue also supports open table file formats. The service represents the feature under the Data Lake frameworks. The integration supports reading and writing tables without installing additional connectors or performing extra setup steps. Simply, define the table file format as a --datalake-formats parameter of your job and configure Spark session with the good catalog and extensions. Some of the Glue features are not supported, though, including the files grouping or job bookmarks.
Also Redshift integrates with open table file formats. Redshift Spectrum supports Apache Hudi and Delta Lake as the external tables formats.
Finally, there is also a Delta Lake and Apache Iceberg native support in EMR. If you need any of these formats, you can enable them by setting the delta.enabled or iceberg.enabled flag in the cluster configuration file. If present, EMR will take care of installing all required dependencies for you.
Azure
The support for open table file formats on Azure could be reduced to Databricks which is the company behind Delta Lake. However, it can also be deployed on AWS and GCP and that's the reason why I omitted this service in the blog post. But there is good news! Azure has other services interacting with Delta Lake (unfortunately, I haven't found any mention for Apache Iceberg and Apache Hudi).
The first of them is Synapse. The data warehouse offering can query Delta Lake files from Serverless SQL Pool using the T-SQL syntax.
When it comes to the writing Delta Lake tables you can use Stream Analytics where Delta Lake is one of the available sinks.
Finally, Azure Databricks Delta Lake is one of connectors supported in Data Factory. It means you can use it directly in the pipelines, including Mapping Data Flow, Copy or Lookup activities.
GCP
GCP also has less services integrating with open table file formats than AWS. The support is mostly limited to BigQuery. More exactly, the ability to read Apache Iceberg tables from this data warehouse layer belongs to a new component called BigLake.
Additionally, Dataproc Metastore can be used as an Apache Iceberg catalog for Apache Spark, Hive, or Presto.
I don't know your feeling after reading the blog post but mine after writing it was more like "the cloud providers love Apache Iceberg". It's pretty well integrated with AWS services and is the single format available for BigQuery. Only Azure is different which, probably thanks to the native Databricks integration, seems to work better with Delta Lake. When it comes to Apache Hudi, it has some support on AWS but seems to have less interest than Apache Iceberg and Delta Lake.