One of the big announcements of the previous Data+AI Summit was Delta Sharing, a protocol to exchange the life data with internal and external users. The question I asked myself at that moment was "Does it exist on the cloud?". Let's see.
What would it take for you to trust your Databricks pipelines in production?
A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.
Konieczny
From my research I found that there are 2 types of data sharing modes:
- life-sharing where the customers access the life data when they need
- synchronized-sharing service where the producer pushes the data to the consumers at a regular interval
Life sharing
Let's first focus on the former category where belong Databricks Delta Sharing, Redshift Data Sharing and Azure Data Share. The first from the list, Delta Sharing, is there to share Delta Lake tables. Unsurprisingly, since Databricks is cloud native from its very first days, the feature leverages cloud resources. In a nutshell, the sharing protocol works in these steps:
- Data provider defines the shared tables with SQL commands:
- CREATE SHARE ... to create the shared container,
- ALTER SHARE ... ADD TABLE ... to share a new table in the container,
- the CREATE RECIPIENT ... to create a new user to share the dataset with. The command returns an activation link URL that the user will use to connect to the shared tables.
- the GRANT SELECT ON SHARE ... TO RECIPIENT ... to grant the read permission to the created share
- The recipient connects to the dataset with the Open Source Delta Sharing library by referencing the credentials downloaded from the link built in the CREATE RECIPIENT step.
- The special Delta Sharing client can then access the tables as they were stored in his own cloud space.
An interesting internal detail to share is the sharing mechanism. When the client accesses the shared table, Databricks generates a dedicated short-lived access link so that the client can directly read the data from the bucket/container exposing the table. There is no need to move the data first to the client object storage! And how did the problem solve AWS Redshift?
AWS data warehouse solution uses the RA3 managed storage which guarantees a decoupled storage and compute (= any of them can scale independently). Thanks to it, the recipient shouldn't impact the cluster sharing the data. Later on, the steps are very similar to the Delta Sharing's. The producer starts by defining a DATASHARE and assigning access permissions to it. Later, the consumer must accept the share and expose it in its Redshift cluster to the internal users.
When it comes to Azure Data Share, let me introduce the workflow in the next section and focus here only on one of 2 supported data sharing modes called in-place. The name is self-explanatory because the service simply shares the data without making any physical action on it in this mode. The opposite mode is called snapshot and it's an example of the synchronized sharing covered just below.
Synchronized sharing
The snapshot mode relies on data synchronization between 2 different accounts. The workflow is similar to the in-place share, though. The data producer creates first a Data Share resource from an existing data source (Blob Storage, Data Lake Gen 2, SQL Database, Synapse, Data Explorer). Next, he selects the shared objects, introduces the recipients' emails, and configures the data share refresh period.
The flow from the consumer perspective looks similar to the Delta Share and Redshift Share. The consumer must confirm the share invitation and configure where the data from the producer will land by creating/selecting the Data Share and the destination in his subscription. After that, the service automatically copies the dataset to the recipient's subscription. For the in-place share, the data doesn't move, but a symbolic reference is created between the source and destination.
One thing to notice regarding the Data Share modes, though. The in-place mode only works for Data Explorer service. The others quoted previously only support the snapshot, so the data synchronization-based mode.
All of these 3 services provide a centralized way to share the data. No more need to keep a document of what is shared with who - the data management part is directly in these services. When you stop the share, the recipient loses access to your data. However, it doesn't prevent him from copying your data to his own cloud space before the cancellation.
Data Engineering Design Patterns
Looking for a book that defines and solves most common data engineering problems? I wrote
one on that topic! You can read it online
on the O'Reilly platform,
or get a print copy on Amazon.
I also help solve your data engineering problems contact@waitingforcode.com đź“©
Read also about Data sharing on the cloud here:
- Announcing Delta Sharing with Demo | Matei Zaharia | Keynote Data + AI Summit NA 2021 Introducing Delta Sharing: An Open Protocol for Secure Data Sharing Announcing Amazon Redshift data sharing (preview) Share data simply and securely using Azure Data Share | Azure Friday Supported data stores
Related blog posts:
- What's new on the cloud for data engineers - part 12 (10.2023-02.2024)
- Vertical autoscaling for data processing on the cloud
- What's new on the cloud for data engineers - part 11 (06-09.2023)
What does it mean, sharing the data on the cloud, and how the cloud providers implement it? The answer is in the new blog post ? https://t.co/lGVin2rs2b
— Bartosz Konieczny (@waitingforcode) November 28, 2021
