Data sharing on the cloud

One of the big announcements of the previous Data+AI Summit was Delta Sharing, a protocol to exchange the life data with internal and external users. The question I asked myself at that moment was "Does it exist on the cloud?". Let's see.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in πŸ‘‰ Early Release on the O'Reilly platform

I also help solve your data engineering problems πŸ‘‰ πŸ“©

From my research I found that there are 2 types of data sharing modes:

Life sharing

Let's first focus on the former category where belong Databricks Delta Sharing, Redshift Data Sharing and Azure Data Share. The first from the list, Delta Sharing, is there to share Delta Lake tables. Unsurprisingly, since Databricks is cloud native from its very first days, the feature leverages cloud resources. In a nutshell, the sharing protocol works in these steps:

  1. Data provider defines the shared tables with SQL commands:
    • CREATE SHARE ... to create the shared container,
    • ALTER SHARE ... ADD TABLE ... to share a new table in the container,
    • the CREATE RECIPIENT ... to create a new user to share the dataset with. The command returns an activation link URL that the user will use to connect to the shared tables.
    • the GRANT SELECT ON SHARE ... TO RECIPIENT ... to grant the read permission to the created share
  2. The recipient connects to the dataset with the Open Source Delta Sharing library by referencing the credentials downloaded from the link built in the CREATE RECIPIENT step.
  3. The special Delta Sharing client can then access the tables as they were stored in his own cloud space.

An interesting internal detail to share is the sharing mechanism. When the client accesses the shared table, Databricks generates a dedicated short-lived access link so that the client can directly read the data from the bucket/container exposing the table. There is no need to move the data first to the client object storage! And how did the problem solve AWS Redshift?

AWS data warehouse solution uses the RA3 managed storage which guarantees a decoupled storage and compute (= any of them can scale independently). Thanks to it, the recipient shouldn't impact the cluster sharing the data. Later on, the steps are very similar to the Delta Sharing's. The producer starts by defining a DATASHARE and assigning access permissions to it. Later, the consumer must accept the share and expose it in its Redshift cluster to the internal users.

When it comes to Azure Data Share, let me introduce the workflow in the next section and focus here only on one of 2 supported data sharing modes called in-place. The name is self-explanatory because the service simply shares the data without making any physical action on it in this mode. The opposite mode is called snapshot and it's an example of the synchronized sharing covered just below.

Synchronized sharing

The snapshot mode relies on data synchronization between 2 different accounts. The workflow is similar to the in-place share, though. The data producer creates first a Data Share resource from an existing data source (Blob Storage, Data Lake Gen 2, SQL Database, Synapse, Data Explorer). Next, he selects the shared objects, introduces the recipients' emails, and configures the data share refresh period.

The flow from the consumer perspective looks similar to the Delta Share and Redshift Share. The consumer must confirm the share invitation and configure where the data from the producer will land by creating/selecting the Data Share and the destination in his subscription. After that, the service automatically copies the dataset to the recipient's subscription. For the in-place share, the data doesn't move, but a symbolic reference is created between the source and destination.

One thing to notice regarding the Data Share modes, though. The in-place mode only works for Data Explorer service. The others quoted previously only support the snapshot, so the data synchronization-based mode.

All of these 3 services provide a centralized way to share the data. No more need to keep a document of what is shared with who - the data management part is directly in these services. When you stop the share, the recipient loses access to your data. However, it doesn't prevent him from copying your data to his own cloud space before the cancellation.