Delta Sharing: An Implementation Guide for Multi-Cloud Data Sharing (2024)

Ryan Chynoweth

8 min read

Dec 31, 2023

In Databricks, Delta Sharing seamlessly integrates with the Unity Catalog to adhere to enterprise governance protocols, offering centralized management, and auditing capabilities for shared data. An additional advantage of Delta Sharing is its commitment to providing users with access to the most up to date data, whether consumed through streams or batches. This real-time accessibility empowers informed decision-making and improves operational efficiency.

There are three core components to Delta Sharing:

Provider: the individual or organization that owns the share and control access to the data objects. A provider can have many shares and many recipients.
Recipient: the individual or organization that is gaining access to a share.
Share: a read-only collection of data objects (tables, table partitions, views, etc.)

Before we dive into the examples, the following instructions are how to create a Delta Share in Databricks using SQL:

CREATE SHARE IF NOT EXISTS my_share COMMENT "This is a Delta share!";

Once the Delta Share is created then you can add a data object to the share. The following command adds a single table to the share we previously created.

ALTER SHARE my_share ADD TABLE my_catalog.my_schema.my_table
COMMENT "This table belongs to a share!" ;

We now have a share on Databricks with a single table. Please note that you can share tables, views, table history, and table partitions.

Delta Sharing is open source, which means that it does not require the provider or the recipient to be a Databricks customer, however, sharing data on Databricks is extremely simple. To share data between Databricks Workspaces, users can simply do the following.

Obtain the recipient’s metastore identifier by running the following command in the recipient’s Databricks environment:

SELECT CURRENT_METASTORE();

Using the output of the previous command, one can add the recipient by running the following in the provider’s environment:

CREATE RECIPIENT IF NOT EXISTS <recipient-name>
USING ID ‘<sharing-identifier>’
COMMENT "This is a new recipient!" ;GRANT SELECT ON SHARE my_share TO RECIPIENT <recipient-name>;

At this point the recipient should be able to discover the share in their environment in the “Shared with me” section of the Catalog Explorer.

Delta Sharing: An Implementation Guide for Multi-Cloud Data Sharing (3)

The biggest benefit of Delta Sharing between two Databricks customers is that it simply works with practically zero effort. The data will automatically show up in the recipient’s catalog and they can begin querying data.

To share data with a non-Databricks customer, the provider will need to first add the recipient then send the connection information to the recipient.

Simply run the following command to obtain the Delta Share activation link in which you can send directly to your recipient to provide access.

DESCRIBE RECIPIENT <recipient-name>;

Once the recipient has successfully activated the delta sharing information they can use the following code to read data from the provider’s Delta Share using OSS projects.

from delta import *
import delta_sharing
from pyspark.sql import SparkSessionprofile_file = "/path/to/oss_share.share"
client = delta_sharing.SharingClient(profile=profile_file)
share_name = "share_name"
schema_name = "schema_name"
table_name = "table_name"
# Initialize Spark
spark = (SparkSession
.builder
.config('spark.jars.packages', 'org.apache.hadoop:hadoop-azure:3.3.1,io.delta:delta-core_2.12:2.2.0,io.delta:delta-sharing-spark_2.12:0.6.2')
.config('spark.sql.extensions', 'io.delta.sql.DeltaSparkSessionExtension')
.config('spark.sql.catalog.spark_catalog', 'org.apache.spark.sql.delta.catalog.DeltaCatalog')
.getOrCreate()
)
# Read from share
table_url = f"{profile_file}#{share_name}.{schema_name}.{table_name}"
share_df = delta_sharing.load_as_spark(url=table_url)
share_df.show()
(spark.read
.format("deltaSharing")
.load(table_url)
.show()
)

The benefit of this approach is that the consumer does not have to be a Databricks customer. This is key, because Delta Sharing avoids a specific vendor requirement for both parties it opens up providers to many more recipients whether they are internal solutions or external organizations. While this example shows a batch solution, it is possible to perform incremental and stream processing off of a Delta Share.

Popular business intelligence tools, like Power BI, have Delta Sharing connectors which allow them to connect directly to the data without requiring data warehouse compute. This is an extreme reduction in cost for BI solutions that leverage this capability. To connect Power BI to a Delta Sharing, click on “Get Data” then search for “Delta sharing” which will open the following screen and add the required URL.

Delta Sharing: An Implementation Guide for Multi-Cloud Data Sharing (4)

It is important to note that Delta Sharing has less capabilities to pushdown queries when compared to connecting to a Databricks SQL warehouse because Power BI is connecting directly to the data as opposed to a warehouse. Therefore this solution is not necessarily as scalable because the Power BI server is responsible for the majority of the computation required to transform the data, however, it entirely eliminates the data warehouse cost and allows users to have more economical reporting capabilities. Essentially, when you use Delta Sharing as a data source the data modeling and transformations are almost entirely owned by the reporting tool. It may require you to import the entire dataset prior to filtering, aggregating, or applying transformations which could be a heavy load on the client server. I recommend using Delta sharing as a data source for small to medium sized tables only. Engineers can also materialize table aggregations prior to ingesting into Power BI to reduce client server load.

To connect from AWS Databricks to Azure Data Lake Gen2 (ADLS Gen2) you can follow my previous blog in which I discuss how to set cluster settings in order to authenticate with ADLS Gen2. In short, you will need an Azure Service Principal with Storage Blob Data Contributor permissions on the container, and the Azure Tenant ID. Please note that these settings are similar for all three cloud storage solutions.

On the AWS Databricks cluster you will need to set the following settings:

fs.azure.account.auth.type.<STORAGE ACCOUNT NAME>.dfs.core.windows.net OAuth
fs.azure.account.oauth.provider.type.<STORAGE ACCOUNT NAME>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
fs.azure.account.oauth2.client.id.<STORAGE ACCOUNT NAME>.dfs.core.windows.net <SERVICE PRINCIPAL SECRET>
fs.azure.account.oauth2.client.secret.<STORAGE ACCOUNT NAME>.dfs.core.windows.net <SERVICE PRINCIPAL ID>
fs.azure.account.oauth2.client.endpoint.<STORAGE ACCOUNT NAME>.dfs.core.windows.net https://login.microsoftonline.com/<AZURE TENANT ID>/oauth2/token

Please see below for a screenshot of the cluster configurations on an AWS Databricks Cluster.

Delta Sharing: An Implementation Guide for Multi-Cloud Data Sharing (5)

Now that you have set your cluster configuration you can run the following command, that reads data from an AWS Databricks environment, and streams that data to an ADLS Gen2 storage location.

(spark.readStream
.table("my_catalog.my_schema.my_table")
.writeStream
.format("delta")
.option("checkpointLocation", "/path/to/checkpoint")
.start("abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/path/to/table")
)

One of the benefits of this process is that the data provider controls the cadence of data egress from one cloud environment to another, this is in contrast with Delta Sharing where the data recipient controls the egress of data. Similar to Delta Sharing, users can transform data in-flight, stream, batch, and incremental process data as needed.

The previous section covered the required cluster configurations to authenticate AWS Databricks clusters to ADLS Gen2, which is also a requirement for table cloning. The key difference here is that the data replication is much simpler than using the native Spark read/write functions.

To clone a table from one Databricks environment to an ADLS Gen2 storage location please run the following command:

CREATE TABLE IF NOT EXISTS delta.`abfss://<container_name>@<storage_name>.dfs.core.windows.net/path/to/table` 
CLONE my_catalog.my_schema.my_table

Similar to the previous section, the benefit of this process is that the data provider controls the cadence and egress of data from one cloud environment to another, this is in contrast with Delta Sharing where the data recipient controls. However, with table clones users are unable to perform in-flight transformations and cannot stream the data. The data is able to be batch and incrementally processed as needed. I would recommend using table clones for data replication scenarios for its simplicity and performance.

Management and Automation of Delta Sharing Resources

In this section, we will discuss decisions around the automation and management of Delta Sharing resources on Databricks. Please note that this applies to scenarios where Delta Sharing is being used, not necessarily the other options in which data is written and/or cloned to a secondary location.

There are two primary options when architecting a data share for multiple consumers: single share with many recipients or many shares with isolated recipients.

A single share allows providers to write a single job that processes, transforms, and partitions the data to then share that data. This allows a provider to then add many recipients to the share and isolate tables based on the partition. However, while you can isolate partitions in a table you cannot restrict access to data objects as a whole, therefore, more granular governance of the data assets is owned by the recipient.
Having many shares allows providers to have strong isolation between recipient data and own the granular permissions of the shared data assets. For example, you can share a table to recipient A and not recipient B, however, with multiple shares the provider will likely require multiple jobs to process, transform, and share that data to the end consumer making it more costly.

Both scenarios work extremely well at scale, but have a few nuances that should be considered during the planning phases of the project around cost and access controls.

As for automating the management of the Delta Sharing resources, please review the APIs available where you can programatically manage your resources. For example, let’s assume there is a solution where consumers can request a “subscription” to a data share. Engineers can use the following to programmatically control permissions and objects:

Create a share.
Update a share to add/remove data objects.
Update permissions a share to add/remove recipients.
Delete a share etirely.
- Please note that deleting a share does not delete the data objects as those are persisted within the provider’s Databricks environment. It simply removes the share entirely as an object.

In this blog, we covered the various implementation details that we discussed in my previous blog. For more information, please check out the Databricks Documentation and the Open Source Documentation. One thing to note is that I did not discuss open-to-open delta sharing, for a code example please see my GitHub repository where it requires setting up your own Delta Sharing server.

I also encourage you to read this Databricks blog for additional information and depictions of sharing data globally.

Go forth and share data!

Disclaimer: these are my own thoughts and opinions and not a reflection of my employer

Delta Sharing: An Implementation Guide for Multi-Cloud Data Sharing (2024)

FAQs

What describes delta Sharing as a solution for data sharing? ›

What is Delta Sharing? Delta Sharing is an open protocol developed by Databricks for secure data sharing with other organizations regardless of the computing platforms they use.

Read The Full Story ›

Which of the following describes delta Sharing as a solution for data sharing? ›

Final answer:

Delta Sharing is a protocol for securely sharing data across different cloud platforms, differentiating itself from SFTP, which is for secure file transfers. Delta Sharing is neither a proprietary solution nor limited to a single cloud platform.

Is Delta Sharing a Multicloud open source solution? ›

Delta Sharing is a multicloud, open-source solution for distributing data across a number of compute resources for efficient data shuffling. Delta Sharing is a multicloud, open-source solution to share data between Databricks workspaces within a single Databricks account.

What cloud provider does Delta use? ›

Delta Air Lines has selected Amazon Web Services (AWS) to serve as the airline's preferred cloud provider. AWS will help Delta unlock technologies and streamline processes that will make the end-to-end customer experience faster, smoother, and more secure.

Keep Reading ›

What is the difference between Multicloud and public cloud? ›

"Multi-cloud" refers to the combination and integration of multiple public clouds. A business may use one public cloud as a database, one as PaaS, one for user authentication, and so on.