Delta Sharing: An Implementation Guide for Multi-Cloud Data Sharing (2024)

Delta Sharing: An Implementation Guide for Multi-Cloud Data Sharing (2)

In a recent blog, Delta Sharing: Multi-Cloud Data Sharing for Architects I discuss the importance of sharing data across various environments and how to handle different scenarios at a high level. The primary focus from a technology perspective was Delta Sharing, an open-source data sharing protocol that offers unparalleled flexibility and freedom. We discussed alternative solutions depending on the technical requirements. In this blog we will give actual examples of the various solutions provided allowing readers to gain a functional understanding of multi-cloud data sharing from an implementation standpoint.

We will cover the following examples:

  • Databricks to Databricks Delta Sharing
  • Databricks to Non-Databricks Delta Sharing
  • Databricks to Power BI Delta Sharing
  • Writing data from AWS Databricks to Azure Data Lake Gen2
    - This is a similar process between any of the three cloud storage solutions (S3, ADLS, and GCS).
  • Cloning Tables from AWS Databricks to Azure Data Lake Gen2
    - This is a similar process between any of the three cloud storage solutions (S3, ADLS, and GCS).

Delta Sharing is a transformative solution to access and share data. Its support for batch, incremental, and stream processing deliver unparalleled versatility in data handling. A key feature of Delta Sharing lies in its ability to abstract data location enabling seamless access to data across different environments.

In Databricks, Delta Sharing seamlessly integrates with the Unity Catalog to adhere to enterprise governance protocols, offering centralized management, and auditing capabilities for shared data. An additional advantage of Delta Sharing is its commitment to providing users with access to the most up to date data, whether consumed through streams or batches. This real-time accessibility empowers informed decision-making and improves operational efficiency.

There are three core components to Delta Sharing:

  1. Provider: the individual or organization that owns the share and control access to the data objects. A provider can have many shares and many recipients.
  2. Recipient: the individual or organization that is gaining access to a share.
  3. Share: a read-only collection of data objects (tables, table partitions, views, etc.)

Before we dive into the examples, the following instructions are how to create a Delta Share in Databricks using SQL:

CREATE SHARE IF NOT EXISTS my_share COMMENT "This is a Delta share!";

Once the Delta Share is created then you can add a data object to the share. The following command adds a single table to the share we previously created.

ALTER SHARE my_share ADD TABLE my_catalog.my_schema.my_table
COMMENT "This table belongs to a share!" ;

We now have a share on Databricks with a single table. Please note that you can share tables, views, table history, and table partitions.

Delta Sharing is open source, which means that it does not require the provider or the recipient to be a Databricks customer, however, sharing data on Databricks is extremely simple. To share data between Databricks Workspaces, users can simply do the following.

Obtain the recipient’s metastore identifier by running the following command in the recipient’s Databricks environment:

SELECT CURRENT_METASTORE();

Using the output of the previous command, one can add the recipient by running the following in the provider’s environment:

CREATE RECIPIENT IF NOT EXISTS <recipient-name>
USING ID ‘<sharing-identifier>’
COMMENT "This is a new recipient!" ;

GRANT SELECT ON SHARE my_share TO RECIPIENT <recipient-name>;

At this point the recipient should be able to discover the share in their environment in the “Shared with me” section of the Catalog Explorer.

Delta Sharing: An Implementation Guide for Multi-Cloud Data Sharing (3)

The biggest benefit of Delta Sharing between two Databricks customers is that it simply works with practically zero effort. The data will automatically show up in the recipient’s catalog and they can begin querying data.

To share data with a non-Databricks customer, the provider will need to first add the recipient then send the connection information to the recipient.

Simply run the following command to obtain the Delta Share activation link in which you can send directly to your recipient to provide access.

DESCRIBE RECIPIENT <recipient-name>;

Once the recipient has successfully activated the delta sharing information they can use the following code to read data from the provider’s Delta Share using OSS projects.

from delta import *
import delta_sharing
from pyspark.sql import SparkSession

profile_file = "/path/to/oss_share.share"
client = delta_sharing.SharingClient(profile=profile_file)
share_name = "share_name"
schema_name = "schema_name"
table_name = "table_name"

# Initialize Spark
spark = (SparkSession
.builder
.config('spark.jars.packages', 'org.apache.hadoop:hadoop-azure:3.3.1,io.delta:delta-core_2.12:2.2.0,io.delta:delta-sharing-spark_2.12:0.6.2')
.config('spark.sql.extensions', 'io.delta.sql.DeltaSparkSessionExtension')
.config('spark.sql.catalog.spark_catalog', 'org.apache.spark.sql.delta.catalog.DeltaCatalog')
.getOrCreate()
)

# Read from share
table_url = f"{profile_file}#{share_name}.{schema_name}.{table_name}"

share_df = delta_sharing.load_as_spark(url=table_url)

share_df.show()

(spark.read
.format("deltaSharing")
.load(table_url)
.show()
)

The benefit of this approach is that the consumer does not have to be a Databricks customer. This is key, because Delta Sharing avoids a specific vendor requirement for both parties it opens up providers to many more recipients whether they are internal solutions or external organizations. While this example shows a batch solution, it is possible to perform incremental and stream processing off of a Delta Share.

Popular business intelligence tools, like Power BI, have Delta Sharing connectors which allow them to connect directly to the data without requiring data warehouse compute. This is an extreme reduction in cost for BI solutions that leverage this capability. To connect Power BI to a Delta Sharing, click on “Get Data” then search for “Delta sharing” which will open the following screen and add the required URL.

Delta Sharing: An Implementation Guide for Multi-Cloud Data Sharing (4)

It is important to note that Delta Sharing has less capabilities to pushdown queries when compared to connecting to a Databricks SQL warehouse because Power BI is connecting directly to the data as opposed to a warehouse. Therefore this solution is not necessarily as scalable because the Power BI server is responsible for the majority of the computation required to transform the data, however, it entirely eliminates the data warehouse cost and allows users to have more economical reporting capabilities. Essentially, when you use Delta Sharing as a data source the data modeling and transformations are almost entirely owned by the reporting tool. It may require you to import the entire dataset prior to filtering, aggregating, or applying transformations which could be a heavy load on the client server. I recommend using Delta sharing as a data source for small to medium sized tables only. Engineers can also materialize table aggregations prior to ingesting into Power BI to reduce client server load.

To connect from AWS Databricks to Azure Data Lake Gen2 (ADLS Gen2) you can follow my previous blog in which I discuss how to set cluster settings in order to authenticate with ADLS Gen2. In short, you will need an Azure Service Principal with Storage Blob Data Contributor permissions on the container, and the Azure Tenant ID. Please note that these settings are similar for all three cloud storage solutions.

On the AWS Databricks cluster you will need to set the following settings:

fs.azure.account.auth.type.<STORAGE ACCOUNT NAME>.dfs.core.windows.net OAuth
fs.azure.account.oauth.provider.type.<STORAGE ACCOUNT NAME>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
fs.azure.account.oauth2.client.id.<STORAGE ACCOUNT NAME>.dfs.core.windows.net <SERVICE PRINCIPAL SECRET>
fs.azure.account.oauth2.client.secret.<STORAGE ACCOUNT NAME>.dfs.core.windows.net <SERVICE PRINCIPAL ID>
fs.azure.account.oauth2.client.endpoint.<STORAGE ACCOUNT NAME>.dfs.core.windows.net https://login.microsoftonline.com/<AZURE TENANT ID>/oauth2/token

Please see below for a screenshot of the cluster configurations on an AWS Databricks Cluster.

Delta Sharing: An Implementation Guide for Multi-Cloud Data Sharing (5)

Now that you have set your cluster configuration you can run the following command, that reads data from an AWS Databricks environment, and streams that data to an ADLS Gen2 storage location.

(spark.readStream
.table("my_catalog.my_schema.my_table")
.writeStream
.format("delta")
.option("checkpointLocation", "/path/to/checkpoint")
.start("abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/path/to/table")
)

One of the benefits of this process is that the data provider controls the cadence of data egress from one cloud environment to another, this is in contrast with Delta Sharing where the data recipient controls the egress of data. Similar to Delta Sharing, users can transform data in-flight, stream, batch, and incremental process data as needed.

The previous section covered the required cluster configurations to authenticate AWS Databricks clusters to ADLS Gen2, which is also a requirement for table cloning. The key difference here is that the data replication is much simpler than using the native Spark read/write functions.

To clone a table from one Databricks environment to an ADLS Gen2 storage location please run the following command:

CREATE TABLE IF NOT EXISTS delta.`abfss://<container_name>@<storage_name>.dfs.core.windows.net/path/to/table` 
CLONE my_catalog.my_schema.my_table

Similar to the previous section, the benefit of this process is that the data provider controls the cadence and egress of data from one cloud environment to another, this is in contrast with Delta Sharing where the data recipient controls. However, with table clones users are unable to perform in-flight transformations and cannot stream the data. The data is able to be batch and incrementally processed as needed. I would recommend using table clones for data replication scenarios for its simplicity and performance.

Management and Automation of Delta Sharing Resources

In this section, we will discuss decisions around the automation and management of Delta Sharing resources on Databricks. Please note that this applies to scenarios where Delta Sharing is being used, not necessarily the other options in which data is written and/or cloned to a secondary location.

There are two primary options when architecting a data share for multiple consumers: single share with many recipients or many shares with isolated recipients.

  • A single share allows providers to write a single job that processes, transforms, and partitions the data to then share that data. This allows a provider to then add many recipients to the share and isolate tables based on the partition. However, while you can isolate partitions in a table you cannot restrict access to data objects as a whole, therefore, more granular governance of the data assets is owned by the recipient.
  • Having many shares allows providers to have strong isolation between recipient data and own the granular permissions of the shared data assets. For example, you can share a table to recipient A and not recipient B, however, with multiple shares the provider will likely require multiple jobs to process, transform, and share that data to the end consumer making it more costly.

Both scenarios work extremely well at scale, but have a few nuances that should be considered during the planning phases of the project around cost and access controls.

As for automating the management of the Delta Sharing resources, please review the APIs available where you can programatically manage your resources. For example, let’s assume there is a solution where consumers can request a “subscription” to a data share. Engineers can use the following to programmatically control permissions and objects:

  • Create a share.
  • Update a share to add/remove data objects.
  • Update permissions a share to add/remove recipients.
  • Delete a share etirely.
    - Please note that deleting a share does not delete the data objects as those are persisted within the provider’s Databricks environment. It simply removes the share entirely as an object.

In this blog, we covered the various implementation details that we discussed in my previous blog. For more information, please check out the Databricks Documentation and the Open Source Documentation. One thing to note is that I did not discuss open-to-open delta sharing, for a code example please see my GitHub repository where it requires setting up your own Delta Sharing server.

I also encourage you to read this Databricks blog for additional information and depictions of sharing data globally.

Go forth and share data!

Disclaimer: these are my own thoughts and opinions and not a reflection of my employer

Delta Sharing: An Implementation Guide for Multi-Cloud Data Sharing (2024)

FAQs

What describes delta Sharing as a solution for data sharing? ›

What is Delta Sharing? Delta Sharing is an open protocol developed by Databricks for secure data sharing with other organizations regardless of the computing platforms they use.

Which of the following describes delta Sharing as a solution for data sharing? ›

Final answer:

Delta Sharing is a protocol for securely sharing data across different cloud platforms, differentiating itself from SFTP, which is for secure file transfers. Delta Sharing is neither a proprietary solution nor limited to a single cloud platform.

Is Delta Sharing a Multicloud open source solution? ›

Delta Sharing is a multicloud, open-source solution for distributing data across a number of compute resources for efficient data shuffling. Delta Sharing is a multicloud, open-source solution to share data between Databricks workspaces within a single Databricks account.

What is the limitation of Delta Sharing in Databricks? ›

Delta Sharing has limits on the metadata size of a shared table. You are limited to 700k AddFiles actions in the DeltaLog. This is how many active files you can have in a shared Delta table. You are limited to 100k RemoveFiles actions in the DeltaLog.

How does delta sharing work? ›

Delta Sharing supports Spark Structured Streaming. A provider can share a table with history so that a recipient can use it as a Structured Streaming source, processing shared data incrementally with low latency. Recipients can also perform Delta Lake time travel queries on tables shared with history.

What are the three types of data sharing? ›

Forms of data sharing
  • Data commons: Resources are held in common, accessible to all members of a group. ...
  • Data collaboratives: Private data which benefits society and the environment is shared for social good. ...
  • Data marketplaces: Intermediary platforms or online stores through which data can be bought or sold.

Is delta sharing an API? ›

The REST APIs provided by Delta Sharing Server are stable public APIs. They are defined by Delta Sharing Protocol and we will follow the entire protocol strictly. The interfaces inside Delta Sharing Server are not public APIs.

What is Delta and how does it help big data processing? ›

Delta Lake is a powerful open-source storage layer that brings ACID transactions, scalable metadata handling, and unified batch and streaming data processing to big data workloads. It's designed to improve data reliability and enable complex data processing workflows.

What is Delta in data transfer? ›

A simple way to define a Delta migration is that it's a migration that only moves files that are different in the source than the destination. Delta simply means “change” or “difference” here, which is what you move during a delta migration.

What is multicloud solutions? ›

Multicloud refers to using services from more than one public cloud provider at the same time. A multicloud environment allows your cloud environments to be private, public, or a combination of both.

What cloud provider does Delta use? ›

Delta Air Lines has selected Amazon Web Services (AWS) to serve as the airline's preferred cloud provider. AWS will help Delta unlock technologies and streamline processes that will make the end-to-end customer experience faster, smoother, and more secure.

What is the difference between Multicloud and public cloud? ›

"Multi-cloud" refers to the combination and integration of multiple public clouds. A business may use one public cloud as a database, one as PaaS, one for user authentication, and so on.

What is the row limit for Delta Sharing? ›

Enter the endpoint URL retrieved from the credentials file in the Delta Sharing Server URL field. Optionally, in the Advanced Options tab you can set a Row Limit for the maximum number of rows you can download. This is set to 1 million rows by default.

What is the difference between Delta share and Snowflake share? ›

Another major benefit of Delta Sharing is that it eliminates the compute costs associated with data sharing. Unlike Snowflake, which charges you for the compute resources used by both the data provider and the data recipient, Delta Sharing only charges the data provider for the storage costs of the data.

What are the advantages of Delta in Databricks? ›

Databricks Delta can perform batch reads and writes, and stream reads and writes. It can also handle schema changes and evolve gracefully as new data is added. All of these capabilities make it an ideal platform for storing and processing large amounts of data.

What is Delta in data integration? ›

Delta processing, on the other hand, means loading the data incrementally, loading the source data at specific pre-established intervals. But how does data preparation work? First, the initial raw data is extracted from various data sources in its original format.

What does Delta data mean? ›

Delta encoding is a way of storing or transmitting data in the form of differences (deltas) between sequential data rather than complete files; more generally this is known as data differencing.

Top Articles
TurboTax Tax Software Review
Where Does Eevee Spawn in Pixelmon?
Truist Bank Near Here
Kraziithegreat
THE 10 BEST Women's Retreats in Germany for September 2024
Santa Clara College Confidential
Us 25 Yard Sale Map
Localfedex.com
Ncaaf Reference
414-290-5379
Edgar And Herschel Trivia Questions
Select Truck Greensboro
Wunderground Huntington Beach
Wordscape 5832
Slushy Beer Strain
Gas Station Drive Thru Car Wash Near Me
Cbs Trade Value Chart Fantasy Football
How to find cash from balance sheet?
Wilmot Science Training Program for Deaf High School Students Expands Across the U.S.
Define Percosivism
Nesz_R Tanjiro
Silive Obituary
ELT Concourse Delta: preparing for Module Two
Robert Deshawn Swonger Net Worth
Promiseb Discontinued
Kaitlyn Katsaros Forum
Bella Bodhi [Model] - Bio, Height, Body Stats, Family, Career and Net Worth 
Stoney's Pizza & Gaming Parlor Danville Menu
Rochester Ny Missed Connections
Valic Eremit
Reviews over Supersaver - Opiness - Spreekt uit ervaring
Restored Republic June 16 2023
Anonib Oviedo
Suspiciouswetspot
January 8 Jesus Calling
Arlington Museum of Art to show shining, shimmering, splendid costumes from Disney Archives
Bolly2Tolly Maari 2
Cinema | Düsseldorfer Filmkunstkinos
Restored Republic
Vadoc Gtlvisitme App
Craigslistodessa
La Qua Brothers Funeral Home
Beth Moore 2023
Lucky Larry's Latina's
Selfservice Bright Lending
Agematch Com Member Login
The Syracuse Journal-Democrat from Syracuse, Nebraska
Bones And All Showtimes Near Johnstown Movieplex
The Listings Project New York
Sun Tracker Pontoon Wiring Diagram
Darkglass Electronics The Exponent 500 Test
All Buttons In Blox Fruits
Latest Posts
Article information

Author: Margart Wisoky

Last Updated:

Views: 5621

Rating: 4.8 / 5 (78 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Margart Wisoky

Birthday: 1993-05-13

Address: 2113 Abernathy Knoll, New Tamerafurt, CT 66893-2169

Phone: +25815234346805

Job: Central Developer

Hobby: Machining, Pottery, Rafting, Cosplaying, Jogging, Taekwondo, Scouting

Introduction: My name is Margart Wisoky, I am a gorgeous, shiny, successful, beautiful, adventurous, excited, pleasant person who loves writing and wants to share my knowledge and understanding with you.