Architecting Global Data Collaboration with Delta Sharing (2024)

In today's interconnected digital landscape, data sharing and collaboration across organizations and platforms are crucial for modern business operations. Delta Sharing, an innovative open data sharing protocol, empowers organizations to securely share and access data across diverse platforms, prioritizing security and scalability without constraints of vendor or data format.

This blog is dedicated to presenting data replication options within Delta Sharing by exploring architecture guidance tailored to specific data sharing scenarios. Drawing insights from our experiences with many Delta Sharing clients, our goal is to reduce egress costs and improve performance by providing specific data replication alternatives. While live sharing remains suitable for many cross-region data sharing scenarios, there are instances where replicating the entire dataset and establishing a data refresh process for local regional replicas proves to be more cost-efficient. Delta Sharing facilitates this through the utilization of Cloudflare R2 storage, Change Data Feed (CDF) Delta Sharing and Delta Deep Cloning functionalities. As a result of these capabilities, Delta Sharing is highly valued by clients for empowering users and providing exceptional flexibility in meeting their data sharing needs.

Delta Sharing is Open, Flexible, and Cost-Efficient

Databricks and the Linux Foundation developed Delta Sharing to provide the first open source approach to data sharing across data, analytics and AI. Customers can share live data across platforms, clouds and regions with strong security and governance. Whether you use the open source project by self-hosting, or the fully managed Delta Sharing on Databricks – both provide a platform-agnostic, flexible, and cost-effective solution for global data delivery. Databricks customers receive additional benefits within a managed environment that minimizes administrative overhead and integrates natively with Databricks Unity Catalog. This integration offers a streamlined experience for data sharing within and across organizations.

Delta Sharing on Databricks has experienced widespread adoption across various collaboration scenarios since its general availability in August 2022.

In this blog, we will explore two common architectural patterns where Delta Sharing has played a pivotal role in enabling and enhancing critical business scenarios:

  1. Intra-Enterprise Cross-Regional Data Sharing
  2. Data Aggregator (Hub and Spoke) Model

As part of this blog, we will also demonstrate that the Delta Sharing deployment architecture is flexible and can be seamlessly extended to fulfill new data sharing requirements.

Intra-Enterprise Cross-Regional Data Sharing

In this use case, we will illustrate a common deployment pattern of Delta Sharing among our customers where there is a business need to share some of the data across regions, such as having a QA team in separate regions or a reporting team interested in business activity data on a global basis. Usually sharing Intra-enterprise tables involves:

  • Sharing large tables: There is a requirement to share large tables in real-time with the recipients, where access patterns vary. Recipients often execute diverse queries with different predicates. A good example is clickstream and user activity data where in those cases remote access is more appropriate.
  • Local replication: To enhance performance and better manage egress cost, some data should be replicated to create a local copy of the data especially when the recipient's region has a significant number of users who frequently access these tables.

In this scenario, both the data provider's and the data recipient's business units share the same Unity Catalog account, but they have different metastores on Databricks.

Architecting Global Data Collaboration with Delta Sharing (1)

The above diagram illustrates a high-level architecture of the Delta Sharing solution, highlighting the key steps in the Delta Sharing process:

  1. Creation of a share: Live tables are shared with the recipient, enabling immediate data access.
  2. On-Demand data replication: Implementing on-demand data replication involves generating a regional duplicate of the data to improve performance, reducing the need for cross-region network access, and minimizing associated egress fees. This is achieved through the utilization of the following approaches for data replication:

A. Change data feed on a shared table

This option requires sharing the table history and enabling the change data feed (CDF) which must be explicitly enabled in the setup code by setting the table property delta.enableChangeDataFeed = true using the Create/Alter table commands.

Furthermore, when adding the table to the Share, ensure that it is added with the CDF option, as shown in the example below.

ALTER SHARE flights_data_shareADD TABLE db_flights.flightsAS db_flights.flights_with_cdfWITH CHANGE DATA FEED;

Once Data is added or updated, Changes can be accessed as in this example

-- View changes as of version 1SELECT * FROM table_changes('db_flights.flights', 1)

On the recipient side, changes can be accessed and merged into a local copy of the data in a similar way as in this notebook. Propagating the changes from the shared table to a local replica can be orchestrated using a Databricks workflow job.

B. Cloudflare R2 with Databricks

R2 is an excellent option for all Delta Sharing scenarios because customers can fully realize the potential of sharing without worrying about any unpredictable egress charges. It is discussed in detail later in this blog.

C. Delta Deep Clone

Another special case option for intra-enterprise sharing is to use Delta deep clone when sharing within the same Databricks cloud account. Deep Cloning is a Delta functionality that copies both the source table data and the metadata of the existing table to the clone target. Additionally, deep clone command has the ability to identify new data and refresh accordingly. Here is the syntax:

CREATE TABLE [IF NOT EXISTS] table_name DEEP CLONE source_table_name [TBLPROPERTIES clause] [LOCATION path]

The previous command runs on the recipient side where source_table_name is the shared table and table_name is the local copy of the data that users can access.

A simple Databricks Workflows job can be scheduled for an incremental refresh of the data with recent updates using the following command:

CREATE OR REPLACE TABLE table_name DEEP CLONE source_table_name

The same use case can easily be extended to share data with external partners and clients on the Databricks Platform or any other platform. This is another common extended pattern where partners and external clients, who are not on Databricks, wish to access this data through Excel, Power BI, Pandas, and other compatible software like Oracle.

Data Aggregator Model (Hub and Spoke model)

Another common scenario pattern arises when a business is focused on sharing data with clients, particularly in cases involving data aggregator enterprises or when the primary business function is collecting data on behalf of clients. A data aggregator, as an entity, specializes in collecting and merging data from diverse sources into a unified, cohesive dataset. These data shares are instrumental in serving diverse business needs such as business decision-making, market analysis, research, and supporting overall business operations.

The data sharing model in this pattern does the following:

  1. Connects recipients that are distributed across various clouds, including AWS, Azure, and GCP.
  2. Supports data consumption on diverse platforms, ranging in complexity from Python code to Excel spreadsheets.
  3. Enables scalability for the number of recipients, the quantity of shares, and data volumes.

In general, this can typically be achieved by the provider establishing a Databricks workspace in each cloud and replicating data using CDF on a shared table (as discussed above) across all three clouds to enhance performance and reduce egress costs. Then within each cloud region, data can be shared with the appropriate clients and partners.

However, a new, more efficient and straightforward approach can be employed by utilizing R2 through Cloudflare with Databricks, currently in private preview.

Cloudflare R2 integration with Databricks will enable organizations to safely, simply, and affordably share and collaborate on live data. With Cloudflare and Databricks, joint customers can eliminate the complexity and dynamic costs that stand in the way of the full potential of multi-cloud analytics and AI initiatives. Specifically, there will be zero egress fees and no need for complex data transfers or costly replication of data sets across regions.

Using this option requires the following steps:

  • Add Cloudflare R2 as an external storage location (while keeping the source of truth data in S3/ADLS/etc.)
  • Create new tables in Cloudflare R2, and sync data incrementally
  • Create a Delta Share, as usual, on the R2 table

As explained above, these approaches demonstrate various methods of on-demand data replication, each with its distinct advantages and specific requirements, making them suitable for various use cases.

Architecting Global Data Collaboration with Delta Sharing (2)

Comparing Data Replication Methods for Cross-Region Sharing

All three previous mechanisms enable Delta Sharing users to create a local copy, to minimize egress fees, especially across clouds and regions. The table below provides a quick summary to differentiate between these options.

Data Replication ToolKey highlightsRecommendation
Change data feed on a shared table
  • It works within and across accounts
  • CDF needs to be enabled on the table
  • Requires coding to propagate the CDC changes on the destination table
  • The process can be orchestrated via Databricks workflows
Use for external Sharing with partners/clients across regions
Cloudflare R2 with Databricks
  • Cloudflare account required
  • Ideal for large-scale data sharing across multiple regions and cloud platforms
  • Utilize delta deep clone or R2 super slurper for efficient data creation and refreshing in R2
Strongly recommended for large scale Delta Sharing in terms of number of shares and 2+ regions
Delta Deep Clone
  • It works within the same account
  • Minimum coding
  • Incremental refresh via Databricks workflows
Recommended when sharing internally across regions

Delta Sharing is open, flexible, and cost-efficient and on Databricks it supports a broad spectrum of data assets, including notebooks, volumes, and AI models. In addition, several optimizations have significantly enhanced the performance of Delta Sharing protocols. Databricks' ongoing investment in Delta Sharing capabilities, including improved monitoring, scalability, ease of use, and observability, underscores its commitment to enhancing the user experience and ensuring that Delta Sharing remains at the forefront of data collaboration for the future.

Next steps

Throughout this blog, we have provided architectural guidance based on our experience with many Delta Sharing customers. Our primary focus is on cost management and performance. While live sharing is suitable for many cross-region data sharing scenarios, we have explored instances where replicating the entire dataset and establishing a data refresh process for local regional replicas proves to be more cost-efficient. Delta Sharing facilitates this through the utilization of R2 and CDF Delta Sharing functionalities, providing users with enhanced flexibility.

In the Intra-Enterprise Cross-Regional Data Sharing use case, Delta Sharing excels in sharing large tables with varied access patterns. Local replication, facilitated by CDF sharing, ensures optimal performance and cost management. Additionally, R2 through Cloudflare with Databricks offers an efficient option for large-scale Delta Sharing across multiple regions and clouds.

To learn more about how to integrate Delta Sharing into your data collaboration strategy check out the latest resources:

Architecting Global Data Collaboration with Delta Sharing (2024)

FAQs

What is delta Sharing as a solution for data sharing? ›

Delta Sharing supports Spark Structured Streaming. A provider can share a table with history so that a recipient can use it as a Structured Streaming source, processing shared data incrementally with low latency. Recipients can also perform Delta Lake time travel queries on tables shared with history.

What is Delta and how does it help big data processing? ›

Delta Lake is a powerful open-source storage layer that brings ACID transactions, scalable metadata handling, and unified batch and streaming data processing to big data workloads. It's designed to improve data reliability and enable complex data processing workflows.

Is Delta Sharing proprietary? ›

An open standard for secure data sharing, Delta Sharing enables data sharing between organizations regardless of their compute platform.

Is Delta Sharing a Multicloud open-source solution? ›

Delta Sharing is a multicloud, open-source solution for distributing data across a number of compute resources for efficient data shuffling. Delta Sharing is a multicloud, open-source solution to share data between Databricks workspaces within a single Databricks account.

What are the three types of data sharing? ›

Forms of data sharing
  • Data commons: Resources are held in common, accessible to all members of a group. ...
  • Data collaboratives: Private data which benefits society and the environment is shared for social good. ...
  • Data marketplaces: Intermediary platforms or online stores through which data can be bought or sold.

What is the limitation of Delta sharing in Databricks? ›

Delta Sharing has limits on the metadata size of a shared table. You are limited to 700k AddFiles actions in the DeltaLog. This is how many active files you can have in a shared Delta table.

What is delta in data integration? ›

Delta processing, on the other hand, means loading the data incrementally, loading the source data at specific pre-established intervals. But how does data preparation work? First, the initial raw data is extracted from various data sources in its original format.

What is delta in ETL? ›

A delta load is a type of ETL process where you only copy the data that has changed since the last load, and then merge it with the existing data in the target. A delta load is similar to an incremental load, but it also applies transformations and calculations to the changed data before loading it to the target.

What is the advantage of delta system? ›

The delta system is used for power transmission because of the lower cost due to the absence of neutral cable. It is also used in applications requiring high starting torque.

Is delta sharing an API? ›

The REST APIs provided by Delta Sharing Server are stable public APIs. They are defined by Delta Sharing Protocol and we will follow the entire protocol strictly. The interfaces inside Delta Sharing Server are not public APIs.

What is the difference between Delta sharing and Snowflake marketplace? ›

Snowflake is a single SaaS service where any account can participate in a share. Delta shares are limited to tables only (no views, functions & etc.) This requires Providers to create a main separate set of tables per consumer if data is restricted.

Do we need a Unity catalog for Delta sharing? ›

A provider who wants to use the Delta Sharing server that is built into Azure Databricks must have at least one workspace that is enabled for Unity Catalog. You do not need to migrate all of your workspaces to Unity Catalog. You can create one Unity Catalog-enabled workspace for share management.

What are the benefits of Delta sharing? ›

Delta Sharing enables you to share data with an unlimited number of recipients, without needing to be concerned about the specific cloud provider region, compute costs, or performance usage profile. You only pay for the storage costs of the data, which as we all know are much lower than the compute costs.

What cloud provider does Delta use? ›

Delta Air Lines has selected Amazon Web Services (AWS) to serve as the airline's preferred cloud provider. AWS will help Delta unlock technologies and streamline processes that will make the end-to-end customer experience faster, smoother, and more secure.

What is the difference between Delta share and Delta Lake? ›

While Delta Lake is a collection of Parquet files, a Delta-Sharing Provider decides what data they share and runs a sharing server which manages access for recipients. A Delta-Sharing Recipient runs a client that supports the protocol with connectors for Pandas, Apache Spark, Rust and Python.

Which of the following describes delta sharing as a solution for data sharing? ›

Final answer:

Delta Sharing is a protocol for securely sharing data across different cloud platforms, differentiating itself from SFTP, which is for secure file transfers. Delta Sharing is neither a proprietary solution nor limited to a single cloud platform.

What is Delta in data transfer? ›

With delta copying, only the changed part of a file is copied. It is usually used in backup or file copying software, often to save bandwidth when copying between computers over a private network or the internet.

What is Delta in data, analytics? ›

The Delta Analysis method compares measurements for either two objects (areas, organizational units, or counters) on a defined time interval (Object Delta) or for a single object on two equal time intervals (Time Delta).

What is Delta data in database? ›

Summary: Delta data is a data set that consists of new or changed data. There are times when you want to bring delta data into an existing model instead of a data load, which incorporates the full data set whether the records have changed or not.

Top Articles
Olympic all-around champion Carly Patterson previews 2024 US Olympic Gymnastics Trials
White Round M 2
Asian Feels Login
Arkansas Gazette Sudoku
Fort Carson Cif Phone Number
Dr Doe's Chemistry Quiz Answer Key
Displays settings on Mac
Strange World Showtimes Near Cmx Downtown At The Gardens 16
Nestle Paystub
Clairememory Scam
今月のSpotify Japanese Hip Hopベスト作品 -2024/08-|K.EG
Craigslist Pikeville Tn
Wgu Admissions Login
Tracking Your Shipments with Maher Terminal
Money blog: Domino's withdraws popular dips; 'we got our dream £30k kitchen for £1,000'
Wisconsin Women's Volleyball Team Leaked Pictures
Haunted Mansion Showtimes Near Millstone 14
50 Shades Darker Movie 123Movies
Driving Directions To Bed Bath & Beyond
Willam Belli's Husband
St. Petersburg, FL - Bombay. Meet Malia a Pet for Adoption - AdoptaPet.com
Dover Nh Power Outage
Bekijk ons gevarieerde aanbod occasions in Oss.
Nz Herald Obituary Notices
Dulce
Brbl Barber Shop
UMvC3 OTT: Welcome to 2013!
Spiritual Meaning Of Snake Tattoo: Healing And Rebirth!
lol Did he score on me ?
Phone number detective
2024 Coachella Predictions
In Branch Chase Atm Near Me
Petsmart Distribution Center Jobs
Morlan Chevrolet Sikeston
Lucky Larry's Latina's
Daily Journal Obituary Kankakee
Tamil Play.com
Craigslist In Myrtle Beach
Foolproof Module 6 Test Answers
American Bully Xxl Black Panther
Page 5662 – Christianity Today
Dispensaries Open On Christmas 2022
Electric Toothbrush Feature Crossword
Miami Vice turns 40: A look back at the iconic series
QVC hosts Carolyn Gracie, Dan Hughes among 400 laid off by network's parent company
Natasha Tosini Bikini
Chase Bank Zip Code
Greg Steube Height
Tyco Forums
Unblocked Games 6X Snow Rider
Who Is Nina Yankovic? Daughter of Musician Weird Al Yankovic
Les BABAS EXOTIQUES façon Amaury Guichon
Latest Posts
Article information

Author: Errol Quitzon

Last Updated:

Views: 5623

Rating: 4.9 / 5 (79 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Errol Quitzon

Birthday: 1993-04-02

Address: 70604 Haley Lane, Port Weldonside, TN 99233-0942

Phone: +9665282866296

Job: Product Retail Agent

Hobby: Computer programming, Horseback riding, Hooping, Dance, Ice skating, Backpacking, Rafting

Introduction: My name is Errol Quitzon, I am a fair, cute, fancy, clean, attractive, sparkling, kind person who loves writing and wants to share my knowledge and understanding with you.