Databricks Clean Rooms Demystified
by Yulin Zhou, Databricks Technology Lead
Having left home to pursue a Data&IT degree in a city 400 miles away from my hometown over a decade ago, my family asked “what is it that you are going to study?”. They are skeptical about the path their only child has taken. It was time when not everyone had a smart phone. I could not explain it to them back then but i just had this feeling that data will be one of the greatest assets organisations have and i want to be the first data engineer in my family tree.
Now that i have navigated in the data, digital, cloud world for a number of years, i have always believed if i can explain a technology in layman’s term that my grandma could understand, then i have explained it well. So let me do this for Databricks clean room in this post.
The concept of Databricks clean room came out in June 2022. It is built on Databricks Delta sharing features that was released in 2021. Delta sharing is an open standard for secure data sharing. “Open” as it is open source. You do not need to download any paid software to view a data asset. That data asset can be a table in your Databricks workspace under a catalog, or a schema using the traditional Data Warehouse term.
The Old Way of Sharing Data
You might wonder, well….. i do not need a clean room or Delta sharing to share my data. Why don't i just export that table and share the export file with my data consumers in a secured file drive?
~ Your data is stale the second you exported it. What if your data consumer requires fresh data every hour? It would be a full time job exporting and ~ You created a physical copy of data that takes storage space, unnecessarily. ~ Quickly you will lose track you have shared what with whom and finger crossed you are not required to audit the data access and usage, any time soon. How about inviting external users to my Databricks workspace and assign them selective data asset read-only permissions using Unity Catalog?
~In most of organisations, active directory admins need a good reason to invite external users to your company's domain. If sharing that one data table in a single Databricks workspace belonging to a 3-person team, the benefit of inviting an external user usually does not justify the cause. ~Having external users in your Databricks workspace also raise a challenge in sharing compute resources. Do you provide a dedicated cluster for your external users to query the shared data only? Who pays for the compute? How do you calculate the VM and DBU usage and convert them to dollar values and share partially the bill with the external users? ~An external user can join and leave an organisation anytime. Do we want to take on the burden of adding and removing the external users from our Databricks workspace whenever there is a change of personel?
The New Way of Sharing Data
Databricks clean room allows you to virtually put your data in a “share”, more like a folder, and share the “share” with recipients. A recipient can be an external user with or without a Databricks workspace.
If your recipient operates on a Databricks workspace, they should provide with you a string that uniquely identifies them. They can get this string in the settings in their Databricks workspace profile. Once you create an recipient in your Databricks workspace using this string and share the “share”. The content of the share will show up on your recipients unity catalog. They can create a cluster and query the shared data however they need.
When your recipient does not have a Databricks workspace, they can use Python, Java, Power BI, Pandas, Apache Spark to access the tables. This does not require you to create any Databricks clusters for them to use.
What can you include in a “share”?
Tables, Views, Notebooks, ML Models — All in Public Preview as of June 2024. If a recipient should only see aggregated data based on selective rows and columns, you can use views to easily manage the granularity of the data you share. Sharing notebooks allow you to communicate with data recipients on how to query the data, call the model and much more.
Views can be shared cross-platform, in private preview as of June 2024, to any recipients, regardless of which cloud, region, or platform they use.
Materialised Views and Streaming Tables — All in private preview as of June 2024, can now be easily shared with recipients, without the need to create and maintain any additional copies or DLT pipelines.
Governance
All access control on your data assets are managed via Unity Catalog. you can easily check who has what level of access to what data as of today. This helps data admins to keep track of data access in an elegant way.
You can also secure your data shares via network IP whitelist.
Costs
Merely sharing Data with external recipients itself does not cost anything for a data provider. External recipients with a Databricks workspace will spin up and pay for their own clusters when querying the data. External recipients without a Databricks workspace simply uses the compute power on their “location machine” to access the data via Delta protocols. HOWEVER, every directory and file read request made to your shared data files and tables, your cloud provider (Azure, AWS, GCP) charges you the standard storage read rate. So make sure you optimise your shared tables for read performance before sharing your data.
Final Notes
Data clean room is such a simple yet powerful concept. Organising data recipients and selective data assets in a clean room to prevent cross contamination, without physically copying data, with independent compute usage and billing, and all of these are built on open source technology meaning non-Databricks users can benefit from Delta sharing. What a wonderful world we are living in, with the advance of technology.