Databricks Spark Integration with Unity Catalog Overview
Databricks Unity Catalog is a shared metastore at the Databricks account level that streamlines management of multiple Databricks workspaces for users.
Immuta’s Databricks Spark integration with Unity Catalog support uses a custom Databricks plugin to enforce Immuta policies on a Databricks cluster with Unity Catalog enabled. This integration allows you to add your tables to the Unity Catalog metastore so that you can use the metastore from any workspace while protecting your data with Immuta policies.
Unity Catalog Object Model
Databricks clusters with Unity Catalog use the following hierarchy of data objects:
Metastore: Created at the account level and is attached to one or more Databricks workspaces. The metastore contains metadata of the configured tables available to query. All clusters on that workspace use the configured metastore and all workspaces that are configured to use a single metastore share those tables.
Catalog: A catalog sits on top of schemas (also called databases) and tables to manage permissions across a set of schemas.
Schema: Organizes tables and views.
Table: Tables can be managed or external tables.
For details about the Unity Catalog object model, search for Unity Catalog in Databricks documentation.
Policy Enforcement
Immuta’s Databricks Spark integration with Unity Catalog support uses a custom Databricks plugin to enforce Immuta policies on a Databricks cluster with Unity Catalog enabled. For Immuta to see all relevant tables that have a data source mapped to them, Immuta requires a privileged metastore owner’s personal access token (PAT) from Databricks, and that metastore owner must have been granted access to all the relevant data. This token is stored encrypted to provide an Immuta-enabled Databricks cluster access to more data than a specific user on that cluster might otherwise have.
You must use an Immuta-provided cluster policy to start your Databricks cluster, as these cluster policies explicitly set the data security mode to the Custom setting that allows Immuta to enforce policies on top of Unity Catalog and add Unity Catalog support to the cluster. Once your configuration is complete, policy enforcement will be the same as the policy enforcement for the Databricks Spark integration.
For configuration instructions, see the Configure Databricks Spark Integration with Unity Catalog Support guide.
Immuta Data Sources in Unity Catalog
The Unity Catalog data object model introduces a 3-tiered namespace, as outlined above. Consequently, your Databricks tables registered as data sources in Immuta will now reference the catalog, schema (also called a database), and the table.
External Tables
If a Databricks table is not a Delta table (if it is an ORC, Parquet, or other file format), it must be an external table. This is a Databricks Unity Catalog restriction and is not related to Immuta. See the Databricks documentation for details about creating these objects to allow external locations to be used.
External locations and storage credentials must be configured correctly on Immuta-enabled clusters to allow tables to be created in a non-managed path. Immuta does not control access to storage credentials or external locations, and a user will have the same level of access to these on an Immuta-enabled cluster as they do on a non-Immuta enabled cluster.
Scratch Paths
Scratch paths are locations in storage that users can read and write to without Immuta policies applied. Immuta's support for scratch paths in Unity Catalog is designed to work with external locations.
You must configure external locations for any scratch path and grant those locations to the metastore owner user being used to connect Immuta. Creating a database in a scratch location in an Immuta-enabled cluster with Unity Catalog differs from how it is supported on a non-Immuta cluster with Unity Catalog; on a non-Immuta cluster, a database will not have a location if it is created against a catalog other than the legacy hive_metastore
.
Immuta requires the database location to be specified in the create database call on an Immuta-enabled cluster so that Immuta can validate whether the read or write is permitted, as illustrated in the example below:
For configuration instructions, see the Configure Scratch Paths guide.
Data Flow
The data flow for Unity Catalog is the same as the data flow for the Databricks Spark integration.
The only change is that Databricks metadata is saved in Unity Catalog at the account level, not the workspace level.