Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Databricks metastore magic allows you to migrate your data from the Databricks legacy Hive metastore to the Unity Catalog metastore while protecting data and maintaining your current processes in a single Immuta instance.
Databricks metastore magic is for customers who intend to use either
the Databricks Spark with Unity Catalog support integration or
the Databricks Unity Catalog integration, but they would like to protect tables in the Hive metastore.
Unity Catalog support is enabled in Immuta.
Databricks has two built-in metastores that contain metadata about your tables, views, and storage credentials:
Legacy Hive metastore: Created at the workspace level. This metastore contains metadata of the configured tables in that workspace available to query.
Unity Catalog metastore: Created at the account level and is attached to one or more Databricks workspaces. This metastore contains metadata of the configured tables available to query. All clusters on that workspace use the configured metastore and all workspaces that are configured to use a single metastore share those tables.
Databricks allows you to use the legacy Hive metastore and the Unity Catalog metastore simultaneously. However, Unity Catalog does not support controls on the Hive metastore, so you must attach a Unity Catalog metastore to your workspace and move existing databases and tables to the attached Unity Catalog metastore to use the governance capabilities of Unity Catalog.
Immuta's Databricks Spark integration and Unity Catalog integration enforce access controls on the Hive and Unity Catalog metastores, respectively. However, because these metastores have two distinct security models, users were discouraged from using both in a single Immuta instance before metastore magic; the Databricks Spark integration and Unity Catalog integration were unaware of each other, so using both concurrently caused undefined behavior.
Metastore magic reconciles the distinct security models of the legacy Hive metastore and the Unity Catalog metastore, allowing you to use multiple metastores (specifically, the Hive metastore or AWS Glue Data Catalog alongside Unity Catalog metastores) within a Databricks workspace and single Immuta instance and keep policies enforced on all your tables as you migrate them. The diagram below shows Immuta enforcing policies on registered tables across workspaces.
In clusters A and D, Immuta enforces policies on data sources in each workspace's Hive metastore and in the Unity Catalog metastore shared by those workspaces. In clusters B, C, and E (which don't have Unity Catalog enabled in Databricks), Immuta enforces policies on data sources in the Hive metastores for each workspace.
With metastore magic, the Databricks Spark integration enforces policies only on data in the Hive metastore, while the Databricks Spark integration with Unity Catalog support or the Unity Catalog integration enforces policies on tables in the Unity Catalog metastore. The table below illustrates this policy enforcement.
Essentially, you have two options to enforce policies on all your tables as you migrate after you have enabled Unity Catalog in Immuta:
Enforce plugin-based policies on all tables: Enable the Databricks Spark integration with Unity Catalog support. For details about plugin-based policies, see this overview guide.
Enforce plugin-based policies on Hive metastore tables and Unity Catalog native controls on Unity Catalog metastore tables: Enable the Databricks Spark integration and the Databricks Unity Catalog integration. Some Immuta policies are not supported in the Databricks Unity Catalog integration. Reach out to your Immuta representative for documentation of these limitations.
Databricks Spark integration with Unity Catalog support and Databricks Unity Catalog integration
Enabling the Databricks Spark integration with Unity Catalog support and the Databricks Unity Catalog integration is not supported. Do not use both integrations to enforce policies on your table.
Databricks SQL cannot run the Databricks Spark plugin to protect tables, so Hive metastore data sources will not be policy enforced in Databricks SQL.
To enforce policies on data sources in Databricks SQL, use Hive metastore table access controls to manually lock down Hive metastore data sources and the Databricks Unity Catalog integration to protect tables in the Unity Catalog metastore. Table access control is enabled by default on SQL warehouses, and any Databricks cluster without the Immuta plugin must have table access control enabled.
The table below outlines the integrations supported for various Databricks cluster configurations. For example, the only integration available to enforce policies on a cluster configured to run on Databricks Runtime 9.1 is the Databricks Spark integration.
Legend:
Databricks Runtime 11.3.
Unity Catalog enabled on your Databricks cluster.
Unity Catalog metastore created and attached to a Databricks workspace.
The metastore owner you are using to manage permissions has been granted access to all catalogs, schemas, and tables that will be protected by Immuta. Data protected by Immuta should only be granted to privileged users in Unity Catalog so that the only view of that data is through an Immuta-enabled cluster.
You have generated a personal access token for the metastore owner that Immuta can use to read data in Unity Catalog.
You do not plan to use non-Unity Catalog enabled clusters with Immuta data sources. Once enabled, all access to data source tables must be on Databricks clusters with Unity Catalog enabled on runtime 11.3.
For details about the supported features listed in the table above, see the pre-configuration details page for Databricks.
The table below outlines the integrations supported for various Databricks cluster configurations. For example, the only integration available to enforce policies on a cluster configured to run on Databricks Runtime 9.1 is the Databricks Spark integration.
Legend:
Databricks metastore magic allows you to migrate your data from the Databricks legacy Hive metastore to the Unity Catalog metastore while protecting data and maintaining your current processes in a single Immuta instance.
No configuration is necessary to enable this feature. For more details, see the Databricks metastore magic overview.
Native workspaces are not supported. Creating a native workspace on a Unity Catalog enabled host is undefined behavior and may cause data loss or crashes.
Tables must be GRANTed access to the Databricks metastore owner token configured for the integration. For the table to be accessible to the user, the full chain of catalog, schema, and table must all have the appropriate grants to this administrator user to allow them to SELECT from the table.
Direct file access to Immuta data sources is not supported.
Limited Enforcement (called available until protected by policy on the App Settings page), which makes Immuta clusters available to all Immuta users until protected by a policy, is not supported. You must set IMMUTA_SPARK_DATABRICKS_ALLOW_NON_IMMUTA_READS
and IMMUTA_SPARK_DATABRICKS_ALLOW_NON_IMMUTA_WRITES
to false
in your cluster policies manually or by selecting Protected until made available by policy in the Databricks integration section of the App Settings page.
R notebooks may have path-related errors accessing tables.
Databricks on Azure will return errors when creating a database in a scratch location when Unity Catalog is enabled.
Databricks accounts deployed on Google Cloud Platform are not supported.
Configure Databricks Spark integration with Unity Catalog support.
Table location | Databricks Spark integration | Databricks Spark integration with Unity Catalog support | Databricks Unity Catalog integration |
---|---|---|---|
Example cluster | Databricks Runtime | Unity Catalog in Databricks | Databricks Spark integration | Databricks Spark with Unity Catalog support | Databricks Unity Catalog integration |
---|---|---|---|---|---|
The feature or integration is enabled.
The feature or integration is disabled.
Project Workspaces | Databricks Tag Ingestion | User Impersonation | Native Query Audit | Multiple Integrations |
---|---|---|---|---|
Example cluster | Databricks Runtime | Unity Catalog in Databricks | Databricks Spark integration | Databricks Spark with Unity Catalog support | Databricks Unity Catalog integration |
---|---|---|---|---|---|
The feature or integration is enabled.
The feature or integration is disabled.
Legacy Metastore
If the database or table is created in the legacy metastore (hive_metastore
), you don't need a storage credential or an external location, but the cluster will need the correct credentials configured if the path is in remote storage.
Immuta's support for scratch paths in Unity Catalog works with external locations.
Grant those locations to the metastore administrator user being used to connect Immuta.
The following example shows creating external locations using the preconfigured storage credential cred
that configures the grants for a metastore admin admin@company.com
:
Immuta requires the database location to be specified in the create database call on an Immuta-enabled cluster so that Immuta can validate the read or write is permitted. For example,
Databricks Unity Catalog is a shared metastore at the Databricks account level that streamlines management of multiple Databricks workspaces for users.
Immuta’s Databricks Spark integration with Unity Catalog support uses a custom Databricks plugin to enforce Immuta policies on a Databricks cluster with Unity Catalog enabled. This integration allows you to add your tables to the Unity Catalog metastore so that you can use the metastore from any workspace while protecting your data with Immuta policies.
Databricks clusters with Unity Catalog use the following hierarchy of data objects:
Metastore: Created at the account level and is attached to one or more Databricks workspaces. The metastore contains metadata of the configured tables available to query. All clusters on that workspace use the configured metastore and all workspaces that are configured to use a single metastore share those tables.
Catalog: A catalog sits on top of schemas (also called databases) and tables to manage permissions across a set of schemas.
Schema: Organizes tables and views.
Table: Tables can be managed or external tables.
For details about the Unity Catalog object model, search for Unity Catalog in Databricks documentation.
Immuta’s Databricks Spark integration with Unity Catalog support uses a custom Databricks plugin to enforce Immuta policies on a Databricks cluster with Unity Catalog enabled. For Immuta to see all relevant tables that have a data source mapped to them, Immuta requires a privileged metastore owner’s personal access token (PAT) from Databricks, and that metastore owner must have been granted access to all the relevant data. This token is stored encrypted to provide an Immuta-enabled Databricks cluster access to more data than a specific user on that cluster might otherwise have.
You must use an Immuta-provided cluster policy to start your Databricks cluster, as these cluster policies explicitly set the data security mode to the Custom setting that allows Immuta to enforce policies on top of Unity Catalog and add Unity Catalog support to the cluster. Once your configuration is complete, policy enforcement will be the same as the policy enforcement for the Databricks Spark integration.
For configuration instructions, see the Configure Databricks Spark Integration with Unity Catalog Support guide.
The Unity Catalog data object model introduces a 3-tiered namespace, as outlined above. Consequently, your Databricks tables registered as data sources in Immuta will now reference the catalog, schema (also called a database), and the table.
If a Databricks table is not a Delta table (if it is an ORC, Parquet, or other file format), it must be an external table. This is a Databricks Unity Catalog restriction and is not related to Immuta. See the Databricks documentation for details about creating these objects to allow external locations to be used.
External locations and storage credentials must be configured correctly on Immuta-enabled clusters to allow tables to be created in a non-managed path. Immuta does not control access to storage credentials or external locations, and a user will have the same level of access to these on an Immuta-enabled cluster as they do on a non-Immuta enabled cluster.
Scratch paths are locations in storage that users can read and write to without Immuta policies applied. Immuta's support for scratch paths in Unity Catalog is designed to work with external locations.
You must configure external locations for any scratch path and grant those locations to the metastore owner user being used to connect Immuta. Creating a database in a scratch location in an Immuta-enabled cluster with Unity Catalog differs from how it is supported on a non-Immuta cluster with Unity Catalog; on a non-Immuta cluster, a database will not have a location if it is created against a catalog other than the legacy hive_metastore
.
Immuta requires the database location to be specified in the create database call on an Immuta-enabled cluster so that Immuta can validate whether the read or write is permitted, as illustrated in the example below:
For configuration instructions, see the Configure Scratch Paths guide.
The data flow for Unity Catalog is the same as the data flow for the Databricks Spark integration.
The only change is that Databricks metadata is saved in Unity Catalog at the account level, not the workspace level.
Immuta clusters use the configured metastore owner personal access token (PAT) to interact with the Unity Catalog metastore. Before registering the table as a data source in Immuta, the catalog, schema, and table being registered must be granted to the configured Unity Catalog metastore owner using one of two methods so that the table is visible to Immuta:
automatically grant access to everything with Privilege Model 1.0. Immuta recommends upgrading the Privilege Model for Unity Catalog to 1.0. This upgrade allows administrators and owners to quickly grant access to everything in a given catalog or schema using a single grant statement. See the Databricks documentation for instructions on enabling Privilege Model 1.0.
Automatically grant select access to everything in a catalog by running the SQL statement below as the metastore owner or catalog owner:
If you are not using Privilege Model 1.0, manually grant access to specific tables by running the SQL statements below as the administrator or table owner:
To register a Databricks table as an Immuta data source, Immuta requires a running Databricks cluster that it can use to determine the schema and metadata of the table in Databricks. This cluster can be either
a non-Immuta cluster: Use a non-Immuta cluster if you have over 1,000 tables to register as Immuta data sources. This is the fastest and least error-prone method to add many data sources at a time.
an Immuta-enabled cluster: Use an Immuta-enabled cluster if you have a few tables to register as Immuta data sources.
Limited enforcement (available until protected by policy access model) is not supported
You must set IMMUTA_SPARK_DATABRICKS_ALLOW_NON_IMMUTA_READS
and IMMUTA_SPARK_DATABRICKS_ALLOW_NON_IMMUTA_WRITES
to false
in your cluster policies manually or by selecting Protected until made available by policy in the Databricks integration section of the App Settings page. See the Databricks Spark integration with Unity Catalog support limitations for details.
Once your cluster is running,
Register your data from your non-Immuta or Immuta-enabled cluster.
If you used a non-Immuta cluster, convert the cluster to an Immuta cluster with Immuta cluster policies once data sources have been created.
Note: When the Unity Catalog integration is enabled, a schema must be specified when registering data sources backed by tables in the legacy hive_metastore
.
Existing Data Sources
Existing data sources will reference the default catalog, hive_metastore
, once Unity Catalog is enabled. However, this default catalog will not be used when you create new data sources.
If you already have an Immuta Databricks Spark integration configured, follow the steps below to enable Unity Catalog support in Immuta.
Enable Unity Catalog support on the App Settings page.
Re-push cluster policies to your Databricks cluster. Note that you must set IMMUTA_SPARK_DATABRICKS_ALLOW_NON_IMMUTA_READS
and IMMUTA_SPARK_DATABRICKS_ALLOW_NON_IMMUTA_WRITES
to false
in your cluster policies manually or by selecting Protected until made available by policy in the Databricks integration section of the App Settings page. See the Databricks Spark integration with Unity Catalog support limitations for details.
Re-start your Databricks cluster with the new cluster policy applied.
Hive metastore
Unity Catalog metastore
Cluster 1
9.1
Unavailable
Unavailable
Cluster 2
10.4
Unavailable
Unavailable
Cluster 3
11.3
Unavailable
Cluster 4
11.3
Cluster 5
11.3
Cluster 1
9.1
Unavailable
Unavailable
Cluster 2
10.4
Unavailable
Unavailable
Cluster 3
11.3
Unavailable
Cluster 4
11.3
Cluster 5
11.3
Immuta’s Databricks Spark integration with Unity Catalog support uses a custom Databricks plugin to enforce Immuta policies on a Databricks cluster with Unity Catalog enabled. This integration provides a pathway for you to add your tables to the Unity Catalog metastore so that you can use the metastore from any workspace while protecting your data with Immuta policies.
Databricks Runtime 11.3.
Unity Catalog enabled on your Databricks cluster.
The metastore owner you are using to manage permissions has been granted access to all catalogs, schemas, and tables that will be protected by Immuta. Data protected by Immuta should only be granted to privileged users in Unity Catalog so that the only view of that data is through an Immuta-enabled cluster.
You have generated a personal access token for the metastore owner that Immuta can use to read data in Unity Catalog.
You do not plan to use non-Unity Catalog enabled clusters with Immuta data sources. Once enabled, all access to data source tables must be on Databricks clusters with Unity Catalog enabled on runtime 11.3.
Deprecation notice
Support for this integration has been deprecated. This integration will be removed in the 2024.2 LTS release.
Enabling Unity Catalog
The integration cannot be disabled once enabled, as it will permanently migrate all data sources to support the additional Unity Catalog controls and hierarchy. Unity Catalog support in Immuta is enabled globally across all Databricks data sources and integrations.
Databricks Runtime 11.3.
Unity Catalog enabled on your Databricks cluster.
The metastore owner you are using to manage permissions has been granted access to all catalogs, schemas, and tables that will be protected by Immuta. Data protected by Immuta should only be granted to privileged users in Unity Catalog so that the only view of that data is through an Immuta-enabled cluster.
You have generated a personal access token for the metastore owner that Immuta can use to read data in Unity Catalog.
You do not plan to use non-Unity Catalog enabled clusters with Immuta data sources. Once enabled, all access to data source tables must be on Databricks clusters with Unity Catalog enabled on runtime 11.3.
In Unity Catalog, catalogs manage permissions across a set of databases.
You can opt to set the default catalog for queries run without explicitly specifying the catalog for a table by adding the following Spark configuration to your Databricks cluster:
This configuration does not limit the cluster to only using this catalog; it merely sets the default for queries run without explicitly specifying the catalog for a table.
Click the App Settings icon in the left sidebar.
Scroll to the Global Integration Settings section and check the Enable Databricks Unity Catalog support in Immuta checkbox.
Complete the following fields:
Workspace Host Name: The hostname (also known as the instance name) of a Databricks workspace instance on an account you want to connect to Immuta. This Databricks workspace is used to run short duration Databricks jobs so that Immuta can pull a token for the metastore owner.
Databricks Account Administrator Personal Access Token: Immuta requires you to provide a personal access token of a Databricks metastore administrator so that Immuta can protect all the data sources available. Databricks metastore administrators are set by changing the owner of a metastore in the account console (or using DDL statements by an account-level administrator). Metastores can be owned by a group that enabled more than one user to be an owner.
Schedule: Immuta uses the administrator token to keep the Immuta-enabled clusters synchronized and needs to periodically refresh it to ensure that the cluster does not use an expired token. This schedule is in cron syntax and will be used to launch the synchronization job.
The default value for this runs the token sync job at midnight daily. This cadence should be sufficient for most Unity Catalog configurations; however, if the timing of the job is problematic you can adjust the time of day to run at a more convenient time.
Token Sync Retries: The number of attempts Immuta will perform to re-request the token. The default value should work for most systems, but in environments with networking or load issues consider increasing this number.
Save the configuration.
After saving the configuration, Immuta will be configured to use Unity Catalog data sources and will automatically sync the Databricks metastore administrator API token, which is required for the integration to correctly view and apply policies to any data source in Databricks.
Check that your token sync job was correctly run in Databricks. Navigate to Workflows and click the Job runs tab. Search for a job that starts with Immuta Unity Token Sync.
If the token sync fails, there will be log messages in the web service logs. These should be discoverable in the event that the connection to Databricks is not functioning. In the event that the token is not synchronized correctly, the following error will appear when performing actions in Databricks:
If the token expires, the following error will appear when performing actions on any Immuta-enabled Databricks cluster: ImmutaException: 403: Invalid access token.
In this case, you can re-run the token sync job by modifying the schedule for token synchronization on the App Settings page. When the configuration is saved, the token synchronization job will run again immediately (regardless of schedule) and will refresh the token. Consider shortening the window between token synchronization jobs by editing the schedule if you see this error.
/
/
/
/
is a shared metastore at the Databricks account level that streamlines management of multiple Databricks workspaces for users.
Unity Catalog and attached to a Databricks workspace.
.
.
.
as Immuta data sources.
in Immuta to restrict access to data.
Unity Catalog and attached to a Databricks workspace.
on a non-Immuta cluster as the metastore admin, who is tied to a specific metastore attached to one or more Databricks workspaces. That way, the catalog will be owned by the metastore admin, which gives broad permissions to grant or revoke objects in the catalog to other users. If this catalog is intended to be protected by Immuta, the data should not be granted to other users besides the metastore admin.
.
.
If you already have a Databricks Spark integration configured, follow the .
.