Configure Databricks Spark with Unity Catalog

Enabling Unity Catalog

The integration cannot be disabled once enabled, as it will permanently migrate all data sources to support the additional Unity Catalog controls and hierarchy. Unity Catalog support in Immuta is enabled globally across all Databricks data sources and integrations.

Prerequisites

  • Databricks Runtime 11.3.

  • Unity Catalog enabled on your Databricks cluster.

  • Unity Catalog metastore created and attached to a Databricks workspace.

  • The metastore owner you are using to manage permissions has been granted access to all catalogs, schemas, and tables that will be protected by Immuta. Data protected by Immuta should only be granted to privileged users in Unity Catalog so that the only view of that data is through an Immuta-enabled cluster.

  • You have generated a personal access token for the metastore owner that Immuta can use to read data in Unity Catalog.

  • You do not plan to use non-Unity Catalog enabled clusters with Immuta data sources. Once enabled, all access to data source tables must be on Databricks clusters with Unity Catalog enabled on runtime 11.3.

Create a Catalog in Databricks

In Unity Catalog, catalogs manage permissions across a set of databases.

  1. Create a new catalog on a non-Immuta cluster as the metastore admin, who is tied to a specific metastore attached to one or more Databricks workspaces. That way, the catalog will be owned by the metastore admin, which gives broad permissions to grant or revoke objects in the catalog to other users. If this catalog is intended to be protected by Immuta, the data should not be granted to other users besides the metastore admin.

You can opt to set the default catalog for queries run without explicitly specifying the catalog for a table by adding the following Spark configuration to your Databricks cluster:

spark.databricks.sql.initial.catalog.name <catalog name>

This configuration does not limit the cluster to only using this catalog; it merely sets the default for queries run without explicitly specifying the catalog for a table.

Enable Databricks Spark with Unity Catalog Support in Immuta

  1. Click the App Settings icon in the left sidebar.

  2. Scroll to the Native Integration Settings section and check the Enable Databricks Unity Catalog support in Immuta checkbox.

  3. Complete the following fields:

    • Workspace Host Name: The hostname (also known as the instance name) of a Databricks workspace instance on an account you want to connect to Immuta. This Databricks workspace is used to run short duration Databricks jobs so that Immuta can pull a token for the metastore owner.

    • Databricks Account Administrator Personal Access Token: Immuta requires you to provide a personal access token of a Databricks metastore administrator so that Immuta can protect all the data sources available. Databricks metastore administrators are set by changing the owner of a metastore in the account console (or using DDL statements by an account-level administrator). Metastores can be owned by a group that enabled more than one user to be an owner.

    • Schedule: Immuta uses the administrator token to keep the Immuta-enabled clusters synchronized and needs to periodically refresh it to ensure that the cluster does not use an expired token. This schedule is in cron syntax and will be used to launch the synchronization job.

      The default value for this runs the token sync job at midnight daily. This cadence should be sufficient for most Unity Catalog configurations; however, if the timing of the job is problematic you can adjust the time of day to run at a more convenient time.

    • Token Sync Retries: The number of attempts Immuta will perform to re-request the token. The default value should work for most systems, but in environments with networking or load issues consider increasing this number.

  4. Save the configuration.

After saving the configuration, Immuta will be configured to use Unity Catalog data sources and will automatically sync the Databricks metastore administrator API token, which is required for the integration to correctly view and apply policies to any data source in Databricks.

Token Synchronization Troubleshooting

  • Check that your token sync job was correctly run in Databricks. Navigate to Workflows and click the Job runs tab. Search for a job that starts with Immuta Unity Token Sync.

  • If the token sync fails, there will be log messages in the web service logs. These should be discoverable in the event that the connection to Databricks is not functioning. In the event that the token is not synchronized correctly, the following error will appear when performing actions in Databricks:

    org.apache.spark.sql.AnalysisException: ImmutaAnalysisException: The Unity Catalog token is missing from the
    Immuta system details.
    
    Make sure the Unity Catalog token job is configured in Immuta and the job was able to successfully push the
    token back to the Immuta Web Service.
  • If the token expires, the following error will appear when performing actions on any Immuta-enabled Databricks cluster: ImmutaException: 403: Invalid access token.

    In this case, you can re-run the token sync job by modifying the schedule for token synchronization on the App Settings page. When the configuration is saved, the token synchronization job will run again immediately (regardless of schedule) and will refresh the token. Consider shortening the window between token synchronization jobs by editing the schedule if you see this error.

Existing Databricks Spark Integration Migration

If you already have a Databricks Spark integration configured, follow the Enable Unity Catalog Support for an Existing Databricks Spark Integration guide.

Next

Last updated

Copyright © 2014-2024 Immuta Inc. All rights reserved.