Databricks Data Source

Deprecation notice

Support for registering Databricks Unity Catalog data sources using this legacy workflow has been deprecated. Instead, register your data using connections.

Requirements

Databricks Spark integration

When exposing a table or view from an Immuta-enabled Databricks cluster, be sure that at least one of these traits is true:

The user exposing the tables has READ_METADATA and SELECT permissions on the target views/tables (specifically if Table ACLs are enabled).
The user exposing the tables is listed in the immuta.spark.acl.allowlist configuration on the target cluster.
The user exposing the tables is a Databricks workspace administrator.

Databricks Unity Catalog integration

When registering Databricks Unity Catalog securables in Immuta, use the service principal from the integration configuration and ensure it has the privileges listed below. Immuta uses this service principal continuously to orchestrate Unity Catalog policies and maintain state between Immuta and Databricks.

USE CATALOG and MANAGE on all catalogs containing securables registered as Immuta data sources.
USE SCHEMA on all schemas containing securables registered as Immuta data sources.
MODIFY and SELECT on all securables you want registered as Immuta data sources. The MODIFY privilege is not required for materialized views registered as Immuta data sources, since MODIFY is not a supported privilege on that object type in Databricks.

MANAGE and MODIFY are required so that the service principal can apply row filters and column masks on the securable; to do so, the service principal must also have SELECT on the securable as well as USE CATALOG on its parent catalog and USE SCHEMA on its parent schema. Since privileges are inherited, you can grant the service principal the MODIFY and SELECT privilege on all catalogs or schemas containing Immuta data sources, which automatically grants the service principal the MODIFY and SELECT privilege on all current and future securables in the catalog or schema. The service principal also inherits MANAGE from the parent catalog for the purpose of applying row filters and column masks, but that privilege must be set directly on the parent catalog in order for grants to be fully applied.

Azure Databricks Unity Catalog limitation

Set all table-level ownership on your Unity Catalog data sources to an individual user or service principal instead of a Databricks group before proceeding. Otherwise, Immuta cannot apply data policies to the table in Unity Catalog. See the Azure Databricks Unity Catalog limitation for details.

Enter connection information

Performance recommendations

Register entire databases with Immuta and run schema monitoring jobs through the Python script provided during data source registration.
Use a Databricks administrator account to register data sources with Immuta using the UI or API; however, you should not test Immuta policies using a Databricks administrator account, as they are able to bypass controls.

Navigate to the Data Sources list page and click Register Data Source.
Select the Databricks tile in the Data Platform section. When exposing a table or view from an Immuta-enabled Databricks cluster, be sure that at least one of these traits is true:
- The user exposing the tables has READ_METADATA and SELECT permissions on the target views/tables (specifically if Table ACLs are enabled).
- The user exposing the tables is listed in the immuta.spark.acl.allowlist configuration on the target cluster.
- The user exposing the tables is a Databricks workspace administrator.
Complete the first four fields in the Connection Information box:
- Server: hostname or IP address
- Port: port configured for Databricks, typically port 443
- SSL: when enabled, ensures communication between Immuta and the remote database is encrypted. Immuta recommends that all connections use SSL. Additional connection string arguments may also be provided below. Only Immuta uses the connection you provide and injects all policy controls when users query the system. Users always connect through Immuta with policies enforced and have no direct association with this connection.
- Database: the remote database
Select your authentication method from the dropdown:
- Access Token:
  1. Enter your Databricks API Token. Use a non-expiring token so that access to the data source is not lost unexpectedly.
  2. Enter the HTTP Path of your Databricks cluster or SQL warehouse.
- OAuth machine-to-machine (M2M):
  1. Enter the HTTP Path of your Databricks cluster or SQL warehouse.
  2. Fill out the Token Endpoint with the full URL of the identity provider. This is where the generated token is sent. The default value is https://<your workspace name>.cloud.databricks.com/oidc/v1/token.
  3. Fill out the Client ID. This is a combination of letters, numbers, or symbols, used as a public identifier and is the same as the service principal's application ID.
  4. Enter the Scope (string). The scope limits the operations and roles allowed in Databricks by the access token. See the OAuth 2.0 documentation for details about scopes.
  5. Enter the Client Secret. Immuta uses this secret to authenticate with the authorization server when it requests a token.
If you are using a proxy server with Databricks, specify it in the Additional Connection String Options:
```
UseProxy=1;ProxyHost=my.host.com;ProxyPort=6789
```
Click Test Connection.

Further considerations

Immuta pushes down joins to be processed on the remote database when possible. To ensure this happens, make sure the connection information matches between data sources, including host, port, ssl, username, and password. You will see performance degradation on joins against the same database if this information doesn't match.
If a client certificate is required to connect to the source database, you can add it in the Upload Certificates section.

Select virtual population

Decide how to virtually populate the data source by selecting one of the options:

Create sources for all tables in this database: This option will create data sources and keep them in sync for every table in the dataset. New tables will be automatically detected and new Immuta views will be created.
Schema / Table: This option will allow you to specify tables or datasets that you want Immuta to register.
1. Opt to Edit in the table selection box that appears.
2. By default, all schemas and tables are selected. Select and deselect by clicking the checkbox to the left of the name in the Import Schemas/Tables menu. You can create multiple data sources at one time by selecting an entire schema or multiple tables.
3. After making your selection(s), click Apply.

Enter basic information

Enter the SQL Schema Name Format to be the SQL name that the data source exists under in Immuta. It must include a schema macro but you may personalize it using lowercase letters, numbers, and underscores to personalize the format. It may have up to 255 characters.
Enter the Schema Project Name Format to be the name of the schema project in the Immuta UI. If you enter a name that already exists, the name will automatically be incremented. For example, if the schema project Customer table already exists and you enter that name in this field, the name for this second schema project will automatically become Customer table 2 when you create it.
1. When selecting Create sources for all tables in this database and monitor for changes you may personalize this field as you wish, but it must include a schema macro.
2. When selecting Schema/Table this field is prepopulated with the recommended project name and you can edit freely.
Select the Data Source Name Format, which will be the format of the name of the data source in the Immuta UI.
- <Tablename>: The data source name will be the name of the remote table, and the case of the data source name will match the case of the macro.
- <Schema><Tablename>: The data source name will be the name of the remote schema followed by the name of the remote table, and the case of the data source name will match the cases of the macros.
- Custom: Enter a custom template for the Data Source Name. You may personalize this field as you wish, but it must include a tablename macro. The case of the macro will apply to the data source name (i.e., <Tablename> will result in "Data Source Name," <tablename> will result in "data source name," and <TABLENAME> will result in "DATA SOURCE NAME").
Enter the SQL Table Name Format, which will be the format of the name of the table in Immuta. It must include a table name macro, but you may personalize the format using lowercase letters, numbers, and underscores. It may have up to 255 characters.

Enable or disable schema monitoring

Note: This step will only appear if all tables within a server have been selected for creation.

Schema monitoring best practices

Schema monitoring is a powerful tool that ensures tables are all governed by Immuta.

Consider using schema monitoring later in your onboarding process, not during your initial setup and configuration when tables are not in a stable state.
Consider using Immuta’s API to either run the schema monitoring job when your ETL process adds new tables or to add new tables.
Activate the new column added templated global policy to protect potentially sensitive data. This policy will null the new columns until a data owner reviews new columns that have been added, protecting your data to avoid data leaks on new columns getting added without being reviewed first.

Generate your Immuta API Key from your user profile page. The Immuta API key used in the Databricks notebook job for schema detection must either belong to an Immuta admin or the user who owns the schema detection groups that are being targeted.
On the data source creation page, click the checkbox to enable Schema Monitoring or Detect Column Changes.
Click Download Schema Job Detection Template and then the Click Here To Download text.
Before you can run the script, follow the Databricks documentation to create the scope and secret using the Immuta API Key generated on your user profile page.
Import the Python script you downloaded into a Databricks workspace as a notebook. Note: The job template has commented out lines for specifying a particular database or table. With those two lines commented out, the schema detection job will run against ALL databases and tables in Databricks. Additionally, if you need to add proxy configuration to the job template, the template uses the Python requests library, which has a simple mechanism for configuring proxies for a request.
Schedule the script as part of a notebook job to run as often as required. Each time the job runs, it will make an API call to Immuta to trigger schema detection queries, and these queries will run on the cluster from which the request was made. Note: Use the api_immuta cluster for this job. The job in Databricks must use an Existing All-Purpose Cluster so that Immuta can connect to it over ODBC. Job clusters do not support ODBC connections.

Opt to configure advanced settings

Although not required, completing these steps will help maximize the utility of your data source. Otherwise, click Create to save the data source.

Column detection

This setting monitors when remote tables' columns have been changed, updates the corresponding data sources in Immuta, and notifies Data Owners of these changes.

To enable, select the checkbox in this section.

See the Schema projects overview page to learn more about column detection.

Event time

An Event Time column denotes the time associated with records returned from this data source. For example, if your data source contains news articles, the time that the article was published would be an appropriate Event Time column.

Click the Edit button in the Event Time section.
Select the column(s).
Click Apply.

Selecting an Event Time column will enable

more statistics to be calculated for this data source including the most recent record time, which is used for determining the freshness of the data source.
the creation of time-based restrictions in the policy builder.

Latency

Click Edit in the Latency section.
Complete the Set Time field, and then select MINUTES, HOURS, or DAYS from the subsequent dropdown menu.
Click Apply.

This setting impacts how often Immuta checks for new values in a column that is driving row-level redaction policies. For example, if you are redacting rows based on a country column in the data, and you add a new country, it will not be seen by the Immuta policy until this period expires.

Sensitive data discovery

Data owners can disable identification for their data sources in this section.

Click Edit in this section.
Select Enabled or Disabled in the window that appears, and then click Apply.

Data source tags

Adding tags to your data source allows users to search for the data source using the tags and Governors to apply Global policies to the data source. Note if Schema Detection is enabled, any tags added now will also be added to the tables that are detected.

To add tags,

Click the Edit button in the Data Source Tags section.
Begin typing in the Search by Tag Name box to select your tag, and then click Add.

Tags can also be added after you create your data source from the data source details page on the overview tab or the data dictionary tab.

Create the data source

Click Create to save the data source(s).

Databricks Unity Catalog behavior

If a registered data source has no subscription policy set on it, Immuta will REVOKE access to the data in Databricks for all Immuta users, even if they had been directly granted access to the table in Unity Catalog.

If you disable a Unity Catalog data source in Immuta, all existing grants and policies on that object will be removed in Databricks for all Immuta users. All existing grants and policies will be removed, regardless of whether they were set in Immuta or in Unity Catalog directly.

If a user is not registered in Immuta, Immuta will have no effect on that user's access to data in Unity Catalog.

See the Databricks Unity Catalog reference guide for more details about permissions Immuta revokes and how to configure this behavior for your integration.

Last updated 1 month ago

Was this helpful?

hashtagRequirements

hashtagEnter connection information

hashtagSelect virtual population

hashtagEnter basic information

hashtagEnable or disable schema monitoring

hashtagOpt to configure advanced settings

hashtagColumn detection

hashtagEvent time

hashtagLatency

hashtagSensitive data discovery

hashtagData source tags

hashtagCreate the data source