If a Databricks cluster needs to be manually updated to reflect changes in the Immuta init script or cluster policies, you can remove and set up your integration again to get the updated policies and init script.
Log in to Immuta as an Application Admin.
Click the App Settings icon in the left sidebar and click the Integrations tab.
Your existing Databricks integration should be listed here; expand it and note the configuration values. Now select Remove to remove your integration.
Click Add Native Integration and select Databricks Integration to add a new integration.
Enter your Databricks integration settings again as configured previously.
Click Add Native Integration to add the integration, and then select Configure Cluster Policies to set up the updated cluster policies and init script.
Select the cluster policies you wish to use for your Immuta-enabled Databricks clusters.
Use the options below to view instructions for automatically pushing cluster policies and the init script (recommended) or manually updating your cluster policies.
Automatically push cluster policies
Select Automatically Push Cluster Policies and enter your privileged Databricks access token. This token must have privileges to write to cluster policies.
Select Apply Policies to push the cluster policies and init script again.
Click Save and Confirm to deploy your changes.
Manually update cluster policies
Download the init script and the new cluster policies to your local computer.
Click Save and Confirm to save your changes in Immuta.
Log in to your Databricks workspace with your administrator account to set up cluster policies.
Get the path you will upload the init script (`immuta_cluster_init_script_proxy.sh`) to by opening one of the cluster policy `.json` files and looking for the `defaultValue` of the field `init_scripts.0.dbfs.destination`. This should be a DBFS path in the form of `dbfs:/immuta-plugin/hostname/immuta_cluster_init_script_proxy.sh`.
Click Data in the left pane to upload your init script to DBFS to the path you found above.
To find your existing cluster policies you need to update, click Compute in the left pane and select the Cluster policies tab.
Edit each of these cluster policies that were configured before and overwrite the contents of the JSON with the new cluster policy JSON you downloaded.
Restart any Databricks clusters using these updated policies for the changes to take effect.
This page contains references to the term whitelist, which Immuta no longer uses. When the term is removed from the software, it will be removed from this page.
Databricks instance: Premium tier workspace and Cluster access control enabled
Databricks instance has network level access to Immuta tenant
Access to Immuta archives
Permissions and access to download (outside Internet access) or transfer files to the host machine
Recommended Databricks Workspace Configurations:
Note: Azure Databricks authenticates users with Microsoft Entra ID. Be sure to configure your Immuta tenant with an IAM that uses the same user ID as does Microsoft Entra ID. Immuta's Spark security plugin will look to match this user ID between the two systems. See this Microsoft Entra ID page for details.
Use the table below to determine which version of Immuta supports your Databricks Runtime version:
Databricks Runtime Version | Immuta Version |
---|---|
The table below outlines the integrations supported for various Databricks cluster configurations. For example, the only integration available to enforce policies on a cluster configured to run on Databricks Runtime 9.1 is the Databricks Spark integration.
Legend:
Immuta supports the Custom access mode.
Supported Languages:
Python
SQL
R (requires advanced configuration; work with your Immuta support professional to use R)
Scala (requires advanced configuration; work with your Immuta support professional to use Scala)
Users who can read raw tables on-cluster
If a Databricks Admin is tied to an Immuta account, they will have the ability to read raw tables on-cluster.
If a Databricks user is listed as an "ignored" user, they will have the ability to read raw tables on-cluster. Users can be added to the immuta.spark.acl.whitelist
configuration to become ignored users.
The Immuta Databricks integration injects an Immuta plugin into the SparkSQL stack at cluster startup. The Immuta plugin creates an "immuta" database that is available for querying and intercepts all queries executed against it. For these queries, policy determinations will be obtained from the connected Immuta tenant and applied before returning the results to the user.
The Databricks cluster init script provided by Immuta downloads the Immuta artifacts onto the target cluster and puts them in the appropriate locations on local disk for use by Spark. Once the init script runs, the Spark application running on the Databricks cluster will have the appropriate artifacts on its CLASSPATH to use Immuta for policy enforcement.
The cluster init script uses environment variables in order to
Determine the location of the required artifacts for downloading.
Authenticate with the service/storage containing the artifacts.
Note: Each target system/storage layer (HTTPS, for example) can only have one set of environment variables, so the cluster init script assumes that any artifact retrieved from that system uses the same environment variables.
See the Databricks Pre-Configuration Details page for known limitations.
There are two installation options for Databricks. Click a link below to navigate to a tutorial for your chosen method:
Simplified Configuration: The steps to enable the integration with this method include
Adding the integration on the App Settings page.
Downloading or automatically pushing cluster policies to your Databricks workspace.
Creating or restarting your cluster.
Manual Configuration: The steps to enable the integration with this method include
Downloading and configuring Immuta artifacts.
Staging Immuta artifacts somewhere the cluster can read from during its startup procedures.
Protecting Immuta environment variables with Databricks Secrets.
Creating and configuring the cluster to start with the init script and load Immuta into its SparkSQL environment.
For easier debugging of the Immuta Databricks installation, enable cluster init script logging. In the cluster page in Databricks for the target cluster, under Advanced Options -> Logging, change the Destination from NONE
to DBFS
and change the path to the desired output location. Note: The unique cluster ID will be added onto the end of the provided path.
For debugging issues between the Immuta web service and Databricks, you can view the Spark UI on your target Databricks cluster. On the cluster page, click the Spark UI tab, which shows the Spark application UI for the cluster. If you encounter issues creating Databricks data sources in Immuta, you can also view the JDBC/ODBC Server portion of the Spark UI to see the result of queries that have been sent from Immuta to Databricks.
The Validation and Debugging Notebook (immuta-validation.ipynb
) is packaged with other Databricks release artifacts (for manual installations), or it can be downloaded from the App Settings page when configuring native Databricks through the Immuta UI. This notebook is designed to be used by or under the guidance of an Immuta Support Professional.
Import the notebook into a Databricks workspace by navigating to Home in your Databricks instance.
Click the arrow next to your name and select Import.
Once you have executed commands in the notebook and populated it with debugging information, export the notebook and its contents by opening the File menu, selecting Export, and then selecting DBC Archive.
This guide details the simplified installation method for enabling native access to Databricks with Immuta policies enforced.
Ensure your Databricks workspace, instance, and permissions meet the guidelines outlined in the before you begin.
Databricks Unity Catalog: If Unity Catalog is enabled in a Databricks workspace, you must use an Immuta cluster policy when you setup the integration to create an Immuta-enabled cluster.
Log in to Immuta and click the App Settings icon in the left sidebar.
Scroll to the System API Key subsection under HDFS and click Generate Key.
Click Save and then Confirm.
Scroll to the Integration Settings section, and click + Add a Native Integration.
Select Databricks Integration from the dropdown menu.
Complete the Hostname field.
Enter a Unique ID for the integration. By default, your Immuta tenant URL populates this field. This ID is used to tie the set of cluster policies to your Immuta tenant and allows multiple Immuta tenants to access the same Databricks workspace without cluster policy conflicts.
Select your configured Immuta IAM from the dropdown menu.
Choose one of the following options for your data access model:
Protected until made available by policy: All tables are hidden until a user is permissioned through an Immuta policy. This is how most databases work and assumes least privileged access and also means you will have to register all tables with Immuta.
Available until protected by policy: All tables are open until explicitly registered and protected by Immuta. This makes a lot of sense if most of your tables are non-sensitive and you can pick and choose which to protect.
Select the Storage Access Type from the dropdown menu.
Opt to add any Additional Hadoop Configuration Files.
Click Add Native Integration.
Several cluster policies are available on the App Settings page when configuring this integration:
Click a link above to read more about each of these cluster policies before continuing with the tutorial.
Click Configure Cluster Policies.
Select one or more cluster policies in the matrix by clicking the Select button(s).
Opt to check the Enable Unity Catalog checkbox to generate cluster policies that will enable Unity Catalog on your cluster. This option is only available when Databricks runtime 11.3 is selected.
Opt to make changes to these cluster policies by clicking Additional Policy Changes and editing the text field.
Use one of the two Installation Types described below to apply the policies to your cluster:
Automatically Push Cluster Policies: This option allows you to automatically push the cluster policies to the configured Databricks workspace. This will overwrite any cluster policy templates previously applied to this workspace.
Select the Automatically Push Cluster Policies radio button.
Enter your Admin Token. This token must be for a user who can create cluster policies in Databricks.
Click Apply Policies.
Manually Push Cluster Policies: Enabling this option will allow you to manually push the cluster policies to the configured Databricks workspace. There will be various files to download and manually push to the configured Databricks workspace.
Select the Manually Push Cluster Policies radio button.
Click Download Init Script.
Follow the steps in the Instructions to upload the init script to DBFS section.
Click Download Policies, and then manually add these Cluster Policies in Databricks.
Opt to click the Download the Benchmarking Suite to compare a regular Databricks cluster to one protected by Immuta. Detailed instructions are available in the first notebook, which will require an Immuta and non-Immuta cluster to generate test data and perform queries.
Click Close, and then click Save and Confirm.
In the Policy dropdown, select the Cluster Policies you pushed or manually added from Immuta.
Select the Custom Access mode.
Opt to adjust Autopilot Options and Worker Type settings: The default values provided here may be more than what is necessary for non-production or smaller use-cases. To reduce resource usage you can enable/disable autoscaling, limit the size and number of workers, and set the inactivity timeout to a lower value.
Opt to configure the Instances tab in the Advanced Options section:
Click Create Cluster.
Before users can query an Immuta data source, an administrator must give the user Can Attach To
permissions on the cluster.
Example cluster | Databricks Runtime | Unity Catalog in Databricks | Databricks Spark integration | Databricks Unity Catalog integration |
---|---|---|---|---|
The feature or integration is enabled.
The feature or integration is disabled.
Create a cluster in Databricks by following the .
IAM Role (AWS ONLY): Select the instance role you created for this cluster. (For access key authentication, you should instead use the environment variables listed in the section.)
.
When the Immuta-enabled Databricks cluster has been successfully started, Immuta will create an immuta
database, which allows Immuta to track Immuta-managed data sources separately from remote Databricks tables so that policies and other security features can be applied. However, users can query sources with their original database or table name without referencing the immuta
database. Additionally, when configuring a Databricks cluster you can hide immuta
from any calls to SHOW DATABASES
so that users aren't misled or confused by its presence. For more details, see the page.
See the for a detailed walkthrough of creating Databricks data sources in Immuta.
Below are example queries that can be run to obtain data from an Immuta-configured data source. Because Immuta supports raw tables in Databricks, you do not have to use Immuta-qualified table names in your queries like the first example. Instead, you can run queries like the second example, which does not reference the .
11.3 LTS
2023.1 and newer
10.4 LTS
2022.2.x and newer
7.3 LTS 9.1 LTS
2021.5.x and newer
This guide details the manual installation method for enabling native access to Databricks with Immuta policies enforced. Before proceeding, ensure your Databricks workspace, instance, and permissions meet the guidelines outlined in the Installation Introduction.
Databricks Unity Catalog: If Unity Catalog is enabled in a Databricks workspace, you must use an Immuta cluster policy when you setup the integration to create an Immuta-enabled cluster.
The immuta_conf.xml
file is no longer required
The immuta_conf.xml
file that was previously used to configure the native Databricks integration is no longer required to install Immuta, so it is no longer staged as a deployment artifact. However, you can use these snippets if you wish to deploy an immuta_conf.xml
file to set properties.
The required Immuta base URL and Immuta system API key properties, along with any other valid properties, can still be specified as Spark environment variables or in the optional immuta_conf.xml
file. As before, if the same property is specified in both locations, the Spark environment variable takes precedence.
If you have an existing immuta_conf.xml
file, you can continue using it. However, it's recommended that you delete any default properties from the file that you have not explicitly overridden, or remove the file completely and rely on Spark environment variables. Either method will ensure that any property defaults changed in upcoming Immuta releases are propagated to your environment.
Navigate to the Immuta archives page. If you are prompted to log in and need basic authentication credentials, contact your Immuta support professional.
Navigate to the Databricks folder for your Immuta version. Ex: https://archives.immuta.com/hadoop/databricks/2024.2.1/
.
Download the .jar file (Immuta plugin) as well as the other scripts listed below, which will load the plugin at cluster startup.
The immuta-benchmark-suite.dbc
is a collection of notebooks packaged as a .dbc file. After you have added cluster policies to your cluster, you can import this file into Databricks to run performance tests and compare a regular Databricks cluster to one protected by Immuta. Detailed instructions are available in the first notebook, which will require an Immuta and non-Immuta cluster to generate test data and perform queries. Note: Use Spark 2 with Databricks Runtime prior to 7.x. Use Spark 3 with Databricks Runtime 7.x or later. Attempting to use an incompatible jar and Databricks Runtime will fail.
Specify the following properties as Spark environment variables or in the optional immuta_conf.xml
file. If the same property is specified in both locations, the Spark environment variable takes precedence. The variable names are the config names in all upper case with _
instead of .
. For example, to set the value of immuta.base.url
via an environment variable, you would set the following in the Environment Variables
section of cluster configuration: IMMUTA_BASE_URL=https://immuta.mycompany.com
immuta.system.api.key
: Obtain this value from the Immuta Configuration UI under HDFS > System API Key. You will need to be a user with the APPLICATION_ADMIN
role to complete this action. Generating a key will destroy any previously generated HDFS keys. This will cause previously integrated HDFS systems to lose access to your Immuta console. The key will only be shown once when generated.
immuta.base.url
: The full URL for the target Immuta tenant Ex: https://immuta.mycompany.com
.
immuta.user.mapping.iamid
: If users authenticate to Immuta using an IAM different from Immuta's built-in IAM, you need to update the configuration file to reflect the ID of that IAM. The IAM ID is shown within the Immuta App Settings page within the Identity Management section. See Databricks to Immuta User Mapping for more details.
Environment variables with Google Cloud Platform
Do not use environment variables to set sensitive properties when using Google Cloud Platform. Set them directly in immuta_conf.xml
.
When configuring the Databricks cluster, a path will need to be provided to each of the artifacts downloaded/created in the previous step. To do this, those artifacts must be hosted somewhere that your Databricks instance can access. The following methods can be used for this step:
Host files in AWS/S3 and provide access by the cluster
Host files in Azure ADL Gen 1 or Gen 2 and provide access by the cluster
Host files on an HTTPS server accessible by the cluster
Host files in DBFS (Not recommended for production)
These artifacts will be downloaded to the required location within the clusters file-system by the init script downloaded in the previous step. In order for the init script to find these files, a URI will have to be provided through environment variables configured on the cluster. Each method's URI structure and setup is explained below.
URI Structure: s3://[bucket]/[path]
Create an instance profile for clusters by following Databricks documentation.
Upload the configuration file, JSON file, and JAR file to an S3 bucket that the role from step 1 has access to.
If you wish to authenticate using access keys, add the following items to the cluster's environment variables:
If you've assumed a role and received a session token, that can be added here as well:
URI Structure: abfs(s)://[container]@[account].dfs.core.windows.net/[path]
Upload the configuration file, JSON file, and JAR file to an ADL gen 2 blob container.
Environment Variables:
If you want to authenticate using an account key, add the following to your cluster's environment variables:
If you want to authenticate using an Azure SAS token, add the following to your cluster's environment variables:
URI Structure: adl://[account].azuredatalakestore.net/[path]
Upload the configuration file, JSON file, and JAR file to ADL gen 1.
Environment Variables:
If authenticating as a Microsoft Entra ID user,
If authenticating using a service principal,
URI Structure: http(s)://[host](:port)/[path]
Artifacts are available for download from Immuta using basic authentication. Your basic authentication credentials can be obtained from your Immuta support professional.
DBFS does not support access control
Any Databricks user can access DBFS via the Databricks command line utility. Files containing sensitive materials (such as Immuta API keys) should not be stored there in plain text. Use other methods described herein to properly secure such materials.
URI Structure: dbfs:/[path]
Upload the artifacts directly to DBFS using the Databricks CLI.
Since any user has access to everything in DBFS:
The artifacts can be stored anywhere in DBFS.
It's best to have a cluster-specific place for your artifacts in DBFS if you are testing to avoid overwriting or reusing someone else's artifacts accidentally.
It is important that non-administrator users on an Immuta-enabled Databricks cluster do not have access to view or modify Immuta configuration or the immuta-spark-hive.jar
file, as this would potentially pose a security loophole around Immuta policy enforcement. Therefore, use Databricks secrets to apply environment variables to an Immuta-enabled cluster in a secure way.
Databricks secrets can be used in the Environment Variables
configuration section for a cluster by referencing the secret path rather than the actual value of the environment variable. For example, if a user wanted to make the following value secret
they could instead create a Databricks secret and reference it as the value of that variable. For instance, if the secret scope my_secrets
was created, and the user added a secret with the key my_secret_env_var
containing the desired sensitive environment variable, they would reference it in the Environment Variables
section:
Then, at runtime, {{secrets/my_secrets/my_secret_env_var}}
would be replaced with the actual value of the secret if the owner of the cluster has access to that secret.
Best practice: Replace sensitive variables with secrets
Immuta recommends that any sensitive environment variables listed below in the various artifact deployment instructions be replaced with secrets.
Cluster creation in an Immuta-enabled organization or Databricks workspace should be limited to administrative users to avoid allowing users to create non-Immuta enabled clusters.
Create a cluster in Databricks by following the Databricks documentation.
Select the Custom Access mode.
Opt to adjust the Autopilot Options and Worker Type settings. The default values provided here may be more than what is necessary for non-production or smaller use-cases. To reduce resource usage you can enable/disable autoscaling, limit the size and number of workers, and set the inactivity timeout to a lower value.
In the Advanced Options section, click the Instances tab.
IAM Role (AWS ONLY): Select the instance role you created for this cluster. (For access key authentication, you should instead use the environment variables listed in the AWS section.)
Click the Spark tab. In Spark Config field, add your configuration.
Cluster Configuration Requirements:
In the Environment Variables section, add the environment variables necessary for your configuration. Remember that these variables should be protected with Databricks secrets as mentioned above.
Click the Init Scripts tab and set the following configurations:
Destination: Specify the service you used to host the Immuta artifacts.
File Path: Specify the full URI to the immuta_cluster_init_script.sh
.
Add the new key/value to the configuration.
Click the Permissions tab and configure the following setting:
Who has access: Users or groups will need to have the permission Can Attach To to execute queries against Immuta configured data sources.
(Re)start the cluster.
As mentioned in the "Environment Variables" section of the cluster configuration, there may be some cases where it is necessary to add sensitive configuration to SparkSession.sparkContext.hadoopConfiguration
in order to read the data composing Immuta data sources.
As an example, when accessing external tables stored in Azure Data Lake Gen 2, Spark must have credentials to access the target containers/filesystems in ADLg2, but users must not have access to those credentials. In this case, an additional configuration file may be provided with a storage account key that the cluster may use to access ADLg2.
To use an additional Hadoop configuration file, you will need to set the IMMUTA_INIT_ADDITIONAL_CONF_URI
environment variable referenced in the Create and configure the cluster section to be the full URI to this file.
The additional configuration file looks very similar to the Immuta Configuration file referenced above. Some example configuration files for accessing different storage layers are below.
IAM role for S3 access
S3 can also be accessed using an IAM role attached to the cluster. See the Databricks documentation for more details.
ADL prefix: Prior to Databricks Runtime version 6, the following configuration items should have a prefix of dfs.adls
rather than fs.adl
Register Databricks securables in Immuta.
When the Immuta enabled Databricks cluster has been successfully started, users will see a new database labeled "immuta". This database is the virtual layer provided to access data sources configured within the connected Immuta instance.
Before users can query an Immuta data source, an administrator must give the user Can Attach To
permissions on the cluster and GRANT
the user access to the immuta
database.
The following SQL query can be run as an administrator within a journal to give the user access to "Immuta":
Below are example queries that can be run to obtain data from an Immuta-configured data source. Because Immuta supports raw tables in Databricks, you do not have to use Immuta-qualified table names in your queries like the first example. Instead, you can run queries like the second example, which does not reference the immuta
database.
See the Databricks Data Source Creation guide for a detailed walkthrough.
By default, the IAM used to map users between Databricks and Immuta is the BIM (Immuta's internal IAM). The Immuta Spark plugin will check the Databricks username against the username within the BIM to determine access. For a basic integration, this means the users email address in Databricks and the connected Immuta tenant must match.
It is possible within Immuta to have multiple users share the same username if they exist within different IAMs. In this case, the cluster can be configured to lookup users from a specified IAM. To do this, the value of immuta.user.mapping.iamid
created and hosted in the previous steps must be updated to be the targeted IAM ID configured within the Immuta tenant. The IAM ID can be found on the App Settings page. Each Databricks cluster can only be mapped to one IAM.
Cluster 1
9.1
Unavailable
Unavailable
Cluster 2
10.4
Unavailable
Unavailable
Cluster 3
11.3
Unavailable
Cluster 4
11.3
Cluster 5
11.3
In the Databricks Clusters UI, install your third-party library .jar or Maven artifact with Library Source Upload
, DBFS
, DBFS/S3
, or Maven
. Alternatively, use the Databricks libraries API.
In the Databricks Clusters UI, add the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS
property as a Spark environment variable and set it to your artifact's URI:
For Maven artifacts, the URI is maven:/<maven_coordinates>
, where <maven_coordinates>
is the Coordinates field found when clicking on the installed artifact on the Libraries tab in the Databricks Clusters UI. Here's an example of an installed artifact:
In this example, you would add the following Spark environment variable:
For jar artifacts, the URI is the Source field found when clicking on the installed artifact on the Libraries tab in the Databricks Clusters UI. For artifacts installed from DBFS or S3, this ends up being the original URI to your artifact. For uploaded artifacts, Databricks will rename your .jar and put it in a directory in DBFS. Here's an example of an installed artifact:
In this example, you would add the following Spark environment variable:
Once you've finished making your changes, restart the cluster.
Specifying more than one trusted library
To specify more than one trusted library, comma delimit the URIs:
Once the cluster is up, execute a command in a notebook. If the trusted library installation is successful, you should see driver log messages like this:
/