Manual Databricks Spark Configuration
This guide details the manual installation method for enabling native access to Databricks with Immuta policies enforced. Before proceeding, ensure your Databricks workspace, instance, and permissions meet the guidelines outlined in the Installation Introduction.
Databricks Unity Catalog: If Unity Catalog is enabled in a Databricks workspace, you must use an Immuta cluster policy when you set up the integration to create an Immuta-enabled cluster.
immuta_conf.xml
is no longer required
The immuta_conf.xml
file that was previously used to configure the native Databricks Spark integration is no longer required to install Immuta, so it is no longer staged as a deployment artifact. However, you can use these snippets if you wish to deploy an immuta_conf.xml
file to set properties.
The required Immuta base URL and Immuta system API key properties, along with any other valid properties, can still be specified as Spark environment variables or in the optional immuta_conf.xml
file. As before, if the same property is specified in both locations, the Spark environment variable takes precedence.
If you have an existing immuta_conf.xml
file, you can continue using it. However, it's recommended that you delete any default properties from the file that you have not explicitly overridden, or remove the file completely and rely on Spark environment variables. Either method will ensure that any property defaults changed in upcoming Immuta releases are propagated to your environment.
1 - Download and Configure Immuta Artifacts
Spark Version
Use Spark 2 with Databricks Runtime prior to 7.x. Use Spark 3 with Databricks Runtime 7.x or later. Attempting to use an incompatible jar and Databricks Runtime will fail.
Navigate to the Immuta archives page. If you are prompted to log in and need basic authentication credentials, contact your Immuta support professional.
Navigate to the Databricks folder for your Immuta version.
Ex: https://archives.immuta.com/hadoop/databricks/2024.3.5/
.Download the .jar file (Immuta plugin) as well as the other scripts listed below, which will load the plugin at cluster startup.
The
immuta-benchmark-suite.dbc
is a collection of notebooks packaged as a .dbc file. After you have added cluster policies to your cluster, you can import this file into Databricks to run performance tests and compare a regular Databricks cluster to one protected by Immuta. Detailed instructions are available in the first notebook, which will require an Immuta and non-Immuta cluster to generate test data and perform queries. Note: Use Spark 2 with Databricks Runtime prior to 7.x. Use Spark 3 with Databricks Runtime 7.x or later. Attempting to use an incompatible jar and Databricks Runtime will fail.Specify the following properties as Spark environment variables or in the optional
immuta_conf.xml
file. If the same property is specified in both locations, the Spark environment variable takes precedence. The variable names are the config names in all upper case with_
instead of.
. For example, to set the value ofimmuta.base.url
via an environment variable, you would set the following in theEnvironment Variables
section of cluster configuration:IMMUTA_BASE_URL=https://immuta.mycompany.com
immuta.system.api.key
: Obtain this value from the Immuta Configuration UI under HDFS > System API Key. You will need to be a user with theAPPLICATION_ADMIN
role to complete this action.immuta.base.url
: The full URL for the target Immuta tenantEx: https://immuta.mycompany.com
.immuta.user.mapping.iamid
: If users authenticate to Immuta using an IAM different from Immuta's built-in IAM, you need to update the configuration file to reflect the ID of that IAM. The IAM ID is shown within the Immuta App Settings page within the Identity Management section. See Databricks to Immuta User Mapping for more details.
Environment variables with Google Cloud Platform
Do not use environment variables to set sensitive properties when using Google Cloud Platform. Set them directly in immuta_conf.xml
.
Generating a key will destroy any previously generated HDFS keys. This will cause previously integrated HDFS systems to lose access to your Immuta console. The key will only be shown once when generated.
2 - Stage Immuta Artifacts
When configuring the Databricks cluster, a path will need to be provided to each of the artifacts downloaded/created in the previous step. To do this, those artifacts must be hosted somewhere that your Databricks instance can access. The following methods can be used for this step:
Host files in AWS/S3 and provide access by the cluster
Host files in Azure ADL Gen 1 or Gen 2 and provide access by the cluster
Host files on an HTTPS server accessible by the cluster
Host files in DBFS (Not recommended for production)
These artifacts will be downloaded to the required location within the clusters file-system by the init script downloaded in the previous step. In order for the init script to find these files, a URI will have to be provided through environment variables configured on the cluster. Each method's URI structure and setup is explained below.
AWS/S3
URI Structure: s3://[bucket]/[path]
Create an instance profile for clusters by following Databricks documentation.
Upload the configuration file, JSON file, and JAR file to an S3 bucket that the role from step 1 has access to.
Authenticating with Access Keys or Session Tokens (Optional)
If you wish to authenticate using access keys, add the following items to the cluster's environment variables:
If you've assumed a role and received a session token, that can be added here as well:
Azure
ADL Gen 2
URI Structure: abfs(s)://[container]@[account].dfs.core.windows.net/[path]
Upload the configuration file, JSON file, and JAR file to an ADL gen 2 blob container.
Environment Variables:
If you want to authenticate using an account key, add the following to your cluster's environment variables:
If you want to authenticate using an Azure SAS token, add the following to your cluster's environment variables:
ADL Gen 1
URI Structure: adl://[account].azuredatalakestore.net/[path]
Upload the configuration file, JSON file, and JAR file to ADL gen 1.
Environment Variables:
If authenticating as a Microsoft Entra ID user,
If authenticating using a service principal,
HTTPS
URI Structure: http(s)://[host](:port)/[path]
Artifacts are available for download from Immuta using basic authentication. Your basic authentication credentials can be obtained from your Immuta support professional.
Environment Variables (Optional)
DBFS
DBFS does not support access control
Any Databricks user can access DBFS via the Databricks command line utility. Files containing sensitive materials (such as Immuta API keys) should not be stored there in plain text. Use other methods described herein to properly secure such materials.
URI Structure: dbfs:/[path]
Upload the artifacts directly to DBFS using the Databricks CLI.
Since any user has access to everything in DBFS:
The artifacts can be stored anywhere in DBFS.
It's best to have a cluster-specific place for your artifacts in DBFS if you are testing to avoid overwriting or reusing someone else's artifacts accidentally.
3 - Protect Immuta Environment Variables with Databricks Secrets
It is important that non-administrator users on an Immuta-enabled Databricks cluster do not have access to view or modify Immuta configuration or the immuta-spark-hive.jar
file, as this would potentially pose a security loophole around Immuta policy enforcement. Therefore, use Databricks secrets to apply environment variables to an Immuta-enabled cluster in a secure way.
Databricks secrets can be used in the Environment Variables
configuration section for a cluster by referencing the secret path rather than the actual value of the environment variable. For example, if a user wanted to make the following value secret
they could instead create a Databricks secret and reference it as the value of that variable. For instance, if the secret scope my_secrets
was created, and the user added a secret with the key my_secret_env_var
containing the desired sensitive environment variable, they would reference it in the Environment Variables
section:
Then, at runtime, {{secrets/my_secrets/my_secret_env_var}}
would be replaced with the actual value of the secret if the owner of the cluster has access to that secret.
Best practice: Replace sensitive variables with secrets
Immuta recommends that any sensitive environment variables listed below in the various artifact deployment instructions be replaced with secrets.
4 - Create and Configure the Cluster
Cluster creation in an Immuta-enabled organization or Databricks workspace should be limited to administrative users to avoid allowing users to create non-Immuta enabled clusters.
Create a cluster in Databricks by following the Databricks documentation.
Select the Custom Access mode.
Opt to adjust the Autopilot Options and Worker Type settings. The default values provided here may be more than what is necessary for non-production or smaller use-cases. To reduce resource usage you can enable/disable autoscaling, limit the size and number of workers, and set the inactivity timeout to a lower value.
In the Advanced Options section, click the Instances tab.
IAM Role (AWS ONLY): Select the instance role you created for this cluster. (For access key authentication, you should instead use the environment variables listed in the AWS section.)
Click the Spark tab. In Spark Config field, add your configuration.
Cluster Configuration Requirements:
In the Environment Variables section, add the environment variables necessary for your configuration. Remember that these variables should be protected with Databricks secrets as mentioned above.
Click the Init Scripts tab and set the following configurations:
Destination: Specify the service you used to host the Immuta artifacts.
File Path: Specify the full URI to the
immuta_cluster_init_script.sh
.Add the new key/value to the configuration.
Click the Permissions tab and configure the following setting:
Who has access: Users or groups will need to have the permission Can Attach To to execute queries against Immuta configured data sources.
(Re)start the cluster.
Additional Hadoop Configuration File (Optional)
As mentioned in the "Environment Variables" section of the cluster configuration, there may be some cases where it is necessary to add sensitive configuration to SparkSession.sparkContext.hadoopConfiguration
in order to read the data composing Immuta data sources.
As an example, when accessing external tables stored in Azure Data Lake Gen 2, Spark must have credentials to access the target containers/filesystems in ADLg2, but users must not have access to those credentials. In this case, an additional configuration file may be provided with a storage account key that the cluster may use to access ADLg2.
To use an additional Hadoop configuration file, you will need to set the IMMUTA_INIT_ADDITIONAL_CONF_URI
environment variable referenced in the Create and configure the cluster section to be the full URI to this file.
The additional configuration file looks very similar to the Immuta Configuration file referenced above. Some example configuration files for accessing different storage layers are below.
Amazon S3
IAM role for S3 access
S3 can also be accessed using an IAM role attached to the cluster. See the Databricks documentation for more details.
Azure Data Lake Gen 2
Azure Data Lake Gen 1
ADL prefix: Prior to Databricks Runtime version 6, the following configuration items should have a prefix of dfs.adls
rather than fs.adl
.
Azure Blob Storage
5 - Register data
Register Databricks securables in Immuta.
6 - Query Immuta Data
When the Immuta enabled Databricks cluster has been successfully started, users will see a new database labeled "immuta". This database is the virtual layer provided to access data sources configured within the connected Immuta instance.
Before users can query an Immuta data source, an administrator must give the user Can Attach To
permissions on the cluster and GRANT
the user access to the immuta
database.
The following SQL query can be run as an administrator within a journal to give the user access to "Immuta":
Below are example queries that can be run to obtain data from an Immuta-configured data source. Because Immuta supports raw tables in Databricks, you do not have to use Immuta-qualified table names in your queries like the first example. Instead, you can run queries like the second example, which does not reference the immuta
database.
Creating a Databricks Data Source
See the Databricks Data Source Creation guide for a detailed walkthrough.
Databricks to Immuta User Mapping
By default, the IAM used to map users between Databricks and Immuta is the BIM (Immuta's internal IAM). The Immuta Spark plugin will check the Databricks username against the username within the BIM to determine access. For a basic integration, this means the users email address in Databricks and the connected Immuta tenant must match.
It is possible within Immuta to have multiple users share the same username if they exist within different IAMs. In this case, the cluster can be configured to lookup users from a specified IAM. To do this, the value of immuta.user.mapping.iamid
created and hosted in the previous steps must be updated to be the targeted IAM ID configured within the Immuta tenant. The IAM ID can be found on the App Settings page. Each Databricks cluster can only be mapped to one IAM.
Last updated