Databricks Installation Guide
Audience: System Administrators
Content Summary: This guide provides instructions to enable native access to Databricks with Immuta protection through installation of a plugin within the target cluster.
Prerequisites:
- Databricks instance: Premium tier workspace and Cluster access control enabled
- Databricks instance has network level access to Immuta instance
- Access to Immuta releases
- Permissions and access to download (outside Internet access) or transfer files to the host machine
Recommended Databricks Workspace Configurations:
Note: Azure Databricks authenticates users with Azure AD. Be sure to configure your Immuta instance with an IAM that uses the same user ID as does Azure AD. Immuta's Spark security plugin will look to match this user ID between the two systems. See this Azure Active Directory page for details.
Supported Databricks Runtimes
Immuta supports these Databricks Runtimes:
- 5.5
- 6.4
- 7.3
- 7.4
- 7.5
- 7.6
- 8.0
- 8.1
Supported Cluster Configurations
High Concurrency Clusters
Configuration Requirements
spark.executor.extraJavaOptions -Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service
spark.driver.extraJavaOptions -Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service
spark.databricks.repl.allowedLanguages python,sql
spark.databricks.pyspark.enableProcessIsolation true
spark.databricks.isv.product Immuta
Language Support
High Concurrent Clusters only support the following languages:
- Python
- SQL
- R (requires advanced configuration; work with your Immuta support professional to use R)
Standard Clusters
Configuration Requirements
spark.executor.extraJavaOptions -Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service
spark.driver.extraJavaOptions -Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service
spark.databricks.repl.allowedLanguages python,sql
spark.databricks.pyspark.enableProcessIsolation true
spark.databricks.isv.product Immuta
Language Support
Standard Clusters only support the following languages:
- Python
- SQL
- R (requires advanced configuration; work with your Immuta support professional to use R)
- Scala (requires advanced configuration; work with your Immuta support professional to use Scala)
Caveats
-
If a Databricks Admin is tied to an Immuta account, they will have the ability to read raw tables on-cluster.
-
If a Databricks user is listed as an "ignored" user, they will have the ability to read raw tables on-cluster. Users can be added to the
immuta.spark.acl.whitelist
configuration to become ignored users.
Installation Overview
The Immuta Databricks integration works by injecting an Immuta plugin into the SparkSQL stack at cluster startup. The Immuta plugin will make an "immuta" database available for querying and will intercept all queries executed against it. For these queries, policy determinations will be obtained from the connected Immuta instance and applied prior to returning the results to the user.
The following is a brief overview of the steps that will need to be completed to enable the integration.
- Download and Configure Immuta Artifacts: The plugin, cluster init script, and configuration.
- Stage Immuta Artifacts: Place the aforementioned files somewhere the cluster can read from during its startup procedures.
- Protect Immuta Environment Variables with Databricks Secrets.
- Create and Configure the Cluster: Configure a cluster to start with the init script and load Immuta into its SparkSQL environment.
- Query Immuta Data: Grant users permission to query data and set up sources within Immuta.
1 - Download and Configure Immuta Artifacts
- Navigate to the Immuta releases page.
- Scroll to the All Archives section and click here.
- Alternatively, you can go directly to the Immuta archives and use the credentials provided on the releases page.
- Navigate to the Databricks folder for your Immuta version.
Ex: https://archives.immuta.com/hadoop/databricks/2021.2.4/
. -
Download the .jar file (Immuta plugin) as well as the other scripts listed below,, which will load the plugin at cluster startup.
immuta-spark-hive-X.X.X_YYYYMMDD-hadoop-Z.Z.Z-public.jar immuta_cluster_init_script.sh allowedCallingClasses.json obscuredCommands.yaml immuta_conf.xml
Spark Version
Use Spark 2 with Datbricks Runtime prior to 7.x. Use Spark 3 with Databricks Runtime 7.x or later. Attempting to use an incompatible jar and Databricks Runtime will fail.
-
Open the
immuta_conf.xml
file in a text editor and edit the following fields to match your Immuta and Databricks configuration.immuta.system.api.key
: Obtain this value from the Immuta Configuration UI under "HDFS" > "System API Key". You will need to be a user with theAPPLICATION_ADMIN
role to complete this action.
Danger
Generating a key will destroy any previously generated HDFS keys. This will cause previously integrated HDFS systems to lose access to your Immuta console. The key will only be shown once when generated.
immuta.base.url
: The full URL for the target Immuta instanceEx: https://immuta.mycompany.com
immuta.user.mapping.iamid
: If users authenticate to Immuta using an IAM different from Immuta's built-in IAM, you need to update the configuration file to reflect the ID of that IAM. The IAM ID is shown within the Immuta app settings page within the "Identity Management" section.
Environment Variable Overrides
Properties in the config file can be overridden during installation using environment variables.
The variable names are the config names in all upper case with _
instead of .
. For example, to set the
value of immuta.base.url
via an environment variable, you would set the following in the Environment Variables
section of cluster configuration: IMMUTA_BASE_URL=https://immuta.mycompany.com
Environment Variables with Google Cloud Platform
Do not use environment variables to set sensitive properties when using Google Cloud Platform. Set them directly
in immuta_conf.xml
.
2 - Stage Immuta Artifacts
When configuring the Databricks cluster, a path will need to be provided to each of the artifacts downloaded/created in the previous step. In order to do this, those artifacts must be hosted somewhere that your Databricks instance can access. The following methods can be used for this step:
- Host files in AWS/S3 and provide access by the cluster
- Host files in Azure ADL Gen 1 or Gen 2 and provide access by the cluster
- Host files on an HTTPS server accessible by the cluster
- Host files in DBFS (Not recommended for production)
These artifacts will be downloaded to the required location within the clusters file-system by the init script downloaded in the previous step. In order for the init script to find these files, a URI will have to be provided through environment variables configured on the cluster. Each method's URI structure and setup is explained below.
AWS/S3
URI Structure: s3://[bucket]/[path]
- Create an instance profile for clusters by following Databricks documentation.
- Upload the configuration file, JSON file, and JAR file to an S3 bucket that the role from step 1 has access to.
Authenticating with Access Keys or Session Tokens (Optional)
If you wish to authenticate using access keys, add the following items to the cluster's environment variables:
IMMUTA_INIT_AWS_SECRET_ACCESS_KEY=<aws secret key>
IMMUTA_INIT_AWS_ACCESS_KEY_ID=<aws access key id>
If you've assumed a role and received a session token, that can be added here as well:
IMMUTA_INIT_AWS_SESSION_TOKEN=<aws session token>
Azure
ADL Gen 2
URI Structure: abfs(s)://[container]@[account].dfs.core.windows.net/[path]
Upload the configuration file, JSON file, and JAR file to an ADL gen 2 blob container.
Environment Variables:
If you want to authenticate using an account key, add the following to your cluster's environment variables:
IMMUTA_INIT_AZCOPY_CRED_TYPE=SharedKey
IMMUTA_INIT_ACCOUNT_NAME=<ADLg2 account name>
IMMUTA_INIT_ACCOUNT_KEY=<ADLg2 account key>
If you want to authenticate using an Azure SAS token, add the following to your cluster's environment variables:
IMMUTA_INIT_AZURE_SAS_TOKEN=<SAS token>
ADL Gen 1
URI Structure: adl://[account].azuredatalakestore.net/[path]
Upload the configuration file, JSON file, and JAR file to ADL gen 1.
Environment Variables:
If authenticating as an AD user,
IMMUTA_INIT_AZURE_AD_USER=<azure AD username>
IMMUTA_INIT_AZURE_PASSWORD=<azure AD password>
If authenticating using a service principal,
IMMUTA_INIT_AZURE_SERVICE_PRINCIPAL=<azure service principal>
IMMUTA_INIT_AZURE_PASSWORD=<azure service principal password>
IMMUTA_INIT_AZURE_TENANT=<tenant ID where principal was created>
HTTPS
URI Structure: http(s)://[host](:port)/[path]
Artifacts are available for download from Immuta using basic authentication. Archives and your basic authentication credentials can be found here.
Environment Variables (Optional)
IMMUTA_INIT_HTTPS_USER=<basic auth username>
IMMUTA_INIT_HTTPS_PASSWORD=<basic auth password>
# Note: Credentials can also be included as part of the artifact URI. For example,
IMMUTA_INIT_JAR_URI=https://user:password@download.immuta.com/path/to/file
DBFS
Warning
DBFS does not support access control. Any Databricks user can access DBFS via the Databricks command line utility. Files containing sensitive materials (such as Immuta API keys) should not be stored there in plain text. Use other methods described herein to properly secure such materials.
URI Structure: dbfs:/[path]
Upload the artifacts directly to DBFS using the Databricks CLI.
Since any user has access to everything in DBFS:
- The artifacts can be stored anywhere in DBFS.
- It's best to have a cluster-specific place for your artifacts in DBFS if you are testing to avoid overwriting or reusing someone else's artifacts accidentally.
3 - Protect Immuta Environment Variables with Databricks Secrets
It is important that non-administrator users on an Immuta-enabled Databricks cluster do not have
access to view or modify Immuta configuration or the immuta-spark-hive.jar
file, as this would potentially
pose a security loophole around Immuta policy enforcement. Therefore,
use Databricks secrets to apply
environment variables to an Immuta-enabled cluster in a secure way.
Databricks secrets can be used in the Environment Variables
configuration section for a cluster by
referencing the secret path rather than the actual value of the environment variable. For example,
if a user wanted to make the following value secret
MY_SECRET_ENV_VAR=super_secret_stuff
they could instead create a Databricks secret and reference it as the value of that variable. For instance,
if the secret scope my_secrets
was created, and the user added a secret with the key my_secret_env_var
containing
the desired sensitive environment variable, they would reference it in the Environment Variables
section:
MY_SECRET_ENV_VAR={{secrets/my_secrets/my_secret_env_var}}
Then, at runtime, {{secrets/my_secrets/my_secret_env_var}}
would be replaced with the actual value of the secret if
the owner of the cluster has access to that secret.
Best Practice: Replace Sensitive Variables with Secrets
Immuta recommends that ANY SENSITIVE environment variables listed below in the various artifact deployment instructions be replaced with secrets.
4 - Create and Configure the Cluster
Cluster creation in an Immuta-enabled organization or Databricks workspace should be limited to administrative users to avoid allowing users to create non-Immuta enabled clusters.
- Create a cluster in Databricks by following the
Databricks documentation.
- "Cluster Mode": Immuta supports both High Concurrency and Standard clusters in Databricks. For more information, see the Databricks cluster configuration documentation.
- "Autopilot Options" & "Worker Type": The default values provided here may be more than what is necessary for non-production or smaller use-cases. To reduce resource usage you can enable/disable autoscaling, limit the size and number of workers, and set the inactivity timeout to a lower value.
- Configure the "Instances" tab
- "IAM Role" (AWS ONLY): Select the instance role you created for this cluster. (For access key authentication, you should instead use the environment variables listed in the AWS section.)
- Configure the "Spark" tab
- "Spark Config": Add the configuration relevant to your cluster type. See Supported Cluster Configurations for details.
- "Environment Variables": Add the environment variables necessary for your configuration.
Remember that these variables should be
protected with Databricks secrets
as mentioned above
# Specify the URI to the artifacts that were hosted in the previous steps # The URI must adhere to the supported types for each service mentioned above IMMUTA_INIT_JAR_URI=<Full URI to immuta-spark-hive.jar> IMMUTA_INIT_CONF_URI=<Full URI to Immuta configuration file> IMMUTA_INIT_ALLOWED_CALLING_CLASSES_URI=<full URI to allowedCallingClasses.json> IMMUTA_INIT_OBSCURED_COMMANDS_URI=<full URI to obscuredCommands.yaml> # (OPTIONAL) # Specify an additional configuration file to be added to the spark.sparkContext.hadoopConfiguration. # This file allows administrators to add sensitive configuration needed by the SparkSession that # should not viewable by users. # Further explanation of this variable as well as examples are provided below. IMMUTA_INIT_ADDITIONAL_CONF_URI=<full URI to additional configuration file>
- Configure the "Init Scripts" tab
- "Destination": Specify the service you used to host the Immuta artifacts
- "File Path": Specify the full URI to the
immuta_cluster_init_script.sh
- "Add" the new key/value to the configuration
- Configure the "Permissions" tab
- "Who has access": Users or groups will need to have the permission "Can Attach To" in order to execute queries against Immuta configured data sources
- (Re)start the cluster.
Additional Hadoop Configuration File (Optional)
As mentioned in the "Environment Variables" section of the cluster configuration, there may be
some cases where it is necessary to add sensitive configuration to SparkSession.sparkContext.hadoopConfiguration
in order to read the data composing Immuta data sources.
As an example, when accessing external tables stored in Azure Data Lake Gen 2, Spark must have credentials to access the target containers/filesystems in ADLg2, but users must not have access to those credentials. In this case, an additional configuration file may be provided with a storage account key that the cluster may use to access ADLg2.
To use an additional Hadoop configuration file, you will need to set the IMMUTA_INIT_ADDITIONAL_CONF_URI
environment
variable referenced in the Create and configure the cluster section to be the full
URI to this file.
The additional configuration file looks very similar to the Immuta Configuration file referenced above. Some example configuration files for accessing different storage layers are below.
Amazon S3
IAM Role for S3 Access
S3 can also be accessed using an IAM role attached to the cluster. See the Databricks documentation for more details.
<configuration>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>[AWS access key ID]</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>[AWS secret key]</value>
</property>
</configuration>
Azure Data Lake Gen 2
<configuration>
<property>
<name>fs.azure.account.key.[storage account name].dfs.core.windows.net</name>
<value>[storage account key]</value>
</property>
</configuration>
Azure Data Lake Gen 1
ADL Prefix
Prior to Databricks Runtime version 6, the following configuration items should have a prefix of dfs.adls
rather than fs.adl
<configuration>
<property>
<name>fs.adl.oauth2.refresh.url</name>
<value>https://login.microsoftonline.com/[directory ID]/oauth2/token</value>
</property>
<property>
<name>fs.adl.oauth2.access.token.provider.type</name>
<value>ClientCredential</value>
</property>
<property>
<name>fs.adl.oauth2.credential</name>
<value>[client secret from Azure]</value>
</property>
<property>
<name>fs.adl.oauth2.client.id</name>
<value>[client ID from Azure]</value>
</property>
</configuration>
Azure Blob Storage
<configuration>
<property>
<name>fs.azure.account.key.[storage account name].blob.core.windows.net</name>
<value>[storage account key]</value>
</property>
</configuration>
5 - Query Immuta Data
When the Immuta enabled Databricks cluster has been successfully started, users will see a new database labeled "immuta". This database is the virtual layer provided to access data sources configured within the connected Immuta instance.
Before users can query an Immuta data source, an administrator
must give the user Can Attach To
permissions on the cluster and GRANT
the user access to the "immuta' database.
The following SQL query can be run as an administrator within a journal to give the user access to "Immuta":
%sql
GRANT SELECT,READ_METADATA ON DATABASE immuta TO `user@company.com`
Below are some example queries that can be run to obtain data from an Immuta configured data source.
%sql
show tables in immuta;
%sql
select * from immuta.my_data_source limit 5;
Creating a Databricks Data Source
See the Databricks Data Source Creation guide for a detailed walkthrough.
Databricks to Immuta User Mapping
By default, the IAM used to map users between Databricks and Immuta is the BIM (Immuta's internal IAM). The Immuta Spark plugin will check the Databricks username against the username within the BIM to determine access. For a basic integration, this means the users email address in Databricks and the connected Immuta instance must match.
It is possible within Immuta to have multiple users share the same username if they exist within different IAMs.
In this case, the cluster can be configured to lookup users from a specified IAM. To do this, the value of
immuta.user.mapping.iamid
given in the immuta_conf.xml
created and hosted in the previous steps
must be updated to be the targeted IAM ID configured within the Immuta instance. The IAM ID can be found on the
App Settings page. Each
Databricks cluster can only
be mapped to one IAM.
Debugging Immuta Installation Issues
For easier debugging of the Immuta Databricks installation, enable cluster init
script logging. In the cluster page in Databricks for the target cluster, under
Advanced Options -> Logging, change the Destination from NONE
to
DBFS
and change the path to the desired output location. Note: The unique cluster ID will be added onto
the end of the provided path.
For debugging issues between the Immuta web service and Databricks, you can view the Spark UI on your target Databricks cluster. On the cluster page, click the Spark UI tab, which shows the Spark application UI for the cluster. If you encounter issues creating Databricks data sources in Immuta, you can also view the JDBC/ODBC Server portion of the Spark UI to see the result of queries that have been sent from Immuta to Databricks.
Using the Validation and Debugging Notebook
The Validation and Debugging Notebook (immuta-validation.ipynb
) is packaged with other Databricks release artifacts
and is designed to be used by or under the guidance of an Immuta Support Professional.
- Import the notebook into a Databricks workspace by navigating to Home in your Databricks instance.
- Click the arrow next to your name and select Import.
- Once you have executed commands in the notebook and populated it with debugging information, export the notebook and its contents by opening the File menu, selecting Export, and then selecting DBC Archive.
Further Reading
Cluster Init Script
The Databricks cluster init script provided by Immuta downloads the previously
mentioned Immuta artifacts (the configuration file and immuta-spark-hive.jar
) onto the target
cluster and puts them in the appropriate locations on local disk for use by Spark. Once the init
script runs, the Spark application running on the Databricks cluster will have the appropriate
artifacts on its CLASSPATH in order to use Immuta for policy enforcement.
The cluster init script uses environment variables in order to
- Determine the location of the required artifacts for downloading.
- Authenticate with the service/storage containing the artifacts.
Note: Each target system/storage layer (HTTPS, for example) can only have one set of environment variables, so the cluster init script assumes that any artifact retrieved from that system uses the same environment variables.