1 of 31

Databricks Spark

Audience: Data Owners and Data Users
Content Summary: This page provides an overview of the Databricks integration. For installation instructions, see the Databricks Installation Introduction and the Databricks Quick Integration Guide.

Overview

Databricks is a plugin integration with Immuta. This integration allows you to protect access to tables and manage row-, column-, and cell-level controls without enabling table ACLs or credential passthrough. Policies are applied to the plan that Spark builds for a user's query and enforced live on-cluster.

Architecture

An Application Admin will configure Databricks with either the

Simplified Databricks Configuration on the Immuta App Settings page
Manual Databricks Configuration where Immuta artifacts must be downloaded and staged to your Databricks clusters

In both configuration options, the Immuta init script adds the Immuta plugin in Databricks: the Immuta Security Manager, wrappers, and Immuta analysis hook plan rewrite. Once an administrator gives users Can Attach To entitlements on the cluster, they can query Immuta-registered data source directly in their Databricks notebooks.

Simplified Databricks Configuration Additional Entitlements

The credentials used to do the Simplified Databricks configuration with automatic cluster policy push must have the following entitlement:

Allow cluster creation

This will give Immuta temporary permission to push the cluster policies to the configured Databricks workspace and overwrite any cluster policy templates previously applied to the workspace.

Policy Enforcement

Immuta Best Practices: Test User

Test the integration on an Immuta-enabled cluster with a user that is not a Databricks administrator.

Registering Data Sources

You should register entire databases with Immuta and run Schema Monitoring jobs through the Python script provided during data source registration. Additionally, you should use a Databricks administrator account to register data sources with Immuta using the UI or API; however, you should not test Immuta policies using a Databricks administrator account, as they are able to bypass controls. See the Pre-Configuration page for more details.

Table Access

A Databricks administrator can control who has access to specific tables in Databricks through Immuta Subscription Policies or by manually adding users to the data source. Data users will only see the immuta database with no tables until they are granted access to those tables as Immuta data sources.

The `immuta` Database

When a table is registered in Immuta as a data source, users can see that table in the native Databricks database and in the immuta database. This allows for an option to use a single database (immuta) for all tables.

Fine-grained Access Control

After data users have subscribed to data sources, administrators can apply fine-grained access controls, such as restricting rows or masking columns with advanced anonymization techniques, to manage what the users can see in each table. More details on the types of data policies can be found on Data Policies page, including an overview of masking struct and array columns in Databricks.

Note: Immuta recommends building Global Policies rather than Local Policies, as they allow organizations to easily manage policies as a whole and capture system state in a more deterministic manner.

Accessing Data

All access controls must go through SQL.

df = spark.sql("select * from immuta.table")

import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()
val sqlDF = spark.sql("SELECT * FROM immuta.table")

%sql
select * from immuta.table

library(SparkR)
df <- SparkR::sql("SELECT * from immuta.table")

Note: With R, you must load the SparkR library in a cell before accessing the data.

Mapping Users

Usernames in Immuta must match usernames in Databricks. It is best practice is to use the same identity manager for Immuta that you use for Databricks (Immuta supports these identity manager protocols and providers. however, for Immuta SaaS users, it’s easiest to just ensure usernames match between systems.

Data Flow

An Immuta Application Administrator configures the Databricks integration and registers available cluster policies Immuta generates.
The Immuta init script adds the immuta plugin in Databricks: the Immuta SecurityManager, wrappers, and Immuta analysis hook plan rewrite.
A Data Owner registers Databricks tables in Immuta as data sources. A Data Owner, Data Governor, or Administrator creates or changes a policy or user in Immuta.
Data source metadata, tags, user metadata, and policy definitions are stored in Immuta's Metadata Database.
A Databricks user who is subscribed to the data source in Immuta queries the corresponding table directly in their notebook or workspace.
During Spark Analysis, Spark calls down to the Metastore to get table metadata.
Immuta intercepts the call to retrieve table metadata from the Metastore.
Immuta modifies the Logical Plan to enforce policies that apply to that user.
Immuta wraps the Physical Plan with specific Java classes to signal to the SecurityManager that it is a trusted node and is allowed to scan raw data. Immuta blocks direct access to S3 unless it backs a registered table in Immuta.
The Physical Plan is applied and filters out and transforms raw data coming back to the user.
The user sees policy-enforced data.

How-to Guides

Installation

This page contains references to the term whitelist, which Immuta no longer uses. When the term is removed from the software, it will be removed from this page.

Prerequisites

Databricks instance: Premium tier workspace and Cluster access control enabled
Databricks instance has network level access to Immuta instance
Access to Immuta archives
Permissions and access to download (outside Internet access) or transfer files to the host machine

Recommended Databricks Workspace Configurations:

Note: Azure Databricks authenticates users with Microsoft Entra ID. Be sure to configure your Immuta instance with an IAM that uses the same user ID as does Microsoft Entra ID. Immuta's Spark security plugin will look to match this user ID between the two systems. See this Microsoft Entra ID page for details.

Supported Databricks Runtime Versions

Use the table below to determine which version of Immuta supports your Databricks Runtime version:

Databricks Runtime Version

Immuta Version

Supported Databricks Cluster Configurations

The table below outlines the integrations supported for various Databricks cluster configurations. For example, the only integration available to enforce policies on a cluster configured to run on Databricks Runtime 9.1 is the Databricks Spark integration.

Legend:

Supported Access Mode and Languages

Immuta supports the Custom access mode.

Supported Languages:
- Python
- SQL
- R (requires advanced configuration; work with your Immuta support professional to use R)
- Scala (requires advanced configuration; work with your Immuta support professional to use Scala)

Databricks Installation Overview

Users Who Can Read Raw Tables On-Cluster

If a Databricks Admin is tied to an Immuta account, they will have the ability to read raw tables on-cluster.
If a Databricks user is listed as an "ignored" user, they will have the ability to read raw tables on-cluster. Users can be added to the immuta.spark.acl.whitelist configuration to become ignored users.

The Immuta Databricks integration injects an Immuta plugin into the SparkSQL stack at cluster startup. The Immuta plugin creates an "immuta" database that is available for querying and intercepts all queries executed against it. For these queries, policy determinations will be obtained from the connected Immuta instance and applied before returning the results to the user.

The Databricks cluster init script provided by Immuta downloads the Immuta artifacts onto the target cluster and puts them in the appropriate locations on local disk for use by Spark. Once the init script runs, the Spark application running on the Databricks cluster will have the appropriate artifacts on its CLASSPATH to use Immuta for policy enforcement.

The cluster init script uses environment variables in order to

Determine the location of the required artifacts for downloading.
Authenticate with the service/storage containing the artifacts.

Note: Each target system/storage layer (HTTPS, for example) can only have one set of environment variables, so the cluster init script assumes that any artifact retrieved from that system uses the same environment variables.

Limitations

See the Databricks Pre-Configuration Details page for known limitations.

Installation Methods

There are two installation options for Databricks. Click a link below to navigate to a tutorial for your chosen method:

Simplified Configuration: The steps to enable the integration with this method include
1. Adding the integration on the App Settings page.
2. Downloading or automatically pushing cluster policies to your Databricks workspace.
3. Creating or restarting your cluster.
Manual Configuration: The steps to enable the integration with this method include
1. Downloading and configuring Immuta artifacts.
2. Staging Immuta artifacts somewhere the cluster can read from during its startup procedures.
3. Protecting Immuta environment variables with Databricks Secrets.
4. Creating and configuring the cluster to start with the init script and load Immuta into its SparkSQL environment.

Debugging Immuta Installation Issues

For easier debugging of the Immuta Databricks installation, enable cluster init script logging. In the cluster page in Databricks for the target cluster, under Advanced Options -> Logging, change the Destination from NONE to DBFS and change the path to the desired output location. Note: The unique cluster ID will be added onto the end of the provided path.

For debugging issues between the Immuta web service and Databricks, you can view the Spark UI on your target Databricks cluster. On the cluster page, click the Spark UI tab, which shows the Spark application UI for the cluster. If you encounter issues creating Databricks data sources in Immuta, you can also view the JDBC/ODBC Server portion of the Spark UI to see the result of queries that have been sent from Immuta to Databricks.

Using the Validation and Debugging Notebook

The Validation and Debugging Notebook (immuta-validation.ipynb) is packaged with other Databricks release artifacts (for manual installations), or it can be downloaded from the App Settings page when configuring native Databricks through the Immuta UI. This notebook is designed to be used by or under the guidance of an Immuta Support Professional.

Import the notebook into a Databricks workspace by navigating to Home in your Databricks instance.
Click the arrow next to your name and select Import.
Once you have executed commands in the notebook and populated it with debugging information, export the notebook and its contents by opening the File menu, selecting Export, and then selecting DBC Archive.

Simplified Databricks Configuration

Audience: System Administrators
Content Summary: This guide details the simplified installation method for enabling native access to Databricks with Immuta policies enforced.
Prerequisites: Ensure your Databricks workspace, instance, and permissions meet the guidelines outlined in the Installation Introduction.

Databricks Unity Catalog

If Unity Catalog is enabled in a Databricks workspace, you must use an Immuta cluster policy when you setup the integration to create an Immuta-enabled cluster.

1 - Add the Integration on the App Settings Page

Log in to Immuta and click the App Settings icon in the left sidebar.
Scroll to the System API Key subsection under HDFS and click Generate Key.
Click Save and then Confirm.
Scroll to the Integration Settings section, and click + Add a Native Integration.
Select Databricks Integration from the dropdown menu.
Complete the Hostname field.
Enter a Unique ID for the integration. By default, your Immuta instance URL populates this field. This ID is used to tie the set of cluster policies to your instance of Immuta and allows multiple instances of Immuta to access the same Databricks workspace without cluster policy conflicts.
Select your configured Immuta IAM from the dropdown menu.
Choose one of the following options for your data access model:
- Protected until made available by policy: All tables are hidden until a user is permissioned through an Immuta policy. This is how most databases work and assumes least privileged access and also means you will have to register all tables with Immuta.
- Available until protected by policy: All tables are open until explicitly registered and protected by Immuta. This makes a lot of sense if most of your tables are non-sensitive and you can pick and choose which to protect.
Select the Storage Access Type from the dropdown menu.
Opt to add any Additional Hadoop Configuration Files.
Click Add Native Integration.

2 - Configure Cluster Policies

Several cluster policies are available on the App Settings page when configuring this integration:

Python & SQL
Python & SQL & R
Python & SQL & R with Library Support
Scala
Sparklyr

Click a link above to read more about each of these cluster policies before continuing with the tutorial.

Click Configure Cluster Policies.
Select one or more cluster policies in the matrix by clicking the Select button(s).
Opt to make changes to these cluster policies by clicking Additional Policy Changes and editing the text field.
Use one of the two Installation Types described in the tabs below to apply the policies to your cluster:

Automatically Push Cluster Policies

This option allows you to automatically push the cluster policies to the configured Databricks workspace. This will overwrite any cluster policy templates previously applied to this workspace.

Select the Automatically Push Cluster Policies radio button.
Enter your Admin Token. This token must be for a user who can create cluster policies in Databricks.
Click Apply Policies.

Manually Push Cluster Policies

Enabling this option will allow you to manually push the cluster policies to the configured Databricks workspace. There will be various files to download and manually push to the configured Databricks workspace.

Select the Manually Push Cluster Policies radio button.
Click Download Init Script.
Follow the steps in the Instructions to upload the init script to DBFS section.
Click Download Policies, and then manually add these Cluster Policies in Databricks.

Opt to click the Download the Benchmarking Suite to compare a regular Databricks cluster to one protected by Immuta. Detailed instructions are available in the first notebook, which will require an Immuta and non-Immuta cluster to generate test data and perform queries.
Click Close, and then click Save and Confirm.

3 - Add Policies to Your Cluster

Create a cluster in Databricks by following the Databricks documentation.
In the Policy dropdown, select the Cluster Policies you pushed or manually added from Immuta.
Select the Custom Access mode.
Opt to adjust Autopilot Options and Worker Type settings: The default values provided here may be more than what is necessary for non-production or smaller use-cases. To reduce resource usage you can enable/disable autoscaling, limit the size and number of workers, and set the inactivity timeout to a lower value.
Opt to configure the Instances tab in the Advanced Options section:
- IAM Role (AWS ONLY): Select the instance role you created for this cluster. (For access key authentication, you should instead use the environment variables listed in the AWS section.)
Click Create Cluster.

4 - Register Data

5 - Query Immuta Data

When the Immuta-enabled Databricks cluster has been successfully started, Immuta will create an immuta database, which allows Immuta to track Immuta-managed data sources separately from remote Databricks tables so that policies and other security features can be applied. However, users can query sources with their original database or table name without referencing the immuta database. Additionally, when configuring a Databricks cluster you can hide immuta from any calls to SHOW DATABASES so that users aren't misled or confused by its presence. For more details, see the Hiding the immuta Database in Databricks page.

Before users can query an Immuta data source, an administrator must give the user Can Attach To permissions on the cluster.
See the Databricks Data Source Creation guide for a detailed walkthrough of creating Databricks data sources in Immuta.

Example Queries

Below are example queries that can be run to obtain data from an Immuta-configured data source. Because Immuta supports raw tables in Databricks, you do not have to use Immuta-qualified table names in your queries like the first example. Instead, you can run queries like the second example, which does not reference the immuta database.

%sql
select * from immuta.my_data_source limit 5;

%sql
select * from my_data_source limit 5;

Manual Databricks Installation

Audience: System Administrators
Content Summary: This guide details the manual installation method for enabling native access to Databricks with Immuta policies enforced.
Prerequisites: Ensure your Databricks workspace, instance, and permissions meet the guidelines outlined in the Installation Introduction.

Databricks Unity Catalog

If Unity Catalog is enabled in a Databricks workspace, you must use an Immuta cluster policy when you setup the integration to create an Immuta-enabled cluster.

The immuta_conf.xml is no longer required.

The immuta_conf.xml file that was previously used to configure the native Databricks integration is no longer required to install Immuta, so it is no longer staged as a deployment artifact. However, you can use these snippets if you wish to deploy an immuta_conf.xml file to set properties.

The required Immuta base URL and Immuta system API key properties, along with any other valid properties, can still be specified as Spark environment variables or in the optional immuta_conf.xml file. As before, if the same property is specified in both locations, the Spark environment variable takes precedence.

If you have an existing immuta_conf.xml file, you can continue using it. However, it's recommended that you delete any default properties from the file that you have not explicitly overridden, or remove the file completely and rely on Spark environment variables. Either method will ensure that any property defaults changed in upcoming Immuta releases are propagated to your environment.

1 - Download and Configure Immuta Artifacts

Spark Version

Use Spark 2 with Databricks Runtime prior to 7.x. Use Spark 3 with Databricks Runtime 7.x or later. Attempting to use an incompatible jar and Databricks Runtime will fail.

Navigate to the Immuta archives page. If you are prompted to log in and need basic authentication credentials, contact your Immuta support professional.
Navigate to the Databricks folder for your Immuta version. Ex: https://archives.immuta.com/hadoop/databricks/2024.1.13/.
Download the .jar file (Immuta plugin) as well as the other scripts listed below, which will load the plugin at cluster startup.
```
allowedCallingClasses.json
immuta-benchmark-suite.dbc
immuta-spark-hive-X.X.X_YYYYMMDD-hadoop-Z.Z.Z-public.jar
immuta_cluster_init_script.sh
obscuredCommands.yaml
```
The immuta-benchmark-suite.dbc is a collection of notebooks packaged as a .dbc file. After you have added cluster policies to your cluster, you can import this file into Databricks to run performance tests and compare a regular Databricks cluster to one protected by Immuta. Detailed instructions are available in the first notebook, which will require an Immuta and non-Immuta cluster to generate test data and perform queries.
Specify the following properties as Spark environment variables or in the optional immuta_conf.xml file. If the same property is specified in both locations, the Spark environment variable takes precedence. The variable names are the config names in all upper case with _ instead of .. For example, to set the value of immuta.base.url via an environment variable, you would set the following in the Environment Variables section of cluster configuration: IMMUTA_BASE_URL=https://immuta.mycompany.com
- immuta.system.api.key: Obtain this value from the Immuta Configuration UI under HDFS > System API Key. You will need to be a user with the APPLICATION_ADMIN role to complete this action. Warning: Generating a key will destroy any previously generated HDFS keys. This will cause previously integrated HDFS systems to lose access to your Immuta console. The key will only be shown once when generated.
- immuta.base.url: The full URL for the target Immuta instance Ex: https://immuta.mycompany.com.
- immuta.user.mapping.iamid: If users authenticate to Immuta using an IAM different from Immuta's built-in IAM, you need to update the configuration file to reflect the ID of that IAM. The IAM ID is shown within the Immuta App Settings page within the Identity Management section. See Databricks to Immuta User Mapping for more details.

Environment Variables with Google Cloud Platform

Do not use environment variables to set sensitive properties when using Google Cloud Platform. Set them directly in immuta_conf.xml.

2 - Stage Immuta Artifacts

When configuring the Databricks cluster, a path will need to be provided to each of the artifacts downloaded/created in the previous step. To do this, those artifacts must be hosted somewhere that your Databricks instance can access. The following methods can be used for this step:

Host files in AWS/S3 and provide access by the cluster
Host files in Azure ADL Gen 1 or Gen 2 and provide access by the cluster
Host files on an HTTPS server accessible by the cluster
Host files in DBFS (Not recommended for production)

These artifacts will be downloaded to the required location within the clusters file-system by the init script downloaded in the previous step. In order for the init script to find these files, a URI will have to be provided through environment variables configured on the cluster. Each method's URI structure and setup is explained below.

AWS/S3

URI Structure: s3://[bucket]/[path]

Create an instance profile for clusters by following Databricks documentation.
Upload the configuration file, JSON file, and JAR file to an S3 bucket that the role from step 1 has access to.

Authenticating with Access Keys or Session Tokens (Optional)

If you wish to authenticate using access keys, add the following items to the cluster's environment variables:

IMMUTA_INIT_AWS_SECRET_ACCESS_KEY=<aws secret key>
IMMUTA_INIT_AWS_ACCESS_KEY_ID=<aws access key id>

If you've assumed a role and received a session token, that can be added here as well:

IMMUTA_INIT_AWS_SESSION_TOKEN=<aws session token>

Azure

ADL Gen 2

URI Structure: abfs(s)://[container]@[account].dfs.core.windows.net/[path]

Upload the configuration file, JSON file, and JAR file to an ADL gen 2 blob container.

Environment Variables:

If you want to authenticate using an account key, add the following to your cluster's environment variables:

IMMUTA_INIT_AZCOPY_CRED_TYPE=SharedKey
IMMUTA_INIT_ACCOUNT_NAME=<ADLg2 account name>
IMMUTA_INIT_ACCOUNT_KEY=<ADLg2 account key>

If you want to authenticate using an Azure SAS token, add the following to your cluster's environment variables:

IMMUTA_INIT_AZURE_SAS_TOKEN=<SAS token>

ADL Gen 1

URI Structure: adl://[account].azuredatalakestore.net/[path]

Upload the configuration file, JSON file, and JAR file to ADL gen 1.

Environment Variables:

If authenticating as a Microsoft Entra ID user,

IMMUTA_INIT_AZURE_AD_USER=<Microsoft Entra ID username>
IMMUTA_INIT_AZURE_PASSWORD=<Microsoft Entra ID password>

If authenticating using a service principal,

IMMUTA_INIT_AZURE_SERVICE_PRINCIPAL=<azure service principal>
IMMUTA_INIT_AZURE_PASSWORD=<azure service principal password>
IMMUTA_INIT_AZURE_TENANT=<tenant ID where principal was created>

HTTPS

URI Structure: http(s)://[host](:port)/[path]

Artifacts are available for download from Immuta using basic authentication. Your basic authentication credentials can be obtained from your Immuta support professional.

Environment Variables (Optional)

IMMUTA_INIT_HTTPS_USER=<basic auth username>
IMMUTA_INIT_HTTPS_PASSWORD=<basic auth password>

# Note: Credentials can also be included as part of the artifact URI. For example,
IMMUTA_INIT_JAR_URI=https://user:password@archives.immuta.com/path/to/file

DBFS

DBFS does not support access control. Any Databricks user can access DBFS via the Databricks command line utility. Files containing sensitive materials (such as Immuta API keys) should not be stored there in plain text. Use other methods described herein to properly secure such materials.

URI Structure: dbfs:/[path]

Upload the artifacts directly to DBFS using the Databricks CLI.

Since any user has access to everything in DBFS:

The artifacts can be stored anywhere in DBFS.
It's best to have a cluster-specific place for your artifacts in DBFS if you are testing to avoid overwriting or reusing someone else's artifacts accidentally.

3 - Protect Immuta Environment Variables with Databricks Secrets

It is important that non-administrator users on an Immuta-enabled Databricks cluster do not have access to view or modify Immuta configuration or the immuta-spark-hive.jar file, as this would potentially pose a security loophole around Immuta policy enforcement. Therefore, use Databricks secrets to apply environment variables to an Immuta-enabled cluster in a secure way.

Databricks secrets can be used in the Environment Variables configuration section for a cluster by referencing the secret path rather than the actual value of the environment variable. For example, if a user wanted to make the following value secret

MY_SECRET_ENV_VAR=super_secret_stuff

they could instead create a Databricks secret and reference it as the value of that variable. For instance, if the secret scope my_secrets was created, and the user added a secret with the key my_secret_env_var containing the desired sensitive environment variable, they would reference it in the Environment Variables section:

MY_SECRET_ENV_VAR={{secrets/my_secrets/my_secret_env_var}}

Then, at runtime, {{secrets/my_secrets/my_secret_env_var}} would be replaced with the actual value of the secret if the owner of the cluster has access to that secret.

Best Practice: Replace Sensitive Variables with Secrets

Immuta recommends that ANY SENSITIVE environment variables listed below in the various artifact deployment instructions be replaced with secrets.

4 - Create and Configure the Cluster

Cluster creation in an Immuta-enabled organization or Databricks workspace should be limited to administrative users to avoid allowing users to create non-Immuta enabled clusters.

Create a cluster in Databricks by following the Databricks documentation.
Select the Custom Access mode.
Opt to adjust the Autopilot Options and Worker Type settings. The default values provided here may be more than what is necessary for non-production or smaller use-cases. To reduce resource usage you can enable/disable autoscaling, limit the size and number of workers, and set the inactivity timeout to a lower value.
In the Advanced Options section, click the Instances tab.
- IAM Role (AWS ONLY): Select the instance role you created for this cluster. (For access key authentication, you should instead use the environment variables listed in the AWS section.)

Click the Spark tab. In Spark Config field, add your configuration.

Cluster Configuration Requirements:

spark.executor.extraJavaOptions -Djava.security.manager=com.immuta.security.ImmutaSecurityManager /
    -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json /
    -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service
spark.driver.extraJavaOptions -Djava.security.manager=com.immuta.security.ImmutaSecurityManager /
    -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json /
    -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service
spark.databricks.repl.allowedLanguages python,sql
spark.databricks.pyspark.enableProcessIsolation true
spark.databricks.isv.product Immuta

In the Environment Variables section, add the environment variables necessary for your configuration. Remember that these variables should be protected with Databricks secrets as mentioned above.

# Specify the URI to the artifacts that were hosted in the previous steps
# The URI must adhere to the supported types for each service mentioned above
IMMUTA_INIT_JAR_URI=<Full URI to immuta-spark-hive.jar>
IMMUTA_INIT_CONF_URI=<Full URI to Immuta configuration file>
IMMUTA_INIT_ALLOWED_CALLING_CLASSES_URI=<full URI to allowedCallingClasses.json>
IMMUTA_INIT_OBSCURED_COMMANDS_URI=<full URI to obscuredCommands.yaml>

# (OPTIONAL)
# Specify an additional configuration file to be added to the spark.sparkContext.hadoopConfiguration.
# This file allows administrators to add sensitive configuration needed by the SparkSession that
# should not viewable by users.
# Further explanation of this variable as well as examples are provided below.
IMMUTA_INIT_ADDITIONAL_CONF_URI=<full URI to additional configuration file>

Click the Init Scripts tab and set the following configurations:
- Destination: Specify the service you used to host the Immuta artifacts.
- File Path: Specify the full URI to the immuta_cluster_init_script.sh.
- Add the new key/value to the configuration.
Click the Permissions tab and configure the following setting:
- Who has access: Users or groups will need to have the permission Can Attach To to execute queries against Immuta configured data sources.
(Re)start the cluster.

Additional Hadoop Configuration File (Optional)

As mentioned in the "Environment Variables" section of the cluster configuration, there may be some cases where it is necessary to add sensitive configuration to SparkSession.sparkContext.hadoopConfiguration in order to read the data composing Immuta data sources.

As an example, when accessing external tables stored in Azure Data Lake Gen 2, Spark must have credentials to access the target containers/filesystems in ADLg2, but users must not have access to those credentials. In this case, an additional configuration file may be provided with a storage account key that the cluster may use to access ADLg2.

To use an additional Hadoop configuration file, you will need to set the IMMUTA_INIT_ADDITIONAL_CONF_URI environment variable referenced in the Create and configure the cluster section to be the full URI to this file.

The additional configuration file looks very similar to the Immuta Configuration file referenced above. Some example configuration files for accessing different storage layers are below.

Amazon S3

IAM Role for S3 Access

S3 can also be accessed using an IAM role attached to the cluster. See the Databricks documentation for more details.

<configuration>
    <property>
        <name>fs.s3n.awsAccessKeyId</name>
        <value>[AWS access key ID]</value>
    </property>
    <property>
        <name>fs.s3n.awsSecretAccessKey</name>
        <value>[AWS secret key]</value>
    </property>
</configuration>

Azure Data Lake Gen 2

<configuration>
    <property>
        <name>fs.azure.account.key.[storage account name].dfs.core.windows.net</name>
        <value>[storage account key]</value>
    </property>
</configuration>

Azure Data Lake Gen 1

ADL Prefix

Prior to Databricks Runtime version 6, the following configuration items should have a prefix of dfs.adls rather than fs.adl

<configuration>
    <property>
        <name>fs.adl.oauth2.refresh.url</name>
        <value>https://login.microsoftonline.com/[directory ID]/oauth2/token</value>
    </property>
    <property>
        <name>fs.adl.oauth2.access.token.provider.type</name>
        <value>ClientCredential</value>
    </property>
    <property>
        <name>fs.adl.oauth2.credential</name>
        <value>[client secret from Azure]</value>
    </property>
    <property>
        <name>fs.adl.oauth2.client.id</name>
        <value>[client ID from Azure]</value>
    </property>
</configuration>

Azure Blob Storage

<configuration>
    <property>
        <name>fs.azure.account.key.[storage account name].blob.core.windows.net</name>
        <value>[storage account key]</value>
    </property>
</configuration>

5 - Register Data

6 - Query Immuta Data

When the Immuta enabled Databricks cluster has been successfully started, users will see a new database labeled "immuta". This database is the virtual layer provided to access data sources configured within the connected Immuta instance.

Before users can query an Immuta data source, an administrator must give the user Can Attach To permissions on the cluster and GRANT the user access to the immuta database.

The following SQL query can be run as an administrator within a journal to give the user access to "Immuta":

%sql
GRANT SELECT,READ_METADATA ON DATABASE immuta TO `user@company.com`

%sql
select * from immuta.my_data_source limit 5;

%sql
select * from my_data_source limit 5;

Creating a Databricks Data Source

See the Databricks Data Source Creation guide for a detailed walkthrough.

Databricks to Immuta User Mapping

By default, the IAM used to map users between Databricks and Immuta is the BIM (Immuta's internal IAM). The Immuta Spark plugin will check the Databricks username against the username within the BIM to determine access. For a basic integration, this means the users email address in Databricks and the connected Immuta instance must match.

It is possible within Immuta to have multiple users share the same username if they exist within different IAMs. In this case, the cluster can be configured to lookup users from a specified IAM. To do this, the value of immuta.user.mapping.iamid created and hosted in the previous steps must be updated to be the targeted IAM ID configured within the Immuta instance. The IAM ID can be found on the App Settings page. Each Databricks cluster can only be mapped to one IAM.

Manually Update Your Databricks Cluster

Audience: System Administrators
Content Summary: This guide details how to manually update your Databricks cluster after changes to the Immuta init script or cluster policies are made.

If a Databricks cluster needs to be manually updated to reflect changes in the Immuta init script or cluster policies, you can remove and set up your integration again to get the updated policies and init script.

Log in to Immuta as an Application Admin.
Click the App Settings icon in the left sidebar and click the Integrations tab.
Your existing Databricks integration should be listed here; expand it and note the configuration values. Now select Remove to remove your integration.
Click Add Native Integration and select Databricks Integration to add a new integration.
Enter your Databricks integration settings again as configured previously.
Click Add Native Integration to add the integration, and then select Configure Cluster Policies to set up the updated cluster policies and init script.
Select the cluster policies you wish to use for your Immuta-enabled Databricks clusters.
Use the tabs below to view instructions for automatically pushing cluster policies and the init script (recommended) or manually updating your cluster policies.

Automatically Push Cluster Policies

Select Automatically Push Cluster Policies and enter your privileged Databricks access token. This token must have privileges to write to cluster policies.
Select Apply Policies to push the cluster policies and init script again.
Click Save and Confirm to deploy your changes.

Manually Update Cluster Policies

Download the init script and the new cluster policies to your local computer.
Click Save and Confirm to save your changes in Immuta.
Log in to your Databricks workspace with your administrator account to set up cluster policies.
Get the path you will upload the init script (immuta_cluster_init_script_proxy.sh) to by opening one of the cluster policy .json files and looking for the defaultValue of the field init_scripts.0.dbfs.destination. This should be a DBFS path in the form of dbfs:/immuta-plugin/hostname/immuta_cluster_init_script_proxy.sh.
Click Data in the left pane to upload your init script to DBFS to the path you found above.
To find your existing cluster policies you need to update, click Compute in the left pane and select the Cluster policies tab.
Edit each of these cluster policies that were configured before and overwrite the contents of the JSON with the new cluster policy JSON you downloaded.

Restart any Databricks clusters using these updated policies for the changes to take effect.

Install a Trusted Library

Audience: System Administrators
Content Summary: This page outlines how to install and configure trusted third-party libraries for Databricks.

1 - Install the Library

Specifying More than One Trusted Library

To specify more than one trusted library, comma delimit the URIs:

IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS=maven:/my.group.id:my-package-id:1.2.3,dbfs:/path/to/my/library.jar

In the Databricks Clusters UI, install your third-party library .jar or Maven artifact with Library Source Upload, DBFS, DBFS/S3, or Maven. Alternatively, use the Databricks libraries API.
In the Databricks Clusters UI, add the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS property as a Spark environment variable and set it to your artifact's URI:

Maven Artifacts

For Maven artifacts, the URI is maven:/<maven_coordinates>, where <maven_coordinates> is the Coordinates field found when clicking on the installed artifact on the Libraries tab in the Databricks Clusters UI. Here's an example of an installed artifact:

In this example, you would add the following Spark environment variable:

IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS=maven:/com.github.immuta.hadoop.immuta-spark-third-party-maven-lib-test:2020-11-17-144644

.jar Artifacts

For jar artifacts, the URI is the Source field found when clicking on the installed artifact on the Libraries tab in the Databricks Clusters UI. For artifacts installed from DBFS or S3, this ends up being the original URI to your artifact. For uploaded artifacts, Databricks will rename your .jar and put it in a directory in DBFS. Here's an example of an installed artifact:

In this example, you would add the following Spark environment variable:

IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS=dbfs:/immuta/bstabile/jars/immuta-spark-third-party-lib-test.jar

Restart the cluster.

2 - Execute a Command in a Notebook

Once the cluster is up, execute a command in a notebook. If the trusted library installation is successful, you should see driver log messages like this:

TrustedLibraryUtils: Successfully found all configured Immuta configured trusted libraries in Databricks.
TrustedLibraryUtils: Wrote trusted libs file to [/databricks/immuta/immutaTrustedLibs.json]: true.
TrustedLibraryUtils: Added trusted libs file with 1 entries to spark context.
TrustedLibraryUtils: Trusted library installation complete.

Limited Enforcement in Databricks

Generally, Immuta prevents users from seeing data unless they are explicitly given access, which blocks access to raw sources in the underlying databases. However, in some native patterns (such as Snowflake), Immuta adds views to allow users access to Immuta sources but does not impede access to preexisting sources in the underlying database. Therefore, if a user had access in Snowflake to a table before Immuta was installed, they would still have access to that table after.

Unlike the example above, Databricks non-admin users will only see sources to which they are subscribed in Immuta, and this can present problems if organizations have a data lake full of non-sensitive data and Immuta removes access to all of it. The Limited Enforcement Scope feature addresses this challenge by allowing Immuta users to access any tables that are not protected by Immuta (i.e., not registered as a data source or a table in a native workspace). Although this is similar to how privileged users in Databricks operate, non-privileged users cannot bypass Immuta controls.

This feature is composed of two configurations:

Allowing non-Immuta reads: Immuta users with regular (unprivileged) Databricks roles may SELECT from tables that are not registered in Immuta.
Allowing non-Immuta writes: Immuta users with regular (unprivileged) Databricks roles can run DDL commands and data-modifying commands against tables or spaces that are not registered in Immuta.

Additionally, Immuta supports auditing all queries run on a Databricks cluster, regardless of whether users touch Immuta-protected data or not. To configure Immuta to do so, navigate to the Enable Auditing of All Queries in Databricks section.

Enable Non-Immuta Reads

Non-Immuta Reads

This setting does not allow reading data directly with commands like spark.read.format("x"). Users are still required to read data and query tables using Spark SQL.
When non-Immuta reads are enabled, Immuta users will see all databases and tables when they run show databases and/or show tables. However, this does not mean they will be able to query all of them.

Enable non-Immuta Reads by setting this configuration in the Spark environment variables (recommended) or immuta_conf.xml (not recommended):
```
<property>
    <name>immuta.spark.databricks.allow.non.immuta.reads</name>
    <value>true</value>
</property>
```
Opt to adjust the cache duration by changing the default value in the Spark environment variables (recommended) or immuta_conf.xml (not recommended). (Immuta caches whether a table has been exposed as an Immuta source to improve performance. The default caching duration is 1 hour.)
```
<property>
    <name>immuta.spark.non.immuta.table.cache.seconds</name>
    <value>3600</value>
</property>
```

Enable Non-Immuta Writes

Non-Immuta Writes

These non-protected tables/spaces have the same exposure as detailed in the read section, but with the distinction that users can write data directly to these paths.
With non-Immuta writes enabled, it will be possible for users on the cluster to mix any policy-enforced data they may have access to via any registered data sources in Immuta with non-Immuta data, and write the ensuing result to a non-Immuta write space where it would be visible to others. If this is not a desired possibility, the cluster should instead be configured to only use Immuta’s native workspaces.

Enable non-Immuta Writes by setting this configuration in the Spark environment variables (recommended) or immuta_conf.xml (not recommended):
```
<property>
    <name>immuta.spark.databricks.allow.non.immuta.writes</name>
    <value>true</value>
</property>
```
Opt to adjust the cache duration by changing the default value in the Spark environment variables (recommended) or immuta_conf.xml (not recommended). (Immuta caches whether a table has been exposed as an Immuta source to improve performance. The default caching duration is 1 hour.)
```
<property>
    <name>immuta.spark.non.immuta.table.cache.seconds</name>
    <value>3600</value>
</property>
```

Enable Auditing of All Queries in Databricks

Enable support for auditing all queries run on a Databricks cluster (regardless of whether users touch Immuta-protected data or not) by setting this configuration in the Spark environment variables (recommended) or immuta_conf.xml (not recommended):

<property>
    <name>immuta.spark.audit.all.queries</name>
    <value>true</value>
</property>

Default Configuration Values

The controls and default values associated with non-Immuta reads, non-Immuta writes, and audit functionality are outlined below.

<property>
    <name>immuta.spark.databricks.allow.non.immuta.reads</name>
    <value>false</value>
</property>
<property>
    <name>immuta.spark.databricks.allow.non.immuta.writes</name>
    <value>false</value>
</property>
<property>
    <name>immuta.spark.non.immuta.table.cache.seconds</name>
    <value>3600</value>
</property>
<property>
    <name>immuta.spark.audit.all.queries</name>
    <value>false</value>
</property>

Hiding the Immuta Database in Databricks

Audience: System Administrators
Content Summary: This page describes how to hide the immuta database in Databricks.

Hiding the database does not disable access to it

Queries can still be performed against tables in the immuta database using the Immuta-qualified table name (e.g., immuta.my_schema_my_table) regardless of whether or not this feature is enabled.

Overview

The immuta database on Immuta-enabled clusters allows Immuta to track Immuta-managed data sources separately from remote Databricks tables so that policies and other security features can be applied. However, Immuta supports raw tables in Databricks, so table-backed queries do not need to reference this database. When configuring a Databricks cluster, you can hide immuta from any calls to SHOW DATABASES so that users are not confused or misled by that database.

Hide the `immuta` Database

When configuring a Databricks cluster, hide immuta by using the following environment variable in the Spark cluster configuration:

IMMUTA_SPARK_SHOW_IMMUTA_DATABASE=false

Then, Immuta will not show this database when a SHOW DATABASES query is performed.

Run spark-submit Jobs on Databricks

Audience: System Administrators
Content Summary: This guide illustrates how to run R and Scala spark-submit jobs on Databricks, including prerequisites and caveats.

Language Support

R and Scala are supported, but require advanced configuration; work with your Immuta support professional to use these languages. Python spark-submit jobs are not supported by the Databricks Spark integration.

Using R in a Notebook

Because of how some user properties are populated in Databricks, users should load the SparkR library in a separate cell before attempting to use any SparkR functions.

R `spark-submit`

Prerequisites

Before you can run spark-submit jobs on Databricks you must initialize the Spark session with the settings outlined below.

Initialize the Spark session by entering these settings into the R submit script immuta.spark.acl.assume.not.privileged="true" and spark.hadoop.immuta.databricks.config.update.service.enabled="false".
This will enable the R script to access Immuta data sources, scratch paths, and workspace tables.
Once the script is written, upload the script to a location in dbfs/S3/ABFS to give the Databricks cluster access to it.

Create the R `spark submit` Job

To create the R spark-submit job,

Go to the Databricks jobs page.
Create a new job, and select Configure spark-submit.

Set up the parameters:

 [
 "--conf","spark.driver.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service",
 "--conf","spark.executor.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service",
 "--conf","spark.databricks.repl.allowedLanguages=python,sql,scala,r",
 "dbfs:/path/to/script.R",
 "arg1", "arg2", "..."
 ]

Note: The path dbfs:/path/to/script.R can be in S3 or ABFS (on Azure Databricks), assuming the cluster is configured with access to that path.

Edit the cluster configuration, and change the Databricks Runtime to be a supported version (5.5, 6.4, 7.3, or 7.4).
Configure the Environment Variables section as you normally would for an Immuta cluster.

Scala spark-submit

Prerequisites

Before you can run spark-submit jobs on Databricks you must initialize the Spark session with the settings outlined below.

Configure the Spark session with immuta.spark.acl.assume.not.privileged="true" and spark.hadoop.immuta.databricks.config.update.service.enabled="false".
Note: Stop your Spark session (spark.stop()) at the end of your job or the cluster will not terminate.

The spark submit job needs to be launched using a different classloader which will point at the designated user JARs directory. The following Scala template can be used to handle launching your submit code using a separate classloader:

package com.example.job

import java.net.URLClassLoader
import java.io.File

import org.apache.spark.sql.SparkSession

object ImmutaSparkSubmitExample {
def main(args: Array[String]): Unit = {
    val jarDir = new File("/databricks/immuta/jars/")
    val urls = jarDir.listFiles.map(_.toURI.toURL)

    // Configure a new ClassLoader which will load jars from the additional jars directory
    val cl = new URLClassLoader(urls)
    val jobClass = cl.loadClass(classOf[ImmutaSparkSubmitExample].getName)
    val job = jobClass.newInstance
    jobClass.getMethod("runJob").invoke(job)
}
}

class ImmutaSparkSubmitExample {

def getSparkSession(): SparkSession = {
    SparkSession.builder()
    .appName("Example Spark Submit")
    .enableHiveSupport()
    .config("immuta.spark.acl.assume.not.privileged", "true")
    .config("spark.hadoop.immuta.databricks.config.update.service.enabled", "false")
    .getOrCreate()
}

def runJob(): Unit = {
    val spark = getSparkSession
    try {
    val df = spark.table("immuta.<YOUR DATASOURCE>")

    // Run Immuta Spark queries...

    } finally {
    spark.stop()
    }
}
}

Create the Scala `spark-submit` Job

To create the Scala spark-submit job,

Build and upload your JAR to dbfs/S3/ABFS where the cluster has access to it.

Select Configure spark-submit, and configure the parameters:

 [
 "--conf","spark.driver.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service",
 "--conf","spark.executor.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service",
 "--conf","spark.databricks.repl.allowedLanguages=python,sql,scala,r",
 "--class","org.youorg.package.MainClass",
 "dbfs:/path/to/code.jar",
 "arg1", "arg2", "..."
 ]

Note: The fully-qualified class name of the class whose main function will be used as the entry point for your code in the --class parameter.

Note: The path dbfs:/path/to/code.jar can be in S3 or ABFS (on Azure Databricks) assuming the cluster is configured with access to that path.

Edit the cluster configuration, and change the Databricks Runtime to a supported version (5.5, 6.4, 7.3, or 7.4).
Include IMMUTA_INIT_ADDITIONAL_JARS_URI=dbfs:/path/to/code.jar in the "Environment Variables" (where dbfs:/path/to/code.jar is the path to your jar) so that the jar is uploaded to all the cluster nodes.

Caveats

The user mapping works differently from notebooks because spark-submit clusters are not configured with access to the Databricks SCIM API. The cluster tags are read to get the cluster creator and match that user to an Immuta user.
Privileged users (Databricks Admins and Whitelisted Users) must be tied to an Immuta user and given access through Immuta to access data through spark-submit jobs because the setting immuta.spark.acl.assume.not.privileged="true" is used.
There is an option of using the immuta.api.key setting with an Immuta API key generated on the Immuta Profile Page.
Currently when an API key is generated it invalidates the previous key. This can cause issues if a user is using multiple clusters in parallel, since each cluster will generate a new API key for that Immuta user. To avoid these issues, manually generate the API key in Immuta and set the immuta.api.key on all the clusters or use a specified job user for the submit job.

Project UDFs Cache Settings

This page outlines the configuration for setting up project UDFs, which allow users to set their current project in Immuta through Spark. For details about the specific functions available and how to use them, see the .

Use Project UDFs in Databricks

Currently, caches are not all invalidated outside of Databricks because Immuta caches information pertaining to a user's current project in the NameNode plugin and in Vulcan. Consequently, this feature should only be used in Databricks.

Web Service and On-Cluster Caches

Immuta caches a mapping of user accounts and users' current projects in the Immuta Web Service and on-cluster. When users change their project with UDFs instead of the Immuta UI, Immuta invalidates all the caches on-cluster (so that everything changes immediately) and the cluster submits a request to change the project context to a web worker. Immediately after that request, another call is made to a web worker to refresh the current project.

To allow use of project UDFs in Spark jobs, raise the caching on-cluster and lower the cache timeouts for the Immuta Web Service. Otherwise, caching could cause dissonance among the requests and calls to multiple web workers when users try to change their project contexts.

Recommended Configuration

1 - Lower Web Service Cache Timeout

Click the App Settings icon in the left sidebar and scroll to the HDFS Cache Settings section.
Lower the Cache TTL of HDFS user names (ms) to 0.
Click Save.

2 - Raise Cache Timeout On-Cluster

In the Spark environment variables section, set the IMMUTA_CURRENT_PROJECT_CACHE_TIMEOUT_SECONDS and IMMUTA_PROJECT_CACHE_TIMEOUT_SECONDS to high values (like 10000).

Note: These caches will be invalidated on cluster when a user calls immuta.set_current_project, so they can effectively be cached permanently on cluster to avoid periodically reaching out to the web service.

Blocking UDFs

External Metastores

Audience: System Administrators
Content Summary: This document describes how to use an existing Hive external metastore instead of the built-in metastore.

Local or Remote Mode

Immuta supports the use of external metastores in , following the same configuration detailed in the .

Configure External Hive Metastore

Download the metastore jars and point to them as specified in . Metastore jars must end up on the cluster's local disk at this explicit path: /databricks/hive_metastore_jars.

If using DBR 7.x with Hive 2.3.x, either

Set spark.sql.hive.metastore.version to 2.3.7 and spark.sql.hive.metastore.jars to builtin or
Download the metastore jars and set spark.sql.hive.metastore.jars to /databricks/hive_metastore_jars/* as before.

Configure AWS Glue Data Catalog

To use AWS Glue Data Catalog as the metastore for Databricks, see the .

Reference Guides

Databricks Spark Pre-Configuration Details

Audience: System Administrators, Data Owners, and Data Users
Content Summary: This page describes the Databricks integration, configuration options, and features.
See the for a tutorial on enabling Databricks and these features through the App Settings page.

Feature Availability

Project Workspaces

Databricks Tag Ingestion

User Impersonation

Native Query Audit

Multiple Integrations

Supported Databricks Cluster Configurations

Example cluster

Databricks Runtime

Unity Catalog in Databricks

Databricks Spark integration

Databricks Spark with Unity Catalog support

Databricks Unity Catalog integration

Legend:

The feature or integration is enabled.
The feature or integration is disabled.

Databricks-Specific Details

Prerequisites

Databricks instance has network level access to Immuta instance
Permissions and access to download (outside Internet access) or transfer files to the host machine

Recommended Databricks Workspace Configurations:

Supported Databricks Runtime Versions

Supported Databricks Cluster Types

Supported Access Mode and Languages

Immuta supports the Custom access mode.

Supported Languages:
- Python
- SQL
- R (requires advanced configuration; work with your Immuta support professional to use R)
- Scala (requires advanced configuration; work with your Immuta support professional to use Scala)

Supported Features

The Immuta Databricks integration supports the following Databricks features:

Workspaces

Tag Ingestion

User Impersonation

Native Query Audit

Audit Limitations

Capturing the code or query that triggers the Spark plan makes audit records more useful in assessing what users are doing.

Multiple Databricks Instances

A user can configure multiple integrations of Databricks to a single Immuta instance and use them dynamically or with workspaces.

Limitation

Immuta does not support Databricks clusters with Photon acceleration enabled.

Cluster Policies

Python & SQL

Audience: System Administrators
Content Summary: This page describes the Python & SQL cluster policy.

Performance

This is the most performant policy configuration.

In this configuration, Immuta is able to rely on Databricks-native security controls, reducing overhead. The key security control here is the enablement of process isolation. This prevents users from obtaining unintentional access to the queries of other users. In other words, masked and filtered data is consistently made accessible to users in accordance with their assigned attributes. This Immuta cluster configuration relies on Py4J security being enabled.

Many Python ML classes (such as LogisticRegression, StringIndexer, and DecisionTreeClassifier) and dbutils.fs are unfortunately not supported with Py4J security enabled. Users will also be to use the Databricks Connect client library. Additionally, only Python and SQL are available as supported languages.

For full details on Databricks’ best practices in configuring clusters, please read their .

Python & SQL & R

Audience: System Administrators
Content Summary: This page describes the Python & SQL & R cluster policy.

Additional Overhead

In relation to the Python & SQL cluster policy, this configuration trades some additional overhead for added support of the R language.

In this configuration, you are able to rely on the Databricks-native security controls. The key security control here is the enablement of process isolation. This prevents users from obtaining unintentional access to the queries of other users. In other words, masked and filtered data is consistently made accessible to users in accordance with their assigned attributes.

Like the Python & SQL configuration, Py4j security is enabled for the Python & SQL & R configuration. However, because R has been added Immuta enables the SecurityManager, in addition to Py4j security, to provide more security guarantees. For example, by default all actions in R execute as the root user; among other things, this permits access to the entire filesystem (including sensitive configuration data), and, without iptable restrictions, a user may freely access the cluster’s cloud storage credentials. To address these security issues, Immuta’s initialization script wraps the R and Rscript binaries to launch each command as a temporary, non-privileged user with limited filesystem and network access and installs the Immuta SecurityManager, which prevents users from bypassing policies and protects against the above vulnerabilities from within the JVM.

Consequently, the cost of introducing R is that the SecurityManager incurs a small increase in performance overhead; however, average latency will vary depending on whether the cluster is homogeneous or heterogeneous. (In homogeneous clusters, all users are at the same level of groups/authorizations; this is enforced externally, rather than directly by Immuta.)

When users install third-party Java/Scala libraries, they will be denied access to sensitive resources by default. However, cluster administrators can specify which of the installed Databricks libraries should be by Immuta.

For full details on Databricks’ best practices in configuring clusters, please read their .

Python & SQL & R with Library Support

Audience: System Administrators
Content Summary: This page describes the Python & SQL & R with Library Support cluster policy.

Py4j Security Disabled

In addition to support for Python, SQL, and R, this configuration adds support for additional Python libraries and utilities by disabling Databricks-native Py4j security.

This configuration does not rely on Databricks-native Py4j security to secure the cluster, while process isolation is still enabled to secure filesystem and network access from within Python processes. On an Immuta-enabled cluster, once Py4J security is disabled the Immuta SecurityManager is installed to prevent nefarious actions from Python in the JVM. Disabling Py4J security also allows for expanded Python library support, including many Python ML classes (such as LogisticRegression, StringIndexer, and DecisionTreeClassifier) and dbutils.fs.

By default, all actions in R will execute as the root user. Among other things, this permits access to the entire filesystem (including sensitive configuration data). And without iptable restrictions, a user may freely access the cluster’s cloud storage credentials. To properly support the use of the R language, Immuta’s initialization script wraps the R and Rscript binaries to launch each command as a temporary, non-privileged user. This user has limited filesystem and network access. The Immuta SecurityManager is also installed to prevent users from bypassing policies and protects against the above vulnerabilities from within the JVM.

The SecurityManager will incur a small increase in performance overhead; average latency will vary depending on whether the cluster is homogeneous or heterogeneous. (In homogeneous clusters, all users are at the same level of groups/authorizations; this is enforced externally, rather than directly by Immuta.)

A homogeneous cluster is recommended for configurations where Py4J security is disabled. If all users have the same level of authorization, there would not be any data leakage, even if a nefarious action was taken.

For full details on Databricks’ best practices in configuring clusters, please read their .

Scala

Audience: System Administrators
Content Summary: This page describes the Scala cluster policy.

Scala Clusters

This configuration is for Scala-only clusters.

Where Scala language support is needed, this configuration can be used in the Custom .

According to Databricks’ cluster type support documentation, Scala clusters are intended for . However, nothing inherently prevents a Scala cluster from being configured for multiple users. Even with the Immuta SecurityManager enabled, there are limitations to user isolation within a Scala job.

For a secure configuration, it is recommended that clusters intended for Scala workloads are limited to Scala jobs only and are made homogeneous through the use of or externally via convention/cluster ACLs. (In homogeneous clusters, all users are at the same level of groups/authorizations; this is enforced externally, rather than directly by Immuta.)

For full details on Databricks’ best practices in configuring clusters, please read their .

Sparklyr

Audience: System Administrators
Content Summary: This page describes the sparklyr cluster policy.

Single-User Clusters Recommended

Like Databricks, Immuta recommends single-user clusters for sparklyr when user isolation is required. A single-user cluster can either be a job cluster or a cluster with credential passthrough enabled. Note: spark-submit jobs are not currently supported.

Two cluster types can be configured with sparklyr: Single-User Clusters (recommended) and Multi-User Clusters (discouraged).

: Credential Passthrough (required on Databricks) allows a single-user cluster to be created. This setting automatically configures the cluster to assume the role of the attached user when reading from storage (S3). Because Immuta requires that raw data is readable by the cluster, the instance profile associated with the cluster should be used rather than a role assigned to the attached user.
: Because Immuta cannot guarantee user isolation in a multi-user sparklyr cluster, it is not recommended to deploy a multi-user cluster. To force all users to act under the same set of attributes, groups, and purposes with respect to their data access and eliminate the risk of a data leak, all sparklyr multi-user clusters must be equalized either by convention (all users able to attach to the cluster have the same level of data access in Immuta) or by configuration (detailed below).

Single-User Cluster Configuration

1 - Enable sparklyr

In addition to the configuration for an Immuta cluster with R, add this environment variable to the Environment Variables section of the cluster:

This configuration makes changes to the iptables rules on the cluster to allow the sparklyr client to connect to the required ports on the JVM used by the sparklyr backend service.

2 - Set Up a sparklyr Connection in Databricks

Install and load libraries into a notebook. Databricks includes the stable version of sparklyr, so library(sparklyr) in an R notebook is sufficient, but you may opt to install the latest version of sparklyr from CRAN. Additionally, loading library(DBI) will allow you to execute SQL queries.
Set up a sparklyr connection:
Pass the connection object to execute queries:

3 - Configure a Single-User Cluster

Add the following items to the Spark Config section of the cluster:

The trustedFileSystems setting is required to allow Immuta’s wrapper FileSystem (used in conjunction with the ImmutaSecurityManager for data security purposes) to be used with credential passthrough. Additionally, the InstanceProfileCredentialsProvider must be configured to continue using the cluster’s instance profile for data access, rather than a role associated with the attached user.

Multi-User Cluster Configuration

Immuta Discourages Deploying Multi-User Clusters with sparklyr Configuration

It is possible, but not recommended, to deploy a multi-user cluster sparklyr configuration. Immuta cannot guarantee user isolation in a multi-user sparklyr configuration.

The configurations in this section enable sparklyr, require project equalization, map sparklyr sessions to the correct Immuta user, and prevent users from accessing Immuta native workspaces.

Add the following environment variables to the Environment Variables section of your cluster configuration:
Add the following items to the Spark Config section:

Limitations

Immuta’s integration with sparklyr does not currently support

spark-submit jobs,
UDFs, or
Databricks Runtimes 5, 6, or 7.

Databricks Change Data Feed

Audience: Databricks Users
Content Summary: This page describes Immuta's support of .

Overview

CDF shows the row-level changes between versions of a Delta table. The changes displayed include row data and metadata that indicates whether the row was inserted, deleted, or updated.

Immuta does not support applying policies to the changed data, and the CDF cannot be read for data source tables if the user does not have access to the raw data in Databricks. However, the CDF can be read if the querying user is allowed to read the raw data and one of the following statements is true:

the table is in the current workspace,
the table is in a scratch path,
non-Immuta reads are enabled AND the table does not intersect with a workspace under which the current user is not acting, or
non-Immuta reads are enabled AND the table is not part of an Immuta data source.

Configure Change Data Feed

There are no configuration changes necessary to use this feature.

Limitation

Immuta does not support reading changes in .

Databricks Libraries

Audience: Databricks Administrators
Content Summary: This page provides an overview of Immuta's feature and support of .

Databricks Libraries and Immuta's Security Manager

The Immuta security manager blocks users from executing code that could allow them to gain access to sensitive data by only allowing select code paths to access sensitive files and methods. These select code paths provide Immuta's code access to sensitive resources while blocking end users from these sensitive resources directly.

Similarly, when users install third-party libraries those libraries will be denied access to sensitive resources by default. However, cluster administrators can specify which of the installed Databricks libraries should be .

Databricks Trusted Libraries

The trusted libraries feature allows Databricks cluster administrators to avoid Immuta security manager errors when using third-party libraries. An administrator can specify an installed library as "trusted," which will enable that library's code to bypass the Immuta security manager. Contact your Immuta support professional for custom security configurations for your libraries.

This feature does not impact Immuta's ability to apply policies; trusting a library only allows code through what previously would have been blocked by the security manager.

Security Vulnerability

Using this feature could create a security vulnerability, depending on the third-party library. For example, if a library exposes a public method named readProtectedFile that displays the contents of a sensitive file, then trusting that library would allow end users access to that file. Work with your Immuta support professional to determine if the risk does not apply to your environment or use case.

Support

Databricks Libraries API

Installing trusted libraries outside of the Databricks Libraries API (e.g., ADD JAR ...) is not supported.

The following types of libraries are supported when installing a third-party library using the Databricks UI or the Databricks Libraries API:

Library source is Upload, DBFS or DBFS/S3 and the Library Type is Jar.
Library source is Maven.

Limitations

Databricks installs libraries right after a cluster has started, but there is no guarantee that library installation will complete before a user's code is executed. If a user executes code before a trusted library installation has completed, Immuta will not be able to identify the library as trusted. This can be solved by either
- waiting for library installation to complete before running any third-party library commands or
- executing a Spark query. This will force Immuta to wait for any trusted Immuta libraries to complete installation before proceeding.
When installing a library using Maven as a library source, Databricks will also install any transitive dependencies for the library. However, those transitive dependencies are installed behind the scenes and will not appear as installed libraries in either the Databricks UI or using the Databricks Libraries API. Only libraries specifically listed in the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS environment variable will be trusted by Immuta, which does not include installed transitive dependencies. This effectively means that any code paths that include a class from a transitive dependency but do not include a class from a trusted third-party library can still be blocked by the Immuta security manager. For example, if a user installs a trusted third-party library that has a transitive dependency of a file-util library, the user will not be able to directly use the file-util library to read a sensitive file that is normally protected by the Immuta security manager.
In many cases, it is not a problem if dependent libraries aren't trusted because code paths where the trusted library calls down into dependent libraries will still be trusted. However, if the dependent library needs to be trusted, there is a workaround:
1. Add the transitive dependency jar paths to the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS environment variable. In the driver log4j logs, Databricks outputs the source jar locations when it installs transitive dependencies. In the cluster driver logs, look for a log message similar to the following:
2. In the above example, where slf4j is the transitive dependency, you would add the path dbfs:/FileStore/jars/maven/org/slf4j/slf4j-api-1.7.25.jar to the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS environment variable and restart your cluster.

Troubleshooting

In case of failure, check the driver logs for details. Some possible causes of failure include

One of the Immuta configured trusted library URIs does not point to a Databricks library. Check that you have configured the correct URI for the Databricks library.
For trusted Maven artifacts, the URI must follow this format: maven:/group.id:artifact-id:version.
Databricks failed to install a library. Any Databricks library installation errors will appear in the Databricks UI under the Libraries tab.

Configuration

Notebook-Scoped Libraries on Machine Learning Clusters

Configuration

No additional configuration is needed to enable this feature. Users only need to be running on clusters with DBR 8+.

DBFS Access

Audience: System Administrators
Content Summary: This page outlines how to access DBFS in Databricks for non-sensitive data. Databricks Administrators should place the desired configuration in the Spark environment variables (recommended) or the immuta_conf.xml file (not recommended).

DBFS FUSE Mount

DBFS FUSE Mount Limitation

This feature cannot be used in environments with E2 Private Link enabled.

This feature (provided by Databricks) mounts DBFS to the local cluster filesystem at /dbfs. Although disabled when using process isolation, this feature can safely be enabled if raw, unfiltered data is not stored in DBFS and all users on the cluster are authorized to see each other’s files. When enabled, the entirety of DBFS essentially becomes a scratch path where users can read and write files in /dfbs/path/to/my/file as though they were local files.

For example,

In Python,

Note: This solution also works in R and Scala.

Enable DBFS FUSE Mount

To enable the DBFS FUSE mount, set this configuration: immuta.spark.databricks.dbfs.mount.enabled=true.

Mounting a Bucket

Users can that can also be accessed using the FUSE mount.
Mounting a bucket is a one-time action, and the mount will be available to all clusters in the workspace from that point on.
Mounting must be performed from a non-Immuta cluster.

Scala DBUtils (and %fs magic) with Scratch Paths

Scratch paths will work when performing arbitrary remote filesystem operations with fs magic or Scala dbutils.fs functions. For example,

Configure Scala DBUtils (and %fs magic) with Scratch Paths

To support %fs magic and Scala DBUtils with scratch paths, configure

Configure DBUtils in Python

To use dbutils in Python, set this configuration: immuta.spark.databricks.py4j.strict.enabled=false.

Example Workflow

This section illustrates the workflow for getting a file from a remote scratch path, editing it locally with Python, and writing it back to a remote scratch path.

Get the file from remote storage:
Make a copy if you want to explicitly edit localScratchFile, as it will be read-only and owned by root:
Write the new file back to remote storage:

Delta Lake API

When using Delta Lake, the API does not go through the normal Spark execution path. This means that Immuta's Spark extensions do not provide protection for the API. To solve this issue and ensure that Immuta has control over what a user can access, the Delta Lake API is blocked.

Spark SQL can be used instead to give the same functionality with all of Immuta's data protections.

Requests

Below is a table of the Delta Lake API with the Spark SQL that may be used instead.

Delta Lake API

Spark SQL

See here for a complete list of the .

Merging tables in native workspaces

When a table is created in a native workspace, you can merge a different Immuta data source from that workspace into that table you created.

Create a table in the native workspace.
Create a temporary view of the Immuta data source you want to merge into that table.
Use that temporary view as the data source you add to the project workspace.
Run the following command:

Environment Variables

This page outlines configuration details for Immuta-enabled Databricks clusters. Databricks Administrators should place the desired configuration in the Spark environment variables (recommended) or immuta_conf.xml (not recommended).

This page contains references to the term whitelist, which Immuta no longer uses. When the term is removed from the software, it will be removed from this page.

Environment Variable Overrides

Properties in the config file can be overridden during installation using environment variables. The variable names are the config names in all upper case with _ instead of .. For example, to set the value of immuta.base.url via an environment variable, you would set the following in the Environment Variables section of cluster configuration: IMMUTA_BASE_URL=https://immuta.mycompany.com

immuta.ephemeral.host.override
- Default: true
- Description: Set this to false if ephemeral overrides should not be enabled for Spark. When true, this will automatically override ephemeral data source httpPaths with the httpPath of the Databricks cluster running the user's Spark application.
immuta.ephemeral.host.override.httpPath
- Description: This configuration item can be used if automatic detection of the Databricks httpPath should be disabled in favor of a static path to use for ephemeral overrides.
immuta.ephemeral.table.path.check.enabled
- Default: true
- Description: When querying Immuta data sources in Spark, the metadata from the Metastore is compared to the metadata for the target source in Immuta to validate that the source being queried exists and is queryable on the current cluster. This check typically validates that the target (database, table) pair exists in the Metastore and that the table’s underlying location matches what is in Immuta. This configuration can be used to disable location checking if that location is dynamic or changes over time. Note: This may lead to undefined behavior if the same table names exist in multiple workspaces but do not correspond to the same underlying data.
immuta.spark.acl.enabled
- Default: true
- Description: Immuta Access Control List (ACL). Controls whether Databricks users are blocked from accessing non-Immuta tables. Ignored if Databricks Table ACLs are enabled (i.e., spark.databricks.acl.dfAclsEnabled=true).
immuta.spark.acl.whitelist
- Description: Comma-separated list of Databricks usernames who may access raw tables when the Immuta ACL is in use.
immuta.spark.acl.privileged.timeout.seconds
- Default: 3600
- Description: The number of seconds to cache privileged user status for the Immuta ACL. A privileged Databricks user is an admin or is whitelisted in immuta.spark.acl.whitelist.
immuta.spark.acl.assume.not.privileged
- Default: false
- Description: Session property that overrides privileged user status when the Immuta ACL is in use. This should only be used in R scripts associated with spark-submit jobs.
immuta.spark.audit.all.queries
- Default: false
- Description: Enables auditing all queries run on a Databricks cluster, regardless of whether users touch Immuta-protected data or not.
immuta.spark.databricks.allow.non.immuta.reads
- Default: false
- Description: Allows non-privileged users to SELECT from tables that are not protected by Immuta. See for details about this feature.
immuta.spark.databricks.allow.non.immuta.writes
- Default: false
- Description: Allows non-privileged users to run DDL commands and data-modifying commands against tables or spaces that are not protected by Immuta. See for details about this feature.
immuta.spark.databricks.allowed.impersonation.users
- Description: This configuration is a comma-separated list of Databricks users who are allowed to impersonate Immuta users.
immuta.spark.databricks.dbfs.mount.enabled
- Default: false
- Description: Exposes the DBFS FUSE mount located at /dbfs. Granular permissions are not possible, so all users will have read/write access to all objects therein. Note: Raw, unfiltered source data should never be stored in DBFS.
immuta.spark.databricks.disabled.udfs
- Description: Block one or more Immuta from being used on an Immuta cluster. This should be a Java regular expression that matches the set of UDFs to block by name (excluding the immuta database). For example to block all project UDFs, you may configure this to be ^.*_projects?$. For a list of functions, see the .
immuta.spark.databricks.filesystem.blacklist
- Default: hdfs
- Description: A list of filesystem protocols that this instance of Immuta will not support for workspaces. This is useful in cases where a filesystem is available to a cluster but should not be used on that cluster.
immuta.spark.databricks.filesystem.is3a.path.style.access.config
- Default: false
- Description: Enables the that retrieves your API key and communicates with Immuta as if it were talking directly to S3, allowing users to access data sources through Immuta's s3p endpoint. This setting is only available on Databricks 7+ clusters.
immuta.spark.databricks.jar.uri
- Default: file:///databricks/jars/immuta-spark-hive.jar
- Description: The location of immuta-spark-hive.jar on the filesystem for Databricks. This should not need to change unless a custom initialization script that places immuta-spark-hive in a non-standard location is necessary.
immuta.spark.databricks.local.scratch.dir.enabled
- Default: true
- Description: Creates a world-readable/writable scratch directory on local disk to facilitate the use of dbutils and 3rd party libraries that may write to local disk. Its location is non-configurable and is stored in the environment variable IMMUTA_LOCAL_SCRATCH_DIR. Note: Sensitive data should not be stored at this location.
immuta.spark.databricks.log.level
- Default Value: INFO
- Description: The SLF4J log level to apply to Immuta's Spark plugins.
immuta.spark.databricks.log.stdout.enabled
- Default: false
- Description: If true, writes logging output to stdout/the console as well as the log4j-active.txt file (default in Databricks).
immuta.spark.databricks.py4j.strict.enabled
- Default: true
- Description: Disable to allow the use of the dbutils API in Python. Note: This setting should only be disabled for customers who employ a homogeneous integration (i.e., all users have the same level of data access).
immuta.spark.databricks.scratch.database
- Description: This configuration is a comma-separated list of additional databases that will appear as scratch databases when running a SHOW DATABASE query. This configuration increases performance by circumventing the Metastore to get the metadata for all the databases to determine what to display for a SHOW DATABASE query; it won't affect access to the scratch databases. Instead, use immuta.spark.databricks.scratch.paths to control read and write access to the underlying database paths.
  Additionally, this configuration will only display the scratch databases that are configured and will not validate that the configured databases exist in the Metastore. Therefore, it is up to the Databricks administrator to properly set this value and keep it current.
immuta.spark.databricks.scratch.paths
- Description: Comma-separated list of remote paths that Databricks users are allowed to directly read/write. These paths amount to unprotected "scratch spaces." You can create a scratch database by configuring its specified location (or configure dbfs:/user/hive/warehouse/<db_name>.db for the default location).
  To create a scratch path to a location or a database stored at that location, configure
  To create a scratch path to a database created using the default location,
immuta.spark.databricks.scratch.paths.create.db.enabled
- Default: false
- Description: Enables non-privileged users to create or drop scratch databases.
immuta.spark.databricks.single.impersonation.user
- Default: false
- Description: When true, this configuration prevents users from changing their impersonation user once it has been set for a given Spark session. This configuration should be set when the BI tool or other service allows users to submit arbitrary SQL or issue SET commands.
immuta.spark.databricks.submit.tag.job
- Default: true
- Description: Denotes whether the Spark job will be run that "tags" a Databricks cluster as being associated with Immuta.
immuta.spark.databricks.trusted.lib.uris
- Description:
immuta.spark.non.immuta.table.cache.seconds
- Default: 3600
- Description: The number of seconds Immuta caches whether a table has been exposed as a source in Immuta. This setting only applies when immuta.spark.databricks.allow.non.immuta.writes or immuta.spark.databricks.allow.non.immuta.reads is enabled.
immuta.spark.require.equalization
- Default: false
- Description: Requires that users act through a single, equalized project. A cluster should be equalized if users need to run Scala jobs on it, and it should be limited to Scala jobs only via spark.databricks.repl.allowedLanguages.
immuta.spark.resolve.raw.tables.enabled
- Default: true
- Description: Enables use of the underlying database and table name in queries against a table-backed Immuta data source. Administrators or whitelisted users can set immuta.spark.session.resolve.raw.tables.enabled to false to bypass resolving raw databases or tables as Immuta data sources. This is useful if an admin wants to read raw data but is also an Immuta user. By default, data policies will be applied to a table even for an administrative user if that admin is also an Immuta user.
immuta.spark.session.resolve.raw.tables.enabled
- Default: true
- Description: Same as above, but a session property that allows users to toggle this functionality. If users run set immuta.spark.session.resolve.raw.tables.enabled=false, they will see raw data only (not Immuta data policy-enforced data). Note: This property is not set in immuta_conf.xml.
immuta.spark.show.immuta.database
- Default: true
- Description: This shows the immuta database in the configured Databricks cluster. When set to false Immuta will no longer show this database when a SHOW DATABASES query is performed. However, queries can still be performed against tables in the immuta database using the Immuta-qualified table name (e.g., immuta.my_schema_my_table) regardless of whether or not this feature is enabled.
immuta.spark.version.validate.enabled
- Default: true
- Description: Immuta checks the versions of its artifacts to verify that they are compatible with each other. When set to true, if versions are incompatible, that information will be logged to the Databricks driver logs and the cluster will not be usable. If a configuration file or the jar artifacts have been patched with a new version (and the artifacts are known to be compatible), this check can be set to false so that the versions don't get logged as incompatible and make the cluster unusable.
immuta.user.context.class
- Default: com.immuta.spark.OSUserContext
- Description: The class name of the UserContext that will be used to determine the current user in immuta-spark-hive. The default implementation gets the OS user running the JVM for the Spark application.
immuta.user.mapping.iamid
- Default: bim
- Description: Denotes which IAM in Immuta should be used when mapping the current Spark user's username to a userid in Immuta. This defaults to Immuta's internal IAM (bim) but should be updated to reflect an actual production IAM.

Ephemeral Overrides

Audience: System Administrators
Content Summary: This page describes ephemeral overrides for Databricks data sources.

Best Practices: Ephemeral Overrides

Disable ephemeral overrides for clusters when using multiple workspaces and dedicate a single cluster to serve queries from Immuta in a single workspace.
If you use multiple E2 workspaces without disabling ephemeral overrides, avoid applying the where user row-level policy to data sources.

Overview

In Immuta, a Databricks data source is considered ephemeral, meaning that the compute resources associated with that data source will not always be available.

Ephemeral data sources allow the use of ephemeral overrides, user-specific connection parameter overrides that are applied to Immuta metadata operations and queries that the user runs through the Query Editor.

When a user runs a Spark job in Databricks, Immuta plugins automatically submit ephemeral overrides for that user to Immuta for all applicable data sources to use the current cluster as compute for all subsequent metadata operations for that user against the applicable data sources.

Example Query and Ephemeral Override Request

A user runs a query on cluster B.
The Immuta plugins on the cluster check if there is a source in the Metastore with a matching database, table name, and location for its underlying data. Note: If tables are dynamic or change over time, users can disable the comparison of the location of the underlying data by setting immuta.ephemeral.table.path.check.enabled to false; disabling this configuration allows users to avoid keeping the relevant data sources in Immuta up-to-date (which would require API calls and automation).
The Immuta plugins on the cluster detect that the user is subscribed to data sources 1, 2, and 3 and that data sources 1 and 3 are both present in the Metastore for cluster B, so the plugins submit ephemeral override requests for data sources 1 and 3 to override their connections with the HTTP path from cluster B.
Since data source 2 is not present in the Metastore, it is marked as a JDBC source.

If the user attempts to query data source 2 and they have not enabled JDBC sources, they will be presented with an error message telling them to do so:

com.immuta.spark.exceptions.ImmutaConfigurationException: This query plan will cause data to be pulled over JDBC. This spark context is not configured to allow this. To enable JDBC set immuta.enable.jdbc=true in the spark context hadoop configuration.

Immuta Operations that Use Ephemeral Overrides

Ephemeral overrides are enabled by default because Immuta must be aware of a cluster that is running to serve metadata queries. The operations that use the ephemeral overrides include

Visibility checks on the data source for a particular user. These checks assess how to apply row-level policies for specific users.
Stats collection triggered by a specific user.
Validating a custom WHERE clause policy against a data source. When owners or governors create custom WHERE clause policies, Immuta uses compute resources to validate the SQL in the policy. In this case, the ephemeral overrides for the user writing the policy are used to contact a cluster for SQL validation.
High Cardinality Column detection. Certain advanced policy types (e.g., minimization and randomized response) in Immuta require a High Cardinality Column, and that column is computed on data source creation. It can be recomputed on demand and, if so, will use the ephemeral overrides for the user requesting computation.

However, ephemeral overrides can be problematic in environments that have a dedicated cluster to handle maintenance activities, since ephemeral overrides can cause these operations to execute on a different cluster than the dedicated one.

Configure Overrides in Immuta-Enabled Clusters

To reduce the risk that a user has overrides set to a cluster (or multiple clusters) that aren't currently up,

direct all clusters' HTTP paths for overrides to a cluster dedicated for metadata queries or
disable overrides completely.

Disable Ephemeral Overrides

To disable ephemeral overrides, set immuta.ephemeral.host.override in spark-defaults.conf to false.

Py4j Security Error

Audience: Data Users and System Administrators
Content Summary: This page provides an explanation and solution for this common Databricks error.

Py4j Security Error Details

Error Message: py4j.security.Py4JSecurityException: Constructor <> is not whitelisted
Explanation: This error indicates you are being blocked by Py4j security rather than the Immuta Security Manager. Py4j security is strict and generally ends up blocking many ML libraries.
Solution: Turn off Py4j security on the offending cluster by setting IMMUTA_SPARK_DATABRICKS_PY4J_STRICT_ENABLED=false in the environment variables section. Additionally, because there are limitations to the security mechanisms Immuta employs on-cluster when Py4j security is disabled, ensure that all users on the cluster have the same level of access to data, as users could theoretically see (policy-enforced) data that other users have queried.

S3 Access in Databricks

Overview

You can use a library (like Boto 3 in Python) to access standard Amazon S3 and point it at Immuta to access your data. The integration with Databricks uses a file system (is3a) that retrieves your API key and communicates with Immuta as if it were talking directly to S3, allowing users to access S3 and Azure Blob data sources through Immuta's s3p endpoint.

This mechanism would never go to S3 directly. To access S3 directly, you will need to expose an S3-backed table or view in the Databricks Metastore as a source or use native workspaces/scratch paths.

Accessing Object-Backed Data Sources in Spark or Databricks

Configure Your Cluster

To use the is3a filesystem, add the following snippet to your cluster configuration:

IMMUTA_SPARK_DATABRICKS_FS_IS3A_PATH_STYLE_ACCESS=true

This configuration is needed to allow any access to is3a on Databricks 7+.

Query Your Data

Register your blob as an Immuta data source.
In Databricks or Spark, write queries that access this data by referencing the S3 path (shown in the Basic Information section of the Upload Files modal above), but using the URL scheme is3a:

Limitations

This integration is only available for object-backed data sources. Consequently, all the standard limitations that apply to object-backed data sources in Immuta apply here.
Additional configuration is necessary to allow is3a paths to function as scratch paths. Contact your Immuta support professional for guidance.

Scala Cluster Security Details

Audience: System Administrators
Content Summary: It is most secure to leverage an equalized project when working in a Scala cluster; however, it is not required to limit Scala to equalized projects. This document outlines security recommendations for Scala clusters and discusses the security risks involved when equalized projects are not used.

Language Support

R and Scala are both supported, but require advanced configuration; work with your Immuta support professional to use these languages.

Recommendations

There are limitations to isolation among users in Scala jobs on a Databricks cluster, even when using Immuta’s SecurityManager. When data is broadcast, cached (spilled to disk), or otherwise saved to SPARK_LOCAL_DIR, it's impossible to distinguish between which user’s data is composed in each file/block. If you are concerned about this vulnerability, Immuta suggests that Scala clusters

be limited to Scala jobs only.
use project equalization, which forces all users to act under the same set of attributes, groups, and purposes with respect to their data access.

Context for Security: Why Project Equalization is Recommended

When data is read in Spark using an Immuta policy-enforced plan, the masking and redaction of rows is performed at the leaf level of the physical Spark plan, so a policy such as "Mask using hashing the column social_security_number for everyone" would be implemented as an expression on a project node right above the FileSourceScanExec/LeafExec node at the bottom of the plan. This process prevents raw data from being shuffled in a Spark application and, consequently, from ending up in SPARK_LOCAL_DIR.

This policy implementation coupled with an equalized project guarantees that data being dropped into SPARK_LOCAL_DIR will have policies enforced and that those policies will be homogeneous for all users on the cluster. Since each user will have access to the same data, if they attempt to manually access other users' cached/spilled data, they will only see what they have access to via equalized permissions on the cluster. If project equalization is not turned on, users could dig through that directory and find data from another user with heightened access, which would result in a data leak.

Configuration for Requiring Equalized Projects with Scala

To require that Scala clusters be used in equalized projects and avoid the risk described above, change the immuta.spark.require.equalization value to true in your Immuta configuration file when you spin up Scala clusters:

<property>
<name>immuta.spark.require.equalization</name>
<value>true</value>
</property>

Once this configuration is complete, users on the cluster will need to switch to an Immuta equalized project before running a job. (Remember that when working under an Immuta Project, only tables within that project can be seen.) Once the first job is run using that equalized project, all subsequent jobs, no matter the user, must also be run under that same equalized project. If you need to change a cluster's project, you must restart the cluster.

Security Configuration for Performance

Audience: System Administrators
Content Summary: This page describes how the Security Manager is disabled for Databricks clusters that do not allow R or Scala code to be executed. Databricks Administrators should place the desired configuration in the immuta_conf.xml file.

Automatic Disabling of the Security Manager

The Immuta Security Manager is an essential element of the Databricks deployment that ensures users can't perform unauthorized actions when using Scala and R, since those languages have features that allow users to circumvent policies without the Security Manager enabled. However, the Security Manager must inspect the call stack every time a permission check is triggered, which adds overhead to queries. To improve Immuta's query performance on Databricks, Immuta disables the Security Manager when Scala and R are not being used.

The cluster init script checks the cluster’s configuration and automatically removes the Security Manager configuration when

spark.databricks.repl.allowedlanguages is a subset of {python, sql}
IMMUTA_SPARK_DATABRICKS_PY4J_STRICT_ENABLED is true

When the cluster is configured this way, Immuta can rely on Databricks' process isolation and Py4J security to prevent user code from performing unauthorized actions.

Note: Immuta still expects the spark.driver.extraJavaOptions and spark.executor.extraJavaOptions to be set and pointing at the Security Manager.

Beyond disabling the Security Manager, Immuta will skip several startup tasks that are required to secure the cluster when Scala and R are configured, and fewer permission checks will occur on the Driver and Executors in the Databricks cluster, reducing overhead and improving performance.

Caveats

There are still cases that require the Security Manager; in those instances, Immuta creates a fallback Security Manager to check the code path, so the IMMUTA_INIT_ALLOWED_CALLING_CLASSES_URI environment variable must always point to a valid calling class file.
Databricks’ dbutils.fs is blocked by their PY4J security; therefore, it can’t be used to access scratch paths.

Spark Direct File Reads

In addition to supporting direct file reads through workspace and scratch paths, Immuta allows direct file reads in Spark for file paths. As a result, users who prefer to interact with their data using file paths or who have existing workflows revolving around file paths can continue to use these workflows without rewriting those queries for Immuta.

When reading from a path in Spark, the Immuta Databricks plugin queries the Immuta Web Service to find Databricks data sources for the current user that are backed by data from the specified path. If found, the query plan maps to the Immuta data source and follows existing code paths for policy enforcement.

Read Data

Spark Direct File Reads in EMR

EMR uses the same integration as Databricks, but you will need to use the immuta SparkSession just as you normally would to interact with Immuta data sources.

For example, instead of spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table/partition_column=01/my_file.parquet"), use immuta.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table/partition_column=01/my_file.parquet").

Users can read data from individual parquet files in a sub-directory and partitioned data from a sub-directory (or by using a where predicate). Use the tabs below to view examples of reading data using these methods.

Read Data from an Individual Parquet File

To read from an individual file, load a partition file from a sub-directory:

spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table/partition_column=01/my_file.parquet")

Read Partitioned Data from a Sub-Directory

To read partitioned data from a sub-directory, load a parquet partition from a sub-directory:

spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table/partition_column=01")

Alternatively, load a parquet partition using a where predicate:

spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table").where("partition_column=01")

Object-Backed Data Sources

Direct file reads in Spark are also supported for object-backed Immuta data sources (such as S3 or Azure Blob data sources) using the is3a file system:

spark.read.format("parquet").load("is3a://immuta/test/path")

Limitations

Direct file reads for Immuta data sources only apply to table-backed Immuta data sources, not data sources created from views or queries.
If more than one data source has been created for a path, Immuta will use the first valid data source it finds. It is therefore not recommended to use this integration when more than one data source has been created for a path.
On Databricks, multiple input paths are supported as long as they belong to the same data source. However, for EMR only a single input path is supported.
CSV-backed tables are not currently supported.

Loading a delta partition from a sub-directory is not recommended by Spark and is not supported in Immuta. Instead, use a where predicate:

# Not recommended by Spark and not supported in Immuta
spark.read.format("delta").load("s3:/my_bucket/path/to/my_delta_table/partition_column=01")

# Recommended by Spark and supported in Immuta.
spark.read.format("delta").load("s3:/my_bucket/path/to/my_delta_table").where("partition_column=01")

Databricks Spark

Overview

Architecture

Policy Enforcement

Registering Data Sources

Table Access

The immuta Database

Fine-grained Access Control

Accessing Data

Mapping Users

Data Flow

How-to Guides

Installation

Prerequisites

Supported Databricks Runtime Versions

Supported Databricks Cluster Configurations

Supported Access Mode and Languages

Databricks Installation Overview

Limitations

Installation Methods

Debugging Immuta Installation Issues

Using the Validation and Debugging Notebook

Simplified Databricks Configuration

1 - Add the Integration on the App Settings Page

2 - Configure Cluster Policies

3 - Add Policies to Your Cluster

4 - Register Data

5 - Query Immuta Data

Example Queries

Manual Databricks Installation

1 - Download and Configure Immuta Artifacts

2 - Stage Immuta Artifacts

AWS/S3

Authenticating with Access Keys or Session Tokens (Optional)

Azure

ADL Gen 2

ADL Gen 1

HTTPS

Environment Variables (Optional)

DBFS

3 - Protect Immuta Environment Variables with Databricks Secrets

4 - Create and Configure the Cluster

Additional Hadoop Configuration File (Optional)

Amazon S3

Azure Data Lake Gen 2

Azure Data Lake Gen 1

Azure Blob Storage

5 - Register Data

6 - Query Immuta Data

Creating a Databricks Data Source

Databricks to Immuta User Mapping

Manually Update Your Databricks Cluster

Install a Trusted Library

1 - Install the Library

2 - Execute a Command in a Notebook

Limited Enforcement in Databricks

Enable Non-Immuta Reads

Enable Non-Immuta Writes

Enable Auditing of All Queries in Databricks

Default Configuration Values

Hiding the Immuta Database in Databricks

Overview

Hide the immuta Database

Run spark-submit Jobs on Databricks

R spark-submit

Prerequisites

Create the R spark submit Job

Scala spark-submit

Prerequisites

Create the Scala spark-submit Job

Caveats

Project UDFs Cache Settings

Web Service and On-Cluster Caches

Recommended Configuration

1 - Lower Web Service Cache Timeout

2 - Raise Cache Timeout On-Cluster

Blocking UDFs

External Metastores

Local or Remote Mode

Configure External Hive Metastore

The `immuta` Database

Hide the `immuta` Database

R `spark-submit`

Create the R `spark submit` Job

Create the Scala `spark-submit` Job