1 of 8

How-to Guides

Configure a Databricks Spark Integration

Permissions

APPLICATION_ADMIN Immuta permission
CAN MANAGE

Manually Update Your Databricks Cluster

If a Databricks cluster needs to be manually updated to reflect changes in the Immuta init script or cluster policies, you can remove and set up your integration again to get the updated policies and init script.

Log in to Immuta as an Application Admin.
Click the App Settings icon in the left sidebar and scroll to the Integration Settings section.
Your existing Databricks Spark integration should be listed here; expand it and note the configuration values. Now select Remove to remove your integration.
Click Add Integration and select Databricks Integration to add a new integration.
Enter your Databricks Spark integration settings again as configured previously.
Click Add Integration to add the integration, and then select Configure Cluster Policies to set up the updated cluster policies and init script.
Select the cluster policies you wish to use for your Immuta-enabled Databricks clusters.
Automatically push cluster policies and the init script (recommended) or manually update your cluster policies.
- Automatically push cluster policies
  1. Select Automatically Push Cluster Policies and enter your privileged Databricks access token. This token must have privileges to write to cluster policies.
  2. Select
Restart any Databricks clusters using these updated policies for the changes to take effect.

Install a Trusted Library

Databricks Libraries API: Installing trusted libraries outside of the Databricks Libraries API (e.g., ADD JAR ...) is not supported.

In the Databricks Clusters UI, install your third-party library .jar or Maven artifact with Library Source Upload, DBFS, DBFS/S3, or Maven. Alternatively, use the Databricks libraries API.
In the Databricks Clusters UI, add the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS property as a Spark environment variable and set it to your artifact's URI. To specify more than one trusted library, comma delimit the URIs:

For Maven artifacts, the URI is maven:/<maven_coordinates>, where <maven_coordinates> is the Coordinates field found when clicking on the installed artifact on the Libraries tab in the Databricks Clusters UI. Here's an example of an installed artifact:

In this example, you would add the following Spark environment variable:

For jar artifacts, the URI is the Source field found when clicking on the installed artifact on the Libraries tab in the Databricks Clusters UI. For artifacts installed from DBFS or S3, this ends up being the original URI to your artifact. For uploaded artifacts, Databricks will rename your .jar and put it in a directory in DBFS. Here's an example of an installed artifact:

In this example, you would add the following Spark environment variable:

Restart the cluster.
Once the cluster is up, execute a command in a notebook. If the trusted library installation is successful, you should see driver log messages like this:

Project UDFs Cache Settings

This page outlines the configuration for setting up project UDFs, which allow users to set their current project in Immuta through Spark. For details about the specific functions available and how to use them, see the Use Project UDFs (Databricks) page.

Use project UDFs in Databricks Spark

Currently, caches are not all invalidated outside of Databricks because Immuta caches information pertaining to a user's current project. Consequently, this feature should only be used in Databricks.

Lower the web service cache timeout in Immuta:
1. Click the App Settings icon and scroll to the HDFS Cache Settings section.
2. Lower the Cache TTL of HDFS user names (ms) to 0.
3. Click Save.
Raise the cache timeout on your Databricks cluster: In the Spark environment variables section, set the IMMUTA_CURRENT_PROJECT_CACHE_TIMEOUT_SECONDS and IMMUTA_PROJECT_CACHE_TIMEOUT_SECONDS to high values (like 10000).
Note: These caches will be invalidated on cluster when a user calls immuta.set_current_project, so they can effectively be cached permanently on cluster to avoid periodically reaching out to the web service.

Run R and Scala spark-submit Jobs on Databricks

This guide illustrates how to run R and Scala spark-submit jobs on Databricks, including prerequisites and caveats.

R `spark-submit`

DBFS Access

This page outlines how to enable access to DBFS in Databricks for non-sensitive data. Databricks administrators should place the desired configuration in the Spark environment variables.

DBFS FUSE mount

This Databricks feature mounts DBFS to the local cluster filesystem at /dbfs. Although disabled when using process isolation, this feature can safely be enabled if raw, unfiltered data is not stored in DBFS and all users on the cluster are authorized to see each other’s files. When enabled, the entirety of DBFS essentially becomes a scratch path where users can read and write files in /dfbs/path/to/my/file as though they were local files.

DBFS FUSE mount limitation: This feature cannot be used in environments with E2 Private Link enabled.

For example,

In Python,

Note: This solution also works in R and Scala.

Enable DBFS FUSE mount

To enable the DBFS FUSE mount, set this configuration in the Spark environment variables: IMMUTA_SPARK_DATABRICKS_DBFS_MOUNT_ENABLED=true.

Mounting a bucket

Users can that can also be accessed using the FUSE mount.
Mounting a bucket is a one-time action, and the mount will be available to all clusters in the workspace from that point on.

Scala DBUtils (and %fs magic) with scratch paths

Scratch paths will work when performing arbitrary remote filesystem operations with fs magic or Scala dbutils.fs functions. For example,

Configure Scala DBUtils (and %fs magic) with scratch paths

To support %fs magic and Scala DBUtils with scratch paths, configure

Configure DBUtils in Python

To use dbutils in Python, set this configuration: immuta.spark.databricks.py4j.strict.enabled=false.

Example workflow

This section illustrates the workflow for getting a file from a remote scratch path, editing it locally with Python, and writing it back to a remote scratch path.

Get the file from remote storage:
Make a copy if you want to explicitly edit localScratchFile, as it will be read-only and owned by root:
Write the new file back to remote storage:

Troubleshooting

This page provides guidelines for troubleshooting issues with the Databricks Spark integration and resolving Py4J security and Databricks trusted library errors.

Debugging the integration

For easier debugging of the Databricks Spark integration, follow the recommendations below.

Enable cluster init script logging:
- In the cluster page in Databricks for the target cluster, navigate to Advanced Options -> Logging.
- Change the Destination from NONE to DBFS and change the path to the desired output location. Note: The unique cluster ID will be added onto the end of the provided path.
View the Spark UI on your target Databricks cluster: On the cluster page, click the Spark UI tab, which shows the Spark application UI for the cluster. If you encounter issues creating Databricks data sources in Immuta, you can also view the JDBC/ODBC Server portion of the Spark UI to see the result of queries that have been sent from Immuta to Databricks.

Using the validation and debugging notebook

The validation and debugging notebook is designed to be used by or under the guidance of an Immuta support professional. Reach out to your Immuta representative for assistance.

Import the notebook into a Databricks workspace by navigating to Home in your Databricks instance.
Click the arrow next to your name and select Import.
Once you have executed commands in the notebook and populated it with debugging information, export the notebook and its contents by opening the File menu, selecting Export, and then selecting DBC Archive.

Py4J security error

Error Message: py4j.security.Py4JSecurityException: Constructor <> is not allowlisted
Explanation: This error indicates you are being blocked by Py4J security rather than the Immuta Security Manager. Py4J security is strict and generally ends up blocking many ML libraries.
Solution: Turn off Py4J security on the offending cluster by setting IMMUTA_SPARK_DATABRICKS_PY4J_STRICT_ENABLED=false in the environment variables section. Additionally, because there are limitations to the security mechanisms Immuta employs on-cluster when Py4J security is disabled, ensure that all users on the cluster have the same level of access to data, as users could theoretically see (policy-enforced) data that other users have queried.

Databricks trusted library errors

Check the driver logs for details. Some possible causes of failure include

One of the Immuta-configured trusted library URIs does not point to a Databricks library. Check that you have configured the correct URI for the Databricks library.
For trusted Maven artifacts, the URI must follow this format: maven:/group.id:artifact-id:version.
Databricks failed to install a library. Any Databricks library installation errors will appear in the Databricks UI under the Libraries tab.

Troubleshooting

This page provides guidelines for troubleshooting issues with the Databricks Spark integration and resolving Py4J security and Databricks trusted library errors.

Debugging the integration

For easier debugging of the Databricks Spark integration, follow the recommendations below.

Enable cluster init script logging:
- In the cluster page in Databricks for the target cluster, navigate to Advanced Options -> Logging.
- Change the Destination from NONE to DBFS and change the path to the desired output location. Note: The unique cluster ID will be added onto the end of the provided path.
View the Spark UI on your target Databricks cluster: On the cluster page, click the Spark UI tab, which shows the Spark application UI for the cluster. If you encounter issues creating Databricks data sources in Immuta, you can also view the JDBC/ODBC Server portion of the Spark UI to see the result of queries that have been sent from Immuta to Databricks.

Using the validation and debugging notebook

The validation and debugging notebook is designed to be used by or under the guidance of an Immuta support professional. Reach out to your Immuta representative for assistance.

Import the notebook into a Databricks workspace by navigating to Home in your Databricks instance.
Click the arrow next to your name and select Import.
Once you have executed commands in the notebook and populated it with debugging information, export the notebook and its contents by opening the File menu, selecting Export, and then selecting DBC Archive.

Py4J security error

Error Message: py4j.security.Py4JSecurityException: Constructor <> is not allowlisted
Explanation: This error indicates you are being blocked by Py4J security rather than the Immuta Security Manager. Py4J security is strict and generally ends up blocking many ML libraries.
Solution: Turn off Py4J security on the offending cluster by setting IMMUTA_SPARK_DATABRICKS_PY4J_STRICT_ENABLED=false in the environment variables section. Additionally, because there are limitations to the security mechanisms Immuta employs on-cluster when Py4J security is disabled, ensure that all users on the cluster have the same level of access to data, as users could theoretically see (policy-enforced) data that other users have queried.

Databricks trusted library errors

Check the driver logs for details. Some possible causes of failure include

One of the Immuta-configured trusted library URIs does not point to a Databricks library. Check that you have configured the correct URI for the Databricks library.
For trusted Maven artifacts, the URI must follow this format: maven:/group.id:artifact-id:version.
Databricks failed to install a library. Any Databricks library installation errors will appear in the Databricks UI under the Libraries tab.

DBFS Access

This page outlines how to enable access to DBFS in Databricks for non-sensitive data. Databricks administrators should place the desired configuration in the Spark environment variables.

DBFS FUSE mount

DBFS FUSE mount limitation: This feature cannot be used in environments with E2 Private Link enabled.

For example,

In Python,

Note: This solution also works in R and Scala.

Enable DBFS FUSE mount

To enable the DBFS FUSE mount, set this configuration in the Spark environment variables: IMMUTA_SPARK_DATABRICKS_DBFS_MOUNT_ENABLED=true.

Mounting a bucket

Users can that can also be accessed using the FUSE mount.
Mounting a bucket is a one-time action, and the mount will be available to all clusters in the workspace from that point on.

Scala DBUtils (and %fs magic) with scratch paths

Scratch paths will work when performing arbitrary remote filesystem operations with fs magic or Scala dbutils.fs functions. For example,

Configure Scala DBUtils (and %fs magic) with scratch paths

To support %fs magic and Scala DBUtils with scratch paths, configure

Configure DBUtils in Python

To use dbutils in Python, set this configuration: immuta.spark.databricks.py4j.strict.enabled=false.

Example workflow

This section illustrates the workflow for getting a file from a remote scratch path, editing it locally with Python, and writing it back to a remote scratch path.

Get the file from remote storage:
Make a copy if you want to explicitly edit localScratchFile, as it will be read-only and owned by root:
Write the new file back to remote storage:

How-to Guides

Configure a Databricks Spark Integration

Permissions

Manually Update Your Databricks Cluster

Install a Trusted Library

Project UDFs Cache Settings

Run R and Scala spark-submit Jobs on Databricks

R spark-submit

DBFS Access

DBFS FUSE mount

Enable DBFS FUSE mount

Scala DBUtils (and %fs magic) with scratch paths

Configure Scala DBUtils (and %fs magic) with scratch paths

Configure DBUtils in Python

Example workflow

Troubleshooting

Debugging the integration

Using the validation and debugging notebook

Py4J security error

Databricks trusted library errors

How-to Guides

Project UDFs Cache Settings

Install a Trusted Library

Troubleshooting

Debugging the integration

Using the validation and debugging notebook

Py4J security error

Databricks trusted library errors

Manually Update Your Databricks Cluster

Configure a Databricks Spark Integration

Permissions

Run R and Scala spark-submit Jobs on Databricks

R spark-submit

Requirements

Prerequisites

Add the integration on the app settings page

Configure cluster policies

Map users and grant them access to the cluster

Create the R spark submit Job

Scala spark-submit

Prerequisites

Create the Scala spark-submit Job

Caveats

DBFS Access

DBFS FUSE mount

Enable DBFS FUSE mount

Scala DBUtils (and %fs magic) with scratch paths

Configure Scala DBUtils (and %fs magic) with scratch paths

Configure DBUtils in Python

Example workflow

R `spark-submit`

R `spark-submit`

Create the R `spark submit` Job

Create the Scala `spark-submit` Job