Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The how-to guides linked on this page illustrate how to integrate Databricks Spark with Immuta.
Requirements
If Databricks Unity Catalog is enabled in a Databricks workspace, you must use an Immuta cluster policy when you set up the Databricks Spark integration to create an Immuta-enabled cluster.
If Databricks Unity Catalog is not enabled in your Databricks workspace, you must disable Unity Catalog in your Immuta tenant before proceeding with your configuration of Databricks Spark:
Navigate to the App Settings page and click Integration Settings.
Uncheck the Enable Unity Catalog checkbox.
Click Save.
Connect your technology
These guides provide instructions for getting your data set up in Immuta.
.
.
This integration enforces policies on Databricks securables registered in the legacy Hive metastore. Once these securables are registered as Immuta data sources, users can query policy-enforced data on Databricks clusters.
The guides in this section outline how to integrate Databricks Spark with Immuta.
This getting started guide outlines how to integrate Databricks with Immuta.
Register your users
These guides provide instructions on setting up your users in Immuta.
Integrate an IAM with Immuta: Connect the IAM your organization already uses and allow Immuta to register your users for you.
Map external user IDs from Databricks to Immuta: Ensure the user IDs in Immuta, Databricks, and your IAM are aligned so that the right policies impact the right users.
Add data metadata
These guides provide instructions on getting your data metadata set up in Immuta for use in policies.
Connect an external catalog: Connect the external catalog your organization already uses and allow Immuta to continually sync your tags with your data sources for you.
Run identification: Identification allows you to automate data tagging using identifiers that detect certain data patterns.
Protect and monitor data access
These guides provide instructions on authoring policies and auditing data access.
Author a global subscription policy: Once you add your data metadata to Immuta, you can immediately create policies that utilize your tags and apply to your tables. Subscription policies can be created to dictate access to data sources.
Author a global data policy: Data metadata can also be used to create data policies that apply to data sources as they are registered in Immuta. Data policies dictate what data a user can see once they are granted access to a data source. Using catalog and identification tags you can create proactive policies, knowing that they will apply to data sources as they are added to Immuta with the automated tagging.
: Once you have your data sources and users, and policies granting them access, you can set up audit export. This will export the audit logs from user queries, policy changes, and tagging updates.
Configure a Databricks Spark integration: Configure the Databricks Spark integration.
Manually update your Databricks cluster: Manually update your cluster to reflect changes in the Immuta init script or cluster policies.
Install a trusted library: Register a Databricks library with Immuta as a trusted library to avoid Immuta security manager errors when using third-party libraries.
Project UDFs cache settings: Raise the caching on-cluster and lower the cache timeouts for the Immuta web service to allow use of project UDFs in Spark jobs.
: Run R and Scala spark-submit jobs on your Databricks cluster.
: Access DBFS in Databricks for non-sensitive data.
: Resolve errors in the Databricks Spark configuration.
Databricks Spark integration configuration: This guide describes the design and components of the integration.
Security and compliance: This guide provides an overview of the Immuta features that provide security for your users and Databricks clusters and that allow you to prove compliance and monitor for anomalies.
Registering and protecting data: This guide provides an overview of registering Databricks securables and protecting them with Immuta policies.
Accessing data: This guide provides an overview of how Databricks users access data registered in Immuta.
This page outlines the configuration for setting up project UDFs, which allow users to set their current project in Immuta through Spark. For details about the specific functions available and how to use them, see the Use Project UDFs (Databricks) page.
Lower the web service cache timeout in Immuta:
Click the App Settings icon and scroll to the HDFS Cache Settings section.
Lower the Cache TTL of HDFS user names (ms) to 0.
Click Save.
Raise the cache timeout on your Databricks cluster: In the Spark environment variables section, set the IMMUTA_CURRENT_PROJECT_CACHE_TIMEOUT_SECONDS and IMMUTA_PROJECT_CACHE_TIMEOUT_SECONDS to high values (like 10000).
Note: These caches will be invalidated on cluster when a user calls immuta.set_current_project, so they can effectively be cached permanently on cluster to avoid periodically reaching out to the web service.
When the Databricks Spark plugin is running on a Databricks cluster, all Databricks users running jobs or queries are either a privileged user or a non-privileged user:
Privileged users: Privileged users can effectively read from and write to any table or view in the cluster Metastore, or any file path accessible by the cluster, without restriction. Privileged users are either or users specified in . Any user writing queries or jobs impersonating another user is a non-privileged user, even if they are impersonating a privileged user.\
Privileged users have effective authority to read from and write to any securable in the cluster metastore or file path, because in almost all cases Databricks clusters running with the Immuta Spark plug-in installed have disabled . However, if Hive metastore table access control is enabled on the cluster, privileged users will have the authority granted to them that is specified by table access control.
In the Databricks Clusters UI, install your third-party library .jar or Maven artifact with Library Source Upload, DBFS, DBFS/S3, or Maven. Alternatively, use the Databricks libraries API.
In the Databricks Clusters UI, add the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS property as a Spark environment variable and set it to your artifact's URI:
Whether a user is a privileged user or a non-privileged user, for a given query or job, is cached once first determined, based on IMMUTA_SPARK_ACL_PRIVILEGED_TIMEOUT_SECONDS environment variable. This caching can be disabled entirely by setting the value of that environment variable to 0.
Usernames in Databricks must match the usernames in the connected Immuta tenant. By default, the Immuta Spark plugin checks the Databricks username against the username within Immuta's internal IAM to determine access. However, you can integrate your existing IAM with Immuta and use that instead of the default internal IAM. Ideally, you should use the same identity manager for Immuta that you use for Databricks. See the Immuta support matrix page for a list of supported identity providers and protocols.
It is possible within Immuta to have multiple users share the same username if they exist within different IAMs. In this case, the cluster can be configured to look up users from a specified IAM. To do this, the value of theIMMTUA_USER_MAPPING_IAMID Spark environment variable must be updated to be the targeted IAM ID configured within the Immuta tenant. The targeted IAM ID can be found on the App settings page. Each Databricks cluster can only be mapped to one IAM.
Databricks user impersonation allows a Databricks user to impersonate an Immuta user. With this feature,
the Immuta user who is being impersonated does not have to have a Databricks account, but they must have an Immuta account.
the Databricks user who is impersonating an Immuta user does not have to be associated with Immuta. For example, this could be a service account.
When acting under impersonation, the Databricks user loses their privileged access, so they can only access the tables the Immuta user has access to and only perform DDL commands when that user is acting under an allowed circumstance (such as workspaces, scratch paths, or non-Immuta reads/writes).
Use the IMMUTA_SPARK_DATABRICKS_ALLOWED_IMPERSONATION_USERS Spark environment variable to enable user impersonation.
Scala clusters
Immuta discourages use of this feature with Scala clusters, as the proper security mechanisms were not built to account for user isolation limitations in Scala clusters. Instead, this feature was developed for the BI tool use case in which service accounts connecting to the Databricks cluster need to impersonate Immuta users so that policies can be enforced.
This page provides guidelines for troubleshooting issues with the Databricks Spark integration and resolving Py4J security and Databricks trusted library errors.
For easier debugging of the Databricks Spark integration, follow the recommendations below.
Enable cluster init script logging:
In the cluster page in Databricks for the target cluster, navigate to Advanced Options -> Logging.
Change the Destination from NONE to DBFS and change the path to the desired output location. Note: The unique cluster ID will be added onto the end of the provided path.
View the Spark UI on your target Databricks cluster: On the cluster page, click the Spark UI tab, which shows the Spark application UI for the cluster. If you encounter issues creating Databricks data sources in Immuta, you can also view the JDBC/ODBC Server portion of the Spark UI to see the result of queries that have been sent from Immuta to Databricks.
The validation and debugging notebook is designed to be used by or under the guidance of an Immuta support professional. Reach out to your Immuta representative for assistance.
Import the notebook into a Databricks workspace by navigating to Home in your Databricks instance.
Click the arrow next to your name and select Import.
Once you have executed commands in the notebook and populated it with debugging information, export the notebook and its contents by opening the File menu, selecting Export, and then selecting DBC Archive.
Error Message: py4j.security.Py4JSecurityException: Constructor <> is not allowlisted
Explanation: This error indicates you are being blocked by Py4J security rather than the Immuta Security Manager. Py4J security is strict and generally ends up blocking many ML libraries.
Solution: Turn off Py4J security on the offending cluster by setting IMMUTA_SPARK_DATABRICKS_PY4J_STRICT_ENABLED=false in the environment variables section. Additionally, because there are limitations to the security mechanisms Immuta employs on-cluster when Py4J security is disabled, ensure that all users on the cluster have the same level of access to data, as users could theoretically see (policy-enforced) data that other users have queried.
Check the driver logs for details. Some possible causes of failure include
One of the Immuta-configured trusted library URIs does not point to a Databricks library. Check that you have configured the correct URI for the Databricks library.
For trusted Maven artifacts, the URI must follow this format: maven:/group.id:artifact-id:version.
Databricks failed to install a library. Any Databricks library installation errors will appear in the Databricks UI under the Libraries tab.
For Maven artifacts, the URI is maven:/<maven_coordinates>, where <maven_coordinates> is the Coordinates field found when clicking on the installed artifact on the Libraries tab in the Databricks Clusters UI. Here's an example of an installed artifact:
In this example, you would add the following Spark environment variable:
IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS=maven:/com.github.immuta.hadoop.immuta-spark-third-party-maven-lib-test:2020-11-17-144644For jar artifacts, the URI is the Source field found when clicking on the installed artifact on the Libraries tab in the Databricks Clusters UI. For artifacts installed from DBFS or S3, this ends up being the original URI to your artifact. For uploaded artifacts, Databricks will rename your .jar and put it in a directory in DBFS. Here's an example of an installed artifact:
In this example, you would add the following Spark environment variable:
Once you've finished making your changes, restart the cluster.
Once the cluster is up, execute a command in a notebook. If the trusted library installation is successful, you should see driver log messages like this:
If a Databricks cluster needs to be manually updated to reflect changes in the Immuta init script or cluster policies, you can remove and set up your integration again to get the updated policies and init script.
Log in to Immuta as an Application Admin.
Click the App Settings icon in the navigation menu and scroll to the Integration Settings section.
Your existing Databricks Spark integration should be listed here; expand it and note the configuration values. Now select Remove to remove your integration.
Click Add Integration and select Databricks Integration to add a new integration.
Enter your Databricks Spark integration settings again as configured previously.
Click Add Integration to add the integration, and then select Configure Cluster Policies to set up the updated cluster policies and init script.
Select the cluster policies you wish to use for your Immuta-enabled Databricks clusters.
Automatically push cluster policies and the init script (recommended) or manually update your cluster policies.
Automatically push cluster policies
Select Automatically Push Cluster Policies and enter your privileged Databricks access token. This token must have privileges to write to cluster policies.
Select
Restart any Databricks clusters using these updated policies for the changes to take effect.
When using Delta Lake, the API does not go through the normal Spark execution path. This means that Immuta's Spark extensions do not provide protection for the API. To solve this issue and ensure that Immuta has control over what a user can access, the Delta Lake API is blocked.
Spark SQL can be used instead to give the same functionality with all of Immuta's data protections.
Below is a table of the Delta Lake API with the Spark SQL that may be used instead.
See here for a complete list of the .
When a table is created in a project workspace, you can merge a different Immuta data source from that workspace into that table you created.
.
Create a temporary view of the Immuta data source you want to merge into that table.
Use that temporary view as the data source you add to the project workspace.
Run the following command:
The Databricks Spark integration is one of two integrations Immuta offers for Databricks.
In this integration, Immuta installs an Immuta-maintained Spark plugin on your Databricks cluster. When a user queries data that has been registered in Immuta as a data source, the plugin injects policy logic into the plan Spark builds so that the results returned to the user only include data that specific user should see.
The reference guides in this section are written for Databricks administrators who are responsible for setting up the integration, securing Databricks clusters, and setting up users:
Installation and compliance: This guide includes information about what Immuta creates in your Databricks environment and securing your Databricks clusters.
: Consult this guide for information about customizing the Databricks Spark integration settings.
: Consult this guide for information about connecting data users and setting up user impersonation.
: This guide provides a list of Spark environment variables used to configure the integration.
: This guide describes ephemeral overrides and how to configure them to reduce the risk that a user has overrides set to a cluster (or multiple clusters) that aren't currently up.
This page outlines how to enable access to DBFS in Databricks for non-sensitive data. Databricks administrators should place the desired configuration in the Spark environment variables.
This Databricks feature mounts DBFS to the local cluster filesystem at /dbfs. Although disabled when using process isolation, this feature can safely be enabled if raw, unfiltered data is not stored in DBFS and all users on the cluster are authorized to see each other’s files. When enabled, the entirety of DBFS essentially becomes a scratch path where users can read and write files in /dfbs/path/to/my/file as though they were local files.
In the context of the Databricks Spark integration, Immuta uses the term ephemeral to describe data sources where the associated compute resources can vary over time. This means that the compute bound to these data sources is not fixed and can change. All Databricks data sources in Immuta are ephemeral.
Ephemeral overrides are specific to each data source and user. They effectively bind cluster compute resources to a data source for a given user. Immuta uses these overrides to determine which cluster compute to use when connecting to Databricks for various maintenance operations.
The operations that use the ephemeral overrides include
Visibility checks on the data source for a particular user. These checks assess how to apply row-level policies for specific users.
Stats collection triggered by a specific user
IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS=dbfs:/immuta/bstabile/jars/immuta-spark-third-party-lib-test.jarTrustedLibraryUtils: Successfully found all configured Immuta configured trusted libraries in Databricks.
TrustedLibraryUtils: Wrote trusted libs file to [/databricks/immuta/immutaTrustedLibs.json]: true.
TrustedLibraryUtils: Added trusted libs file with 1 entries to spark context.
TrustedLibraryUtils: Trusted library installation complete.Click Save and Confirm to deploy your changes.
Manually update cluster policies
Download the init script and the new cluster policies to your local computer.
Click Save and Confirm to save your changes in Immuta.
Log in to your Databricks workspace with your administrator account to set up cluster policies.
Get the path you will upload the init script (immuta_cluster_init_script_proxy.sh) to by opening one of the cluster policy .json files and looking for the defaultValue of the field init_scripts.0.dbfs.destination. This should be a DBFS path in the form of dbfs:/immuta-plugin/hostname/immuta_cluster_init_script_proxy.sh.
Click Data in the left pane to upload your init script to DBFS to the path you found above.
To find your existing cluster policies you need to update, click Compute in the left pane and select the Cluster policies tab.
Edit each of these cluster policies that were configured before and overwrite the contents of the JSON with the new cluster policy JSON you downloaded.
Validating a custom WHERE clause policy against a data source. When owners or governors create custom WHERE clause policies, Immuta uses compute resources to validate the SQL in the policy. In this case, the ephemeral overrides for the user writing the policy are used to contact a cluster for SQL validation.
High cardinality column detection. Certain advanced policy types (e.g., minimization) in Immuta require a high cardinality column, and that column is computed on data source creation. It can be recomputed on demand and, if so, will use the ephemeral overrides for the user requesting computation.
An ephemeral override request can be triggered when a user queries the securable corresponding to a data source in a Databricks cluster with the Spark plug-in configured. The actual triggering of this request depends on the configuration settings.
Ephemeral overrides can also be set for a data source in the Immuta UI by navigating to a data source page, clicking on the data source actions button, and selecting Ephemeral overrides from the dropdown menu.
Ephemeral override requests made from a cluster for data sources and users where ephemeral overrides were set in the UI will not be successful.
If ephemeral overrides are never set (either through the user interface or the cluster configuration), the system will continue to use the connection details directly associated with the data source, which are set during data source registration.
Ephemeral overrides can be problematic in environments that have a dedicated cluster to handle maintenance activities, since ephemeral overrides can cause these operations to execute on a different cluster than the dedicated one.
To reduce the risk that a user has overrides set to a cluster (or multiple clusters) that aren't currently up, complete one of the following actions:
Direct all clusters' HTTP paths for overrides to a cluster dedicated for metadata queries using the IMMUTA_EPHEMERAL_HOST_OVERRIDE_HTTPPATH Spark environment variable.
Disable ephemeral overrides completely by setting the IMMTUA_EPHEMERAL_HOST_OVERRIDE Spark environment variable to false.


DeltaTable.convertToDelta
CONVERT TO DELTA parquet./path/to/parquet/
DeltaTable.delete
DELETE FROM [table_identifier delta./path/to/delta/] WHERE condition
DeltaTable.generate
GENERATE symlink_format_manifest FOR TABLE [table_identifier delta./path/to/delta]
DeltaTable.history
DESCRIBE HISTORY [table_identifier delta./path/to/delta] (LIMIT x)
DeltaTable.merge
MERGE INTO
DeltaTable.update
UPDATE [table_identifier delta./path/to/delta/] SET column = valueWHERE (condition)
DeltaTable.vacuum
VACUUM [table_identifier delta./path/to/delta]
DBFS FUSE mount limitation: This feature cannot be used in environments with E2 Private Link enabled.
For example,
In Python,
Note: This solution also works in R and Scala.
To enable the DBFS FUSE mount, set this configuration in the Spark environment variables: IMMUTA_SPARK_DATABRICKS_DBFS_MOUNT_ENABLED=true.
Scratch paths will work when performing arbitrary remote filesystem operations with fs magic or Scala dbutils.fs functions. For example,
To support %fs magic and Scala DBUtils with scratch paths, configure
To use dbutils in Python, set this configuration: immuta.spark.databricks.py4j.strict.enabled=false.
This section illustrates the workflow for getting a file from a remote scratch path, editing it locally with Python, and writing it back to a remote scratch path.
Get the file from remote storage:
Make a copy if you want to explicitly edit localScratchFile, as it will be read-only and owned by root:
Write the new file back to remote storage:
MERGE INTO delta_native.target_native as target
USING immuta_temp_view_data_source as source
ON target.dr_number = source.dr_number
WHEN MATCHED THEN
UPDATE SET target.date_reported = source.date_reporteddbutils.fs.cp(s3ScratchFile, "file://{}".format(localScratchFile))shutil.copy(localScratchFile, localScratchFileCopy)
with open(localScratchFileCopy, "a") as f:
f.write("Some appended file content")dbutils.fs.cp("file://{}".format(localScratchFileCopy), s3ScratchFile)%sh echo "I'm creating a new file in DBFS" > /dbfs/my/newfile.txt%python
with open("/dbfs/my/newfile.txt", "w") as f:
f.write("I'm creating a new file in DBFS")%fs put -f s3://my-bucket/my/scratch/path/mynewfile.txt "I'm creating a new file in S3"
%scala dbutils.fs.put("s3://my-bucket/my/scratch/path/mynewfile.txt", "I'm creating a new file in S3") <property>
<name>immuta.spark.databricks.scratch.paths</name>
<value>s3://my-bucket/my/scratch/path</value>
</property>%python
import os
import shutil
s3ScratchFile = "s3://some-bucket/path/to/scratch/file"
localScratchDir = os.environ.get("IMMUTA_LOCAL_SCRATCH_DIR")
localScratchFile = "{}/myfile.txt".format(localScratchDir)
localScratchFileCopy = "{}/myfile_copy.txt".format(localScratchDir)Before you can run spark-submit jobs on Databricks, complete the following steps.
Initialize the Spark session:
Enter these settings into the R submit script to allow the R script to access Immuta data sources, scratch paths, and workspace tables: immuta.spark.acl.assume.not.privileged="true" and spark.hadoop.immuta.databricks.config.update.service.enabled="false".
Once the script is written, upload the script to a location in dbfs/S3/ABFS to give the Databricks cluster access to it.
Because of how some user properties are populated in Databricks, load the SparkR library in a separate cell before attempting to use any SparkR functions.
To create the R spark-submit job,
Go to the Databricks jobs page.
Create a new job, and select Configure spark-submit.
Set up the parameters:
Note: The path dbfs:/path/to/script.R can be in S3 or ABFS (on Azure Databricks), assuming the cluster is configured with access to that path.
Edit the cluster configuration, and change the Databricks Runtime to be a .
Configure the section as you normally would for an Immuta cluster.
Before you can run spark-submit jobs on Databricks you must initialize the Spark session with the settings outlined below.
Configure the Spark session with immuta.spark.acl.assume.not.privileged="true" and spark.hadoop.immuta.databricks.config.update.service.enabled="false".
Note: Stop your Spark session (spark.stop()) at the end of your job or the cluster will not terminate.
The spark submit job needs to be launched using a different classloader which will point at the designated user JARs directory. The following Scala template can be used to handle launching your submit code using a separate classloader:
To create the Scala spark-submit job,
Build and upload your JAR to dbfs/S3/ABFS where the cluster has access to it.
Select Configure spark-submit, and configure the parameters:
Note: The fully-qualified class name of the class whose main function will be used as the entry point for your code in the --class parameter.
Note: The path dbfs:/path/to/code.jar can be in S3 or ABFS (on Azure Databricks) assuming the cluster is configured with access to that path.
Edit the cluster configuration, and change the Databricks Runtime to a .
Include IMMUTA_INIT_ADDITIONAL_JARS_URI=dbfs:/path/to/code.jar in the "Environment Variables" (where dbfs:/path/to/code.jar is the path to your jar) so that the jar is uploaded to all the cluster nodes.
The user mapping works differently from notebooks because spark-submit clusters are not configured with access to the Databricks SCIM API. The cluster tags are read to get the cluster creator and match that user to an Immuta user.
Privileged users (Databricks admins and allowlisted users) must be tied to an Immuta user and given access through Immuta to access data through spark-submit jobs because the setting immuta.spark.acl.assume.not.privileged="true" is used.
There is an option of using the immuta.api.key setting with an Immuta API key generated on the Immuta profile page.
Currently when an API key is generated it invalidates the previous key. This can cause issues if a user is using multiple clusters in parallel, since each cluster will generate a new API key for that Immuta user. To avoid these issues, manually generate the API key in Immuta and set the immuta.api.key on all the clusters or use a specified job user for the submit job.
Immuta offers several features to provide security for your users and Databricks clusters and to prove compliance and monitor for anomalies.
Immuta supports the following authentication methods to configure the Databricks Spark integration and register data sources:
OAuth machine-to-machine (M2M): Immuta uses the to integrate with , which allows Immuta to authenticate with Databricks using a client secret. Once Databricks verifies the Immuta service principal’s identity using the client secret, Immuta is granted a temporary OAuth token to perform token-based authentication in subsequent requests. When that token expires (after one hour), Immuta requests a new temporary token. See the for more details.
Personal access token (PAT): This token gives Immuta temporary permission to push the cluster policies to the configured Databricks workspace and overwrite any cluster policy templates previously applied to the workspace when configuring the integration or to register securables as Immuta data sources.
The built-in Immuta IAM can be used as a complete solution for authentication and fine-grained user entitlement. However, you can connect your existing identity management provider to Immuta to use that system for authentication and fine-grained user entitlement instead.
Each of the supported identity providers includes a specific set of configuration options that enable Immuta to communicate with the IAM system and map the users, permissions, groups, and attributes into Immuta.
See the for a list of supported providers and details.
See the for details and instructions on mapping Databricks user accounts to Immuta.
See the and the guides for more information about transmission of policy decision data, encryption of data in transit and at rest, and encryption key management.
Non-administrator users on an Immuta-enabled Databricks cluster must not have access to view or modify Immuta configuration, as this poses a security loophole around Immuta policy enforcement. allow you to securely apply environment variables to Immuta-enabled clusters.
Databricks secrets can be used in the environment variables configuration section for a cluster by referencing the secret path instead of the actual value of the environment variable.
See the for details and instructions on using Databricks secrets.
There are limitations to isolation among users in Scala jobs on a Databricks cluster. When data is broadcast, cached (spilled to disk), or otherwise saved to SPARK_LOCAL_DIR, it's impossible to distinguish between which user’s data is composed in each file/block. To address this vulnerability, Immuta suggests that you
limit Scala clusters to Scala jobs only and
require equalized projects, which will force all users to act under the same set of attributes, groups, and purposes with respect to their data access. This requirement guarantees that data being dropped into SPARK_LOCAL_DIR will have policies enforced and that those policies will be homogeneous for all users on the cluster. Since each user will have access to the same data, if they attempt to manually access other users' cached/spilled data, they will only see what they have access to via equalized permissions on the cluster. If project equalization is not turned on, users could dig through that directory and find data from another user with heightened access, which would result in a data leak.
See the for more details and configuration instructions.
Immuta provides auditing features and governance reports so that data owners and governors can monitor users' access to data and detect anomalies in behavior.
You can view the information in these audit logs on or export the full audit logs to S3 and ADLS for long-term backup and processing with log data processors and tools. This capability fosters convenient integrations with log monitoring services and data pipelines.
See the for details about these capabilities and how they work with the Databricks Spark integration.
Immuta captures the code or query that triggers the Spark plan in Databricks, making audit records more useful in assessing what users are doing.
To audit what triggers the Spark plan, Immuta hooks into Databricks where notebook cells and JDBC queries execute and saves the cell or query text. Then, Immuta pulls this information into the audits of the resulting Spark jobs.
Immuta will audit queries that come from interactive notebooks, notebook jobs, and JDBC connections, but will not audit . Furthermore, Immuta only audits Spark jobs that are associated with Immuta tables. Consequently, Immuta will not audit a query in a notebook cell that does not trigger a Spark job, unless is set to true.
See the page for examples of saved queries and the resulting audit records. To exclude query text from audit events, see the .
Immuta supports auditing all queries run on a Databricks cluster, regardless of whether users touch Immuta-protected data or not.
See the for details and instructions.
When a query is run by a user impersonating another user, the extra.impersonationUser field in the audit log payload is populated with the Databricks username of the user impersonating another user. The userId field will return the Immuta username of the user being impersonated:
See the for details about user impersonation.
Immuta governance reports allow users with the GOVERNANCE Immuta permission to use a natural language builder to instantly create reports that delineate user activity across Immuta. These reports can be based on various entity types, including users, groups, projects, data sources, purposes, policy types, or connection types.
See the for a list of report types and guidance.
Once a Databricks securable is registered in Immuta as a data source and you are subscribed to that data source, you must access that data through SQL:
df = spark.sql("select * from immuta.table")import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
val sqlDF = spark.sql("SELECT * FROM immuta.table")%sql
select * from immuta.tablelibrary(SparkR)
df <- SparkR::sql("SELECT * from immuta.table")With R, you must load the SparkR library in a cell before accessing the data.
See the sections below for more guidance on accessing data using Delta Lake, direct file reads in Spark for file paths, and user impersonation.
When using Delta Lake, the API does not go through the normal Spark execution path. This means that Immuta's Spark extensions do not provide protection for the API. To solve this issue and ensure that Immuta has control over what a user can access, the Delta Lake API is blocked.
Spark SQL can be used instead to give the same functionality with all of Immuta's data protections. See the for a list of corresponding Spark SQL calls to use.
In addition to supporting direct file reads through workspace and scratch paths, Immuta allows direct file reads in Spark for file paths. As a result, users who prefer to interact with their data using file paths or who have existing workflows revolving around file paths can continue to use these workflows without rewriting those queries for Immuta.
When reading from a path in Spark, the Immuta Databricks Spark plugin queries the Immuta Web Service to find Databricks data sources for the current user that are backed by data from the specified path. If found, the query plan maps to the Immuta data source and follows existing code paths for policy enforcement.
Users can read data from individual parquet files in a sub-directory and partitioned data from a sub-directory (or by using a where predicate). Expand the blocks below to view examples of reading data using these methods.
Direct file reads for Immuta data sources only apply to data sources created from tables, not data sources created from views or queries.
If more than one data source has been created for a path, Immuta will use the first valid data source it finds. It is therefore not recommended to use this integration when more than one data source has been created for a path.
In Databricks, multiple input paths are supported as long as they belong to the same data source.
CSV-backed tables are not currently supported.
User impersonation allows Databricks users to query data as another Immuta user. To impersonate another user, see the .
[
"--conf","spark.driver.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service",
"--conf","spark.executor.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service",
"--conf","spark.databricks.repl.allowedLanguages=python,sql,scala,r",
"dbfs:/path/to/script.R",
"arg1", "arg2", "..."
]package com.example.job
import java.net.URLClassLoader
import java.io.File
import org.apache.spark.sql.SparkSession
object ImmutaSparkSubmitExample {
def main(args: Array[String]): Unit = {
val jarDir = new File("/databricks/immuta/jars/")
val urls = jarDir.listFiles.map(_.toURI.toURL)
// Configure a new ClassLoader which will load jars from the additional jars directory
val cl = new URLClassLoader(urls)
val jobClass = cl.loadClass(classOf[ImmutaSparkSubmitExample].getName)
val job = jobClass.newInstance
jobClass.getMethod("runJob").invoke(job)
}
}
class ImmutaSparkSubmitExample {
def getSparkSession(): SparkSession = {
SparkSession.builder()
.appName("Example Spark Submit")
.enableHiveSupport()
.config("immuta.spark.acl.assume.not.privileged", "true")
.config("spark.hadoop.immuta.databricks.config.update.service.enabled", "false")
.getOrCreate()
}
def runJob(): Unit = {
val spark = getSparkSession
try {
val df = spark.table("immuta.<YOUR DATASOURCE>")
// Run Immuta Spark queries...
} finally {
spark.stop()
}
}
} [
"--conf","spark.driver.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service",
"--conf","spark.executor.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service",
"--conf","spark.databricks.repl.allowedLanguages=python,sql,scala,r",
"--class","org.youorg.package.MainClass",
"dbfs:/path/to/code.jar",
"arg1", "arg2", "..."
]Loading a delta partition from a sub-directory is not recommended by Spark and is not supported in Immuta. Instead, use a where predicate:
spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table/partition_column=01/my_file.parquet")spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table/partition_column=01")spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table").where("partition_column=01")Read partitioned data from a sub-directory# Not recommended by Spark and not supported in Immuta
spark.read.format("delta").load("s3:/my_bucket/path/to/my_delta_table/partition_column=01")
# Recommended by Spark and supported in Immuta.
spark.read.format("delta").load("s3:/my_bucket/path/to/my_delta_table").where("partition_column=01")
{
"id": "query-a20e-493e-id-c1ada0a23a26",
[...]
"userId": "<immuta_username>",
[...]
"extra": {
[...]
"impersonationUser": "<databricks_username>"
}
[...]
}A Databricks workspace with the Premium tier, which includes cluster policies (required to configure the Spark integration)
A cluster that uses one of these supported Databricks Runtimes:
11.3 LTS
14.3 LTS
Supported languages
Python
R (not supported for Databricks Runtime 14.3 LTS)
Scala (not supported for Databricks Runtime 14.3 LTS)
SQL
A Databricks cluster that is one of these supported compute types:
Custom access mode
A Databricks workspace and cluster with the ability to directly make HTTP calls to the Immuta web service. The Immuta web service also must be able to connect to and perform queries on the Databricks cluster, and to call .
Enable OAuth M2M authentication (recommended) or personal access tokens.
Disable Photon by setting runtime_engine to STANDARD using the Clusters API. Immuta does not support clusters with Photon enabled. Photon is enabled by default on compute running Databricks Runtime 9.1 LTS or newer and must be manually disabled before setting up the integration with Immuta.
Restrict the set of Databricks principals who have CAN MANAGE privileges on Databricks clusters where the Spark plugin is installed. This is to prevent editing , editing cluster policies, or removing the Spark plugin from the cluster, all of which would cause the Spark plugin to stop working.
If Databricks Unity Catalog is enabled in a Databricks workspace, you must use an Immuta cluster policy when you set up the Databricks Spark integration to create an Immuta-enabled cluster. See the section below for guidance.
If Databricks Unity Catalog is not enabled in your Databricks workspace, you must disable Unity Catalog in your Immuta tenant before proceeding with your configuration of Databricks Spark:
Navigate to the App Settings page and click Integration Settings.
Uncheck the Enable Unity Catalog checkbox.
Click
Click the App Settings icon in Immuta.
Navigate to HDFS > System API Key and click Generate Key.
Click Save and then Confirm. If you do not save and confirm, the system API key will not be saved.
Scroll to the Integration Settings section.
Click + Add Native Integration and select Databricks Spark Integration from the dropdown menu.
Complete the Hostname field.
Enter a Unique ID for the integration. The unique ID is used to name cluster policies clearly, which is important when managing several Databricks Spark integrations. As cluster policies are workspace-scoped, but multiple integrations might be made in one workspace, this ID lets you distinguish between different sets of cluster policies.
Select the identity manager that should be used when mapping the current Spark user to their corresponding identity in Immuta from the Immuta IAM dropdown menu. This should be set to reflect the identity manager you use in Immuta (such as Entra ID or Okta).
Choose an Access Model. The Protected until made available by policy option , whereas the Available until protected by policy option allows it.
Behavior change
If a table is registered in Immuta and does not have a subscription policy applied to it, that data will be visible to users, even if the Protected until made available by policy setting is enabled.
If you have enabled this setting, author an "Allow individually selected users" global subscription policy that applies to all data sources.
Select the Storage Access Type from the dropdown menu.
Opt to add any Additional Hadoop Configuration Files.
Click Add Native Integration, and then click Save and Confirm. This will restart the application and save your Databricks Spark integration. (It is normal for this restart to take some time.)
The Databricks Spark integration will not do anything until your cluster policies are configured, so even though your integration is saved, continue to the next section to configure your cluster policies so the Spark plugin can manage authorization on the Databricks cluster.
Click Configure Cluster Policies.
Select one or more cluster policies in the matrix. Clusters running Immuta with Databricks Runtime 14.3 can only use Python and SQL. You can make changes to the policy by clicking Additional Policy Changes and editing the environment variables in the text field or by downloading it. See the Spark environment variables reference guide for information about each variable and its default value. Some common settings are linked below:
(you can also )
Select your Databricks Runtime.
Use one of the two installation types described below to apply the policies to your cluster:
Automatically push cluster policies: This option allows you to automatically push the cluster policies to the configured Databricks workspace. This will overwrite any cluster policy templates previously applied to this workspace.
Select the Automatically Push Cluster Policies radio button.
Click Close, and then click Save and Confirm.
Apply the cluster policy generated by Immuta to the cluster with the Spark plugin installed by following the .
Give users the Can Attach To permission on the cluster.
In the Databricks Spark integration, Immuta installs an Immuta-maintained Spark plugin on your Databricks cluster. When a user queries data that has been registered in Immuta as a data source, the plugin injects policy logic into the plan Spark builds so that the results returned to the user only include data that specific user should see.
The sequence diagram below breaks down this process of events when an Immuta user queries data in Databricks.
When data owners register Databricks securables in Immuta, the securable metadata is registered and Immuta creates a corresponding data source for those securables. The data source metadata is stored in the Immuta Metadata Database so that it can be referenced in policy definitions.
The image below illustrates what happens when a data owner registers the Accounts, Claims, and Customers securables in Immuta.
Users who are subscribed to the data source in Immuta can then query the corresponding securable directly in their Databricks notebook or workspace.
See the for details about the authentication methods supported for registering data.
When schema monitoring is enabled, Immuta monitors your servers to detect when new tables or columns are created or deleted, and automatically registers (or disables) those tables in Immuta. These newly updated data sources will then have any global policies and tags that are set in Immuta applied to them. The Immuta data dictionary will be updated with any column changes, and the Immuta environment will be in sync with your data environment.
For Databricks Spark, the automatic is disabled because of the . In this case, Immuta requires you to download a schema detection job template (a Python script) and import that into your Databricks workspace.
See the for instructions on enabling schema monitoring.
In Immuta, a Databricks data source is considered ephemeral, meaning that the compute resources associated with that data source will not always be available.
Ephemeral data sources allow the use of ephemeral overrides, user-specific connection parameter overrides that are applied to Immuta metadata operations.
When a user runs a Spark job in Databricks, the Immuta plugin automatically submits ephemeral overrides for that user to Immuta. Consequently, subsequent metadata operations for that user will use the current cluster as compute.
See the for more details about ephemeral overrides and how to configure or disable them.
The Spark plugin has the capability to send ephemeral override requests to Immuta. These requests are distinct from ephemeral overrides themselves. Ephemeral overrides cannot be turned off, but the Spark plugin can be configured to not send ephemeral override requests.
Tags can be used in Immuta in a variety of ways:
Use tags for global subscription or data policies that will apply to all data sources in the organization. In doing this, company-wide data security restrictions can be controlled by the administrators and governors, while the users and data owners need only to worry about tagging the data correctly.
Generate Immuta reports from tags for insider threat surveillance or data access monitoring.
Filter search results with tags in the Immuta UI.
The Databricks Spark integration cannot ingest tags from Databricks, but you can connect any of these to work with your integration.
You can also manage tags in Immuta by to your data sources and columns. Alternatively, you can use to automatically tag your sensitive data.
Immuta allows you to author subscription and data policies to automate access controls on your Databricks data.
Subscription policies: After registering data sources in Immuta, you can control who has access to specific securables in Databricks through Immuta subscription policies or by . Data users will only see the immuta database with no tables until they are granted access to those tables as Immuta data sources. See the for a list of policy types supported.
Data policies: You can create data policies to apply fine-grained access controls (such as restricting rows or masking columns) to manage what users can see in each table after they are subscribed to a data source. See the for details about specific types of data policies supported.
The image below illustrates how Immuta enforces a subscription policy that only allows users in the Analysts group to access the yellow-table.
See the for details about the benefits of using Immuta subscription and data policies.
Once a Databricks user who is subscribed to the data source in Immuta directly in their workspace, Spark Analysis initiates and the following events take place:
Spark calls down to the Metastore to get table metadata.
Immuta intercepts the call to retrieve table metadata from the Metastore.
Immuta modifies the Logical Plan to enforce policies that apply to that user.
Immuta wraps the Physical Plan with specific Java classes to signal to the Security Manager that it is a trusted node and is allowed to scan raw data.
The image below illustrates what happens when an Immuta user who is subscribed to the Customers data source queries the securable in Databricks.
Regardless of the policies on the data source, the users will be able to read raw data on the cluster if they meet one of the criteria listed below:
Databricks administrator is tied to an Immuta account
A Databricks user is listed as an ignored user (Users can be specified in the to become ignored users.)
Generally, Immuta prevents users from seeing data unless they are explicitly given access, which blocks access to raw sources in the underlying databases.
Databricks non-admin users will only see sources to which they are subscribed in Immuta, and this can present problems if organizations have a data lake full of non-sensitive data and Immuta removes access to all of it. To address this challenge, Immuta allows administrators to change this default setting when configuring the integration so that Immuta users can access securables that are not registered as a data source. Although this is similar to how privileged users in Databricks operate, non-privileged users cannot bypass Immuta controls.
See the for details about this setting.
Immuta projects combine users and data sources under a common purpose. Sometimes this purpose is for a single user to organize their data sources or to control an entire schema of data sources through a single projects screen; however, most often this is an Immuta purpose for which the data has been approved to be used and will restrict access to data and streamline team collaboration. Consequently, data owners can restrict access to data for a specified purpose through projects.
When a user is working within the context of a project, they will only see the data in that project. This helps to prevent data leaks when users collaborate. Users can switch project contexts to access various data sources while acting under the appropriate purpose.
When users change project contexts (either through the Immuta UI or with ), queries reflect users as acting under the purposes of that project, which may allow additional access to data if there are purpose restrictions on the data source(s). This process also allows organizations to track not just whether a specific data source is being used, but why.
See the for details about how to prevent users from switching project contexts in a session.
Users can have additional write access in their integration using project workspaces. Users can integrate a single or multiple workspaces with a single Immuta tenant.
See the for more details.
This page outlines configuration details for Immuta-enabled Databricks clusters. Databricks administrators should place the desired configuration in the Spark environment variables.
If you add additional Hadoop configuration during the integration setup, this variable sets the path to that file.
The additional Hadoop configuration is where sensitive configuration goes for remote filesystems (if you are using a secret key pair to access S3, for example).
Default value: true
Set this to false if ephemeral overrides should not be enabled for Spark. When true, this will automatically override ephemeral data source httpPaths with the httpPath of the Databricks cluster running the user's Spark application.
This configuration item can be used if automatic detection of the Databricks httpPath should be disabled in favor of a static path to use for ephemeral overrides.
Default value: true
When querying Immuta data sources in Spark, the metadata from the Metastore is compared to the metadata for the target source in Immuta to validate that the source being queried exists and is queryable on the current cluster. This check typically validates that the target (database, table) pair exists in the Metastore and that the table’s underlying location matches what is in Immuta. This configuration can be used to disable location checking if that location is dynamic or changes over time. Note: This may lead to undefined behavior if the same table names exist in multiple workspaces but do not correspond to the same underlying data.
A URI that points to a valid calling class file, which is an Immuta artifact you download during the process.
This is a comma-separated list of Databricks users who can access any table or view in the cluster metastore without restriction.
Default value: 3600
The number of seconds to cache privileged user status for the Immuta ACL. A privileged Databricks user is an admin or is allowlisted in IMMUTA_SPARK_ACL_ALLOWLIST.
Default value: false
Enables auditing all queries run on a Databricks cluster, regardless of whether users touch Immuta-protected data or not.
Default value: false
Allows non-privileged users to SELECT from tables that are not protected by Immuta. See the for details about this feature.
Default value: false
Allows non-privileged users to run DDL commands and data-modifying commands against tables or spaces that are not protected by Immuta. See the for details about this feature.
This is a comma-separated list of Databricks users who are allowed to impersonate Immuta users:
Default value: false
Exposes the DBFS FUSE mount located at /dbfs. Granular permissions are not possible, so all users will have read/write access to all objects therein. Note: Raw, unfiltered source data should never be stored in DBFS.
Block one or more Immuta from being used on an Immuta cluster. This should be a Java regular expression that matches the set of UDFs to block by name (excluding the immuta database). For example to block all project UDFs, you may configure this to be ^.*_projects?$. For a list of functions, see the .
Default value: file:///databricks/jars/immuta-spark-hive.jar
The location of immuta-spark-hive.jar on the filesystem for Databricks. This should not need to change unless a custom initialization script that places immuta-spark-hive in a non-standard location is necessary.
Default value: true
Creates a world-readable or writable scratch directory on local disk to facilitate the use of dbutils and 3rd party libraries that may write to local disk. Its location is non-configurable and is stored in the environment variable IMMUTA_LOCAL_SCRATCH_DIR. Note: Sensitive data should not be stored at this location.
Default value: INFO
The SLF4J log level to apply to Immuta's Spark plugins.
Default value: false
If true, writes logging output to stdout/the console as well as the log4j-active.txt file (default in Databricks).
This configuration is a comma-separated list of additional databases that will appear as scratch databases when running a SHOW DATABASE query. This configuration increases performance by circumventing the Metastore to get the metadata for all the databases to determine what to display for a SHOW DATABASE query; it won't affect access to the scratch databases. Instead, use to control read and write access to the underlying database paths.
Additionally, this configuration will only display the scratch databases that are configured and will not validate that the configured databases exist in the Metastore. Therefore, it is up to the Databricks administrator to properly set this value and keep it current.
Comma-separated list of remote paths that Databricks users are allowed to directly read/write. These paths amount to unprotected "scratch spaces." You can create a scratch database by configuring its specified location (or configure dbfs:/user/hive/warehouse/<db_name>.db for the default location).
To create a scratch path to a location or a database stored at that location, configure
To create a scratch path to a database created using the default location,
Default value: false
Enables non-privileged users to create or drop scratch databases.
Default value: false
When true, this configuration prevents users from changing their impersonation user once it has been set for a given Spark session. This configuration should be set when the BI tool or other service allows users to submit arbitrary SQL or issue SET commands.
Default value: true
Denotes whether the Spark job will be run that "tags" a Databricks cluster as being associated with Immuta.
A comma-separated list of URIs.
Default value: 3600
The number of seconds Immuta caches whether a table has been exposed as a data source in Immuta. This setting only applies when IMMUTA_SPARK_DATABRICKS_ALLOW_NON_IMMUTA_WRITES or IMMUTA_SPARK_DATABRICKS_ALLOW_NON_IMMUTA_READS is enabled.
Default value: false
Requires that users act through a single, equalized project. A cluster should be equalized if users need to run Scala jobs on it, and it should be limited to Scala jobs only via spark.databricks.repl.allowedLanguages.
Default value: true
Enables use of the underlying database and table name in queries against a table-backed Immuta data source. Administrators or allowlisted users can set IMMUTA_SPARK_RESOLVE_RAW_TABLES_ENABLED to false to bypass resolving raw databases or tables as Immuta data sources. This is useful if an admin wants to read raw data but is also an Immuta user. By default, data policies will be applied to a table even for an administrative user if that admin is also an Immuta user.
Default value: true
Same as the variable, but this is a session property that allows users to toggle this functionality. If users run set immuta.spark.session.resolve.raw.tables.enabled=false, they will see raw data only (not Immuta data policy-enforced data). Note: This property is not set in immuta_conf.xml.
Default value: true
This shows the immuta database in the configured Databricks cluster. When set to false Immuta will no longer show this database when a SHOW DATABASES query is performed. However, queries can still be performed against tables in the immuta database using the Immuta-qualified table name (e.g., immuta.my_schema_my_table) regardless of whether or not this feature is enabled.
Default value: true
Immuta checks the versions of its artifacts to verify that they are compatible with each other. When set to true, if versions are incompatible, that information will be logged to the Databricks driver logs and the cluster will not be usable. If a configuration file or the jar artifacts have been patched with a new version (and the artifacts are known to be compatible), this check can be set to false so that the versions don't get logged as incompatible and make the cluster unusable.
Default value: bim
Denotes which IAM in Immuta should be used when mapping the current Spark user's username to a userid in Immuta. This defaults to Immuta's internal IAM (bim) but should be updated to reflect an actual production IAM.
The Physical Plan is applied and filters out and transforms raw data coming back to the user.
The user sees policy-enforced data.




"spark_env_vars.IMMUTA_SPARK_DATABRICKS_ALLOWED_IMPERSONATION_USERS": {
"type": "fixed",
"value": "[email protected],[email protected]"
}IMMUTA_SPARK_DATABRICKS_SCRATCH_PATHS=s3://path/to/the/dirIMMUTA_SPARK_DATABRICKS_SCRATCH_PATHS=s3://path/to/the/dir,dbfs:/user/hive/warehouse/any_db_name.db</value>Click Apply Policies.
Manually push cluster policies: Enabling this option allows you to manually push the cluster policies and the init script to the configured Databricks workspace.
Select the Manually Push Cluster Policies radio button.
Click Download Init Script and set the Immuta plugin init script as a cluster-scoped init script in Databricks by following the Databricks documentation.
Click Download Policies, and then workspace.
Ensure that the init_scripts.0.workspace.destination in the policy matches the file path to the init script you configured above.
The Immuta cluster policy references Databricks Secrets for several of the sensitive fields. These secrets must be manually created if the cluster policy is not automatically pushed. Use Databricks API or CLI to push the proper secrets.
In the Databricks Spark integration, Immuta installs an Immuta-maintained Spark plugin on your Databricks cluster. When a user queries data that has been registered in Immuta as a data source, the plugin injects policy logic into the plan Spark builds so that the results returned to the user only include data that specific user should see.
The sequence diagram below breaks down this process of events when an Immuta user queries data in Databricks.
A Databricks workspace with the Premium tier, which includes cluster policies (required to configure the Spark integration)
A cluster that uses one of these supported Databricks Runtimes:
11.3 LTS
14.3 LTS
For a comparison of features supported for both Databricks Runtimes, see the .
Supported languages
Python
R (not supported for Databricks Runtime 14.3 LTS)
Scala (not supported for Databricks Runtime 14.3 LTS)
A Databricks cluster that is one of these supported compute types:
Custom access mode
A Databricks workspace and cluster with the ability to directly make HTTP calls to the Immuta web service. The Immuta web service also must be able to connect to and perform queries on the Databricks cluster, and to call .
When an administrator configures the Databricks Spark integration, Immuta generates a cluster policy that the administrator then applies to the Databricks cluster. When the cluster starts after the cluster policy has been applied, the Databricks cluster that Immuta provides downloads Spark plugin artifacts onto the cluster that has the init script and puts the artifacts in the appropriate locations on local disk for use by Spark.
Once the init script runs, the Spark application running on the Databricks cluster will have the appropriate artifacts on its CLASSPATH to use Immuta for authorization and policy enforcement.
Immuta adds the following artifacts to your Databricks environment:
Once the Immuta-enabled cluster is running, the following user actions spur various processes. The list below provides an overview of each process:
: When a data owner registers a Databricks securable as a data source, data source metadata (column type, securable name, column names, etc.) is retrieved from the Metastore and stored in the Immuta Metadata Database. If tags are then applied to the data source, Immuta stores this metadata in the Metadata Database as well.
Data source is deleted: When a data source is deleted, the data source metadata is deleted from the Metadata Database. Depending on the settings configured for the integration, users will either be able to query that data now that it is no longer registered in Immuta, or access to the securable will be revoked for all users. See the for details about this setting.
: Information about the policy and the columns or securables it applies to is stored in the Metadata Database. When a user queries the data in Databricks, the Spark plugin retrieves the policy information, the user metadata, and the data source metadata from the Metadata Database and injects this information as policy logic into the Spark logical plan. Immuta caches policy information and data source definitions in memory on the Spark application to reduce calls to the Metadata Database and boost performance.
The image below illustrates these processes and how they interact.
The Databricks Spark integration allows users to author subscription and data policies to enforce access controls. See the corresponding pages for details about specific types of policies supported:
Immuta supports clusters on Databricks Runtime 14.3. The integration for this Databricks Runtime differs from the integration for Databricks Runtime 11.3 in the following ways:
: The Security Manager is disabled for Databricks Runtime 14.3. Because the Security Manager is used to prevent users from circumventing access controls when using R and Scala, those languages are unsupported. Only Python and SQL clusters are supported.
Py4J security and process isolation automatically enabled: Immuta relies on Databricks process isolation and Py4J security to prevent user code from performing unauthorized actions. After selecting Runtime 14.3 during configuration, Immuta will automatically enable process isolation and Py4J security.
dbutils is unsupported: Immuta relies on Databricks process isolation and Py4J security to prevent user code from performing unauthorized actions. This means that dbutils is not supported for Databricks Spark integrations using Databricks Runtime 14.3 LTS.
The table below compares the features supported for clusters on Databricks Runtime 11.3 and Databricks Runtime 14.3.
The Databricks Spark integration supports the following authentication methods to configure the integration:
OAuth machine-to-machine (M2M): Immuta uses the to integrate with , which allows Immuta to authenticate with Databricks using a client secret. Once Databricks verifies the Immuta service principal’s identity using the client secret, Immuta is granted a temporary OAuth token to perform token-based authentication in subsequent requests. When that token expires (after one hour), Immuta requests a new temporary token. See the for more details.
Personal access token (PAT): This token gives Immuta temporary permission to push the cluster policies to the configured Databricks workspace and overwrite any cluster policy templates previously applied to the workspace when configuring the integration or to register securables as Immuta data sources.
Immuta captures the code or query that triggers the Spark plan in Databricks, making audit records more useful in assessing what users are doing. To audit what triggers the Spark plan, Immuta hooks into Databricks where notebook cells and JDBC queries execute and saves the cell or query text. Then, Immuta pulls this information into the audits of the resulting Spark jobs.
Immuta supports auditing all queries run on a Databricks cluster, regardless of whether users touch Immuta-protected data or not. To configure Immuta to do so, set the in the Spark cluster configuration when configuring your integration.
See the for more details about the audit capabilities in the Databricks Spark integration.
Non-administrator users on an Immuta-enabled Databricks cluster must not have access to view or modify Immuta configuration or the immuta-spark-hive.jar file, as this poses a security loophole around Immuta policy enforcement. allow you to securely apply environment variables to Immuta-enabled clusters.
Databricks secrets can be used in the environment variables configuration section for a cluster by referencing the secret path instead of the actual value of the environment variable. For example, if a user wanted to make the MY_SECRET_ENV_VAR=abcd_1234 value secret, they could instead create a Databricks secret and reference it as the value of that variable by following these steps:
Create the secret scope my_secrets and add a secret with the key my_secret_env_var containing the sensitive environment variable.
Reference the secret in the environment variables section as MY_SECRET_ENV_VAR={{secrets/my_secrets/my_secret_env_var}}.
At runtime, {{secrets/my_secrets/my_secret_env_var}} would be replaced with the actual value of the secret if the owner of the cluster has access to that secret.
There are limitations to isolation among users in Scala jobs on a Databricks cluster, even when using Immuta’s Security Manager. When data is broadcast, cached (spilled to disk), or otherwise saved to SPARK_LOCAL_DIR, it's impossible to distinguish between which user’s data is composed in each file/block. If you are concerned about this vulnerability, Immuta suggests that you
limit Scala clusters to Scala jobs only and
require equalized projects, which will force all users to act under the same set of attributes, groups, and purposes with respect to their data access. To require that Scala clusters be used in equalized projects and avoid the risk described above, set the to true.
Once this configuration is complete, users on the cluster will need to switch to an Immuta equalized project before running a job. Once the first job is run using that equalized project, all subsequent jobs, no matter the user, must also be run under that same equalized project. If you need to change a cluster's project, you must restart the cluster.
When data is read in Spark using an Immuta policy-enforced plan, the masking and redaction of rows is performed at the leaf level of the physical Spark plan, so a policy such as "Mask using hashing the column social_security_number for everyone" would be implemented as an expression on a project node right above the FileSourceScanExec/LeafExec node at the bottom of the plan. This process prevents raw data from being shuffled in a Spark application and, consequently, from ending up in SPARK_LOCAL_DIR.
This policy implementation coupled with an equalized project guarantees that data being dropped into SPARK_LOCAL_DIR will have policies enforced and that those policies will be homogeneous for all users on the cluster. Since each user will have access to the same data, if they attempt to manually access other users' cached data, they will only see what they have access to via equalized permissions on the cluster. If project equalization is not turned on, users could dig through that directory and find data from another user with heightened access, which would result in a data leak.
The has guidance for resolving issues with your installation.
You can customize the Databricks Spark integration settings using these components Immuta provides:
Immuta provides cluster policies that set the and configuration on your Databricks cluster once you apply that policy to your cluster. These policies generated by Immuta must be applied to your cluster manually. The includes instructions for generating and applying these cluster policies. Each cluster policy is described below.
The lists the various possible settings controlled by these variables that you can set in your cluster policy before attaching it to your cluster.
In some cases it is necessary to add sensitive configuration to SparkSession.sparkContext.hadoopConfiguration to allow Spark to read data.
For example, when accessing external tables stored in Azure Data Lake Gen2, Spark must have credentials to access the target containers or filesystems in Azure Data Lake Gen2, but users must not have access to those credentials. In this case, an additional configuration file may be provided with a storage account key that the cluster may use to access Azure Data Lake Gen2.
To use an additional Hadoop configuration file, set the to be the full URI to this file.
Databricks non-privileged users will only see sources to which they are subscribed in Immuta, and this can present problems if organizations have a data lake full of non-sensitive data and Immuta removes access to all of it. Immuta addresses this challenge by allowing Immuta users to access any tables that are not protected by Immuta (i.e., not registered as a data source or a table in a native workspace). Although this is similar to how privileged users in Databricks operate, non-privileged users cannot bypass Immuta controls.
Protected until made available by policy: This setting means that users can only see tables that Immuta has explicitly subscribed them to.
Behavior change
If a table is registered in Immuta and does not have a subscription policy applied to it, that data will be visible to users, even if the Protected until made available by policy setting is enabled.
If you have enabled this setting, author an "Allow individually selected users" that applies to all data sources.
Available until protected by policy: This setting means all tables are open until explicitly registered and protected by Immuta. This setting allows both non-Immuta reads and non-Immuta writes:
: Immuta users with regular (non-privileged) Databricks roles may SELECT from tables that are not registered in Immuta. This setting does not allow reading data directly with commands like spark.read.format("x"). Users are still required to read data and query tables using Spark SQL. When non-Immuta reads are enabled through the cluster policy, Immuta users will see all databases and tables when they run show databases or show tables. However, this does not mean they will be able to query all of them.
The includes instructions for applying these settings to your cluster.
In Immuta, a Databricks data source is considered ephemeral, meaning that the compute resources associated with that data source will not always be available.
Ephemeral data sources allow the use of ephemeral overrides, user-specific connection parameter overrides that are applied to Immuta metadata operations.
When a user runs a Spark job in Databricks, the Immuta plugin automatically submits ephemeral overrides for that user to Immuta for all applicable data sources to use the current cluster as compute for all subsequent metadata operations for that user against the applicable data sources.
For more details about ephemeral overrides and how to configure or disable them, see the .
Immuta projects combine users and data sources under a common purpose. Sometimes this purpose is for a single user to organize their data sources or to control an entire schema of data sources through a single projects screen; however, most often this is an Immuta purpose for which the data has been approved to be used and will restrict access to data and streamline team collaboration. Consequently, data owners can restrict access to data for a specified purpose through projects.
When a user is working within the context of a project, data users will only see the data in that project. This helps to prevent data leaks when users collaborate. Users can switch project contexts to access various data sources while acting under the appropriate purpose. Consider adjusting the following project settings to suit your organization's needs:
Project UDFs (web service and on-cluster caches): Immuta caches a mapping of user accounts and users' current projects in the Immuta Web Service and on-cluster. When users change their project with UDFs instead of the Immuta UI, Immuta invalidates all the caches on-cluster (so that everything changes immediately) and the cluster submits a request to change the project context to a web worker. Immediately after that request, another call is made to a web worker to refresh the current project. To allow use of project UDFs in Spark jobs, . Otherwise, caching could cause dissonance among the requests and calls to multiple web workers when users try to change their project contexts.
Preventing users from changing projects within a session: If your compliance requirements restrict users from changing projects within a session, you can block the use of Immuta's project UDFs on a Databricks Spark cluster. To do so, .
This section describes how Immuta interacts with common Databricks features.
Databricks users can see the Databricks change data feed (CDF) on queried tables if they are allowed to read raw data and meet specific qualifications. Immuta does not support applying policies to the changed data, and the CDF cannot be read for data source tables if the user does not have access to the raw data in Databricks or for .
The CDF can be read if the querying user is allowed to read the raw data and ONE of the following statements is true:
the table is in the current workspace
the table is in a scratch path
non-Immuta reads are enabled AND the table does not intersect with a workspace under which the current user is not acting
non-Immuta reads are enabled AND the table is not part of an Immuta data source
Security vulnerability
Using this feature could create a security vulnerability, depending on the third-party library. For example, if a library exposes a public method named readProtectedFile that displays the contents of a sensitive file, then trusting that library would allow end users access to that file. Work with your Immuta support professional to determine if the risk does not apply to your environment or use case.
The trusted libraries feature allows Databricks cluster administrators to avoid Immuta Security Manager errors when using third-party libraries. An administrator can specify an installed library as trusted, which will enable that library's code to bypass the Immuta security manager. This feature does not impact Immuta's ability to apply policies; trusting a library only allows code through that otherwise would have been blocked by the Security Manager.
The following types of libraries are supported when installing a third-party library using the Databricks UI or the Databricks Libraries API:
Library source is Upload, DBFS or DBFS/S3 and the Library Type is Jar.
Library source is Maven.
When users install third-party libraries, those libraries will be denied access to sensitive resources by default. However, cluster administrators can specify which of the installed Databricks libraries should be trusted by Immuta. See the to add a trusted library to your configuration.
Limitations
Installing trusted libraries outside of the Databricks Libraries API (e.g., ADD JAR ...) is not supported.
Databricks installs libraries right after a cluster has started, but there is no guarantee that library installation will complete before a user's code is executed. If a user executes code before a trusted library installation has completed, Immuta will not be able to identify the library as trusted. This can be solved by either
waiting for library installation to complete before running any third-party library commands or
Connect any of these to work with your Databricks Spark integration so data owners can tag their data.
Immuta supports the use of external metastores in :
Local mode: The metastore client running inside a cluster connects to the underlying metastore database directly via JDBC.
Remote mode: Instead of connecting to the underlying database directly, the metastore client connects to a separate metastore service via the Thrift protocol. The metastore service connects to the underlying database. When running a metastore in remote mode, DBFS is not supported.
For more details about these deployment modes, see .
Users on Databricks Runtimes 8+ can manage notebook-scoped libraries with .
However, this functionality differs from the , and Python libraries are not supported as trusted libraries. The Immuta Security Manager will deny the code of libraries installed with %pip access to sensitive resources.
Scratch paths are cluster-specific remote file paths that Databricks users are allowed to directly read from and write to without restriction. The creator of a Databricks cluster specifies the set of remote file paths that are designated as scratch paths on that cluster when they configure a Databricks cluster. Scratch paths are useful for scenarios where non-sensitive data needs to be written out to a specific location using a Databricks cluster protected by Immuta.
To configure a scratch path, use the .
spark.databricks.repl.allowedlanguages is a subset of {python, sql}
IMMUTA_SPARK_DATABRICKS_PY4J_STRICT_ENABLED is true
When the cluster is configured this way, Immuta can rely on Databricks' process isolation and Py4J security to prevent user code from performing unauthorized actions.
Note: Immuta still expects the spark.driver.extraJavaOptions and spark.executor.extraJavaOptions to be set and pointing at the Security Manager.
Beyond disabling the Security Manager, Immuta will skip several startup tasks that are required to secure the cluster when Scala and R are configured, and fewer permission checks will occur on the Driver and Executors in the Databricks cluster, reducing overhead and improving performance.
Caveats
There are still cases that require the Security Manager; in those instances, Immuta creates a fallback Security Manager to check the code path, so the IMMUTA_INIT_ALLOWED_CALLING_CLASSES_URI environment variable must always point to a valid calling class file.
Databricks’ dbutils is blocked by their Py4J security; therefore, it can’t be used to access scratch paths.
SHOW DATABASESimmutaimmuta.my_schema_my_tableTo hide the immuta database, use the following environment variable in the Spark cluster configuration when configuring your integration:
Then, Immuta will not show this database when a SHOW DATABASES query is performed.
A policy is deleted: When a policy is deleted, the policy information is deleted from the Metadata Database. If users were granted access to the data source by that policy, their access is revoked.
Databricks user is mapped to Immuta: When a Databricks user is mapped to Immuta, their metadata is stored in the Metadata Database.
Databricks user queries data: When a user queries the data in Databricks, Immuta intercepts the call from Spark down to the Metastore. Then, the Immuta-maintained Spark plugin retrieves the policy information, the user metadata, and the data source metadata from the Metadata Database and injects this information as policy logic into the Spark logical plan. Once the physical plan is applied, Databricks returns policy-enforced data to the user.
Databricks Connect is unsupported: Databricks Connect is unsupported because Py4J security must be enabled to use it.
Non-Immuta reads and writes
✅
✅
✅
✅
✅
✅
Python
✅
✅
SQL
✅
✅
R
✅
❌
Scala
✅
❌
Immuta project workspaces
✅
❌
Smart mask ordering
✅
❌
Masking and tagging complex columns (STRUCT, ARRAY, MAP)
✅
❌
Photon support
✅
❌
dbutils
✅
❌
Databricks Connect
✅
❌
Write policies
❌
❌
Support for allowlisting networks or local filesystem paths
❌
✅
Subscription policies
✅
✅
Data policies
✅
✅
✅
✅
✅



✅
DecisionTreeClassifierdbutils.fs
Databricks Connect client library
For full details on Databricks’ best practices in configuring clusters, read their governance documentation.
When users install third-party Java/Scala libraries, they will be denied access to sensitive resources by default. However, cluster administrators can specify which of the installed Databricks libraries should be trusted by Immuta.
The following Databricks features are unsupported when this cluster policy is applied:
Many Python ML classes (such as LogisticRegression, StringIndexer, and DecisionTreeClassifier)
dbutils.fs
Databricks Connect client library
For full details on Databricks’ best practices in configuring clusters, read their governance documentation.
The Security Manager will incur a small increase in performance overhead; average latency will vary depending on whether the cluster is homogeneous or heterogeneous. (In homogeneous clusters, all users are at the same level of groups/authorizations; this is enforced externally, rather than directly by Immuta.)
When users install third-party Java/Scala libraries, they will be denied access to sensitive resources by default. However, cluster administrators can specify which of the installed Databricks libraries should be trusted by Immuta.
A homogeneous cluster is recommended for configurations where Py4J security is disabled. If all users have the same level of authorization, there would not be any data leakage, even if a nefarious action was taken.
For full details on Databricks’ best practices in configuring clusters, read their governance documentation.
Multi-User Clusters: Because Immuta cannot guarantee user isolation in a multi-user sparklyr cluster, it is not recommended to deploy a multi-user cluster. To force all users to act under the same set of attributes, groups, and purposes with respect to their data access and eliminate the risk of a data leak, all sparklyr multi-user clusters must be equalized either by convention (all users able to attach to the cluster have the same level of data access in Immuta) or by configuration (detailed below).
1 - Enable sparklyr
In addition to the configuration for an Immuta cluster with R, add this environment variable to the Environment Variables section of the cluster:
This configuration makes changes to the iptables rules on the cluster to allow the sparklyr client to connect to the required ports on the JVM used by the sparklyr backend service.
2 - Set up a sparklyr connection in Databricks
Install and load libraries into a notebook. Databricks includes the stable version of sparklyr, so library(sparklyr) in an R notebook is sufficient, but you may opt to install the latest version of sparklyr from CRAN. Additionally, loading library(DBI) will allow you to execute SQL queries.
Set up a sparklyr connection:
Pass the connection object to execute queries:
3 - Configure a single-user cluster
Add the following items to the Spark Config section of the cluster:
The trustedFileSystems setting is required to allow Immuta’s wrapper FileSystem (used in conjunction with the Security Manager for data security purposes) to be used with credential passthrough. Additionally, the InstanceProfileCredentialsProvider must be configured to continue using the cluster’s instance profile for data access, rather than a role associated with the attached user.
Avoid deploying multi-user clusters with sparklyr configuration
It is possible, but not recommended, to deploy a multi-user cluster sparklyr configuration. Immuta cannot guarantee user isolation in a multi-user sparklyr configuration.
The configurations in this section enable sparklyr, require project equalization, map sparklyr sessions to the correct Immuta user, and prevent users from accessing Immuta native workspaces.
Add the following environment variables to the Environment Variables section of your cluster configuration:
Add the following items to the Spark Config section:
Immuta’s integration with sparklyr does not currently support
spark-submit jobs
UDFs
IMMUTA_SPARK_DATABRICKS_ALLOW_NON_IMMUTA_WRITES: Immuta users with regular (non-privileged) Databricks roles can run DDL commands and data-modifying commands against tables or spaces that are not registered in Immuta. With non-Immuta writes enabled through the cluster policy, users on the cluster can mix any policy-enforced data they may have access to via any registered data sources in Immuta with non-Immuta data and write the ensuing result to a non-Immuta write space where it would be visible to others. If this is not a desired possibility, the cluster should instead be configured to only use Immuta’s project workspaces.executing a Spark query. This will force Immuta to wait for any trusted Immuta libraries to complete installation before proceeding.
When installing a library using Maven as a library source, Databricks will also install any transitive dependencies for the library. However, those transitive dependencies are installed behind the scenes and will not appear as installed libraries in either the Databricks UI or using the Databricks Libraries API. Only libraries specifically listed in the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS environment variable will be trusted by Immuta, which does not include installed transitive dependencies. This effectively means that any code paths that include a class from a transitive dependency but do not include a class from a trusted third-party library can still be blocked by the Immuta security manager. For example, if a user installs a trusted third-party library that has a transitive dependency of a file-util library, the user will not be able to directly use the file-util library to read a sensitive file that is normally protected by the Immuta security manager.
In many cases, it is not a problem if dependent libraries aren't trusted because code paths where the trusted library calls down into dependent libraries will still be trusted. However, if the dependent library needs to be trusted, there is a workaround:
Add the transitive dependency jar paths to the . In the driver log4j logs, Databricks outputs the source jar locations when it installs transitive dependencies. In the cluster driver logs, look for a log message similar to the following:
In the above example, where slf4j is the transitive dependency, you would add the path dbfs:/FileStore/jars/maven/org/slf4j/slf4j-api-1.7.25.jar to the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS environment variable and restart your cluster.
builtinDownload the metastore jars and set spark.sql.hive.metastore.jars to /databricks/hive_metastore_jars/* as before.
IMMUTA_SPARK_SHOW_IMMUTA_DATABASE=falsesc <- spark_connect(method = "databricks")dbGetQuery(sc, "show tables in immuta")IMMUTA_DATABRICKS_SPARKLYR_SUPPORT_ENABLED=true
IMMUTA_SPARK_REQUIRE_EQUALIZATION=true
IMMUTA_SPARK_CURRENT_USER_SCIM_FALLBACK=falseimmuta.spark.acl.assume.not.privileged true
immuta.api.key=<user’s API key>IMMUTA_DATABRICKS_SPARKLYR_SUPPORT_ENABLED=truespark.databricks.passthrough.enabled true
spark.databricks.pyspark.trustedFilesystems com.databricks.s3a.S3AFileSystem,shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem,shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem,com.databricks.adl.AdlFileSystem,shaded.databricks.V2_1_4.com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem,shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem,shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem,org.apache.hadoop.fs.ImmutaSecureFileSystemWrapper
spark.hadoop.fs.s3a.aws.credentials.provider com.amazonaws.auth.InstanceProfileCredentialsProviderINFO LibraryDownloadManager: Downloaded library dbfs:/FileStore/jars/maven/org/slf4j/slf4j-api-1.7.25.jar as
local file /local_disk0/tmp/addedFile8569165920223626894slf4j_api_1_7_25-784af.jar