Installation and Compliance
Last updated
Was this helpful?
Last updated
Was this helpful?
In the Databricks Spark integration, Immuta installs an Immuta-maintained Spark plugin on your Databricks cluster. When a user queries data that has been registered in Immuta as a data source, the plugin injects policy logic into the plan Spark builds so that the results returned to the user only include data that specific user should see.
The sequence diagram below breaks down this process of events when an Immuta user queries data in Databricks.
A Databricks workspace with the Premium tier, which includes cluster policies (required to configure the Spark integration)
A cluster that uses one of these supported Databricks Runtimes:
9.1 LTS
10.4 LTS
11.3 LTS
14.3 (private preview) - Requires Immuta version 2025.1.x or newer
Supported languages
Python
R (not supported for Databricks Runtime 14.3)
Scala (not supported for Databricks Runtime 14.3)
SQL
A Databricks cluster that is one of these supported compute types:
Custom access mode
The Databricks Spark integration only works with Spark 3.
Once the init script runs, the Spark application running on the Databricks cluster will have the appropriate artifacts on its CLASSPATH to use Immuta for authorization and policy enforcement.
Immuta adds the following artifacts to your Databricks environment:
Once the Immuta-enabled cluster is running, the following user actions spur various processes. The list below provides an overview of each process:
A policy is deleted: When a policy is deleted, the policy information is deleted from the Metadata Database. If users were granted access to the data source by that policy, their access is revoked.
Databricks user queries data: When a user queries the data in Databricks, Immuta intercepts the call from Spark down to the Metastore. Then, the Immuta-maintained Spark plugin retrieves the policy information, the user metadata, and the data source metadata from the Metadata Database and injects this information as policy logic into the Spark logical plan. Once the physical plan is applied, Databricks returns policy-enforced data to the user.
The image below illustrates these processes and how they interact.
The Databricks Spark integration allows users to author subscription and data policies to enforce access controls. See the corresponding pages for details about specific types of policies supported:
Immuta supports clusters on Databricks Runtime 14.3. The integration for this Databricks Runtime differs from the integration for other supported Runtimes in the following ways:
Py4J security and process isolation automatically enabled: Immuta relies on Databricks process isolation and Py4J security to prevent user code from performing unauthorized actions. After selecting Runtime 14.3 during configuration, Immuta will automatically enable process isolation and Py4J security.
dbutils is unsupported: Immuta relies on Databricks process isolation and Py4J security to prevent user code from performing unauthorized actions. This means that dbutils is not supported for Databricks Spark integrations using Runtime 14.3.
The Databricks Spark integration supports the following authentication methods to configure the integration:
Personal access token (PAT): This token gives Immuta temporary permission to push the cluster policies to the configured Databricks workspace and overwrite any cluster policy templates previously applied to the workspace when configuring the integration or to register securables as Immuta data sources.
Immuta captures the code or query that triggers the Spark plan in Databricks, making audit records more useful in assessing what users are doing. To audit what triggers the Spark plan, Immuta hooks into Databricks where notebook cells and JDBC queries execute and saves the cell or query text. Then, Immuta pulls this information into the audits of the resulting Spark jobs.
Databricks secrets can be used in the environment variables configuration section for a cluster by referencing the secret path instead of the actual value of the environment variable. For example, if a user wanted to make the MY_SECRET_ENV_VAR=abcd_1234
value secret, they could instead create a Databricks secret and reference it as the value of that variable by following these steps:
Create the secret scope my_secrets
and add a secret with the key my_secret_env_var
containing the sensitive environment variable.
Reference the secret in the environment variables section as MY_SECRET_ENV_VAR={{secrets/my_secrets/my_secret_env_var}}
.
At runtime, {{secrets/my_secrets/my_secret_env_var}}
would be replaced with the actual value of the secret if the owner of the cluster has access to that secret.
There are limitations to isolation among users in Scala jobs on a Databricks cluster, even when using Immuta’s Security Manager. When data is broadcast, cached (spilled to disk), or otherwise saved to SPARK_LOCAL_DIR
, it's impossible to distinguish between which user’s data is composed in each file/block. If you are concerned about this vulnerability, Immuta suggests that you
limit Scala clusters to Scala jobs only and
When data is read in Spark using an Immuta policy-enforced plan, the masking and redaction of rows is performed at the leaf level of the physical Spark plan, so a policy such as "Mask using hashing the column social_security_number
for everyone" would be implemented as an expression on a project node right above the FileSourceScanExec/LeafExec
node at the bottom of the plan. This process prevents raw data from being shuffled in a Spark application and, consequently, from ending up in SPARK_LOCAL_DIR
.
This policy implementation coupled with an equalized project guarantees that data being dropped into SPARK_LOCAL_DIR
will have policies enforced and that those policies will be homogeneous for all users on the cluster. Since each user will have access to the same data, if they attempt to manually access other users' cached data, they will only see what they have access to via equalized permissions on the cluster. If project equalization is not turned on, users could dig through that directory and find data from another user with heightened access, which would result in a data leak.
A Databricks workspace and cluster with the ability to directly make HTTP calls to the Immuta web service. The Immuta web service also must be able to connect to and perform queries on the Databricks cluster, and to call .
When an administrator configures the Databricks Spark integration, Immuta generates a cluster policy that the administrator then applies to the Databricks cluster. When the cluster starts after the cluster policy has been applied, the Databricks cluster that Immuta provides downloads Spark plugin artifacts onto the cluster that has the init script and puts the artifacts in the appropriate locations on local disk for use by Spark.
Note: The Security Manager is disabled for.
To hide the immuta
database, use the following environment variable in the when configuring your integration:
: When a data owner registers a Databricks securable as a data source, data source metadata (column type, securable name, column names, etc.) is retrieved from the Metastore and stored in the Immuta Metadata Database. If tags are then applied to the data source, Immuta stores this metadata in the Metadata Database as well.
Data source is deleted: When a data source is deleted, the data source metadata is deleted from the Metadata Database. Depending on the settings configured for the integration, users will either be able to query that data now that it is no longer registered in Immuta, or access to the securable will be revoked for all users. See the for details about this setting.
: Information about the policy and the columns or securables it applies to is stored in the Metadata Database. When a user queries the data in Databricks, the Spark plugin retrieves the policy information, the user metadata, and the data source metadata from the Metadata Database and injects this information as policy logic into the Spark logical plan. Immuta caches policy information and data source definitions in memory on the Spark application to reduce calls to the Metadata Database and boost performance.
: When a Databricks user is mapped to Immuta, their metadata is stored in the Metadata Database.
: The Security Manager is disabled for Databricks Runtime 14.3. Because the Security Manager is used to prevent users from circumventing access controls when using R and Scala, those languages are unsupported. Only Python and SQL clusters are supported.
: Databricks Connect is unsupported because Py4J security must be enabled to use it.
OAuth machine-to-machine (M2M): Immuta uses the to integrate with , which allows Immuta to authenticate with Databricks using a client secret. Once Databricks verifies the Immuta service principal’s identity using the client secret, Immuta is granted a temporary OAuth token to perform token-based authentication in subsequent requests. When that token expires (after one hour), Immuta requests a new temporary token. See the for more details.
Immuta supports auditing all queries run on a Databricks cluster, regardless of whether users touch Immuta-protected data or not. To configure Immuta to do so, set the in the Spark cluster configuration when configuring your integration.
See the for more details about the audit capabilities in the Databricks Spark integration.
Non-administrator users on an Immuta-enabled Databricks cluster must not have access to view or modify Immuta configuration or the immuta-spark-hive.jar
file, as this poses a security loophole around Immuta policy enforcement. allow you to securely apply environment variables to Immuta-enabled clusters.
require equalized projects, which will force all users to act under the same set of attributes, groups, and purposes with respect to their data access. To require that Scala clusters be used in equalized projects and avoid the risk described above, set the to true
.
Once this configuration is complete, users on the cluster will need to switch to an Immuta equalized project before running a job. Once the first job is run using that equalized project, all subsequent jobs, no matter the user, must also be run under that same equalized project. If you need to change a cluster's project, you must restart the cluster.
The has guidance for resolving issues with your installation.