# Installation and Compliance

In the Databricks Spark integration, Immuta installs an Immuta-maintained Spark plugin on your Databricks cluster. When a user queries data that has been registered in Immuta as a data source, the plugin injects policy logic into the plan Spark builds so that the results returned to the user only include data that specific user should see.

The sequence diagram below breaks down this process of events when an Immuta user queries data in Databricks.

<figure><img src="/files/idQL9wNk1cToB20FDrLA" alt=""><figcaption><p>Immuta intercepts Spark calls to the Metastore. Immuta then modifies the logical plan so that policies are applied to the data for the querying user.</p></figcaption></figure>

## System requirements

* A Databricks workspace with the Premium tier, which includes cluster policies (required to configure the Spark integration)
* A cluster that uses one of these supported Databricks Runtimes:
  * 11.3 LTS
  * 14.3
* Supported languages
  * Python
  * R (not supported for Databricks Runtime 14.3)
  * Scala (not supported for Databricks Runtime 14.3)
  * SQL
* A Databricks cluster that is one of these supported compute types:
  * [All-purpose compute](https://docs.databricks.com/en/compute/index.html#types-of-compute)
  * [Job compute](https://docs.databricks.com/en/compute/index.html#types-of-compute)
* Custom access mode
* A Databricks workspace and cluster with the ability to directly make HTTP calls to the Immuta web service. The Immuta web service also must be able to connect to and perform queries on the Databricks cluster, and to call [Databricks workspace APIs](https://docs.databricks.com/api/workspace/introduction).
* The Databricks Spark integration only works with Spark 3.

## What does Immuta do in my Databricks environment?

When an administrator configures the Databricks Spark integration, Immuta generates a cluster policy that the administrator then applies to the Databricks cluster. When the cluster starts after the cluster policy has been applied, the Databricks cluster [init script](https://docs.databricks.com/en/init-scripts/index.html) that Immuta provides downloads Spark plugin artifacts onto the cluster that has the init script and puts the artifacts in the appropriate locations on local disk for use by Spark.

<figure><img src="/files/dzJWMB7JjLFz0DPcMxCv" alt=""><figcaption></figcaption></figure>

Once the init script runs, the Spark application running on the Databricks cluster will have the appropriate artifacts on its CLASSPATH to use Immuta for authorization and policy enforcement.

Immuta adds the following artifacts to your Databricks environment:

<details>

<summary>Immuta-maintained Spark plugin</summary>

The Databricks Spark integration injects this Immuta-maintained Spark plugin into the SparkSQL stack at cluster startup time. Policy determinations are obtained from the connected Immuta tenant and applied before returning results to the user. The plugin includes wrappers and Immuta analysis hook plan rewrites to enforce policies.

</details>

<details>

<summary>Immuta Security Manager</summary>

*Note: The Security Manager is disabled for*[ *Databricks Runtime 14.3*](#databricks-runtime-14.3)*.*

The Immuta Security Manager ensures users can't perform unauthorized actions when using Scala and R, since those languages have features that allow users to circumvent policies without the Security Manager enabled. The Immuta Security Manager blocks users from executing code that could allow them to gain access to sensitive data by only allowing select code paths to access sensitive files and methods. These select code paths provide Immuta's code access to sensitive resources while blocking end users from these sensitive resources directly.

**Performance**

The Security Manager must inspect the call stack every time a permission check is triggered, which adds overhead to queries. To improve Immuta's query performance on Databricks, Immuta disables the Security Manager when Scala and R are not being used.

The cluster init script checks the cluster’s configuration and automatically removes the Security Manager configuration when

* `spark.databricks.repl.allowedlanguages` is a subset of `{python, sql}`
* `IMMUTA_SPARK_DATABRICKS_PY4J_STRICT_ENABLED` is `true`

When the cluster is configured this way, Immuta can rely on Databricks' process isolation and Py4J security to prevent user code from performing unauthorized actions.

*Note: Immuta still expects the `spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` to be set and pointing at the Security Manager.*

Beyond disabling the Security Manager, Immuta will skip several startup tasks that are required to secure the cluster when Scala and R are configured, and fewer permission checks will occur on the Driver and Executors in the Databricks cluster, reducing overhead and improving performance.

**Caveats**

* There are still cases that require the Security Manager; in those instances, Immuta creates a fallback Security Manager to check the code path, so the `IMMUTA_INIT_ALLOWED_CALLING_CLASSES_URI` environment variable must always point to a valid calling class file.
* Databricks’ dbutils is blocked by their Py4J security; therefore, it can’t be used to access scratch paths.

</details>

<details>

<summary><code>immuta</code> database</summary>

When a table is registered in Immuta as a data source, users can see that table in the native Databricks database and in the `immuta` database. This allows for an option to use a single database (`immuta`) for all tables.

The `immuta` database on Immuta-enabled clusters allows Immuta to track Immuta-managed data sources separately from remote Databricks tables so that policies and other security features can be applied. However, Immuta supports raw tables in Databricks, so table-backed queries do not need to reference this database.

When configuring a Databricks cluster, you can hide `immuta` from any calls to `SHOW DATABASES` so that users are not confused or misled by that database. Hiding the database does not disable access to it. Queries can still be performed against tables in the `immuta` database using the Immuta-qualified table name (e.g., `immuta.my_schema_my_table`) regardless of whether or not this database is hidden.

To hide the `immuta` database, use the following environment variable in the [Spark cluster configuration](/latest/configuration/integrations/databricks/databricks-spark/reference-guides/databricks/configuration.md#immuta_spark_show_immuta_database) when configuring your integration:

```conf
IMMUTA_SPARK_SHOW_IMMUTA_DATABASE=false
```

Then, Immuta will not show this database when a `SHOW DATABASES` query is performed.

</details>

Once the Immuta-enabled cluster is running, the following user actions spur various processes. The list below provides an overview of each process:

* [**Data source is registered**](/latest/configuration/integrations/databricks/databricks-spark/reference-guides/registering-and-protecting-data.md#registering-data): When a data owner registers a Databricks securable as a data source, the data source metadata (column type, securable name, column names, etc.) is retrieved from the Metastore and stored in the Immuta Metadata Database. If tags are then applied to the data source, Immuta stores this metadata in the Metadata Database as well.
* **Data source is deleted**: When a data source is deleted, the data source metadata is deleted from the Metadata Database. Depending on the settings configured for the integration, users will either be able to query that data now that it is no longer registered in Immuta, or access to the securable will be revoked for all users. See the [Protected and unprotected tables section](/latest/configuration/integrations/databricks/databricks-spark/reference-guides/databricks/customizing-the-integration.md#protected-and-unprotected-tables) for details about this setting.
* [**Policy is created or edited on a data source**](/latest/configuration/integrations/databricks/databricks-spark/reference-guides/registering-and-protecting-data.md#protecting-data): Information about the policy and the columns or securables it applies to is stored in the Metadata Database. When a user queries the data in Databricks, the Spark plugin retrieves the policy information, the user metadata, and the data source metadata from the Metadata Database and injects this information as policy logic into the Spark logical plan. Immuta caches policy information and data source definitions in memory on the Spark application to reduce calls to the Metadata Database and boost performance.
* **Policy is deleted**: When a policy is deleted, the policy information is deleted from the Metadata Database. If users were granted access to the data source by that policy, their access is revoked.
* [**Databricks user is mapped to Immuta**](/latest/configuration/integrations/databricks/databricks-spark/reference-guides/databricks/setting-up-users.md#mapping-databricks-users-to-immuta): When a Databricks user is mapped to Immuta, their metadata is stored in the Metadata Database.
* **Databricks user queries data**: When a user queries the data in Databricks, Immuta intercepts the call from Spark down to the Metastore. Then, the Immuta-maintained Spark plugin retrieves the policy information, the user metadata, and the data source metadata from the Metadata Database and injects this information as policy logic into the Spark logical plan. Once the physical plan is applied, Databricks returns policy-enforced data to the user.

The image below illustrates these processes and how they interact.

<figure><img src="/files/p4fVnBoFjspHc6HOsfwC" alt=""><figcaption></figcaption></figure>

### Supported policies

The Databricks Spark integration allows users to author subscription and data policies to enforce access controls. See the corresponding pages for details about specific types of policies supported:

* [Subscription policy access types](/latest/governance/author-policies-for-data-access-control/authoring-policies-in-secure/section-contents/reference-guides/subscription-access-types.md)
* [Data policy types](/latest/governance/author-policies-for-data-access-control/authoring-policies-in-secure/data-policies/reference-guides/data-policies.md#policy-support-matrix)

### Databricks Runtime 14.3

Immuta supports clusters on Databricks Runtime 14.3. The integration for this Databricks Runtime differs from the integration for other supported Runtimes in the following ways:

* [**Security Manager is disabled**](#immuta-security-manager): The Security Manager is disabled for Databricks Runtime 14.3. Because the Security Manager is used to prevent users from circumventing access controls when using R and Scala, those languages are unsupported. Only Python and SQL clusters are supported.
* **Py4J security and process isolation automatically enabled**: Immuta relies on Databricks process isolation and Py4J security to prevent user code from performing unauthorized actions. After selecting Runtime 14.3 during configuration, Immuta will automatically enable process isolation and Py4J security.
* **dbutils is unsupported**: Immuta relies on Databricks process isolation and Py4J security to prevent user code from performing unauthorized actions. This means that dbutils is not supported for Databricks Spark integrations using Runtime 14.3.
* [**Databricks Connect is unsupported**](https://docs.databricks.com/en/dev-tools/databricks-connect/index.html): Databricks Connect is unsupported because Py4J security must be enabled to use it.

The table below compares the features supported for clusters on Databricks Runtime 11.3 and Databricks Runtime 14.3.

| Feature                                                                                                                                                                                 | Databricks Runtime 11.3 | Databricks Runtime 14.3 |
| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------- | ----------------------- |
| Subscription policies                                                                                                                                                                   | :white\_check\_mark:    | :white\_check\_mark:    |
| Data policies                                                                                                                                                                           | :white\_check\_mark:    | :white\_check\_mark:    |
| [Scratch paths](/latest/configuration/integrations/databricks/databricks-spark/reference-guides/databricks/customizing-the-integration.md#scratch-paths)                                | :white\_check\_mark:    | :white\_check\_mark:    |
| [Project UDFs](/latest/configuration/integrations/databricks/databricks-spark/reference-guides/databricks/customizing-the-integration.md#restricting-users-access-with-immuta-projects) | :white\_check\_mark:    | :white\_check\_mark:    |
| [Non-Immuta reads and writes](#user-content-fn-1)[^1]                                                                                                                                   | :white\_check\_mark:    | :white\_check\_mark:    |
| [Impersonation](/latest/configuration/integrations/databricks/databricks-spark/reference-guides/databricks/setting-up-users.md#user-impersonation)                                      | :white\_check\_mark:    | :white\_check\_mark:    |
| [Metastore magic](/latest/configuration/integrations/databricks.md#metastore-magic)                                                                                                     | :white\_check\_mark:    | :white\_check\_mark:    |
| Python                                                                                                                                                                                  | :white\_check\_mark:    | :white\_check\_mark:    |
| SQL                                                                                                                                                                                     | :white\_check\_mark:    | :white\_check\_mark:    |
| R                                                                                                                                                                                       | :white\_check\_mark:    | :x:                     |
| Scala                                                                                                                                                                                   | :white\_check\_mark:    | :x:                     |
| Immuta project workspaces                                                                                                                                                               | :white\_check\_mark:    | :x:                     |
| Smart mask ordering                                                                                                                                                                     | :white\_check\_mark:    | :x:                     |
| Masking and tagging complex columns (STRUCT, ARRAY, MAP)                                                                                                                                | :white\_check\_mark:    | :x:                     |
| Photon support                                                                                                                                                                          | :white\_check\_mark:    | :x:                     |
| dbutils                                                                                                                                                                                 | :white\_check\_mark:    | :x:                     |
| Databricks Connect                                                                                                                                                                      | :white\_check\_mark:    | :x:                     |
| Write policies                                                                                                                                                                          | :x:                     | :x:                     |
| Support for allowlisting networks or local filesystem paths                                                                                                                             | :x:                     | :white\_check\_mark:    |

## Cluster security and compliance

### Authentication methods

The Databricks Spark integration supports the following authentication methods to configure the integration:

* **OAuth machine-to-machine (M2M)**: Immuta uses the [Client Credentials Flow](https://auth0.com/docs/get-started/authentication-and-authorization-flow/client-credentials-flow) to integrate with [Databricks OAuth machine-to-machine authentication](https://docs.databricks.com/en/dev-tools/auth/oauth-m2m.html), which allows Immuta to authenticate with Databricks using a client secret. Once Databricks verifies the Immuta service principal’s identity using the client secret, Immuta is granted a temporary OAuth token to perform token-based authentication in subsequent requests. When that token expires (after one hour), Immuta requests a new temporary token. See the [Databricks OAuth machine-to-machine (M2M) authentication page](https://docs.databricks.com/en/dev-tools/auth/oauth-m2m.html) for more details.
* **Personal access token (PAT)**: This token gives Immuta temporary permission to push the cluster policies to the configured Databricks workspace and overwrite any cluster policy templates previously applied to the workspace when configuring the integration or to register securables as Immuta data sources.

### Audit

Immuta captures the code or query that triggers the Spark plan in Databricks, making audit records more useful in assessing what users are doing. To audit what triggers the Spark plan, Immuta hooks into Databricks where notebook cells and JDBC queries execute and saves the cell or query text. Then, Immuta pulls this information into the audits of the resulting Spark jobs.

Immuta supports auditing all queries run on a Databricks cluster, regardless of whether users touch Immuta-protected data or not. To configure Immuta to do so, set the [`IMMUTA_SPARK_AUDIT_ALL_QUERIES` environment variable](/latest/configuration/integrations/databricks/databricks-spark/reference-guides/databricks/configuration.md#immuta_spark_audit_all_queries) in the Spark cluster configuration when configuring your integration.

See the [Security and compliance guide](/latest/configuration/integrations/databricks/databricks-spark/reference-guides/security-and-compliance.md#auditing-and-compliance) for more details about the audit capabilities in the Databricks Spark integration.

### Protecting the Immuta configuration

Non-administrator users on an Immuta-enabled Databricks cluster must not have access to view or modify Immuta configuration or the `immuta-spark-hive.jar` file, as this poses a security loophole around Immuta policy enforcement. [Databricks secrets](https://docs.databricks.com/security/secrets/index.html#spark-conf-env-var) allow you to securely apply environment variables to Immuta-enabled clusters.

Databricks secrets can be used in the environment variables configuration section for a cluster by referencing the secret path instead of the actual value of the environment variable. For example, if a user wanted to make the `MY_SECRET_ENV_VAR=abcd_1234` value secret, they could instead create a Databricks secret and reference it as the value of that variable by following these steps:

1. Create the secret scope `my_secrets` and add a secret with the key `my_secret_env_var` containing the sensitive environment variable.
2. Reference the secret in the environment variables section as `MY_SECRET_ENV_VAR={{secrets/my_secrets/my_secret_env_var}}`.

At runtime, `{{secrets/my_secrets/my_secret_env_var}}` would be replaced with the actual value of the secret if the owner of the cluster has access to that secret.

### Scala clusters

There are limitations to isolation among users in Scala jobs on a Databricks cluster, even when using Immuta’s Security Manager. When data is broadcast, cached (spilled to disk), or otherwise saved to `SPARK_LOCAL_DIR`, it's impossible to distinguish between which user’s data is composed in each file/block. If you are concerned about this vulnerability, Immuta suggests that you

* **limit Scala clusters to Scala jobs only** and
* **require equalized projects**, which will force all users to act under the same set of attributes, groups, and purposes with respect to their data access. To require that Scala clusters be used in equalized projects and avoid the risk described above, set the [`IMMUTA_SPARK_REQUIRE_EQUALIZATION` Spark environment variable](/latest/configuration/integrations/databricks/databricks-spark/reference-guides/databricks/configuration.md#immuta_spark_require_equalization) to `true`.\
  \
  Once this configuration is complete, users on the cluster will need to switch to an Immuta equalized project before running a job. Once the first job is run using that equalized project, all subsequent jobs, no matter the user, must also be run under that same equalized project. If you need to change a cluster's project, you must restart the cluster.

When data is read in Spark using an Immuta policy-enforced plan, the masking and redaction of rows is performed at the leaf level of the physical Spark plan, so a policy such as "Mask using hashing the column `social_security_number` for everyone" would be implemented as an expression on a project node right above the `FileSourceScanExec/LeafExec` node at the bottom of the plan. This process prevents raw data from being shuffled in a Spark application and, consequently, from ending up in `SPARK_LOCAL_DIR`.

This policy implementation coupled with an equalized project guarantees that data being dropped into `SPARK_LOCAL_DIR` will have policies enforced and that those policies will be homogeneous for all users on the cluster. Since each user will have access to the same data, if they attempt to manually access other users' cached data, they will only see what they have access to via equalized permissions on the cluster. If project equalization is not turned on, users could dig through that directory and find data from another user with heightened access, which would result in a data leak.

## Troubleshooting the installation

The [Troubleshooting page](/latest/configuration/integrations/databricks/databricks-spark/how-to-guides/troubleshooting.md) has guidance for resolving issues with your installation.

[^1]: When you set up a Databricks Spark integration, you can enable the setting **Available until protected by policy.** This setting means all tables are open until explicitly registered and protected by Immuta, which allows Databricks users to read and write to data objects that are not registered in Immuta.\
    \
    See the [Customizing the integration page](/latest/configuration/integrations/databricks/databricks-spark/reference-guides/databricks/customizing-the-integration.md#protected-and-unprotected-tables) for details.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.immuta.com/latest/configuration/integrations/databricks/databricks-spark/reference-guides/databricks/installation-and-compliance.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.