Registering and Protecting Data

In the Databricks Spark integration, Immuta installs an Immuta-maintained Spark plugin on your Databricks cluster. When a user queries data that has been registered in Immuta as a data source, the plugin injects policy logic into the plan Spark builds so that the results returned to the user only include data that specific user should see.

The sequence diagram below breaks down this process of events when an Immuta user queries data in Databricks.

Registering data

When data owners register Databricks securables in Immuta, the securable metadata is registered and Immuta creates a corresponding for those securables. The data source metadata is stored in the Immuta Metadata Database so that it can be referenced in policy definitions.

The image below illustrates what happens when a data owner registers the Accounts, Claims, and Customers securables in Immuta.

Users who are subscribed to the data source in Immuta can then query the corresponding securable directly in their Databricks notebook or workspace.

Authentication methods

See the Installation and compliance page for details about the authentication methods supported for registering data.

Schema monitoring

When schema monitoring is enabled, Immuta monitors your servers to detect when new tables or columns are created or deleted, and automatically registers (or disables) those tables in Immuta. These newly updated data sources will then have any global policies and tags that are set in Immuta applied to them. The Immuta data dictionary will be updated with any column changes, and the Immuta environment will be in sync with your data environment.

For Databricks Spark, the automatic schema monitoring job is disabled because of the ephemeral nature of Databricks clusters. In this case, Immuta requires you to download a schema detection job template (a Python script) and import that into your Databricks workspace.

See the Databricks data source guide for instructions on enabling schema monitoring.

Ephemeral overrides

In Immuta, a Databricks data source is considered ephemeral, meaning that the compute resources associated with that data source will not always be available.

Ephemeral data sources allow the use of ephemeral overrides, user-specific connection parameter overrides that are applied to Immuta metadata operations.

When a user runs a Spark job in Databricks, the Immuta plugin automatically submits ephemeral overrides for that user to Immuta. Consequently, .

See the Ephemeral overrides page for more details about ephemeral overrides and how to configure or disable them.

Ephemeral override requests

The Spark plugin has the capability to send ephemeral override requests to Immuta. These requests are distinct from ephemeral overrides themselves. Ephemeral overrides cannot be turned off, but the Spark plugin can be configured to not send ephemeral override requests.

Tag ingestion

Tags can be used in Immuta in a variety of ways:

Use tags for global subscription or data policies that will apply to all data sources in the organization. In doing this, company-wide data security restrictions can be controlled by the administrators and governors, while the users and data owners need only to worry about tagging the data correctly.
Generate Immuta reports from tags for insider threat surveillance or data access monitoring.
Filter search results with tags in the Immuta UI.

The Databricks Spark integration cannot ingest tags from Databricks, but you can connect any of these supported external catalogs to work with your integration.

You can also manage tags in Immuta by manually adding tags to your data sources and columns. Alternatively, you can use identification to automatically tag your sensitive data.

Protecting data

Immuta allows you to author subscription and data policies to automate access controls on your Databricks data.

Subscription policies: After registering data sources in Immuta, you can control who has access to specific securables in Databricks through Immuta subscription policies or by manually adding users to the data source. Data users will only see the immuta database with no tables until they are granted access to those tables as Immuta data sources. See the Subscription policy access types page for a list of policy types supported.
Data policies: You can create data policies to apply fine-grained access controls (such as restricting rows or masking columns) to manage what users can see in each table after they are subscribed to a data source. See the Data policy types page for details about specific types of data policies supported.

The image below illustrates how Immuta enforces a subscription policy that only allows users in the Analysts group to access to yellow-table.

See the Automate data access control decisions page for details about the benefits of using Immuta subscription and data policies.

Policy enforcement in Databricks

Once a Databricks user who is subscribed to the data source in Immuta queries the corresponding securable directly in their workspace, Spark Analysis initiates and the following events take place:

Spark calls down to the Metastore to get table metadata.
Immuta intercepts the call to retrieve table metadata from the Metastore.
Immuta modifies the Logical Plan to enforce policies that apply to that user.
Immuta wraps the Physical Plan with specific Java classes to signal to the Security Manager that it is a trusted node and is allowed to scan raw data.
The Physical Plan is applied and filters out and transforms raw data coming back to the user.
The user sees policy-enforced data.

The image below illustrates what happens when an Immuta user who is subscribed to the Customers data source queries the securable in Databricks.

Users who can read raw tables on-cluster

Regardless of the policies on the data source, the users will be able to read raw data on the cluster if they meet one of the criteria listed below:

Databricks administrator is tied to an Immuta account
A Databricks user is listed as an ignored user (Users can be specified in the IMMUTA_SPARK_ACL_ALLOWLIST Spark environment variable to become ignored users.)

Protected and unprotected tables

Generally, Immuta prevents users from seeing data unless they are explicitly given access, which blocks access to raw sources in the underlying databases.

Databricks non-admin users will only see sources to which they are subscribed in Immuta, and this can present problems if organizations have a data lake full of non-sensitive data and Immuta removes access to all of it. To address this challenge, Immuta allows administrators to change this default setting when configuring the integration so that Immuta users can access securables that are not registered as a data source. Although this is similar to how privileged users in Databricks operate, non-privileged users cannot bypass Immuta controls.

See the Customizing the integration guide for details about this setting.

Restricting users' access to data with Immuta projects

Immuta projects combine users and data sources under a common purpose. Sometimes this purpose is for a single user to organize their data sources or to control an entire schema of data sources through a single projects screen; however, most often this is an Immuta purpose for which the data has been approved to be used and will restrict access to data and streamline team collaboration. Consequently, data owners can restrict access to data for a specified purpose through projects.

When a user is working within the context of a project, they will only see the data in that project. This helps to prevent data leaks when users collaborate. Users can switch project contexts to access various data sources while acting under the appropriate purpose.

When users change project contexts (either through the Immuta UI or with project UDFs), queries reflect users as acting under the purposes of that project, which may allow additional access to data if there are purpose restrictions on the data source(s). This process also allows organizations to track not just whether a specific data source is being used, but why.

See the Customizing the integration page for details about how to prevent users from switching project contexts in a session.

Project workspaces

Users can have additional write access in their integration using project workspaces. Users can integrate a single or multiple workspaces with a single Immuta tenant.

See the Writing to projects page for more details.

PreviousSecurity and Compliance NextAccessing Data

Last updated 2 months ago

Was this helpful?