This page provides an overview of the Databricks Spark integration. For installation instructions, see the Databricks Installation Introduction.
Databricks Spark is a plugin integration with Immuta. This integration allows you to protect access to tables and manage row-, column-, and cell-level controls without enabling table ACLs or credential passthrough. Policies are applied to the plan that Spark builds for a user's query and enforced live on-cluster.
An Application Admin will configure Databricks Spark with either the
Simplified Databricks Spark Configuration on the Immuta App Settings page
Manual Databricks Spark Configuration where Immuta artifacts must be downloaded and staged to your Databricks clusters
In both configuration options, the Immuta init script adds the Immuta plugin in Databricks: the Immuta Security Manager, wrappers, and Immuta analysis hook plan rewrite. Once an administrator gives users Can Attach To
entitlements on the cluster, they can query Immuta-registered data source directly in their Databricks notebooks.
Simplified Databricks Spark configuration additional entitlements
The credentials used to do the Simplified Databricks Spark configuration with automatic cluster policy push must have the Allow cluster creation
entitlement.
This will give Immuta temporary permission to push the cluster policies to the configured Databricks workspace and overwrite any cluster policy templates previously applied to the workspace.
Best practice
Test the integration on an Immuta-enabled cluster with a user that is not a Databricks administrator.
You should register entire databases with Immuta and run Schema Monitoring jobs through the Python script provided during data source registration. Additionally, you should use a Databricks administrator account to register data sources with Immuta using the UI or API; however, you should not test Immuta policies using a Databricks administrator account, as they are able to bypass controls.
A Databricks administrator can control who has access to specific tables in Databricks through Immuta Subscription Policies or by manually adding users to the data source. Data users will only see the immuta
database with no tables until they are granted access to those tables as Immuta data sources.
immuta
DatabaseWhen a table is registered in Immuta as a data source, users can see that table in the native Databricks database and in the immuta
database. This allows for an option to use a single database (immuta
) for all tables.
After data users have subscribed to data sources, administrators can apply fine-grained access controls, such as restricting rows or masking columns with advanced anonymization techniques, to manage what the users can see in each table. More details on the types of data policies can be found on the Data Policies page, including an overview of masking struct and array columns in Databricks.
Note: Immuta recommends building Global Policies rather than Local Policies, as they allow organizations to easily manage policies as a whole and capture system state in a more deterministic manner.
All access controls must go through SQL.
Note: With R, you must load the SparkR library in a cell before accessing the data.
Usernames in Immuta must match usernames in Databricks. It is best practice is to use the same identity manager for Immuta that you use for Databricks (Immuta supports these identity manager protocols and providers. however, for Immuta SaaS users, it’s easiest to just ensure usernames match between systems.
An Immuta Application Administrator configures the Databricks Spark integration and registers available cluster policies Immuta generates.
The Immuta init script adds the immuta
plugin in Databricks: the Immuta SecurityManager, wrappers, and Immuta analysis hook plan rewrite.
A Data Owner registers Databricks tables in Immuta as data sources. A Data Owner, Data Governor, or Administrator creates or changes a policy or user in Immuta.
Data source metadata, tags, user metadata, and policy definitions are stored in Immuta's Metadata Database.
A Databricks user who is subscribed to the data source in Immuta queries the corresponding table directly in their notebook or workspace.
During Spark Analysis, Spark calls down to the Metastore to get table metadata.
Immuta intercepts the call to retrieve table metadata from the Metastore.
Immuta modifies the Logical Plan to enforce policies that apply to that user.
Immuta wraps the Physical Plan with specific Java classes to signal to the SecurityManager that it is a trusted node and is allowed to scan raw data.
The Physical Plan is applied and filters out and transforms raw data coming back to the user.
The user sees policy-enforced data.