Immuta offers two integrations for Databricks:
Databricks Unity Catalog integration: This integration supports working with database objects registered in Unity Catalog.
Databricks Spark integration: This integration supports working with database objects registered in the legacy Hive metastore.
To determine which integration you should use, evaluate the following elements:
Cluster runtime
Databricks Runtime 9.1 or 10.4: Use the Databricks Spark integration.
Databricks Runtime 11.3 and newer: See the list below to determine which integration is supported for your data's location.
Location of data: Where is your data?
Legacy Hive metastore: Databricks recommends that you migrate all data from the legacy Hive metastore to Unity Catalog. However, when this migration is not possible, use the Databricks Spark integration to protect securables registered in the Hive metastore.
Unity Catalog: To protect securables registered in the Unity Catalog metastore, use the Databricks Unity Catalog integration.
Legacy Hive metastore and Unity Catalog: If you need to work with database objects registered in both the legacy Hive metastore and in Unity Catalog, metastore magic allows you to use both integrations.
Databricks metastore magic allows you to migrate your data from the Databricks legacy Hive metastore to the Unity Catalog metastore while protecting data and maintaining your current processes in a single Immuta instance.
Databricks metastore magic is for organizations who intend to use the Databricks Unity Catalog integration, but must still protect tables in the Hive metastore until they can migrate all of their data to Unity Catalog.
Unity Catalog support is enabled in Immuta.
Databricks has two built-in metastores that contain metadata about your tables, views, and storage credentials:
Legacy Hive metastore: Created at the workspace level. This metastore contains metadata of the registered securables in that workspace available to query.
Unity Catalog metastore: Created at the account level and is attached to one or more Databricks workspaces. This metastore contains metadata of the registered securables available to query. All clusters on that workspace use the configured metastore and all workspaces that are configured to use a single metastore share those securables.
Databricks allows you to use the legacy Hive metastore and the Unity Catalog metastore simultaneously. However, Unity Catalog does not support controls on the Hive metastore, so you must attach a Unity Catalog metastore to your workspace and move existing databases and tables to the attached Unity Catalog metastore to use the governance capabilities of Unity Catalog.
Immuta's Databricks Spark integration and Unity Catalog integration enforce access controls on the Hive and Unity Catalog metastores, respectively. However, because these metastores have two distinct security models, users were discouraged from using both in a single Immuta instance before metastore magic; the Databricks Spark integration and Unity Catalog integration were unaware of each other, so using both concurrently caused undefined behavior.
Metastore magic reconciles the distinct security models of the legacy Hive metastore and the Unity Catalog metastore, allowing you to use multiple metastores (specifically, the Hive metastore or AWS Glue Data Catalog alongside Unity Catalog metastores) within a Databricks workspace and single Immuta instance and keep policies enforced on all your tables as you migrate them. The diagram below shows Immuta enforcing policies on registered tables across workspaces.
In clusters A and D, Immuta enforces policies on data sources in each workspace's Hive metastore and in the Unity Catalog metastore shared by those workspaces. In clusters B, C, and E (which don't have Unity Catalog enabled in Databricks), Immuta enforces policies on data sources in the Hive metastores for each workspace.
With metastore magic, the Databricks Spark integration enforces policies only on data in the Hive metastore, while the Unity Catalog integration enforces policies on tables in the Unity Catalog metastore. The table below illustrates this policy enforcement.
To enforce plugin-based policies on Hive metastore tables and Unity Catalog native controls on Unity Catalog metastore tables, enable the Databricks Spark integration and the Databricks Unity Catalog integration. Note that some Immuta policies are not supported in the Databricks Unity Catalog integration. See the Databricks Unity Catalog integration reference guide for details.
Databricks SQL cannot run the Databricks Spark plugin to protect tables, so Hive metastore data sources will not be policy enforced in Databricks SQL.
To enforce policies on data sources in Databricks SQL, use Hive metastore table access controls to manually lock down Hive metastore data sources and the Databricks Unity Catalog integration to protect tables in the Unity Catalog metastore. Table access control is enabled by default on SQL warehouses, and any Databricks cluster without the Immuta plugin must have table access control enabled.
This integration allows you to manage and access data in your Databricks account across all of your workspaces. With Immuta’s Databricks Unity Catalog integration, you can write your policies in Immuta and have them enforced automatically by Databricks across data in your Unity Catalog metastore.
This getting started guide outlines how to integrate Databricks Unity Catalog with Immuta.
Databricks Unity Catalog configuration: Configure the Databricks Unity Catalog integration.
Migrate to Databricks Unity Catalog: Migrate from the legacy Databricks Spark integrations to the Databricks Unity Catalog integration.
Databricks Unity Catalog integration reference guide: This guide describes the design and components of the integration.
The how-to guides linked on this page illustrate how to integrate Databricks Unity Catalog with Immuta. See the reference guide for information about the Databricks Unity Catalog integration.
Requirements:
Unity Catalog metastore created and attached to a Databricks workspace. Immuta supports configuring a single metastore for each configured integration, and that metastore may be attached to multiple Databricks workspaces.
Unity Catalog enabled on your Databricks cluster or SQL warehouse. All SQL warehouses have Unity Catalog enabled if your workspace is attached to a Unity Catalog metastore.
These guides provide instructions on getting your data set up in Immuta for the Marketplace and Governance apps.
Register your Databricks Unity Catalog connection: Using a single setup process, connect Databricks Unity Catalog to Immuta. This will register your data objects into Immuta and allow you to start dictating access through Marketplace or global policies.
Organize your data sources into domains and assign domain permissions to accountable teams: Use domains to segment your data and assign responsibilities to the appropriate team members. These domains will then be used in Marketplace, policies, audit, and sensitive data discovery.
These guides provide instructions on getting your users set up in Immuta for the Marketplace and Governance apps.
Connect an IAM: Bring the IAM your organization already uses and allow Immuta to register your users for you.
Map external user IDs from Databricks to Immuta: Ensure the user IDs in Immuta, Databricks, and your IAM are aligned so that the right policies impact the right users.
These guides provide instructions on using Marketplace for the first time.
Publish a data product: Once you register your tables and users, you can immediately start publishing data products in Marketplace.
Request access to a data product: Users must then request access to your data products in Marketplace.
Respond to an access request: To grant access to a data product and its tables, respond to the access request.
These guides provide instructions on getting your data metadata set up in Immuta for the Governance app.
Connect an external catalog: Bring the external catalog your organization already uses and allow Immuta to continually sync your tags with your data sources for you.
Run sensitive data discovery: Sensitive data discovery (SDD) allows you to automate data tagging using identifiers that detect certain data patterns.
These guides provide instructions on using the Governance app for the first time.
Author a global subscription policy: Once you add your data metadata to Immuta, you can immediately create policies that utilize your tags and apply to your tables. Subscription policies can be created to dictate access to data sources.
Author a global data policy: Data metadata can also be used to create data policies that apply to data sources as they are registered in Immuta. Data policies dictate what data a user can see once they are granted access to a data source. Using catalog and SDD tags you can create proactive policies, knowing that they will apply to data sources as they are added to Immuta with the automated tagging.
Configure audit: Once you have your data sources and users, and policies granting them access, you can set up audit export. This will export the audit logs from user queries, policy changes, and tagging updates.
Loading...
Loading...
Immuta’s integration with Unity Catalog allows you to enforce fine-grained access controls on Unity Catalog securable objects with Immuta policies. Instead of manually creating UDFs or granting access to each table in Databricks, you can author your policies in Immuta and have Immuta manage and orchestrate Unity Catalog access-control policies on your data in Databricks clusters or SQL warehouses:
Subscription policies: Immuta subscription policies automatically grant and revoke access to specific Databricks securable objects.
Data policies: Immuta data policies enforce row- and column-level security.
Unity Catalog uses the following hierarchy of data objects:
Metastore: Created at the account level and is attached to one or more Databricks workspaces. The metastore contains metadata of all the catalogs, schemas, and tables available to query. All clusters on that workspace use the configured metastore and all workspaces that are configured to use a single metastore share those objects.
Catalog: Sits on top of schemas (also called databases) and tables to manage permissions across a set of schemas
Schema: Organizes tables and views
Table-etc: Table (managed or external tables), view, volume, model, and function
For details about the Unity Catalog object model, see the Databricks Unity Catalog documentation.
The Databricks Unity Catalog integration supports
applying column masks and row filters on specific securable objects
applying subscription polices on tables and views
enforcing Unity Catalog access controls, even if Immuta becomes disconnected
allowing non-Immuta reads and writes
using Photon
using a proxy server
Unity Catalog supports managing permissions account-wide in Databricks through controls applied directly to objects in the metastore. To establish a connection with Databricks and apply controls to securable objects within the metastore, Immuta requires a service principal with permissions to manage all data protected by Immuta. Databricks OAuth for service principals (OAuth M2M) or a personal access token (PAT) can be provided for Immuta to authenticate as the service principal. (See the permissions requirements section for a list of specific Databricks privileges.)
Immuta uses this service principal to run queries that set up user-defined functions (UDFs) and other data necessary for policy enforcement. Upon enabling the integration, Immuta will create a catalog that contains these schemas:
immuta_system
: Contains internal Immuta data.
immuta_policies_n
: Contains policy UDFs.
When policies require changes to be pushed to Unity Catalog, Immuta updates the internal tables in the immuta_system
schema with the updated policy information. If necessary, new UDFs are pushed to replace any out-of-date policies in the immuta_policies_n
schemas and any row filters or column masks are updated to point at the new policies. Many of these operations require compute on the configured Databricks cluster or SQL warehouse, so compute must be available for these policies to succeed.
Workspace-catalog binding allows users to leverage Databricks’ catalog isolation mode to limit catalog access to specific Databricks workspaces. The default isolation mode is OPEN, meaning all workspaces can access the catalog (with the exception of the automatically-created workspace catalog), provided they are in the metastore attached to the catalog. Setting this mode to ISOLATED allows the catalog owner to specify a workspace-catalog binding, which means the owner can dictate which workspaces are authorized to access the catalog. This prevents other workspaces from accessing the specified catalogs. To bind a catalog to a specific workspace in Databricks Unity Catalog, see the Databricks documentation.
Typical use cases for binding a catalog to specific workspaces include
Ensuring users can only access production data from a production workspace environment.
For example, you may have production data in a prod_catalog
, as well as a production workspace you are introducing to your organization. Binding the prod_catalog
to the prod_workspace
ensures that workspace admins and users can only access prod_catalog
from the prod_workspace
environment.
Ensuring users can only process sensitive data from a specific workspace. Limiting the environments from which users can access sensitive data helps better secure your organization’s data. Limiting access to one workspace also simplifies any monitoring, auditing, and understanding of which users are accessing specific data. This would entail a similar setup as the example above.
Giving users read-only access to production data from a developer workspace.
This enables your organization to effectively conduct development and testing, while minimizing risk to production data. All user access to this catalog from this workspace can be specified as read-only, ensuring developers can access the data they need for testing without risk of any unwanted updates.
Immuta’s Databricks Unity Catalog integration allows users to configure additional workspace connections to support using Databricks' workspace-catalog binding feature. Users can configure additional workspace connections in their Immuta integrations to be consistent with the workspace-catalog bindings that are set up in Databricks. Immuta will use each additional workspace connection to govern the catalog(s) that workspace is bound to in Databricks. If desired, each set of bound catalogs can also be configured to run on its own compute.
To use this feature, you should first set up a workspace-catalog binding in your Databricks account. Once that is configured, you can use Immuta's Integrations API to configure an additional workspace connection. This can be added when you initially set up the integration or by updating your existing integration configuration.
Limitations
Additional workspace connections in Databricks Unity Catalog are not currently supported in Immuta's connections.
Each additional workspace connection must be in the same metastore as the primary workspace used to set up the integration.
No two additional workspace connections can be responsible for the same catalog.
Immuta’s Unity Catalog integration applies Databricks table-, row-, and column-level security controls that are enforced natively within Databricks. Immuta's management of these Databricks security controls is automated and ensures that they synchronize with Immuta policy or user entitlement changes.
Table-level security: Immuta manages REVOKE and GRANT privileges on securable objects in Databricks through subscription policies. When you create a subscription policy in Immuta, Immuta uses the Unity Catalog API to issue GRANTS or REVOKES against the catalog, schema, or table in Databricks for every user affected by that subscription policy.
Row-level security: Immuta applies SQL UDFs to restrict access to rows for querying users.
Column-level security: Immuta applies column-mask SQL UDFs to tables for querying users. These column-mask UDFs run for any column that requires masking.
The Unity Catalog integration supports the following policy types:
Conditional masking
Constant
Custom masking
Hashing
Null (including on ARRAY, MAP, and STRUCT type columns)
Regex: You must use the global regex flag (g
) when creating a regex masking policy in this integration. You cannot use the case insensitive regex flag (i
) when creating a regex masking policy in this integration. See the limitations section for examples.
Rounding (date and numeric rounding)
Matching (only show rows where)
Custom WHERE
Never
Where user
Where value in column
Minimization
Time-based restrictions
Project-scoped purpose exceptions for Databricks Unity Catalog integrations allow you to apply purpose-based policies to Databricks data sources in a project. As a result, users can only access that data when they are working within that specific project.
If you are using views in Databricks Unity Catalog, one of the following must be true for project-scoped purpose exceptions to apply to the views in Databricks:
The view and underlying table are registered as Immuta data sources and added to a project: If a view and its underlying table are both added as Immuta data sources, both of these assets must be added to the project for the project-scoped purpose exception to apply. If a view and underlying table are both added as data sources but the table is not added to an Immuta project, the purpose exception will not apply to the view because Databricks does not support fine-grained access controls on views.
Only the underlying table is registered as an Immuta data source and added to a project: If only the underlying table is registered as an Immuta data source but the view is not registered, the purpose exception will apply to both the table and corresponding view in Databricks. Views are the only Databricks object that will have Immuta policies applied to them even if they're not registered as Immuta data sources (as long as their underlying tables are registered).
This feature allows masked columns to be joined across data sources that belong to the same project. When data sources do not belong to a project, Immuta uses a unique salt per data source for hashing to prevent masked values from being joined. (See the Why use masked joins? guide for an explanation of that behavior.) However, once you add Databricks Unity Catalog data sources to a project and enable masked joins, Immuta uses a consistent salt across all the data sources in that project to allow the join.
For more information about masked joins and enabling them for your project, see the Masked joins section of documentation.
Some users may need to be exempt from masking and row-level policy enforcement. When you add user accounts to the configured exemption group in Databricks, Immuta will not enforce policies for those users. Exemption groups are created when the Unity Catalog integration is configured, and no policies will apply to these users' queries, despite any policies enforced on the tables they query.
The principal used to register data sources in Immuta will be automatically added to this exemption group for that Databricks table. Consequently, users added to this list and used to register data sources in Immuta should be limited to service accounts.
hive_metastore
When enabling Unity Catalog support in Immuta, the catalog for all Databricks data sources will be updated to point at the default hive_metastore
catalog. Internally, Databricks exposes this catalog as a proxy to the workspace-level Hive metastore that schemas and tables were kept in before Unity Catalog. Since this catalog is not a real Unity Catalog catalog, it does not support any Unity Catalog policies. Therefore, Immuta will ignore any data sources in the hive_metastore
in any Databricks Unity Catalog integration, and policies will not be applied to tables there.
However, with Databricks metastore magic you can use hive_metastore
and enforce subscription and data policies with the Databricks Spark integration.
The Databricks Unity Catalog integration supports the following authentication methods to configure the integration and create data sources:
Personal access token (PAT): This is the access token for the Immuta service principal. This service principal must have the metastore privileges listed in the permissions section for the metastore associated with the Databricks workspace. If this token is configured to expire, update this field regularly for the integration to continue to function.
OAuth machine-to-machine (M2M): Immuta uses the Client Credentials Flow to integrate with Databricks OAuth machine-to-machine authentication, which allows Immuta to authenticate with Databricks using a client secret. Once Databricks verifies the Immuta service principal’s identity using the client secret, Immuta is granted a temporary OAuth token to perform token-based authentication in subsequent requests. When that token expires (after one hour), Immuta requests a new temporary token. See the Databricks OAuth machine-to-machine (M2M) authentication page for more details.
The status of the integration is visible on the integrations tab of the Immuta application settings page. If errors occur in the integration, a banner will appear in the Immuta UI with guidance for remediating the error.
The definitions for each status and the state of configured data platform integrations is available in the response schema of the integrations API. However, the UI consolidates these error statuses and provides detail in the error messages.
The Unity Catalog data object model introduces a 3-tiered namespace, as outlined above. Consequently, your Databricks tables registered as data sources in Immuta will reference the catalog, schema (also called a database), and table.
The supported object types for Databricks Unity Catalog are listed below. When applying read and write access policies to these data sources, the privileges granted by Immuta vary depending on the object type. See an outline of privileges granted by Immuta on the Subscription policy access types page.
Table
View
Materialized view
Streaming table
External table
Foreign table
External data connectors and query-federated tables are preview features in Databricks. See the Databricks documentation for details about the support and limitations of these features before registering them as data sources in the Unity Catalog integration.
The Databricks Unity Catalog integration audits user queries run in clusters or SQL warehouses for deployments configured with the Databricks Unity Catalog integration. The audit ingest is set when configuring the integration and the audit logs can be scoped to only ingest specific workspaces if needed.
See the Unity Catalog audit page for details about manually prompting ingest of audit logs and the contents of the logs.
You can enable tag ingestion to allow Immuta to ingest Databricks Unity Catalog table and column tags so that you can use them in Immuta policies to enforce access controls. When you enable this feature, Immuta uses the credentials and connection information from the Databricks Unity Catalog integration to pull tags from Databricks and apply them to data sources as they are registered in Immuta. If Databricks data sources preexist the Databricks Unity Catalog tag ingestion enablement, those data sources will automatically sync to the catalog and tags will apply. Immuta checks for changes to tags in Databricks and syncs Immuta data sources to those changes every 24 hours.
Once external tags are applied to Databricks data sources, those tags can be used to create subscription and data policies.
To enable Databricks Unity Catalog tag ingestion, see the Configure a Databricks Unity Catalog integration page.
After making changes to tags in Databricks, you can manually sync the catalog so that the changes immediately apply to the data sources in Immuta. Otherwise, tag changes will automatically sync within 24 hours.
When syncing data sources to Databricks Unity Catalog tags, Immuta pulls the following information:
Table tags: These tags apply to the table and appear on the data source details tab. Databricks tags' key and value pairs are reflected in Immuta as a hierarchy with each level separated by a .
delimiter. For example, the Databricks Unity Catalog tag Location: US
would be represented as Location.US
in Immuta.
Column tags: These tags are applied to data source columns and appear on the columns listed in the data dictionary tab. Databricks tags' key and value pairs are reflected in Immuta as a hierarchy with each level separated by a .
delimiter. For example, the Databricks Unity Catalog tag Location: US
would be represented as Location.US
in Immuta.
Table comments field: This content appears as the data source description on the data source details tab.
Column comments field: This content appears as dictionary column descriptions on the data dictionary tab.
Only tags that apply to Databricks data sources in Immuta are available to build policies in Immuta. Immuta will not pull tags in from Databricks Unity Catalog unless those tags apply to registered data sources.
Cost implications: Tag ingestion in Databricks Unity Catalog requires compute resources. Therefore, having many Databricks data sources or frequently manually syncing data sources to Databricks Unity Catalog may incur additional costs.
Databricks Unity Catalog tag ingestion only supports tenants with fewer than 2,500 data sources registered.
See the Enable Unity Catalog guide for a list of requirements.
Row access policies with more than 1023 columns are unsupported. This is an underlying limitation of UDFs in Databricks. Immuta will only create row access policies with the minimum number of referenced columns. This limit will therefore apply to the number of columns referenced in the policy and not the total number in the table.
If you disable table grants, Immuta revokes the grants. Therefore, if users had access to a table before enabling Immuta, they’ll lose access.
You must use the global regex flag (g
) when creating a regex masking policy in this integration, and you cannot use the case insensitive regex flag (i
) when creating a regex masking policy in this integration. See the examples below for guidance:
regex with a global flag (supported): /^ssn|social ?security$/g
regex without a global flag (unsupported): /^ssn|social ?security$/
regex with a case insensitive flag (unsupported): /^ssn|social ?security$/gi
regex without a case insensitive flag (supported): /^ssn|social ?security$/g
If a registered data source is owned by a Databricks group at the table level, then the Unity Catalog integration cannot apply data masking policies to that table in Unity Catalog.
Therefore, set all table-level ownership on your Unity Catalog data sources to an individual user or service principal instead of a Databricks group. Catalogs and schemas can still be owned by a Databricks group, as ownership at that level doesn't interfere with the integration.
The following features are currently unsupported:
Databricks change data feed support
Immuta projects
Multiple IAMs on a single cluster
Column masking policies on views
Mixing masking policies on the same column
Row-redaction policies on views
R and Scala cluster support
Scratch paths
User impersonation
Policy enforcement on raw Spark reads
Python UDFs for advanced masking functions
Direct file-to-SQL reads
Data policies (except for masking with NULL) on ARRAY, MAP, or STRUCT type columns
Shallow clones
Snippets for Databricks data sources may be empty in the Immuta UI.
This integration enforces policies on Databricks securables registered in the legacy Hive metastore. Once these securables are registered as Immuta data sources, users can query policy-enforced data on Databricks clusters.
The guides in this section outline how to integrate Databricks Spark with Immuta.
This getting started guide outlines how to integrate Databricks with Immuta.
Configure a Databricks Spark integration: Configure the Databricks Spark integration.
Manually update your Databricks cluster: Manually update your cluster to reflect changes in the Immuta init script or cluster policies.
Install a trusted library: Register a Databricks library with Immuta as a trusted library to avoid Immuta security manager errors when using third-party libraries.
Project UDFs cache settings: Raise the caching on-cluster and lower the cache timeouts for the Immuta web service to allow use of project UDFs in Spark jobs.
Run R and Scala spark-submit jobs on Databricks: Run R and Scala spark-submit
jobs on your Databricks cluster.
DBFS access: Access DBFS in Databricks for non-sensitive data.
Troubleshooting: Resolve errors in the Databricks Spark configuration.
Databricks Spark integration configuration: This guide describes the design and components of the integration.
Security and compliance: This guide provides an overview of the Immuta features that provide security for your users and Databricks clusters and that allow you to prove compliance and monitor for anomalies.
Registering and protecting data: This guide provides an overview of registering Databricks securables and protecting them with Immuta policies.
Accessing data: This guide provides an overview of how Databricks users access data registered in Immuta.
The how-to guides linked on this page illustrate how to integrate Databricks Spark with Immuta.
Requirements
If Databricks Unity Catalog is enabled in a Databricks workspace, you must use an Immuta cluster policy when you set up the Databricks Spark integration to create an Immuta-enabled cluster.
If Databricks Unity Catalog is not enabled in your Databricks workspace, you must disable Unity Catalog in your Immuta tenant before proceeding with your configuration of Databricks Spark:
Navigate to the App Settings page and click Integration Settings.
Uncheck the Enable Unity Catalog checkbox.
Click Save.
These guides provide instructions for getting your data set up in Immuta.
Organize your data sources into domains and assign domain permissions to accountable teams (recommended): Use domains to segment your data and assign responsibilities to the appropriate team members. These domains will then be used in policies, audit, and sensitive data discovery.
These guides provide instructions on setting up your users in Immuta.
Integrate an IAM with Immuta: Connect the IAM your organization already uses and allow Immuta to register your users for you.
Map external user IDs from Databricks to Immuta: Ensure the user IDs in Immuta, Databricks, and your IAM are aligned so that the right policies impact the right users.
These guides provide instructions on getting your data metadata set up in Immuta for use in policies.
Connect an external catalog: Connect the external catalog your organization already uses and allow Immuta to continually sync your tags with your data sources for you.
Run sensitive data discovery: Sensitive data discovery (SDD) allows you to automate data tagging using identifiers that detect certain data patterns.
These guides provide instructions on authoring policies and auditing data access.
Author a global subscription policy: Once you add your data metadata to Immuta, you can immediately create policies that utilize your tags and apply to your tables. Subscription policies can be created to dictate access to data sources.
Author a global data policy: Data metadata can also be used to create data policies that apply to data sources as they are registered in Immuta. Data policies dictate what data a user can see once they are granted access to a data source. Using catalog and SDD tags you can create proactive policies, knowing that they will apply to data sources as they are added to Immuta with the automated tagging.
Configure audit: Once you have your data sources and users, and policies granting them access, you can set up audit export. This will export the audit logs from user queries, policy changes, and tagging updates.
APPLICATION_ADMIN
Immuta permission
CAN MANAGE
Databricks privilege on the cluster
A Databricks workspace with the Premium tier, which includes cluster policies (required to configure the Spark integration)
A cluster that uses one of these supported Databricks Runtimes:
9.1 LTS
10.4 LTS
11.3 LTS
14.3 (private preview)
Supported languages
Python
R (not supported for Databricks Runtime 14.3)
Scala (not supported for Databricks Runtime 14.3)
SQL
A Databricks cluster that is one of these supported compute types:
Custom access mode
A Databricks workspace and cluster with the ability to directly make HTTP calls to the Immuta web service. The Immuta web service also must be able to connect to and perform queries on the Databricks cluster, and to call Databricks workspace APIs.
The Databricks Spark integration only works with Spark 3.
Enable OAuth M2M authentication (recommended) or personal access tokens.
Disable Photon by setting runtime_engine
to STANDARD
using the Clusters API. Immuta does not support clusters with Photon enabled. Photon is enabled by default on compute running Databricks Runtime 9.1 LTS or newer and must be manually disabled before setting up the integration with Immuta.
Restrict the set of Databricks principals who have CAN MANAGE
privileges on Databricks clusters where the Spark plugin is installed. This is to prevent editing environment variables or Spark configuration, editing cluster policies, or removing the Spark plugin from the cluster, all of which would cause the Spark plugin to stop working.
If Databricks Unity Catalog is enabled in a Databricks workspace, you must use an Immuta cluster policy when you set up the Databricks Spark integration to create an Immuta-enabled cluster. See the configure cluster policies section below for guidance.
If Databricks Unity Catalog is not enabled in your Databricks workspace, you must disable Unity Catalog in your Immuta tenant before proceeding with your configuration of Databricks Spark:
Navigate to the App Settings page and click Integration Settings.
Uncheck the Enable Unity Catalog checkbox.
Click Save.
Click the App Settings icon in Immuta.
Navigate to HDFS > System API Key and click Generate Key.
Click Save and then Confirm. If you do not save and confirm, the system API key will not be saved.
Scroll to the Integration Settings section.
Click + Add Native Integration and select Databricks Spark Integration from the dropdown menu.
Complete the Hostname field.
Enter a Unique ID for the integration. The unique ID is used to name cluster policies clearly, which is important when managing several Databricks Spark integrations. As cluster policies are workspace-scoped, but multiple integrations might be made in one workspace, this ID lets you distinguish between different sets of cluster policies.
Select the identity manager that should be used when mapping the current Spark user to their corresponding identity in Immuta from the Immuta IAM dropdown menu. This should be set to reflect the identity manager you use in Immuta (such as Entra ID or Okta).
Choose an Access Model. The Protected until made available by policy option disallows reading and writing tables not protected by Immuta, whereas the Available until protected by policy option allows it.
Select the Storage Access Type from the dropdown menu.
Opt to add any Additional Hadoop Configuration Files.
Click Add Native Integration, and then click Save and Confirm. This will restart the application and save your Databricks Spark integration. (It is normal for this restart to take some time.)
The Databricks Spark integration will not do anything until your cluster policies are configured, so even though your integration is saved, continue to the next section to configure your cluster policies so the Spark plugin can manage authorization on the Databricks cluster.
Click Configure Cluster Policies.
Select one or more cluster policies in the matrix. Clusters running Immuta with Databricks Runtime 14.3 can only use Python and SQL. You can make changes to the policy by clicking Additional Policy Changes and editing the environment variables in the text field or by downloading it. See the Spark environment variables reference guide for information about each variable and its default value. Some common settings are linked below:
Select your Databricks Runtime.
Use one of the two installation types described below to apply the policies to your cluster:
Automatically push cluster policies: This option allows you to automatically push the cluster policies to the configured Databricks workspace. This will overwrite any cluster policy templates previously applied to this workspace.
Select the Automatically Push Cluster Policies radio button.
Enter your Admin Token. This token must be for a user who has the required Databricks privilege. This will give Immuta temporary permission to push the cluster policies to the configured Databricks workspace and overwrite any cluster policy templates previously applied to the workspace.
Click Apply Policies.
Manually push cluster policies: Enabling this option allows you to manually push the cluster policies and the init script to the configured Databricks workspace.
Select the Manually Push Cluster Policies radio button.
Click Download Init Script and set the Immuta plugin init script as a cluster-scoped init script in Databricks by following the Databricks documentation.
Click Download Policies, and then manually add this cluster policy to your Databricks workspace.
Ensure that the init_scripts.0.workspace.destination
in the policy matches the file path to the init script you configured above.
The Immuta cluster policy references Databricks Secrets for several of the sensitive fields. These secrets must be manually created if the cluster policy is not automatically pushed. Use Databricks API or CLI to push the proper secrets.
Click Close, and then click Save and Confirm.
Apply the cluster policy generated by Immuta to the cluster with the Spark plugin installed by following the Databricks documentation.
Give users the Can Attach To
permission on the cluster.
If a Databricks cluster needs to be manually updated to reflect changes in the Immuta init script or cluster policies, you can remove and set up your integration again to get the updated policies and init script.
Log in to Immuta as an Application Admin.
Click the App Settings icon in the navigation menu and scroll to the Integration Settings section.
Your existing Databricks Spark integration should be listed here; expand it and note the configuration values. Now select Remove to remove your integration.
Click Add Integration and select Databricks Integration to add a new integration.
Enter your Databricks Spark integration settings again as configured previously.
Click Add Integration to add the integration, and then select Configure Cluster Policies to set up the updated cluster policies and init script.
Select the cluster policies you wish to use for your Immuta-enabled Databricks clusters.
Automatically push cluster policies and the init script (recommended) or manually update your cluster policies.
Automatically push cluster policies
Select Automatically Push Cluster Policies and enter your privileged Databricks access token. This token must have privileges to write to cluster policies.
Select Apply Policies to push the cluster policies and init script again.
Click Save and Confirm to deploy your changes.
Manually update cluster policies
Download the init script and the new cluster policies to your local computer.
Click Save and Confirm to save your changes in Immuta.
Log in to your Databricks workspace with your administrator account to set up cluster policies.
Get the path you will upload the init script (immuta_cluster_init_script_proxy.sh
) to by opening one of the cluster policy .json
files and looking for the defaultValue
of the field init_scripts.0.dbfs.destination
. This should be a DBFS path in the form of dbfs:/immuta-plugin/hostname/immuta_cluster_init_script_proxy.sh
.
Click Data in the left pane to upload your init script to DBFS to the path you found above.
To find your existing cluster policies you need to update, click Compute in the left pane and select the Cluster policies tab.
Edit each of these cluster policies that were configured before and overwrite the contents of the JSON with the new cluster policy JSON you downloaded.
Restart any Databricks clusters using these updated policies for the changes to take effect.
In the Databricks Clusters UI, install your third-party library .jar or Maven artifact with Library Source Upload
, DBFS
, DBFS/S3
, or Maven
. Alternatively, use the Databricks libraries API.
In the Databricks Clusters UI, add the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS
property as a Spark environment variable and set it to your artifact's URI:
Once you've finished making your changes, restart the cluster.
Once the cluster is up, execute a command in a notebook. If the trusted library installation is successful, you should see driver log messages like this:
This page outlines the configuration for setting up project UDFs, which allow users to set their current project in Immuta through Spark. For details about the specific functions available and how to use them, see the Use Project UDFs (Databricks) page.
Lower the web service cache timeout in Immuta:
Click the App Settings icon and scroll to the HDFS Cache Settings section.
Lower the Cache TTL of HDFS user names (ms) to 0.
Click Save.
Raise the cache timeout on your Databricks cluster: In the Spark environment variables section, set the IMMUTA_CURRENT_PROJECT_CACHE_TIMEOUT_SECONDS
and IMMUTA_PROJECT_CACHE_TIMEOUT_SECONDS
to high values (like 10000
).
Note: These caches will be invalidated on cluster when a user calls immuta.set_current_project
, so they can effectively be cached permanently on cluster to avoid periodically reaching out to the web service.
Loading...
Loading...
This page provides guidelines for troubleshooting issues with the Databricks Spark integration and resolving Py4J security and Databricks trusted library errors.
For easier debugging of the Databricks Spark integration, follow the recommendations below.
Enable cluster init script logging:
In the cluster page in Databricks for the target cluster, navigate to Advanced Options -> Logging.
Change the Destination from NONE
to DBFS
and change the path to the desired output location. Note: The unique cluster ID will be added onto the end of the provided path.
View the Spark UI on your target Databricks cluster: On the cluster page, click the Spark UI tab, which shows the Spark application UI for the cluster. If you encounter issues creating Databricks data sources in Immuta, you can also view the JDBC/ODBC Server portion of the Spark UI to see the result of queries that have been sent from Immuta to Databricks.
The validation and debugging notebook is designed to be used by or under the guidance of an Immuta support professional. Reach out to your Immuta representative for assistance.
Import the notebook into a Databricks workspace by navigating to Home in your Databricks instance.
Click the arrow next to your name and select Import.
Once you have executed commands in the notebook and populated it with debugging information, export the notebook and its contents by opening the File menu, selecting Export, and then selecting DBC Archive.
Error Message: py4j.security.Py4JSecurityException: Constructor <> is not allowlisted
Explanation: This error indicates you are being blocked by Py4J security rather than the Immuta Security Manager. Py4J security is strict and generally ends up blocking many ML libraries.
Solution: Turn off Py4J security on the offending cluster by setting IMMUTA_SPARK_DATABRICKS_PY4J_STRICT_ENABLED=false
in the environment variables section. Additionally, because there are limitations to the security mechanisms Immuta employs on-cluster when Py4J security is disabled, ensure that all users on the cluster have the same level of access to data, as users could theoretically see (policy-enforced) data that other users have queried.
Check the driver logs for details. Some possible causes of failure include
One of the Immuta-configured trusted library URIs does not point to a Databricks library. Check that you have configured the correct URI for the Databricks library.
For trusted Maven artifacts, the URI must follow this format: maven:/group.id:artifact-id:version
.
Databricks failed to install a library. Any Databricks library installation errors will appear in the Databricks UI under the Libraries tab.
The Immuta offers for Databricks.
In this integration, Immuta installs an Immuta-maintained Spark plugin on your Databricks cluster. When a user queries data that has been registered in Immuta as a data source, the plugin injects policy logic into the plan Spark builds so that the results returned to the user only include data that specific user should see.
The reference guides in this section are written for Databricks administrators who are responsible for setting up the integration, securing Databricks clusters, and setting up users:
Installation and compliance: This guide includes information about what Immuta creates in your Databricks environment and securing your Databricks clusters.
Customizing the integration: Consult this guide for information about customizing the Databricks Spark integration settings.
Setting up users: Consult this guide for information about connecting data users and setting up user impersonation.
Spark environment variables: This guide provides a list of Spark environment variables used to configure the integration.
Ephemeral overrides: This guide describes and how to configure them to reduce the risk that a user has overrides set to a cluster (or multiple clusters) that aren't currently up.
Loading...
You can customize the Databricks Spark integration settings using these components Immuta provides:
In some cases it is necessary to add sensitive configuration to SparkSession.sparkContext.hadoopConfiguration
to allow Spark to read data.
For example, when accessing external tables stored in Azure Data Lake Gen2, Spark must have credentials to access the target containers or filesystems in Azure Data Lake Gen2, but users must not have access to those credentials. In this case, an additional configuration file may be provided with a storage account key that the cluster may use to access Azure Data Lake Gen2.
Generally, Immuta prevents users from seeing data unless they are explicitly given access, which blocks access to raw sources in the underlying databases.
Databricks non-privileged users will only see sources to which they are subscribed in Immuta, and this can present problems if organizations have a data lake full of non-sensitive data and Immuta removes access to all of it. The limited enforcement scope feature addresses this challenge by allowing Immuta users to access any tables that are not protected by Immuta (i.e., not registered as a data source or a table in a native workspace). Although this is similar to how privileged users in Databricks operate, non-privileged users cannot bypass Immuta controls.
Protected until made available by policy: This setting means all tables are hidden until a user is granted access through an Immuta policy. This is how most databases work and assumes least privileged access and also means you will have to register all tables with Immuta if this is disabled.
Available until protected by policy: This setting means all tables are open until explicitly registered and protected by Immuta. This makes sense if most of your tables are non-sensitive and you can pick and choose which to protect. This setting allows both non-Immuta reads and non-Immuta writes:
In Immuta, a Databricks data source is considered ephemeral, meaning that the compute resources associated with that data source will not always be available.
Ephemeral data sources allow the use of ephemeral overrides, user-specific connection parameter overrides that are applied to Immuta metadata operations.
When a user runs a Spark job in Databricks, the Immuta plugin automatically submits ephemeral overrides for that user to Immuta for all applicable data sources to use the current cluster as compute for all subsequent metadata operations for that user against the applicable data sources.
Immuta projects combine users and data sources under a common purpose. Sometimes this purpose is for a single user to organize their data sources or to control an entire schema of data sources through a single projects screen; however, most often this is an Immuta purpose for which the data has been approved to be used and will restrict access to data and streamline team collaboration. Consequently, data owners can restrict access to data for a specified purpose through projects.
When a user is working within the context of a project, data users will only see the data in that project. This helps to prevent data leaks when users collaborate. Users can switch project contexts to access various data sources while acting under the appropriate purpose. Consider adjusting the following project settings to suit your organization's needs:
This section describes how Immuta interacts with common Databricks features.
The CDF can be read if the querying user is allowed to read the raw data and ONE of the following statements is true:
the table is in the current workspace
the table is in a scratch path
non-Immuta reads are enabled AND the table does not intersect with a workspace under which the current user is not acting
non-Immuta reads are enabled AND the table is not part of an Immuta data source
Security vulnerability
Using this feature could create a security vulnerability, depending on the third-party library. For example, if a library exposes a public method named readProtectedFile
that displays the contents of a sensitive file, then trusting that library would allow end users access to that file. Work with your Immuta support professional to determine if the risk does not apply to your environment or use case.
The trusted libraries feature allows Databricks cluster administrators to avoid . An administrator can specify an installed library as trusted, which will enable that library's code to bypass the Immuta security manager. This feature does not impact Immuta's ability to apply policies; trusting a library only allows code through that otherwise would have been blocked by the Security Manager.
The following types of libraries are supported when installing a third-party library using the Databricks UI or the Databricks Libraries API:
Library source
is Upload
, DBFS
or DBFS/S3
and the Library Type
is Jar
.
Library source
is Maven
.
Limitations
Installing trusted libraries outside of the Databricks Libraries API (e.g., ADD JAR ...
) is not supported.
Databricks installs libraries right after a cluster has started, but there is no guarantee that library installation will complete before a user's code is executed. If a user executes code before a trusted library installation has completed, Immuta will not be able to identify the library as trusted. This can be solved by either
waiting for library installation to complete before running any third-party library commands or
executing a Spark query. This will force Immuta to wait for any trusted Immuta libraries to complete installation before proceeding.
When installing a library using Maven as a library source, Databricks will also install any transitive dependencies for the library. However, those transitive dependencies are installed behind the scenes and will not appear as installed libraries in either the Databricks UI or using the Databricks Libraries API. Only libraries specifically listed in the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS
environment variable will be trusted by Immuta, which does not include installed transitive dependencies. This effectively means that any code paths that include a class from a transitive dependency but do not include a class from a trusted third-party library can still be blocked by the Immuta security manager. For example, if a user installs a trusted third-party library that has a transitive dependency of a file-util
library, the user will not be able to directly use the file-util
library to read a sensitive file that is normally protected by the Immuta security manager.
In many cases, it is not a problem if dependent libraries aren't trusted because code paths where the trusted library calls down into dependent libraries will still be trusted. However, if the dependent library needs to be trusted, there is a workaround:
In the above example, where slf4j
is the transitive dependency, you would add the path dbfs:/FileStore/jars/maven/org/slf4j/slf4j-api-1.7.25.jar
to the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS
environment variable and restart your cluster.
Local mode: The metastore client running inside a cluster connects to the underlying metastore database directly via JDBC.
Remote mode: Instead of connecting to the underlying database directly, the metastore client connects to a separate metastore service via the Thrift protocol. The metastore service connects to the underlying database. When running a metastore in remote mode, DBFS is not supported.
Scratch paths are cluster-specific remote file paths that Databricks users are allowed to directly read from and write to without restriction. The creator of a Databricks cluster specifies the set of remote file paths that are designated as scratch paths on that cluster when they configure a Databricks cluster. Scratch paths are useful for scenarios where non-sensitive data needs to be written out to a specific location using a Databricks cluster protected by Immuta.
Loading...
Loading...
Loading...
Loading...
In the Databricks Spark integration, Immuta installs an Immuta-maintained Spark plugin on your Databricks cluster. When a user queries data that has been registered in Immuta as a data source, the plugin injects policy logic into the plan Spark builds so that the results returned to the user only include data that specific user should see.
The sequence diagram below breaks down this process of events when an Immuta user queries data in Databricks.
When data owners register Databricks securables in Immuta, the securable metadata is registered and Immuta creates a corresponding for those securables. The data source metadata is stored in the Immuta Metadata Database so that it can be referenced in policy definitions.
The image below illustrates what happens when a data owner registers the Accounts
, Claims
, and Customers
securables in Immuta.
Users who are subscribed to the data source in Immuta can then query the corresponding securable directly in their Databricks notebook or workspace.
When schema monitoring is enabled, Immuta monitors your servers to detect when new tables or columns are created or deleted, and automatically registers (or disables) those tables in Immuta. These newly updated data sources will then have any global policies and tags that are set in Immuta applied to them. The Immuta data dictionary will be updated with any column changes, and the Immuta environment will be in sync with your data environment.
For Databricks Spark, the automatic schema monitoring job is disabled because of the ephemeral nature of Databricks clusters. In this case, Immuta requires you to download a schema detection job template (a Python script) and import that into your Databricks workspace.
See the Databricks data source guide for instructions on enabling schema monitoring.
In Immuta, a Databricks data source is considered ephemeral, meaning that the compute resources associated with that data source will not always be available.
Ephemeral data sources allow the use of ephemeral overrides, user-specific connection parameter overrides that are applied to Immuta metadata operations.
When a user runs a Spark job in Databricks, the Immuta plugin automatically submits ephemeral overrides for that user to Immuta. Consequently, .
See the Ephemeral overrides page for more details about ephemeral overrides and how to configure or disable them.
The Spark plugin has the capability to send ephemeral override requests to Immuta. These requests are distinct from ephemeral overrides themselves. Ephemeral overrides cannot be turned off, but the Spark plugin can be configured to not send ephemeral override requests.
Tags can be used in Immuta in a variety of ways:
Use tags for global subscription or data policies that will apply to all data sources in the organization. In doing this, company-wide data security restrictions can be controlled by the administrators and governors, while the users and data owners need only to worry about tagging the data correctly.
Generate Immuta reports from tags for insider threat surveillance or data access monitoring.
Filter search results with tags in the Immuta UI.
The Databricks Spark integration cannot ingest tags from Databricks, but you can connect any of these supported external catalogs to work with your integration.
You can also manage tags in Immuta by manually adding tags to your data sources and columns. Alternatively, you can use sensitive data discovery (SDD) to automatically tag your sensitive data.
Immuta allows you to author subscription and data policies to automate access controls on your Databricks data.
Subscription policies: After registering data sources in Immuta, you can control who has access to specific securables in Databricks through Immuta subscription policies or by manually adding users to the data source. Data users will only see the immuta
database with no tables until they are granted access to those tables as Immuta data sources. See the Subscription policy access types page for a list of policy types supported.
Data policies: You can create data policies to apply fine-grained access controls (such as restricting rows or masking columns) to manage what users can see in each table after they are subscribed to a data source. See the Data policy types page for details about specific types of data policies supported.
The image below illustrates how Immuta enforces a subscription policy that only allows users in the Analysts
group to access to yellow-table
.
See the Automate data access control decisions page for details about the benefits of using Immuta subscription and data policies.
Once a Databricks user who is subscribed to the data source in Immuta queries the corresponding securable directly in their workspace, Spark Analysis initiates and the following events take place:
Spark calls down to the Metastore to get table metadata.
Immuta intercepts the call to retrieve table metadata from the Metastore.
Immuta modifies the Logical Plan to enforce policies that apply to that user.
Immuta wraps the Physical Plan with specific Java classes to signal to the Security Manager that it is a trusted node and is allowed to scan raw data.
The Physical Plan is applied and filters out and transforms raw data coming back to the user.
The user sees policy-enforced data.
The image below illustrates what happens when an Immuta user who is subscribed to the Customers
data source queries the securable in Databricks.
Regardless of the policies on the data source, the users will be able to read raw data on the cluster if they meet one of the criteria listed below:
Databricks administrator is tied to an Immuta account
A Databricks user is listed as an ignored user (Users can be specified in the IMMUTA_SPARK_ACL_ALLOWLIST
Spark environment variable to become ignored users.)
Generally, Immuta prevents users from seeing data unless they are explicitly given access, which blocks access to raw sources in the underlying databases.
Databricks non-admin users will only see sources to which they are subscribed in Immuta, and this can present problems if organizations have a data lake full of non-sensitive data and Immuta removes access to all of it. To address this challenge, Immuta allows administrators to change this default setting when configuring the integration so that Immuta users can access securables that are not registered as a data source. Although this is similar to how privileged users in Databricks operate, non-privileged users cannot bypass Immuta controls.
See the Customizing the integration guide for details about this setting.
Immuta projects combine users and data sources under a common purpose. Sometimes this purpose is for a single user to organize their data sources or to control an entire schema of data sources through a single projects screen; however, most often this is an Immuta purpose for which the data has been approved to be used and will restrict access to data and streamline team collaboration. Consequently, data owners can restrict access to data for a specified purpose through projects.
When a user is working within the context of a project, they will only see the data in that project. This helps to prevent data leaks when users collaborate. Users can switch project contexts to access various data sources while acting under the appropriate purpose.
When users change project contexts (either through the Immuta UI or with project UDFs), queries reflect users as acting under the purposes of that project, which may allow additional access to data if there are purpose restrictions on the data source(s). This process also allows organizations to track not just whether a specific data source is being used, but why.
See the Customizing the integration page for details about how to prevent users from switching project contexts in a session.
Users can have additional write access in their integration using project workspaces. Users can integrate a single or multiple workspaces with a single Immuta tenant.
See the Writing to projects page for more details.
Loading...
Loading...