Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Immuta manages access to Snowflake tables by administering Snowflake row access policies and column masking policies on those tables, allowing users to query tables directly in Snowflake while dynamic policies are enforced.
This getting started guide outlines how to integrate your Snowflake account with Immuta.
Configure a Snowflake integration: Configure the Snowflake integration.
Edit or remove an existing integration: Manage integration settings or delete your existing Snowflake integration.
Integration settings:
Enable Snowflake table grants: Enable Snowflake table grants and configure the Snowflake role prefix.
Use Snowflake data sharing with Immuta: Use Snowflake data sharing with table grants or project workspaces.
Snowflake low row access policy mode: Enable Snowflake low row access policy mode.
Snowflake lineage tag propagation: Configure your Snowflake integration to automatically apply tags added to a Snowflake table to its descendant data source columns in Immuta.
Phased Snowflake onboarding approach: A phased onboarding approach to configuring the Snowflake integration ensures that your users will not be immediately affected by changes as you add data sources and policies. This guide describes the settings and requirements for implementing this phased approach.
Snowflake integration reference guide: This reference guide describes the design and features of the Snowflake integration.
Integration health statuses: This reference guide provides descriptions of the possible statuses of a configured integration.
Snowflake table grants: Snowflake table grants simplifies the management of privileges in Snowflake when using Immuta. Instead of manually granting users access to tables registered in Immuta, you allow Immuta to manage privileges on your Snowflake tables and views according to subscription policies. This guide describes the components of Snowflake table grants and how they are used in Immuta's Snowflake integration.
Snowflake data sharing with Immuta: Organizations can share the policy-protected data of their Snowflake database with other Snowflake accounts with Immuta policies enforced in real time. This guide describes the components of using Immuta with Snowflake data shares.
Snowflake low row access policy mode: The Snowflake low row access policy mode improves query performance in Immuta's Snowflake integration. To do so, this mode decreases the number of Snowflake row access policies Immuta creates and uses table grants to manage user access. This guide describes the design and requirements of this mode.
Snowflake lineage tag propagation: Snowflake column lineage specifies how data flows from source tables or columns to the target tables in write operations. When Snowflake lineage tag propagation is enabled in Immuta, Immuta automatically applies tags added to a Snowflake table to its descendant data source columns in Immuta so you can build policies using those tags to restrict access to sensitive data.
Warehouse sizing recommendations: Adjust the size and scale of clusters for your warehouse to manage workloads so that you can use Snowflake compute resources the most cost effectively.
Immuta integrates with your data platforms and external catalogs so you can register your data and effectively manage access controls on that data.
This section includes concept, reference, and how-to guides for configuring your data platform integration, registering data sources, and connecting your external catalog so that you can discover, monitor, and protect sensitive data using Immuta's flagship modules: Discover, Detect, and Secure.
This reference guide outlines the features, policies, and audit capabilities supported by each integration.
This section includes how-to and reference guides for Snowflake and how it integrates with Immuta.
This section includes how-to and reference guides for Databricks Unity Catalog and how it integrates with Immuta.
This section includes how-to and reference guides for Databricks Spark and how it integrates with Immuta.
This section includes how-to and reference guides for Starburst (Trino) and how it integrates with Immuta.
This section includes how-to and reference guides for Redshift and how it integrates with Immuta.
This section includes how-to and reference guides for Azure Synapse Analytics and how it integrates with Immuta.
This page includes how-to and reference content for Amazon S3 and how it integrates with Immuta.
This page includes how-to and reference content for Google BigQuery and how it integrates with Immuta.
This section covers the various data catalogs Immuta integrates with.
This reference guide outlines the actions and features that trigger Immuta queries in your remote platform that may incur cost.
Navigate to the App Settings page.
Scroll to the Global Integrations Settings section.
Ensure the Snowflake Table Grants checkbox is checked. It is enabled by default.
Opt to change the Role Prefix. Snowflake table grants creates a new Snowflake role for each Immuta user. To ensure these Snowflake role names do not collide with existing Snowflake roles, each Snowflake role created for Snowflake table grants requires a common prefix. When using multiple Immuta accounts within a single Snowflake account, the Snowflake table grants role prefix should be unique for each Immuta account. The prefix must adhere to Snowflake identifier requirements and be less than 50 characters. Once the configuration is saved, the prefix cannot be modified; however, the Snowflake table grants feature can be disabled and re-enabled to change the prefix.
Finish configuring your integration by following one of these guidelines:
New Snowflake integration: Set up a new Snowflake integration by following the configuration tutorial.
Existing Snowflake integration (automatic setup): You will be prompted to enter connection information for a Snowflake user. Immuta will execute the migration to Snowflake table grants using a connection established with this Snowflake user. The Snowflake user you provide here must have Snowflake privileges to run these privilege grants.
Existing Snowflake integration (manual setup): Immuta will display a link to a migration script you must run in Snowflake and a link to a rollback script for use in the event of a failed migration. Important: Execute the migration script in Snowflake before clicking Save on the app settings page.
Snowflake table grants private preview migration
To migrate from the private preview version of Snowflake table grants (available before September 2022) to the generally available version of Snowflake table grants, follow the steps in the migration guide.
To edit or remove a Snowflake integration, you have two options:
Automatic: Grant Immuta one-time use of credentials to automatically edit or remove the integration.
The credentials provided must have the following permissions:
CREATE DATABASE ON ACCOUNT WITH GRANT OPTION
CREATE ROLE ON ACCOUNT WITH GRANT OPTION
CREATE USER ON ACCOUNT WITH GRANT OPTION
MANAGE GRANTS ON ACCOUNT WITH GRANT OPTION
Manual: Run the Immuta script in your Snowflake environment yourself to edit or remove the integration.
The specified role used to run the bootstrap needs to have the following privileges:
CREATE DATABASE ON ACCOUNT WITH GRANT OPTION
CREATE ROLE ON ACCOUNT WITH GRANT OPTION
CREATE USER ON ACCOUNT WITH GRANT OPTION
MANAGE GRANTS ON ACCOUNT WITH GRANT OPTION
APPLY MASKING POLICY ON ACCOUNT WITH GRANT OPTION
APPLY ROW ACCESS POLICY ON ACCOUNT WITH GRANT OPTION
Select one of the following options for editing your integration:
Automatic: Grant Immuta one-time use of credentials to automatically edit the integration.
Manual: Run the Immuta script in your Snowflake environment yourself to edit the integration.
Click the App Settings icon in the left sidebar.
Click the Integrations tab and click the down arrow next to the Snowflake integration.
Edit the field you want to change or check a checkbox of a feature you would like to enable. Note any field shadowed is not editable, and the integration must be disabled and re-installed to change it.
From the Select Authentication Method Dropdown, select either Username and Password or Key Pair Authentication:
Username and Password option: Complete the Username, Password, and Role fields.
Key Pair Authentication option:
Complete the Username field.
Click Key Pair (Required), and upload a Snowflake key pair file.
Complete the Role field.
Click Save.
Click the App Settings icon in the left sidebar.
Click the Integrations tab and click the down arrow next to the Snowflake integration.
Edit the field you want to change or check a checkbox of a feature you would like to enable. Note any field shadowed is not editable, and the integration must be disabled and re-installed to change it.
Click edit script to download the script, and then run it in Snowflake.
Click Save.
Select one of the following options for deleting your integration:
Automatic: Grant Immuta one-time use of credentials to automatically remove the integration and Immuta-managed resources from your Snowflake environment.
Manual: Run the Immuta script in your Snowflake environment yourself to remove Immuta-managed resources and policies from Snowflake.
Click the App Settings icon in the left sidebar.
Click the Integrations tab and click the down arrow next to the Snowflake integration.
Click the checkbox to disable the integration.
Enter the Username, Password, and Role that was entered when the integration was configured.
Click Save.
Click the App Settings icon in the left sidebar.
Click the Integrations tab and click the down arrow next to the Snowflake integration.
Click the checkbox to disable the integration.
Click cleanup script to download the script.
Click Save.
Run the cleanup script in Snowflake.
Your guide to discovering, securing, and monitoring your data with Immuta.
Immuta does not require users to learn a new API or language to access protected data. Instead, Immuta integrates with existing tools and ongoing work while remaining invisible to downstream consumers.
The following data platforms integrate with Immuta:
Snowflake integration: With this integration, policies administered in Immuta are pushed down into Snowflake as Snowflake governance features (row access policies and masking policies).
Databricks:
Databricks Unity Catalog integration: This integration allows you to manage multiple Databricks workspaces through Unity Catalog while protecting your data with Immuta policies. Instead of manually creating UDFs or granting access to each table in Databricks, you can author your policies in Immuta and have Immuta manage and enforce Unity Catalog access-control policies on your data in Databricks clusters or SQL warehouse.
Databricks Spark integration: This integration enforces policies on Databricks tables registered as data sources in Immuta, allowing users to query policy-enforced data on Databricks clusters (including job clusters). Immuta policies are applied to the plan that Spark builds for users' queries, all executed directly against Databricks tables.
Google BigQuery: In this integration, Immuta generates policy-enforced views in your configured Google BigQuery dataset for tables registered as Immuta data sources.
Starburst (Trino) integration: The Starburst (Trino) integration allows you to access policy-protected data directly in your Starburst (Trino) catalogs without rewriting queries or changing your workflows. Immuta policies are translated into Starburst (Trino) rules and permissions and applied directly to tables within your existing catalogs.
Redshift integration: With the Redshift integration, Immuta applies policies directly in Redshift. This allows data analysts to query their data directly in Redshift instead of going through a proxy.
Azure Synapse Analytics integration: The Azure Synapse Analytics integration allows Immuta to apply policies directly in Azure Synapse Analytics dedicated SQL pools without needing users to go through a proxy. Instead, users can work within their existing Synapse Studio and have per-user policies dynamically applied at query time.
Amazon S3 integration: The Amazon S3 integration allows users to apply subscription policies to data in S3 to restrict what prefixes, buckets, or objects users can access. To enforce access controls on this data, Immuta creates S3 grants that are administered by S3 Access Grants, an AWS feature that defines access permissions to data in S3.
The table below outlines the features supported by each of Immuta's integrations.
Project workspaces | Tag ingestion | User impersonation | Query audit | Multiple integrations | |
---|---|---|---|---|---|
Certain policies are unsupported or supported with caveats*, depending on the integration:
*Supported with caveats:
On Databricks data sources, joins will not be allowed on data protected with replace with NULL or constant policies.
Databricks Unity Catalog ARRAY, MAP, or STRUCT type columns only support masking with NULL.
On Starburst data sources, the Immuta @iam
function for WHERE clause policies can block the creation of views.
For details about each of these policies, see the Policies in Immuta page.
The table below outlines what information is included in the query audit logs for each integration where query audit is supported.
Legend:
Immuta is compatible with . Using both Immuta and Snowflake, organizations can share the policy-protected data of their Snowflake database with other Snowflake accounts with Immuta policies enforced in real time.
Prerequisites:
Required Permission: Immuta: GOVERNANCE
to fit your organization's compliance requirements.
It's important to understand that subscription policies are not relevant to Snowflake data shares, because the act of sharing the data is the subscription policy. Data policies can be enforced on the consuming account from the producer account on a share following these instructions.
Required Permission: Immuta: USER_ADMIN
To register the Snowflake data consumer in Immuta,
.
to match the account ID for the data consumer. This value is the output on the data consumer side when SELECT CURRENT_ACCOUNT()
is run in Snowflake.
for your organization's policies.
.
Required Permission: Snowflake ACCOUNTADMIN
To share the policy-protected data source,
Grant reference usage on the Immuta database to the share you created:
Replace the content in angle brackets above with the name of your Immuta database and Snowflake data share.
This upgrade step is necessary if you meet both of the following criteria:
You have the Snowflake low row access policy mode enabled in private preview.
You have user impersonation enabled.
If you do not meet this criteria, follow the instructions on the .
To upgrade to the generally available version of the feature, on the app settings page and then re-enable it.
To migrate from the private preview version of table grants (available before September 2022) to the GA version, complete the steps below.
Navigate to the App Settings page.
Scroll to the Global Integrations Settings section.
Uncheck the Snowflake Table Grants checkbox to disable the feature.
Click Save. Wait for about 1 minute per 1000 users. This gives time for Immuta to drop all the previously created user roles.
Use the to re-enable the feature.
The how-to guides linked on this page illustrate how to integrate Snowflake with Immuta.
Requirement: Snowflake Enterprise Edition
These guides provide information on the recommended features to enable with Snowflake.
with the following features enabled:
(enabled by default)
(enabled by default)
(enabled by default)
Select None as your .
.
.
These guides provide instructions for organizing your Snowflake data to align with your governance structure.
.
These guides provide instructions for discovering, classifying, and tagging your data.
Validate the policy. You do not have to validate every policy you create in Immuta; instead, examine a few to validate the behavior you expect to see:
Validate that the Immuta users impacted now have an Immuta role in Snowflake dedicated to them.
Validate that when acting under the Immuta role those users have access to the table(s) in question.
Validate that users without access in Immuta can still access the table with a different Snowflake role that has access.
they were not granted access by Immuta and
they have a role that provides them access, even if they are not currently acting under that role.
Validate that a user with a role that can access the table in question (whether it's an Immuta role or not) sees the impact of that data policy.
Once all Immuta policies are in place, remove or alter old roles.
If you have Snowflake low row access policy mode enabled in private preview and have impersonation enabled, see these . Otherwise, query performance will be negatively affected.
Click the App Settings icon in the sidebar and scroll to the Global Integration Settings section.
Click the Enable Snowflake Low Row Access Policy Mode checkbox to enable the feature.
Confirm to allow Immuta to automatically disable impersonation for the Snowflake integration. If you do not confirm, you will not be able to enable Snowflake low row access policy mode.
Click Save.
If you already have a configured, you don't need to reconfigure your integration. Your Snowflake policies automatically refresh when you enable Snowflake low row access policy mode.
. Note that you will not be able to enable project workspaces or user impersonation with Snowflake low row access policy mode enabled.
Click Save and Confirm your changes.
Private preview: This feature is only available to select accounts. Reach out to your Immuta representative to enable this feature.
Contact your Immuta representative to enable this feature in your Immuta tenant.
Navigate to the App Setting page and click the Integration tab.
Click +Add Native Integration and select Snowflake from the dropdown menu.
Complete the Host, Port, and Default Warehouse fields.
Enable Native Query Audit.
Enable Native Lineage and complete the following fields:
Ingest Batch Sizes: This setting configures the number of rows Immuta ingests per batch when streaming Access History data from your Snowflake instance.
Table Filter: This filter determines which tables Immuta will ingest lineage for. Enter a regular expression that excludes /
from the beginning and end to filter tables. Without this filter, Immuta will attempt to ingest lineage for every table on your Snowflake instance.
Tag Filter: This filter determines which tags to propagate using lineage. Enter a regular expression that excludes /
from the beginning and end to filter tags. Without this filter, Immuta will ingest lineage for every tag on your Snowflake instance.
Select Manual or Automatic Setup and
.
The Snowflake lineage sync endpoint triggers the native lineage ingestion job that allows Immuta to propagate Snowflake tags added through lineage to Immuta data sources.
Copy the example and replace the Immuta URL and API key with your own.
Change the payload attribute values to your own, where
tableFilter
(string): This regular expression determines which tables Immuta will ingest lineage for. Enter a regular expression that excludes /
from the beginning and end to filter tables. Without this filter, Immuta will attempt to ingest lineage for every table on your Snowflake instance.
batchSize
(integer): This parameter configures the number of rows Immuta ingests per batch when streaming Access History data from your Snowflake instance. Minimum 1.
lastTimestamp
(string): Setting this parameter will only return lineage events later than the value provided. Use a format like 2022-06-29T09:47:06.012-07:00.
Once the sync job is complete, you can complete the following steps:
Snowflake | Databricks Spark | Databricks Unity Catalog | Starburst (Trino) | |
---|---|---|---|---|
This is available and the information is included in audit logs.
This is not available and the information is not included in audit logs.
of the Snowflake table that has been registered in Immuta.
These guides provide instructions for auditing and detecting your users' activity, or see the for a comprehensive guide on the benefits of these features and other recommendations.
or for your .
.
.
to configure and validate SDD.
to discover entities of interest for your policy needs.
.
Register your remaining tables at the with .
.
These guides provide instructions for configuring and securing your data with governance policies, or see the for a comprehensive guide on creating policies to fit your organization's use case.
.
Validate that a user with enabled retains access if
.
Immuta helps you achieve the following outcomes in your data platform:
Simplify Operations: Immuta’s dynamic access control and policy management require 93x fewer data policies to manage access control in your data platform according to the GigaOm study. It is simple and scalable, which improves change management and lowers the total cost of ownership of cloud data management.
Improve data security: Immuta helps prove compliance with rules and regulations, even when securing hundreds of thousands of tables. An Immuta customer, Swedbank, migrated all critical analytics workloads to the cloud in less than 12 months, including over 100 terabytes from more than 2,500 sources.
Unlock data’s value: Immuta helps organizations get access to more data 100x faster, which translates to improved productivity. An Immuta customer, Thomson Reuters enabled faster access to data, resulting in a 60x increase in data usage and greater productivity.
Immuta provides three modules to create a full data security platform suite.
Discover sensitive data from millions of fields without manual effort. With over 60 pre-built and domain-specific identifiers, you can tailor data classification to your unique business needs based on your desired confidence level.
Leverage timely insights into data access and user activity with anomaly indicators for faster analysis and proactive actions.
Immuta’s attribute-based access control (ABAC) delivers scalable data access without role explosion, and dynamic data masking ensures the right users can access the right data.
The Snowflake low row access policy mode improves query performance in Immuta's Snowflake integration by decreasing the number of Snowflake row access policies Immuta creates and by using table grants to manage user access.
Immuta manages access to Snowflake tables by administering Snowflake row access policies and column masking policies on those tables, allowing users to query them directly in Snowflake while policies are enforced.
Without Snowflake low row access policy mode enabled, row access policies are created and administered by Immuta in the following scenarios:
Table grants are disabled and a subscription policy that does not automatically subscribe everyone to the data source is applied. Immuta administers Snowflake row access policies to filter out all the rows to restrict access to the entire table when the user doesn't have privileges to query it. However, if table grants are disabled and a subscription policy is applied that grants everyone access to the data source automatically, Immuta does not create a row access policy in Snowflake. See the subscription policies page for details about these policy types.
Purpose-based policy is applied to a data source. A row access policy filters out all the rows of the table if users aren't acting under the purpose specified in the policy when they query the table.
Row-level security policy is applied to a data source. A row access policy filters out rows querying users don't have access to.
User impersonation is enabled. A row access policy is created for every Snowflake table registered in Immuta.
Deprecation notice
Support for using the Snowflake integration with low row access policy mode disabled has been deprecated. You must enable this feature and table grants for your integration to continue working. See the release notes for EOL dates.
Snowflake low row access policy mode is enabled by default to reduce the number of row access policies Immuta creates and improve query performance. Snowflake low row access policy mode requires
user impersonation to be disabled. User impersonation diminishes the performance of interactive queries because of the number of row access policies Immuta creates when it's enabled.
Project-scoped purpose exceptions for Snowflake integrations allow you to apply purpose-based policies to Snowflake data sources in a project. As a result, users can only access that data when they are working within that specific project.
This feature allows masked columns to be joined across data sources that belong to the same project. When data sources do not belong to a project, Immuta uses a unique salt per data source for hashing to prevent masked values from being joined. (See the Why use masked joins? guide for an explanation of that behavior.) However, once you add Snowflake data sources to a project and enable masked joins, Immuta uses a consistent salt across all the data sources in that project to allow the join.
For more information about masked joins and enabling them for your project, see the Masked joins section of documentation.
Project workspaces are not compatible with this feature.
Impersonation is not supported when the Snowflake low row access policy mode is enabled.
Immuta is compatible with Snowflake Secure Data Sharing. Using both Immuta and Snowflake, organizations can share the policy-protected data of their Snowflake database with other Snowflake accounts with Immuta policies enforced in real time. This integration gives data consumers a live connection to the data and relieves data providers of the legal and technical burden of creating static data copies that leave their Snowflake environment.
Requirements:
Snowflake Enterprise Edition or higher
Immuta's table grants feature
This method requires that the data consumer account is registered as an Immuta user with the Snowflake user name equal to the consuming account.
At that point, the user that represents the account being shared with can have the appropriate attributes and groups assigned to them, relevant to the data policies that need to be enforced. Once that user has access to the share in the consuming account (not managed by Immuta), they can query the share with the data policies from the producer account enforced because Immuta is treating that account as if they are a single user in Immuta.
For a tutorial on this workflow, see the Using Snowflake Data Sharing page.
Using Immuta with Snowflake Data Sharing allows the sharer to
Only need limited knowledge of the context or goals of the existing policies in place: Because the sharer is not editing or creating policies to share their data, they only need a limited knowledge of how the policies work. Their main responsibility is making sure they properly represent the attributes of the data consumer (the account being shared to).
Leave policies untouched.
This page details how to configure the Snowflake integration using the legacy workflow. To configure the Snowflake integration and register data sources using the simplified workflow, see this how-to guide.
Warehouse sizing recommendations
Before configuring the integration, review the Warehouse sizing recommendations guide to ensure that you use Snowflake compute resources cost effectively.
When performing an automated installation, Immuta requires temporary, one-time use of credentials with the following permissions:
CREATE DATABASE ON ACCOUNT WITH GRANT OPTION
CREATE ROLE ON ACCOUNT WITH GRANT OPTION
CREATE USER ON ACCOUNT WITH GRANT OPTION
MANAGE GRANTS ON ACCOUNT WITH GRANT OPTION
APPLY MASKING POLICY ON ACCOUNT WITH GRANT OPTION
APPLY ROW ACCESS POLICY ON ACCOUNT WITH GRANT OPTION
These permissions will be used to create and configure a new IMMUTA database within the specified Snowflake instance. The credentials are not stored or saved by Immuta, and Immuta doesn’t retain access to them after initial setup is complete.
You can create a new account for Immuta to use that has these permissions, or you can grant temporary use of a pre-existing account. By default, the pre-existing account with appropriate permissions is ACCOUNTADMIN. If you create a new account, it can be deleted after initial setup is complete.
Alternatively, you can create the IMMUTA database within the specified Snowflake instance manually using the manual setup option.
The specified role used to run the bootstrap needs to have the following privileges:
CREATE DATABASE ON ACCOUNT WITH GRANT OPTION
CREATE ROLE ON ACCOUNT WITH GRANT OPTION
CREATE USER ON ACCOUNT WITH GRANT OPTION
MANAGE GRANTS ON ACCOUNT WITH GRANT OPTION
APPLY MASKING POLICY ON ACCOUNT WITH GRANT OPTION
APPLY ROW ACCESS POLICY ON ACCOUNT WITH GRANT OPTION
It will create a user called IMMUTA_SYSTEM_ACCOUNT
, and grant the following privileges to that user:
APPLY MASKING POLICY ON ACCOUNT
APPLY ROW ACCESS POLICY ON ACCOUNT
Additional grants associated with the IMMUTA database
Snowflake resource names: Use uppercase for the names of the Snowflake resources you create below.
Click the App Settings icon in the navigation panel.
Click the Integrations tab.
Click the +Add Native Integration button and select Snowflake from the dropdown menu.
Complete the Host, Port, and Default Warehouse fields.
Opt to check the Enable Project Workspace box. This will allow for managed write access within Snowflake. Note: Project workspaces still use Snowflake views, so the default role of the account used to create the data sources in the project must be added to the Excepted Roles List. This option is unavailable when table grants is enabled.
Opt to check the Enable Impersonation box and customize the Impersonation Role to allow users to natively impersonate another user. You cannot edit this choice after you configure the integration.
Snowflake query audit is enabled by default; you can disable it by clicking the Enable Native Query Audit checkbox.
Configure the audit frequency by scrolling to Integrations Settings and find the Snowflake Audit Sync Schedule section.
Enter how often, in hours, you want Immuta to ingest audit events from Snowflake as an integer between 1 and 24.
Continue with your integration configuration.
Altering parameters in Snowflake at the account level may cause unexpected behavior of the Snowflake integration in Immuta
The QUOTED_IDENTIFIERS_IGNORE_CASE
parameter must be set to false
(the default setting in Snowflake) at the account level. Changing this value to true
causes unexpected behavior of the Snowflake integration.
You have two options for configuring your Snowflake environment:
Automatic setup: Grant Immuta one-time use of credentials to automatically configure your Snowflake environment and the integration.
Manual setup: Run the Immuta script in your Snowflake environment yourself to configure your Snowflake environment and the integration.
Required permissions: When performing an automated installation, Immuta requires temporary, one-time use of credentials with the Snowflake permissions listed above.
From the Select Authentication Method Dropdown, select one of the following authentication methods:
Username and Password: Complete the Username, Password, and Role fields.
Key Pair Authentication:
Complete the Username field.
When using a private key, enter the private key file password in the Additional Connection String Options. Use the following format: PRIV_KEY_FILE_PWD=<your_pw>
Click Key Pair (Required), and upload a Snowflake key pair file.
Complete the Role field.
Account creation best practice
The account you create for Immuta should only be used for the integration and should not be used as the credentials for creating data sources in Immuta; doing so will cause issues. Instead, create a separate, dedicated READ-ONLY account for creating and registering data sources within Immuta.
Required permissions: The specified role used to run the bootstrap needs to have the Snowflake permissions listed above.
It will create a user called IMMUTA_SYSTEM_ACCOUNT
, and grant the following privileges to that user:
APPLY MASKING POLICY ON ACCOUNT
APPLY ROW ACCESS POLICY ON ACCOUNT
Additional grants associated with the IMMUTA
database
Select Manual.
Use the Dropdown Menu to select your Authentication Method:
Username and password: Enter the Username and Password and set them in the bootstrap script for the Immuta system account credentials.
Key pair authentication: Upload the Key Pair file and when using a private key, enter the private key file password in the Additional Connection String Options. Use the following format: PRIV_KEY_FILE_PWD=<your_pw>
Snowflake External OAuth:
Create a security integration for your Snowflake External OAuth. Note that if you have an existing security integration, then the Immuta system role must be added to the existing EXTERNAL_OAUTH_ALLOWED_ROLES_LIST
. The Immuta system role will be the Immuta database provided above with _SYSTEM
. If you used the default database name it will be IMMUTA_SYSTEM
.
Fill out the Token Endpoint. This is where the generated token is sent.
Fill out the Client ID. This is the subject of the generated token.
Select the method Immuta will use to obtain an access token:
Certificate
Keep the Use Certificate checkbox enabled.
Opt to fill out the Resource field with a URI of the resource where the requested token will be used.
Enter the x509 Certificate Thumbprint. This identifies the corresponding key to the token and is often abbreviated as `x5t` or is called `sub` (Subject).
Upload the PEM Certificate, which is the client certificate that is used to sign the authorization request.
Client secret
Uncheck the Use Certificate checkbox.
Enter the Scope (string). The scope limits the operations and roles allowed in Snowflake by the access token. See the OAuth 2.0 scopes documentation for details about scopes.
Enter the Client Secret (string). Immuta uses this secret to authenticate with the authorization server when it requests a token.
In the Setup section, click bootstrap script to download the script. Then, fill out the appropriate fields and run the bootstrap script in Snowflake.
Different accounts
The account used to enable the integration must be different from the account used to create data sources in Immuta. Otherwise, workspace views won't be generated properly.
If you enabled a Snowflake workspace, select Warehouses from the dropdown menu that will be available to project owners when creating native Snowflake workspaces. Select from a list of all the warehouses available to the privileged account entered above. Note that any warehouse accessible by the PUBLIC role does not need to be explicitly added.
Enter the Excepted Roles/User List. Each role or username (both case-sensitive) in this list should be separated by a comma.
Excepted roles/users will have no policies applied to queries
Any user with the username or acting under the role in this list will have no policies applied to them when querying Immuta protected Snowflake tables in Snowflake. Therefore, this list should be used for service or system accounts and the default role of the account used to create the data sources in the Immuta projects (if you have Snowflake workspace enabled).
Click Save.
To allow Immuta to automatically import table and column tags from Snowflake, enable Snowflake tag ingestion in the external catalog section of the Immuta app settings page.
Snowflake user authentication
To configure Snowflake tag ingestion, which syncs Snowflake tags into Immuta, you must provide a Snowflake user who has, at minimum, the ability to set the following privileges:
GRANT IMPORTED PRIVILEGES ON DATABASE snowflake
GRANT APPLY TAG ON ACCOUNT
Navigate to the App Settings page.
Scroll to 2 External Catalogs, and click Add Catalog.
Enter a Display Name and select Snowflake from the dropdown menu.
Enter the Account.
Enter the Authentication information: Username, Password, Port, Default Warehouse, and Role.
Opt to enter the Proxy Host, Proxy Port, and Encrypted Key File Passphrase.
Opt to Upload Certificates.
Click the Test Connection button.
Click the Test Data Source Link.
Once both tests are successful, click Save.
If a Databricks cluster needs to be manually updated to reflect changes in the Immuta init script or cluster policies, you can remove and set up your integration again to get the updated policies and init script.
Log in to Immuta as an Application Admin.
Click the App Settings icon in the left sidebar and scroll to the Integration Settings section.
Your existing Databricks Spark integration should be listed here; expand it and note the configuration values. Now select Remove to remove your integration.
Click Add Native Integration and select Databricks Integration to add a new integration.
Enter your Databricks Spark integration settings again as configured previously.
Click Add Native Integration to add the integration, and then select Configure Cluster Policies to set up the updated cluster policies and init script.
Select the cluster policies you wish to use for your Immuta-enabled Databricks clusters.
Automatically push cluster policies and the init script (recommended) or manually update your cluster policies.
Automatically push cluster policies
Select Automatically Push Cluster Policies and enter your privileged Databricks access token. This token must have privileges to write to cluster policies.
Select Apply Policies to push the cluster policies and init script again.
Click Save and Confirm to deploy your changes.
Manually update cluster policies
Download the init script and the new cluster policies to your local computer.
Click Save and Confirm to save your changes in Immuta.
Log in to your Databricks workspace with your administrator account to set up cluster policies.
Get the path you will upload the init script (immuta_cluster_init_script_proxy.sh
) to by opening one of the cluster policy .json
files and looking for the defaultValue
of the field init_scripts.0.dbfs.destination
. This should be a DBFS path in the form of dbfs:/immuta-plugin/hostname/immuta_cluster_init_script_proxy.sh
.
Click Data in the left pane to upload your init script to DBFS to the path you found above.
To find your existing cluster policies you need to update, click Compute in the left pane and select the Cluster policies tab.
Edit each of these cluster policies that were configured before and overwrite the contents of the JSON with the new cluster policy JSON you downloaded.
Restart any Databricks clusters using these updated policies for the changes to take effect.
Snowflake
Databricks Unity Catalog
Databricks Spark
Google BigQuery
Starburst
Redshift
Azure Synapse Analytics
Amazon S3
Table and user coverage
Registered data sources and users
Registered data sources and users
All tables and users
Registered data sources and users
Object queried
Columns returned
Query text
Unauthorized information
Policy details
User's entitlements
Column tags
Table tags
Snowflake Enterprise Edition required
In this integration, Immuta manages access to Snowflake tables by administering Snowflake row access policies and column masking policies on those tables, allowing users to query tables directly in Snowflake while dynamic policies are enforced.
Like with all Immuta integrations, Immuta can inject its ABAC model into policy building and administration to remove policy management burden and significantly reduce role explosion.
When an administrator configures the Snowflake integration with Immuta, Immuta creates an IMMUTA
database and schemas (immuta_procedures
, immuta_policies
, and immuta_functions
) within Snowflake to contain policy definitions and user entitlements. Immuta then creates a system role and gives that system account the following privileges:
APPLY MASKING POLICY
APPLY ROW ACCESS POLICY
ALL PRIVILEGES ON DATABASE "IMMUTA" WITH GRANT OPTION
ALL PRIVILEGES ON ALL SCHEMAS IN DATABASE "IMMUTA" WITH GRANT OPTION
USAGE ON FUTURE PROCEDURES IN SCHEMA "IMMUTA".immuta_procedures WITH GRANT OPTION
USAGE ON WAREHOUSE
OWNERSHIP ON SCHEMA "IMMUTA".immuta_policies TO ROLE "IMMUTA_SYSTEM" COPY CURRENT GRANTS
OWNERSHIP ON SCHEMA "IMMUTA".immuta_procedures TO ROLE "IMMUTA_SYSTEM" COPY CURRENT GRANTS
OWNERSHIP ON SCHEMA "IMMUTA".immuta_functions TO ROLE "IMMUTA_SYSTEM" COPY CURRENT GRANTS
OWNERSHIP ON SCHEMA "IMMUTA".public TO ROLE "IMMUTA_SYSTEM" COPY CURRENT GRANTS
Optional features, like automatic object tagging, native query auditing, etc., require additional permissions to be granted to the Immuta system account, are listed in the supported features section.
Snowflake is a policy push integration with Immuta. When Immuta users create policies, they are then pushed into the Immuta database within Snowflake; there, the Immuta system account applies Snowflake row access policies and column masking policies directly onto Snowflake tables. Changes in Immuta policies, user attributes, or data sources trigger webhooks that keep the Snowflake policies up-to-date.
For a user to query Immuta-protected data, they must meet two qualifications:
They must be subscribed to the Immuta data source.
They must be granted SELECT
access on the table by the Snowflake object owner or automatically via the Snowflake table grants feature.
After a user has met these qualifications they can query Snowflake tables directly.
See the integration support matrix on the Data policy types reference guide for a list of supported data policy types in Snowflake.
When a user applies a masking policy to a Snowflake data source, Immuta truncates masked values to align with Snowflake column length (VARCHAR(X)
types) and precision (NUMBER (X,Y)
types) requirements.
Consider these columns in a data source that have the following masking policies applied:
Column A (VARCHAR(6)): Mask using hashing for everyone
Column B (VARCHAR(5)): Mask using a constant REDACTED
for everyone
Column C (VARCHAR(6)): Mask by making null for everyone
Column D (NUMBER(3, 0)): Mask by rounding to the nearest 10 for everyone
Querying this data source in Snowflake would return the following values:
Hashing collisions
Hashing collisions are more likely to occur across or within Snowflake columns restricted to short lengths, since Immuta truncates the hashed value to the limit of the column. (Hashed values truncated to 5 characters have a higher risk of collision than hashed values truncated to 20 characters.) Therefore, avoid applying hashing policies to Snowflake columns with such restrictions.
For more details about Snowflake column length and precision requirements, see the Snowflake behavior change release documentation.
When a policy is applied to a column, Immuta uses Snowflake memoizable functions to cache the result of the called function. Then, when a user queries a column that has that policy applied to it, Immuta uses that cached result to dramatically improve query performance.
Register Snowflake data sources using a dedicated Snowflake role. Avoid using individual user accounts for data source onboarding. Instead, create a service account (Snowflake user account TYPE=SERVICE
) with SELECT
access for onboarding data sources. No policies will apply to that account, ensuring that your integration works with the following use cases:
Snowflake project workspaces: Snowflake workspaces generate static views with the credentials used to register the table as an Immuta data source. Those tables must be registered in Immuta by an excepted role so that policies applied to the backing tables are not applied to the project workspace views.
Using views and tables within Immuta: Because this integration uses Snowflake governance policies, users can register tables and views as Immuta data sources. However, if you want to register views and apply different policies to them than their backing tables, the owner of the view must be an excepted role; otherwise, the backing table’s policies will be applied to that view.
Private preview: This feature is only available to select accounts. Reach out to your Immuta representative to enable this feature.
Bulk data source creation is the more efficient process when loading more than 5000 data sources from Snowflake and allows for data sources to be registered in Immuta before running sensitive data discovery or applying policies.
To use this feature, see the Bulk create Snowflake data sources guide.
Based on performance tests that create 100,000 data sources, Immuta recommends a SaaS XL environment.
Performance gains are limited when enabling sensitive data discovery at the time of data source creation.
External catalog integrations are not recognized during bulk data source creation. Users must manually trigger a catalog sync for tags to appear on the data source through the data source's health check.
Excepted roles and users are assigned when the integration is installed, and no policies will apply to these users' queries, despite any Immuta policies enforced on the tables they are querying. Credentials used to register a data source in Immuta will be automatically added to this excepted list for that Snowflake table. Consequently, roles and users added to this list and used to register data sources in Immuta should be limited to service accounts.
Immuta excludes the listed roles and users from policies by wrapping all policies in a CASE statement that will check if a user is acting under one of the listed usernames or roles. If a user is, then the policy will not be acted on the queried table. If the user is not, then the policy will be executed like normal. Immuta does not distinguish between role and username, so if you have a role and user with the exact same name, both the user and any user acting under that role will have full access to the data sources and no policies will be enforced for them.
An Immuta application administrator configures the Snowflake integration and registers Snowflake warehouse and databases with Immuta.
Immuta creates a database inside the configured Snowflake warehouse that contains Immuta policy definitions and user entitlements.
A data owner registers Snowflake tables in Immuta as data sources.
If Snowflake tag ingestion was enabled during the configuration, Immuta uses the host provided in the configuration and ingests internal tags on Snowflake tables registered as Immuta data sources.
A data owner, data governor, or administrator creates or changes a policy or a user's attributes change in Immuta.
The Immuta web service calls a stored procedure that modifies the user entitlements or policies.
Immuta manages and applies Snowflake governance column and row access policies to Snowflake tables that are registered as Immuta data sources.
If Snowflake table grants is not enabled, Snowflake object owner or user with the global MANAGE GRANTS privilege grants SELECT privilege on relevant Snowflake tables to users. Note: Although they are GRANTed access, if they are not subscribed to the table via Immuta-authored policies, they will not see data.
A Snowflake user who is subscribed to the data source in Immuta queries the corresponding table directly in Snowflake and sees policy-enforced data.
The Snowflake integration supports the following authentication methods to configure the integration and create data sources:
Username and password: Users can authenticate with their Snowflake username and password.
Key pair: Users can authenticate with a Snowflake key pair authentication.
Snowflake External OAuth: Users can authenticate with Snowflake External OAuth.
Immuta's OAuth authentication method uses the Client Credentials Flow to integrate with Snowflake External OAuth. When a user configures the Snowflake integration or connects a Snowflake data source, Immuta uses the token credentials (obtained using a certificate or passing a client secret) to craft an authenticated access token to connect with Snowflake. This allows organizations that already use Snowflake External OAuth to use that secure authentication with Immuta.
An Immuta application administrator configures the Snowflake integration or creates a data source.
Immuta creates a custom token and sends it to the authorization server.
The authorization server confirms the information sent from Immuta and issues an access token to Immuta.
Immuta sends the access token it received from the authorization server to Snowflake.
Snowflake authenticates the token and grants access to the requested resources from Immuta.
The integration is connected and users can query data.
The Immuta Snowflake integration supports the following Snowflake features:
Private connectivity for Snowflake: While Immuta does not persist any of your data, data is temporarily held in memory in some instances, like when a user generates a data source fingerprint. This data is encrypted using TLS from the data source to Immuta as it traverses the public internet. Alternatively, Immuta can be connected to a user's Snowflake Account over either AWS PrivateLink or Azure Private Link so that any data moving between the user's data source and the Immuta tenant is over a private network.
Snowflake external tables: However, you cannot add a masking policy to an external table column while creating the external table in Snowflake because masking policies cannot be attached to virtual columns.
The Snowflake integration supports the Immuta features outlined below. Click the links provided for more details.
Immuta project workspaces: Users can have additional write access in their integration using project workspaces.
Tag ingestion: Immuta automatically ingests Snowflake object tags from your Snowflake instance and adds them to the appropriate data sources.
User impersonation: Native impersonation allows users to natively query data as another Immuta user. To enable native user impersonation, see the Integration user impersonation page.
Native query audit: Immuta audits queries run natively in Snowflake against Snowflake data registered as Immuta data sources.
Snowflake low row access policy mode: The Snowflake low row access policy mode improves query performance in Immuta's Snowflake integration by decreasing the number of Snowflake row access policies Immuta creates.
Snowflake table grants: This feature allows Immuta to manage privileges on your Snowflake tables and views according to the subscription policies on the corresponding Immuta data sources.
Immuta system account required Snowflake privileges
CREATE [OR REPLACE] PROCEDURE
DROP ROLE
REVOKE ROLE
Users can have additional write access in their integration using project workspaces. For more details, see the Snowflake project workspaces page.
To use project workspaces with the Snowflake integration, the default role of the account used to create data sources in the project must be added to the "Excepted Roles/Users List." If the role is not added, you will not be able to query the equalized view using the project role in Snowflake.
You can enable Snowflake tag ingestion so that Immuta will ingest Snowflake object tags from your Snowflake instance into Immuta and add them to the appropriate data sources.
The Snowflake tags' key and value pairs will be reflected in Immuta as two levels: the key will be the top level and the value the second. As Snowflake tags are hierarchical, Snowflake tags applied to a database will also be applied to all of the schemas in that database, all of the tables within those schemas, and all of the columns within those tables. For example: If a database is tagged PII
, all of the tables and columns in that database will also be tagged PII
.
To enable Snowflake tag ingestion, see the Configure a Snowflake integration page.
Snowflake has some natural data latency. If you manually refresh the governance page to see all tags created globally, users can experience a delay of up to two hours. However, if you run schema detection or a health check to find where those tags are applied, the delay will not occur because Immuta will only refresh tags for those specific tables.
Immuta system account required Snowflake privilege
IMPORTED PRIVILEGES ON DATABASE snowflake
Once this feature has been enabled with the Snowflake integration, Immuta will query Snowflake to retrieve user query histories. These histories provide audit records for queries against Snowflake data sources that are queried natively in Snowflake.
This process will happen automatically every hour by default but can be configured to a different frequency when configuring or editing the integration. Additionally, audit ingestion can be manually requested at any time from the Immuta audit page. When manually requested, it will only search for new queries that were created since the last native query that had been audited. The job is run in the background, so the new queries will not be immediately available.
For details about prompting these logs and the contents of these audit logs, see the Snowflake query audit logs page.
A user can configure multiple integrations of Snowflake to a single Immuta tenant and use them dynamically or with workspaces.
There can only be one integration connection with Immuta per host.
The host of the data source must match the host of the integration for the view to be created.
Projects can only be configured to use one Snowflake host.
If there are errors in generating or applying policies natively in Snowflake, the data source will be locked and only users on the excepted roles/users list and the credentials used to create the data source will be able to access the data.
Once a Snowflake integration is disabled in Immuta, the user must remove the access that was granted in Snowflake. If that access is not revoked, users will be able to access the raw table in Snowflake.
Migration must be done using the credentials and credential method (automatic or bootstrap) used to configure the integration.
When configuring one Snowflake instance with multiple Immuta tenants, the user or system account that enables the integration on the app settings page must be unique for each Immuta tenant.
A Snowflake table can only have one set of policies enforced at a given time, so creating multiple data sources pointing to the same table is not supported. If this is a use case you need to support, create views in Snowflake and expose those instead.
You cannot add a masking policy to an external table column while creating the external table because a masking policy cannot be attached to a virtual column.
If you create an Immuta data source from a Snowflake view created using a select * from
query, Immuta column detection will not work as expected because Snowflake views are not automatically updated based on backing table changes. To remedy this, you can create views that have the specific columns you want or you can CREATE AND REPLACE
the view in Snowflake whenever the backing table is updated and manually run the column detection job on the data source page.
If a user is created in Snowflake after that user is already registered in Immuta, Immuta does not grant usage on the per-user role automatically - meaning Immuta does not govern this user's access without manual intervention. If a Snowflake user is created after that user is registered in Immuta, the user account must be disabled and re-enabled to trigger a sync of Immuta policies to govern that user. Whenever possible, Snowflake users should be created before registering those users in Immuta.
Snowflake tables from imported databases are not supported. Instead, create a view of the table and register that view as a data source.
The Immuta Snowflake integration uses Snowflake governance features to let users query data natively in Snowflake. This means that Immuta also inherits some Snowflake limitations using correlated subqueries with row access policies and column-level security. These limitations appear when writing custom WHERE policies, but do not remove the utility of row-level policies.
All column names must be fully qualified: Any column names that are unqualified (i.e., just the column name) will default to a column of the data source the policy is being applied to (if one matches the name).
The Immuta system account must have SELECT
privileges on all tables/views referenced in a subquery: The Immuta system role name is specified by the user, and the role is created when the Snowflake instance is integrated.
Any subqueries that error in Snowflake will also error in Immuta.
Including one or more subqueries in the Immuta policy condition may cause errors in Snowflake. If an error occurs, it may happen during policy creation or at query-time. To avoid these errors, limit the number of subqueries, limit the number of JOIN operations, and simplify WHERE clause conditions.
For more information on the Snowflake subquery limitations see
Snowflake table grants simplifies the management of privileges in Snowflake when using Immuta. Instead of having to manually grant users access to tables registered in Immuta, you allow Immuta to manage privileges on your Snowflake tables and views according to subscription policies. Then, users subscribed to a data source in Immuta can view and query the Snowflake table, while users who are not subscribed to the data source cannot view or query the Snowflake table.
Enabling Snowflake table grants gives the following privileges to the Immuta Snowflake role:
MANAGE GRANTS ON ACCOUNT
allows the Immuta Snowflake role to grant and revoke SELECT
privileges on Snowflake tables and views that have been added as data sources in Immuta.
CREATE ROLE ON ACCOUNT
allows for the creation of a Snowflake role for each user in Immuta, enabling fine-grained, attribute-based access controls to determine which tables are available to which individuals.
Since table privileges are granted to roles and not to users in Snowflake, Immuta's Snowflake table grants feature creates a new Snowflake role for each Immuta user. This design allows Immuta to manage table grants through fine-grained access controls that consider the individual attributes of users.
Each Snowflake user with an Immuta account will be granted a role that Immuta manages. The naming convention for this role is <IMMUTA>_USER_<username>
, where
<IMMUTA>
is the prefix you specified when enabling the feature on the Immuta app settings page.
<username>
is the user's Immuta username.
Users are granted access to each Snowflake table or view automatically when they are subscribed to the corresponding data source in Immuta.
Users have two options for querying Snowflake tables that are managed by Immuta:
Use the role that Immuta creates and manages. (For example, USE ROLE IMMUTA_USER_<username>
. See the section above for details about the role and name conventions.) If the current active primary role is used to query tables, USAGE
on a Snowflake warehouse must be granted to the Immuta-managed Snowflake role for each user.
USE SECONDARY ROLES ALL
, which allows users to use the privileges from all roles that they have been granted, including IMMUTA_USER_<username>
, in addition to the current active primary role. Users may also set a value for DEFAULT_SECONDARY_ROLES
as an object property on a Snowflake user. To learn more about primary roles and secondary roles in Snowflake, see Snowflake documentation.
Immuta uses an algorithm to determine the most optimal way to group users in a role hierarchy in order to optimize the number of GRANTs (or REVOKES) executed in Snowflake. This is done by determining the least amount of possible permutations of access across tables and users based on the policies in place; then, those become intermediate roles in the hierarchy that each user is added to, based on the intermediate roles they belong to.
As an example, take the below users and data sources they have access to. To do this naively by individually granting every user to the tables they have access to would result in 37 grants:
Conversely, using the Immuta algorithm, we can optimize the number of grants in the same scenario down to 29:
It’s important to consider a few things here:
If the permutations of access are small, there will be a huge optimization realized (very few intermediate roles). If every user has their own unique permutation of access, the optimization will be negligible (an intermediate role per user). It is most common that the number of permutations of access will be many multiples smaller than the actual user count, so there should be large optimizations. In other words, a much smaller number of intermediate roles and the number of total overall grants reduced, since the tables are granted to roles and roles to users.
This only happens once up front. After that, changes are incremental based on policy changes and user attribute changes (smaller updates), unless there’s a policy that makes a sweeping change across all users. The addition of new users who have access becomes much more straightforward also due to the fact above. User’s access will be granted via the intermediate role, and, therefore, a lot of the work is front loaded in the intermediate role creation.
Project workspaces are not supported when Snowflake table grants is enabled.
If an Immuta tenant is connected to an external IAM and that external IAM has a username identical to another username in Immuta's built-in IAM, those users will have the same Snowflake role, leading both to see the same data.
Sometimes the role generated can contain special characters such as @
because it's based on the user name configured from your identity manager. Because of this, it is recommended that any code references to the Immuta-generated role be enclosed with double quotes.
Private preview: This feature is only available to select accounts. Reach out to your Immuta representative to enable this feature.
Snowflake column lineage specifies how data flows from source tables or columns to the target tables in write operations. When Snowflake lineage tag propagation is enabled in Immuta, Immuta automatically applies tags added to a Snowflake table to its descendant data source columns in Immuta so you can build policies using those tags to restrict access to sensitive data.
Snowflake Access History tracks user read and write operations. Snowflake column lineage extends this Access History to specify how data flows from source columns to the target columns in write operations, allowing data stewards to understand how sensitive data moves from ancestor tables to target tables so that they can
trace data back to its source to validate the integrity of dashboards and reports,
identify who performed write operations to meet compliance requirements,
evaluate data quality and pinpoint points of failure, and
tag sensitive data on source tables without having tag columns on their descendant tables.
However, tagging sensitive data doesn’t innately protect that data in Snowflake; users need Immuta to disseminate these lineage tags automatically to descendant tables registered in Immuta so data stewards can build policies using the semantic and business context captured by those tags to restrict access to sensitive data. When Snowflake lineage tag propagation is enabled, Immuta propagates tags applied to a data source to its descendant data source columns in Immuta, which keeps your data inventory in Immuta up-to-date and allows you to protect your data with policies without having to manually tag every new Snowflake data source you register in Immuta.
An application administrator enables the feature on the Immuta app settings page.
Snowflake lineage metadata (column names and tags) for the Snowflake tables is stored in the metadata database.
A data owner creates a new data source (or adds a new column to a Snowflake table) that initiates a job that applies all tags for each column from its ancestor columns.
A data owner or governor adds a tag to a column in Immuta that has descendants, which initiates a job that propagates the tag to all descendants.
An audit record is created that includes which tags were applied and from which columns those tags originated.
The Snowflake Account Usage ACCESS_HISTORY
view contains column lineage information.
To appropriately propagate tags to descendant data sources, Immuta fetches Access History metadata to determine what column tags have been updated, stores this metadata in the Immuta metadata database, and then applies those tags to relevant descendant columns of tables registered in Immuta.
Consider the following example using the Customer, Customer 2, and Customer 3 tables that were all registered in Immuta as data sources.
Customer: source table
Customer 2: descendant of Customer
Customer 3: descendant of Customer 2
If the Discovered.Electronic Mail Address
tag is added to the Customer data source in Immuta, that tag will propagate through lineage to the Customer 2 and Customer 3 data sources.
After an application administrator has enabled Snowflake lineage tag propagation, data owners can register data in Immuta and have tags in Snowflake propagated from ancestor tables to descendant data sources. Whenever new tags are added to those tables in Immuta, those upstream tags will propagate to descendant data sources.
By default all tags are propagated, but these tags can be filtered on the app settings page or using the Immuta API.
Lineage tag propagation works with any tag added to the data dictionary. Tags can be manually added, synced from an external catalog, or discovered by SDD. Consider the following example using the Customer, Customer 2, and Customer 3 tables that were all registered in Immuta as data sources.
Customer: source table
Customer 2: descendant of Customer
Customer 3: descendant of Customer 2
Immuta added the Discovered.Electronic Mail Address
tag to the Customer data source, and that tag propagated through lineage to the Customer 2 and Customer 3 data sources.
Removing the tag from the Customer 2 table soft deletes it from the Customer 2 data source. When a tag is deleted, downstream lineage tags are removed, unless another parent data source still has that tag. The tag remains visible, but it will not be re-added if a future propagation event specifies the same tag again. Immuta prevents you from removing Snowflake object tags from data sources. You can only remove Immuta-managed tags. To remove Snowflake object tags from tables, you must remove them in Snowflake.
However the Discovered.Electronic Mail Address
tag still applies to the Customer 3 data source because Customer still has the tag applied. The only way a tag will be removed from descendant data sources is if no other ancestor of the descendant still prescribes the tag.
If the Snowflake lineage tag propagation feature is disabled, tags will remain on Immuta data sources.
Immuta audit records include Snowflake lineage tag events when a tag is added or removed.
The example audit record below illustrates the SNOWFLAKE_TAGS.pii
tag successfully propagating from the Customer table to Customer 2:
Without tableFilter
set, Immuta will ingest lineage for every table on the Snowflake instance.
Tag propagation based on lineage is not retroactive. For example, if you add a table, add tags to that table, and then run the lineage ingestion job, tags will not get propagated. However, if you add a table, run the lineage ingestion job, and then add tags to the table, the tags will get propagated.
The native lineage job needs to pull in lineage data before any tag is applied in Immuta. When Immuta gets new lineage information from Snowflake, Immuta does not update existing tags in Immuta.
There can be up to a 3-hour delay in Snowflake for a lineage event to make it into the ACCESS_HISTORY
view.
Immuta does not ingest lineage information for views.
Snowflake only captures lineage events for CTAS
, CLONE
, MERGE
, and INSERT
write operations. Snowflake does not capture lineage events for DROP
, RENAME
, ADD
, or SWAP
. Instead of using these latter operations, you need to recreate a table with the same name if you need to make changes.
Immuta cannot enforce coherence of your Snowflake lineage. If a column, table, or schema in the middle of the lineage graph gets dropped, Immuta will not do anything unless a table with that same name gets recreated. This means a table that gets dropped but not recreated could live in Immuta’s system indefinitely.
This integration allows you to manage and access data in your Databricks account across all of your workspaces. With Immuta’s Databricks Unity Catalog integration, you can write your policies in Immuta and have them enforced automatically by Databricks across data in your Unity Catalog metastore.
This getting started guide outlines how to integrate Databricks Unity Catalog with Immuta.
: Configure the Databricks Unity Catalog integration.
: Migrate from the legacy Databricks Spark integrations to the Databricks Unity Catalog integration.
: This guide describes the design and components of the integration.
: This reference guide provides descriptions of the possible statuses of a configured integration.
The warehouse you select when configuring the Snowflake integration uses compute resources to set up the integration, register data sources, orchestrate policies, and run jobs like sensitive data discovery. Snowflake credit charges are based on the size of and amount of time the warehouse is active, not the number of queries run.
This document prescribes how and when to adjust the size and scale of clusters for your warehouse to manage workloads so that you can use Snowflake compute resources the most cost effectively.
In general, increase the size of and number of clusters for the warehouse to handle heavy workloads and multiple queries. Workloads are typically lighter after data sources are onboarded and policies are established in Immuta, so compute resources can be reduced after those workloads complete.
The Snowflake integration uses warehouse compute resources to sync policies created in Immuta to the Snowflake objects registered as data sources and, if enabled, to run and . Follow the guidelines below to adjust the warehouse size and scale according to your needs.
Increase the of and of clusters for the warehouse during large policy syncs, updates, and changes.
Enable to optimize resource use in Snowflake. In the Snowflake UI, the lowest auto suspend time setting is 5 minutes. However, through SQL query, you can set auto_suspend
to 61 seconds (since the minimum uptime for a warehouse is 60 seconds). For example,
Sensitive data discovery uses compute resources for each table registered if it is enabled. Consider when registering data sources if you have an or a tagging strategy in place.
Register data before creating global policies. By default, Immuta on registered data (unless an existing global policy applies to it), which allows Immuta to only pull metadata instead of also applying policies when data sources are created. Registering data before policies are created reduces the workload and the Snowflake compute resources needed.
Begin onboarding with a small dataset of tables, and then review and monitor query performance in the . Adjust the virtual warehouse accordingly to handle heavier loads.
uses the compute warehouse that was employed during the initial ingestion to periodically monitor the schema for changes. If you expect a low number of new tables or minimal changes to the table structure, consider scaling down the warehouse size.
Resize the warehouse after after data sources are registered and policies are established. For example,
For more details and guidance about warehouse sizing, see the .
Even after your integration is configured, data sources are registered, and policies are established, changes to those data sources or policies may initiate heavy workloads. Follow the guidelines below to adjust your warehouse size and scale according to your needs.
Check how many credits queries have consumed:
The how-to guides linked on this page illustrate how to integrate Databricks Unity Catalog with Immuta.
Requirements:
Unity Catalog and attached to a Databricks workspace. Immuta supports configuring a single metastore for each configured integration, and that metastore may be attached to multiple Databricks workspaces.
Unity Catalog enabled on your Databricks cluster or SQL warehouse. All SQL warehouses have Unity Catalog enabled if your workspace is attached to a Unity Catalog metastore.
These guides provide information on the recommended features to enable with Databricks Unity Catalog.
with the following feature enabled: (enabled by default)
Select None as your .
.
.
These guides provide instructions for organizing your Databricks Unity Catalog data to align with your governance structure.
These guides provide instructions for discovering, classifying, and tagging your data.
Validate the policies. You do not have to validate every policy you create in Immuta; instead, examine a few to validate the behavior you expect to see.
Once all Immuta policies are in place, remove or alter old permissions and revoke access to the ungoverned tables.
allows you to manage and access data in your Databricks account across all of your workspaces. With Immuta’s Databricks Unity Catalog integration, you can write your policies in Immuta and have them enforced automatically by Databricks across data in your Unity Catalog metastore.
This page details how to configure the integration. To configure the Databricks Unity Catalog integration and register data sources using the , see this .
Several different accounts are used to set up and maintain the Databricks Unity Catalog integration. The permissions required for each are outlined below.
Immuta account (required): This user configures the integration on the app settings page in Immuta. To access the app settings page, this user needs the following permission:
APPLICATION_ADMIN
Immuta permission
Databricks service principal (required): This service principal is used continuously by Immuta to orchestrate Unity Catalog policies and maintain state between Immuta and Databricks. In the , Immuta also uses this service principal to create the Immuta-managed catalog. This service principal needs the following Databricks privileges:
CREATE CATALOG
privilege on the Unity Catalog metastore. This is only required if you have Immuta the integration in Databricks for you. If a separate user will run the Immuta script in Databricks to manually configure the integration, that Databricks user account needs this privilege instead.
OWNER
permission on the Immuta catalog you configure.
OWNER
privilege on one of the securables below so that Immuta can administer Unity Catalog row-level and column-level security controls.
on catalogs with schemas and tables registered as Immuta data sources. This permission could also be applied by granting OWNER
on a catalog to a Databricks group that includes the Immuta service principal to allow for multiple owners.
on schemas with tables registered as Immuta data sources.
on all tables registered as Immuta data sources - if the OWNER
permission cannot be applied at the catalog- or schema-level. In this case, each table registered as an Immuta data source must individually have the OWNER
permission granted to the Immuta service principal.
USE CATALOG
and USE SCHEMA
on parent catalogs and schemas of tables registered as Immuta data sources so that the Immuta service principal can SELECT
and MODIFY
securables within the parent catalog and schema.
SELECT
and MODIFY
on all tables registered as Immuta data sources so that the Immuta service principal can grant and revoke access to tables and apply Unity Catalog row- and column-level security controls.
For native query audit (optional)
USE CATALOG
on the system
catalog
USE SCHEMA
on the system.access
schema
SELECT
on the following system tables:
system.access.audit
system.access.table_lineage
system.access.column_lineage
Databricks account (recommended): This user account can manually configure the integration in Databricks to create the Immuta-managed catalog. To do so, this account requires the following Databricks privileges:
CREATE CATALOG
on the Unity Catalog metastore
ACCOUNT ADMIN
on the Unity Catalog metastore for native query audit (optional)
Before you configure the Databricks Unity Catalog integration, ensure that you have fulfilled the following requirements:
Unity Catalog enabled on your Databricks cluster or SQL warehouse. All SQL warehouses have Unity Catalog enabled if your workspace is attached to a Unity Catalog metastore. Immuta recommends linking a SQL warehouse to your Immuta tenant rather than a cluster for both performance and availability reasons.
Unity Catalog best practices
Ensure your integration with Unity Catalog goes smoothly by following these guidelines:
Use a Databricks SQL warehouse to configure the integration. Databricks SQL warehouses are faster to start than traditional clusters, require less management, and can run all the SQL that Immuta requires for policy administration. A serverless warehouse provides nearly instant startup time and is the preferred option for connecting to Immuta.
Move all data into Unity Catalog before configuring Immuta with Unity Catalog. The default catalog used once Unity Catalog support is enabled in Immuta is the hive_metastore
, which is not supported by the Unity Catalog native integration. Data sources in the Hive Metastore must be managed by the Databricks Spark integration. Existing data sources will need to be re-created after they are moved to Unity Catalog and the Unity Catalog integration is configured.
Ensure that all Databricks clusters that have Immuta installed are stopped and the Immuta configuration is removed from the cluster. Immuta-specific cluster configuration is no longer needed with the Databricks Unity Catalog integration.
USE CATALOG
on the system
catalog
USE SCHEMA
on the system.access
schema
SELECT
on the following system tables:
system.access.audit
system.access.table_lineage
system.access.column_lineage
You have two options for configuring your Databricks Unity Catalog integration:
Click the App Settings icon in the left sidebar.
Click the Integrations tab.
Click + Add Native Integration and select Databricks Unity Catalog from the dropdown menu.
Complete the following fields:
Server Hostname is the hostname of your Databricks workspace.
HTTP Path is the HTTP path of your Databricks cluster or SQL warehouse.
Immuta Catalog is the name of the catalog Immuta will create to store internal entitlements and other user data specific to Immuta. This catalog will only be readable for the Immuta service principal and should not be granted to other users. The catalog name may only contain letters, numbers, and underscores and cannot start with a number.
If using a proxy server with Databricks Unity Catalog, click the Enable Proxy Support checkbox and complete the Proxy Host and Proxy Port fields. The username and password fields are optional.
Opt to fill out the Exemption Group field with the name of a group in Databricks that will be excluded from having data policies applied and must not be changed from the default value. Create this account-level group for privileged users and service accounts that require an unmasked view of data before configuring the integration in Immuta.
Opt to scope the query audit ingestion by entering in Unity Catalog Workspace IDs. Enter a comma-separated list of the workspace IDs that you want Immuta to ingest audit records for. If left empty, Immuta will audit all tables and users in Unity Catalog.
Enter how often, in hours, you want Immuta to ingest audit events from Unity Catalog as an integer between 1 and 24.
Continue with your integration configuration.
Select your authentication method from the dropdown:
OAuth machine-to-machine (M2M):
AWS Databricks:
Fill out the Token Endpoint with the full URL of the identity provider. This is where the generated token is sent. The default value is https://<your workspace name>.cloud.databricks.com/oidc/v1/token
.
Enter the Client Secret you created above. Immuta uses this secret to authenticate with the authorization server when it requests a token.
Azure Databricks:
Within Immuta, fill out the Token Endpoint with the full URL of the identity provider. This is where the generated token is sent. The default value is https://<your workspace name>.azuredatabricks.net/oidc/v1/token
.
Enter the Client Secret you created above. Immuta uses this secret to authenticate with the authorization server when it requests a token.
Click Save.
Click the App Settings icon in the left sidebar.
Click the Integrations tab.
Click + Add Native Integration and select Databricks Unity Catalog from the dropdown menu.
Complete the following fields:
Server Hostname is the hostname of your Databricks workspace.
HTTP Path is the HTTP path of your Databricks cluster or SQL warehouse.
Immuta Catalog is the name of the catalog Immuta will create to store internal entitlements and other user data specific to Immuta. This catalog will only be readable for the Immuta service principal and should not be granted to other users. The catalog name may only contain letters, numbers, and underscores and cannot start with a number.
If using a proxy server with Databricks Unity Catalog, click the Enable Proxy Support checkbox and complete the Proxy Host and Proxy Port fields. The username and password fields are optional.
Opt to fill out the Exemption Group field with the name of a group in Databricks that will be excluded from having data policies applied and must not be changed from the default value. Create this account-level group for privileged users and service accounts that require an unmasked view of data before configuring the integration in Immuta.
Opt to scope the query audit ingestion by entering in Unity Catalog Workspace IDs. Enter a comma-separated list of the workspace IDs that you want Immuta to ingest audit records for. If left empty, Immuta will audit all tables and users in Unity Catalog.
Enter how often, in hours, you want Immuta to ingest audit events from Unity Catalog as an integer between 1 and 24.
Continue with your integration configuration.
Select your authentication method from the dropdown:
OAuth machine-to-machine (M2M):
AWS Databricks:
Fill out the Token Endpoint with the full URL of the identity provider. This is where the generated token is sent. The default value is https://<your workspace name>.cloud.databricks.com/oidc/v1/token
.
Enter the Client Secret you created above. Immuta uses this secret to authenticate with the authorization server when it requests a token.
Azure Databricks:
Within Immuta, fill out the Token Endpoint with the full URL of the identity provider. This is where the generated token is sent. The default value is https://<your workspace name>.azuredatabricks.net/oidc/v1/token
.
Enter the Client Secret you created above. Immuta uses this secret to authenticate with the authorization server when it requests a token.
Select the Manual toggle and copy or download the script. You can modify the script to customize your storage location for tables, schemas, or catalogs.
Run the script in Databricks.
Click Save.
If the usernames in Immuta do not match usernames in Databricks, map each Databricks username to each Immuta user account to ensure Immuta properly enforces policies using one of the methods linked below:
Design partner preview
This feature is only available to select accounts. Reach out to your Immuta representative to enable this feature.
Requirement: A Databricks Unity Catalog integration must be configured for tags to be ingested.
To allow Immuta to automatically import table and column tags from Databricks Unity Catalog, enable Databricks Unity Catalog tag ingestion in the external catalog section of the Immuta app settings page.
Navigate to the App Settings page.
Scroll to 2 External Catalogs, and click Add Catalog.
Enter a Display Name and select Databricks Unity Catalog from the dropdown menu.
Click Save and confirm your changes.
While you're onboarding Snowflake data sources and designing policies, you don't want to disrupt your Snowflake users' existing workflows. Instead, you want to gradually onboard Immuta through a series of successive changes that will not impact your existing Snowflake users.
A phased onboarding approach to configuring the Snowflake integration ensures that your users will not be immediately affected by changes as you add data sources and configure policies.
Several features allow you to gradually onboard data sources and policies in Immuta:
: By default, no policy is applied at registration time; instead of applying a restrictive policy immediately upon registration, the table is registered in Immuta and waits for a policy to be applied, if ever.
There are several benefits to this design:
All existing roles maintain access to the data and registration of the table or view with Immuta has zero impact on your data platform.
It gives you time to configure tags on the Immuta registered tables and views, either manually or through automatic means, such as Immuta’s sensitive data detection (SDD), or an external catalog integration to include Snowflake tags.
It gives you time to assess and validate the sensitive data tags that were applied.
You can build only row and column controls with Immuta and let your existing roles manage table access instead of using Immuta subscription policies for table access.
coupled with Snowflake low row access policy mode: With these features enabled, Immuta manages access to tables (subscription policies) through GRANTs. This works by assigning each user their own unique role created by Immuta and all table access is managed using that single role.
Without these two features enabled, Immuta uses a Snowflake row access policy (RAP) to manage table access. A RAP only allows users to access rows in the table if they were explicitly granted access through an Immuta subscription policy; otherwise, the user sees no rows. This behavior means all existing Snowflake roles lose access to the table contents until explicitly granted access through Immuta subscription policies. Essentially, roles outside of Immuta don't control access anymore.
By using table grants and the low row access policy mode, users and roles outside Immuta continue to work.
There are two benefits to this approach:
All pre-existing Snowflake roles retain access to the data until you explicitly revoke access (outside Immuta).
It provides a way to test that Immuta GRANTs are working without impacting production workloads.
The following configuration is required for phased Snowflake onboarding:
Impersonation is disabled
Project workspaces are disabled
If either of these capabilities is necessary for your use case, you cannot do phased Snowflake onboarding as described below.
This integration enforces policies on Databricks tables registered as data sources in Immuta, allowing users to query policy-enforced data on Databricks clusters (including job clusters). Immuta policies are applied to the plan that Spark builds for users' queries, all executed directly against Databricks tables.
The guides in this section outline how to integrate Databricks Spark with Immuta.
: Configure the Databricks Spark integration.
: Access DBFS in Databricks for non-sensitive data.
: Allow Immuta users to access tables that are not protected by Immuta.
: Hide the Immuta database from users in Databricks, since user queries do not need to reference it.
: Run R and Scala spark-submit
jobs on your Databricks cluster.
: Raise the caching on-cluster and lower the cache timeouts for the Immuta web service to allow use of project UDFs in Spark jobs.
: Use an existing Hive external metastore instead of the built-in metastore.
: This guide describes the design and components of the integration.
: This reference guide provides descriptions of the possible statuses of a configured integration.
Configuration settings: These guides describe various integration settings that can be configured, including , cluster policies, and .
: This guide describes Immuta's support of Databricks change data feed.
: The trusted libraries feature allows Databricks cluster administrators to avoid Immuta security manager errors when using third-party libraries. This guide describes the feature and its configuration.
: When using Delta Lake, the API does not go through the normal Spark execution path. This means that Immuta's Spark extensions do not provide protection for the API. To solve this issue and ensure that Immuta has control over what a user can access, the Delta Lake API is blocked. This reference guide outlines the Spark SQL options that can be substituted for the Delta Lake API.
: Immuta allows direct file reads in Spark for file paths. This guide describes that process.
Immuta’s integration with Unity Catalog allows you to enforce fine-grained access controls on Unity Catalog securable objects with Immuta policies. Instead of manually creating UDFs or granting access to each table in Databricks, you can author your policies in Immuta and have Immuta manage and orchestrate Unity Catalog access-control policies on your data in Databricks clusters or SQL warehouses:
Subscription policies: Immuta subscription policies automatically grant and revoke access to specific Databricks securable objects.
: Immuta data policies enforce row- and column-level security.
Unity Catalog uses the following hierarchy of data objects:
Metastore: Created at the account level and is attached to one or more Databricks workspaces. The metastore contains metadata of all the catalogs, schemas, and tables available to query. All clusters on that workspace use the configured metastore and all workspaces that are configured to use a single metastore share those objects.
Catalog: Sits on top of schemas (also called databases) and tables to manage permissions across a set of schemas
Schema: Organizes tables and views
Table-etc: Table (managed or external tables), view, volume, model, and function
For details about the Unity Catalog object model, see the .
The Databricks Unity Catalog integration supports
:
applying column masks and row filters on specific securable objects
applying subscription polices on tables and views
enforcing Unity Catalog access controls, even if Immuta becomes disconnected
allowing non-Immuta reads and writes
using Photon
using a proxy server
Immuta uses this service principal to run queries that set up user-defined functions (UDFs) and other data necessary for policy enforcement. Upon enabling the native integration, Immuta will create a catalog that contains these schemas:
immuta_system
: Contains internal Immuta data.
immuta_policies_n
: Contains policy UDFs.
When policies require changes to be pushed to Unity Catalog, Immuta updates the internal tables in the immuta_system
schema with the updated policy information. If necessary, new UDFs are pushed to replace any out-of-date policies in the immuta_policies_n
schemas and any row filters or column masks are updated to point at the new policies. Many of these operations require compute on the configured Databricks cluster or SQL warehouse, so compute must be available for these policies to succeed.
Typical use cases for binding a catalog to specific workspaces include
Ensuring users can only access production data from a production workspace environment.
For example, you may have production data in a prod_catalog
, as well as a production workspace you are introducing to your organization. Binding the prod_catalog
to the prod_workspace
ensures that workspace admins and users can only access prod_catalog
from the prod_workspace
environment.
Ensuring users can only process sensitive data from a specific workspace. Limiting the environments from which users can access sensitive data helps better secure your organization’s data. Limiting access to one workspace also simplifies any monitoring, auditing, and understanding of which users are accessing specific data. This would entail a similar setup as the example above.
Giving users read-only access to production data from a developer workspace.
This enables your organization to effectively conduct development and testing, while minimizing risk to production data. All user access to this catalog from this workspace can be specified as read-only, ensuring developers can access the data they need for testing without risk of any unwanted updates.
Immuta’s Unity Catalog integration applies Databricks table-, row-, and column-level security controls that are enforced natively within Databricks. Immuta's management of these Databricks security controls is automated and ensures that they synchronize with Immuta policy or user entitlement changes.
Row-level security: Immuta applies SQL UDFs to restrict access to rows for querying users.
Column-level security: Immuta applies column-mask SQL UDFs to tables for querying users. These column-mask UDFs run for any column that requires masking.
The Unity Catalog integration supports the following policy types:
Conditional masking
Constant
Custom masking
Hashing
Null
Rounding (date and numeric rounding)
Matching (only show rows where)
Custom WHERE
Never
Where user
Where value in column
Minimization
Time-based restrictions
If you are using views in Databricks Unity Catalog, one of the following must be true for project-scoped purpose exceptions to apply to the views in Databricks:
The view and underlying table are registered as Immuta data sources and added to a project: If a view and its underlying table are both added as Immuta data sources, both of these assets must be added to the project for the project-scoped purpose exception to apply. If a view and underlying table are both added as data sources but the table is not added to an Immuta project, the purpose exception will not apply to the view because Databricks does not support fine-grained access controls on views.
Only the underlying table is registered as an Immuta data source and added to a project: If only the underlying table is registered as an Immuta data source but the view is not registered, the purpose exception will apply to both the table and corresponding view in Databricks. Views are the only Databricks object that will have Immuta policies applied to them even if they're not registered as Immuta data sources (as long as their underlying tables are registered).
Some users may need to be exempt from masking and row-level policy enforcement. When you add user accounts to the configured exemption group in Databricks, Immuta will not enforce policies for those users. Exemption groups are created when the Unity Catalog integration is configured, and no policies will apply to these users' queries, despite any policies enforced on the tables they query.
The principal used to register data sources in Immuta will be automatically added to this exemption group for that Databricks table. Consequently, users added to this list and used to register data sources in Immuta should be limited to service accounts.
hive_metastore
When enabling Unity Catalog support in Immuta, the catalog for all Databricks data sources will be updated to point at the default hive_metastore
catalog. Internally, Databricks exposes this catalog as a proxy to the workspace-level Hive metastore that schemas and tables were kept in before Unity Catalog. Since this catalog is not a real Unity Catalog catalog, it does not support any Unity Catalog policies. Therefore, Immuta will ignore any data sources in the hive_metastore
in any Databricks Unity Catalog integration, and policies will not be applied to tables there.
The Databricks Unity Catalog integration supports the following authentication methods to configure the integration and create data sources:
Access requirements
For Databricks Unity Catalog audit to work, Immuta must have, at minimum, the following access.
USE CATALOG
on the system
catalog
USE SCHEMA
on the system.access
schema
SELECT
on the following system tables:
system.access.audit
system.access.table_lineage
system.access.column_lineage
Design partner preview: This feature is available to select accounts. Reach out to your Immuta representative to enable this feature.
You can enable tag ingestion to allow Immuta to ingest Databricks Unity Catalog table and column tags so that you can use them in Immuta policies to enforce access controls. When you enable this feature, Immuta uses the credentials and connection information from the Databricks Unity Catalog integration to pull tags from Databricks and apply them to data sources as they are registered in Immuta. If Databricks data sources preexist the Databricks Unity Catalog tag ingestion enablement, those data sources will automatically sync to the catalog and tags will apply. Immuta checks for changes to tags in Databricks and syncs Immuta data sources to those changes every 24 hours.
When syncing data sources to Databricks Unity Catalog tags, Immuta pulls the following information:
Table tags: These tags apply to the table and appear on the data source overview tab. Databricks tags' key and value pairs are reflected in Immuta as a hierarchy with each level separated by a .
delimiter. For example, the Databricks Unity Catalog tag Location: US
would be represented as Location.US
in Immuta.
Column tags: These tags are applied to data source columns and appear on the columns listed in the data dictionary tab. Databricks tags' key and value pairs are reflected in Immuta as a hierarchy with each level separated by a .
delimiter. For example, the Databricks Unity Catalog tag Location: US
would be represented as Location.US
in Immuta.
Table comments field: This content appears as the data source description on the data source overview tab.
Column comments field: This content appears as dictionary column descriptions on the data dictionary tab.
Only tags that apply to Databricks data sources in Immuta are available to build policies in Immuta. Immuta will not pull tags in from Databricks Unity Catalog unless those tags apply to registered data sources.
Cost implications: Tag ingestion in Databricks Unity Catalog requires compute resources. Therefore, having many Databricks data sources or frequently manually syncing data sources to Databricks Unity Catalog may incur additional costs.
Databricks Unity Catalog tag ingestion only supports tenants with fewer than 2,500 data sources registered.
The table below outlines the integrations supported for various Databricks cluster configurations. For example, the only integration available to enforce policies on a cluster configured to run on Databricks Runtime 9.1 is the Databricks Spark integration.
Legend:
Row access policies with more than 1023 columns are unsupported. This is an underlying limitation of UDFs in Databricks. Immuta will only create row access policies with the minimum number of referenced columns. This limit will therefore apply to the number of columns referenced in the policy and not the total number in the table.
If you disable table grants, Immuta revokes the grants. Therefore, if users had access to a table before enabling Immuta, they’ll lose access.
You must use the global regex flag (g
) when creating a regex masking policy in this integration, and you cannot use the case insensitive regex flag (i
) when creating a regex masking policy in this integration. See the examples below for guidance:
regex with a global flag (supported): /^ssn|social ?security$/g
regex without a global flag (unsupported): /^ssn|social ?security$/
regex with a case insensitive flag (unsupported): /^ssn|social ?security$/gi
regex without a case insensitive flag (supported): /^ssn|social ?security$/g
If a registered data source is owned by a Databricks group at the table level, then the Unity Catalog integration cannot apply data masking policies to that table in Unity Catalog.
Therefore, set all table-level ownership on your Unity Catalog data sources to an individual user or service principal instead of a Databricks group. Catalogs and schemas can still be owned by a Databricks group, as ownership at that level doesn't interfere with the integration.
The following features are currently unsupported:
Databricks change data feed support
Immuta projects
Multiple IAMs on a single cluster
Column masking policies on views
Mixing masking policies on the same column
Row-redaction policies on views
R and Scala cluster support
Scratch paths
User impersonation
Policy enforcement on raw Spark reads
Python UDFs for advanced masking functions
Direct file-to-SQL reads
Data policies (except for masking with NULL) on ARRAY, MAP, or STRUCT type columns
Snippets for Databricks data sources may be empty in the Immuta UI.
This guide details the manual installation method for enabling native access to Databricks with Immuta policies enforced. Before proceeding, ensure your Databricks workspace, instance, and permissions meet the guidelines outlined in the .
immuta_conf.xml
is no longer required
The immuta_conf.xml
file that was previously used to configure the native Databricks Spark integration is no longer required to install Immuta, so it is no longer staged as a deployment artifact. However, you can use these snippets if you wish to deploy an immuta_conf.xml
file to set properties.
The required Immuta base URL and Immuta system API key properties, along with any other valid properties, can still be specified as Spark environment variables or in the optional immuta_conf.xml
file. As before, if the same property is specified in both locations, the Spark environment variable takes precedence.
If you have an existing immuta_conf.xml
file, you can continue using it. However, it's recommended that you delete any default properties from the file that you have not explicitly overridden, or remove the file completely and rely on Spark environment variables. Either method will ensure that any property defaults changed in upcoming Immuta releases are propagated to your environment.
If Databricks Unity Catalog is enabled in a Databricks workspace, you must use an Immuta cluster policy when you set up the Databricks Spark integration to create an Immuta-enabled cluster. .
If Databricks Unity Catalog is not enabled in your Databricks workspace, you must disable Unity Catalog in your Immuta tenant before proceeding with your configuration of Databricks Spark:
Navigate to the App Settings page and click Integration Settings.
Uncheck the Enable Unity Catalog checkbox.
Click Save.
Spark Version
Use Spark 2 with Databricks Runtime prior to 7.x. Use Spark 3 with Databricks Runtime 7.x or later. Attempting to use an incompatible jar and Databricks Runtime will fail.
Navigate to the page. If you are prompted to log in and need basic authentication credentials, contact your Immuta support professional.
Navigate to the Databricks folder for the newest Immuta version. Ex: https://archives.immuta.com/hadoop/databricks/2024.2.1/
.
Download the .jar file (Immuta plugin) as well as the other scripts listed below, which will load the plugin at cluster startup.
The immuta-benchmark-suite.dbc
is a collection of notebooks packaged as a .dbc file. After you have added cluster policies to your cluster, you can import this file into Databricks to run performance tests and compare a regular Databricks cluster to one protected by Immuta. Detailed instructions are available in the first notebook, which will require an Immuta and non-Immuta cluster to generate test data and perform queries.
Specify the following properties as Spark environment variables or in the optional immuta_conf.xml
file. If the same property is specified in both locations, the Spark environment variable takes precedence. The variable names are the config names in all upper case with _
instead of .
. For example, to set the value of immuta.base.url
via an environment variable, you would set the following in the Environment Variables
section of cluster configuration: IMMUTA_BASE_URL=https://immuta.mycompany.com
immuta.system.api.key
: Obtain this value from the under HDFS > System API Key. You will need to be a user with the APPLICATION_ADMIN
role to complete this action.
immuta.base.url
: The full URL for the target Immuta tenant Ex: https://immuta.mycompany.com
.
immuta.user.mapping.iamid
: If users authenticate to Immuta using an IAM different from Immuta's built-in IAM, you need to update the configuration file to reflect the ID of that IAM. The IAM ID is shown within the Immuta App Settings page within the Identity Management section. See Databricks to Immuta User Mapping for more details.
Environment variables with Google Cloud Platform
Do not use environment variables to set sensitive properties when using Google Cloud Platform. Set them directly in immuta_conf.xml
.
Generating a key will destroy any previously generated HDFS keys. This will cause previously integrated HDFS systems to lose access to your Immuta console. The key will only be shown once when generated.
When configuring the Databricks cluster, a path will need to be provided to each of the artifacts downloaded/created in the previous step. To do this, those artifacts must be hosted somewhere that your Databricks instance can access. The following methods can be used for this step:
These artifacts will be downloaded to the required location within the clusters file-system by the init script downloaded in the previous step. In order for the init script to find these files, a URI will have to be provided through environment variables configured on the cluster. Each method's URI structure and setup is explained below.
URI Structure: s3://[bucket]/[path]
Upload the configuration file, JSON file, and JAR file to an S3 bucket that the role from step 1 has access to.
If you wish to authenticate using access keys, add the following items to the cluster's environment variables:
If you've assumed a role and received a session token, that can be added here as well:
URI Structure: abfs(s)://[container]@[account].dfs.core.windows.net/[path]
Environment Variables:
If you want to authenticate using an account key, add the following to your cluster's environment variables:
If you want to authenticate using an Azure SAS token, add the following to your cluster's environment variables:
URI Structure: adl://[account].azuredatalakestore.net/[path]
Environment Variables:
If authenticating as a Microsoft Entra ID user,
If authenticating using a service principal,
URI Structure: http(s)://[host](:port)/[path]
Artifacts are available for download from Immuta using basic authentication. Your basic authentication credentials can be obtained from your Immuta support professional.
DBFS does not support access control
Any Databricks user can access DBFS via the Databricks command line utility. Files containing sensitive materials (such as Immuta API keys) should not be stored there in plain text. Use other methods described herein to properly secure such materials.
URI Structure: dbfs:/[path]
Since any user has access to everything in DBFS:
The artifacts can be stored anywhere in DBFS.
It's best to have a cluster-specific place for your artifacts in DBFS if you are testing to avoid overwriting or reusing someone else's artifacts accidentally.
Databricks secrets can be used in the Environment Variables
configuration section for a cluster by referencing the secret path rather than the actual value of the environment variable. For example, if a user wanted to make the following value secret
they could instead create a Databricks secret and reference it as the value of that variable. For instance, if the secret scope my_secrets
was created, and the user added a secret with the key my_secret_env_var
containing the desired sensitive environment variable, they would reference it in the Environment Variables
section:
Then, at runtime, {{secrets/my_secrets/my_secret_env_var}}
would be replaced with the actual value of the secret if the owner of the cluster has access to that secret.
Best practice: Replace sensitive variables with secrets
Immuta recommends that any sensitive environment variables listed below in the various artifact deployment instructions be replaced with secrets.
Cluster creation in an Immuta-enabled organization or Databricks workspace should be limited to administrative users to avoid allowing users to create non-Immuta enabled clusters.
Select the Custom Access mode.
Opt to adjust the Autopilot Options and Worker Type settings. The default values provided here may be more than what is necessary for non-production or smaller use-cases. To reduce resource usage you can enable/disable autoscaling, limit the size and number of workers, and set the inactivity timeout to a lower value.
In the Advanced Options section, click the Instances tab.
Click the Spark tab. In Spark Config field, add your configuration.
Cluster Configuration Requirements:
Click the Init Scripts tab and set the following configurations:
Destination: Specify the service you used to host the Immuta artifacts.
File Path: Specify the full URI to the immuta_cluster_init_script.sh
.
Add the new key/value to the configuration.
Click the Permissions tab and configure the following setting:
Who has access: Users or groups will need to have the permission Can Attach To to execute queries against Immuta configured data sources.
(Re)start the cluster.
As mentioned in the "Environment Variables" section of the cluster configuration, there may be some cases where it is necessary to add sensitive configuration to SparkSession.sparkContext.hadoopConfiguration
in order to read the data composing Immuta data sources.
As an example, when accessing external tables stored in Azure Data Lake Gen 2, Spark must have credentials to access the target containers/filesystems in ADLg2, but users must not have access to those credentials. In this case, an additional configuration file may be provided with a storage account key that the cluster may use to access ADLg2.
The additional configuration file looks very similar to the Immuta Configuration file referenced above. Some example configuration files for accessing different storage layers are below.
IAM role for S3 access
ADL prefix: Prior to Databricks Runtime version 6, the following configuration items should have a prefix of dfs.adls
rather than fs.adl
.
When the Immuta enabled Databricks cluster has been successfully started, users will see a new database labeled "immuta". This database is the virtual layer provided to access data sources configured within the connected Immuta tenant.
Before users can query an Immuta data source, an administrator must give the user Can Attach To
permissions on the cluster and GRANT
the user access to the immuta
database.
The following SQL query can be run as an administrator within a journal to give the user access to "Immuta":
By default, the IAM used to map users between Databricks and Immuta is the BIM (Immuta's internal IAM). The Immuta Spark plugin will check the Databricks username against the username within the BIM to determine access. For a basic integration, this means the users email address in Databricks and the connected Immuta tenant must match.
This guide details the simplified installation method for enabling native access to Databricks with Immuta policies enforced.
Ensure your Databricks workspace, instance, and permissions meet the guidelines outlined in the before you begin.
If Databricks Unity Catalog is enabled in a Databricks workspace, you must use an Immuta cluster policy when you set up the Databricks Spark integration to create an Immuta-enabled cluster. See the section below for guidance.
If Databricks Unity Catalog is not enabled in your Databricks workspace, you must disable Unity Catalog in your Immuta tenant before proceeding with your configuration of Databricks Spark:
Navigate to the App Settings page and click Integration Settings.
Uncheck the Enable Unity Catalog checkbox.
Click Save.
Log in to Immuta and click the App Settings icon in the left sidebar.
Scroll to the System API Key subsection under HDFS and click Generate Key.
Click Save and then Confirm.
Scroll to the Integration Settings section.
Click + Add Native Integration and select Databricks Integration from the dropdown menu.
Complete the Hostname field.
Enter a Unique ID for the integration. By default, your Immuta tenant URL populates this field. This ID is used to tie the set of cluster policies to your Immuta tenant and allows multiple Immuta tenants to access the same Databricks workspace without cluster policy conflicts.
Select your configured Immuta IAM from the dropdown menu.
Choose one of the following options for your data access model:
Protected until made available by policy: All tables are hidden until a user is permissioned through an Immuta policy. This is how most databases work and assumes least privileged access and also means you will have to register all tables with Immuta.
Available until protected by policy: All tables are open until explicitly registered and protected by Immuta. This makes a lot of sense if most of your tables are non-sensitive and you can pick and choose which to protect.
Select the Storage Access Type from the dropdown menu.
Opt to add any Additional Hadoop Configuration Files.
Click Add Native Integration.
Several cluster policies are available on the App Settings page when configuring this integration:
Click a link above to read more about each of these cluster policies before continuing with the tutorial.
Click Configure Cluster Policies.
Select one or more cluster policies in the matrix by clicking the Select button(s).
Opt to check the Enable Unity Catalog checkbox to generate cluster policies that will enable Unity Catalog on your cluster. This option is only available when Databricks runtime 11.3 is selected.
Opt to make changes to these cluster policies by clicking Additional Policy Changes and editing the text field.
Use one of the two Installation Types described below to apply the policies to your cluster:
Automatically push cluster policies: This option allows you to automatically push the cluster policies to the configured Databricks workspace. This will overwrite any cluster policy templates previously applied to this workspace.
Select the Automatically Push Cluster Policies radio button.
Enter your Admin Token. This token must be for a user who can create cluster policies in Databricks.
Click Apply Policies.
Manually push cluster policies: Enabling this option will allow you to manually push the cluster policies to the configured Databricks workspace. There will be various files to download and manually push to the configured Databricks workspace.
Select the Manually Push Cluster Policies radio button.
Click Download Init Script.
Follow the steps in the Instructions to upload the init script to DBFS section.
Click Download Policies, and then manually add these Cluster Policies in Databricks.
Opt to click the Download the Benchmarking Suite to compare a regular Databricks cluster to one protected by Immuta. Detailed instructions are available in the first notebook, which will require an Immuta and non-Immuta cluster to generate test data and perform queries.
Click Close, and then click Save and Confirm.
In the Policy dropdown, select the Cluster Policies you pushed or manually added from Immuta.
Select the Custom Access mode.
Opt to adjust Autopilot Options and Worker Type settings: The default values provided here may be more than what is necessary for non-production or smaller use-cases. To reduce resource usage you can enable/disable autoscaling, limit the size and number of workers, and set the inactivity timeout to a lower value.
Opt to configure the Instances tab in the Advanced Options section:
Click Create Cluster.
Before users can query an Immuta data source, an administrator must give the user Can Attach To
permissions on the cluster.
When you enable Unity Catalog, Immuta automatically migrates your existing Databricks data sources in Immuta to reference the legacy hive_metastore
catalog to account for Unity Catalog's . New data sources will reference the Unity Catalog metastore you create and attach to your Databricks workspace.
Because the hive_metastore
catalog is not managed by Unity Catalog, existing data sources in the hive_metastore
cannot have Unity Catalog access controls applied to them. .
To allow Immuta to administer Unity Catalog access controls on that data, move the data to Unity Catalog and re-register those tables in Immuta by completing the steps below. If you don't move all data before configuring the integration, will protect your existing data sources throughout the migration process.
Ensure that all Databricks clusters that have Immuta installed are stopped and the Immuta configuration is removed from the cluster. Immuta-specific cluster configuration is no longer needed with the Databricks Unity Catalog integration.
Move all data into Unity Catalog before configuring Immuta with Unity Catalog. Existing data sources will need to be re-created after they are moved to Unity Catalog and the Unity Catalog integration is configured.
.
This page contains references to the term whitelist, which Immuta no longer uses. When the term is removed from the software, it will be removed from this page.
Databricks instance: Premium tier workspace and
Databricks instance has network level access to Immuta tenant
Databricks instance is either publicly accessible or has been configured for .
Access to
Permissions and access to download (outside Internet access) or transfer files to the host machine
Recommended Databricks Workspace Configurations:
Note: Azure Databricks authenticates users with Microsoft Entra ID. Be sure to configure your Immuta tenant with an IAM that uses the same user ID as does Microsoft Entra ID. Immuta's Spark security plugin will look to match this user ID between the two systems. See this for details.
Use the table below to determine which version of Immuta supports your Databricks Runtime version:
Databricks Runtime Version | Immuta Version |
---|
The table below outlines the integrations supported for various Databricks cluster configurations. For example, the only integration available to enforce policies on a cluster configured to run on Databricks Runtime 9.1 is the Databricks Spark integration.
Legend:
Immuta supports the Custom access mode.
Supported Languages:
Python
SQL
R (requires advanced configuration; work with your Immuta support professional to use R)
Scala (requires advanced configuration; work with your Immuta support professional to use Scala)
Users who can read raw tables on-cluster
If a Databricks Admin is tied to an Immuta account, they will have the ability to read raw tables on-cluster.
If a Databricks user is listed as an "ignored" user, they will have the ability to read raw tables on-cluster. Users can be added to the immuta.spark.acl.whitelist
configuration to become ignored users.
The Immuta Databricks Spark integration injects an Immuta plugin into the SparkSQL stack at cluster startup. The Immuta plugin creates an "immuta" database that is available for querying and intercepts all queries executed against it. For these queries, policy determinations will be obtained from the connected Immuta tenant and applied before returning the results to the user.
The Databricks cluster init script provided by Immuta downloads the Immuta artifacts onto the target cluster and puts them in the appropriate locations on local disk for use by Spark. Once the init script runs, the Spark application running on the Databricks cluster will have the appropriate artifacts on its CLASSPATH to use Immuta for policy enforcement.
The cluster init script uses environment variables in order to
Determine the location of the required artifacts for downloading.
Authenticate with the service/storage containing the artifacts.
Note: Each target system/storage layer (HTTPS, for example) can only have one set of environment variables, so the cluster init script assumes that any artifact retrieved from that system uses the same environment variables.
There are two installation options for Databricks. Click a link below to navigate to a tutorial for your chosen method:
Adding the integration on the App Settings page.
Downloading or automatically pushing cluster policies to your Databricks workspace.
Creating or restarting your cluster.
Downloading and configuring Immuta artifacts.
Staging Immuta artifacts somewhere the cluster can read from during its startup procedures.
Protecting Immuta environment variables with Databricks Secrets.
Creating and configuring the cluster to start with the init script and load Immuta into its SparkSQL environment.
For easier debugging of the Immuta Databricks installation, enable cluster init script logging. In the cluster page in Databricks for the target cluster, under Advanced Options -> Logging, change the Destination from NONE
to DBFS
and change the path to the desired output location. Note: The unique cluster ID will be added onto the end of the provided path.
For debugging issues between the Immuta web service and Databricks, you can view the Spark UI on your target Databricks cluster. On the cluster page, click the Spark UI tab, which shows the Spark application UI for the cluster. If you encounter issues creating Databricks data sources in Immuta, you can also view the JDBC/ODBC Server portion of the Spark UI to see the result of queries that have been sent from Immuta to Databricks.
The Validation and Debugging Notebook (immuta-validation.ipynb
) is packaged with other Databricks release artifacts (for manual installations), or it can be downloaded from the App Settings page when configuring native Databricks through the Immuta UI. This notebook is designed to be used by or under the guidance of an Immuta Support Professional.
Import the notebook into a Databricks workspace by navigating to Home in your Databricks instance.
Click the arrow next to your name and select Import.
Once you have executed commands in the notebook and populated it with debugging information, export the notebook and its contents by opening the File menu, selecting Export, and then selecting DBC Archive.
This page outlines how to access DBFS in Databricks for non-sensitive data. Databricks Administrators should place the desired configuration in the Spark environment variables (recommended) or the immuta_conf.xml
file (not recommended).
This feature (provided by Databricks) mounts DBFS to the local cluster filesystem at /dbfs
. Although disabled when using process isolation, this feature can safely be enabled if raw, unfiltered data is not stored in DBFS and all users on the cluster are authorized to see each other’s files. When enabled, the entirety of DBFS essentially becomes a scratch path where users can read and write files in /dfbs/path/to/my/file
as though they were local files.
DBFS FUSE Mount limitation: This feature cannot be used in environments with E2 Private Link enabled.
For example,
In Python,
Note: This solution also works in R and Scala.
To enable the DBFS FUSE mount, set this configuration: immuta.spark.databricks.dbfs.mount.enabled=true
.
Mounting a bucket
Users can that can also be accessed using the FUSE mount.
Mounting a bucket is a one-time action, and the mount will be available to all clusters in the workspace from that point on.
Mounting must be performed from a non-Immuta cluster.
Scratch paths will work when performing arbitrary remote filesystem operations with fs magic or Scala dbutils.fs functions. For example,
To support %fs magic and Scala DBUtils with scratch paths, configure
To use dbutils
in Python, set this configuration: immuta.spark.databricks.py4j.strict.enabled=false
.
This section illustrates the workflow for getting a file from a remote scratch path, editing it locally with Python, and writing it back to a remote scratch path.
Get the file from remote storage:
Make a copy if you want to explicitly edit localScratchFile
, as it will be read-only and owned by root:
Write the new file back to remote storage:
Hiding the database does not disable access to it
Queries can still be performed against tables in the immuta
database using the Immuta-qualified table name (e.g., immuta.my_schema_my_table
) regardless of whether or not this feature is enabled.
The immuta
database on Immuta-enabled clusters allows Immuta to track Immuta-managed data sources separately from remote Databricks tables so that policies and other security features can be applied. However, Immuta supports raw tables in Databricks, so table-backed queries do not need to reference this database. When configuring a Databricks cluster, you can hide immuta
from any calls to SHOW DATABASES
so that users are not confused or misled by that database.
immuta
DatabaseWhen configuring a Databricks cluster, hide immuta
by using the following environment variable in the :
This page outlines the configuration for setting up project UDFs, which allow users to set their current project in Immuta through Spark. For details about the specific functions available and how to use them, see the .
Use project UDFs in Databricks Spark
Currently, caches are not all invalidated outside of Databricks because Immuta caches information pertaining to a user's current project. Consequently, this feature should only be used in Databricks.
Immuta caches a mapping of user accounts and users' current projects in the Immuta Web Service and on-cluster. When users change their project with UDFs instead of the Immuta UI, Immuta invalidates all the caches on-cluster (so that everything changes immediately) and the cluster submits a request to change the project context to a web worker. Immediately after that request, another call is made to a web worker to refresh the current project.
To allow use of project UDFs in Spark jobs, raise the caching on-cluster and lower the cache timeouts for the Immuta Web Service. Otherwise, caching could cause dissonance among the requests and calls to multiple web workers when users try to change their project contexts.
Click the App Settings icon in the left sidebar and scroll to the HDFS Cache Settings section.
Lower the Cache TTL of HDFS user names (ms) to 0.
Click Save.
In the Spark environment variables section, set the IMMUTA_CURRENT_PROJECT_CACHE_TIMEOUT_SECONDS
and IMMUTA_PROJECT_CACHE_TIMEOUT_SECONDS
to high values (like 10000
).
Note: These caches will be invalidated on cluster when a user calls immuta.set_current_project
, so they can effectively be cached permanently on cluster to avoid periodically reaching out to the web service.
Generally, Immuta prevents users from seeing data unless they are explicitly given access, which blocks access to raw sources in the underlying databases. However, in some native patterns (such as Snowflake), Immuta adds views to allow users access to Immuta sources but does not impede access to preexisting sources in the underlying database. Therefore, if a user had access in Snowflake to a table before Immuta was installed, they would still have access to that table after.
Unlike the example above, Databricks non-admin users will only see sources to which they are subscribed in Immuta, and this can present problems if organizations have a data lake full of non-sensitive data and Immuta removes access to all of it. The Limited Enforcement Scope feature addresses this challenge by allowing Immuta users to access any tables that are not protected by Immuta (i.e., not registered as a data source or a table in a native workspace). Although this is similar to how privileged users in Databricks operate, non-privileged users cannot bypass Immuta controls.
This feature is composed of two configurations:
Allowing non-Immuta reads: Immuta users with regular (unprivileged) Databricks roles may SELECT
from tables that are not registered in Immuta.
Allowing non-Immuta writes: Immuta users with regular (unprivileged) Databricks roles can run DDL commands and data-modifying commands against tables or spaces that are not registered in Immuta.
Additionally, Immuta supports auditing all queries run on a Databricks cluster, regardless of whether users touch Immuta-protected data or not. To configure Immuta to do so, navigate to the .
Non-Immuta reads
This setting does not allow reading data directly with commands like spark.read.format("x"). Users are still required to read data and query tables using Spark SQL.
When non-Immuta reads are enabled, Immuta users will see all databases and tables when they run show databases and/or show tables. However, this does not mean they will be able to query all of them.
Enable non-Immuta Reads by setting this configuration in the Spark environment variables (recommended) or immuta_conf.xml
(not recommended):
Opt to adjust the cache duration by changing the default value in the Spark environment variables (recommended) or immuta_conf.xml
(not recommended). (Immuta caches whether a table has been exposed as an Immuta source to improve performance. The default caching duration is 1 hour.)
Non-Immuta writes
These non-protected tables/spaces have the same exposure as detailed in the read section, but with the distinction that users can write data directly to these paths.
With non-Immuta writes enabled, it will be possible for users on the cluster to mix any policy-enforced data they may have access to via any registered data sources in Immuta with non-Immuta data, and write the ensuing result to a non-Immuta write space where it would be visible to others. If this is not a desired possibility, the cluster should instead be configured to only use Immuta’s native workspaces.
Enable non-Immuta Writes by setting this configuration in the Spark environment variables (recommended) or immuta_conf.xml
(not recommended):
Opt to adjust the cache duration by changing the default value in the Spark environment variables (recommended) or immuta_conf.xml
(not recommended). (Immuta caches whether a table has been exposed as an Immuta source to improve performance. The default caching duration is 1 hour.)
Enable support for auditing all queries run on a Databricks cluster (regardless of whether users touch Immuta-protected data or not) by setting this configuration in the Spark environment variables (recommended) or immuta_conf.xml
(not recommended):
The controls and default values associated with non-Immuta reads, non-Immuta writes, and audit functionality are outlined below.
This page describes the Databricks Spark integration, configuration options, and features. See the for a tutorial on enabling Databricks and these features through the App Settings page.
Project Workspaces | Databricks Tag Ingestion | User Impersonation | Native Query Audit | Multiple Integrations |
---|
The table below outlines the integrations supported for various Databricks cluster configurations. For example, the only integration available to enforce policies on a cluster configured to run on Databricks Runtime 9.1 is the Databricks Spark integration.
Example cluster | Databricks Runtime | Unity Catalog in Databricks | Databricks Spark integration | Databricks Unity Catalog integration |
---|
Legend:
The feature or integration is enabled.
The feature or integration is disabled.
Databricks instance has network level access to Immuta tenant
Permissions and access to download (outside Internet access) or transfer files to the host machine
Recommended Databricks Workspace Configurations:
Immuta supports the Custom access mode.
Supported Languages:
Python
SQL
R (requires advanced configuration; work with your Immuta support professional to use R)
Scala (requires advanced configuration; work with your Immuta support professional to use Scala)
The Immuta Databricks Spark integration supports the following Databricks features:
Audit limitations
Capturing the code or query that triggers the Spark plan makes audit records more useful in assessing what users are doing.
A user can configure multiple integrations of Databricks to a single Immuta tenant and use them dynamically or with workspaces.
Immuta does not support Databricks clusters with Photon acceleration enabled.
A | B | C | D |
---|---|---|---|
will still run on data sources and can be manually triggered. Tags applied through sensitive data discovery will propagate as tags added through lineage to descendant Immuta data sources.
Review your to identify query performance and bottlenecks.
After reviewing query performance and cost, implement to adjust your warehouse.
These guides provide instructions for auditing and detecting your users' activity, or see the for a comprehensive guide on the benefits of these features and other recommendations.
or for your .
.
.
to configure and validate SDD.
to discover entities of interest for your policy needs.
.
Register your remaining tables at the with .
.
These guides provide instructions for configuring and securing your data with governance policies, or see the for a comprehensive guide on creating policies to fit your organization's use case.
.
.
Unity Catalog and attached to a Databricks workspace. Immuta supports configuring a single metastore for each configured integration, and that metastore may be attached to multiple Databricks workspaces.
Move all data into Unity Catalog before configuring Immuta with Unity Catalog. Existing data sources will need to be re-created after they are moved to Unity Catalog and the Unity Catalog integration is configured. If you don't move all data before configuring the integration, will protect your existing data sources throughout the migration process.
In Databricks, with the .
.
.
If you will configure the integration using the manual setup option, the Immuta script you will use includes the SQL statements for granting required privileges to the service principal, so you can skip this step and continue to the . Otherwise, . For Databricks Unity Catalog audit to work, the service principal must have the following access at minimum:
Existing data source migration: If you have existing Databricks data sources, complete these steps before proceeding.
: Immuta creates the catalogs, schemas, tables, and functions using the integration's configured service principal.
: Run the Immuta script in Databricks yourself to create the catalog. You can also modify the script to customize your storage location for tables, schemas, or catalogs. The user running the script must have the .
Required permissions: When performing an automatic setup, the Immuta service principal must have the .
is enabled by default; you can disable it by clicking the Enable Native Query Audit checkbox. .
Configure the by scrolling to Integrations Settings and find the Unity Catalog Audit Sync Schedule section.
Access Token: Enter a Databricks Personal Access Token. This is the access token for the Immuta service principal. This service principal must have the for the metastore associated with the Databricks workspace. If this token is configured to expire, update this field regularly for the integration to continue to function.
Follow for the Immuta service principal and assign this service principal the for the metastore associated with the Databricks workspace.
Fill out the Client ID. This is a combination of letters, numbers, or symbols, used as a public identifier and is the .
Enter the Scope (string). The scope limits the operations and roles allowed in Databricks by the access token. See the for details about scopes.
Follow to create a service principal within Azure and then populate to your Databricks account and workspace.
Assign this service principal the for the metastore associated with the Databricks workspace.
Within Databricks, . This completes your Databricks-based service principal setup.
Fill out the Client ID. This is a combination of letters, numbers, or symbols, used as a public identifier and is the (note that Azure Databricks uses the Azure SP Client ID; it will be identical).
Enter the Scope (string). The scope limits the operations and roles allowed in Databricks by the access token. See the for details about scopes.
Required permissions: When performing a manual setup, a service principal and a Databricks account must have the .
is enabled by default; you can disable it by clicking the Enable Native Query Audit checkbox. .
Configure the by scrolling to Integrations Settings and find the Unity Catalog Audit Sync Schedule section.
Access Token: Enter a Databricks Personal Access Token. This is the access token for the Immuta service principal. This service principal must have the for the metastore associated with the Databricks workspace. If this token is configured to expire, update this field regularly for the integration to continue to function.
Follow for the Immuta service principal and assign this service principal the for the metastore associated with the Databricks workspace.
Fill out the Client ID. This is a combination of letters, numbers, or symbols, used as a public identifier and is the .
Enter the Scope (string). The scope limits the operations and roles allowed in Databricks by the access token. See the for details about scopes.
Follow to create a service principal within Azure and then populate to your Databricks account and workspace.
Assign this service principal the for the metastore associated with the Databricks workspace.
Within Databricks, . This completes your Databricks-based service principal setup.
Fill out the Client ID. This is a combination of letters, numbers, or symbols, used as a public identifier and is the (note that Azure Databricks uses the Azure SP Client ID; it will be identical).
Enter the Scope (string). The scope limits the operations and roles allowed in Databricks by the access token. See the for details about scopes.
If the Databricks user doesn't exist in Databricks when you configure the integration, after they are created in Databricks. Otherwise, policies will not be enforced correctly for them in Databricks. Databricks user identities for Immuta users are automatically marked as invalid when the user is not found during policy application, preventing them from being affected by Databricks policy until their Immuta user identity is manually mapped to their Databricks identity.
.
See the for step-by-step guidance to implement phased Snowflake onboarding.
Unity Catalog supports managing permissions account-wide in Databricks through controls applied directly to objects in the metastore. To establish a connection with Databricks and apply controls to securable objects within the metastore, Immuta requires a service principal with permissions to manage all data protected by Immuta. (OAuth M2M) or a personal access token (PAT) can be provided for Immuta to authenticate as the service principal. (See the for a list of specific Databricks privileges.)
Workspace-catalog binding allows users to leverage Databricks’ catalog isolation mode to limit catalog access to specific Databricks workspaces. The default isolation mode is OPEN, meaning all workspaces can access the catalog (with the exception of the automatically-created ), provided they are in the metastore attached to the catalog. Setting this mode to ISOLATED allows the catalog owner to specify a workspace-catalog binding, which means the owner can dictate which workspaces are authorized to access the catalog. This prevents other workspaces from accessing the specified catalogs. To bind a catalog to a specific workspace in Databricks Unity Catalog, see the .
Immuta’s Databricks Unity Catalog integration allows users to configure additional workspace connections to support using Databricks' feature. Prior to supporting additional workspace connections, a Unity Catalog metastore in Databricks was available unrestricted across workspaces in Databricks, which made integrating against a metastore independent of the workspace attached to that metastore possible. However, when additional workspace connections are configured, you can assign additional workspaces to a specific catalog (provided those workspaces are still within the same metastore). Each additional workspace connection is responsible for a catalog or set of catalogs and can be attached to its own compute if desired.
Users can configure their Immuta integrations to be consistent with the workspace-catalog boundaries that they’ve configured in Databricks. If you are using additional workspace connections, you can configure them in your Databricks Unity Catalog integration through the Integrations API when you or by .
Additional workspace connections in Databricks Unity Catalog are not currently supported in Immuta's .
Table-level security: Immuta manages and privileges on securable objects in Databricks through subscription policies. When you create a subscription policy in Immuta, Immuta uses the Unity Catalog API to issue GRANTS or REVOKES against the catalog, schema, or table in Databricks for every user affected by that subscription policy.
Regex: You must use the global regex flag (g
) when creating a regex masking policy in this integration. You cannot use the case insensitive regex flag (i
) when creating a regex masking policy in this integration. See the for examples.
Project-scoped purpose exceptions for Databricks Unity Catalog integrations allow you to apply to Databricks data sources in a project. As a result, users can only access that data when they are working within that specific project.
This feature allows masked columns to be joined across data sources that belong to the same project. When data sources do not belong to a project, Immuta uses a unique salt per data source for hashing to prevent masked values from being joined. (See the guide for an explanation of that behavior.) However, once you add Databricks Unity Catalog data sources to a project and enable masked joins, Immuta uses a consistent salt across all the data sources in that project to allow the join.
For more information about masked joins and enabling them for your project, see the of documentation.
However, with you can use hive_metastore
and enforce subscription and data policies with the .
Personal access token (PAT): This is the access token for the Immuta service principal. This service principal must have the metastore privileges listed in the section for the metastore associated with the Databricks workspace. If this token is configured to expire, update this field regularly for the integration to continue to function.
OAuth machine-to-machine (M2M): Immuta uses the to integrate with , which allows Immuta to authenticate with Databricks using a client secret. Once Databricks verifies the Immuta service principal’s identity using the client secret, Immuta is granted a temporary OAuth token to perform token-based authentication in subsequent requests. When that token expires (after one hour), Immuta requests a new temporary token. See the for more details.
The Unity Catalog data object model introduces a 3-tiered namespace, as . Consequently, your Databricks tables registered as data sources in Immuta will reference the catalog, schema (also called a database), and table.
External data connectors and query-federated tables are preview features in Databricks. See the for details about the support and limitations of these features before registering them as data sources in the Unity Catalog integration.
The Databricks Unity Catalog integration audits user queries run in clusters or SQL warehouses for deployments configured with the Databricks Unity Catalog integration. The audit ingest is set when and the audit logs can be scoped to only ingest specific workspaces if needed.
See the for details about manually prompting ingest of audit logs and the contents of the logs.
Once external tags are applied to Databricks data sources, those tags can be used to create and .
To enable Databricks Unity Catalog tag ingestion, see the page.
After making changes to tags in Databricks, you can so that the changes immediately apply to the data sources in Immuta. Otherwise, tag changes will automatically sync within 24 hours.
for a list of requirements.
Example cluster | Databricks Runtime | Unity Catalog in Databricks | Databricks Spark integration | Databricks Unity Catalog integration |
---|
The feature or integration is enabled.
The feature or integration is disabled.
Unity Catalog row- and column-level security controls are unsupported for single-user clusters. See the for details about this limitation.
.
Host files in and provide access by the cluster
Host files in Gen 1 or Gen 2 and provide access by the cluster
Host files on an server accessible by the cluster
Host files in (Not recommended for production)
Create an instance profile for clusters by following .
Upload the configuration file, JSON file, and JAR file to an .
Upload the configuration file, JSON file, and JAR file to .
Upload the artifacts directly to using the .
It is important that non-administrator users on an Immuta-enabled Databricks cluster do not have access to view or modify Immuta configuration or the immuta-spark-hive.jar
file, as this would potentially pose a security loophole around Immuta policy enforcement. Therefore, use to apply environment variables to an Immuta-enabled cluster in a secure way.
Create a cluster in Databricks by following the .
IAM Role (AWS ONLY): Select the instance role you created for this cluster. (For access key authentication, you should instead use the environment variables listed in the section.)
In the Environment Variables section, add the environment variables necessary for your configuration. Remember that these variables should be as mentioned above.
To use an additional Hadoop configuration file, you will need to set the IMMUTA_INIT_ADDITIONAL_CONF_URI
environment variable referenced in the section to be the full URI to this file.
S3 can also be accessed using an IAM role attached to the cluster. See the for more details.
.
Below are example queries that can be run to obtain data from an Immuta-configured data source. Because Immuta supports raw tables in Databricks, you do not have to use Immuta-qualified table names in your queries like the first example. Instead, you can run queries like the second example, which does not reference the .
See the for a detailed walkthrough.
It is possible within Immuta to have multiple users share the same username if they exist within different IAMs. In this case, the cluster can be configured to lookup users from a specified IAM. To do this, the value of immuta.user.mapping.iamid
created and hosted in the previous steps must be updated to be the targeted IAM ID configured within the Immuta tenant. The IAM ID can be found on the . Each Databricks cluster can only be mapped to one IAM.
Create a cluster in Databricks by following the .
IAM Role (AWS ONLY): Select the instance role you created for this cluster. (For access key authentication, you should instead use the environment variables listed in the section.)
.
When the Immuta-enabled Databricks cluster has been successfully started, Immuta will create an immuta
database, which allows Immuta to track Immuta-managed data sources separately from remote Databricks tables so that policies and other security features can be applied. However, users can query sources with their original database or table name without referencing the immuta
database. Additionally, when configuring a Databricks cluster you can hide immuta
from any calls to SHOW DATABASES
so that users aren't misled or confused by its presence. For more details, see the page.
See the for a detailed walkthrough of creating Databricks data sources in Immuta.
Below are example queries that can be run to obtain data from an Immuta-configured data source. Because Immuta supports raw tables in Databricks, you do not have to use Immuta-qualified table names in your queries like the first example. Instead, you can run queries like the second example, which does not reference the .
Example cluster | Databricks Runtime | Unity Catalog in Databricks | Databricks Spark integration | Databricks Unity Catalog integration |
---|
The feature or integration is enabled.
The feature or integration is disabled.
See the for known limitations.
: The steps to enable the integration with this method include
: The steps to enable the integration with this method include
If your compliance requirements restrict users from changing projects within a session, you can block the use of Immuta's project UDFs on a Databricks Spark cluster. To do so, configure the immuta.spark.databricks.disabled.udfs
option as described on the .
Databricks instance: Premium tier workspace and
Access to
Note: Azure Databricks authenticates users with Microsoft Entra ID. Be sure to configure your Immuta tenant with an IAM that uses the same user ID as does Microsoft Entra ID. Immuta's Spark security plugin will look to match this user ID between the two systems. See this for details.
See for a list of Databricks Runtimes Immuta supports.
: Databricks users can see the on queried tables if they are allowed to read raw data and meet specific qualifications.
: Users can register their Databricks Libraries with Immuta as trusted libraries, allowing Databricks cluster administrators to avoid Immuta security manager errors when using third-party libraries.
: Immuta supports the use of external metastores in local or remote mode.
: In addition to supporting direct file reads through workspace and scratch paths, Immuta allows direct file reads in Spark for file paths.
Users can have additional write access in their integration using project workspaces. Users can integrate a single or multiple workspaces with a single Immuta tenant. For more details, see the page.
The Immuta Databricks Spark integration cannot ingest tags from Databricks, but you can connect any of these to work with your integration.
Native impersonation allows users to natively query data as another Immuta user. To enable native user impersonation, see the page.
Immuta will audit queries that come from interactive notebooks, notebook jobs, and JDBC connections, but will not audit . Furthermore, Immuta only audits Spark jobs that are associated with Immuta tables. Consequently, Immuta will not audit a query in a notebook cell that does not trigger a Spark job, unless immuta.spark.audit.all.queries
is set to true
; for more details about this configuration and auditing all queries in Databricks, see .
To audit the code or query that triggers the Spark plan, Immuta hooks into Databricks where notebook cells and JDBC queries execute and saves the cell or query text. Then, Immuta pulls this information into the audits of the resulting Spark jobs. Examples of a saved cell/query and the resulting audit record are provided on the page.
In most cases, Immuta’s runs automatically from the Immuta web service. For Databricks, that automatic job is disabled because of the . In this case, Immuta requires users to download a schema detection job template (a Python script) and import that into their Databricks workspace. See the guide for details.
5w4502
REDAC
990
6e3611
REDAC
750
9s7934
REDAC
380
11.3 LTS | 2023.1 and newer |
10.4 LTS | 2022.2.x and newer |
7.3 LTS 9.1 LTS | 2021.5.x and newer |
Immuta supports the use of external metastores in local or remote mode , following the same configuration detailed in the Databricks documentation.
Download the metastore jars and point to them as specified in Databricks documentation. Metastore jars must end up on the cluster's local disk at this explicit path: /databricks/hive_metastore_jars
.
If using DBR 7.x with Hive 2.3.x, either
Set spark.sql.hive.metastore.version
to 2.3.7
and spark.sql.hive.metastore.jars
to builtin
or
Download the metastore jars and set spark.sql.hive.metastore.jars
to /databricks/hive_metastore_jars/*
as before.
To use AWS Glue Data Catalog as the metastore for Databricks, see the Databricks documentation.
Additional overhead: In relation to the Python & SQL cluster policy, this configuration trades some additional overhead for added support of the R language.
In this configuration, you are able to rely on the Databricks-native security controls. The key security control here is the enablement of process isolation. This prevents users from obtaining unintentional access to the queries of other users. In other words, masked and filtered data is consistently made accessible to users in accordance with their assigned attributes.
Like the Python & SQL configuration, Py4j security is enabled for the Python & SQL & R configuration. However, because R has been added Immuta enables the SecurityManager, in addition to Py4j security, to provide more security guarantees. For example, by default all actions in R execute as the root user; among other things, this permits access to the entire filesystem (including sensitive configuration data), and, without iptable restrictions, a user may freely access the cluster’s cloud storage credentials. To address these security issues, Immuta’s initialization script wraps the R and Rscript binaries to launch each command as a temporary, non-privileged user with limited filesystem and network access and installs the Immuta SecurityManager, which prevents users from bypassing policies and protects against the above vulnerabilities from within the JVM.
Consequently, the cost of introducing R is that the SecurityManager incurs a small increase in performance overhead; however, average latency will vary depending on whether the cluster is homogeneous or heterogeneous. (In homogeneous clusters, all users are at the same level of groups/authorizations; this is enforced externally, rather than directly by Immuta.)
Many Python ML classes (such as LogisticRegression
, StringIndexer
, and DecisionTreeClassifier
) and dbutils.fs are unfortunately not supported with Py4J security enabled. Users will also be unable to use the Databricks Connect client library.
When users install third-party Java/Scala libraries, they will be denied access to sensitive resources by default. However, cluster administrators can specify which of the installed Databricks libraries should be trusted by Immuta.
For full details on Databricks’ best practices in configuring clusters, read their governance documentation.
Py4j security disabled: In addition to support for Python, SQL, and R, this configuration adds support for additional Python libraries and utilities by disabling Databricks-native Py4j security.
This configuration does not rely on Databricks-native Py4j security to secure the cluster, while process isolation is still enabled to secure filesystem and network access from within Python processes. On an Immuta-enabled cluster, once Py4J security is disabled the Immuta SecurityManager is installed to prevent nefarious actions from Python in the JVM. Disabling Py4J security also allows for expanded Python library support, including many Python ML classes (such as LogisticRegression
, StringIndexer
, and DecisionTreeClassifier
) and dbutils.fs.
By default, all actions in R will execute as the root user. Among other things, this permits access to the entire filesystem (including sensitive configuration data). And without iptable restrictions, a user may freely access the cluster’s cloud storage credentials. To properly support the use of the R language, Immuta’s initialization script wraps the R and Rscript binaries to launch each command as a temporary, non-privileged user. This user has limited filesystem and network access. The Immuta SecurityManager is also installed to prevent users from bypassing policies and protects against the above vulnerabilities from within the JVM.
The SecurityManager will incur a small increase in performance overhead; average latency will vary depending on whether the cluster is homogeneous or heterogeneous. (In homogeneous clusters, all users are at the same level of groups/authorizations; this is enforced externally, rather than directly by Immuta.)
When users install third-party Java/Scala libraries, they will be denied access to sensitive resources by default. However, cluster administrators can specify which of the installed Databricks libraries should be trusted by Immuta.
A homogeneous cluster is recommended for configurations where Py4J security is disabled. If all users have the same level of authorization, there would not be any data leakage, even if a nefarious action was taken.
For full details on Databricks’ best practices in configuring clusters, read their governance documentation.
Scala clusters: This configuration is for Scala-only clusters.
Where Scala language support is needed, this configuration can be used in the Custom access mode.
According to Databricks’ cluster type support documentation, Scala clusters are intended for single users only. However, nothing inherently prevents a Scala cluster from being configured for multiple users. Even with the Immuta SecurityManager enabled, there are limitations to user isolation within a Scala job.
For a secure configuration, it is recommended that clusters intended for Scala workloads are limited to Scala jobs only and are made homogeneous through the use of project equalization or externally via convention/cluster ACLs. (In homogeneous clusters, all users are at the same level of groups/authorizations; this is enforced externally, rather than directly by Immuta.)
For full details on Databricks’ best practices in configuring clusters, read their governance documentation.
Performance: This is the most performant policy configuration.
In this configuration, Immuta is able to rely on Databricks-native security controls, reducing overhead. The key security control here is the enablement of process isolation. This prevents users from obtaining unintentional access to the queries of other users. In other words, masked and filtered data is consistently made accessible to users in accordance with their assigned attributes. This Immuta cluster configuration relies on Py4J security being enabled.
Many Python ML classes (such as LogisticRegression
, StringIndexer
, and DecisionTreeClassifier
) and dbutils.fs are unfortunately not supported with Py4J security enabled. Users will also be unable to use the Databricks Connect client library. Additionally, only Python and SQL are available as supported languages.
For full details on Databricks’ best practices in configuring clusters, read their governance documentation.
This guide illustrates how to run R and Scala spark-submit
jobs on Databricks, including prerequisites and caveats.
Language support: R and Scala are supported, but require advanced configuration; work with your Immuta support professional to use these languages. Python spark-submit
jobs are not supported by the Databricks Spark integration.
Using R in a notebook: Because of how some user properties are populated in Databricks, users should load the SparkR library in a separate cell before attempting to use any SparkR functions.
spark-submit
Before you can run spark-submit
jobs on Databricks you must initialize the Spark session with the settings outlined below.
Initialize the Spark session by entering these settings into the R submit script immuta.spark.acl.assume.not.privileged="true"
and spark.hadoop.immuta.databricks.config.update.service.enabled="false"
.
This will enable the R script to access Immuta data sources, scratch paths, and workspace tables.
Once the script is written, upload the script to a location in dbfs/S3/ABFS
to give the Databricks cluster access to it.
spark submit
JobTo create the R spark-submit
job,
Go to the Databricks jobs page.
Create a new job, and select Configure spark-submit.
Set up the parameters:
Note: The path dbfs:/path/to/script.R
can be in S3 or ABFS (on Azure Databricks), assuming the cluster is configured with access to that path.
Edit the cluster configuration, and change the Databricks Runtime to be a supported version.
Configure the Environment Variables section as you normally would for an Immuta cluster.
Before you can run spark-submit
jobs on Databricks you must initialize the Spark session with the settings outlined below.
Configure the Spark session with immuta.spark.acl.assume.not.privileged="true"
and spark.hadoop.immuta.databricks.config.update.service.enabled="false"
.
Note: Stop your Spark session (spark.stop()
) at the end of your job or the cluster will not terminate.
The spark submit job needs to be launched using a different classloader which will point at the designated user JARs directory. The following Scala template can be used to handle launching your submit code using a separate classloader:
spark-submit
JobTo create the Scala spark-submit
job,
Build and upload your JAR to dbfs/S3/ABFS
where the cluster has access to it.
Select Configure spark-submit, and configure the parameters:
Note: The fully-qualified class name of the class whose main
function will be used as the entry point for your code in the --class
parameter.
Note: The path dbfs:/path/to/code.jar
can be in S3 or ABFS (on Azure Databricks) assuming the cluster is configured with access to that path.
Edit the cluster configuration, and change the Databricks Runtime to a supported version.
Include IMMUTA_INIT_ADDITIONAL_JARS_URI=dbfs:/path/to/code.jar
in the "Environment Variables" (where dbfs:/path/to/code.jar
is the path to your jar) so that the jar is uploaded to all the cluster nodes.
The user mapping works differently from notebooks because spark-submit
clusters are not configured with access to the Databricks SCIM API. The cluster tags are read to get the cluster creator and match that user to an Immuta user.
Privileged users (Databricks Admins and Whitelisted Users) must be tied to an Immuta user and given access through Immuta to access data through spark-submit
jobs because the setting immuta.spark.acl.assume.not.privileged="true"
is used.
There is an option of using the immuta.api.key
setting with an Immuta API key generated on the Immuta profile page.
Currently when an API key is generated it invalidates the previous key. This can cause issues if a user is using multiple clusters in parallel, since each cluster will generate a new API key for that Immuta user. To avoid these issues, manually generate the API key in Immuta and set the immuta.api.key
on all the clusters or use a specified job user for the submit job.
Single-user clusters recommended
Like Databricks, Immuta recommends single-user clusters for sparklyr when user isolation is required. A single-user cluster can either be a job cluster or a cluster with credential passthrough enabled. Note: spark-submit jobs are not currently supported.
Two cluster types can be configured with sparklyr: Single-User Clusters (recommended) and Multi-User Clusters (discouraged).
Single-User Clusters: Credential Passthrough (required on Databricks) allows a single-user cluster to be created. This setting automatically configures the cluster to assume the role of the attached user when reading from storage. Because Immuta requires that raw data is readable by the cluster, the instance profile associated with the cluster should be used rather than a role assigned to the attached user.
Multi-User Clusters: Because Immuta cannot guarantee user isolation in a multi-user sparklyr cluster, it is not recommended to deploy a multi-user cluster. To force all users to act under the same set of attributes, groups, and purposes with respect to their data access and eliminate the risk of a data leak, all sparklyr multi-user clusters must be equalized either by convention (all users able to attach to the cluster have the same level of data access in Immuta) or by configuration (detailed below).
In addition to the configuration for an Immuta cluster with R, add this environment variable to the Environment Variables section of the cluster:
This configuration makes changes to the iptables rules on the cluster to allow the sparklyr client to connect to the required ports on the JVM used by the sparklyr backend service.
Install and load libraries into a notebook. Databricks includes the stable version of sparklyr, so library(sparklyr)
in an R notebook is sufficient, but you may opt to install the latest version of sparklyr from CRAN
. Additionally, loading library(DBI)
will allow you to execute SQL queries.
Set up a sparklyr connection:
Pass the connection object to execute queries:
Add the following items to the Spark Config section of the cluster:
The trustedFileSystems
setting is required to allow Immuta’s wrapper FileSystem (used in conjunction with the ImmutaSecurityManager
for data security purposes) to be used with credential passthrough. Additionally, the InstanceProfileCredentialsProvider
must be configured to continue using the cluster’s instance profile for data access, rather than a role associated with the attached user.
Avoid deploying multi-user clusters with sparklyr configuration
It is possible, but not recommended, to deploy a multi-user cluster sparklyr configuration. Immuta cannot guarantee user isolation in a multi-user sparklyr configuration.
The configurations in this section enable sparklyr, require project equalization, map sparklyr sessions to the correct Immuta user, and prevent users from accessing Immuta native workspaces.
Add the following environment variables to the Environment Variables section of your cluster configuration:
Add the following items to the Spark Config section:
Immuta’s integration with sparklyr does not currently support
spark-submit jobs,
UDFs, or
Databricks Runtimes 5, 6, or 7.
Error Message: py4j.security.Py4JSecurityException: Constructor <> is not whitelisted
Explanation: This error indicates you are being blocked by Py4j security rather than the Immuta Security Manager. Py4j security is strict and generally ends up blocking many ML libraries.
Solution: Turn off Py4j security on the offending cluster by setting IMMUTA_SPARK_DATABRICKS_PY4J_STRICT_ENABLED=false
in the environment variables section. Additionally, because there are limitations to the security mechanisms Immuta employs on-cluster when Py4j security is disabled, ensure that all users on the cluster have the same level of access to data, as users could theoretically see (policy-enforced) data that other users have queried.
This page outlines configuration details for Immuta-enabled Databricks clusters. Databricks Administrators should place the desired configuration in the Spark environment variables (recommended) or immuta_conf.xml
(not recommended).
This page contains references to the term whitelist, which Immuta no longer uses. When the term is removed from the software, it will be removed from this page.
Environment variable overrides
Properties in the config file can be overridden during installation using environment variables. The variable names are the config names in all upper case with _
instead of .
. For example, to set the value of immuta.base.url
via an environment variable, you would set the following in the Environment Variables
section of cluster configuration: IMMUTA_BASE_URL=https://immuta.mycompany.com
immuta.ephemeral.host.override
Default: true
Description: Set this to false
if ephemeral overrides should not be enabled for Spark. When true
, this will automatically override ephemeral data source httpPaths with the httpPath of the Databricks cluster running the user's Spark application.
immuta.ephemeral.host.override.httpPath
Description: This configuration item can be used if automatic detection of the Databricks httpPath should be disabled in favor of a static path to use for ephemeral overrides.
immuta.ephemeral.table.path.check.enabled
Default: true
Description: When querying Immuta data sources in Spark, the metadata from the Metastore is compared to the metadata for the target source in Immuta to validate that the source being queried exists and is queryable on the current cluster. This check typically validates that the target (database, table) pair exists in the Metastore and that the table’s underlying location matches what is in Immuta. This configuration can be used to disable location checking if that location is dynamic or changes over time. Note: This may lead to undefined behavior if the same table names exist in multiple workspaces but do not correspond to the same underlying data.
immuta.spark.acl.enabled
Default: true
Description: Immuta Access Control List (ACL). Controls whether Databricks users are blocked from accessing non-Immuta tables. Ignored if Databricks Table ACLs are enabled (i.e., spark.databricks.acl.dfAclsEnabled=true
).
immuta.spark.acl.whitelist
Description: Comma-separated list of Databricks usernames who may access raw tables when the Immuta ACL is in use.
immuta.spark.acl.privileged.timeout.seconds
Default: 3600
Description: The number of seconds to cache privileged user status for the Immuta ACL. A privileged Databricks user is an admin or is whitelisted in immuta.spark.acl.whitelist
.
immuta.spark.acl.assume.not.privileged
Default: false
Description: Session property that overrides privileged user status when the Immuta ACL is in use. This should only be used in R scripts associated with spark-submit jobs.
immuta.spark.audit.all.queries
Default: false
Description: Enables auditing all queries run on a Databricks cluster, regardless of whether users touch Immuta-protected data or not.
immuta.spark.databricks.allow.non.immuta.reads
Default: false
Description: Allows non-privileged users to SELECT
from tables that are not protected by Immuta. See Limited Enforcement in Databricks Spark for details about this feature.
immuta.spark.databricks.allow.non.immuta.writes
Default: false
Description: Allows non-privileged users to run DDL commands and data-modifying commands against tables or spaces that are not protected by Immuta. See Limited Enforcement in Databricks Spark for details about this feature.
immuta.spark.databricks.allowed.impersonation.users
Description: This configuration is a comma-separated list of Databricks users who are allowed to impersonate Immuta users.
immuta.spark.databricks.dbfs.mount.enabled
Default: false
Description: Exposes the DBFS FUSE mount located at /dbfs
. Granular permissions are not possible, so all users will have read/write access to all objects therein. Note: Raw, unfiltered source data should never be stored in DBFS.
immuta.spark.databricks.disabled.udfs
Description: Block one or more Immuta user-defined functions (UDFs) from being used on an Immuta cluster. This should be a Java regular expression that matches the set of UDFs to block by name (excluding the immuta
database). For example to block all project UDFs, you may configure this to be ^.*_projects?$
. For a list of functions, see the project UDFs page.
immuta.spark.databricks.filesystem.blacklist
Default: hdfs
Description: A list of filesystem protocols that this instance of Immuta will not support for workspaces. This is useful in cases where a filesystem is available to a cluster but should not be used on that cluster.
immuta.spark.databricks.jar.uri
Default: file:///databricks/jars/immuta-spark-hive.jar
Description: The location of immuta-spark-hive.jar
on the filesystem for Databricks. This should not need to change unless a custom initialization script that places immuta-spark-hive in a non-standard location is necessary.
immuta.spark.databricks.local.scratch.dir.enabled
Default: true
Description: Creates a world-readable/writable scratch directory on local disk to facilitate the use of dbutils
and 3rd party libraries that may write to local disk. Its location is non-configurable and is stored in the environment variable IMMUTA_LOCAL_SCRATCH_DIR
. Note: Sensitive data should not be stored at this location.
immuta.spark.databricks.log.level
Default Value: INFO
Description: The SLF4J log level to apply to Immuta's Spark plugins.
immuta.spark.databricks.log.stdout.enabled
Default: false
Description: If true, writes logging output to stdout/the console as well as the log4j-active.txt
file (default in Databricks).
immuta.spark.databricks.py4j.strict.enabled
Default: true
Description: Disable to allow the use of the dbutils
API in Python. Note: This setting should only be disabled for customers who employ a homogeneous integration (i.e., all users have the same level of data access).
immuta.spark.databricks.scratch.database
Description: This configuration is a comma-separated list of additional databases that will appear as scratch databases when running a SHOW DATABASE
query. This configuration increases performance by circumventing the Metastore to get the metadata for all the databases to determine what to display for a SHOW DATABASE
query; it won't affect access to the scratch databases. Instead, use immuta.spark.databricks.scratch.paths
to control read and write access to the underlying database paths.
Additionally, this configuration will only display the scratch databases that are configured and will not validate that the configured databases exist in the Metastore. Therefore, it is up to the Databricks administrator to properly set this value and keep it current.
immuta.spark.databricks.scratch.paths
Description: Comma-separated list of remote paths that Databricks users are allowed to directly read/write. These paths amount to unprotected "scratch spaces." You can create a scratch database by configuring its specified location (or configure dbfs:/user/hive/warehouse/<db_name>.db
for the default location).
To create a scratch path to a location or a database stored at that location, configure
To create a scratch path to a database created using the default location,
immuta.spark.databricks.scratch.paths.create.db.enabled
Default: false
Description: Enables non-privileged users to create or drop scratch databases.
immuta.spark.databricks.single.impersonation.user
Default: false
Description: When true
, this configuration prevents users from changing their impersonation user once it has been set for a given Spark session. This configuration should be set when the BI tool or other service allows users to submit arbitrary SQL or issue SET commands.
immuta.spark.databricks.submit.tag.job
Default: true
Description: Denotes whether the Spark job will be run that "tags" a Databricks cluster as being associated with Immuta.
immuta.spark.databricks.trusted.lib.uris
Description: Databricks Trusted Libraries
immuta.spark.non.immuta.table.cache.seconds
Default: 3600
Description: The number of seconds Immuta caches whether a table has been exposed as a source in Immuta. This setting only applies when immuta.spark.databricks.allow.non.immuta.writes
or immuta.spark.databricks.allow.non.immuta.reads
is enabled.
immuta.spark.require.equalization
Default: false
Description: Requires that users act through a single, equalized project. A cluster should be equalized if users need to run Scala jobs on it, and it should be limited to Scala jobs only via spark.databricks.repl.allowedLanguages
.
immuta.spark.resolve.raw.tables.enabled
Default: true
Description: Enables use of the underlying database and table name in queries against a table-backed Immuta data source. Administrators or whitelisted users can set immuta.spark.session.resolve.raw.tables.enabled
to false
to bypass resolving raw databases or tables as Immuta data sources. This is useful if an admin wants to read raw data but is also an Immuta user. By default, data policies will be applied to a table even for an administrative user if that admin is also an Immuta user.
immuta.spark.session.resolve.raw.tables.enabled
Default: true
Description: Same as above, but a session property that allows users to toggle this functionality. If users run set immuta.spark.session.resolve.raw.tables.enabled=false
, they will see raw data only (not Immuta data policy-enforced data). Note: This property is not set in immuta_conf.xml
.
immuta.spark.show.immuta.database
Default: true
Description: This shows the immuta
database in the configured Databricks cluster. When set to false
Immuta will no longer show this database when a SHOW DATABASES
query is performed. However, queries can still be performed against tables in the immuta
database using the Immuta-qualified table name (e.g., immuta.my_schema_my_table
) regardless of whether or not this feature is enabled.
immuta.spark.version.validate.enabled
Default: true
Description: Immuta checks the versions of its artifacts to verify that they are compatible with each other. When set to true
, if versions are incompatible, that information will be logged to the Databricks driver logs and the cluster will not be usable. If a configuration file or the jar artifacts have been patched with a new version (and the artifacts are known to be compatible), this check can be set to false
so that the versions don't get logged as incompatible and make the cluster unusable.
immuta.user.context.class
Default: com.immuta.spark.OSUserContext
Description: The class name of the UserContext that will be used to determine the current user in immuta-spark-hive
. The default implementation gets the OS user running the JVM for the Spark application.
immuta.user.mapping.iamid
Default: bim
Description: Denotes which IAM in Immuta should be used when mapping the current Spark user's username to a userid in Immuta. This defaults to Immuta's internal IAM (bim
) but should be updated to reflect an actual production IAM.
CDF shows the row-level changes between versions of a Delta table. The changes displayed include row data and metadata that indicates whether the row was inserted, deleted, or updated.
Immuta does not support applying policies to the changed data, and the CDF cannot be read for data source tables if the user does not have access to the raw data in Databricks. However, the CDF can be read if the querying user is allowed to read the raw data and one of the following statements is true:
the table is in the current workspace,
the table is in a scratch path,
non-Immuta reads are enabled AND the table does not intersect with a workspace under which the current user is not acting, or
non-Immuta reads are enabled AND the table is not part of an Immuta data source.
There are no configuration changes necessary to use this feature.
Immuta does not support reading changes in streaming queries.
It is most secure to leverage an equalized project when working in a Scala cluster; however, it is not required to limit Scala to equalized projects. This document outlines security recommendations for Scala clusters and discusses the security risks involved when equalized projects are not used.
Language support: R and Scala are both supported, but require advanced configuration; work with your Immuta support professional to use these languages.
There are limitations to isolation among users in Scala jobs on a Databricks cluster, even when using Immuta’s SecurityManager. When data is broadcast, cached (spilled to disk), or otherwise saved to SPARK_LOCAL_DIR
, it's impossible to distinguish between which user’s data is composed in each file/block. If you are concerned about this vulnerability, Immuta suggests that Scala clusters
be limited to Scala jobs only.
use project equalization, which forces all users to act under the same set of attributes, groups, and purposes with respect to their data access.
When data is read in Spark using an Immuta policy-enforced plan, the masking and redaction of rows is performed at the leaf level of the physical Spark plan, so a policy such as "Mask using hashing the column social_security_number
for everyone" would be implemented as an expression on a project node right above the FileSourceScanExec/LeafExec
node at the bottom of the plan. This process prevents raw data from being shuffled in a Spark application and, consequently, from ending up in SPARK_LOCAL_DIR
.
This policy implementation coupled with an equalized project guarantees that data being dropped into SPARK_LOCAL_DIR
will have policies enforced and that those policies will be homogeneous for all users on the cluster. Since each user will have access to the same data, if they attempt to manually access other users' cached/spilled data, they will only see what they have access to via equalized permissions on the cluster. If project equalization is not turned on, users could dig through that directory and find data from another user with heightened access, which would result in a data leak.
To require that Scala clusters be used in equalized projects and avoid the risk described above, change the immuta.spark.require.equalization
value to true
in your Immuta configuration file when you spin up Scala clusters:
Once this configuration is complete, users on the cluster will need to switch to an Immuta equalized project before running a job. (Remember that when working under an Immuta Project, only tables within that project can be seen.) Once the first job is run using that equalized project, all subsequent jobs, no matter the user, must also be run under that same equalized project. If you need to change a cluster's project, you must restart the cluster.
This page describes how the Security Manager is disabled for Databricks clusters that do not allow R or Scala code to be executed. Databricks Administrators should place the desired configuration in the Spark environment variables (recommended) or immuta_conf.xml
(not recommended).
The Immuta Security Manager is an essential element of the Databricks Spark deployment that ensures users can't perform unauthorized actions when using Scala and R, since those languages have features that allow users to circumvent policies without the Security Manager enabled. However, the Security Manager must inspect the call stack every time a permission check is triggered, which adds overhead to queries. To improve Immuta's query performance on Databricks, Immuta disables the Security Manager when Scala and R are not being used.
The cluster init script checks the cluster’s configuration and automatically removes the Security Manager configuration when
spark.databricks.repl.allowedlanguages
is a subset of {python, sql}
IMMUTA_SPARK_DATABRICKS_PY4J_STRICT_ENABLED
is true
When the cluster is configured this way, Immuta can rely on Databricks' process isolation and Py4J security to prevent user code from performing unauthorized actions.
Note: Immuta still expects the spark.driver.extraJavaOptions
and spark.executor.extraJavaOptions
to be set and pointing at the Security Manager.
Beyond disabling the Security Manager, Immuta will skip several startup tasks that are required to secure the cluster when Scala and R are configured, and fewer permission checks will occur on the Driver and Executors in the Databricks cluster, reducing overhead and improving performance.
There are still cases that require the Security Manager; in those instances, Immuta creates a fallback Security Manager to check the code path, so the IMMUTA_INIT_ALLOWED_CALLING_CLASSES_URI
environment variable must always point to a valid calling class file.
Databricks’ dbutils.fs
is blocked by their PY4J
security; therefore, it can’t be used to access scratch paths.
Ephemeral overrides best practices
Disable ephemeral overrides for clusters when using multiple workspaces and dedicate a single cluster to serve queries from Immuta in a single workspace.
If you use multiple E2 workspaces without disabling ephemeral overrides, avoid applying the where user row-level policy to data sources.
In Immuta, a Databricks data source is considered ephemeral, meaning that the compute resources associated with that data source will not always be available.
Ephemeral data sources allow the use of ephemeral overrides, user-specific connection parameter overrides that are applied to Immuta metadata operations.
When a user runs a Spark job in Databricks, Immuta plugins automatically submit ephemeral overrides for that user to Immuta for all applicable data sources to use the current cluster as compute for all subsequent metadata operations for that user against the applicable data sources.
A user runs a query on cluster B.
The Immuta plugins on the cluster check if there is a source in the Metastore with a matching database, table name, and location for its underlying data. Note: If tables are dynamic or change over time, users can disable the comparison of the location of the underlying data by setting immuta.ephemeral.table.path.check.enabled
to false
; disabling this configuration allows users to avoid keeping the relevant data sources in Immuta up-to-date (which would require API calls and automation).
The Immuta plugins on the cluster detect that the user is subscribed to data sources 1, 2, and 3 and that data sources 1 and 3 are both present in the Metastore for cluster B, so the plugins submit ephemeral override requests for data sources 1 and 3 to override their connections with the HTTP path from cluster B.
Since data source 2 is not present in the Metastore, it is marked as a JDBC source.
If the user attempts to query data source 2 and they have not enabled JDBC sources, they will be presented with an error message telling them to do so:
com.immuta.spark.exceptions.ImmutaConfigurationException
: This query plan will cause data to be pulled over JDBC. This spark context is not configured to allow this. To enable JDBC setimmuta.enable.jdbc=true
in the spark context hadoop configuration.
Ephemeral overrides are enabled by default because Immuta must be aware of a cluster that is running to serve metadata queries. The operations that use the ephemeral overrides include
Visibility checks on the data source for a particular user. These checks assess how to apply row-level policies for specific users.
Stats collection triggered by a specific user.
Validating a custom WHERE clause policy against a data source. When owners or governors create custom WHERE clause policies, Immuta uses compute resources to validate the SQL in the policy. In this case, the ephemeral overrides for the user writing the policy are used to contact a cluster for SQL validation.
High Cardinality Column detection. Certain advanced policy types (e.g., minimization and randomized response) in Immuta require a High Cardinality Column, and that column is computed on data source creation. It can be recomputed on demand and, if so, will use the ephemeral overrides for the user requesting computation.
However, ephemeral overrides can be problematic in environments that have a dedicated cluster to handle maintenance activities, since ephemeral overrides can cause these operations to execute on a different cluster than the dedicated one.
To reduce the risk that a user has overrides set to a cluster (or multiple clusters) that aren't currently up,
direct all clusters' HTTP paths for overrides to a cluster dedicated for metadata queries or
disable overrides completely.
To disable ephemeral overrides, set immuta.ephemeral.host.override
in spark-defaults.conf
to false.
In addition to supporting direct file reads through workspace and scratch paths, Immuta allows direct file reads in Spark for file paths. As a result, users who prefer to interact with their data using file paths or who have existing workflows revolving around file paths can continue to use these workflows without rewriting those queries for Immuta.
When reading from a path in Spark, the Immuta Databricks Spark plugin queries the Immuta Web Service to find Databricks data sources for the current user that are backed by data from the specified path. If found, the query plan maps to the Immuta data source and follows existing code paths for policy enforcement.
Users can read data from individual parquet files in a sub-directory and partitioned data from a sub-directory (or by using a where
predicate). Use the tabs below to view examples of reading data using these methods.
To read from an individual file, load a partition file from a sub-directory:
To read partitioned data from a sub-directory, load a parquet partition from a sub-directory:
Alternatively, load a parquet partition using a where
predicate:
Direct file reads for Immuta data sources only apply to table-backed Immuta data sources, not data sources created from views or queries.
If more than one data source has been created for a path, Immuta will use the first valid data source it finds. It is therefore not recommended to use this integration when more than one data source has been created for a path.
In Databricks, multiple input paths are supported as long as they belong to the same data source.
CSV-backed tables are not currently supported.
Loading a delta
partition from a sub-directory is not recommended by Spark and is not supported in Immuta. Instead, use a where
predicate:
This page provides an overview of Immuta's Databricks Trusted Libraries feature and support of Notebook-Scoped Libraries on Machine Learning Clusters.
The Immuta security manager blocks users from executing code that could allow them to gain access to sensitive data by only allowing select code paths to access sensitive files and methods. These select code paths provide Immuta's code access to sensitive resources while blocking end users from these sensitive resources directly.
Similarly, when users install third-party libraries those libraries will be denied access to sensitive resources by default. However, cluster administrators can specify which of the installed Databricks libraries should be trusted by Immuta.
The trusted libraries feature allows Databricks cluster administrators to avoid Immuta security manager errors when using third-party libraries. An administrator can specify an installed library as "trusted," which will enable that library's code to bypass the Immuta security manager. Contact your Immuta support professional for custom security configurations for your libraries.
This feature does not impact Immuta's ability to apply policies; trusting a library only allows code through what previously would have been blocked by the security manager.
Security vulnerability
Using this feature could create a security vulnerability, depending on the third-party library. For example, if a library exposes a public method named readProtectedFile
that displays the contents of a sensitive file, then trusting that library would allow end users access to that file. Work with your Immuta support professional to determine if the risk does not apply to your environment or use case.
Databricks Libraries API: Installing trusted libraries outside of the Databricks Libraries API (e.g., ADD JAR ...
) is not supported.
The following types of libraries are supported when installing a third-party library using the Databricks UI or the Databricks Libraries API:
Library source
is Upload
, DBFS
or DBFS/S3
and the Library Type
is Jar
.
Library source
is Maven
.
Databricks installs libraries right after a cluster has started, but there is no guarantee that library installation will complete before a user's code is executed. If a user executes code before a trusted library installation has completed, Immuta will not be able to identify the library as trusted. This can be solved by either
waiting for library installation to complete before running any third-party library commands or
executing a Spark query. This will force Immuta to wait for any trusted Immuta libraries to complete installation before proceeding.
When installing a library using Maven as a library source, Databricks will also install any transitive dependencies for the library. However, those transitive dependencies are installed behind the scenes and will not appear as installed libraries in either the Databricks UI or using the Databricks Libraries API. Only libraries specifically listed in the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS
environment variable will be trusted by Immuta, which does not include installed transitive dependencies. This effectively means that any code paths that include a class from a transitive dependency but do not include a class from a trusted third-party library can still be blocked by the Immuta security manager. For example, if a user installs a trusted third-party library that has a transitive dependency of a file-util
library, the user will not be able to directly use the file-util
library to read a sensitive file that is normally protected by the Immuta security manager.
In many cases, it is not a problem if dependent libraries aren't trusted because code paths where the trusted library calls down into dependent libraries will still be trusted. However, if the dependent library needs to be trusted, there is a workaround:
Add the transitive dependency jar paths to the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS
environment variable. In the driver log4j
logs, Databricks outputs the source jar locations when it installs transitive dependencies. In the cluster driver logs, look for a log message similar to the following:
In the above example, where slf4j
is the transitive dependency, you would add the path dbfs:/FileStore/jars/maven/org/slf4j/slf4j-api-1.7.25.jar
to the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS
environment variable and restart your cluster.
In case of failure, check the driver logs for details. Some possible causes of failure include
One of the Immuta configured trusted library URIs does not point to a Databricks library. Check that you have configured the correct URI for the Databricks library.
For trusted Maven artifacts, the URI must follow this format: maven:/group.id:artifact-id:version
.
Databricks failed to install a library. Any Databricks library installation errors will appear in the Databricks UI under the Libraries tab.
For details about configuring trusted libraries, navigate to the installation guide.
Users on Databricks runtimes 8+ can manage notebook-scoped libraries with %pip
commands.
However, this functionality differs from Immuta's trusted libraries feature, and Python libraries are still not supported as trusted libraries. The Immuta Security Manager will deny the code of libraries installed with %pip
access to sensitive resources.
No additional configuration is needed to enable this feature. Users only need to be running on clusters with DBR 8+.
Cluster 1 | 9.1 | Unavailable | Unavailable |
Cluster 2 | 10.4 | Unavailable | Unavailable |
Cluster 3 | 11.3 | Unavailable |
Cluster 4 | 11.3 |
Cluster 5 | 11.3 |
Cluster 1 | 9.1 | Unavailable | Unavailable |
Cluster 2 | 10.4 | Unavailable | Unavailable |
Cluster 3 | 11.3 | Unavailable |
Cluster 4 | 11.3 |
Cluster 5 | 11.3 |
Cluster 1 | 9.1 | Unavailable | Unavailable |
Cluster 2 | 10.4 | Unavailable | Unavailable |
Cluster 3 | 11.3 | Unavailable |
Cluster 4 | 11.3 |
Cluster 5 | 11.3 |
Databricks Spark is a plugin integration with Immuta. This integration allows you to protect access to tables and manage row-, column-, and cell-level controls without enabling table ACLs or credential passthrough. Policies are applied to the plan that Spark builds for a user's query and enforced live on-cluster.
An Application Admin will configure Databricks Spark with either the
Simplified Databricks Spark Configuration on the Immuta App Settings page
Manual Databricks Spark Configuration where Immuta artifacts must be downloaded and staged to your Databricks clusters
In both configuration options, the Immuta init script adds the Immuta plugin in Databricks: the Immuta Security Manager, wrappers, and Immuta analysis hook plan rewrite. Once an administrator gives users Can Attach To
entitlements on the cluster, they can query Immuta-registered data source directly in their Databricks notebooks.
Simplified Databricks Spark configuration additional entitlements
The credentials used to do the Simplified Databricks Spark configuration with automatic cluster policy push must have the Allow cluster creation
entitlement.
This will give Immuta temporary permission to push the cluster policies to the configured Databricks workspace and overwrite any cluster policy templates previously applied to the workspace.
Best practice
Test the integration on an Immuta-enabled cluster with a user that is not a Databricks administrator.
You should register entire databases with Immuta and run Schema Monitoring jobs through the Python script provided during data source registration. Additionally, you should use a Databricks administrator account to register data sources with Immuta using the UI or API; however, you should not test Immuta policies using a Databricks administrator account, as they are able to bypass controls.
A Databricks administrator can control who has access to specific tables in Databricks through Immuta Subscription Policies or by manually adding users to the data source. Data users will only see the immuta
database with no tables until they are granted access to those tables as Immuta data sources.
immuta
DatabaseWhen a table is registered in Immuta as a data source, users can see that table in the native Databricks database and in the immuta
database. This allows for an option to use a single database (immuta
) for all tables.
After data users have subscribed to data sources, administrators can apply fine-grained access controls, such as restricting rows or masking columns with advanced anonymization techniques, to manage what the users can see in each table. More details on the types of data policies can be found on the Data Policies page, including an overview of masking struct and array columns in Databricks.
Note: Immuta recommends building Global Policies rather than Local Policies, as they allow organizations to easily manage policies as a whole and capture system state in a more deterministic manner.
All access controls must go through SQL.
Note: With R, you must load the SparkR library in a cell before accessing the data.
Usernames in Immuta must match usernames in Databricks. It is best practice is to use the same identity manager for Immuta that you use for Databricks (Immuta supports these identity manager protocols and providers. however, for Immuta SaaS users, it’s easiest to just ensure usernames match between systems.
An Immuta Application Administrator configures the Databricks Spark integration and registers available cluster policies Immuta generates.
The Immuta init script adds the immuta
plugin in Databricks: the Immuta SecurityManager, wrappers, and Immuta analysis hook plan rewrite.
A Data Owner registers Databricks tables in Immuta as data sources. A Data Owner, Data Governor, or Administrator creates or changes a policy or user in Immuta.
Data source metadata, tags, user metadata, and policy definitions are stored in Immuta's Metadata Database.
A Databricks user who is subscribed to the data source in Immuta queries the corresponding table directly in their notebook or workspace.
During Spark Analysis, Spark calls down to the Metastore to get table metadata.
Immuta intercepts the call to retrieve table metadata from the Metastore.
Immuta modifies the Logical Plan to enforce policies that apply to that user.
Immuta wraps the Physical Plan with specific Java classes to signal to the SecurityManager that it is a trusted node and is allowed to scan raw data.
The Physical Plan is applied and filters out and transforms raw data coming back to the user.
The user sees policy-enforced data.
In the Databricks Clusters UI, install your third-party library .jar or Maven artifact with Library Source Upload
, DBFS
, DBFS/S3
, or Maven
. Alternatively, use the Databricks libraries API.
In the Databricks Clusters UI, add the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS
property as a Spark environment variable and set it to your artifact's URI:
For Maven artifacts, the URI is maven:/<maven_coordinates>
, where <maven_coordinates>
is the Coordinates field found when clicking on the installed artifact on the Libraries tab in the Databricks Clusters UI. Here's an example of an installed artifact:
In this example, you would add the following Spark environment variable:
For jar artifacts, the URI is the Source field found when clicking on the installed artifact on the Libraries tab in the Databricks Clusters UI. For artifacts installed from DBFS or S3, this ends up being the original URI to your artifact. For uploaded artifacts, Databricks will rename your .jar and put it in a directory in DBFS. Here's an example of an installed artifact:
In this example, you would add the following Spark environment variable:
Once you've finished making your changes, restart the cluster.
Specifying more than one trusted library
To specify more than one trusted library, comma delimit the URIs:
Once the cluster is up, execute a command in a notebook. If the trusted library installation is successful, you should see driver log messages like this:
Delta Lake API reference guide
When using Delta Lake, the API does not go through the normal Spark execution path. This means that Immuta's Spark extensions do not provide protection for the API. To solve this issue and ensure that Immuta has control over what a user can access, the Delta Lake API is blocked.
Spark SQL can be used instead to give the same functionality with all of Immuta's data protections.
Below is a table of the Delta Lake API with the Spark SQL that may be used instead.
Delta Lake API | Spark SQL |
---|---|
See here for a complete list of the Delta SQL Commands.
When a table is created in a native workspace, you can merge a different Immuta data source from that workspace into that table you created.
Create a table in the native workspace.
Create a temporary view of the Immuta data source you want to merge into that table.
Use that temporary view as the data source you add to the project workspace.
Run the following command:
Private preview: Write policies are only available to select accounts. Contact your Immuta representative to enable this feature.
Starburst (Trino) version 438 or newer
Write policies for Starburst (Trino) enabled. Contact your Immuta representative to get this feature enabled on your account.
In its default setting, the Starburst (Trino) integration's write access value controls the authorization of SQL operations that perform data modification (such as INSERT
, UPDATE
, DELETE
, MERGE
, and TRUNCATE
). However, administrators can allow table modification operations (such as ALTER
and DROP
tables) to be authorized as write operations. Two locations allow administrators to specify how are applied to data in Starburst (Trino). Select one or both of the options below to customize these settings. If the access-control.properties
file is used, it may override the policies configured in the Immuta web service.
Immuta web service: Configure write policies in the Immuta web service to allow all Starburst (Trino) clusters targeting that Immuta tenant to receive the same write policy configuration for data sources. This configuration will only affect tables or views registered as Immuta data sources.
Starburst (Trino) cluster: Configure write policies using the access-control.properties
file in or to broadly customize access for Immuta users on a specific cluster. This configuration file takes precedence over write policies passed from the Immuta web service. Use this option if all Immuta users should have the same level of access to tables regardless of the write policy setting in the Immuta web service.
Contact your Immuta representative to configure read and write access in the Immuta web service if all Starburst (Trino) data source operations should be affected identically across Starburst (Trino) clusters connected to your Immuta tenant. A configuration example is provided below.
The following example maps WRITE
to READ
, WRITE
and OWN
permissions and READ
to just READ
. Both READ
and WRITE
permissions should always include READ
:
Configure the integration to allow read and write policies to apply to any data source (registered or unregistered in Immuta) on a Starburst cluster.
Create the Immuta access control configuration file in the Starburst configuration directory (/etc/starburst/immuta-access-control.properties
for Docker installations or <starburst_install_directory>/etc/immuta-access-control.properties
for standalone installations).
Modify one or both properties below to customize the behavior of read or write access policies for all users:
immuta.allowed.immuta.datasource.operations
: This property governs objects (catalogs, schemas, tables, etc.) that are registered as data sources in Immuta. For these permissions to apply, the user must be subscribed to the data source in Immuta and not be an administrator (who gets all permissions).
READ
: Grants SELECT
on tables or views; grants SHOW
on tables, views, or columns
WRITE
: Grants INSERT
, UPDATE
, DELETE
, MERGE
, or TRUNCATE
on tables; grants REFRESH
on materialized views.
OWN
: Grants ALTER
and DROP
on tables; grants SET
on comments and properties
immuta.allowed.non.immuta.datasource.operations
: This property governs objects (catalogs, schemas, tables, etc.) that are not registered as data sources in Immuta. Use all or a combination of the following access values:
READ
: Grants SELECT
on tables or views; grants SHOW
on tables, views, or columns
WRITE
: Grants INSERT
, UPDATE
, DELETE
, MERGE
, or TRUNCATE
on tables; grants REFRESH
on materialized views.
OWN
: Grants ALTER
and DROP
on tables; grants SET
on comments and properties
CREATE
: Grants CREATE
on catalogs, schema, tables, and views. This is the only property that can allow CREATE
permissions, since CREATE
is enforced on new objects that do not exist in Starburst or Immuta yet (such as a new table being created with CREATE TABLE
).
For example, the following configuration allows READ
, WRITE
, and OWN
operations to be authorized on data sources registered in Immuta and all operations are permitted on data that is not registered in Immuta:
Enable the Immuta access control plugin in the Starburst cluster's configuration file (/etc/starburst/config.properties
for Docker installations or <starburst_install_directory>/etc/config.properties
for standalone installations). For example,
Create the Immuta access control configuration file in the Trino configuration directory (/etc/trino/config.properties
for Docker installations or <trino_install_directory>/etc/config.properties
for standalone installations).
Modify one or both properties below to customize the behavior of read or write access policies for all users:
immuta.allowed.immuta.datasource.operations
: This property governs objects (catalogs, schemas, tables, etc.) that are registered as data sources in Immuta. For these permissions to apply, the user must be subscribed to the data source in Immuta and not be an administrator (who gets all permissions).
READ
: Grants SELECT
on tables or views; grants SHOW
on tables, views, or columns
WRITE
: Grants INSERT
, UPDATE
, DELETE
, MERGE
, or TRUNCATE
on tables; grants REFRESH
on materialized views.
OWN
: Grants ALTER
and DROP
on tables; grants SET
on comments and properties
immuta.allowed.non.immuta.datasource.operations
: This property governs objects (catalogs, schemas, tables, etc.) that are not registered as data sources in Immuta. Use all or a combination of the following access values:
READ
: Grants SELECT
on tables or views; grants SHOW
on tables, views, or columns
WRITE
: Grants INSERT
, UPDATE
, DELETE
, MERGE
, or TRUNCATE
on tables; grants REFRESH
on materialized views.
OWN
: Grants ALTER
and DROP
on tables; grants SET
on comments and properties
CREATE
: Grants CREATE
on catalogs, schema, tables, and views. This is the only property that can allow CREATE
permissions, since CREATE
is enforced on new objects that do not exist in Starburst or Immuta yet (such as a new table being created with CREATE TABLE
).
For example, the following configuration allows READ
, WRITE
, and OWN
operations to be authorized on data sources registered in Immuta and all operations are permitted on data that is not registered in Immuta:
Enable the Immuta access control plugin in Trino's configuration file (/etc/trino/config.properties
for Docker installations or <trino_install_directory>/etc/config.properties
for standalone installations). For example,
Databricks metastore magic allows you to migrate your data from the Databricks legacy Hive metastore to the Unity Catalog metastore while protecting data and maintaining your current processes in a single Immuta tenant.
Databricks metastore magic is for customers who intend to use the , but they would like to protect tables in the Hive metastore.
Unity Catalog support is enabled in Immuta.
Databricks has two built-in metastores that contain metadata about your tables, views, and storage credentials:
Legacy Hive metastore: Created at the workspace level. This metastore contains metadata of the configured tables in that workspace available to query.
Unity Catalog metastore: Created at the account level and is attached to one or more Databricks workspaces. This metastore contains metadata of the configured tables available to query. All clusters on that workspace use the configured metastore and all workspaces that are configured to use a single metastore share those tables.
Databricks allows you to use the legacy Hive metastore and the Unity Catalog metastore simultaneously. However, Unity Catalog does not support controls on the Hive metastore, so you must attach a Unity Catalog metastore to your workspace and move existing databases and tables to the attached Unity Catalog metastore to use the governance capabilities of Unity Catalog.
Immuta's Databricks Spark integration and Unity Catalog integration enforce access controls on the Hive and Unity Catalog metastores, respectively. However, because these metastores have two distinct security models, users were discouraged from using both in a single Immuta tenant before metastore magic; the Databricks Spark integration and Unity Catalog integration were unaware of each other, so using both concurrently caused undefined behavior.
Metastore magic reconciles the distinct security models of the legacy Hive metastore and the Unity Catalog metastore, allowing you to use multiple metastores (specifically, the Hive metastore or alongside Unity Catalog metastores) within a Databricks workspace and single Immuta tenant and keep policies enforced on all your tables as you migrate them. The diagram below shows Immuta enforcing policies on registered tables across workspaces.
In clusters A and D, Immuta enforces policies on data sources in each workspace's Hive metastore and in the Unity Catalog metastore shared by those workspaces. In clusters B, C, and E (which don't have Unity Catalog enabled in Databricks), Immuta enforces policies on data sources in the Hive metastores for each workspace.
With metastore magic, the Databricks Spark integration enforces policies only on data in the Hive metastore, while the Unity Catalog integration enforces policies on tables in the Unity Catalog metastore. The table below illustrates this policy enforcement.
Databricks SQL cannot run the Databricks Spark plugin to protect tables, so Hive metastore data sources will not be policy enforced in Databricks SQL.
The table below outlines the integrations supported for various Databricks cluster configurations. For example, the only integration available to enforce policies on a cluster configured to run on Databricks Runtime 9.1 is the Databricks Spark integration.
Legend:
The how-to guides linked on this page illustrate how to integrate Starburst (Trino) with Immuta.
These guides provide information on the recommended features to enable with Starburst (Trino).
.
Select None as your .
.
.
These guides provide instructions for organizing your Starburst (Trino) data to align with your governance structure.
.
These guides provide instructions for auditing and detecting your users' activity, or see the for a comprehensive guide on the benefits of these features and other recommendations.
or for your .
.
Public preview: Native SDD for Starburst (Trino) is currently in public preview and available to all accounts.
These guides provide instructions for discovering, classifying, and tagging your data.
Validate the policies. You do not have to validate every policy you create in Immuta; instead, examine a few to validate the behavior you expect to see.
Once all Immuta policies are in place, remove or alter old permissions and revoke access to the ungoverned tables.
In this integration, Immuta generates policy-enforced views in your configured Redshift schema for tables registered as Immuta data sources.
This guide outlines how to integrate Redshift with Immuta.
: Configure the integration in Immuta.
: Configure Redshift Spectrum in Immuta.
: This guide describes the design and components of the integration.
: This reference guide provides descriptions of the possible statuses of a configured integration.
In this integration, Immuta policies are translated into Starburst rules and permissions and applied directly to tables within users’ existing catalogs.
This guide outlines how to integrate Starburst with Immuta.
: Configure the integration in Immuta.
: Configure how read and write access subscription policies translate to Starburst (Trino) privileges and apply to Starburst (Trino) data sources.
: This guide describes the design and components of the integration.
: This reference guide provides descriptions of the possible statuses of a configured integration.
The how-to guides linked on this page illustrate how to integrate Redshift with Immuta.
Requirement: Redshift cluster with an RA3 node is required for the multi-database integration. For other instance types, you may configure a single-database integration using one of the .
These guides provide information on the recommended feature to enable with Redshift.
or .
Select None as your .
.
.
These guides provide instructions for organizing your Redshift data to align with your governance structure.
.
Private preview: Native SDD for Redshift is currently in private preview and available to all accounts.
These guides provide instructions for discovering, classifying, and tagging your data.
Validate the policies. You do not have to validate every policy you create in Immuta; instead, examine a few to validate the behavior you expect to see.
Once all Immuta policies are in place, remove or alter old permissions and revoke access to the ungoverned tables.
Starburst and Trino
is based on open-source . Consequently, this page occasionally refers to the Trino Execution Engine and Trino methods.
The Starburst (Trino) integration allows you to access policy-enforced data directly in your Starburst catalogs without rewriting queries or changing workflows. Instead of generating policy-enforced views and adding them to an Immuta catalog that users have to query (like in the legacy Starburst (Trino) integration), Immuta policies are translated into Starburst (Trino) rules and permissions and applied directly to tables within users’ existing catalogs.
Once an Immuta Application Admin configures the Starburst (Trino) integration, the ImmutaSystemAccessControl plugin is installed on the . This plugin provides policy decisions to the Trino Execution Engine whenever an Immuta user queries a Starburst (Trino) table registered in Immuta. Then, the Trino Execution Engine applies policies to the backing catalogs and retrieves the data with appropriate policy enforcement.
By default, this integration is designed to be minimally invasive: if a catalog is not registered as an Immuta data source, users will still have access to it in Starburst (Trino). However, this limited enforcement can be changed in the provided by Immuta. Additionally, you can continue to use Trino's file-based access control provider or on catalogs that are not protected or controlled by Immuta.
When you configure the integration, Immuta generates an API key for you to add to your Immuta access control properties file for API authentication between Starburst (Trino) and Immuta. You can rotate this shared secret to mitigate potential security risks and comply with your organizational policies.
To rotate this API key, see the .
When a user queries a table in Starburst, the Trino Execution Engine reaches out to the Immuta plugin to determine what the user is allowed to see:
masking policies: For each column, Starburst (Trino) requests a view expression from the Immuta plugin. If there is a masking policy on the column, the Immuta plugin returns the corresponding view expression for that column. Otherwise, nothing is returned.
row-level policies: For each table, Starburst (Trino) requests the rows a user can see in a table from Immuta. If there is a WHERE clause policy on the data source, Immuta returns the corresponding view expression as a WHERE clause. Otherwise, nothing is returned.
The Immuta plugin then requests policy information about the tables being queried from the Immuta Web Service and sends this information to the Trino Execution Engine. Finally, the Trino Execution Engine constructs the SQL statement, executes it on the backing tables to apply the policies, and returns the response to the user.
See the integration support matrix on the for a list of supported data policy types in Starburst (Trino).
Users cannot bypass Immuta controls by changing roles in their system access control provider.
Multiple system access control providers can be configured in the Starburst (Trino) integration. This approach allows Immuta to work with existing Starburst (Trino) installations that already have an access control provider configured.
Immuta does not manage all permissions in Starburst (Trino) and will default to allowing access to anything Immuta does not manage so that the Starburst (Trino) integration complements existing controls. For example, if the Starburst (Trino) integration is configured to allow users write access to tables that are not protected by Immuta, you can still lock down write access for specific non-Immuta tables using an additional access control provider.
If you have multiple access control providers configured, those providers interact in the following ways:
For a user to have access to a resource (catalog, schema, or a table), that user must have access in all of the configured access control providers.
In catalog, schema, or table filtering (such as show catalogs
, show schemas
, or show tables
), the user will see the intersection of all access control providers. For example, if a Starburst (Trino) environment includes the catalogs public
, demo
, and restricted
and one provider restricts a user from accessing the restricted
catalog and another provider restricts the user from accessing the demo
catalog, running show catalogs
will only return the public
catalog for that user.
Only one column masking policy can be applied per column across all system access control providers. If two or more access control providers return a mask for a column, Starburst (Trino) will throw an error at query time.
For row filtering policies, the expression for each system access control provider is applied one after the other.
Starburst (Trino) query passthrough is available in most connectors using the query
table function or raw_query
in the Elasticsearch connector. Consequently, Immuta blocks functions named raw_query
or query
, as those table functions would completely bypass Immuta’s access controls.
For example, without blocking those functions, this query would access the public.customer
table directly:
select * from table(postgres.system.query(query => 'select * from public.customer limit 10'));
An Immuta Application Administrator configures the Starburst (Trino) integration, adding the ImmutaSystemAccessControl plugin on their Starburst (Trino) node.
Data source metadata, tags, user metadata, and policy definitions are stored in Immuta's Metadata Database.
The Trino Execution Engine calls various methods on the interface to ask the ImmutaSystemAccessControl plugin where the policies should be applied. The masking and row-level security methods apply the actual policy expressions.
The Immuta System Access Control plugin calls the Immuta Web Service to retrieve policy information for that data source for the querying user, using the querying user's project, purpose, and entitlements.
The Immuta System Access Control plugin provides the SQL view expression (for masked columns) or WHERE clause SQL view expression (for row filtering) to the Trino Execution Engine.
The Trino Execution Engine constructs and executes the SQL statement on the backing catalogs and retrieves the data with appropriate policy enforcement.
User sees policy-enforced data.
The Starburst (Trino) integration supports the following authentication methods to create data sources in Immuta:
Username and password: You can authenticate with your Starburst (Trino) username and password.
Configure JWT authentication method in Starburst (Trino)
When using OAuth authentication to create data sources in Immuta, configure your Starburst (Trino) cluster to use JWT authentication, not OpenID Connect or OAuth.
When users query a Starburst data source, Immuta sends a username with the view SQL so that policies apply in the right context. Since OAuth authentication does not require a username to be associated with a data source upon data source creation, Immuta does not send a username and Starburst queries fail. To avoid this error, you must configure a global admin username.
If you are using OAuth or asynchronous authentication to create Starburst data sources, work with your Immuta representative to configure the globalAdminUsername
property.
The descriptions below provide guidance for applying policies to Starburst (Trino)-created logical views in the
However, there are other approaches you can use to apply policies to Starburst (Trino)-created logical views. The examples below are the simplest approaches.
DEFINER
security modeFor views created using the DEFINER
security mode,
ensure the user who created the view is configured as an admin user in the Immuta plugin so that policies are never applied to the underlying tables.
create Immuta data sources and apply policies to logical views exposing those tables.
lock down access to the underlying tables in Starburst (Trino) so that all end user access is provided through the views.
INVOKER
security modeApplying policies to views or tables
Avoid creating data policies for both a logical view and its underlying tables. Instead, apply policies to the logical view or the underlying tables.
For views created using the INVOKER
security mode, the querying user needs access to the logical view and underlying tables.
If non-Immuta table reads are disabled, provide access to the views and tables through Immuta. To do so, create Immuta data sources for the view and underlying tables, and grant access to the querying user in Immuta. If creating data policies, apply the policies to either the view or underlying tables, not both.
If non-Immuta table reads are enabled, the user already has access to the table and view. Create Immuta data sources and apply policies to the underlying table; this approach will enforce access controls for both the table and view in Starburst (Trino).
In addition to the information included on the Starburst (Trino) Audit Logs page, the audit logs payload in the Starburst (Trino) integration includes immutaPlanningDuration
, which represents the planning overhead in Immuta.
You can configure multiple Starburst (Trino) integrations with a single Immuta tenant and use them dynamically. Configure the integration once in Immuta to use it in multiple Starburst (Trino) clusters. However, consider the following limitations:
Names of catalogs cannot overlap because Immuta cannot distinguish among them.
A combination of cluster types on a single Immuta tenant is supported unless your Trino cluster is configured to use a proxy. In that case, you can only connect either Trino clusters or Starburst clusters to the same Immuta tenant.
Limit your masked joins to columns with matching column types. Starburst truncates the result of the masking expression to conform to the native column type when performing the join, so joining two masked columns with different data types produces invalid results when one of the columns' lengths is less than the length of the masked value.
For example, if the value of a hashed column is 64 characters, joining a hashed varchar(50) and a hashed varchar(255) column will not be joined correctly, since the varchar(50) value is truncated and doesn’t match the varchar(255) value.
The plugin comes pre-installed with Starburst Enterprise, so this page provides separate sets of guidelines for configuration:
: These instructions are specific to Starburst Enterprise clusters.
: These instructions are specific to open-source Trino clusters.
A valid .
The Starburst Cluster must be publicly accessible or have configured.
Starburst does not support using Starburst built-in access control (BIAC) concurrently with any other access control providers such as Immuta. If Starburst BIAC is in use, it must be disabled to allow Immuta to enforce policies on cluster.
Click the App Settings icon in the left sidebar.
Click the Integrations tab.
Click Add Native Integration and select Trino from the Native Integration Type dropdown menu.
Click Save.
If you are using OAuth or asynchronous authentication to create Starburst data sources, work with your Immuta representative to configure the globalAdminUsername
property.
Default configuration property values
If you use the default property values in the configuration file described in this section,
you will give users read and write access to tables that are not registered in Immuta and
results for SHOW
queries will not be filtered on table metadata.
These default settings help ensure that a new Starburst integration installation is minimally disruptive for existing Starburst deployments, allowing you to then add Immuta data sources and update configuration to enforce more controls as you see fit.
However, the access-control.config-files
property can be configured to allow Immuta to work with existing Starburst installations that have already configured an access control provider. For example, if the Starburst integration is configured to allow users write access to tables that are not protected by Immuta, you can still lock down write access for specific non-Immuta tables using an additional access control provider.
Create the Immuta access control configuration file in the Starburst configuration directory (/etc/starburst/immuta-access-control.properties
for Docker installations or <starburst_install_directory>/etc/immuta-access-control.properties
for standalone installations).
The table below describes the properties that can be set during configuration.
Enable the Immuta access control plugin in Starburst's configuration file (/etc/starburst/config.properties
for Docker installations or <starburst_install_directory>/etc/config.properties
for standalone installations). For example,
All Starburst users must map to Immuta users or match the immuta.user.admin
regex configured on the cluster, and their Starburst username must be mapped to Immuta so they can query policy-enforced data.
A user impersonating a different user in Starburst requires the IMPERSONATE_USER permission in Immuta. Both users must be mapped to an Immuta user, or the querying user must match the configured immuta.user.admin
regex.
Click the App Settings icon in the left sidebar.
Click the Integrations tab.
Click Add Native Integration and select Trino from the dropdown menu.
Click Save.
If you are using OAuth or asynchronous authentication to create Starburst data sources, work with your Immuta representative to configure the globalAdminUsername
property.
Default configuration property values
If you use the default property values in the configuration file described in this section,
you will give users read and write access to tables that are not registered in Immuta and
results for SHOW
queries will not be filtered on table metadata.
These default settings help ensure that a new Starburst integration installation is minimally disruptive for existing Trino deployments, allowing you to then add Immuta data sources and update configuration to enforce more controls as you see fit.
However, the access-control.config-files
property can be configured to allow Immuta to work with existing Trino installations that have already configured an access control provider. For example, if the Starburst (Trino) integration is configured to allow users write access to tables that are not protected by Immuta, you can still lock down write access for specific non-Immuta tables using an additional access control provider.
Enable Immuta on your cluster. Select the tab below that corresponds to your installation method for instructions:
Docker (Trino 413 and older)
Create the Immuta access control configuration file in the Trino configuration directory: /etc/trino/immuta-access-control.properties
.
Pull the image and start the container. The example below specifies the Immuta Trino plugin version 414 with the 414
tag, but any supported Trino version newer than 414 can be used:
Create the Immuta access control configuration file in the Trino configuration directory: /etc/trino/immuta-access-control.properties
.
Standalone installations
Create the Immuta access control configuration file in the Trino configuration directory: <trino_install_directory>/etc/immuta-access-control.properties
.
Configure the properties described in the table below.
Enable the Immuta access control plugin in Trino's configuration file (/etc/trino/config.properties
for Docker installations or <trino_install_directory>/etc/config.properties
for standalone installations). For example,
All Trino users must map to Immuta users or match the immuta.user.admin
regex configured on the cluster, and their Trino username must be mapped to Immuta so they can query policy-enforced data.
A user impersonating a different user in Trino requires the IMPERSONATE_USER permission in Immuta. Both users must be mapped to an Immuta user, or the querying user must match the configured immuta.user.admin
regex.
This page illustrates how to configure the on the Immuta app settings page. To configure this integration via the Immuta API, see the .
For instructions on configuring Redshift Spectrum, see the guide.
A Redshift cluster with an RA3 node is required for the multi-database integration. You must use a Redshift RA3 instance type because Immuta requires cross-database views, which are only supported in Redshift RA3 instance types. For other instance types, you may configure a single-database integration using one of the .
For automated installations, the credentials provided must be a Superuser or have the ability to create databases and users and modify grants.
The must be set to false
(default setting) for your Redshift cluster.
Click the App Settings icon in the left sidebar.
Click the Integrations tab.
Click the +Add Native Integration button and select Redshift from the dropdown menu.
Complete the Host and Port fields.
Enter an Immuta Database. This is a new database where all secure schemas and Immuta created views will be stored.
Opt to check the Enable Impersonation box and customize the Impersonation Role name as needed. This will allow users to natively impersonate another user.
You have two options for configuring your Redshift environment:
Immuta requires temporary, one-time use of credentials with specific privileges
When performing an automated installation, Immuta requires temporary, one-time use of credentials with the following privileges:
CREATE DATABASE
CREATE USER
REVOKE ALL PRIVILEGES ON DATABASE
GRANT TEMP ON DATABASE
MANAGE GRANTS ON ACCOUNT
These privileges will be used to create and configure a new IMMUTA database within the specified Redshift instance. The credentials are not stored or saved by Immuta, and Immuta doesn’t retain access to them after initial setup is complete.
You can create a new account for Immuta to use that has these privileges, or you can grant temporary use of a pre-existing account. By default, the pre-existing account with appropriate privileges is a Superuser. If you create a new account, it can be deleted after initial setup is complete.
Alternatively, you can create the IMMUTA database within the specified Redshift instance without giving Immuta user credentials for a Superuser using the manual setup option.
Select Automatic.
Enter an Initial Database from your Redshift integration for Immuta to use to connect.
Use the dropdown menu to select your Authentication Method.
Username and Password: Enter the Username and Password of the privileged user.
AWS Access Key: Enter the Database User, Access Key ID, and Secret Key. Opt to enter in the Session Token.
Required privileges
The specified role used to run the bootstrap needs to have the following privileges:
CREATE DATABASE
CREATE USER
REVOKE ALL PRIVILEGES ON DATABASE
GRANT TEMP ON DATABASE
MANAGE GRANTS ON ACCOUNT
Select Manual and download both of the bootstrap scripts.
Run the bootstrap script (initial database) in the Redshift initial database.
Run the bootstrap script (Immuta database) in the new Immuta Database in Redshift.
Choose your authentication method, and enter the information of the newly created account.
Click Save.
Click the App Settings icon in the left sidebar.
Navigate to the Integrations tab and click the down arrow next to the Redshift Integration.
Edit the field you want to change. Note any field shadowed is not editable, and the integration must be disabled and re-installed to change it.
Enter Username and Password.
Click Save.
Required privileges
When performing edits to an integration, Immuta requires temporary, one-time use of credentials of a Superuser or a user with the following permissions:
Create Databases
Create users
Modify grants
Alternatively, you can download the Edit Script and run it in Redshift.
Disabling Redshift Spectrum
Click the App Settings icon in the left sidebar.
Navigate to the Integrations tab and click the down arrow next to the Redshift Integration.
Click the checkbox to disable the integration.
Enter the username and password that were used to initially configure the integration.
Click Save.
This page provides an overview of the Redshift integration in Immuta. For a tutorial detailing how to enable this integration, see the .
Redshift is a policy push integration that allows Immuta to apply policies directly in Redshift. This allows data analysts to query Redshift views directly instead of going through a proxy and have per-user policies dynamically applied at query time.
The Redshift integration will create views from the tables within the database specified when configured. Then, the user can choose the name for the schema where all the Immuta generated views will reside. Immuta will also create the schemas immuta_system
, immuta_functions
, and immuta_procedures
to contain the tables, views, UDFs, and stored procedures that support the integration. Immuta then creates a system role and gives that system account the following privileges:
ALL PRIVILEGES ON DATABASE IMMUTA_DB
ALL PRIVILEGES ON ALL SCHEMAS IN DATABASE IMMUTA_DB
USAGE ON FUTURE PROCEDURES IN SCHEMA IMMUTA_DB.IMMUTA_PROCEDURES
USAGE ON LANGUAGE PLPYTHONU
Additionally the PUBLIC
role will be granted the following privileges:
USAGE ON DATABASE IMMUTA_DB
TEMP ON DATABASE IMMUTA_DB
USAGE ON SCHEMA IMMUTA_DB.IMMUTA_PROCEDURES
USAGE ON SCHEMA IMMUTA_DB.IMMUTA_FUNCTIONS
USAGE ON FUTURE FUNCTIONS IN SCHEMA IMMUTA_DB.IMMUTA_FUNCTIONS
USAGE ON SCHEMA IMMUTA_DB.IMMUTA_SYSTEM
SELECT ON TABLES TO public
Immuta supports the Redshift integration as both multi-database and single-database integrations.
If using a multi-database integration, you must use a Redshift cluster with an RA3 node because Immuta requires cross-database views.
If using a single-database integration, all Redshift cluster types are supported. However, because cross-database queries are not supported in any types other than RA3, Immuta's views must exist in the same database as the raw tables. Consequently, the steps for configuring the integration for Redshift clusters with external tables differ slightly from those that don't have external tables. Allow Immuta to create secure views of your external tables through one of these methods:
SQL statements are used to create all views, including a join to the secure view: immuta_system.user_profile
. This secure view is a select from the immuta_system.profile
table (which contains all Immuta users and their current groups, attributes, projects, and a list of valid tables they have access to) with a constraint immuta__userid = current_user()
to ensure it only contains the profile row for the current user. The immuta_system.user_profile
view is readable by all users, but will only display the data that corresponds to the user executing the query.
The Redshift integration uses webhooks to keep views up-to-date with Immuta data sources. When a data source or policy is created, updated, or disabled, a webhook will be called that will create, modify, or delete the dynamic view. The immuta_system.profile
table is updated through webhooks when a user's groups or attributes change, they switch projects, they acknowledge a purpose, or when their data source access is approved or revoked. The profile table can only be read and updated by the Immuta system account.
Immuta creates a database inside the configured Redshift ecosystem that contains Immuta policy definitions and user entitlements.
Data source metadata, tags, user metadata, and policy definitions are stored in Immuta's Metadata Database.
The Immuta Web Service calls a stored procedure that modifies the user entitlements or policies.
Allow Immuta to create secure views of your external tables through one of these methods:
that contains the external tables: Instead of creating an immuta
database that manages all schemas and views created when Redshift data is registered in Immuta, the integration adds the Immuta-managed schemas and views to an existing database in Redshift
and re-create all of your external tables in that database.
For an overview of the integration, see the documentation.
A Redshift cluster with an AWS row-level security patch applied. for guidance.
that is .
The must be set to false
(default setting) for your Redshift cluster.
The Redshift role used to run the Immuta bootstrap script must have the following privileges when configuring the integration to
Use an existing database:
ALL PRIVILEGES ON DATABASE
for the database you configure the integration with, as you must manage grants on that database.
CREATE USER
GRANT TEMP ON DATABASE
Create a new database:
CREATE DATABASE
CREATE USER
GRANT TEMP ON DATABASE
REVOKE ALL PRIVILEGES ON DATABASE
.
Click the App Settings icon in the left sidebar.
Click the Integrations tab.
Click the +Add Native Integration button and select Redshift from the dropdown menu.
Complete the Host and Port fields.
Enter the name of the database you created the external schema in as the Immuta Database. This database will store all secure schemas and Immuta-created views.
Opt to check the Enable Impersonation box and customize the Impersonation Role name as needed. This will allow users to natively impersonate another user.
Select Manual and download both of the bootstrap scripts from the Setup section. The specified role used to run the bootstrap needs to have the following privileges:
ALL PRIVILEGES ON DATABASE
for the database you configure the integration with, as you must manage grants on that database.
CREATE USER
GRANT TEMP ON DATABASE
Run the bootstrap script (Immuta database) in the Redshift database that contains the external schema.
Choose your authentication method, and enter the credentials from the bootstrap script for the Immuta_System_Account
.
Click Save.
Click the App Settings icon in the left sidebar.
Click the Integrations tab.
Click the +Add Native Integration button and select Redshift from the dropdown menu.
Complete the Host and Port fields.
Enter an Immuta Database. This is a new database where all secure schemas and Immuta created views will be stored.
Opt to check the Enable Impersonation box and customize the Impersonation Role name as needed. This will allow users to natively impersonate another user.
Select Manual and download both of the bootstrap scripts from the Setup section. The specified role used to run the bootstrap needs to have the following privileges:
ALL PRIVILEGES ON DATABASE
for the database you configure the integration with, as you must manage grants on that database.
CREATE DATABASE
CREATE USER
GRANT TEMP ON DATABASE
Run the bootstrap script (initial database) in the Redshift initial database.
Run the bootstrap script (Immuta database) in the new Immuta Database in Redshift.
Choose your authentication method, and enter the credentials from the bootstrap script for the Immuta_System_Account
.
Click Save.
Then, add your external tables to the Immuta database.
The how-to guides linked on this page illustrate how to integrate Azure Synapse Analytics with Immuta.
Requirement: A running Dedicated SQL pool
These guides provide information on the recommended feature to enable with Azure Synapse Analytics.
.
Select None as your .
.
.
These guides provide instructions for organizing your Azure Synapse Analytics data to align with your governance structure.
.
These guides provide instructions for configuring and securing your data with governance policies, or see the for a comprehensive guide on creating policies to fit your organization's use case.
Validate the policies. You do not have to validate every policy you create in Immuta; instead, examine a few to validate the behavior you expect to see.
Once all Immuta policies are in place, remove or alter old permissions and revoke access to the ungoverned tables.
/
/
/
Given the above configuration, when a user gets write access to a Starburst (Trino) data source, they will have both data and table modification permissions on that data source. See the for details about these operations.
Table location | Databricks Spark integration | Databricks Unity Catalog integration |
---|
To enforce plugin-based policies on Hive metastore tables and Unity Catalog native controls on Unity Catalog metastore tables, enable the and the Databricks Unity Catalog integration. Note that some Immuta policies are not supported in the Databricks Unity Catalog integration. See the for details.
To enforce policies on data sources in Databricks SQL, use to manually lock down Hive metastore data sources and the Databricks Unity Catalog integration to protect tables in the Unity Catalog metastore. Table access control is enabled by default on SQL warehouses, and any Databricks cluster without the Immuta plugin must have table access control enabled.
Example cluster | Databricks Runtime | Unity Catalog in Databricks | Databricks Spark integration | Databricks Unity Catalog integration |
---|
The feature or integration is enabled.
The feature or integration is disabled.
.
to configure and validate SDD.
to discover entities of interest for your policy needs.
.
Register your remaining tables at the with .
.
These guides provide instructions for configuring and securing your data with governance policies, or see the for a comprehensive guide on creating policies to fit your organization's use case.
.
.
.
to configure and validate SDD.
to discover entities of interest for your policy needs.
.
Register your remaining tables at the with .
.
These guides provide instructions for configuring and securing your data with governance policies, or see the for a comprehensive guide on creating policies to fit your organization's use case.
.
.
See the for instructions on configuring multiple access control providers.
You can add or remove functions that are blocked by Immuta in the Starburst (Trino) integration configuration file. See the for instructions.
A data owner . A data owner, data governor, or administrator or user in Immuta.
A Starburst (Trino) user who is subscribed to the data source in Immuta directly in their Starburst catalog.
OAuth 2.0: You can authenticate with OAuth 2.0. Immuta's OAuth authentication method uses the ; when you register a data source, Immuta reaches out to your OAuth server to generate a JSON web token (JWT) and then passes that token to the Starburst (Trino) cluster. If you use OAuth to authenticate when creating a data source, you must configure the globalAdminUsername
property. See the section for details.
Immuta policies can be applied to .
and
User impersonation: Native impersonation allows users to natively query data as another Immuta user. To enable native user impersonation, see the .
: Immuta audits queries run natively in Starburst (Trino) against Starburst (Trino) data registered as Immuta data sources.
The Immuta Trino Event Listener allows Immuta to translate events into comprehensive audit logs for users with the Immuta AUDIT
permission to view. For more information about what is included in those audit logs, see the page.
Property | Starburst version | Required or optional | Description |
---|
The example configuration snippet below uses the default configuration settings for immuta.allowed.immuta.datasource.operations
and immuta.allowed.non.immuta.datasource.operations
, which allow read access for data registered as Immuta data sources and read and write access on data that is not registered in Immuta. See the for details about customizing and enforcing read and write access controls in Starburst.
to add users to Immuta.
when configuring your IAM (or map usernames manually) to Immuta.
.
A user with access to Immuta's Archives site is required to conduct the download in this step at . If you are prompted to log in and need basic authentication credentials, contact your Immuta support professional.
The Immuta Trino plugin version is updated alongside Trino so that a matching version of the plugin is published for corresponding Trino releases. For example, the Immuta plugin version supporting Trino version 403 is simply version 403
. Download the plugin from version from site that corresponds with the Trino version you use.
Follow to install the plugin archive on all nodes in your cluster.
Docker (Trino 414 and newer): For Trino versions 414 and newer, you can use the `immuta-trino` Docker image (which includes the Trino plugin jars) from registry.immuta.com instead of the .
Follow to install the plugin archive on all nodes in your cluster.
Property | Trino version | Required or optional | Description |
---|
The example configuration snippet below uses the default configuration settings for immuta.allowed.immuta.datasource.operations
and immuta.allowed.non.immuta.datasource.operations
, which allow read access for data registered as Immuta data sources and read and write access on data that is not registered in Immuta. See the for details about customizing and enforcing read and write access controls in Starburst.
to add users to Immuta.
when configuring your IAM (or map usernames manually) to Immuta.
.
: Grant Immuta one-time use of credentials to automatically configure your Redshift environment and the integration.
: Run the Immuta script in your Redshift environment yourself to configure your environment and the integration.
.
Disabling the Redshift integration is not supported when you set the fields nativeWorkspaceName
, nativeViewName
, and nativeSchemaName
to . Disabling the integration when these fields are used in metadata ingestion causes undefined behavior.
: Instead of creating an immuta
database that manages all schemas and views created when Redshift data is registered in Immuta, the integration adds the Immuta-managed schemas and views to an existing database in Redshift.
and re-create all of your external tables in that database.
An Immuta Application Administrator and registers Redshift warehouse and databases with Immuta.
A Data Owner registers Redshift tables in Immuta as .
A Data Owner, Data Governor, or Administrator or user in Immuta.
A Redshift user who is subscribed to the data source in Immuta directly in Redshift through the immuta database and sees policy-enforced data.
Redshift Spectrum () allows Redshift users to query external data directly from files on Amazon S3. Because cross-database queries are not supported in Redshift Spectrum, Immuta's views must exist in the same database as the raw tables. Consequently, the steps for configuring the integration for Redshift clusters with external tables differ slightly from those that don't have external tables. Allow Immuta to create secure views of your external tables through one of these methods:
: Instead of creating an immuta
database that manages all schemas and views created when Redshift data is registered in Immuta, the integration adds the Immuta-managed schemas and views to an existing database in Redshift
and re-create all of your external tables in that database.
Once the integration is configured, Data Owners must .
.
.
.
.
DeltaTable.convertToDelta
CONVERT TO DELTA parquet./path/to/parquet/
DeltaTable.delete
DELETE FROM [table_identifier delta./path/to/delta/
] WHERE condition
DeltaTable.generate
GENERATE symlink_format_manifest FOR TABLE [table_identifier delta./path/to/delta
]
DeltaTable.history
DESCRIBE HISTORY [table_identifier delta./path/to/delta
] (LIMIT x)
DeltaTable.merge
MERGE INTO
DeltaTable.update
UPDATE [table_identifier delta./path/to/delta/
] SET column = valueWHERE (condition)
DeltaTable.vacuum
VACUUM [table_identifier delta./path/to/delta
]
This page provides a tutorial for enabling the native Azure Synapse Analytics integration on the Immuta app settings page. To configure this integration via the Immuta API, see the Configure an Azure Synapse Analytics integration API guide.
For an overview of the integration, see the Azure Synapse Analytics overview documentation.
A running Dedicated SQL pool is required.
Click the App Settings icon in the left sidebar.
Click the Integrations tab.
Click the +Add Native Integration button and select Azure Synapse Analytics from the dropdown menu.
Complete the Host, Port, Immuta Database, and Immuta Schema fields.
Opt to check the Enable Impersonation box and customize the Impersonation Role name as needed. This will allow users to natively impersonate another user.
Opt to update the User Profile Delimiters. This will be necessary if any of the provided symbols are used in user profile information.
You have two options for configuring your Azure Synapse Analytic environment:
Automatic setup: Grant Immuta one-time use of credentials to automatically configure your environment and the integration.
Manual setup: Run the Immuta script in your Azure Synapse Analytics environment yourself to configure the integration.
Enter the username and password in the Privileged User Credentials section.
Select Manual.
Download, fill out the appropriate fields, and run the bootstrap master script and bootstrap script linked in the Setup section.
Enter the username and password in the Immuta System Account Credentials section. The username and password provided must be the credentials that were set in the bootstrap master script when you created the user.
Click Save.
Register Azure Synapse Analytics data in Immuta.
Click the App Settings icon in the left sidebar.
Navigate to the Integrations tab and click the down arrow next to the Azure Synapse Analytics Integration.
Edit the field you want to change. Note any field shadowed is not editable, and the integration must be disabled and re-installed to change it.
Enter Username and Password.
Click Save.
Immuta requires temporary, one-time use of credentials with specific permissions
When performing edits to an integration, Immuta requires temporary, one-time use of credentials of a Superuser or a user with the Manage GRANTS permission.
Alternatively, you can download the Edit Script from your Azure Synapse Analytics configuration on the Immuta app settings page and run it in Azure Synapse Analytics.
Click the App Settings icon in the left sidebar.
Navigate to the Integrations tab and click the down arrow next to the Azure Synapse Analytics Integration.
Click the checkbox to disable the integration.
Enter the username and password that were used to initially configure the integration.
Click Save.
Connect an external catalog to use tagging capabilities outside of Immuta and pull tags from external table schemas. Once the catalog has been connected, Immuta ingests a data dictionary from the catalog and applies data source and column tags directly to the data source. These tags can then be used to create policies.
This getting started guide outlines how to use external catalogs in Immuta to gain value from all three Immuta modules: Discover and Secure.
Configure an external catalog: Configure Alation, Collibra, or a custom REST catalog to ingest tags into Immuta.
External catalog integrations: This reference guide describes the requirements of the external catalogs Immuta supports.
Custom REST catalog introduction: This reference guide describes the custom catalog option for users to make API calls to retrieve metadata on their data.
Custom REST catalog interface endpoints: This reference guide describes the endpoints for configuring a custom REST catalog.
The how-to guides linked on this page illustrate how to link an external catalog with Immuta to ingest tags and add value to the Immuta modules: Secure and Discover.
Best practice: Use a single catalog; having more than one can lead to multiple truths and data leaks.
Requirement: A catalog with tags that correspond to your Immuta data sources
When changes are made to the external catalog, refresh external tags.
Requirements:
A physical data dictionary with assets that correspond to your Immuta data sources
The Collibra global role Catalog
or Catalog Author
When changes are made to the external catalog, refresh external tags.
Requirements:
A catalog with assets that correspond to your Immuta data sources
The ability to create a registered app in the Azure portal
When changes are made to the external catalog, refresh external tags.
Requirements:
A catalog with tags that correspond to your Immuta data sources
When changes are made to the external catalog, refresh external tags.
Requirements:
Fewer than 2,500 Databricks Unity Catalog data sources registered in Immuta
Databricks privileges listed on the Configure a Databricks Unity Catalog integration page
Once you register data sources, table and column tags from Databricks Unity Catalog will be ingested and applied to the corresponding data sources in Immuta.
Requirements:
A Snowflake user who can set the following permissions:
GRANT IMPORTED PRIVILEGES ON DATABASE snowflake
GRANT APPLY TAG ON ACCOUNT
Snowflake Enterprise Edition or higher
Configure Snowflake tag ingestion in Immuta.
When changes are made to the tags in Snowflake, refresh external tags
This page outlines how to connect an external catalog on the Immuta app settings page. For details on external catalogs in Immuta, see the External catalog reference guide.
Requirements:
APPLICATION_ADMIN
Immuta permission
An Alation API access token connected to a user with the Server Admin
permission
To change the default expiration period for your Alation catalog's API tokens, see configure the expiration period for Alation API tokens.
Navigate to the App Settings page.
Scroll to 2 External Catalogs, and click Add Catalog.
Enter a Display Name and select Alation from the dropdown menu.
Complete the URL and API key fields. The API key must be an API access token for your Alation instance connected to a user with the Server Admin
permission.
Configure whether or not Alation tags and custom fields are imported as Immuta tags:
Link Alation tags: When selected, Immuta imports Alation tags as Immuta tags.
Link Alation Custom Fields: When selected, Immuta imports Alation custom fields as Immuta tags. Follow the Alation documentation to create an Alation custom field, add permissions to your custom field, and apply custom fields to tables and columns.
Opt to select Upload Certificates.
Upload the Certificate Authority, Certificate File, and Key File.
Opt to enable Strict SSL by selecting the checkbox.
Click the Test Connection button.
Once the connection is successful, click Save.
Requirement: APPLICATION_ADMIN
Immuta permission
Navigate to the App Settings page.
Scroll to 2 External Catalogs, and click Add Catalog.
Enter the Display Name and select Collibra from the dropdown menu.
Enter the HTTP endpoint of the catalog in the URL field.
Complete the Username and Password fields. Note: This is the username and the password that Immuta can use to connect to the external catalog.
Complete the Asset Mappings modal to set which Collibra asset types align to the Immuta data source and column. Immuta will only link data sources from the asset types you specify.
Complete the Attributes as Tags modal to specify which Collibra attributes you want in Immuta. These attributes will come in as parent tags with their values as children tags.
Opt to select Upload Certificates.
Upload the Certificate Authority, Certificate File, and Key File.
Opt to enable Strict SSL by selecting the checkbox.
Click the Test Connection button.
Once the connection is successful, click Save.
Private preview
The Microsoft Purview catalog integration is only available to select accounts. Contact your Immuta representative to enable this feature.
Requirement: APPLICATION_ADMIN
Immuta permission
Register an app in the Azure portal with the with the following settings:
Supported account type: "Accounts in this organizational directory only"
Microsoft-Graph: User.Read
API permission
A client secret
Using that registered app, navigate to Immuta and complete the following:
Navigate to the App Settings page.
Scroll to 2 External Catalogs, and click Add Catalog.
Enter the Display Name and select Microsoft Purview from the dropdown menu.
Complete the following fields:
Enter the Microsoft Purview endpoint URL including the Azure Account Name, like https://<ACCOUNTNAME>.purview.azure.com
, in the Purview Endpoint URL field.
Complete the Microsoft Entra Directory (tenant) ID and Microsoft Entra (client) ID fields.
Enter the Microsoft Entra Application Client Secret ID for Immuta to authenticate and connect to the Purview API. The secret cannot be expired.
Click the Test Connection button.
Once the test is successful, click Save.
Requirement: APPLICATION_ADMIN
Immuta permission
Integrating a custom REST catalog service with Immuta requires implementing a REST interface. For details about the necessary endpoints that must be serviced, see the Custom REST catalog interface endpoints page.
Navigate to the App Settings page.
Scroll to 2 External Catalogs, and click Add Catalog.
Enter the Display Name and select Rest from the dropdown menu.
Select the Internal Plugin checkbox if the catalog has been uploaded to Immuta as a custom server plugin.
Complete the following fields:
Enter the HTTP endpoint of the catalog in the URL field.
Complete the Username and Password fields.
Enter the path of the Tags Endpoint.
Enter the path of the Data Source Endpoint.
Enter the path to the information page for a data source in the Data Source Link Template field.
Opt to enter the path to the information page for a column in the Column Link Template field.
Opt to upload a Catalog Image.
Opt to select Upload Certificates.
Upload the Certificate Authority, Certificate File, and Key File.
Opt to enable Strict SSL by selecting the checkbox.
Click the Test Connection button.
Click the Test Data Source Link.
Once both tests are successful, click Save.
See the Configure a Snowflake integration page for guidance on configuring tag ingestion.
If Snowflake data sources existed before configuring tag ingestion, Immuta will automatically sync those data sources to the catalog and apply tags to them. Immuta will automatically check the external catalog for changes and sync data sources to the catalog every 24 hours.
See the Configure a Databricks Unity Catalog integration page for guidance on configuring tag ingestion.
If Databricks Unity Catalog data sources existed before configuring tag ingestion, Immuta will automatically sync those data sources to the catalog and apply tags to them. Immuta will automatically check the external catalog for changes and sync data sources to the catalog every 24 hours.
You can manually link and remove external catalogs from data sources on the data source overview tab.
Navigate to your data source.
In the connection information section, click the Link Catalog icon (or Unlink Catalog to remove an external catalog from a data source).
Select your external catalog from the dropdown menu.
Click Link to confirm.
Navigate to your data source and click the data source Health dropdown menu.
Click Re-run in the External Catalog section.
Users who want to use tags from outside of Immuta can connect an external catalog to automatically pull and apply them to Immuta data sources. These tags can then be used to drive policies or classification frameworks.
Immuta supports the following external catalogs:
To configure an external catalog, see the Configure an external catalog guide.
Once an external catalog has been configured on the Immuta app settings page, there are two recurring process steps:
Linking to data sources and columns: Whenever a new data source is created or an external catalog is set up, Immuta will attempt to automatically link data sources to their corresponding assets in the external catalog. This is done by comparing the fully qualified name of a data source in Immuta with its corresponding asset name in the external catalog, so data sources must have the same name in Immuta and the external catalog. Alternatively, a user can also manually link a data source to an asset in an external catalog. Once a data source has been linked to an external catalog, it can be seen on the data source's detail page.
Pull and apply tags in Immuta: Using the link established in the first step, Immuta polls the external catalog to ingest and apply tags to each data source and its columns. Immuta checks every 24 hours for any relevant metadata changes in the connected external catalog. Tags originating from an external catalog can be found on the tags list page and on the data dictionary page for each data source.
See below for more information about the way Immuta integrates with each supported external catalog provider.
Immuta's Alation integration supports importing both tags and custom fields, Alation's two primary ways of allowing data stewards to apply metadata to data assets.
Tags: Tags are a single word or phrase that can be attached to most Alation objects by nearly anyone. For instance, users can add a PCI
tag for financial data.
Custom fields: Custom fields are key-value pairs that can only be attached and removed by authorized users. Unlike tags, custom fields can have multiple values associated with a single key. For example, the custom field DK_STEWARD
could have MARKETING
, FINANCE
, and CUSTOMER
values associated with it. Using Alation custom fields allows you to explicitly control who can modify information associated with that field inside of Alation, whereas Alation standard tags are modifiable by any user inside of Alation.
When pulled into Immuta, Alation tags and custom fields will be applied to data sources as either column or data source tags in Immuta. Importing both Alation tags and custom fields into Immuta provides full flexibility for customers leveraging the Alation enterprise data catalog, no matter what operating model they choose to document their metadata in Alation.
Collibra tags using the dot "." delimiter will be transformed into hierarchical tags in Immuta. To learn more about the benefits of hierarchical tags for policy authoring, see tag hierarchy.
Immuta's Collibra integration supports importing both tags and attributes. Additionally, data source and column descriptions from the connected Collibra catalog will be pulled into Immuta.
Tags: Tags are a single word or phrase that can be attached to objects in Collibra. For instance, users can add a PHI
tag on health-related data assets.
Attributes: Attributes in Collibra are a characteristic that describes an asset with an individual field. Unlike tags, attributes can have multiple values associated with a single key. For example, the attribute classification
could have non sensitive
, sensitive
, and highly sensitive
values associated with it. Using Collibra attributes allows you to explicitly control who can modify information associated with that field inside of Collibra, whereas Collibra standard tags are modifiable by any user inside of Collibra.
When pulled into Immuta, Collibra tags and attributes will be applied to data sources as either column or data source tags in Immuta. Importing both Collibra tags and attributes into Immuta provides full flexibility for customers leveraging the Collibra data catalog, no matter what operating model they choose to document their metadata in Collibra.
Linking to data sources and columns in Collibra: Immuta links data sources to assets in Collibra by looking up the full name. To ensure unique names that Immuta can easily link to, it is recommended that customers use Collibra Edge to onboard their data sources into Collibra.
Pull and apply tags in Immuta from Collibra: Immuta checks Collibra every 24 hours by observing the linked assets history for any relevant metadata changes. Based on these changes, Immuta then only polls and ingests tags from Collibra for the relevant data sources. However, if Immuta observes more than 25,000 metadata changes in Collibra within 24 hours, it will poll all data sources for tags during that run of external catalog tag synchronization.
Collibra assets must have unique full names in order for Immuta to guarantee exact matching. If there are multiple Collibra assets with the same name, Immuta will link to the first asset it matches to.
Columns must have a direct relation to their parent asset in Collibra. Indirect/inherited relations are not supported and will result in column tags and attributes not being ingested into Immuta.
Private preview
The Microsoft Purview catalog integration is only available to select accounts. Contact your Immuta representative to enable this feature.
The Microsoft Purview catalog integration with Immuta currently supports ingestion of Classifications and Managed attributes as tags. Additionally, data source and column descriptions from the connected Microsoft Purview catalog will be pulled into Immuta.
Linking to data sources and columns in Microsoft Purview: Immuta links data sources to assets in Microsoft Purview by looking up the fully qualified name of an entity. The composition of the fully qualified name in Microsoft Purview differs depending on the technology type backing the data source.
Pull and apply tags in Immuta from Microsoft Purview: Immuta polls Microsoft Purview every 24 hours for all tags.
Standard tags from Purview do not get ingested into Immuta
The current implementation only supports Databricks Unity Catalog, Snowflake and Azure Synapse Analytics data sources and their associated columns
Managed attributes are supported, but have the following limitations:
If a managed attribute is applied to an Immuta data source but later expires, it will still appear as a tag on the data source. Expired attributes must be removed from the object in Purview for the tag to be removed from the Immuta data source.
The following managed attribute data types are not supported and will not be applied to Immuta data sources as tags:
Dates
Number types
Rich text
If users have an unsupported catalog, or have customized their catalog integration, they can connect through the REST Catalog using the Immuta API.
For more details about using a custom REST catalog with Immuta, see the Custom REST Catalog Interface Introduction.
Design partner preview: This feature is only available to select accounts. Reach out to your Immuta representative to enable this feature.
Users can connect their Databricks Unity Catalog account to allow Immuta to ingest Databricks tags and apply them to Databricks data sources. To learn more about Databricks Unity Catalog tag ingestion, see the Databricks Unity Catalog reference guide.
Users can connect a Snowflake account to allow Immuta to ingest Snowflake tags onto Snowflake data sources. To learn more about Snowflake tag ingestion, see the Snowflake reference guide.
Tags ingested from external catalogs cannot be edited within Immuta. To edit, delete, or add a tag from an external catalog to a data source or column, make the change in the external catalog.
You can configure multiple external catalogs within a single tenant of Immuta, but only one external catalog can be linked to a data source.
S3 data sources cannot currently be linked to external catalogs.
To configure an external catalog, see the Configuration how-to guide.
To learn more about how Immuta can automatically tag your data with Discover, see the Discover introduction.
In this integration, Immuta generates policy-enforced views in a schema in your configured Azure Synapse Analytics Dedicated SQL pool for tables registered as Immuta data sources.
This guide outlines how to integrate Azure Synapse Analytics with Immuta.
Azure Synapse Analytics configuration: Configure the integration in Immuta.
Azure Synapse Analytics integration reference guide: This guide describes the design and components of the integration.
Integration health statuses: This reference guide provides descriptions of the possible statuses of a configured integration.
Hive metastore |
Unity Catalog metastore |
Cluster 1 | 9.1 | Unavailable | Unavailable |
Cluster 2 | 10.4 | Unavailable | Unavailable |
Cluster 3 | 11.3 | Unavailable |
Cluster 4 | 11.3 |
Cluster 5 | 11.3 |
| 392 and newer | Required | This property enables the integration. |
| 392 and newer | Optional |
| 413 and newer | Optional |
| 392 and newer | Optional |
| 392 and newer | Required |
| 392 and newer | Optional | This property allows you to specify a path to your CA file. |
| 392 and newer | Optional | Amount of time in seconds for which a user's specific representation of an Immuta data source will be cached for. Changing this will impact how quickly policy changes are reflected for users actively querying Starburst. By default, cache expires after 30 seconds. |
| 392 and newer | Optional | Amount of time in seconds for which a user's available Immuta data sources will be cached for. Changing this will impact how quickly data sources will be available due to changing projects or subscriptions. By default, cache expires after 30 seconds. |
| 392 and newer | Required | The protocol and fully qualified domain name (FQDN) for the Immuta tenant used by Starburst (for example, |
| 392 and newer | Optional | When set to false, Immuta won't filter unallowed table metadata, which helps ensure Immuta remains noninvasive and performant. If this property is set to true, running |
| 420 and newer | Required if | This property identifies the Starburst group that is the Immuta administrator. The users in this group will not have Immuta policies applied to them. Therefore, data sources should be created by users in this group so that they have access to everything. This property can be used in conjunction with the |
| 392 and newer | Required if | This property identifies the Starburst user who is an Immuta administrator (for example, |
| 392 and newer | Required | This property enables the integration. |
| 392 and newer | Optional | Trino allows you to enable multiple system access control providers at the same time. To do so, add providers to this property as comma-separated values. This approach allows Immuta to work with existing Trino installations that have already configured an access control provider. Immuta does not manage all permissions in Trino and will default to allowing access to anything Immuta does not manage so that the Starburst (Trino) integration complements existing controls. For example, if the Starburst (Trino) integration is configured to allow users write access to tables that are not protected by Immuta, you can still lock down write access for specific non-Immuta tables using an additional access control provider. |
| 413 and newer | Optional |
| 392 and newer | Optional |
| 392 and newer | Required |
| 392 and newer | Optional | This property allows you to specify a path to your CA file. |
| 392 and newer | Optional | Amount of time in seconds for which a user's specific representation of an Immuta data source will be cached for. Changing this will impact how quickly policy changes are reflected for users actively querying Trino. By default, cache expires after 30 seconds. |
| 392 and newer | Optional | Amount of time in seconds for which a user's available Immuta data sources will be cached for. Changing this will impact how quickly data sources will be available due to changing projects or subscriptions. By default, cache expires after 30 seconds. |
| 392 and newer | Required | The protocol and fully qualified domain name (FQDN) for the Immuta tenant used by Trino (for example, |
| 392 and newer | Optional | When set to false, Immuta won't filter unallowed table metadata, which helps ensure Immuta remains noninvasive and performant. If this property is set to true, running |
| 420 and newer | Required if | This property identifies the Trino group that is the Immuta administrator. The users in this group will not have Immuta policies applied to them. Therefore, data sources should be created by users in this group so that they have access to everything. This property can be used in conjunction with the |
| 392 and newer | Required if | This property identifies the Trino user who is an Immuta administrator (for example, |
This page describes the Azure Synapse Analytics integration, through which Immuta applies policies directly in Azure Synapse Analytics. For a tutorial on configuring Azure Synapse Analytics see the Azure Synapse Integration page.
The Azure Synapse Analytics is a policy push integration that allows Immuta to apply policies directly in Azure Synapse Analytics Dedicated SQL pools without the need for users to go through a proxy. Instead, users can work within their existing Synapse Studio and have per-user policies dynamically applied at query time.
This integration works on a per-Dedicated-SQL-pool basis: all of Immuta's policy definitions and user entitlements data need to be in the same pool as the target data sources because Dedicated SQL pools do not support cross-database joins. Immuta creates schemas inside the configured Dedicated SQL pool that contain policy-enforced views that users query.
When the integration is configured, the Application Admin specifies the
Immuta Database: This is the pre-existing database Immuta uses. Immuta will create views from the tables contained in this database, and all schemas and views created by Immuta will exist in this database, such as the schemas immuta_system
, immuta_functions
, and the immuta_procedures
that contain the tables, views, UDFs, and stored procedures that support the integration.
Immuta Schema: The schema that Immuta manages. All views generated by Immuta for tables registered as data sources will be created in this schema.
User Profile Delimiters: Since Azure Synapse Analytics dedicated SQL pools do not support array or hash objects, certain user access information is stored as delimited strings; the Application Admin can modify those delimiters to ensure they do not conflict with possible characters in strings.
For a tutorial on configuring the integration see the Azure Synapse Integration page.
Synapse data sources are represented as views and are under one schema instead of a database, so their view names are a combination of their schema and table name, separated by an underscore.
For example, with a configuration that uses IMMUTA
as the schema in the database dedicated_pool
, the view name for the data source dedicated_pool.tpc.case
would be dedicated_pool.IMMUTA.tpc_case
.
You can see the view information on the data source overview page under Connection Information.
This integration uses webhooks to keep views up-to-date with the corresponding Immuta data sources. When a data source or policy is created, updated, or disabled, a webhook is called that creates, modifies, or deletes the dynamic view in the Immuta schema. Note that only standard views are available because Azure Synapse Analytics Dedicated SQL pools do not support secure views.
An Immuta Application Administrator configures the Synapse integration, registering their initial Synapse Dedicated SQL pool with Immuta.
Immuta creates Immuta schemas inside the configured Synapse Dedicated SQL pool.
A Data Owner registers Synapse tables in Immuta as data sources. A Data Owner, Data Governor, or Administrator creates or changes a policy or user in Immuta.
Data source metadata, tags, user metadata, and policy definitions are stored in Immuta's Metadata Database.
The Immuta Web Service calls a stored procedure that modifies the user entitlements or policies and updates data source view definitions as necessary.
A Synapse user who is subscribed to the data source in Immuta queries the corresponding data source view in Synapse and sees policy-enforced data.
Private preview: This integration is available to select accounts. Reach out to your Immuta representative for details.
The Google BigQuery integration allows users to query policy protected data directly in BigQuery as secure views within an Immuta-created dataset. Immuta controls who can see what within the views, allowing data governors to create complex ABAC policies and data users to query the right data within the BigQuery console.
Google BigQuery is configured through the Immuta console and a script provided by Immuta. While you can complete some steps within the BigQuery console, it is easiest to install using gcloud and the Immuta script.
Once Google BigQuery has been configured, BigQuery admins can start creating subscription and data policies to meet compliance requirements and users can start querying policy protected data directly in BigQuery.
Create a global subscription or supported data policy.
Revoke user access to the original datasets and grant users access to the Immuta created datasets in BigQuery.
Users query data from the Immuta created datasets directly in BigQuery.
What permissions will Immuta have in my BigQuery environment?
You can find a list of the permissions the custom Immuta role has here.
What integration features will Immuta support for BigQuery?
For private preview, Immuta supports a basic version of the BigQuery integration where Immuta can enforce specific policies on data in a single BigQuery project. At this time, workspaces, tag ingestion, user impersonation, native query audit, and multiple integrations are not supported.
In this policy push integration, Immuta creates views that contain all policy logic. Each view has a 1-to-1 relationship with the original table. Access controls are applied in the view, allowing customers to leverage Immuta’s powerful set of attribute-based policies and query data directly in BigQuery.
BigQuery is organized by projects (which can be thought of as databases), datasets (which can be compared to schemas), tables, and views. When you enable the integration, an Immuta dataset is created in BigQuery that contains the Immuta-required user entitlements information. These objects within the Immuta dataset are intended to only be used and altered by the Immuta application.
After data sources are registered, Immuta uses the custom user and role, created before the integration is enabled, to push the Immuta data sources as views into a mirrored dataset of the original table. Immuta manages grants on the created view to ensure only users subscribed to the Immuta data source will see the data.
The Immuta integration uses a mirrored dataset approach. That is, if the source dataset is named mydataset
, Immuta will create a dataset named mydataset_secure
, assuming that _secure
is the specified Immuta dataset suffix. This mirrored dataset is an authorized dataset, allowing it to access the data of the original dataset. It will contain the Immuta-managed views, which have identical names to the original tables they’re based on.
Following the principle of least privilege, Immuta does not have permission to manage Google Cloud Platform users, specifically in granting or denying access to a project and its datasets. This means that data governors should limit user access to original datasets to ensure data users are accessing the data through the Immuta created views and not the backing tables. The only users who need to have access to the backing tables are the credentials used to register the tables in Immuta.
Additionally, a data governor must grant users access to the mirrored datasets that Immuta will create and populate with views. Immuta and BigQuery’s best practice recommendation is to grant access via groups in Google Cloud Platform. Because users still must be registered in Immuta and subscribed to an Immuta data source to be able to query Immuta views, all Immuta users can be granted access to the mirrored datasets that Immuta creates.
This integration can only be enabled through a manual bootstrap using the Immuta API.
This integration can only be enabled to work in a single region.
This integration supports the following policy types:
Column masking
Mask using hashing (SHA256())
Mask by making NULL
Mask using constant
Mask using a regular expression
Mask by date rounding
Mask by numeric rounding
Mask using custom functions
Row-level masking
Row visibility based on user attributes and/or object attributes
Only show rows that fall within a given time window
Minimize rows
Filter rows using custom WHERE clause
Always hide rows
See the resources below to start implementing and using the BigQuery integration:
Building global subscription and data policies to govern data
Creating projects to collaborate
Follow this guide to connect your Google BigQuery data warehouse to Immuta.
Immuta SaaS or Immuta v2023.1 or newer with Google BigQuery integration (PrPr) enabled.
Immuta role with SYSTEM_ADMIN permissions and an API key.
The Google BigQuery integration requires you to create a Google Cloud service account and role that will be used by Immuta to
create a Google BigQuery dataset that will be used to store a table of user entitlements, UDFs for policy enforcement, etc.
manage the table of user entitlements via updates when entitlements change in Immuta.
create datasets and secure views with access control policies enforced, which mirror tables inside of datasets you ingest as Immuta data sources.
You have two options to create the required Google Cloud service account and role:
The bootstrap.sh
script is a shell script provided by Immuta that creates prerequisite Google Cloud IAM objects for the integration to connect. When you run this script from your command line, it will create the following items, :
A new Google Cloud IAM role
A new Google Cloud service account, which will be granted the newly-created role
A JSON keyfile for the newly-created service account
You will need to use the objects created in these steps to enable the Google BigQuery integration.
Google Cloud IAM roles required to run the script
To execute bootstrap.sh
from your command line, you must be authenticated to the gcloud CLI utility as a user with all of the following roles:
roles/iam.roleAdmin
roles/iam.serviceAccountAdmin
roles/serviceusage.serviceUsageAdmin
Having these three roles is the least-privilege set of Google Cloud IAM roles required to successfully run the bootstrap.sh
script from your command line. However, having either of the following Google Cloud IAM roles will also allow you to run the script successfully:
roles/editor
roles/owner
Install gcloud.
Set the account property in the core section for Google Cloud CLI to the account gcloud should use for authentication. (You can run gcloud auth list to see your currently available accounts):
In Immuta, navigate to the App Settings page and click the Integrations tab.
Click Add Native Integration and select Google BigQuery from the dropdown menu.
Click Select Authentication Method and select Key File.
Click Download Script(s).
Before you run the script, update your permissions to execute it:
Run the script, where
PROJECT_ID is the Google Cloud Platform project to operate on.
ROLE_ID is the name of the custom role to create.
NAME will create a service account with the provided name.
OUTPUT_FILE is the path where the resulting private key should be written. File system write permission will be checked on the specified path prior to the key creation.
undelete-role (optional) will undelete the custom role from the project. Roles that have been deleted for a long time can't be undeleted. This option can fail for the following reasons:
The role specified does not exist.
The active user does not have permission to access the given role.
enable-api (optional) provided you’ve been granted access to enable the Google BigQuery API, will enable the service.
Alternatively, you may use the Google Cloud Console to create the prerequisite role, service account, and private key file for the integration to connect to Google BigQuery.
Create a custom role using the console with the following privileges:
bigquery.datasets.create
bigquery.datasets.delete
bigquery.datasets.get
bigquery.datasets.update
bigquery.jobs.create
bigquery.jobs.get
bigquery.jobs.list
bigquery.jobs.listAll
bigquery.routines.create
bigquery.routines.delete
bigquery.routines.get
bigquery.routines.list
bigquery.routines.update
bigquery.tables.create
bigquery.tables.delete
bigquery.tables.export
bigquery.tables.get
bigquery.tables.getData
bigquery.tables.list
bigquery.tables.setCategory
bigquery.tables.update
bigquery.tables.updateData
bigquery.tables.updateTag
Create a service account and grant it the custom role you just created.
Once the Google Cloud IAM custom role and service account are created, you can enable the Google BigQuery integration. This section illustrates how to enable the integration on the Immuta app settings page. To configure this integration via the Immuta API, see the Configure a Google BigQuery integration API guide.
In Immuta, navigate to the App Settings page and click the Integrations tab.
Click Add Native Integration and select Google BigQuery from the dropdown menu.
Click Select Authentication Method and select Key File.
Upload your GCP Service Account Key File. This is the private key file generated in create a Google Cloud service account and role for Immuta to use to connect to Google BigQuery. Uploading this file will auto-populate the following fields:
Project Id: The Google Cloud Platform project to operate on, where your Google BigQuery data warehouse is located. A new dataset will be provisioned in this Google BigQuery project to store the integration configuration.
Service Account: The service account you created in create a Google Cloud service account and role for Immuta to use to connect to Google BigQuery.
Complete the following fields:
Immuta Dataset: The name of the Google BigQuery dataset to provision inside of the project. Important: if you are using multiple environments in the same Google BigQuery project, this dataset to provision must be unique across environments.
Immuta Role: The custom role you created in create a Google Cloud service account and role for Immuta to use to connect to Google BigQuery.
Dataset Suffix: The suffix that will be postfixed to the name of each dataset created to store secure views, one per dataset that you ingest a table for as a data source in Immuta. Important: if you are using multiple environments in the same Google BigQuery project, this suffix must be unique across environments.
GCP Location: The dataset’s location. After a dataset is created, the location can't be changed. Note that
If you choose EU for the dataset location, your Core BigQuery Customer Data resides in the EU.
Click Test Google BigQuery Integration.
Click Save.
GCP location must match dataset region
The region set for the GCP location must match the region of your datasets. Set GCP location to a general region (for example, US
) to include child regions.
You can disable the Google BigQuery integration automatically or manually.
Click the App Settings icon, and then click the Integrations tab.
Select the Google BigQuery integration you would like to disable, and select the Disable Integration checkbox.
Click Save.
The privileges required to run the cleanup script are the same as the Google Cloud IAM roles required to run the bootstrap.sh
script.
Click the App Settings icon, and then click the Integrations tab.
Select the Google BigQuery integration you would like to disable, and click Download Scripts.
Click Save. Wait until Immuta has finished saving your configuration changes before proceeding.
Before you run the script, update your permissions to execute it:
Run the cleanup script.
Build global subscription policies and data policies
Create projects to securely collaborate on analytical workloads
Private preview: The Amazon S3 integration is available to select accounts. Reach out to your Immuta representative for details.
Immuta's Amazon S3 integration allows users to apply subscription policies to data in S3 to restrict what prefixes, buckets, or objects users can access. To enforce access controls on this data, Immuta creates S3 grants that are administered by S3 Access Grants, an AWS feature that defines access permissions to data in S3.
No location is registered in your S3 Access Grants instance before configuring the integration in Immuta
Write policies private preview enabled for your account; contact your Immuta representative to get this feature enabled
APPLICATION_ADMIN
Immuta permission to configure the integration
CREATE_S3_DATASOURCE
Immuta permission to register S3 prefixes
The AWS account credentials or optional AWS IAM role you provide Immuta when configuring the integration must
have the permissions to perform the following actions to create locations and issue grants:
accessgrantslocation resource:
s3:CreateAccessGrant
s3:DeleteAccessGrantsLocation
s3:GetAccessGrantsLocation
accessgrantsinstance resource:
s3:CreateAccessGrantsLocation
s3:GetAccessGrantsInstance
s3:GetAccessGrantsInstanceForPrefix
s3:GetAccessGrantsInstanceResourcePolicy
s3:ListAccessGrants
s3:ListAccessGrantsLocations
accessgrant resource:
s3:DeleteAccessGrant
s3:GetAccessGrant
bucket resource: s3:ListBucket
role resource:
iam:GetRole
iam:PassRole
all resources: s3:ListAccessGrantsInstances
Follow AWS documentation to create an Access Grants instance using the S3 console, AWS CLI, AWS SDKs, or the REST API. AWS supports one Access Grants instance per region per AWS account.
Follow the instructions at the top of the "Register a location" page in AWS documentation to create an AWS IAM role and give the S3 Access Grants service principal access to this role in the resource policy file. You will add this role to your integration configuration in Immuta so that Immuta can register this role with your Access Grants location. The AWS documentation linked above gives a complete policy example, but your policy should include the following permissions:
sts:AssumeRole
sts:SetSourceIdentity
sts:SetContext
Follow the instructions at the top of the "Register a location" page in AWS documentation to create an IAM policy with the following permissions, and attach the policy to the IAM role you created to grant the permissions to the role. The AWS documentation linked above gives a complete example, but the policy should at least include the following permissions:
s3:GetObject
s3:GetObjectVersion
s3:GetObjectAcl
s3:GetObjectVersionAcl
s3:ListMultipartUploadParts
s3:PutObject
s3:PutObjectAcl
s3:PutObjectVersionAcl
s3:DeleteObject
s3:DeleteObjectVersion
s3:AbortMultipartUpload
s3:ListBucket
s3:ListAllMyBuckets
iam:passRole
If you use server-side encryption with AWS Key Management Service (AWS KMS) keys to encrypt your data, the following permissions are required for the IAM role in the policy. If you do not use this feature, do not include these permissions in your IAM policy:
kms:Decrypt
kms:GenerateDataKey
Opt to create an AWS IAM role that Immuta can use to create Access Grants locations and issue grants. This role must have the S3 permissions listed in the permissions section.
In Immuta, click App Settings in the navigation menu and click the Integrations tab.
Click + Add Native Integration.
Select Amazon S3 from the dropdown menu and click Continue Configuration.
Complete the connection details fields, where
Friendly Name is a name for the integration that is unique across all Amazon S3 integrations configured in Immuta.
AWS Account ID is the ID of your AWS account.
AWS Region is the AWS region to use.
S3 Access Grants Location IAM Role ARN is the role the S3 Access Grants service assumes to vend credentials to the grantee. When a grantee accesses S3 data, the Access Grants service attaches session policies and assumes this role in order to vend credentials scoped to a prefix or bucket to the grantee. This role needs full access to all paths under the S3 location prefix.
S3 Access Grants S3 Location Scope is the base S3 location that Immuta will use for this connection when registering S3 prefixes. This path must be unique across all S3 integrations configured in Immuta. During data source registration, this prefix is prepended to the data source prefixes to build the final path used to grant or revoke access to that data in S3. For example, a location prefix of s3://research-data
would be prepended to the data source prefix /demographics
to generate a final path of s3://research-data/demographics
.
Select your authentication method:
Access using AWS IAM role: Provide an AWS IAM Role that Immuta will assume when interacting with the AWS API. This option allows you to provide Immuta with an IAM role from your AWS account that is granted a trust relationship with Immuta's IAM role for providing S3 access grants operations. Immuta will assume this IAM role from Immuta's AWS account in order to perform any operations in your AWS account. Before proceeding, contact your Immuta representative for the AWS account to add to your trust policy. Then, complete the steps below.
Enter the role ARN in the AWS IAM Role field. Immuta will assume this role when interacting with AWS.
Set the external ID provided in a condition on the trust relationship for the cross-account IAM specified above. See the AWS documentation for guidance.
Access using access key and secret access key: Provide your AWS Access Key ID and AWS Secret Access Key.
Click Verify Credentials.
Click Next to review and confirm your connection information, and then click Complete Setup.
You can edit the following settings for an existing Amazon S3 integration on the app settings page:
friendly name
authentication type and values (access key, secret, and role)
To edit settings for an existing integration via the API, see the Configure an Amazon S3 integration API guide.
Follow the Create an S3 data source guide to register prefixes in Immuta.
To create an S3 data source using the API, see the Configure an S3 integration and create an S3 data source API guide.
Requirements: USER_ADMIN
Immuta permission and either the GOVERNANCE
or CREATE_S3_DATASOURCE
Immuta permission
Build read or write subscription policies in Immuta to enforce access controls.
Map AWS IAM principals to each Immuta user to ensure Immuta properly enforces policies:
Click People and select Users in the navigation menu.
Navigate to the user's page and click the more actions icon next to their username.
Select Change S3 User or AWS IAM Role from the dropdown menu.
Use the dropdown menu to select the User Type. Then complete the S3 field. When selecting Unset (fallback to Immuta username), the S3 username is assumed to be the same as the Immuta username. User and role names are case-sensitive. See the AWS documentation for details.
Click Save.
See the Mapping IAM principals in Immuta section for details about supported principals.
Requirement: User must be subscribed to the data source in Immuta
Request access to Amazon S3 data through S3 Access Grants. If you're accessing S3 data through one of the supported S3 Access Grants integrations (such as Amazon EMR on EC2), that application will make this request on your behalf, so you can skip this step.
Immuta's Amazon S3 integration allows users to apply subscription policies to data in S3 to restrict what prefixes, buckets, or objects users can access. To enforce access controls on this data, Immuta creates S3 grants that are administered by S3 Access Grants, an AWS feature that defines access permissions to data in S3.
With this integration, users can avoid
hand-writing AWS IAM policies
managing AWS IAM role limits
manually tracking what user or role has access to what files in AWS S3 and verifying those are consistent with intent
To enforce controls on S3 data, Immuta interacts with several S3 Access Grants components:
Access Grants instance: An Access Grants instance is a logical container for individual grants that specify who can access what level of data in S3 in your AWS account and region. AWS supports one Access Grants instance per region per AWS account.
Location: A location specifies what data the Access Grants instance can grant access to. For example, registering a location with a scope of s3://
allows Access Grants to manage access to all S3 buckets in that AWS account and region, whereas setting the bucket s3://research-data
as the scope limits Access Grants to managing access to that single bucket for that location. When you configure the S3 integration in Immuta, you specify a location's scope and IAM assumed role, and Immuta registers the location in your Access Grants instance and associates it with the provided IAM role for you. Each S3 integration you configure in Immuta is associated with one location, and Immuta manages all grants in that location. Therefore, grants cannot be manually created by users in an Access Grants instance location that Immuta has registered and manages. During data source registration, this location scope is prepended to the data source prefixes to build the final path used to grant or revoke access to that data in S3. For example, a location scope of s3://research-data
would be prepended to the data source prefix /demographics
to generate a final path of s3://research-data/demographics
.
Individual grants: Individual permission grants in S3 Access Grants specify the identity that can access the data, the access level, and the location of the S3 data. Immuta creates a grant for each user subscribed to a prefix, bucket, or object by interacting with the Access Grants API. Each grant has its own ID and gives the user or role principle access to the data.
IAM assumed role: This is an IAM role you create in S3 that has full access to all prefixes, buckets, and objects in the Access Grants location registered by Immuta. This IAM role is used to vend temporary credentials to users or applications. When a grantee requests temporary credentials, the S3 Access Grants service assumes this role to vend credentials scoped to the prefix, bucket, or object specified in the grant to the grantee. The grantee then uses these credentials to access S3 data. When configuring the integration in Immuta, you specify this role, and then Immuta associates this role with the registered location in the Access Grants instance.
Temporary credentials: These just-in-time access credentials provide access to a prefix, bucket, or object with a permission level of READ or READWRITE in S3. When a user or application requests temporary credentials to access S3 data, the S3 Access Grants instance evaluates the request against the grants Immuta has created for that user. If a matching grant exists, S3 Access Grants assumes the IAM role associated with the location of the matching grant and scopes the permissions of the IAM session to the S3 prefix, bucket, or object specified by the grant and vends these temporary credentials to the requester. These credentials have a default timeout of 1 hour, but this duration can be changed by the requester.
The diagram below illustrates how these S3 Access Grants components interact.
For more details about these Access Grants concepts, see the S3 Access Grants documentation.
After an administrator creates an Access Grants instance and an assumed IAM role in their AWS account, an application administrator configures the Amazon S3 integration in Immuta. During configuration, the administrator provides the following connection information so that Immuta can create and register a location in that Access Grants instance:
AWS account ID and region
ARN for the existing Access Grants instance
ARN for the assumed IAM role
When Immuta registers this location, it associates the assumed IAM role with the location. This allows the IAM role to create temporary credentials with access scoped to a particular S3 prefix, bucket, or object in the location. The IAM role you create for this location must have all the object- and bucket-level permissions listed in the set up S3 Access Grants instance section on all buckets and objects in the location; if it is missing permissions, the IAM role will not be able to grant those missing permissions to users or applications requesting temporary credentials.
In the example below, an application administrator registers the following location prefix and IAM role for their Access Grants instance in AWS account 123456
:
Location path: s3://
. This path allows a single Amazon S3 integration to manage all objects in S3 in that AWS account and region. Data owners can scope down access further when registering specific S3 prefixes and applying policies.
Location IAM role: The arn:aws:iam::123456:role/access-grants-role
IAM role will be used to vend temporary credentials to users and applications.
Immuta registers this location and associated IAM role in the user's Access Grants instance:
After the S3 integration is configured, a data owner can register S3 prefixes and buckets that are in the configured Access Grants location path to enforce access controls on resources. Immuta stores the connection information for the prefix so that the metadata can be used to create and enforce subscription policies on S3 data.
A data owner or governor can apply a subscription policy to a registered prefix, bucket, or object to control who can access objects beginning with that prefix or in that bucket after it is registered in Immuta. Once a subscription policy is created and Immuta users are subscribed to the prefix, bucket, or object, Immuta calls the Access Grants API to create a grant for each subscribed user, specifying the following parameters in the payload so that Access Grants can create and store a grant for each user:
Access Grants location
READ
access
User or role principle
Registered prefix, bucket, or object
In the example below, a data owner registers the s3://research-data/*
bucket, and Immuta stores the connection information in the Immuta metadata database. Once the user, Taylor, is subscribed to s3://research-data/*
, Immuta calls the Access Grants API to create a grant for that user to allow them to read and write S3 data in that bucket:
To access S3 data registered in Immuta, users must be subscribed to the prefix, bucket, or object in Immuta, and their principals must be mapped to their Immuta user accounts. Once users are subscribed, they request temporary credentials from S3 Access Grants. Access Grants looks up the grant ID associated with the requester. If no matching grant exists, they receive an access denied error. If one exists, Access Grants assumes the IAM role associated with the location and requests temporary credentials that are scoped to the prefix, bucket, or object and permissions specified by the individual grant. Access Grants vends the credentials to the requester, who uses those temporary credentials to access the data in S3.
In the example below, Taylor requests temporary credentials from S3 Access Grants. Access Grants looks up the grant ID (1
) for that user, assumes the arn:aws:iam::123456:role/access-grants-role
IAM role for the location, and vends temporary credentials to Taylor, who then uses the credentials to access the research-data
bucket in S3:
Note that when accessing data through S3 Access Grants, the user or application interacts directly with the Access Grants API to request temporary credentials; Immuta does not act in this process at all. See the diagram below for an illustration of the process for accessing data through S3 Access Grants.
AWS services that support S3 Access Grants will request temporary credentials for users automatically. If users are not using a service that supports S3 Access Grants, they must have the permissions listed in the AWS documentation to call the Access Grants API directly themselves to request temporary credentials to access data through the access grant.
For a list of AWS services that support S3 Access Grants, see the AWS documentation.
Immuta's S3 integration allows data owners and governors to apply object-level access controls on data in S3 through subscription policies. When a user is subscribed to a registered prefix, bucket, or object, Immuta calls the Access Grants API to create an individual grant that narrows the scope of access within the location to that registered prefix, bucket, or object. See the diagram below for a visualization of this process.
When a user's entitlements change or a subscription policy is added to, updated, or deleted from a prefix, Immuta performs one of the following processes for each user subscribed to the registered prefix:
User added to the prefix: Immuta specifies a permission (READ
or READWRITE
) for each user and uses the Access Grants API to create an individual grant for each user.
User updated: Immuta deletes the current grant ID and creates a new one using the Access Grants API.
User deleted: Immuta deletes the grant ID using the Access Grants API.
Immuta offers two subscription policy access types to manage read and write access to data in S3:
Read access policies manage who can get objects from S3.
Write access policies manage who can modify data in S3.
Data policies, which provide more granular controls by redacting or masking values in a table, are not supported for S3.
Data owners can register an S3 prefix at any level in the S3 path by creating an Immuta data source. During this process, Immuta stores the connection information for use in subscription policies.
Each prefix added in the data registration workflow is created as a single Immuta data source, and a subscription policy added to a data source applies to any objects in that bucket or beginning with that prefix:
Therefore, data owners should register prefixes or buckets at the lowest level of access control they need for that data. Using the example above, if the data owner needed to allow different users to access s3://yellow-bucket/research-data/*
than those who should access s3://yellow-bucket/analyst-data/*
, the data owner must register the research-data/*
and analyst-data/*
prefixes separately and then apply a subscription policy to those prefixes:
When an S3 data source is deleted, Immuta deletes all the grants associated with that prefix, bucket, or object in that location.
Names are case-sensitive
The IAM role name and IAM user name are case-sensitive. See the AWS documentation for details.
Immuta supports mapping an Immuta user to one of the following AWS IAM principals:
IAM role principals: Only a single Immuta user can be mapped to an IAM role. This restriction prohibits enforcing policies on AWS users who could assume that role. Therefore, if using role principals, create a new user in Immuta that represents the role so that the role then has the permissions applied specifically to it.
See the protect data section for instructions on mapping principals to user accounts in Immuta.
The Amazon S3 integration will not interfere with existing legacy S3 integrations, and multiple S3 integrations can exist in a single Immuta tenant.
AWS services that support S3 Access Grants will request temporary credentials for users automatically. If users are not using a service that supports S3 Access Grants, they must have the permissions listed in the AWS documentation to call the Access Grants API directly themselves to request temporary credentials to access data through the access grant.
For a list of AWS services that support S3 Access Grants, see the AWS documentation.
During private preview, Immuta supports up to 500 prefixes (data sources) and up to 20 Immuta users that are mapped to S3 identities principals. This is a preview limitation that will be removed in a future phase of the integration.
S3 Access Grants allows 100,000 grants per region per account. Thus, if you have 5 Immuta users with access to 20,000 registered prefixes, you would reach this limit. See AWS documentation for details.
The following Immuta features are not currently supported by the integration in private preview:
Audit
Automatically syncing Immuta with AWS IAM identities: you cannot set the S3 User Type field to AWS IAM User when configuring your identity provider (IdP) in Immuta
Data policies
Schema monitoring
Tag ingestion
The table below provides definitions for each status and the state of configured data platform integrations. The status of the integration appears on the integrations tab of the Immuta application settings page and in the .
If any errors occur with the integration configuration, a banner will appear in the Immuta UI with guidance for remediating the error.
Status | Description | State |
---|
The table below provides definitions for each status and the state of configured data platform integrations. The status of the integration appears on the integrations tab of the Immuta application settings page and in the .
If any errors occur with the integration configuration, a banner will appear in the Immuta UI with guidance for remediating the error.
Status | Description | State |
---|
This page describes the Azure Synapse integration, configuration options, and features. See the Azure Synapse integration page for a tutorial on enabling the integration and these features through the App Settings page.
A running Dedicated SQL pool
The Azure Synapse Analytics integration supports the username and password authentication method to configure the integration and create data sources.
Immuta cannot ingest tags from Synapse, but you can connect any of these supported external catalogs to work with your integration.
Impersonation allows users to query data as another Immuta user in Synapse. To enable user impersonation, see the User Impersonation page.
A user can configure multiple integrations of Synapse to a single Immuta tenant.
Immuta does not support the following masking types in this integration because of limitations with Dedicated SQL pools (linked below). Any column assigned one of these masking types will be masked to NULL:
Reversible Masking: Synapse UDFs currently only support SQL, but Immuta needs to execute code (such as JavaScript or Python) to support this masking feature. See the Synapse Documentation for details.
Format Preserving Masking: Synapse UDFs currently only support SQL, but Immuta needs to execute code (such as JavaScript or Python) to support this masking feature. See the Synapse Documentation for details.
Regex: The built in string replace function does not support full regex. See the Synapse Documentation for details.
The delimiters configured when enabling the integration cannot be changed once they are set. To change the delimiters, the integration has to be disabled and re-enabled.
If the generated view name is more than 128 characters, then the view name is shortened to 128 characters. This could cause collisions between view names if the shortened version is the same for two different data sources.
For proper updates, the Dedicated SQL pools have to be running when changes are made to users or data sources in Immuta.
The table below provides definitions for each status and the state of configured data platform integrations. The status of the integration appears on the integrations tab of the Immuta application settings page and in the response schema of the integrations API.
If any errors occur with the integration configuration, a banner will appear in the Immuta UI with guidance for remediating the error.
Status | Description | State |
---|---|---|
The table below provides definitions for each status and the state of configured data platform integrations. The status of the integration appears on the integrations tab of the Immuta application settings page and in the response schema of the integrations API.
If any errors occur with the integration configuration, a banner will appear in the Immuta UI with guidance for remediating the error.
Status | Description | State |
---|---|---|
The table below provides definitions for each status and the state of configured data platform integrations. The status of the integration appears on the integrations tab of the Immuta application settings page and in the response schema of the integrations API.
If any errors occur with the integration configuration, a banner will appear in the Immuta UI with guidance for remediating the error.
Status | Description | State |
---|---|---|
The table below provides definitions for each status and the state of configured data platform integrations. The status of the integration appears on the integrations tab of the Immuta application settings page and in the response schema of the integrations API.
If any errors occur with the integration configuration, a banner will appear in the Immuta UI with guidance for remediating the error.
Status | Description | State |
---|---|---|
The table below provides definitions for each status and the state of configured data platform integrations. The status of the integration appears on the integrations tab of the Immuta application settings page and in the response schema of the integrations API.
If any errors occur with the integration configuration, a banner will appear in the Immuta UI with guidance for remediating the error.
Status | Description | State |
---|---|---|
The table below provides definitions for each status and the state of configured data platform integrations. The status of the integration appears on the integrations tab of the Immuta application settings page and in the response schema of the integrations API.
If any errors occur with the integration configuration, a banner will appear in the Immuta UI with guidance for remediating the error.
Status | Description | State |
---|---|---|
The diagram below contrasts Immuta's provided catalog integration architecture with this Customer REST Catalog interface - which gives the customer tremendous control over the metadata being provided to Immuta.
The custom-developed service must be built to receive and handle calls to the REST endpoints specified below. Immuta will call these endpoints as detailed below when certain events occur and at various intervals. The required responses to complete the connection are also detailed.
Tags are attributes applied to data - either at the top, data source, level or at the individual column level.
Tags in Immuta take the form of a nested tree structure. There are "parents", "children", "grand-children", etc.:
The REST Catalog interface interprets a tag's relationship mapping from a string based on a standard "dot" (.
) notation, like:
Tags returned must meet the following constraints:
They must be no longer than 500 characters. Longer tags will not throw an error but will be truncated silently at 500 characters.
They must be composed of letters, digits, underscores, dashes, and whitespace characters. A period (.
) is used as a separator as described above. Other special characters are not supported.
A tag object has a single id
property, which is used to uniquely identify the tag within the catalog. This id
may be of either a string or integer type, and its value is completely up to the designer of the REST Catalog service. Common examples include: a standard integer value, a UUID, or perhaps a hash of the tag's string value (if it is unique within the system).
For this Customer REST Catalog interface, tags are represented in JSON like:
For example, the object below specifies 3 different tags:
For more information on tags and how they are created, managed, and displayed within Immuta, see our tag documentation.
Descriptions are strings that, like tags, can be applied to either a data source or an individual column. These strings support UTF-8, including special and various language characters.
Immuta can make requests to your REST Catalog service using any of the following authentication methods:
Username and password: Immuta can send requests with a username and a password in the Authorization HTTP header. In this case, the custom REST service will need to be able to parse a Basic Authorization Header and validate the credentials sent with it.
PKI Certificate: Immuta can also send requests using a CA certificate, a certificate, and a key.
NO Authentication: Immuta can make unauthenticated requests to your REST Catalog service. However, this should only be used if you have other security measures in place (e.g., if the service is in an isolated network that's reachable only by your Immuta environment).
Authentication and specific endpoints
When accessing the /dataSource
and /tags
endpoints, Immuta will use the configured username and password. If you choose to also protect the human-readable pages with authentication, users will be prompted to authenticate when they first visit those pages.
/tags
The /tags
endpoint is used to collect ALL the tags the catalog can provide. It is used by Immuta to populate Immuta's tags list in the Governance section. These tags can then be used for policy creation ahead of actual data sources being created that make use of them. This enables policies to immediately apply when data sources are registered with Immuta.
As with all external catalogs, tags ingested by Immuta from the REST catalog interface are not able to be modified locally within Immuta as this catalog becomes the "source of truth" for them. This results in the tags showing in Immuta with either a lock icon next to them, or without the delete button that would allow a user to manually remove them from an assigned data source or column.
The /tags
endpoint receives a simple GET request from Immuta. No payload nor query parameters are required.
Example request:
The Custom REST service must respond with an object that maps all tag name strings to associated id
s. The tag name string fully-qualifies the location of the tag in the tree structure as detailed previously, and the id
is a globally unique identifier assigned by the REST catalog to that tag.
Example response:
/dataSource
The /dataSource
endpoint does the vast majority of the work. It receives a POST
request from Immuta, and returns the mapping of a data source and its columns to the applied tags and descriptions.
Immuta will try to fetch metadata for a data source in the system at various times:
During data source creation. During data source creation, Immuta will send metadata to the REST Catalog service, most notably the connection details of the data source, which includes the schema and table name. It is important that the Custom REST service implemented can parse this information and search its records for an appropriate record to return with an ID unique to this data source in its catalogMetadata
object.
When a user manually links the data source. Data sources that either fail to auto-link, or that were created prior to the Custom REST catalog being configured, can still be manually linked. To do so, a data source owner can provide the ID of the asset as defined by the Custom REST Catalog via the Immuta UI. In order for this to work, the Custom REST Catalog service must support matching data source assets by unique ID.
During various refreshes. Once linked, Immuta will periodically call the /dataSource
endpoint to ensure information is up to date.
Immuta's POST requests to the /dataSource
endpoint will consist of a payload containing many of the elements outlined below:
This object must be parsed by the in Custom REST Catalog order to determine the specific data source metadata being requested.
For the most part, Immuta will provide the id
of the data source as part of the catalogMetadata
. This should be used as the primary metadata lookup value.
When a data source is being created, such an id
will not yet be known to Immuta. Immuta will instead send handlerInfo
information as part of the request.
When an id
is not specified, the schema
and table
name elements should be parsed in an attempt to identify the desired catalog entry and provide an appropriate id
. If such a lookup is successful and an id
is returned to Immuta in the catalogMetadata
section, Immuta will establish an automatic link between the the new data source and the catalog entry, and future references will use that id
.
The schema for the /dataSource
response uses the same tag object structure from the /tags
response, along with the following set of metadata keys for both data sources and columns.
Example response:
/dataSource/page/{id}
This endpoint returns a human-readable information page from the REST catalog for the data source associated with {id}
. Immuta provides this as a mechanism for allowing the REST catalog to provide additional information about the data source that may not be directly ingested by or visible within Immuta. This link is accessed in the Immuta UI when a user clicks the catalog logo associated with the data source.
Immuta will send a GET request to the /dataSource/page/{id}
endpoint, where {id}
will be:
Example request:
The Custom REST Catalog can either provide such a page directly, or can redirect the user to any resource where the appropriate page would be provided - for example a backing full service catalog such as Collibra, if this Custom REST catalog is simply being used to support a custom data model.
Example response:
/column/{id}
This endpoint returns the catalog's human-readable information page for the column associated with {id}
. Immuta provides this as a mechanism for allowing the REST catalog to provide additional information about the specific column that may not be directly ingested by or visible within Immuta.
Immuta will send a GET request to the /column/{id}
endpoint, where {id}
will be:
Example request:
The Custom REST Catalog can either provide such a page directly, or can redirect the user to any resource where the appropriate page would be provided - for example a backing full service catalog such as Collibra, if this Custom REST catalog is simply being used to support a custom data model.
Example response:
This page describes the Redshift integration, configuration options, and features. For a tutorial to enable this integration, see the installation guide.
For automated installations, the credentials provided must be a Superuser or have the ability to create databases and users and modify grants.
Redshift Serverless.
Redshift Spectrum For configuration and data source registration instructions, see the configuration page.
The Redshift integration supports the following authentication methods to configure the integration and create data sources:
Username and Password: Users can authenticate with their Redshift username and password.
AWS Access Key: Users can authenticate with an AWS access key.
Okta: Users can authenticate with their Okta credentials when installing the integration with the manual configuration.
Immuta cannot ingest tags from Redshift, but you can connect any of these supported external catalogs to work with your integration.
Required Redshift privileges
Setup User:
OWNERSHIP ON GROUP IMMUTA_IMPERSONATOR_ROLE
CREATE GROUP
Immuta System Account:
GRANT EXECUTE ON PROCEDURE grant_impersonation
GRANT EXECUTE ON PROCEDURE revoke_impersonation
Impersonation allows users to query data as another Immuta user in Redshift. To enable user impersonation, see the User Impersonation page.
Users can enable multiple Redshift integrations with a single Immuta tenant.
The host of the data source must match the host of the native connection for the native view to be created.
When using multiple Redshift integrations, a user has to have the same user account across all hosts.
Registering Redshift datashares as Immuta data sources is unsupported.
Case sensitivity of database, table, and column identifiers is not supported. The enable_case_sensitive_identifier
parameter must be set to false
(default setting) for your Redshift cluster to configure the integration and register data sources.
For most policy types in Redshift, Immuta uses SQL clauses to implement enforcement logic; however Immuta uses Python UDFs in the Redshift integration to implement the following masking policies:
Masking using a regular expression
Reversible masking
Format-preserving masking
Randomized response
The number of Python UDFs that can run concurrently per Redshift cluster is limited to one-fourth of the total concurrency level for the cluster. For example, if the Redshift cluster is configured with a concurrency of 15, a maximum of three Python UDFs can run concurrently. After the limit is reached, Python UDFs are queued for execution within workload management queues.
The SVL_QUERY_QUEUE_INFO
view in Redshift, which is visible to a Redshift superuser, summarizes details for queries that spent time in a workload management (WLM) query queue. Queries must be completed in order to appear as results in the SVL_QUERY_QUEUE_INFO
view.
If you find that queries on Immuta-built views are spending time in the workload management (WLM) query queue, you should either edit your Redshift cluster configuration to increase concurrency, or use fewer of the masking policies which leverage Python UDFs. For more information on increasing concurrency, see the Redshift docs on implementing workload management.
The custom REST catalog integration allows Immuta to make a defined set of API calls to a Custom REST service you develop to retrieve metadata. The Custom REST service receives Immuta's calls, and then collects the relevant information and delivers it back to Immuta.
The diagram below highlights the main feature of Immuta's Custom REST Catalog integration.
Through a Custom REST Catalog, you can build and maintain your own solutions that provide metadata required to effectively use Immuta within your organization.
API Interface Specification Documentation: This page details the endpoints and data schemas of the API and contains example requests and responses.
/
Starburst allows you to enable multiple system access control providers at the same time. To do so, add providers to this property as comma-separated values. Immuta has tested the Immuta system access control provider alongside the . This approach allows Immuta to work with existing Starburst installations that have already configured an access control provider. Immuta does not manage all permissions in Starburst and will default to allowing access to anything Immuta does not manage so that the Starburst integration complements existing controls. For example, if the Starburst integration is configured to allow users write access to tables that are not protected by Immuta, you can still lock down write access for specific non-Immuta tables using an additional access control provider.
This property defines a comma-separated list of allowed operations for users on Immuta data sources they are subscribed to: READ
,WRITE
, and OWN
. (See the for details about the OWN
operation.) When set to WRITE
, all users granted access to a data source through a subscription policy are allowed read and write operations to data source schemas and tables. By default, this property is set to READ
, which blocks write operations on data source tables and schemas. If are enabled for your Immuta tenant, this property is set to READ,WRITE
by default, so users granted write access to a data source through a write access subscription policy are allowed read and write operations to data source schemas and tables.
This property defines a comma-separated list of allowed operations users will have on tables not registered as Immuta data sources: READ
, WRITE
, CREATE
, and OWN
. (See the for details about CREATE
and OWN
operations.) When set to READ
, users are allowed read operations on tables not registered as Immuta data sources. When set to WRITE
, users are allowed read and write operations on tables not registered as Immuta data sources. If this property is left empty, users will not get access to any tables outside Immuta. By default, this property is set to READ,WRITE
. If are enabled for your Immuta tenant, this property is set to READ,WRITE,OWN,CREATE
by default.
This should be set to the Immuta API key displayed when enabling the integration on the app settings page. To rotate this API key, use the to generate a new API key, and then replace the existing immuta.apikey
value with the new one.
This property defines a comma-separated list of allowed operations for users on Immuta data sources they are subscribed to: READ
,WRITE
, and OWN
. (See the for details about the OWN
operation.) When set to WRITE
, all users granted access to a data source through a subscription policy are allowed read and write operations to data source schemas and tables. By default, this property is set to READ
, which blocks write operations on data source tables and schemas. If are enabled for your Immuta tenant, this property is set to READ,WRITE
by default, so users granted write access to a data source through a write access subscription policy are allowed read and write operations to data source schemas and tables.
This property defines a comma-separated list of allowed operations users will have on tables not registered as Immuta data sources: READ
, WRITE
, CREATE
, and OWN
. (See the for details about CREATE
and OWN
operations.) When set to READ
, users are allowed read operations on tables not registered as Immuta data sources. When set to WRITE
, users are allowed read and write operations on tables not registered as Immuta data sources. If this property is left empty, users will not get access to any tables outside Immuta. By default, this property is set to READ,WRITE
. If are enabled for your Immuta tenant, this property is set to READ,WRITE,OWN,CREATE
by default.
This should be set to the Immuta API key displayed when enabling the integration on the app settings page. To rotate this API key, use the to generate a new API key, and then replace the existing immuta.apikey
value with the new one.
Attribute | Data Type | Description |
---|---|---|
Attribute | Data Type | Description |
---|---|---|
Attribute | Data Type | Description |
---|---|---|
Attribute | Data Type | Description |
---|---|---|
catalogMetadata
dictionary
Object holding the data source's catalog metadata.
catalogMetadata.id
string or integer
The unique identifier of the data source in the catalog.
catalogMetadata.name
string
The name of the data source in the catalog.
handlerInfo
dictionary
Object holding the data source's connection details.
handlerInfo.schema
string
The data source’s schema name in the source system.
handlerInfo.table
string
The data source’s table name in the source system.
handlerInfo.hostname
string
The data source’s connection schema in the source storage system.
handlerInfo.port
integer
The data source’s connection port in the source storage system.
handlerInfo.query
string
The data source’s connection schema in the source storage system, if applicable.
dataSource
dictionary
Object holding general data source information from Immuta. This can be viewed with debugging, but is not usually required for catalog purposes.
catalogMetadata
dictionary
Object holding the data source's catalog metadata.
catalogMetadata.id
string or integer
The unique identifier of the data source in the catalog.
catalogMetadata.name
string
The name of the data source in the catalog.
description
string
A description of the data source.
tags
<tags object>
Object containing the data source-level tags.
dictionary
dictionary
Object containing the column names of the data source as its keys.
dictionary.<column>
dictionary
Object containing a single column's metadata.
dictionary.<column>.catalogMetadata.id
string or integer
The unique identifier of the column in the catalog.
dictionary.<column>.description
string
A description of the column.
dictionary.<column>.tags
<tags object>
Object containing the column-level tags as keys.
id
URL Parameter, integer or string
The unique identifier of the data source in the remote catalog system.
id
URL Parameter, integer or string
The unique identifier of the column in the remote catalog system.
createError
Error occurred during creation of the integration.
creating
Integration is in the process of being created and set up.
deleted
Integration is deleted.
Not in use
deleteError
Error occurred while deleting the integration. The integration has been rolled back to the previous state.
deleting
Integration is in the process of being disabled or deleted.
disabled
Integration was force disabled and no cleanup was performed on the native platform.
Not in use
editError
Error occurred while editing the integration. The integration has been rolled back to the previous state.
editing
The integration is in the process of being edited.
enabled
The integration is enabled and active.
migrateError
Error occurred while performing a migration of the integration. The integration has been rolled back to the previous state.
migrating
Migration is being performed on the integration. An example of a migration is a stored procedure update.
recurringValidationError
Validation has failed during the periodic check and the integration may be misconfigured.
createError
Error occurred during creation of the integration.
creating
Integration is in the process of being created and set up.
deleted
Integration is deleted.
Not in use
deleteError
Error occurred while deleting the integration. The integration has been rolled back to the previous state.
deleting
Integration is in the process of being disabled or deleted.
disabled
Integration was force disabled and no cleanup was performed on the native platform.
Not in use
editError
Error occurred while editing the integration. The integration has been rolled back to the previous state.
editing
The integration is in the process of being edited.
enabled
The integration is enabled and active.
migrateError
Error occurred while performing a migration of the integration. The integration has been rolled back to the previous state.
migrating
Migration is being performed on the integration. An example of a migration is a stored procedure update.
recurringValidationError
Validation has failed during the periodic check and the integration may be misconfigured.
createError
Error occurred during creation of the integration.
creating
Integration is in the process of being created and set up.
deleted
Integration is deleted.
Not in use
deleteError
Error occurred while deleting the integration. The integration has been rolled back to the previous state.
deleting
Integration is in the process of being disabled or deleted.
disabled
Integration was force disabled and no cleanup was performed on the native platform.
Not in use
editError
Error occurred while editing the integration. The integration has been rolled back to the previous state.
editing
The integration is in the process of being edited.
enabled
The integration is enabled and active.
migrateError
Error occurred while performing a migration of the integration. The integration has been rolled back to the previous state.
migrating
Migration is being performed on the integration. An example of a migration is a stored procedure update.
recurringValidationError
Validation has failed during the periodic check and the integration may be misconfigured.
createError
Error occurred during creation of the integration.
creating
Integration is in the process of being created and set up.
deleted
Integration is deleted.
Not in use
deleteError
Error occurred while deleting the integration. The integration has been rolled back to the previous state.
deleting
Integration is in the process of being disabled or deleted.
disabled
Integration was force disabled and no cleanup was performed on the native platform.
Not in use
editError
Error occurred while editing the integration. The integration has been rolled back to the previous state.
editing
The integration is in the process of being edited.
enabled
The integration is enabled and active.
migrateError
Error occurred while performing a migration of the integration. The integration has been rolled back to the previous state.
migrating
Migration is being performed on the integration. An example of a migration is a stored procedure update.
recurringValidationError
Validation has failed during the periodic check and the integration may be misconfigured.
createError
Error occurred during creation of the integration.
creating
Integration is in the process of being created and set up.
deleted
Integration is deleted.
Not in use
deleteError
Error occurred while deleting the integration. The integration has been rolled back to the previous state.
deleting
Integration is in the process of being disabled or deleted.
disabled
Integration was force disabled and no cleanup was performed on the native platform.
Not in use
editError
Error occurred while editing the integration. The integration has been rolled back to the previous state.
editing
The integration is in the process of being edited.
enabled
The integration is enabled and active.
migrateError
Error occurred while performing a migration of the integration. The integration has been rolled back to the previous state.
migrating
Migration is being performed on the integration. An example of a migration is a stored procedure update.
recurringValidationError
Validation has failed during the periodic check and the integration may be misconfigured.
createError
Error occurred during creation of the integration.
creating
Integration is in the process of being created and set up.
deleted
Integration is deleted.
Not in use
deleteError
Error occurred while deleting the integration. The integration has been rolled back to the previous state.
deleting
Integration is in the process of being disabled or deleted.
disabled
Integration was force disabled and no cleanup was performed on the native platform.
Not in use
editError
Error occurred while editing the integration. The integration has been rolled back to the previous state.
editing
The integration is in the process of being edited.
enabled
The integration is enabled and active.
migrateError
Error occurred while performing a migration of the integration. The integration has been rolled back to the previous state.
migrating
Migration is being performed on the integration. An example of a migration is a stored procedure update.
recurringValidationError
Validation has failed during the periodic check and the integration may be misconfigured.
Project Workspaces
Native Query Audit
Project Workspaces
Query Audit
createError | Error occurred during creation of the integration. |
creating | Integration is in the process of being created and set up. |
deleted | Integration is deleted. | Not in use |
deleteError | Error occurred while deleting the integration. The integration has been rolled back to the previous state. |
deleting | Integration is in the process of being disabled or deleted. |
disabled | Integration was force disabled and no cleanup was performed on the native platform. | Not in use |
editError | Error occurred while editing the integration. The integration has been rolled back to the previous state. |
editing | The integration is in the process of being edited. |
enabled | The integration is enabled and active. |
migrateError | Error occurred while performing a migration of the integration. The integration has been rolled back to the previous state. |
migrating | Migration is being performed on the integration. An example of a migration is a stored procedure update. |
recurringValidationError | Validation has failed during the periodic check and the integration may be misconfigured. |
createError | Error occurred during creation of the integration. |
creating | Integration is in the process of being created and set up. |
deleted | Integration is deleted. | Not in use |
deleteError | Error occurred while deleting the integration. The integration has been rolled back to the previous state. |
deleting | Integration is in the process of being disabled or deleted. |
disabled | Integration was force disabled and no cleanup was performed on the native platform. | Not in use |
editError | Error occurred while editing the integration. The integration has been rolled back to the previous state. |
editing | The integration is in the process of being edited. |
enabled | The integration is enabled and active. |
migrateError | Error occurred while performing a migration of the integration. The integration has been rolled back to the previous state. |
migrating | Migration is being performed on the integration. An example of a migration is a stored procedure update. |
recurringValidationError | Validation has failed during the periodic check and the integration may be misconfigured. |