Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Policy decision data is transmitted to ensure end users querying data are limited to the appropriate access as defined by the policies in Immuta.
Spark Plugin
In the Databricks integration, the user, data source information, and query are sent to Immuta through the Spark plugin to determine what policies need to be applied while the query is being processed. Data that travels from Immuta to the Databricks cluster could include
user attributes.
what columns to mask.
the entire predicate itself (for row-level policies).
A user runs a query against data in their environment.
The query is sent to the Immuta Web Service.
The Web Service queries the Metadata Database to obtain the policy definition, which includes data source metadata (tags, column names, etc.) and user entitlements (groups and attributes).
The policy information is transmitted to the remote data system for native policy enforcement.
Query results are displayed based on what policy definition was applied.
Sample data is processed and aggregated or reduced during Immuta's fingerprinting process and specific policy processes. Note: Data Owners can see sample data when editing a data source. However, this action requires the database password, and the small sample of data visible is only displayed in the UI and is not stored in Immuta.
When enabled, statistical queries made during data source health checks are distilled into summary statistics, called fingerprints. The sample data processed for fingerprinting allows Immuta to track data source changes.
During this process, statistical query results and data samples (which may contain PII) are temporarily held in memory by the Fingerprint Service.
The fingerprinting process checks for new tables through schema monitoring (when enabled) and captures summary statistics of changes to data sources, including when policies were applied, external views were created, or sensitive data elements were added.
Immuta does not sample data for row redaction policies.
Immuta does not sample data for row redaction policies; Immuta only pulls samples of data to determine if a column is a candidate for randomized response and aggregates of user-defined cohorts for k-anonymization. Both datasets only exist in memory during the computation.
Sample data is processed when k-anonymization or randomized response policies are applied to data sources.
Sample data exists temporarily in memory in the Fingerprint Service during the computation.
k-Anonymization Policies: At the time of its application, the columns of a k-anonymization policy are queried under a separate fingerprinting process that generates rules enforcing k-anonymity. The results of this query, which may contain PII, are temporarily held in memory by the Fingerprint Service. The final rules are stored in the Metadata Database as the policy definition for enforcement.
Randomized Response Policies: If the list of substitution values for a categorical column is not part of the policy specification (e.g., when specified via the API), a list is obtained via query and merged into the policy definition in the Metadata Database.
Raw data is processed for masking, producing either a distinct set of values or aggregated groups of values.
Immuta captures metadata and stores it in an internal PostgreSQL database. Customers can encrypt the volumes backing the database using an external Key Management Service to ensure that data is encrypted at rest.
To encrypt data in transit, Immuta uses TLS protocol, which is configured by the customer.
Immuta encrypts values with data encryption keys, either those that are system-generated or managed using an external key management service (KMS). Immuta recommends a KMS to encrypt or decrypt data keys and supports the AWS Key Management Service; however, if no KMS is configured, Immuta will generate a data encryption key on a user-defined rollover schedule, using the most recent data key to encrypt new values while preserving old data keys to decrypt old values.
Immuta employs three families of functions in its masking policies:
One-way Hashing: One-way (irreversible) hashing is performed via a salted SHA256 hash. A consistent salt is used for values throughout the data source, so users can count or track the specific values without revealing the true value. Since hashed values are different across data sources, users are unable to join on hashed values. Note: joining on masked values can be enabled in Immuta Projects.
Reversible Masking: For reversible masking, values are encrypted using AES-256 CBC encryption. Encryption is performed using a cell-specific initialization vector. The resulting values can be unmasked by an authorized user. Note that this is dynamic encryption of individual fields as results are streamed to the querying system; Immuta is not modifying records in the data store.
Reversible Format Preserving Masking: Format preserving masking maintains the format of the data while masking the value, and is achieved by initializing and applying the NIST standard method FF1 at the column level. The resulting values can be unmasked by an authorized user.
Immuta communicates with remote databases over a TCP connection.
If you want to disable the metadata collection that requires sampling data, you must
These steps will ensure that Immuta queries no data, under any circumstances. Without this sample data, some Immuta features will be unavailable. Sensitive Data Discovery (SDD) cannot be used to automatically detect sensitive data in your data sources, and the following masking policies will not work:
Masking with format preserving masking
Masking with k-anonymization
Masking using randomized response
To stop Immuta from running fingerprints on all data sources,
Navigate to the App Settings page, and scroll to the Advanced Configuration section.
Enter the following YAML:
Click Save.
To stop Immuta from running data source health checks on all data sources,
Navigate to the App Settings page, and scroll to the Advanced Configuration section.
Enter the following YAML:
Click Save.
Tag each data source with the seeded Skip Stats Job
tag to stop Immuta from collecting a sample and running table stats on the sample. You can tag data sources as you create them in the UI or via the Immuta API.
Immuta does not require users to learn a new API or language to access data exposed there. Instead, Immuta integrates with existing tools and ongoing work while remaining invisible to downstream consumers. This page outlines those integrations.
The Snowflake integration differs based on your Snowflake Edition:
Snowflake Integration Using Snowflake Governance Features: With this integration, policies administered in Immuta are pushed down into Snowflake as (row access policies and masking policies). This integration requires Snowflake Enterprise Edition or higher.
Snowflake Integration Without Snowflake Governance Features: With this integration, policies administered by Immuta are pushed down into Snowflake as views with a 1-to-1 relationship to the original table and all policy logic is contained in that view.
Click a link below for details about each question:
This integration allows you to manage multiple Databricks workspaces through Unity Catalog while protecting your data with Immuta policies. Instead of manually creating UDFs or granting access to each table in Databricks, you can author your policies in Immuta and have Immuta manage and enforce Unity Catalog access-control policies on your data in Databricks clusters or SQL warehouse.
Immuta’s Databricks Spark integration with Unity Catalog support uses a custom Databricks plugin to enforce Immuta policies on a Databricks cluster with Unity Catalog enabled. This integration allows you to add your tables to the Unity Catalog metastore so that you can use the metastore from any workspace while protecting your data with Immuta policies.
This integration enforces policies on Databricks tables registered as data sources in Immuta, allowing users to query policy-enforced data on Databricks clusters (including job clusters). Immuta policies are applied to the plan that Spark builds for users' queries, all executed directly against Databricks tables.
Deprecation notice
The Starburst (Trino) integration enables Immuta to apply policies directly in Starburst and Trino clusters without going through a proxy. This means users can use their existing Starburst and Trino tooling (querying, reporting, etc.) and have per-user policies dynamically applied at query time.
With the Redshift integration, Immuta applies policies directly in Redshift. This allows data analysts to query their data directly in Redshift instead of going through a proxy.
The Azure Synapse Analytics integration allows Immuta to apply policies directly in Azure Synapse Analytics dedicated SQL pools without needing users to go through a proxy. Instead, users can work within their existing Synapse Studio and have per-user policies dynamically applied at query time.
Private preview
This integration is available to select accounts. Reach out to your Immuta representative for details.
The Amazon S3 integration allows users to apply subscription policies to data in S3 to restrict what prefixes, buckets, or objects users can access. To enforce access controls on this data, Immuta creates S3 grants that are administered by S3 Access Grants, an AWS feature that defines access permissions to data in S3.
Private preview
This integration is available to select accounts. Reach out to your Immuta representative for details.
In this integration, Immuta generates policy-enforced views in your configured Google BigQuery dataset for tables registered as Immuta data sources.
If users have another catalog, or have customized their Collibra or Alation integrations, they can connection through the REST Catalog using the Immuta API.
Users can also connect a Snowflake account to allow Immuta to ingest Snowflake tags onto Snowflake data sources.
External identity managers configured in Immuta allow users to authenticate using an existing identity management system and can optionally be used to synchronize user groups and attributes into Immuta.
The table below outlines the features supported by each of Immuta's integrations.
The table below outlines the audit support by each of Immuta's integrations and what information is included in the audit logs.
Legend:
Limited support: There is limited support for audit for this integration.
Certain policies are unsupported or supported with caveats, depending on the integration:
*Supported with Caveats:
On Databricks data sources, joins will not be allowed on data protected with replace with NULL/constant policies.
On Trino data sources, the Immuta functions @iam
and @interpolatedComparison
for WHERE clause policies can block the creation of views.
The Immuta UI allows users to share, access, and analyze data from one secure location efficiently and easily. This section of documentation introduces all Immuta users to pages and basic features found in the Immuta console.
Data:
: Create, manage, and subscribe to data sources.
: Combine data sources, work under specified purposes, and collaborate with other users.
: Manage user roles, groups, and attributes.
: Manage global policies and view all policies and the data sources they apply to.
: Configure purposes, run governance reports, and view notifications.
: Analyze how data is being used across your organization.
: Write, modify, and execute queries against data sources you're subscribed to in the Immuta UI.
: Configure Immuta to meet your organization's needs.
: View access requests and receive activity updates.
: Manage username and password, access SQL credentials, and generate API keys.
The data sources page allows Immuta users to view, subscribe to, and create data sources in Immuta. On the main data source page is a list of data sources. Users can navigate between the All Data Sources tab and the My Data Sources tab to filter this list. Additionally, the Search bar can be used to filter search results by data source name, tag, project, connection strings, or columns.
To navigate to a specific data source, click on it from this list, and you will be taken to the data source overview page
In addition to the data source's health, this page provides detailed information about the data source and is organized by tabs across the top of the page: , , , , , , and . The visibility and appearance of the tabs will vary slightly depending on the type of user accessing the data source.
This section includes detailed information regarding Data Source Health and Data Source Health Checks. The health status of a data source is visible in the top right corner of the data source details page.
If you click the health status text, a dropdown menu displays the status of specific data source checks.
Health Check: When an Immuta data source is created, a background job is submitted to compute the row count and high cardinality column for the data source. This job uses the connection information provided at data source creation time. A data source initially has a health status of “healthy” because the initial health check performed is a simple SQL query against the source to make sure the source can be queried at all. After the background job for the row count/high cardinality column computation is complete, the health status is updated. If one or both of those jobs failed, the health status will change to “Unhealthy.”
Fingerprint: Captures summary statistics of a data source when a data source is created, when a policy is applied or changed, or when a user manually updates the data source fingerprint.
View: Depending on the integration, this records if a view has been created to represent the data source in an integration, when it was created, and gives a button to re-create the view if policies have been changed.
Row Count: Calculates the number of rows in the data source.
High Cardinality: Calculates the high cardinality columns, which contain unique values such as identification numbers, email addresses, or usernames. A high cardinality column is required to generate consistent random subsets of data for use in certain minimization techniques.
Global Policies Applied: Verifies that relevant Global Policies are successfully applied.
: Detects when a new table has been added in a remote database and automatically creates a new data source. Correspondingly, if a remote table is removed, that data source will be disabled in the console. Schema detection is set to run every night.
: Detects when a column has been added or removed in a remote database and automatically updates the data source in Immuta. This detection is set to run every night, but users can manually trigger the job here.
This tab includes detailed information about the data source, including its Description, Technology, Table Name, Remote Database, Remote Table, the Parent Server, and the Data Source ID.
This tab contains information about the users associated with the data source, their username, when their access expires, what their role is, how they are subscribed to the data source, and an Actions button that details the users' subscription history, including the reason users need access to the data and how they plan to use it.
Members can be filtered by Role or Subscription using the Filters button.
This tab lists the policies associated with the data source and includes three components:
Subscribers: Lists who may access the data source. If a Subscription Policy has already been set by a Global Policy, a notification and a Disable button appear at the bottom of this section. Data Owners can click the Disable button to make changes to the Subscription Policy.
Activity Panel: Records all changes made to policies by Data Owners or Governors, including when the data source was created, the name and type of the policy, when the policy was applied or changed, and if the policy is in conflict on the data source. Global policy changes are identified by the Governance icon; all other updates are labeled by the Data Sources icon.
The Data Dictionary is a table that details information about each column in a data source. The information within the Data Dictionary is generated automatically when the data source is created if the remote data platform supports SQL. Otherwise, Data Owners or Experts can manually create Data Dictionaries. The Data Dictionary tab includes three sections:
Name: The name of the column in the table.
Type: The type of value, which may be text, integer, decimal, or timestamp.
Actions: Users may use the buttons in this column to edit, comment, or tag items in the Data Dictionary.
Deprecation notice
Support for this feature has been deprecated.
Users are able to comment on or ask questions about the Data Dictionary columns and definitions, public queries, and the data source in general. Resolved comments and questions are available for review to keep a complete history of all the knowledge sharing that has occurred on a data source.
Contact information for Data Owners is provided for each data source, which allows users to ask questions about accessibility and attributes required for viewing the data.
This tab lists all projects, derived data sources, or parent data sources associated with the data source and includes the reason the data source was added to a project, who added the data source to the project or created it, and when the data source was added to the project or created.
When users submit an Unmask request in the UI, a Tasks tab appears beside the Relationships tab for the requesting user and the user receiving the request. This tab contains information about the request and allows users to view and manage the tasks listed.
Your guide to discovering, securing, and monitoring your data with Immuta.
For Immuta to enforce policies, it needs to catalog the resources policies are being applied to by performing metadata ingestion. Metadata ingestion is the process that occurs when you where Immuta gathers details about your tables. However, Immuta does not need access to the data within the tables in order to protect it, with the exception of a few specific and advanced masking policies detailed below.
Immuta collects and stores the following kinds of information in Immuta's Metadata Database for policy enforcement. Further, policy information may be transmitted to data source host systems for enforcement purposes as part of a query or to enable the host system to perform native enforcement.
Identity Management Information: Usernames, group information, and other kinds of personal identifiers may be stored and referenced for the purposes of performing authentication and access control and may be retained in audit logs. When such information is relevant for access determination under policy, it may be retained as part of the policy definition.
Schema Information: Data source metadata such as schema, column data types, and information about the host.
Immuta's Metadata Database can also contain the following forms of metadata for policy enforcement. These forms contain sample data from your tables and if you do not want Immuta to have access to the data being protected.
Fingerprints: When enabled, additional statistical queries made during the health check are distilled into summary statistics, called fingerprints. During this process, statistical query results and data samples (which may contain PII) are temporarily held in memory by the Fingerprint Service.
k-Anonymization Policies: When a k-anonymization policy is applied, the columns under the k-anonymization policy are queried within a separate fingerprinting process which generates rules enforcing k-anonymity. The results of this query, which may contain PII, are temporarily held in memory by the Fingerprint Service. The final rules are stored for enforcement.
Randomized Response Policies: If the list of substitution values for a categorical column is not part of the policy specification (e.g., when specified via the API), a list is obtained via query and merged into the policy definition.
If no metadata collection types have been disabled, data is processed in the following workflow to support data source creation, health checks, policy enforcement, and dictionary features.
A System Administrator configures the integration in Immuta.
A Data Owner registers data sources from their remote data platforms with Immuta. Note: Data Owners can see sample data when editing a data source. However, this action requires the database password, and the small sample of data visible is only displayed in the UI and is not stored in Immuta.
When a data source is created or updated, the Metadata Database pulls in and stores statistics about the data source, including row count and high cardinality calculations.
The data source health check runs daily to ensure existing tables are still valid.
If an external catalog is enabled, the daily health check will pull in data source attributes (e.g., tags and definitions) and store them in the Metadata Database.
Immuta requires certain privileges to perform metadata ingestion. The user connecting a table to Immuta as a data source must have privileges specific to their data platform to perform metadata ingestion.
For example, a user registering a Snowflake table as an Immuta data source must have the REFERENCES
privilege to view the structure of the table and allow Immuta access to that information as well. This does not require the user (or Immuta) to have access to view the data itself.
Support for this integration has been deprecated. Use the instead.
The Starburst (Trino) integration v2.0 allows you to access policy-protected data directly in your Starburst (Trino) catalogs without rewriting queries or changing your workflows. Instead of generating policy-enforced views and adding them to an Immuta catalog that users have to query (like in the ), Immuta policies are translated into Starburst (Trino) rules and permissions and applied directly to tables within users’ existing catalogs.
Users who want to use tagging capabilities outside of Immuta and pull tags from external table schemas can connect or as an external catalog. Once they have been connected, Immuta will ingest a data dictionary from the catalog that will apply data source and column tags directly onto data sources. These tags can then be used to write and drive policies.
This is available and the information is included in audit logs.
This is not available and the information is not included in audit logs.
For details about each of these policies, see the .
From here, data owners can also and edit or create a data source description.
This tab is visible to everyone, but Data Owners and Governors can .
Data Policies: Lists policies that enforce privacy controls on the data source. Data Owners can use this section to .
This tab is visible to everyone, but Data Owners and Governors can from this page.
Governors manage purposes for data use across Immuta. After creating a purpose, governors can customize acknowledgement statements that users must agree to before accessing a project or data source. Project owners also have the ability to create purposes that will populate on the purposes tab of the governance page.
Governors can build reports to analyze how data is being used and accessed across Immuta using this report builder. Reports can be based on users, groups, projects, data sources, tags, purposes, policies, and connections within Immuta.
For detailed information on how to run reports, see Immuta reports.
This tab contains a list of all activity associated with the governor, data sources, and global and local policies.
This tab contains a list of all tags within the Immuta environment. This includes built-in Immuta tags, tags created by governors, and tags imported from an external catalog. These tags can then be applied to projects, data sources, and the data dictionary by governors, data owners, or data source experts.
Governors can click on the tags listed here to open up a tag details page. This details page has an overview tab with information about the tag's description, origin, and creation. It also includes a data sources tab that lists the data sources the tag has been applied to and information about its application. The tag details page also includes a columns tab with the columns the tag has been applied to and information about its application, like the other tags applied to that column.
For more information on tags, see the Tags in Immuta page.
The Immuta people page is visible only to user administrators; the following actions can be completed on the Immuta people page:
Create, manage, and delete users.
Add or delete permissions from users and groups.
Add or delete attributes from users and groups.
Create, manage, and delete groups.
On this tab, administrators can add users, filter the list of users, or navigate to users' profiles by clicking on their name.
After clicking on an individual user from this list, the user's email, position, and last login and update appear. From here, admins can manage the user's permissions, attributes, and groups.
Similar to the Users tab, the Groups tab includes a list of groups. After clicking on a specific group, administrators can view the group details, add and remove group members, and manage attributes for the group.
Deprecation notice
Support for the audit page has been deprecated. Instead, pull audit logs from Kubernetes and push them to your SIEM.
All activity in Immuta is audited. This process provides rich audit logs that detail who subscribes to each data source, why they subscribe, when they access data, what SQL queries and blob fetches they run, and which files they access. Audit logs can be used for a number of intentions, including insider threat surveillance and data access monitoring for billing purposes. Audit logs can also be shipped to your enterprise auditing capability.
For more details about using audit logs, see the Audit Logs User Guide.
Immuta's logging system is designed to easily connect with enterprise log collection and aggregation systems. Please see the Immuta System Audit Logs page for full details.
Immuta provides access to all of the audit logs via the Audit page.
Only users with the AUDIT
permission can access this page. See the Administration section for more information.
Users can sort these logs by ascending (oldest entries first) or descending (latest entries first) order. By default, 50 log entries are displayed to a page, but that can be changed to 100 or 200. Additionally, users can filter the entries in a variety of ways, including by project purpose, blobId, remote query id, the entry timestamp, data source, project, record type, user, and SQL query. These query audit records detail the query run, the columns that were masked, and how the masking was enforced.
Snowflake
Databricks Unity Catalog
Databricks Spark
Databricks SQL
Starburst (Trino)
Redshift
Azure Synapse Analytics
Native query audit type
Legacy audit and UAM
Legacy audit and UAM
Legacy audit and UAM
Legacy audit
Table and user coverage
Registered data sources and users
Registered data sources and users
All tables and users
Registered data sources and users
Object queried
Limited support
Columns returned
Limited support
Query text
Limited support
Unauthorized information
Limited support
Limited support
Policy details
Limited support
User's entitlements
Limited support
The user profile page contains personal information your user account, including contact information, API keys, and pending requests. To navigate to the user profile page or quick actions, click the profile icon in the header of the Immuta UI and select Profile..
The following information about the user is displayed on their profile page. With the exception of the Databricks, Redshift, Snowflake, or Synapse username, this information may be edited by the user at any time.
Name: The user's full name.
Email: The user's email address.
Position: The user's current position.
Last Updated: The time of the user's last profile update.
About: A short description about the user.
Location: The user's work location.
Organization: The organization that a user is associated with.
Phone Number: The user's phone number.
Databricks Username: The user's Databricks username. Only an admin may set this field.
Redshift Username: The user's Redshift username. Only an admin may set this field.
Snowflake Username: The user's Snowflake username. Only an admin may set this field.
Synapse Username: The user's Synapse username. Only an admin may set this field.
Receive System Notifications as Emails: The user can opt to receive email notifications.
API keys allow for a secure way to communicate with the Immuta REST API without requiring the username and password. Each key can be revoked at any time and new ones generated. Once a key is revoked it can no longer be used to access the REST API, and users will need to authenticate any tool that they were using with the revoked API key with a new key.
Once in the API keys tab, a user can generate API keys or revoke API keys.
An API key can be linked to a project. By linking an API key to a project, you will be limiting that API key's visibility to only data sources associated with that project.
The requests tab allows users to view and manage all pending access requests directly from their profile page.
Audience: All Immuta users
Content Summary: Notifications in the Immuta UI fall into two categories: Access Requests and Activity. This page illustrates these basic Notification features in the Immuta UI.
Request notifications alert Data Owners that users wish to subscribe to their data sources.
Users can view their request notifications by clicking on the cell phone icon in the top right corner of the Immuta Console.
After clicking on the icon, Data Owners can grant or reject requests directly in the notifications drop-down.
Users will see their pending access requests in the same dropdown.
Activity notifications are used to alert users to actions that other users have performed within Immuta. The activity requests that each user receives depend on their permissions and responsibilities.
Data Users: Data Users receive activity notifications when Data Owners accept or deny their pending access requests.
Data Owners: Data Owners receive notifications about activity in their data sources and projects and when users query their data sources that have policies enforced. These notifications are shown when the user selects the bell icon in the upper righthand corner.
Governors: Governors receive notifications for all data source activity, including policy updates within Immuta. These notifications are shown when the user selects the bell icon in the upper right-hand corner.
Administrators: Administrators receive notifications for user, group, and attribute activity, such as when a new user is created or when an attribute is added to a group. These notifications are shown when the user selects the bell icon in the upper right-hand corner.
For an extensive list of notifications, see the .
Users can subscribe to email notifications by completing the following steps:
Navigate to the User Profile page, and select Edit from the dropdown menu in the top right corner of the user profile information panel.
Select the Receive System Notifications as Emails checkbox at the bottom of the window that appears.
Click Save.
Once this setting is enabled, Immuta will compile notifications and distribute these compilations via email at 8-hour intervals.
Deprecation notice
Support for this feature has been deprecated.
The allows users to write, modify, and execute queries against data sources they are subscribed to.
Click the Query Editor icon in the left sidebar.
Select a data source in the Tables list.
Click the dropdown menu icon next to the data source and select Preview Sample Data, or click Preview Sample Data in the Table Schema panel.
View data in the Results panel.
Filter results by clicking the overflow menu next to the column name.
Rearrange and resize columns by clicking and dragging.
Run and export full results or export current results to .csv by clicking one of the corresponding download buttons in the top right corner of the table.
Click the Query Editor icon in the left sidebar.
Write your query in the Query Editor panel.
Execute your query by clicking the Run Query button. Note: Clicking this button will only run the currently highlighted query. Queries (or portions of queries) can be executed by manually highlighting the query (or portion of the query) and clicking Run Query.
View data in the Results panel.
Filter results by clicking the overflow menu next to the column name.
Rearrange and resize columns by clicking and dragging.
Export results to .csv by clicking the download button in the top right corner of the table.
Audience: Data Owners, Data Users, and Data Governors
Content Summary: Projects allow users to collaborate in their data analysis by combining data sources and providing special access to data for project members. Projects are created, managed, and joined from the Projects page.
This page highlights the major features of the Projects page. For conceptual details or specific tutorials, click the links below or navigate to .
This page lists all the public projects available to be joined by others in the All Projects tab and all projects users own or belong to are listed in the My Projects tab. Additionally, users with the CREATE_PROJECT
permission can create a new project from this page.
To view details about a specific project, users click the project name.
After navigating to a specific project from the Projects page, the following information about the project is visible to users on the Overview tab:
Project Details: Information about the project appears in the sidebar on the left of the Overview tab. Details include when the project was created, the purposes associated with the project, a description of the project, the project ID, and credentials.
Documentation: If Project Owners choose, they may add documentation about their project, which will appear in this section to viewers. If no additional documentation about the project is added, only the project name will appear here.
Data Sources: The data sources associated with the project are listed here. Users can click on individual data sources to view the reason why it was added to the project and they can navigate to the data source itself. Project Owners can also manage their project data sources in this section.
Tags: Tags associated with the data source are listed here. Project Owners can manage tags from this section.
Activity Panel: All activity associated with with the project is listed in the sidebar on the right of the screen. Information recorded here includes who added data sources and tags to the project, members who have been added and removed from the project, and policy updates to the project.
This page includes a list of project members, their contact information and role, how they are subscribed, and when their membership expires. From this page, Project Owners can add and remove members from the project.
Members can be filtered by Role or Subscription using the Filters button.
This tab allows Project Owners to choose who may request access to their project or whether or not their project is visible at all to users who are not project members.
The Project Equalization section enables Project Owners to level all members' access to data so that data appears the same to all project members, regardless of their individual attributes or groups.
The Subscribers section allows Project Owners to make their project open to anyone, to users who request and are granted access, to users with specified groups and attributes, or only to users the Project Owners manually add.
Deprecation notice
Support for this feature has been deprecated.
Project members can view, create, reply to, delete, and resolve discussion threads in this tab.
A list of data sources within the project appears in this tab. Project members can view, comment on, and add data sources to the project here as well. Any project member can add data sources to the project, unless the Allow Masked Joins or Project Equalization features are enabled; in those instances, only Project Owners can add data sources to the project.
Audience: Application Administrators
Content Summary: The App Settings Page is visible only to Application Administrators and allows them to , to , and to .
This tab is where the Administrator can add IAMs, external catalogs, and data providers. They can also adjust various Immuta settings to configure it better to their organization's needs.
For a tutorial on changing settings on this tab see .
This tab includes a list of licenses and details the universally unique identifier (UUID), the features associated with specific licenses, the expiration dates, the total number of seats, and the date the keys were added. Administrators can also .
This tab allows Administrators to export a zip file called the Immuta status bundle. This bundle will include information helpful to assess and solve issues within an Immuta instance by providing a snapshot of Immuta, associated services, and information about the remote source backing any of the selected Data Sources. When generating the status bundle the Administrator may select the particular information that will help solve the issue at hand.
Audience: System Administrators
Content Summary: This page outlines how to configure an external metadata database for Immuta instead of using Immuta's built-in PostgreSQL Metadata Database that runs in Kubernetes.
Helm Chart Version
Update to the latest Helm Chart before proceeding any further.
The Metadata Database can optionally be configured to run outside of Kubernetes, which eliminates the variability introduced by the Kubernetes scheduler and/or scaler without compromising high-availability. This is the preferred configuration, as it offers infrastructure administrators a greater level of control in the event of disaster recovery.
PostgreSQL Version incompatibilities
PostgreSQL 12
through 16
are only supported when Query Engine rehydration is enabled; otherwise, the PostgreSQL version must be pinned at 12
. PostgreSQL abstraction layers such as AWS Aurora are not supported.
Enable an external metadata database by setting database.enabled=false
in the immuta-values.yaml
file and passing the connection information for the PostgreSQL instance under the key externalDatabase
.
Set queryEngine.rehydration.enabled=true
. If set to false
, then externalDatabase.superuser.username
and externalDatabase.superuser.password
must be provided.
Superuser Role
Prior to Helm Chart 4.13
, declaring externalDatabase.superuser.username
and externalDatabase.superuser.password
was a required field. This requirement has since been made optional when Query Engine rehydration is enabled. If a superuser is omitted, then the chart will no longer manage the database backup/restore process. In this configuration, customers are responsible for backing up their external metadata database.
Additionally, it is possible to use existingSecret
instead of setting externalDatabase.password
in the Helm values. These passwords map to the same keys that are used for the built-in database. For example,
Role Creation
The role's password set below should match Helm value externalDatabase.password
.
Azure Database for PostgreSQL
During restore the built-in database's backup expects role postgres
to exist. This role is not present by default, and must be created when using Azure Database for PostgreSQL.
Log in to the external metadata database as a user with the superuser role attribute (such as the postgres
user) using your preferred tool (e.g., psql, pgAdmin).
Connect to database postgres
, and execute the following SQL.
Connect to database bometadata
that was created in the previous step, and execute the following SQL. Azure Database for PostgreSQL: Extensions must be configured in the web portal.
Helm Releases
For existing deployments, you can migrate from the built-in database to an external database. To migrate, backups must be configured. Reach out to your Immuta representative for instructions.
(Optional) Set default namespace:
Trigger manual backup:
Validate backup succeeded:
Edit immuta-values.yaml
to enable the external metadata database and restore.
Apply the immuta-values.yaml
changes made in the previous step:
Wait until the Kubernetes resources become ready.
Edit immuta-values.yaml
to enable Query Engine rehydration and disable backup/restore.
Rerun the previous helm upgrade
command to apply the latest immuta-values.yaml
changes.
Connect to database postgres
, and execute the following SQL. Azure Database for PostgreSQL: Delete the previously created role by running DROP ROLE postgres;
.
1.16 or greater
3.2 or greater
Rocky Linux 9
Review the potential impacts of Immuta's Rocky Linux 9 upgrade to your environment before proceeding:
ODBC Drivers
Your ODBC drivers should use a driver compatible with Enterprise Linux 9 or Red Hat Enterprise Linux 9.
Container Runtimes
You must run a .
Use at least Docker v20.10.10 if using Docker as the .
Use at least containerd 1.4.10 if using containerd as the .
OpenSSL 3.0
CentOS Stream 9 uses OpenSSL 3.0, which has deprecated support for older insecure hashes and TLS versions, such as TLS 1.0 and TLS 1.1. This shouldn't impact you unless you are using an old, insecure certificate. In that case, the certificate will no longer work. See the for more information.
FIPS Environments
If you run Immuta 2022.5.x containers in a FIPS-enabled environment, they will now fail. Helm Chart 4.11 contains a feature for you to override the openssl.cnf
file, which can be used to allow Immuta to run in your environment, mimicking the CentOS 7 behavior.
Using a Kubernetes namespace
If deploying Immuta into a Kubernetes namespace other than the default, you must include the --namespace
option into all helm
and kubectl
commands provided throughout this section.
Immuta's Helm Chart requires Helm version 3+.
Run helm version
to verify the version of Helm you are using:
Helm 3 Example Output
In order to deploy Immuta to your Kubernetes cluster, you must be able to access the Immuta Helm Chart Repository and the Immuta Docker Registry. You can obtain credentials from your Immuta support professional.
--pass-credentials Flag
If you encounter an unauthorized error when adding Immuta's Helm Chart, you could run helm repo add --pass-credentials
.
Usernames and passwords are only passed to the URL location of the Helm repository by default. The username and password are scoped to the scheme, host, and port of the Helm repository. To pass the username and password to other domains Helm may encounter when it goes to retrieve a chart, the new --pass-credentials
flag can be used. This flag restores the old behavior for a single repository as an opt-in behavior.
If you use a username and password for a Helm repository, you can audit the Helm repository in order to check for another domain that could have received the credentials. In the index.yaml
file for that repository, look for another domain in the URL's list for the chart versions. If there is another domain found and that chart version is pulled or installed the credentials will be passed on.
Run helm repo list
to ensure Immuta's Helm Chart repository has been successfully added:
Example Output
Don't forget the image pull secret!
You must create a Kubernetes Image Pull Secret in the namespace that you are deploying Immuta in, or the Pods will fail to start due to ErrImagePull
.
Run kubectl get secrets
to confirm your Kubernetes image pull secret is in place:
Example Output
Run helm search repo immuta
to check the version of your local copy of Immuta's Helm Chart:
Example Output
Update your local Chart by running helm repo update
.
To perform an upgrade without upgrading to the latest version of the Chart, run helm list
to determine the Chart version of the installed release, and then specify that version using the --version
argument of helm repo update
.
Once you have the Immuta Docker Registry and Helm Chart Repository configured, download the immuta-values.yaml file. This file is a recommended starting point for your installation.
Modify immuta-values.yaml
based on the determined configuration for your Kubernetes cluster and the desired Immuta installation. You can change a number of settings in this file, such as
Replace the placeholder password value "<SPECIFY_PASSWORD_THAT_MEETS_YOUR_ORGANIZATIONS_POLICIES>"
with a secure password that meets your organization's password policies.
Avoid these special characters in generated passwords
whitespace, $
, &
, :
, \
, /
, '
Default Helm Values
If you would like to disable persistence to disk for the database
and query-engine
components, you can do so by configuring database.persistence.enabled=false
and/or queryEngine.persistence.enabled=false
in immuta-values.yaml
. Disabling persistence can be done for test environments. However, we strongly recommend against disabling persistence in production environments as this leaves your database in ephemeral storage.
By default, database.persistence.enabled
and queryEngine.persistence.enabled
are set to true
and request 120Gi
of storage for each component. Recommendations for the Immuta Metadata Database storage size for POV, Staging, and Production deployments are provided in the immuta-values.yaml
as shown below. However, the actual size needed is a function of the number of data sources you intend to create and the amount of logging/auditing (and its retention) that will be used in your system.
Provide Room for Growth
Provide plenty of room for growth here, as Immuta's operation will be severely impacted should database storage reach capacity.
While the Immuta query engine persistent storage size is configurable as well, the default size of 20Gi
should be sufficient for operations in nearly all environments.
Limitations on modifying database and query-engine persistence
At this point this procedure forks depending on whether you are installing with the intent of restoring from a backup or not. Use the bullets below to determine which step to follow.
If this is a new install with no restoration needed, follow Step 4.1.
Immuta's Helm Chart has support for taking backups and storing them in a PersistentVolume
or copying them directly to cloud provider blob storage including AWS S3, Azure Blob Storage, and Google Storage.
To configure backups with blob storage, reference the backup
section in immuta-values.yaml
and consult the subsections of this section of the documentation that are specific to your cloud provider for assistance in configuring a compatible resource. If your Kubernetes environment is not represented there, or a workable solution does not appear available, please contact your Immuta representative to discuss options.
If using volumes, the Kubernetes cluster Immuta is being installed into must support PersistentVolumes
with an access mode of ReadWriteMany
. If such a resource is available, Immuta's Helm Chart will set everything up for you if you enable backups and comment out the volume
and claimName
.
If using the volume backup type, an existing PersistentVolumeClaim
name needs to be configured in your immuta-values.yaml
because the persistentVolumeClaimSpec
is only used to create a new, empty volume.
If you are unsure of the value for <YOUR ReadWriteMany PersistentVolumeClaim NAME>
, the command kubectl get pvc
will list it for you
Example Output
Adhering to the guidelines and best practices for replicas and resource limits outlined below is essential for optimizing performance, ensuring cluster stability, controlling costs, and maintaining a secure and manageable environment. These settings help strike a balance between providing sufficient resources to function optimally and making efficient use of the underlying infrastructure.
Set the following replica parameters in your Helm Chart to the values listed below:
Add this YAML snippet to your Helm values file:
Database is the only component that needs a lot of resources, especially if you don’t use the query engine. For a small installation, you can set the database resources to 2Gi
, and if you see slower performance over time, you can increase this number to improve performance.
Setting CPU resources and limits is optional. Resource contention over CPU is not a common occurrence for Immuta, so it won’t have a significant effect if the CPU resource and limit is set.
Run the following command to deploy Immuta:
Troubleshooting
HTTP communication using TLS certificates is enabled by default in Immuta's Helm Chart for both internal (inside the Kubernetes cluster) and external (between the Kubernetes ingress and the outside world) communications. This is accomplished through the generation of a local certificate authority (CA) which signs certificates for each service - all handled automatically by the Immuta installation. While not recommended, if TLS must be disabled for some reason, this can be done by setting tls.enabled
to false
in the values file.
Best Practice: TLS Certification
Immuta recommends to use your own TLS certificate for external (outside the Kubernetes cluster) communications for Immuta production deployments.
Using your own certificates requires you to create a Kubernetes Secret containing the private key, certificate, and certificate authority certificate. This can be easily done using kubectl:
Make sure your certificates are correct
Make sure the certificate's Common Name (CN) and/or Subject Alternative Name (SAN) matches the specified externalHostname
or contains an appropriate wildcard.
After creating the Kubernetes Secret, specify its use in the external ingress by setting tls.externalSecretName
= immuta-external-tls
in your immuta-values.yaml file:
For Argo CD versions older than 1.7.0 you must use the following Helm values in order for the TLS generation hook to run successfully.
Starting with Argo CD version 1.7.0 the default TLS generation hook values can be used.
tls.manageGeneratedSecret
must be set to true when using Argo CD to deploy Immuta; otherwise, the generated TLS secret will be shown as OutOfSync (requires pruning) in Argo CD. Pruning the Secret would break TLS for the deployment, so it is important to set this value to prevent that from happening.
Rocky Linux 9
Review the potential impacts of Immuta's Rocky Linux 9 upgrade to your environment before proceeding:
ODBC Drivers
Your ODBC drivers should use a driver compatible with Enterprise Linux 9 or Red Hat Enterprise Linux 9.
Container Runtimes
You must run a .
Use at least Docker v20.10.10 if using Docker as the .
Use at least containerd 1.4.10 if using containerd as the .
OpenSSL 3.0
CentOS Stream 9 uses OpenSSL 3.0, which has deprecated support for older insecure hashes and TLS versions, such as TLS 1.0 and TLS 1.1. This shouldn't impact you unless you are using an old, insecure certificate. In that case, the certificate will no longer work. See the for more information.
FIPS Environments
If you run Immuta 2022.5.x containers in a FIPS-enabled environment, they will now fail. Helm Chart 4.11 contains a feature for you to override the openssl.cnf
file, which can be used to allow Immuta to run in your environment, mimicking the CentOS 7 behavior.
3.2 or greater
: See the for a list of versions Immuta supports.
Database backups for the metadata database and Query Engine may be stored in either cloud-based blob storage or a Persistent Volume in Kubernetes.
Backups may be stored using one of the following cloud-based blob storage services:
AWS S3
Supports authentication via AWS Access Key ID / Secret Key, IAM Roles via kube2iam or kiam, or IAM Roles in EKS.
Azure Blob Storage
Supports authentication via Azure Storage Key, Azure SAS Token, or Azure Managed Identities.
Google Cloud Storage
Supports authentication via Google Service Account Key
When database persistence is enabled, Immuta requires access to PersistentVolumes
through the use of a persistent volume claim template. These volumes should normally be provided by a block device, such as AWS EBS, Azure Disk, or GCE Persistent Disk.
Additionally, when database persistence is enabled, Immuta requires the ability to run an initContainer
as root. When PodSecurityPolicies
are in place, service accounts must be granted access to use a PodSecurityPolicy
with the ability to RunAsAny user
.
The Immuta Helm Chart supports RBAC and will try to create all needed RBAC roles by default.
Best Practice: Use Nginx Ingress Controller
Immuta needs Ingress for two services:
Immuta Web Service (HTTP)
Immuta Query Engine (TCP)
Immuta’s suggested minimum node size has 4 CPUs and 16GB RAM. The default Immuta Helm deployment requires at least 3 nodes.
Internal HTTPS communication refers to all communication between Immuta services. External HTTPS communication refers to communication between clients and the Immuta Query Engine and Web Service, which is configured using a Kubernetes Ingress resource.
Audience: System Administrators
Content Summary: This page outlines instructions for troubleshooting specific issues with Helm.
Using a Kubernetes namespace
If deploying Immuta into a Kubernetes namespace other than the default, you must include the --namespace
option into the helm
and kubectl
commands provided throughout this section.
If you encounter Immuta Pods that have had the status Pending
or Init:0/1
for an extended period of time, then there may be an issue mounting volumes to the Pods. You may find error messages by describing one of the pods that had the Pending
or Init:0/1
status.
If an event with the message pod has unbound PersistentVolumeClaims
is seen on the frozen pod, then there is most likely an issue with the database backup storage Persistent Volume Claims. Typically this is caused by the database backup PVC not binding because there are no Kubernetes Storage Classes configured to provide the correct storage type.
Solution
Review your and ensure that you either have the proper storageClassName
or claimName
set.
Once you have updated the immuta-values.yaml
to contain the proper PVC configuration, you will want to first delete the Immuta deployment, then run helm install
.
Occasionally Helm has bugs or loses track of Kubernetes resources that it has created. Immuta has created a Bash script that you may download and use to cleanup all resources that are tied to an Immuta deployment. This command should only be run after helm delete <YOUR RELEASE NAME>
.
Download cleanup-immuta-deployment.sh.
Run the script:
After a configuration change or cluster outage you may need to perform a rolling-restart to refresh the database pods without data loss. Use the command below to update a restart
annotation on the database pods to instruct the database StatefulSet to roll the pods.
After a configuration change or cluster outage you may need to perform a rolling-restart to refresh the web service pods. Use the command below to update a restart
annotation on the web pods to Deployment to roll the pods.
Solution
Should the need arise that you need to regenerate internal TLS certificates follow the instructions below.
Solution
Delete the internal TLS secret
Recreate the internal TLS secret by running Helm Upgrade.
Note
If you need to modify any postgres settings such as TLS certificate verification for the Query Engine. Be sure to modify values.yaml
file before running this command.
Helm 3: helm upgrade immuta/immuta --values
Helm 2: helm upgrade immuta/immuta --values --name
WAIT FOR PODS TO RESTART BEFORE CONTINUING
Restart Query Engine:
WAIT FOR PODS TO RESTART BEFORE CONTINUING
Restart Web Service:
Should you need to rotate external TLS certificates, follow the instructions below:
Solution
Create a new secret with the relevant TLS files.
Update your tls.externalSecretName
in immuta_values.yaml with the new external TLS secret.
Run Helm Upgrade to update the certificates for the deployment.
Delete the old secret.
In order to connect to the , each user must create SQL credentials. SQL credentials can be accessed by clicking the SQL Credentials tab.
For more information on SQL credentials, see .
If for an organization's Immuta instance, users may also receive notifications at the email address they configure in their profile.
The externalDatabase
object is detailed below and in the .
Run helm list
to view all existing Helm releases. Refer to the to learn more.
Follow the steps outlined in section .
Guidance for configuring , , and are provided below. See for a comprehensive list of configuration options.
Modifying any file bundled inside the Helm Chart could cause unforeseen issues and as such is not supported by Immuta. This includes but is not limited to the values.yaml
file that contains default configurations for the Helm deployment. Any custom configurations can be made in the immuta-values.yaml
file can then be passed into helm install immuta
by using the --values
flag as described in .
Once persistence is set to either true
or false
for database
or query-engine
, it cannot be changed for the deployment. Modifying persistence will require a fresh installation or a full backup and restore procedure as per .
If you are upgrading a previous installation using the full backup and restore (), follow Step 4.2.
If you are upgrading a previous installation using the full backup and restore procedure (), a valid backup configuration must be available in the Helm values. Enable the functionality to restore from backups by setting the restore.enabled
option to true in immuta-values.yaml
.
The Immuta Helm Chart supports for all components. Set resources and limits to the database and query engine in the Helm values. Without those limits, the pods will be the first target for eviction, which can cause issues during backup and restore, since this process consumes a lot of memory.
If you encounter errors while deploying the Immuta Helm Chart, see .
The Immuta Helm Chart (version 4.5.0+) can be deployed using .
For detailed assistance in troubleshooting your installation, contact your Immuta representative or see .
Immuta uses to manage and orchestrate Kubernetes deployments. Check the to ensure you are using the correct Helm Chart with your version of Immuta.
Immuta recommends that you use the because it supports both HTTP and TCP ingress.
The Immuta Helm Chart creates Ingress resources for HTTP services (the Immuta Web Service), but because of limitations with Kubernetes Ingress resources TCP ingress must be configured separately. The configuration for TCP ingress is dependent on the Ingress Controller that you are using in your cluster. Also, the configuration for TCP ingress is optional if you will only integrations, and it can .
To simplify the configuration for cluster Ingress, the Immuta Helm Chart contains an optional Nginx Ingress component that may be used to configure a to be used specifically for Immuta. Contact your Immuta Support Professional for more information.
All Immuta services use TLS certificates to enable communication over HTTPS. In order to support many configurations, the Immuta Helm Chart has the ability to configure internal and external communication independently. If TLS is enabled, by default, a certificate authority will be generated then used to sign a certificate for both internal and external communications. See for instructions to configuring TLS.
host
(required)
Hostname of the external PostgreSQL database instance.
nil
port
Port of the external PostgreSQL database instance.
5432
sslmode
(required)
The mode for the database connection. Supported values are disable
, require
, verify-ca
, and verify-fully
.
nil
superuser.username
Username for the superuser used to initialize the PostgreSQL instance.
nil
superuser.password
Password for the superuser used to initialize the PostgreSQL instance.
nil
username
Username that Immuta creates and uses for the application.
bometa
password
(required)
Password associated with username
.
nil
dbname
Database name that Immuta uses.
bometadata
Using a Kubernetes namespace
If deploying Immuta into a Kubernetes namespace other than the default, you must include the --namespace
option into all helm
and kubectl
commands provided throughout this section.
Immuta's Helm Chart requires Helm version 3+.
New installations of Immuta must use the latest version of Helm 3 and Immuta's latest Chart.
Run helm version
to verify the version of Helm you are using:
In order to deploy Immuta to your Kubernetes cluster, you must be able to access the Immuta Helm Chart Repository and the Immuta Docker Registry. You can obtain credentials from your Immuta support professional.
Run helm repo list
to ensure Immuta's Helm Chart repository has been successfully added:
Example Output
If you do not create a Kubernetes Image Pull Secret, installation will fail.
You must create a Kubernetes Image Pull Secret in the namespace that you are deploying Immuta in, or the installation will fail.
Run kubectl get secrets
to confirm your Kubernetes image pull secret is in place:
Example Output
No Rollback
Immuta's migrations to your database are one way; this means that there is no way to revert back to an earlier version of the software. If you must rollback, you will need to backup and delete what you have and then proceed to restore from the backup to the appropriate version of the software.
No Modifying Persistence
Once persistence is set to either true
or false
for the database
or query-engine
, it cannot be changed for the deployment. Modifying persistence will require a fresh installation or a full backup and restore procedure as per Method B: Complete Backup and Restore Upgrade.
Run helm search repo immuta
to check the version of your local copy of Immuta's Helm Chart:
Example Output
Update your local Chart by running helm repo update
.
To perform an upgrade without upgrading to the latest version of the Chart, run helm list
to determine the Chart version of the installed release, and then specify that version using the --version
argument of helm repo update
.
Run helm list
to confirm Helm connectivity and verify the current Immuta installation:
Example Output
Make note of:
NAME - This is the '<YOUR RELEASE NAME>
' that will be used in the remainder of these instructions.
CHART - This is the version of Immuta's Helm Chart that your instance was deployed under.
You will need the Helm values associated with your installation, which are typically stored in an immuta-values.yaml
file. If you do not possess the original values file, these can be extracted from the existing installation using:
Select your method:
Method B - Backup and Restore: This method is intended primarily for recovery scenarios and is only to be used if you have been advised to by an Immuta representative. Reach out to your Immuta representative for instructions.
Rocky Linux 9
Review the potential impacts of Immuta's Rocky Linux 9 upgrade to your environment before proceeding:
ODBC Drivers
Your ODBC drivers should use a driver compatible with Enterprise Linux 9 or Red Hat Enterprise Linux 9.
Container Runtimes
You must run a supported version of Kubernetes.
Use at least Docker v20.10.10 if using Docker as the container runtime.
Use at least containerd 1.4.10 if using containerd as the container runtime.
OpenSSL 3.0
CentOS Stream 9 uses OpenSSL 3.0, which has deprecated support for older insecure hashes and TLS versions, such as TLS 1.0 and TLS 1.1. This shouldn't impact you unless you are using an old, insecure certificate. In that case, the certificate will no longer work. See the OpenSSL migration guide for more information.
FIPS Environments
If you run Immuta 2022.5.x containers in a FIPS-enabled environment, they will now fail. Helm Chart 4.11 contains a feature for you to override the openssl.cnf
file, which can be used to allow Immuta to run in your environment, mimicking the CentOS 7 behavior.
After you make any desired changes in your immuta-values.yaml
file, you can apply these changes by running helm upgrade
:
Note: Errors in upgrades can result when upgrading Chart versions on the installation. These are typically easily resolved by making slight modifications of your values to accommodate the changes in the Chart. Downloading an updated copy of the immuta-values.yaml
and comparing to your existing values is often a great way to debug such occurrences.
If you are on Kubernetes 1.22+, remove nginxIngress.controller.image.tag=v0.49.3
when upgrading; otherwise, your ingress service may not start after the upgrade.
Audience: System Administrators
Content Summary: Before installing Immuta, you will need to spin up your AKS or ACS cluster. This page outlines how to deploy Immuta cluster infrastructure on AKS and ACS.
If you would like to install Immuta on an existing AKS or ACS cluster, you can skip this section. However, we recommend deploying a dedicated resource group and cluster for Immuta if possible.
Once you have deployed your cluster infrastructure, please visit Helm Installation on Microsoft Azure Kubernetes Service to finish installing Immuta.
Best Practice: Use AKS
Immuta highly recommends to use the improved version of Azure Kubernetes Service, AKS. Immuta on AKS will exhibit superior stability, performance, and scalability than a deployment on the deprecated version known as ACS.
You will need a resource group to deploy your AKS or ACS cluster in:
Note: There is no naming requirement for the Immuta resource group.
Now it is time to spin up your cluster resources in Azure. This step will be different depending on whether you are deploying an AKS or ACS cluster.
After running the command, you will have to wait a few moments as the cluster resources are starting up.
Create AKS Cluster (Recommended):
Create ACS Cluster (Deprecated):
You will need to configure the kubectl
command line utility to use the Immuta cluster.
If you do not have kubectl
installed, you can install it through the Azure CLI.
If you are using AKS, run
For ACS clusters, run
If you are using AKS, run
For ACS clusters, run
The following procedure walks through the process of changing passwords for the database users in the Immuta Database.
The commands outlined here will need to be altered depending on your Helm release name and chosen passwords. Depending on your environment, there may be other changes required for the commands to complete successfully, including, but not limited to, Kubernetes namespace, kubectl context, and Helm values file name.
This process results in downtime.
Scale database StatefulSet
to 1 replica:
Change database.superuserPassword
:
Alter Postgres user password:
Update database.superuserPassword
with <new-password>
in immuta-values.yaml
.
Change database.replicationPassword
:
Alter replicator user password:
Update database.replicationPassword
with <new-password>
in immuta-values.yaml
.
Change database.password
:
Alter bometa
user password:
Update database.password
with <new-password>
in immuta-values.yaml
.
Update database.patroniApiPassword
with <new-password>
in immuta-values.yaml
.
Run helm upgrade
to persist the changes and scale the database StatefulSet
up:
Restart web pods:
Users have the option to use an existing Kubernetes secret for Immuta database passwords used in Helm installations.
Update your existingSecret
values in your Kubernetes environment.
Get the current replica counts:
Scale database StatefulSet
to 1 replica:
Change the value corresponding to database.superuserPassword
in the existing Kubernetes Secret.
Alter Postgres user password:
Change the value corresponding to database.replicationPassword
in the existing Kubernetes Secret.
Alter replicator user password:
Change the value corresponding to database.password
in the existing Kubernetes Secret.
Alter bometa
user password:
Scale the immuta-database StatefulSet
up to the previous replica count determined in the previous step:
Restart web pods:
Audience: System Administrators
Content Summary: This page outlines how to install Immuta in an air-gapped environment.
Process for Saving and Loading Docker Images
The process outlined for saving and loading the Docker images will be different for everyone. With the exception of the list of Docker images that all users need to copy to their container registry, all code blocks provided are merely examples.
This high-level overview makes these assumptions:
a container registry is accessible from inside the air-gapped environment
Docker and Helm are already installed
All users should copy these Docker images to their container registry.
See the Helm Chart Options page for the values: IMMUTA_DEPLOY_TOOLS_VERSION
, MEMCACHED_TAG
, and INGRESS_NGINX_TAG
.
Docker Registry Authentication
Contact your Immuta support professional for your Immuta Docker Registry credentials.
Authenticate with Immuta's Docker registry.
Pull the images.
Save the images.
The .tar.gz
files will now be in your working directory.
Add Immuta's Chart repository to Helm.
Download the Helm Chart.
The .tgz
files will now be in your working directory.
Move the Helm Chart and Docker images onto a machine connected to the air-gapped network.
Copy these Docker images to your container registry. Note: You may need to reload the environment variables.
Validate that the images are present.
Tag the images.
Push the images to your registry.
Create the Helm values file (i.e., myValues.yaml
) and point it to your registry (i.e., web.imageRepository
). Be sure to replace $CUSTOMER_REGISTRY
with the actual URL, including any additional prefixes before immuta
, with the URL for the actual registry.
Deploy the Helm Chart.
Use case
While you're onboarding Snowflake data sources and designing policies, you don't want to disrupt your Snowflake users' existing workflows.
Instead, you want to gradually onboard Immuta through a series of successive changes that will not impact your existing Snowflake users.
A phased onboarding approach to configuring the Snowflake integration ensures that your users will not be immediately affected by changes as you add data sources and configure policies.
Several features allow you to gradually onboard data sources and policies in Immuta:
Subscription policy of “None” by default: By default, no policy is applied at registration time; instead of applying a restrictive policy immediately upon registration, the table is registered in Immuta and waits for a policy to be applied, if ever.
There are several benefits to this design:
All existing roles maintain access to the data and registration of the table or view with Immuta has zero impact on your data platform.
It gives you time to configure tags on the Immuta registered tables and views, either manually or through automatic means, such as Immuta’s sensitive data detection (SDD), or an external catalog integration to include Snowflake tags.
It gives you time to assess and validate the sensitive data tags that were applied.
You can build only row and column controls with Immuta and let your existing roles manage table access instead of using Immuta subscription policies for table access.
Snowflake table grants coupled with Snowflake low row access policy mode: With these features enabled, Immuta manages access to tables (subscription policies) through GRANTs. This works by assigning each user their own unique role created by Immuta and all table access is managed using that single role.
Without these two features enabled, Immuta uses a Snowflake row access policy (RAP) to manage table access. A RAP only allows users to access rows in the table if they were explicitly granted access through an Immuta subscription policy; otherwise, the user sees no rows. This behavior means all existing Snowflake roles lose access to the table contents until explicitly granted access through Immuta subscription policies. Essentially, roles outside of Immuta don't control access anymore.
By using table grants and the low row access policy mode, users and roles outside Immuta continue to work.
There are two benefits to this approach:
All pre-existing Snowflake roles retain access to the data until you explicitly revoke access (outside Immuta).
It provides a way to test that Immuta GRANTs are working without impacting production workloads.
The following configuration is required for phased Snowflake onboarding:
Impersonation is disabled
Project workspaces are disabled
If either user impersonation or project workspaces is necessary for your use case, you cannot do phased Snowflake onboarding as described below.
Configure your Snowflake integration with the following features enabled:
Snowflake table grants (enabled by default)
Snowflake low row access policy mode
Select None as your default subscription policy.
Plan the policies you need to have in place, the tags that will apply to your data, and how the tags will be applied to your data.
Register a subset of your tables to configure and validate SDD.
Configure SDD to discover entities of interest for your policy needs.
Validate that the SDD tags are applied correctly.
Register your remaining tables at the schema level with schema detection turned on. This setting allows Immuta to continuously monitor for schema changes (new tables, column, dropped tables, columns, changed column types).
Let SDD or external catalog synchronization complete, and then validate that SDD tags are applied correctly.
Further customize SDD as necessary.
At this point, no policies are in place because of the default subscription policy setting. Now you can write and apply the policies you planned. You do not have to do all policies at once.
In the steps below, you do not have to validate every policy you create in Immuta; instead, examine a few to validate the behavior you expect to see.
Subscription policies grant or revoke access to Snowflake tables.
If necessary, you could use your existing roles for table access and only use Immuta for row access policies and masking policies.
Immuta roles are created for users once they are subscribed to a table by a policy. SECONDARY ROLES ALL
allows you to combine warehouse access with the Immuta role.
Validate that the Immuta users impacted now have an Immuta role in Snowflake dedicated to them.
Validate that when acting under the Immuta role those users have access to the table(s) in question.
Validate that users without access in Immuta can still access the table with a different Snowflake role that has access.
Validate that a user with SECONDARY ROLES ALL
enabled retains access if
they were not granted access by Immuta and
they have a role that provides them access, even if they are not currently acting under that role.
Data policies enforce fine-grained access controls on a table (for example, row access policies or masking policies).
Validate that a user with a role that can access the table in question (whether it's an Immuta role or not) sees the impact of that data policy.
Once all Immuta policies are in place, remove or alter old roles.
Delete irrelevant roles instead of revoking access to avoid confusion.
Ensure deleting roles will not have other implications, like impacting warehouse access. If deleting those roles will have unintended effects alter those roles to remove the access control logic instead of deleting them.
Use Case
While you're onboarding Snowflake data sources and designing policies, you don't want to disrupt your Snowflake users' existing workflows.
Instead, you want to gradually onboard Immuta through a series of successive changes that will not impact your existing Snowflake users.
A phased onboarding approach to configuring the Snowflake integration ensures that your users will not be immediately affected by changes as you add data sources and configure policies.
Several features allow you to gradually onboard data sources and policies in Immuta:
Subscription policy of “None” by default: By default, no policy is applied at registration time; instead of applying a restrictive policy immediately upon registration, the table is registered in Immuta and waits for a policy to be applied, if ever.
There are several benefits to this design:
All existing roles maintain access to the data and registration of the table or view with Immuta has zero impact on your data platform.
It gives you time to configure tags on the Immuta registered tables and views, either manually or through automatic means, such as Immuta’s sensitive data detection (SDD), or an external catalog integration to include Snowflake tags.
It gives you time to assess and validate the sensitive data tags that were applied.
You can build only row and column controls with Immuta and let your existing roles manage table access instead of using Immuta subscription policies for table access.
Snowflake table grants coupled with Snowflake low row access policy mode: With these features enabled, Immuta manages access to tables (subscription policies) through GRANTs. This works by assigning each user their own unique role created by Immuta and all table access is managed using that single role.
Without these two features enabled, Immuta uses a Snowflake row access policy (RAP) to manage table access. A RAP only allows users to access rows in the table if they were explicitly granted access through an Immuta subscription policy; otherwise, the user sees no rows. This behavior means all existing Snowflake roles lose access to the table contents until explicitly granted access through Immuta subscription policies. Essentially, roles outside of Immuta don't control access anymore.
By using table grants and the low row access policy mode, users and roles outside Immuta continue to work.
There are two benefits to this approach:
All pre-existing Snowflake roles retain access to the data until you explicitly revoke access (outside Immuta).
It provides a way to test that Immuta GRANTs are working without impacting production workloads.
The following configuration is required for phased Snowflake onboarding:
Impersonation is disabled
Project workspaces are disabled
If either of these capabilities is necessary for your use case, you cannot do phased Snowflake onboarding as described below.
See the Getting started page for step-by-step guidance to implement phased Snowflake onboarding.
Immuta is compatible with Snowflake Secure Data Sharing. Using both Immuta and Snowflake, organizations can share the policy-protected data of their Snowflake database with other Snowflake accounts with Immuta policies enforced in real time. This integration gives data consumers a live connection to the data and relieves data providers of the legal and technical burden of creating static data copies that leave their Snowflake environment.
There are two options to use Snowflake Data Sharing with Immuta:
Snowflake Data Shares with Immuta Users (Public Preview): This option utilizes Snowflake table grants and requires the data viewer to be registered as an Immuta user.
Snowflake Data Shares with Non-Immuta Users: This option utilizes Snowflake project workspaces to share policy-protected data without data viewers being registered as Immuta users.
This method allows data providers to share policy-enforced data with data consumers registered in Immuta.
The data consumer will register in Immuta as a user with the appropriate Immuta attributes and groups. Once that user has subscribed to the data source, they will be able to see the policy-protected data of a Snowflake data share.
For a tutorial on this workflow, see the Using Snowflake Data Sharing page.
Snowflake Enterprise Edition or higher
Immuta's table grants feature
Using Immuta users with Snowflake Data Sharing allows the sharer to
Only need limited knowledge of the context or goals of the existing policies in place: Because the sharer is not editing or creating policies to share their data, they only need a limited knowledge of how the policies work. Their main responsibility is making sure they properly represent the attributes of the data consumer.
Leave policies untouched.
In this method, Immuta projects can be used to protect and share data with data consumers, even without those users being registered in Immuta.
Using Immuta projects, organizations can create projects and then adjust the equalized entitlements of the project to represent attributes and groups of the data consumer. This allows the project to function as a user, with the data being protected for a particular set of attributes and groups. Once the entitlements have been set, the project owner can enable a project workspace that will create a Snowflake secure view of that policy-protected data that is ready to share with the data consumer. Because of the Immuta project, equalized entitlements, and workspace, the data is restricted to data consumers who possess the relevant attributes and groups.
For a tutorial on this workflow, see the Using Snowflake Data Sharing page.
Any Snowflake integration
Immuta attribute based access control (ABAC) data policies
Using Immuta project workspaces with Snowflake Data Sharing allows the sharer to
Only need limited knowledge of the context or goals of the existing policies in place: Because the sharer is not editing or creating policies to share their data, they only need a limited knowledge of how the policies work. Their main responsibility is making sure they properly represent the attributes of the data consumer.
Leave policies untouched.
Only share data that the sharer is allowed to see: Users who can create data shares shouldn’t necessarily be the same users who can make policy changes.
Let Immuta create the policy-enforced secure view, ready to share.
Project workspaces are generally recommended to allow WRITE access; however, Snowflake's Data Sharing feature does not support WRITE access to shared data.
Actions of the data consumer after the data has been shared are not audited when using project workspaces.
In this legacy integration, all enforcement is done by creating views that contain all policy logic. Each view has a 1-to-1 relationship with the original table. All policy-enforced views are accessible through the PUBLIC
role and access controls are applied in the view, allowing customers to leverage Immuta's powerful set of attribute-based policies. Additionally, users can continue using roles to enforce compute-based policies through "warehouse" roles, without needing to grant each of those roles access to the underlying data or create multiple views of the data for each specific business unit.
This integration leverages webhooks to keep Snowflake views up-to-date with the corresponding Immuta data sources. Whenever a data source or policy is created, updated, or disabled, a webhook will be called that will create, modify, or delete the Snowflake view with Immuta policies.
The SQL that makes up all views includes a join to the secure view: immuta_system.user_profile
. This view is a select from the immuta_system.profile
table (which contains all Immuta users and their current groups, attributes, projects, and a list of valid tables they have access to) with a constraint immuta__userid = current_user()
to ensure it only contains the profile row for the current user. This secure view is readable by all users and will only display the data that corresponds to the user executing the query.
Note: The immuta_system.profile
table is updated through webhooks whenever a user's groups or attributes change, they switch projects, they acknowledge a purpose, or when their data source access is approved or revoked. The profile table can only be read and updated by the Immuta system account.
By default, all views are created within the immuta
database, which is accessible by the PUBLIC
role, so users acting under any Snowflake role can connect. All views within the database have the SELECT
permission granted to the PUBLIC
role as well, and access is enforced by the access_check
function built into the individual views. Consequently, there is no need for users to manage any role-based access to any of the database objects managed by Immuta.
When creating a Snowflake data source, users have the option to use a regular view (traditional database view) or a secure view; however, according to Snowflake's documentation , "the Snowflake query optimizer, when evaluating secure views, bypasses certain optimizations used for regular views. This may result in some impact on query performance for secure views." To use the data source with both Snowflake and Snowflake workspaces, secure views are necessary. Note: If HIPAA compliance is required, secure views must be used.
When using a non-secure view, certain policies may leak sensitive information. In addition to the concerns outlined here, there is also a risk of someone exploiting the query optimizer to discover that a row exists in a table that has been excluded by row-level policies. This attack is mentioned here in the Snowflake documentation.
Policies that will not leak sensitive information
Masking by making NULL, using a constant, or by rounding (date/numeric)
Minimization row-level policies
Date-based row-level policies
K-anonymization masking policies
Policies that could leak sensitive information
Masking using a regex will show the regex being applied. In general this should be safe, but if you have a regex policy that removes a specific selector to redact (e.g., a regex of /123-45-6789/g
to specifically remove a single SSN from a column), then someone would be able to identify columns with that value.
In conditional masking and custom WHERE clauses including “Right To Be Forgotten,” the custom SQL will be visible, so for a policy like "only show rows where COUNTRY NOT IN(‘UK’, ‘AUS’)," users will know that it’s possible there is data in that table containing those values.
Policies that will leak potentially sensitive information
These policies leak information sensitive to Immuta, but in most cases would require an attacker to reverse the algorithm. In general these policies should be used with secure views:
Masking using hashing will include the salt used.
Numeric and categorical randomized response will include the salt used.
Reversible masking will include both a key and an IV.
Format preserving masking will include a tweak, key, an alphabet range, prefix, pad to length, and checksum id if used.
The data sources themselves have all the Data policies included in the SQL through a series of CASE statements that determine which view of the data a user will see. Row-level policies are applied as top-level WHERE clauses, and usage policies (purpose-based or subscription-level) are applied as WHERE clauses against the user_profile
JOIN. The access_check
function allows Immuta to throw custom errors when a user lacks access to a data source because they are not subscribed to the data source, they are operating under the wrong project, or they cannot view any data because of policies enforced on the data source.
Migration troubleshooting
If multiple Snowflake integrations are enabled, they will all migrate together. If one fails, they will all revert to the Snowflake Standard integration.
If an error occurs during migration and the integration cannot be reverted, the integration must be disabled and re-enabled.
You can migrate from a Snowflake integration without governance features to a Snowflake integration with governance features on the app settings page. Once prompted, Immuta will migrate the integration, allowing users to seamlessly transition workloads from the legacy Immuta views to the direct Snowflake tables.
After the migration is complete, Immuta views will still exist for pre-existing Snowflake data sources to support existing workflows. However, disabling the Immuta data source will drop the Immuta view, and, if the data source is re-enabled, the view will not be recreated.
Certain interpolation functions can also block the creation of a view, specifically @interpolatedComparison()
and @iam
.
When configuring one Snowflake instance with multiple Immuta instances, the user or system account that enables the integration on the app settings page must be unique for each Immuta instance.
Snowflake Enterprise Edition required
This integration requires the Snowflake Enterprise Edition.
In this integration, Immuta manages access to Snowflake tables by administering Snowflake row access policies and column masking policies on those tables, allowing users to query tables directly in Snowflake while dynamic policies are enforced.
Like with all Immuta integrations, Immuta can inject its ABAC model into policy building and administration to remove policy management burden and significantly reduce role explosion.
When an administrator configures the Snowflake integration with Immuta, Immuta creates an IMMUTA
database and schemas (immuta_procedures
, immuta_policies
, and immuta_functions
) within Snowflake to contain policy definitions and user entitlements. Immuta then creates a system role and gives that system account the following privileges:
APPLY MASKING POLICY
APPLY ROW ACCESS POLICY
ALL PRIVILEGES ON DATABASE "IMMUTA" WITH GRANT OPTION
ALL PRIVILEGES ON ALL SCHEMAS IN DATABASE "IMMUTA" WITH GRANT OPTION
USAGE ON FUTURE PROCEDURES IN SCHEMA "IMMUTA".immuta_procedures WITH GRANT OPTION
USAGE ON WAREHOUSE
OWNERSHIP ON SCHEMA "IMMUTA".immuta_policies TO ROLE "IMMUTA_SYSTEM" COPY CURRENT GRANTS
OWNERSHIP ON SCHEMA "IMMUTA".immuta_procedures TO ROLE "IMMUTA_SYSTEM" COPY CURRENT GRANTS
OWNERSHIP ON SCHEMA "IMMUTA".immuta_functions TO ROLE "IMMUTA_SYSTEM" COPY CURRENT GRANTS
OWNERSHIP ON SCHEMA "IMMUTA".public TO ROLE "IMMUTA_SYSTEM" COPY CURRENT GRANTS
Optional features, like automatic object tagging, native query auditing, etc., require additional permissions to be granted to the Immuta system account, are listed in the supported features section.
Snowflake is a policy push integration with Immuta. When Immuta users create policies, they are then pushed into the Immuta database within Snowflake; there, the Immuta system account applies Snowflake row access policies and column masking policies directly onto Snowflake tables. Changes in Immuta policies, user attributes, or data sources trigger webhooks that keep the Snowflake policies up-to-date.
For a user to query Immuta-protected data, they must meet two qualifications:
They must be subscribed to the Immuta data source.
They must be granted SELECT
access on the table by the Snowflake object owner or automatically via the Snowflake table grants feature.
After a user has met these qualifications they can query Snowflake tables directly.
When a user applies a masking policy to a Snowflake data source, Immuta truncates masked values to align with Snowflake column length (VARCHAR(X)
types) and precision (NUMBER (X,Y)
types) requirements.
Consider these columns in a data source that have the following masking policies applied:
Column A (VARCHAR(6)): Mask using hashing for everyone
Column B (VARCHAR(5)): Mask using a constant REDACTED
for everyone
Column C (VARCHAR(6)): Mask by making null for everyone
Column D (NUMBER(3, 0)): Mask by rounding to the nearest 10 for everyone
Querying this data source in Snowflake would return the following values:
5w4502
REDAC
990
6e3611
REDAC
750
9s7934
REDAC
380
Hashing collisions
Hashing collisions are more likely to occur across or within Snowflake columns restricted to short lengths, since Immuta truncates the hashed value to the limit of the column. (Hashed values truncated to 5 characters have a higher risk of collision than hashed values truncated to 20 characters.) Therefore, avoid applying hashing policies to Snowflake columns with such restrictions.
For more details about Snowflake column length and precision requirements, see the Snowflake behavior change release documentation.
When a policy is applied to a column, Immuta uses Snowflake memoizable functions to cache the result of the called function. Then, when a user queries a column that has that policy applied to it, Immuta uses that cached result to dramatically improve query performance.
Best practice
Use a dedicated Snowflake role to register Snowflake tables as Immuta data sources. Then, include this role in the excepted roles/users list.
Register Snowflake data sources using a dedicated Snowflake role. No policies will apply to that role, ensuring that your integration works with the following use cases:
Snowflake project workspaces: Snowflake workspaces generate static views with the credentials used to register the table as an Immuta data source. Those tables must be registered in Immuta by an excepted role so that policies applied to the backing tables are not applied to the project workspace views.
Using views and tables within Immuta: Because this integration uses Snowflake governance policies, users can register tables and views as Immuta data sources. However, if you want to register views and apply different policies to them than their backing tables, the owner of the view must be an excepted role; otherwise, the backing table’s policies will be applied to that view.
Private preview
This feature is only available to select accounts. Reach out to your Immuta representative to enable this feature.
Bulk data source creation is the more efficient process when loading more than 5000 data sources from Snowflake and allows for data sources to be registered in Immuta before running sensitive data discovery or applying policies.
To use this feature, see the Bulk create Snowflake data sources guide.
Based on performance tests that create 100,000 data sources, the following minimum resource allocations need to be applied to the appropriate pods in your Kubernetes environment for successful bulk data source creation.
Memory
4Gi
16Gi
8Gi
CPU
2
4
2
Storage
8Gi
24Gi
16Gi
Performance gains are limited when enabling sensitive data discovery at the time of data source creation.
External catalog integrations are not recognized during bulk data source creation. Users must manually trigger a catalog sync for tags to appear on the data source through the data source's health check.
Excepted roles and users are assigned when the integration is installed, and no policies will apply to these users' queries, despite any Immuta policies enforced on the tables they are querying. Credentials used to register a data source in Immuta will be automatically added to this excepted list for that Snowflake table. Consequently, roles and users added to this list and used to register data sources in Immuta should be limited to service accounts.
Immuta excludes the listed roles and users from policies by wrapping all policies in a CASE statement that will check if a user is acting under one of the listed usernames or roles. If a user is, then the policy will not be acted on the queried table. If the user is not, then the policy will be executed like normal. Immuta does not distinguish between role and username, so if you have a role and user with the exact same name, both the user and any user acting under that role will have full access to the data sources and no policies will be enforced for them.
An Immuta application administrator configures the Snowflake integration and registers Snowflake warehouse and databases with Immuta.
Immuta creates a database inside the configured Snowflake warehouse that contains Immuta policy definitions and user entitlements.
A data owner registers Snowflake tables in Immuta as data sources.
If Snowflake tag ingestion was enabled during the configuration, Immuta uses the host provided in the configuration and ingests internal tags on Snowflake tables registered as Immuta data sources.
A data owner, data governor, or administrator creates or changes a policy or a user's attributes change in Immuta.
The Immuta web service calls a stored procedure that modifies the user entitlements or policies.
Immuta manages and applies Snowflake governance column and row access policies to Snowflake tables that are registered as Immuta data sources.
If Snowflake table grants is not enabled, Snowflake object owner or user with the global MANAGE GRANTS privilege grants SELECT privilege on relevant Snowflake tables to users. Note: Although they are GRANTed access, if they are not subscribed to the table via Immuta-authored policies, they will not see data.
A Snowflake user who is subscribed to the data source in Immuta queries the corresponding table directly in Snowflake and sees policy-enforced data.
The Snowflake integration supports the following authentication methods to install the integration and create data sources:
Username and password: Users can authenticate with their Snowflake username and password.
Key pair: Users can authenticate with a Snowflake key pair authentication.
Snowflake External OAuth: Users can authenticate with Snowflake External OAuth when using Snowflake with governance features.
Immuta's OAuth authentication method uses the Client Credentials Flow to integrate with Snowflake External OAuth. When a user configures the Snowflake integration or connects a Snowflake data source, Immuta uses the token credentials (obtained using a certificate or passing a client secret) to craft an authenticated access token to connect with Snowflake. This allows organizations that already use Snowflake External OAuth to use that secure authentication with Immuta.
An Immuta application administrator configures the Snowflake integration or creates a data source.
Immuta creates a custom token and sends it to the authorization server.
The authorization server confirms the information sent from Immuta and issues an access token to Immuta.
Immuta sends the access token it received from the authorization server to Snowflake.
Snowflake authenticates the token and grants access to the requested resources from Immuta.
The integration is connected and users can query data.
The Immuta Snowflake integration supports Snowflake external tables. However, you cannot add a masking policy to an external table column while creating the external table in Snowflake because masking policies cannot be attached to virtual columns.
The Snowflake integration with Snowflake governance features supports the Immuta features outlined below. Click the links provided for more details.
Immuta project workspaces: Users can have additional write access in their integration using project workspaces.
Tag ingestion: Immuta automatically ingests Snowflake object tags from your Snowflake instance and adds them to the appropriate data sources.
User impersonation: Native impersonation allows users to natively query data as another Immuta user. To enable native user impersonation, see the Integration user impersonation page.
Native query audit: Immuta audits queries run natively in Snowflake against Snowflake data registered as Immuta data sources.
Snowflake low row access policy mode: The Snowflake low row access policy mode improves query performance in Immuta's Snowflake integration by decreasing the number of Snowflake row access policies Immuta creates.
Snowflake table grants: This feature allows Immuta to manage privileges on your Snowflake tables and views according to the subscription policies on the corresponding Immuta data sources.
Immuta system account required Snowflake privileges
CREATE [OR REPLACE] PROCEDURE
DROP ROLE
REVOKE ROLE
Users can have additional write access in their integration using project workspaces. For more details, see the Snowflake project workspaces page.
To use project workspaces with the Snowflake integration with governance features, the default role of the account used to create data sources in the project must be added to the "Excepted Roles/Users List." If the role is not added, you will not be able to query the equalized view using the project role in Snowflake.
Immuta system account required Snowflake privileges
GRANT IMPORTED PRIVILEGES ON DATABASE snowflake
GRANT APPLY TAG ON ACCOUNT
When configuring a Snowflake integration, you can enable Snowflake tag ingestion as well. With this feature enabled, Immuta will automatically ingest Snowflake object tags from your Snowflake instance into Immuta and add them to the appropriate data sources.
The Snowflake tags' key and value pairs will be reflected in Immuta as two levels: the key will be the top level and the value the second. As Snowflake tags are hierarchical, Snowflake tags applied to a database will also be applied to all of the schemas in that database, all of the tables within those schemas, and all of the columns within those tables. For example: If a database is tagged PII
, all of the tables and columns in that database will also be tagged PII
.
To enable Snowflake tag ingestion, follow one of the tutorials below:
Manually enable Snowflake tag ingestion: This tutorial is intended for users who want Snowflake tags to be ingested into Immuta but do not want users to query data sources natively in Snowflake.
Automatically enable Snowflake tag ingestion: This tutorial illustrates how to enable Snowflake tag ingestion when configuring a Snowflake integration.
Snowflake has some natural data latency. If you manually refresh the governance page to see all tags created globally, users can experience a delay of up to two hours. However, if you run schema detection or a health check to find where those tags are applied, the delay will not occur because Immuta will only refresh tags for those specific tables.
Immuta system account required Snowflake privileges
IMPORTED PRIVILEGES ON DATABASE snowflake
Once this feature has been enabled with the Snowflake integration, Immuta will query Snowflake to retrieve user query histories. These histories provide audit records for queries against Snowflake data sources that are queried natively in Snowflake.
This process will happen automatically every hour by default but can be configured to a different frequency when configuring or editing the integration. Additionally, audit ingestion can be manually requested at any time from the Immuta audit page. When manually requested, it will only search for new queries that were created since the last native query that had been audited. The job is run in the background, so the new queries will not be immediately available.
For details about prompting these logs and the contents of these audit logs, see the Snowflake query audit logs page.
A user can configure multiple integrations of Snowflake to a single Immuta instance and use them dynamically or with workspaces.
There can only be one integration connection with Immuta per host.
The host of the data source must match the host of the integration for the view to be created.
Projects can only be configured to use one Snowflake host.
If there are errors in generating or applying policies natively in Snowflake, the data source will be locked and only users on the excepted roles/users list and the credentials used to create the data source will be able to access the data.
Once a Snowflake integration is disabled in Immuta, the user must remove the access that was granted in Snowflake. If that access is not revoked, users will be able to access the raw table in Snowflake.
Migration must be done using the credentials and credential method (automatic or bootstrap) used to install the integration.
When configuring one Snowflake instance with multiple Immuta instances, the user or system account that enables the integration on the app settings page must be unique for each Immuta instance.
A Snowflake table can only have one set of policies enforced at a given time, so creating multiple data sources pointing to the same table is not supported. If this is a use case you need to support, create views in Snowflake and expose those instead.
You cannot add a masking policy to an external table column while creating the external table because a masking policy cannot be attached to a virtual column.
If you create an Immuta data source from a Snowflake view created using a select * from
query, Immuta column detection will not work as expected because Snowflake views are not automatically updated based on backing table changes. To remedy this, you can create views that have the specific columns you want or you can CREATE AND REPLACE
the view in Snowflake whenever the backing table is updated and manually run the column detection job on the data source page.
If a user is created in Snowflake after that user is already registered in Immuta, Immuta does not grant usage on the per-user role automatically - meaning Immuta does not govern this user's access without manual intervention. If a Snowflake user is created after that user is registered in Immuta, the user account must be disabled and re-enabled to trigger a sync of Immuta policies to govern that user. Whenever possible, Snowflake users should be created before registering those users in Immuta.
Snowflake tables from imported databases are not supported. Instead, create a view of the table and register that view as a data source.
The Immuta Snowflake integration uses Snowflake governance features to let users query data natively in Snowflake. This means that Immuta also inherits some Snowflake limitations using correlated subqueries with row access policies and column-level security. These limitations appear when writing custom WHERE policies, but do not remove the utility of row-level policies.
All column names must be fully qualified:
Any column names that are unqualified (i.e., just the column name) will default to a column of the data source the policy is being applied to (if one matches the name).
The Immuta system account must have SELECT
privileges on all tables/views referenced in a subquery:
The Immuta system role name is specified by the user, and the role is created when the Snowflake instance is integrated.
Any subqueries that error in Snowflake will also error in Immuta.
Including one or more subqueries in the Immuta policy condition may cause errors in Snowflake. If an error occurs, it may happen during policy creation or at query-time. To avoid these errors, limit the number of subqueries, limit the number of JOIN operations, and simplify WHERE clause conditions.
For more information on the Snowflake subquery limitations see
The Helm Chart includes components that make up your Immuta infrastructure, and you can change these values to tailor your Immuta infrastructure to suit your needs. The tables below include parameter descriptions and default values for all components in the Helm Chart.
When installing Immuta, download immuta-values.yaml
and update the values to your preferred settings.
See the Helm installation page for guidance and best practices.
immutaVersion
Version of Immuta
<Current Immuta Version>
imageTag
Docker image tag
<Current Version Tag>
imagePullPolicy
Image pull policy
IfNotPresent
imagePullSecrets
List of image pull secrets to use
[immuta-registry]
existingSecret
nil
externalHostname
External hostname assigned to this immuta instance.
nil
podSecurityContext
Pod level security features on all pods.
{}
containerSecurityContext
Container level security features on all containers.
{}
global.imageRegistry
Global override for image registry.
registry.immuta.com
global.podAnnotations
Annotations to be set on all pods.
{}
global.podLabels
Labels that will be set on all pods.
{}
backup.enabled
Whether or not to turn on automatic backups
true
backup.restore.enabled
Whether or not to restore from backups if present
false
backup.type
Backup storage type. Must be defined if backup.enabled
is true
. Must be one of: s3
, gs
, or azblob
.
nil
backup.cronJob.nodeSelector
Node selector for backup cron job.
{"kubernetes.io/os": "linux"}
backup.cronJob.resources
{}
backup.cronJob.tolerations
Tolerations for backup CronJob.
nil
backup.extraEnv
Mapping of key-value pairs to be set on backup Job containers.
{}
backup.failedJobsHistoryLimit
Number of failed jobs to exist before stopping
1
backup.keepBackupVolumes
Whether or not to delete backup volumes when uninstalling Immuta
false
backup.maxBackupCount
Max number of backups to exist at a given time.
10
backup.podAnnotations
Annotations to add to all pods associated with backups
nil
backup.podLabels
Labels to add to all pods associated with backups.
nil
backup.restore.databaseFile
Name of the file in the database
backup folder to restore from.
nil
backup.restore.queryEngineFile
Name of the file in the query-engine
backup folder to restore from.
nil
backup.schedule
Kubernetes CronJob schedule expression.
0 0 * * *
backup.securityContext
SecurityContext for backup Pods.
{}
backup.serviceAccountAnnotations
Annotations to add to all ServiceAccounts associated with backups.
nil
backup.successfulJobsHistoryLimit
Number of successful jobs to exist before cleanup.
3
backup.podSecurityContext
Pod level security features.
{}
backup.containerSecurityContext
Container level security.
{}
These values are used when backup.type=s3
.
backup.s3.awsAccessKeyId
AWS Access Key ID.
nil
backup.s3.awsSecretAccessKey
AWS Secret Access Key.
nil
backup.s3.awsRegion
AWS Region.
nil
backup.s3.bucket
S3 Bucket to store backups in.
nil
backup.s3.bucketPrefix
Prefix to append to all backups.
nil
backup.s3.endpoint
Endpoint URL of an s3-compatible server.
nil
backup.s3.caBundle
CA bundle in PEM format. Used to verify TLS certificates of custom s3 endpoint.
nil
backup.s3.forcePathStyle
Set to "true" to force the use of path-style addressing.
nil
backup.s3.disableSSL
Set to "true" to disable SSL connections for the s3 endpoint.
nil
These values are used when backup.type=azblob
.
backup.azblob.azStorageAccount
Azure Storage Account Name
nil
backup.azblob.azStorageKey
Azure Storage Account Key
nil
backup.azblob.azStorageSASToken
Azure Storage Account SAS Token
nil
backup.azblob.container
Azure Storage Account Container Name
nil
backup.azblob.containerPrefix
Prefix to append to all backups
nil
These values are used when backup.type=gs
.
backup.gs.gsKeySecretName
Kubernetes Secret containing key.json
for Google Service Account
nil
backup.gs.bucket
Google Cloud Storage Bucket
nil
backup.gs.bucketPrefix
Prefix to append to all backups
nil
tls.enabled
Whether or not to use TLS.
true
tls.create
Whether or not to generate TLS certificates.
true
tls.manageGeneratedSecret
When true, the generated TLS secret will be created as a resource of the Helm Chart.
false
tls.secretName
Secret name to use for internal and external communication. (For self-provided certs only)
nil
tls.enabledInternal
Whether or not to use TLS for all internal communication.
true
tls.internalSecretName
Secret name to use for internal communication. (For self-provided certs only)
nil
tls.enabledExternal
Whether or not to use TLS for all external communication.
true
tls.externalSecretName
Secret name to use for external communication. (For self-provided certs only)
nil
tls.manageGeneratedSecret
may cause issues with helm install
.
In most cases, tls.manageGeneratedSecret
should only be set to true when Helm is not being used to install the release (i.e., Argo CD).
If tls.manageGeneratedSecret
is set to true when used with the default TLS generation hook configuration, you will encounter an error similar to the following.
Error: secrets "immuta-tls" already exists
You can work around this error by configuring the TLS generation hook to run as a post-install
hook.
However, this configuration is not compatible with helm install --wait
. If the --wait
flag is used, the command will timeout and fail.
web.extraEnv
Mapping of key-value pairs to be set on web containers.
{}
web.extraVolumeMounts
List of extra volume mounts to be added to web containers.
[]
web.extraVolumes
List of extra volumes to be added to web containers.
[]
web.image.registry
Image registry for the Immuta service image.
Value from global.imageRegistry
web.image.repository
Image repository for the Immuta service image.
immuta/immuta-service
web.image.tag
Image tag for the Immuta service image.
Value from imageTag
or immutaVersion
web.image.digest
Image digest for the Immuta service image in format of sha256:<DIGEST>
.
web.imagePullPolicy
ImagePullPolicy for the Immuta service container.
{{ .Values.imageTag }}
web.imageRepository
deprecated
Use web.image.registry
and web.image.repository
.
nil
web.imageTag
deprecated
Use web.image.tag
.
nil
web.replicas
Number of replicas of web service to deploy. Maximum: 3
1
web.workerCount
Number of web service worker processes to deploy.
2
web.threadPoolSize
Number of threads to use for each NodeJS process.
nil
web.ingress.enabled
Controls the creation of an Ingress resource for the web service.
true
web.ingress.clientMaxBodySize
client_max_body_size
passed through to nginx.
1g
web.resources
{}
web.podAnnotations
Additional annotations to apply to web pods.
{}
web.podLabels
Additional labels to apply to web pods.
{}
web.nodeSelector
Node selector for web pods.
{"kubernetes.io/os": "linux"}
web.serviceAccountAnnotations
Annotations for the web ServiceAccount.
{}
web.tolerations
Tolerations for web pods.
nil
web.podSecurityContext
Pod level security features.
{}
web.containerSecurityContext
Container level security features.
{}
fingerprint.image.registry
Image registry for the Immuta fingerprint image.
Value from global.imageRegistry
fingerprint.image.repository
Image repository for the Immuta fingerprint image.
immuta/immuta-fingerprint
fingerprint.image.tag
Image tag for the Immuta fingerprint image.
Value from imageTag
or immutaVersion
fingerprint.image.digest
Image digest for the Immuta fingerprint image in format of sha256:<DIGEST>
.
fingerprint.imagePullPolicy
ImagePullPolicy for the Immuta fingerprint container.
{{ .Values.imageTag }}
fingerprint.imageRepository
deprecated
Use fingerprint.image.registry
and fingerprint.image.repository
.
nil
fingerprint.imageTag
deprecated
Use fingerprint.image.tag
.
nil
fingerprint.replicas
Number of replicas of fingerprint service to deploy.
1
fingerprint.logLevel
Log level for the Fingerprint service.
WARNING
fingerprint.extraConfig
Object containing configuration options for the Immuta Fingerprint service.
{}
fingerprint.resources
{}
fingerprint.podAnnotations
Additional annotations to apply to fingerprint Pods.
{}
fingerprint.podLabels
Additional labels to apply to fingerprint Pods.
{}
fingerprint.nodeSelector
Node selector for fingerprint Pods.
{"kubernetes.io/os": "linux"}
fingerprint.serviceAccountAnnotations
Annotations for the fingerprint ServiceAccount.
{}
fingerprint.tolerations
Tolerations for fingerprint Pods.
nil
<component>.podSecurityContext
Pod level security features.
<component>.containerSecurityContext
Container level security features.
The Metadata Database component can be configured to use either the built-in Kubernetes deployment or an external PostgreSQL database.
The following Helm values are shared between both built-in and external databases.
database.enabled
Enabled flag. Used to disable the built-in database when an external database is used.
true
database.image.registry
Image registry for the Immuta database image.
Value from global.imageRegistry
database.image.repository
Image repository for the Immuta database image.
immuta/immuta-db
database.image.tag
Image tag for the Immuta database image.
Value from imageTag
or immutaVersion
database.image.digest
Image digest for the Immuta database image in format of sha256:<DIGEST>
.
database.imagePullPolicy
ImagePullPolicy for the Immuta database container.
{{ .Values.imageTag }}
database.imageRepository
deprecated
Use database.image.registry
and database.image.repository
.
nil
database.imageTag
deprecated
Use database.image.tag
.
nil
These values are used when database.enabled=true
.
database.extraEnv
Mapping of key-value pairs to be set on database containers.
{}
database.extraVolumeMounts
List of extra volume mounts to be added to database containers.
[]
database.extraVolumes
List of extra volumes to be added to database containers.
[]
database.nodeSelector
Node selector for database pods.
{"kubernetes.io/os": "linux"}
database.password
Password for immuta metadata database
secret
database.patroniApiPassword
Password for Patroni REST API.
secret
database.patroniKubernetes
Patroni Kubernetes settings.
{"use_endpoints": true}
database.persistence.enabled
Set this to true to enable data persistence on all database pods. It should be set to true
for all non-testing environments.
false
database.podAnnotations
Additional annotations to apply to database pods.
{}
database.podLabels
Additional labels to apply to database pods.
{}
database.replicas
Number of database replicas.
1
database.replicationPassword
Password for replication user.
secret
database.resources
{}
database.sharedMemoryVolume.enabled
Enable the use of a memory-backed emptyDir
volume for /dev/shm
.
false
database.sharedMemoryVolume.sizeLimit
Size limit for the shared memory volume. Only available when the SizeMemoryBackedVolumes
feature gate is enabled.
nil
database.superuserPassword
Password for PostgreSQL superuser.
secret
database.tolerations
Tolerations for database pods.
nil
database.podSecurityContext
Pod level security features.
{}
database.containerSecurityContext
Container level security features.
{}
These values are used when database.enabled=false
.
externalDatabase.host
required
Hostname of the external database instance.
nil
externalDatabase.port
Port for the external database instance.
5432
externalDatabase.sslmode
PostgreSQL sslmode
option for the external database connection. Behavior when unset is require
.
nil
externalDatabase.dbname
Immuta database name.
bometadata
externalDatabase.username
Immuta database username.
bometa
externalDatabase.password
required
Immuta database user password.
nil
externalDatabase.superuser.username
required
Username for the superuser used to initialize the database instance.
true
externalDatabase.superuser.password
required
Password for the superuser used to initialize the database instance.
true
externalDatabase.backup.enabled
(Deprecated) Enable flag for external database backups. Refer to backup.enabled=true
.
true
externalDatabase.restore.enabled
(Deprecated) Enable flag for the external database restore. Refer to backup.restore.enabled=true
.
true
queryEngine.extraEnv
Mapping of key-value pairs to be set on Query Engine containers.
{}
queryEngine.extraVolumeMounts
List of extra volume mounts to be added to Query Engine containers.
[]
queryEngine.extraVolumes
List of extra volumes to be added to Query Engine containers.
[]
queryEngine.image.registry
Image registry for the Immuta Query Engine image.
Value from global.imageRegistry
queryEngine.image.repository
Image repository for the Immuta Query Engine image.
immuta/immuta-db
queryEngine.image.tag
Image tag for the Immuta Query Engine image.
Value from imageTag
or immutaVersion
queryEngine.image.digest
Image digest for the Immuta Query Engine image in format of sha256:<DIGEST>
.
queryEngine.imagePullPolicy
ImagePullPolicy for the Immuta Query Engine container.
{{ .Values.imageTag }}
queryEngine.imageRepository
deprecated
Use queryEngine.image.registry
and queryEngine.image.repository
.
nil
queryEngine.imageTag
deprecated
Use queryEngine.image.tag
.
nil
queryEngine.replicas
Number of database replicas
1
queryEngine.password
Password for immuta feature store database
secret
queryEngine.superuserPassword
Password for PostgreSQL superuser.
secret
queryEngine.replicationPassword
Password for replication user.
secret
queryEngine.patroniApiPassword
Password for Patroni REST API.
secret
queryEngine.patroniKubernetes
Patroni Kubernetes settings.
{"use_endpoints": true}
queryEngine.persistence.enabled
This should be set to true
for all non-testing environments.
false
queryEngine.resources
{}
queryEngine.service
Service configuration for Query Engine service if not using an Ingress Controller.
queryEngine.podAnnotations
Additional annotations to apply to Query Engine pods.
{}
queryEngine.podLabels
Additional labels to apply to Query Engine pods.
{}
queryEngine.nodeSelector
Node selector for Query Engine pods.
{"kubernetes.io/os": "linux"}
queryEngine.sharedMemoryVolume.enabled
Enable the use of a memory-backed emptyDir
volume for /dev/shm
.
false
queryEngine.sharedMemoryVolume.sizeLimit
Size limit for the shared memory volume. Only available when the SizeMemoryBackedVolumes
feature gate is enabled.
nil
queryEngine.tolerations
Tolerations for Query Engine pods.
nil
queryEngine.podSecurityContext
Pod level security features.
{}
queryEngine.containerSecurityContext
Container level security features.
{}
If you will only use integrations, port 5432 is optional. Using the built-in Ingress Nginx Controller, you can disable it by setting the value to false
.
queryEngine.publishPort
Controls whether or not the Query Engine port (5432) is published on the built-in Ingress Controller service.
true
The Cleanup hook is a Helm post-delete hook that is responsible for cleaning up some resources that are not deleted by Helm.
hooks.cleanup.resources
{}
hooks.cleanup.serviceAccountAnnotations
Annotations for the cleanup hook ServiceAccount.
{}
hooks.cleanup.nodeSelector
Node selector for pods.
{"kubernetes.io/os": "linux"}
hooks.cleanup.tolerations
Tolerations for pods.
nil
hooks.cleanup.podSecurityContext
Pod level security features.
hooks.cleanup.containerSecurityContext
Container level security features.
The database initialize hook is used to initialize the external database when database.enabled=false
.
hooks.databaseInitialize.resources
{}
hooks.databaseInitialize.serviceAccountAnnotations
Annotations for the database initialize hook ServiceAccount.
{}
hooks.databaseInitialize.verbose
Flag to enable or disable verbose logging in the database initialize hook.
true
hooks.databaseInitialize.nodeSelector
Node selector for pods.
{"kubernetes.io/os": "linux"}
hooks.databaseInitialize.tolerations
Tolerations for pods.
nil
hooks.databaseInitialize.podSecurityContext
Pod level security features.
hooks.databaseInitialize.containerSecurityContext
Container level security features.
The TLS generation hook is a Helm pre-install hook that is responsible for generating TLS certificates used for connections between the Immuta pods.
hooks.tlsGeneration.hookAnnotations."helm.sh/hook-delete-policy"
Delete policy for the TLS generation hook.
"before-hook-creation,hook-succeeded"
hooks.tlsGeneration.resources
{}
hooks.tlsGeneration.serviceAccountAnnotations
Annotations for the cleanup hook ServiceAccount.
{}
hooks.tlsGeneration.nodeSelector
Node selector for pods.
{"kubernetes.io/os": "linux"}
hooks.tlsGeneration.tolerations
Tolerations for pods.
nil
hooks.tlsGeneration.podSecurityContext
Pod level security features.
hooks.tlsGeneration.containerSecurityContext
Container level security features.
cache.type
Type to use for the cache. Valid values are memcached
.
memcached
cache.replicas
Number of replicas.
1
cache.resources
{}
cache.nodeSelector
Node selector for pods.
{"kubernetes.io/os": "linux"}
cache.podSecurityContext
SecurityContext for cache Pods.
{"runAsUser": 65532}
cache.containerSecurityContext
Container level security features.
{}
cache.updateStrategy
UpdateStrategy Spec for cache workloads.
{}
cache.tolerations
Tolerations for pods.
nil
cache.memcached.image.registry
Image registry for Memcached image.
Value from global.imageRegistry
cache.memcached.image.repository
Image repository for Memcached image.
memcached
cache.memcached.image.tag
Image tag for Memcached image.
1.6-alpine
cache.memcached.image.digest
Image digest for the Immuta Memcached image in format of sha256:<DIGEST>
.
cache.memcached.imagePullPolicy
Image pull policy.
Value from imagePullPolicy
cache.memcached.maxItemMemory
Limit for max item memory in cache (in MB).
64
deployTools.image.registry
Image registry for Immuta deploy tools image.
Value from global.imageRegistry
deployTools.image.repository
Image repository for Immuta deploy tools image.
immuta/immuta-deploy-tools
deployTools.image.tag
Image tag for Immuta deploy tools image.
2.4.3
deployTools.image.digest
Image digest for the Immuta deploy tools image in format of sha256:<DIGEST>
.
deployTools.imagePullPolicy
Image pull policy.
Value from imagePullPolicy
nginxIngress.enabled
Enable nginx ingress deployment
true
nginxIngress.podSecurityContext
Pod level security features.
{}
nginxIngress.containerSecurityContext
Container level security features.
{capabilities: {drop: [ALL], add: [NET_BIND_SERVICE]}, runAsUser: 101}
nginxIngress.controller.image.registry
Image registry for the Nginx Ingress controller image.
Value from global.imageRegistry
nginxIngress.controller.image.repository
Image repository for the Nginx Ingress controller image.
ingress-nginx-controller
nginxIngress.controller.image.tag
Image tag for the Nginx Ingress controller image.
v1.1.0
nginxIngress.controller.image.digest
Image digest for the Immuta Nginx Ingress controller image in format of sha256:<DIGEST>
.
nginxIngress.controller.imagePullPolicy
ImagePullPolicy for the Nginx Ingress controller container.
{{ .Values.imageTag }}
nginxIngress.controller.imageRepository
deprecated
Use nginxIngress.controller.image.registry
and nginxIngress.controller.image.repository
.
nil
nginxIngress.controller.imageTag
deprecated
Use nginxIngress.controller.image.tag
.
nil
nginxIngress.controller.service.annotations
Used to set arbitrary annotations on the Nginx Ingress Service.
{}
nginxIngress.controller.service.type
Controller service type.
LoadBalancer
nginxIngress.controller.service.isInternal
Whether or not to use an internal ELB
false
nginxIngress.controller.service.acmCertArn
ARN for ACM certificate
nginxIngress.controller.replicas
Number of controller replicas
1
nginxIngress.controller.minReadySeconds
Minimum ready seconds
0
nginxIngress.controller.electionID
Election ID for nginx ingress controller
ingress-controller-leader
nginxIngress.controller.hostNetwork
Run nginx ingress controller on host network
false
nginxIngress.controller.config.proxy-read-timeout
Controller proxy read timeout.
300
nginxIngress.controller.config.proxy-send-timeout
Controller proxy send timeout.
300
nginxIngress.controller.podAnnotations
Additional annotations to apply to nginx ingress controller pods.
{}
nginxIngress.controller.podLabels
Additional labels to apply to nginx ingress controller pods.
{}
nginxIngress.controller.nodeSelector
Node selector for nginx ingress controller pods.
{"kubernetes.io/os": "linux"}
nginxIngress.controller.tolerations
Tolerations for nginx ingress controller pods.
nil
nginxIngress.controller.resources
{}
Deprecation Warning
The following values are deprecated. Values should be migrated to cache
and cache.memcached
. See Cache for replacement values.
memcached.pdbMinAvailable
Minimum pdb available.
1
memcached.maxItemMemory
Limit for max item memory in cache (in MB).
64
memcached.resources
{requests: {memory: 64Mi}}
memcached.podAnnotations
Additional annotations to apply to memcached pods.
{}
memcached.podLabels
Additional labels to apply to memcached pods.
{}
memcached.nodeSelector
Node selector for memcached pods.
{"kubernetes.io/os": "linux"}
memcached.tolerations
Tolerations for memcached pods.
nil
This section guides you through configuring your integrations. Once configuration is complete, data owners and governors can use tags to create policies.
Best practices for users, permissions, attributes, and tags
The best practices outlined below will also appear in callouts within relevant tutorials.
If has been enabled, then manually adding tags to columns in the data dictionary will be unnecessary in most cases. The data owner will need to verify that the Discovered tags are correct.
Turning on can improve your data's security with its automated tagging. Immuta highly recommends the use of this feature in tandem with vigilant verification of tags on all data sources.
Use an for authentication and to manage attributes.
Use the minimum number of tags possible to achieve the data privacy needed.
Start organizing attributes and groups in Immuta and transfer them to your IAM.
This section includes concept, reference, and how-to guides for configuring your integrations, connecting your IAM and external catalog, and enabling sensitive data discovery. Some of these guides are provided below. See the left navigation for a complete list of resources.
Configure your integrations:
.
.
.
The Snowflake low row access policy mode improves query performance in Immuta's Snowflake integration by decreasing the number of Immuta creates and by using table grants to manage user access.
Immuta manages access to Snowflake tables by administering and on those tables, allowing users to query them directly in Snowflake while policies are enforced.
Without Snowflake low row access policy mode enabled, row access policies are created and administered by Immuta in the following scenarios:
are disabled and a subscription policy that does not automatically subscribe everyone to the data source is applied. Immuta administers Snowflake row access policies to filter out all the rows to restrict access to the entire table when the user doesn't have privileges to query it. However, if table grants are disabled and a subscription policy is applied that grants everyone access to the data source automatically, Immuta does not create a row access policy in Snowflake. See the for details about these policy types.
is applied to a data source. A row access policy filters out all the rows of the table if users aren't acting under the purpose specified in the policy when they query the table.
is applied to a data source. A row access policy filters out rows querying users don't have access to.
is enabled. A row access policy is created for every Snowflake table registered in Immuta.
Deprecation notice
Support for using the Snowflake integration with low row access policy mode disabled has been deprecated. You must enable this feature and for your integration to continue working in future releases. See the for EOL dates.
Snowflake low row access policy mode is enabled by default to reduce the number of row access policies Immuta creates and improve query performance. Snowflake low row access policy mode requires
.
user impersonation to be disabled. User impersonation diminishes the performance of interactive queries because of the number of row access policies Immuta creates when it's enabled.
Project workspaces are not compatible with this feature.
Impersonation is not supported when the Snowflake low row access policy mode is enabled.
When a project member acts under a project's purposes, any matching purpose exceptions on tables will be honored, even if those tables exist outside the project. Project managers cannot assume approving a purpose means that the purposes of that project are limited to the tables in the project.
Deprecation notice
Support for this integration has been deprecated.
This page details how to install the for users on Snowflake Standard. If you currently use Snowflake Enterprise, see the .
Snowflake resource names
Use uppercase for the names of the Snowflake resources you create below.
Click the App Settings icon in the left sidebar.
Click the Integrations tab.
Click the +Add Native Integration button and select Snowflake from the dropdown menu.
Scroll down and uncheck the box for Snowflake Governance Features.
Scroll back up and complete the Host, Port, and Default Warehouse fields.
Opt to check the Enable Project Workspace box. This will allow for managed Write access within Snowflake.
Opt to check the Enable Impersonation box and customize the Impersonation Role name as needed. This will allow users to natively impersonate another user. Note you cannot edit this choice after you configure the integration.
is enabled by default; you can disable it by clicking the Enable Native Query Audit checkbox.
Configure the by scrolling to Integrations Settings and find the Snowflake Audit Sync Schedule section.
Enter how often, in hours, you want Immuta to ingest audit events from Snowflake as an integer between 1 and 24.
Continue with your integration configuration.
Opt to check the Automatically ingest Snowflake object tags box. This will enable Immuta to automatically import table and column tags from Snowflake. Note this feature requires an Enterprise Edition of Snowflake.
You have two options for installing the Snowflake and Snowflake Workspace access patterns: automatic or manual setup.
Known issue
To configure your Snowflake integration using password-only authentication in the automatic setup option, upgrade to Immuta v2024.2.7 or newer. Otherwise, Immuta will return an error.
Immuta requires temporary, one-time use of credentials with specific permissions.
When performing an automated installation, Immuta requires temporary, one-time use of credentials with the following permissions:
CREATE DATABASE ON ACCOUNT WITH GRANT OPTION
CREATE ROLE ON ACCOUNT WITH GRANT OPTION
CREATE USER ON ACCOUNT WITH GRANT OPTION
MANAGE GRANTS ON ACCOUNT
These permissions will be used to create and configure a new IMMUTA database within the specified Snowflake instance. The credentials are not stored or saved by Immuta, and Immuta doesn’t retain access to them after initial setup is complete.
You can create a new account for Immuta to use that has these permissions, or you can grant temporary use of a pre-existing account. By default, the pre-existing account with appropriate permissions is ACCOUNTADMIN. If you create a new account, it can be deleted after initial setup is complete.
Alternatively, you can create the IMMUTA database within the specified Snowflake instance manually using the Manual Setup option.
From the Select Authentication Method Dropdown, select either Username and Password or Key Pair Authentication:
Username and Password: Fill out the Username, Password, and Role fields.
Key Pair Authentication:
Complete the Username field.
When using a private key, enter the private key file password in the Additional Connection String Options. Use the following format: PRIV_KEY_FILE_PWD=<your_pw>
Click Key Pair (Required), and upload a Snowflake key pair file.
Complete the Role field.
Best Practices: Account Creation
The account you create for Immuta should only be used for the integration and should NOT be used as the credentials when creating data sources within Immuta. This will cause issues.
Create a dedicated READ-ONLY account for creating and registering data sources within Immuta. This account should also not be the account used to configure the integration.
The specified role used to run the bootstrap needs to have the following privileges:
CREATE DATABASE ON ACCOUNT WITH GRANT OPTION
CREATE ROLE ON ACCOUNT WITH GRANT OPTION
CREATE USER ON ACCOUNT WITH GRANT OPTION
MANAGE GRANTS ON ACCOUNT
Warning: Different Accounts
Download and run the bootstrap script linked in the Setup section. Take note of the username and password used in the script.
Use the Dropdown Menu to select your Authentication Method:
Username and Password: Enter the Username and Password that were that were set in the bootstrap script for the Immuta System Account Credentials.
Key Pair Authentication: Upload the Key Pair file and when using a private key, enter the private key file password in the Additional Connection String Options. Use the following format: PRIV_KEY_FILE_PWD=<your_pw>
If you enabled a Snowflake workspace, select Warehouses from the dropdown menu that will be available to project owners when creating native Snowflake workspaces. Select from a list of all the warehouses available to the privileged account entered above. Note that any warehouse accessible by the PUBLIC role does not need to be explicitly added.
Click Test Snowflake Connection.
Once the credentials are successfully tested, click Save.
Now that Snowflake has been enabled, all future Snowflake data sources will also be created natively within the immuta
database of the linked Snowflake instance. In addition to creating views, Immuta will also periodically sync user metadata to a system table within the Snowflake instance.
Snowflake table grants simplifies the management of privileges in Snowflake when using Immuta. Instead of having to manually grant users access to tables registered in Immuta, you allow Immuta to manage privileges on your Snowflake tables and views according to subscription policies. Then, users subscribed to a data source in Immuta can view and query the Snowflake table, while users who are not subscribed to the data source cannot view or query the Snowflake table.
Enabling Snowflake table grants gives the following privileges to the Immuta Snowflake role:
MANAGE GRANTS ON ACCOUNT
allows the Immuta Snowflake role to grant and revoke SELECT
privileges on Snowflake tables and views that have been added as data sources in Immuta.
CREATE ROLE ON ACCOUNT
allows for the creation of a Snowflake role for each user in Immuta, enabling fine-grained, attribute-based access controls to determine which tables are available to which individuals.
Since table privileges are granted to roles and not to users in Snowflake, Immuta's Snowflake table grants feature creates a new Snowflake role for each Immuta user. This design allows Immuta to manage table grants through fine-grained access controls that consider the individual attributes of users.
Each Snowflake user with an Immuta account will be granted a role that Immuta manages. The naming convention for this role is <IMMUTA>_USER_<username>
, where
<IMMUTA>
is the prefix you specified when enabling the feature on the Immuta app settings page.
<username>
is the user's Immuta username.
Users are granted access to each Snowflake table or view automatically when they are subscribed to the corresponding data source in Immuta.
Users have two options for querying Snowflake tables that are managed by Immuta:
that Immuta creates and manages. (For example, USE ROLE IMMUTA_USER_<username>
. See the for details about the role and name conventions.) If the current active primary role is used to query tables, USAGE
on a Snowflake warehouse must be granted to the Immuta-managed Snowflake role for each user.
, which allows users to use the privileges from all roles that they have been granted, including IMMUTA_USER_<username>
, in addition to the current active primary role. Users may also set a value for DEFAULT_SECONDARY_ROLES
as an on a Snowflake user. To learn more about primary roles and secondary roles in Snowflake, see .
If an Immuta instance is connected to an external IAM and that external IAM has a username identical to another username in Immuta's built-in IAM, those users will have the same Snowflake role, leading both to see the same data.
This page details how to install the for users on Snowflake Enterprise. If you currently use Snowflake Standard, see the .
Snowflake resource names
Use uppercase for the names of the Snowflake resources you create below.
Click the Integrations tab on the app settings page.
Click the +Add Native Integration button and select Snowflake from the dropdown menu.
Complete the Host, Port, and Default Warehouse fields.
Opt to check the Enable Project Workspace box. This will allow for managed write access within Snowflake. Note: Project workspaces still use Snowflake views, so the default role of the account used to create the data sources in the project must be added to the Excepted Roles List. This option is unavailable when is enabled.
Opt to check the Enable Impersonation box and customize the Impersonation Role to allow users to natively impersonate another user. You cannot edit this choice after you configure the integration.
is enabled by default; you can disable it by clicking the Enable Native Query Audit checkbox.
Configure the by scrolling to Integrations Settings and find the Snowflake Audit Sync Schedule section.
Enter how often, in hours, you want Immuta to ingest audit events from Snowflake as an integer between 1 and 24.
Continue with your integration configuration.
Opt to check the Automatically ingest Snowflake object tags box to allow Immuta to automatically import table and column tags from Snowflake.
in Snowflake at the account level may cause unexpected behavior of the Snowflake integration in Immuta
The must be set to false
(the default setting in Snowflake) at the account level. Changing this value to true
causes unexpected behavior of the Snowflake integration.
You have two options for configuring your Snowflake environment:
Known issue
To configure your Snowflake integration using password-only authentication in the automatic setup option, upgrade to Immuta v2024.2.7 or newer. Otherwise, Immuta will return an error.
Immuta requires temporary, one-time use of credentials with specific permissions.
When performing an automated installation, Immuta requires temporary, one-time use of credentials with the following permissions:
CREATE DATABASE ON ACCOUNT WITH GRANT OPTION
CREATE ROLE ON ACCOUNT WITH GRANT OPTION
CREATE USER ON ACCOUNT WITH GRANT OPTION
MANAGE GRANTS ON ACCOUNT WITH GRANT OPTION
APPLY MASKING POLICY ON ACCOUNT WITH GRANT OPTION
APPLY ROW ACCESS POLICY ON ACCOUNT WITH GRANT OPTION
These permissions will be used to create and configure a new IMMUTA database within the specified Snowflake instance. The credentials are not stored or saved by Immuta, and Immuta doesn’t retain access to them after initial setup is complete.
You can create a new account for Immuta to use that has these permissions, or you can grant temporary use of a pre-existing account. By default, the pre-existing account with appropriate permissions is ACCOUNTADMIN. If you create a new account, it can be deleted after initial setup is complete.
From the Select Authentication Method Dropdown, select one of the following authentication methods:
Username and Password: Complete the Username, Password, and Role fields.
Key Pair Authentication:
Complete the Username field.
When using a private key, enter the private key file password in the Additional Connection String Options. Use the following format: PRIV_KEY_FILE_PWD=<your_pw>
Click Key Pair (Required), and upload a Snowflake key pair file.
Complete the Role field.
Best practices: account creation
The account you create for Immuta should only be used for the integration and should not be used as the credentials for creating data sources in Immuta; doing so will cause issues. Instead, create a separate, dedicated READ-ONLY account for creating and registering data sources within Immuta.
The specified role used to run the bootstrap needs to have the following privileges:
CREATE DATABASE ON ACCOUNT WITH GRANT OPTION
CREATE ROLE ON ACCOUNT WITH GRANT OPTION
CREATE USER ON ACCOUNT WITH GRANT OPTION
MANAGE GRANTS ON ACCOUNT WITH GRANT OPTION
APPLY MASKING POLICY ON ACCOUNT WITH GRANT OPTION
APPLY ROW ACCESS POLICY ON ACCOUNT WITH GRANT OPTION
It will create a user called IMMUTA_SYSTEM_ACCOUNT
, and grant the following privileges to that user:
APPLY MASKING POLICY ON ACCOUNT
APPLY ROW ACCESS POLICY ON ACCOUNT
Additional grants associated with the IMMUTA database
GRANT IMPORTED PRIVILEGES ON DATABASE snowflake
GRANT APPLY TAG ON ACCOUNT
Select Manual.
Use the Dropdown Menu to select your Authentication Method:
Username and password: Enter the Username and Password and set them in the bootstrap script for the Immuta system account credentials.
Key pair authentication: Upload the Key Pair file and when using a private key, enter the private key file password in the Additional Connection String Options. Use the following format: PRIV_KEY_FILE_PWD=<your_pw>
Snowflake External OAuth:
Fill out the Token Endpoint. This is where the generated token is sent and is also known as aud
(Audience) and iss
(Issuer).
Fill out the Client ID. This is the subject of the generated token and is also known as sub
(Subject).
Select the method Immuta will use to obtain an access token:
Certificate:
Keep the Use Certificate checkbox enabled.
Opt to fill out the Resource field with a URI of the resource where the requested token will be used.
Enter the x509 Certificate Thumbprint. This identifies the corresponding key to the token and is often abbreviated as x5t
or is called sub
(Subject).
Upload the PEM Certificate, which is the client certificate that is used to sign the authorization request.
Client secret:
Uncheck the Use Certificate checkbox.
Enter the Client Secret (string). Immuta uses this secret to authenticate with the authorization server when it requests a token.
Download, fill out the appropriate fields, and run the bootstrap script linked in the Setup section.
Warning: different accounts
If you enabled a Snowflake workspace, select Warehouses from the dropdown menu that will be available to project owners when creating native Snowflake workspaces. Select from a list of all the warehouses available to the privileged account entered above. Note that any warehouse accessible by the PUBLIC role does not need to be explicitly added.
Enter the Excepted Roles/User List. Each role or username (both case-sensitive) in this list should be separated by a comma.
Excepted roles/users will have no policies applied to queries.
Any user with the username or acting under the role in this list will have no policies applied to them when querying Immuta protected Snowflake tables in Snowflake. Therefore, this list should be used for service or system accounts and the default role of the account used to create the data sources in the Immuta projects (if you have Snowflake workspace enabled).
Click Test Snowflake Connection.
Once the credentials are successfully tested, click Save and Confirm your changes.
Name of an for the Helm install to use. A managed Secret is not created when this value is set.
Container .
Container .
Container .
Container .
Container .
Container .
Container .
Container .
Container .
Container .
Container .
Project owners cannot limit to a single project. Turning masked joins on in a single project in Immuta enables masked joins across all of a subscriber's data sources, regardless of which projects the data sources belong to.
On September 30, 2024, Snowflake released a change to transition away from allowing password-only authentication. To use username and password authentication when configuring a new Snowflake integration, you must use the , which provides a script that permits password-only authentication by differentiating it as a legacy service with an additional parameter. Existing integrations will continue to function as-is.
The account used to enable the integration must be different from the account used to create data sources in Immuta. Otherwise, views won't be generated properly.
.
are not supported when Snowflake table grants is enabled.
: Grant Immuta one-time use of credentials to automatically configure your Snowflake environment and the integration.
: Run the Immuta script in your Snowflake environment yourself to configure your Snowflake environment and the integration.
On September 30, 2024, Snowflake released a change to transition away from allowing password-only authentication. To use username and password authentication when configuring a new Snowflake integration, you must use the , which provides a script that permits password-only authentication by differentiating it as a legacy service with an additional parameter. Existing integrations will continue to function as-is.
Alternatively, you can create the IMMUTA database within the specified Snowflake instance manually using the option.
If you have selected to automatically ingest Snowflake object tags, which enables ,
. Note that if you have an existing security integration, . The Immuta system role will be the Immuta database provided above with _SYSTEM
. If you used the default database name it will be IMMUTA_SYSTEM
.
Enter the Scope (string). The scope limits the operations and roles allowed in Snowflake by the access token. See the for details about scopes.
The account used to enable the integration must be different from the account used to create data sources in Immuta. Otherwise, views won't be generated properly.
.
Immuta is compatible with Snowflake Secure Data Sharing. Using both Immuta and Snowflake, organizations can share the policy-protected data of their Snowflake database with other Snowflake accounts with Immuta policies enforced in real time. See below for instructions on using Snowflake Data Sharing with Immuta Users using Immuta's table grants feature and Snowflake Data Sharing with non-Immuta users using Immuta's project workspaces.
Prerequisites:
Required Permission: Immuta: GOVERNANCE
Build Immuta data policies to fit your organization's compliance requirements.
Required Permission: Immuta: USER_ADMIN
To register the Snowflake data consumer in Immuta,
Update the Immuta user's Snowflake username to match the account ID for the data consumer. This value is the output on the data consumer side when SELECT CURRENT_ACCOUNT()
is run in Snowflake.
Give the Immuta user the appropriate attributes and groups for your organization's policies.
Required Permission: Snowflake: ACCOUNTADMIN
To share the policy-protected data source,
Create a Snowflake Data Share of the Snowflake table that has been registered in Immuta.
Grant reference usage on the Immuta database to the share you created:
Replace the content in angle brackets above with the name of your Immuta database and Snowflake data share.
Prerequisites:
Use Case
As you follow this tutorial, these callouts will have examples centered around the same use case and will further explain the steps necessary to meet the following compliance requirement:
Compliance Requirement: Users can only see data from their country.
Use Case: Create Policies
The Immuta user will create a global data policy that restricts the rows users can see based on their attributes, which identify their country. In the example below, users with the attribute Country.JP
would only see rows that have JP
as a value in the CREDIT POINT OF SALE
column.
Required Permission: Immuta: GOVERNANCE
Using an attribute based access control (ABAC) model, build Immuta data policies using Immuta attributes and groups to fit your organization's compliance requirements.
Use Case: Create Project
The Immuta user will create a project for the data share. In the example below, the user creates a Japan Data Share project that will only be shared with data consumers in Japan.
Required Permission: Immuta: CREATE_PROJECT
Create an Immuta project with the data sources that you will be sharing, a Snowflake workspace, and project equalization enabled.
Use Case
Because data consumers have the attribute "Country.JP", this will be the equalized entitlement added to the project. The Immuta user editing the equalized entitlement must also have the attribute "Country.JP" to ensure they have access to the data they will share.
Required Permission: Immuta: CREATE_PROJECT
or PROJECT_MANAGEMENT
A user with the same attributes or groups as the data consumer must edit the equalized entitlements to represent the appropriate attributes and groups of the data consumer.
Required Permission: Snowflake: ACCOUNTADMIN
Create the Snowflake Data Share pointing to the project workspace using the schema and role in the Native Snowflake Access section of the project information. Repeat this step for each data source you want to share.
The commands run in Snowflake should look similar to this:
To edit or remove a Snowflake integration, you have two options:
Automatic: Grant Immuta one-time use of credentials to automatically edit or remove the integration.
The credentials provided must have the following permissions:
CREATE DATABASE ON ACCOUNT WITH GRANT OPTION
CREATE ROLE ON ACCOUNT WITH GRANT OPTION
CREATE USER ON ACCOUNT WITH GRANT OPTION
MANAGE GRANTS ON ACCOUNT WITH GRANT OPTION
Manual: Run the Immuta script in your Snowflake environment yourself to edit or remove the integration.
The specified role used to run the bootstrap needs to have the following privileges:
CREATE DATABASE ON ACCOUNT WITH GRANT OPTION
CREATE ROLE ON ACCOUNT WITH GRANT OPTION
CREATE USER ON ACCOUNT WITH GRANT OPTION
MANAGE GRANTS ON ACCOUNT WITH GRANT OPTION
APPLY MASKING POLICY ON ACCOUNT WITH GRANT OPTION
APPLY ROW ACCESS POLICY ON ACCOUNT WITH GRANT OPTION
Select one of the following options for editing your integration:
Automatic: Grant Immuta one-time use of credentials to automatically edit the integration.
Manual: Run the Immuta script in your Snowflake environment yourself to edit the integration.
Click the App Settings icon in the left sidebar.
Click the Integrations tab and click the down arrow next to the Snowflake integration.
Edit the field you want to change or check a checkbox of a feature you would like to enable. Note any field shadowed is not editable, and the integration must be disabled and re-installed to change it.
From the Select Authentication Method Dropdown, select either Username and Password or Key Pair Authentication:
Username and Password option: Complete the Username, Password, and Role fields.
Key Pair Authentication option:
Complete the Username field.
When using a private key, enter the private key file password in the Additional Connection String Options. Use the following format: PRIV_KEY_FILE_PWD=<your_pw>
Click Key Pair (Required), and upload a Snowflake key pair file.
Complete the Role field.
Click Save.
Click the App Settings icon in the left sidebar.
Click the Integrations tab and click the down arrow next to the Snowflake integration.
Edit the field you want to change or check a checkbox of a feature you would like to enable. Note any field shadowed is not editable, and the integration must be disabled and re-installed to change it.
Download the Edit Script and run it in Snowflake.
Click Save.
Select one of the following options for deleting your integration:
Automatic: Grant Immuta one-time use of credentials to automatically remove the integration and Immuta-managed resources from your Snowflake environment.
Manual: Run the Immuta script in your Snowflake environment yourself to remove Immuta-managed resources and policies from Snowflake.
Click the App Settings icon in the left sidebar.
Click the Integrations tab and click the down arrow next to the Snowflake integration.
Click the checkbox to disable the integration.
Enter the Username, Password, and Role that was entered when the integration was configured.
Click Validate Credentials.
Click Save.
Click the App Settings icon in the left sidebar.
Click the Integrations tab and click the down arrow next to the Snowflake integration.
Click the checkbox to disable the integration.
Download the Cleanup Script.
Click Save.
Run the cleanup script in Snowflake.
If you have Snowflake low row access policy mode enabled in private preview and have impersonation enabled, see these upgrade instructions. Otherwise, query performance will be negatively affected.
Snowflake low row access policy mode is enabled by default. However, you can disable or re-enable the feature by following the steps below.
Click the App Settings icon in the sidebar and scroll to the Global Integration Settings section.
Click the Enable Snowflake Low Row Access Policy Mode checkbox to disable the feature.
Click Save and confirm your configuration changes.
If you already have a Snowflake governance features integration configured, you don't need to reconfigure your integration. Your Snowflake policies automatically refresh when you enable or disable Snowflake low row access policy mode.
Click Save and Confirm your changes.
Click the App Settings icon in the sidebar and scroll to the Global Integration Settings section.
Click the Enable Snowflake Low Row Access Policy Mode checkbox to re-enable the feature.
Confirm to allow Immuta to automatically disable impersonation for the Snowflake integration. If you do not confirm, you will not be able to enable Snowflake low row access policy mode.
Click Save and confirm your configuration changes.
If you already have a Snowflake governance features integration configured, you don't need to reconfigure your integration. Your Snowflake policies automatically refresh when you enable or disable Snowflake low row access policy mode.
Configure your Snowflake integration with governance features enabled. Note that you will not be able to enable project workspaces or user impersonation with Snowflake low row access policy mode enabled.
Click Save and Confirm your changes.
Navigate to the App Settings page.
Scroll to the Global Integration Settings section.
Ensure the Snowflake Governance Features checkbox is checked. It is enabled by default.
Ensure the Snowflake Table Grants checkbox is checked. It is enabled by default.
Opt to change the Role Prefix. Snowflake table grants creates a new Snowflake role for each Immuta user. To ensure these Snowflake role names do not collide with existing Snowflake roles, each Snowflake role created for Snowflake table grants requires a common prefix. When using multiple Immuta accounts within a single Snowflake account, the Snowflake table grants role prefix should be unique for each Immuta account. The prefix must adhere to Snowflake identifier requirements and be less than 50 characters. Once the configuration is saved, the prefix cannot be modified; however, the Snowflake table grants feature can be disabled and re-enabled to change the prefix.
Finish configuring your integration by following one of these guidelines:
New Snowflake integration: Set up a new Snowflake integration by following the configuration tutorial.
Existing Snowflake integration (automatic setup): You will be prompted to enter connection information for a Snowflake user. Immuta will execute the migration to Snowflake table grants using a connection established with this Snowflake user. The Snowflake user you provide here must have Snowflake privileges to run these privilege grants.
Existing Snowflake integration (manual setup): Immuta will display a link to a migration script you must run in Snowflake and a link to a rollback script for use in the event of a failed migration. Important: Execute the migration script in Snowflake before clicking Save on the app settings page.
Snowflake table grants private preview migration
To migrate from the private preview version of Snowflake table grants (available before September 2022) to the generally available version of Snowflake table grants, follow the steps in the migration guide.
The steps outlined on this page are necessary if you meet both of the following criteria:
You have the Snowflake low row access policy mode enabled in private preview.
You have user impersonation enabled.
If you do not meet this criteria, follow the instructions on the configuration guide.
To upgrade to generally available version of the feature, either
disable your Snowflake integration on the app settings page and then re-enable it, OR
disable Snowflake low row access policy mode on the app settings page and re-enable it.
Audience: System Administrators
Content Summary: This guide details the simplified installation method for enabling native access to Databricks with Immuta policies enforced.
Prerequisites: Ensure your Databricks workspace, instance, and permissions meet the guidelines outlined in the Installation Introduction.
Databricks Unity Catalog
If Unity Catalog is enabled in a Databricks workspace, you must use an Immuta cluster policy when you setup the integration to create an Immuta-enabled cluster.
Log in to Immuta and click the App Settings icon in the left sidebar.
Scroll to the System API Key subsection under HDFS and click Generate Key.
Click Save and then Confirm.
Scroll to the Integration Settings section, and click + Add a Native Integration.
Select Databricks Integration from the dropdown menu.
Complete the Hostname field.
Enter a Unique ID for the integration. By default, your Immuta instance URL populates this field. This ID is used to tie the set of cluster policies to your instance of Immuta and allows multiple instances of Immuta to access the same Databricks workspace without cluster policy conflicts.
Select your configured Immuta IAM from the dropdown menu.
Choose one of the following options for your data access model:
Protected until made available by policy: All tables are hidden until a user is permissioned through an Immuta policy. This is how most databases work and assumes least privileged access and also means you will have to register all tables with Immuta.
Available until protected by policy: All tables are open until explicitly registered and protected by Immuta. This makes a lot of sense if most of your tables are non-sensitive and you can pick and choose which to protect.
Select the Storage Access Type from the dropdown menu.
Opt to add any Additional Hadoop Configuration Files.
Click Add Native Integration.
Several cluster policies are available on the App Settings page when configuring this integration:
Click a link above to read more about each of these cluster policies before continuing with the tutorial.
Click Configure Cluster Policies.
Select one or more cluster policies in the matrix by clicking the Select button(s).
Opt to make changes to these cluster policies by clicking Additional Policy Changes and editing the text field.
Use one of the two Installation Types described in the tabs below to apply the policies to your cluster:
Automatically Push Cluster Policies
This option allows you to automatically push the cluster policies to the configured Databricks workspace. This will overwrite any cluster policy templates previously applied to this workspace.
Select the Automatically Push Cluster Policies radio button.
Enter your Admin Token. This token must be for a user who can create cluster policies in Databricks.
Click Apply Policies.
Manually Push Cluster Policies
Enabling this option will allow you to manually push the cluster policies to the configured Databricks workspace. There will be various files to download and manually push to the configured Databricks workspace.
Select the Manually Push Cluster Policies radio button.
Click Download Init Script.
Follow the steps in the Instructions to upload the init script to DBFS section.
Click Download Policies, and then manually add these Cluster Policies in Databricks.
Opt to click the Download the Benchmarking Suite to compare a regular Databricks cluster to one protected by Immuta. Detailed instructions are available in the first notebook, which will require an Immuta and non-Immuta cluster to generate test data and perform queries.
Click Close, and then click Save and Confirm.
Create a cluster in Databricks by following the Databricks documentation.
In the Policy dropdown, select the Cluster Policies you pushed or manually added from Immuta.
Select the Custom Access mode.
Opt to adjust Autopilot Options and Worker Type settings: The default values provided here may be more than what is necessary for non-production or smaller use-cases. To reduce resource usage you can enable/disable autoscaling, limit the size and number of workers, and set the inactivity timeout to a lower value.
Opt to configure the Instances tab in the Advanced Options section:
IAM Role (AWS ONLY): Select the instance role you created for this cluster. (For access key authentication, you should instead use the environment variables listed in the AWS section.)
Click Create Cluster.
Register Databricks securables in Immuta.
When the Immuta-enabled Databricks cluster has been successfully started, Immuta will create an immuta
database, which allows Immuta to track Immuta-managed data sources separately from remote Databricks tables so that policies and other security features can be applied. However, users can query sources with their original database or table name without referencing the immuta
database. Additionally, when configuring a Databricks cluster you can hide immuta
from any calls to SHOW DATABASES
so that users aren't misled or confused by its presence. For more details, see the Hiding the immuta
Database in Databricks page.
Before users can query an Immuta data source, an administrator must give the user Can Attach To
permissions on the cluster.
See the Databricks Data Source Creation guide for a detailed walkthrough of creating Databricks data sources in Immuta.
Below are example queries that can be run to obtain data from an Immuta-configured data source. Because Immuta supports raw tables in Databricks, you do not have to use Immuta-qualified table names in your queries like the first example. Instead, you can run queries like the second example, which does not reference the immuta
database.
When you enable Unity Catalog, Immuta automatically migrates your existing Databricks data sources in Immuta to reference the legacy hive_metastore
catalog to account for Unity Catalog's three-level hierarchy. New data sources will reference the Unity Catalog metastore you create and attach to your Databricks workspace.
Because the hive_metastore
catalog is not managed by Unity Catalog, existing data sources in the hive_metastore
cannot have Unity Catalog access controls applied to them. Data sources in the Hive Metastore must be managed by the Databricks Spark integration.
To allow Immuta to administer Unity Catalog access controls on that data, move the data to Unity Catalog and re-register those tables in Immuta by completing the steps below. If you don't move all data before configuring the integration, metastore magic will protect your existing data sources throughout the migration process.
Disable all existing Databricks Spark integrations with Unity Catalog support or Databricks SQL integrations. Note: Immuta supports running the Databricks Spark integration with the Unity Catalog integration concurrently, so Databricks Spark integrations do not have to be disabled before migrating to Unity Catalog.
Ensure that all Databricks clusters that have Immuta installed are stopped and the Immuta configuration is removed from the cluster. Immuta-specific cluster configuration is no longer needed with the Databricks Unity Catalog integration.
Move all data into Unity Catalog before configuring Immuta with Unity Catalog. Existing data sources will need to be re-created after they are moved to Unity Catalog and the Unity Catalog integration is configured.
Databricks Unity Catalog allows you to manage and access data in your Databricks account across all of your workspaces. With Immuta’s Databricks Unity Catalog integration, you can write your policies in Immuta and have them enforced automatically by Databricks across data in your Unity Catalog metastore.
APPLICATION_ADMIN
Immuta permission for the user configuring the integration in Immuta.
Databricks privileges:
An account with the CREATE CATALOG
privilege on the Unity Catalog metastore to create an Immuta-owned catalog and tables. For automatic setups, this privilege must be granted to the Immuta system account user. For manual setups, the user running the Immuta script must have this privilege.
An Immuta system account user requires the following Databricks privileges:
OWNER
permission on the Immuta catalog you configure.
OWNER
permission on catalogs with schemas and tables registered as Immuta data sources so that Immuta can administer Unity Catalog row-level and column-level security controls. This permission can be applied by granting OWNER
on a catalog to a Databricks group that includes the Immuta system account user to allow for multiple owners. If the OWNER
permission cannot be applied at the catalog- or schema-level, each table registered as an Immuta data source must individually have the OWNER
permission granted to the Immuta system account user.
USE CATALOG
and USE SCHEMA
on parent catalogs and schemas of tables registered as Immuta data sources so that the Immuta system account user can interact with those tables.
SELECT
and MODIFY
on all tables registered as Immuta data sources so that the system account user can grant and revoke access to tables and apply Unity Catalog row- and column-level security controls.
USE CATALOG
on the system
catalog for native query audit.
USE SCHEMA
on the system.access
schema for native query audit.
SELECT
on the following system tables for native query audit:
system.access.audit
system.access.table_lineage
system.access.column_lineage
Before you configure the Databricks Unity Catalog integration, ensure that you have fulfilled the following requirements:
Unity Catalog metastore created and attached to a Databricks workspace. Immuta supports configuring a single metastore for each configured integration, and that metastore may be attached to multiple Databricks workspaces.
Unity Catalog enabled on your Databricks cluster or SQL warehouse. All SQL warehouses have Unity Catalog enabled if your workspace is attached to a Unity Catalog metastore. Immuta recommends linking a SQL warehouse to your Immuta instance rather than a cluster for both performance and availability reasons.
Personal access token generated for the user that Immuta will use to manage policies in Unity Catalog.
No Databricks SQL integrations are configured in your Immuta instance. The Databricks Unity Catalog integration replaces the Databricks SQL integration entirely and cannot coexist with it. If there are configured Databricks SQL integrations, remove them and add a Databricks Unity Catalog integration in its place. Databricks data sources will also need to be migrated if they are defined in the hive_metastore
catalog.
No Databricks Spark integrations with Unity Catalog support are configured in your Immuta instance. Immuta does not support that integration and the Databricks Unity Catalog integration concurrently. See the Unity Catalog overview for supported cluster configurations.
Unity Catalog system tables enabled for native query audit.
Best practices
Ensure your integration with Unity Catalog goes smoothly by following these guidelines:
Use a Databricks SQL warehouse to configure the integration. Databricks SQL warehouses are faster to start than traditional clusters, require less management, and can run all the SQL that Immuta requires for policy administration. A serverless warehouse provides nearly instant startup time and is the preferred option for connecting to Immuta.
Move all data into Unity Catalog before configuring Immuta with Unity Catalog. The default catalog used once Unity Catalog support is enabled in Immuta is the hive_metastore
, which is not supported by the Unity Catalog native integration. Data sources in the Hive Metastore must be managed by the Databricks Spark integration. Existing data sources will need to be re-created after they are moved to Unity Catalog and the Unity Catalog integration is configured.
Disable existing Databricks SQL and Databricks Spark with Unity Catalog Support integrations.
Ensure that all Databricks clusters that have Immuta installed are stopped and the Immuta configuration is removed from the cluster. Immuta-specific cluster configuration is no longer needed with the Databricks Unity Catalog integration.
Move all data into Unity Catalog before configuring Immuta with Unity Catalog. Existing data sources will need to be re-created after they are moved to Unity Catalog and the Unity Catalog integration is configured. If you don't move all data before configuring the integration, metastore magic will protect your existing data sources throughout the migration process.
Existing data source migration
If you have existing Databricks data sources, complete these migration steps before proceeding.
You have two options for configuring your Databricks Unity Catalog integration:
Automatic setup: Immuta creates the catalogs, schemas, tables, and functions using the integration's configured personal access token.
Manual setup: Run the Immuta script in Databricks yourself to create the catalog. You can also modify the script to customize your storage location for tables, schemas, or catalogs.
Required permissions
When performing an automatic setup, the Databricks personal access token you configure below must be attached to an account with the following permissions for the metastore associated with the specified Databricks workspace:
USE CATALOG
and USE SCHEMA
on parent catalogs and schemas of tables registered as Immuta data sources so that the Immuta system account user can interact with those tables.
SELECT
and MODIFY
on all tables registered as Immuta data sources so that the system account user can grant and revoke access to tables and apply Unity Catalog row- and column-level security controls.
OWNER
permission on the Immuta catalog created below.
OWNER
permission on catalogs with schemas and tables registered as Immuta data sources so that Immuta can administer Unity Catalog row-level and column-level security controls. This permission can be applied by granting OWNER
on a catalog to a Databricks group that includes the Immuta system account user to allow for multiple owners. If the OWNER
permission cannot be applied at the catalog- or schema-level, each table registered as an Immuta data source must individually have the OWNER
permission granted to the Immuta system account user.
CREATE CATALOG
on the workspace metastore.
USE CATALOG
on the system
catalog for native query audit.
USE SCHEMA
on the system.access
schema for native query audit.
SELECT
on the following system tables for native query audit:
system.access.audit
system.access.table_lineage
system.access.column_lineage
Click the App Settings icon in the left sidebar.
Scroll to the Global Integration Settings section and check the Enable Databricks Unity Catalog support in Immuta checkbox. The additional settings in this section are only relevant to the Databricks Spark with Unity Catalog integration and will not have any effect on the Unity Catalog integration. These can be left with their default values.
Click the Integrations tab.
Click + Add Native Integration and select Databricks Unity Catalog from the dropdown menu.
Complete the following fields:
Server Hostname is the hostname of your Databricks workspace.
HTTP Path is the HTTP path of your Databricks cluster or SQL warehouse.
Immuta Catalog is the name of the catalog Immuta will create to store internal entitlements and other user data specific to Immuta. This catalog will only be readable for the Immuta service principal and should not be granted to other users. The catalog name may only contain letters, numbers, and underscores and cannot start with a number.
If using a proxy server with Databricks Unity Catalog, click the Enable Proxy Support checkbox and complete the Proxy Host and Proxy Port fields. The username and password fields are optional.
Opt to fill out the Exemption Group field with the name of a group in Databricks that will be excluded from having data policies applied and must not be changed from the default value. Create this account-level group for privileged users and service accounts that require an unmasked view of data before configuring the integration in Immuta.
Unity Catalog query audit is enabled by default; you can disable it by clicking the Enable Native Query Audit checkbox. Ensure you have enabled system tables in Unity Catalog and provided the required access to the Immuta system account.
Configure the audit frequency by scrolling to Integrations Settings and find the Unity Catalog Audit Sync Schedule section.
Enter how often, in hours, you want Immuta to ingest audit events from Unity Catalog as an integer between 1 and 24.
Continue with your integration configuration.
Enter a Databricks Personal Access Token. This is the access token for the Immuta service principal. This service principal must have the metastore privileges listed above for the metastore associated with the Databricks workspace. If this token is configured to expire, update this field regularly for the integration to continue to function.
Click Test Databricks Unity Catalog Connection.
Save and Confirm your changes.
Required permissions
When performing a manual setup, the following Databricks permissions are required:
The user running the script must have the CREATE CATALOG
permission on the workspace metastore.
The Databricks personal access token you configure below must be attached to an account with the following permissions:
USE CATALOG
and USE SCHEMA
on parent catalogs and schemas of tables registered as Immuta data sources so that the Immuta system account user can interact with those tables.
SELECT
and MODIFY
on all tables registered as Immuta data sources so that the system account user can grant and revoke access to tables and apply Unity Catalog row- and column-level security controls.
OWNER
permission on the Immuta catalog created below.
OWNER
permission on catalogs with schemas and tables registered as Immuta data sources so that Immuta can administer Unity Catalog row-level and column-level security controls. This permission can be applied by granting OWNER
on a catalog to a Databricks group that includes the Immuta system account user to allow for multiple owners. If the OWNER
permission cannot be applied at the catalog- or schema-level, each table registered as an Immuta data source must individually have the OWNER
permission granted to the Immuta system account user.
USE CATALOG
on the system
catalog for native query audit.
USE SCHEMA
on the system.access
schema for native query audit.
SELECT
on the following system tables for native query audit:
system.access.audit
system.access.table_lineage
system.access.column_lineage
Click the App Settings icon in the left sidebar.
Scroll to the Global Integration Settings section and check the Enable Databricks Unity Catalog support in Immuta checkbox. The additional settings in this section are only relevant to the Databricks Spark with Unity Catalog integration and will not have any effect on the Unity Catalog integration. These can be left with their default values.
Click the Integrations tab.
Click + Add Native Integration and select Databricks Unity Catalog from the dropdown menu.
Complete the following fields:
Server Hostname is the hostname of your Databricks workspace.
HTTP Path is the HTTP path of your Databricks cluster or SQL warehouse.
Immuta Catalog is the name of the catalog Immuta will create to store internal entitlements and other user data specific to Immuta. This catalog will only be readable for the Immuta service principal and should not be granted to other users. The catalog name may only contain letters, numbers, and underscores and cannot start with a number.
If using a proxy server with Databricks Unity Catalog, click the Enable Proxy Support checkbox and complete the Proxy Host and Proxy Port fields. The username and password fields are optional.
Opt to fill out the Exemption Group field with the name of a group in Databricks that will be excluded from having data policies applied and must not be changed from the default value. Create this account-level group for privileged users and service accounts that require an unmasked view of data before configuring the integration in Immuta.
Unity Catalog query audit is enabled by default; you can disable it by clicking the Enable Native Query Audit checkbox. Ensure you have enabled system tables in Unity Catalog and provided the required access to the Immuta system account.
Configure the audit frequency by scrolling to Integrations Settings and find the Unity Catalog Audit Sync Schedule section.
Enter how often, in hours, you want Immuta to ingest audit events from Unity Catalog as an integer between 1 and 24.
Continue with your integration configuration.
Enter a Databricks Personal Access Token. This is the access token for the Immuta service principal. This service principal must have the metastore privileges listed above for the metastore associated with the Databricks workspace. If this token is configured to expire, update this field regularly for the integration to continue to function.
Select the Manual toggle and copy or download the script. You can modify the script to customize your storage location for tables, schemas, or catalogs.
Run the script in Databricks.
Click Test Databricks Unity Catalog Connection.
Save and Confirm your changes.
To enable native query audit for Unity Catalog, complete the following steps before configuring the integration:
Grant your Immuta system account user access to the Databricks Unity Catalog system tables. For Databricks Unity Catalog audit to work, Immuta must have, at minimum, the following access.
USE CATALOG
on the system
catalog
USE SCHEMA
on the system.access
schema
SELECT
on the following system tables:
system.access.audit
system.access.table_lineage
system.access.column_lineage
Use the Databricks Personal Access Token in the configuration above for the account you just granted system table access. This account will be the Immuta system account user.
Register Unity Catalog securables as Immuta data sources.
External data connectors and query-federated tables are preview features in Databricks. See the Databricks documentation for details about the support and limitations of these features before registering them as data sources in the Unity Catalog integration.
Map Databricks usernames to Immuta to ensure Immuta properly enforces policies and audits user queries.
Build global policies in Immuta to enforce table-, column-, and row-level security.
Audience: Data Owners and Data Users
Content Summary: This page provides an overview of the Databricks integration. For installation instructions, see the Databricks Installation Introduction and the Databricks Quick Integration Guide.
Databricks is a plugin integration with Immuta. This integration allows you to protect access to tables and manage row-, column-, and cell-level controls without enabling table ACLs or credential passthrough. Policies are applied to the plan that Spark builds for a user's query and enforced live on-cluster.
An Application Admin will configure Databricks with either the
Simplified Databricks Configuration on the Immuta App Settings page
Manual Databricks Configuration where Immuta artifacts must be downloaded and staged to your Databricks clusters
In both configuration options, the Immuta init script adds the Immuta plugin in Databricks: the Immuta Security Manager, wrappers, and Immuta analysis hook plan rewrite. Once an administrator gives users Can Attach To
entitlements on the cluster, they can query Immuta-registered data source directly in their Databricks notebooks.
Simplified Databricks Configuration Additional Entitlements
The credentials used to do the Simplified Databricks configuration with automatic cluster policy push must have the following entitlement:
Allow cluster creation
This will give Immuta temporary permission to push the cluster policies to the configured Databricks workspace and overwrite any cluster policy templates previously applied to the workspace.
Immuta Best Practices: Test User
Test the integration on an Immuta-enabled cluster with a user that is not a Databricks administrator.
You should register entire databases with Immuta and run Schema Monitoring jobs through the Python script provided during data source registration. Additionally, you should use a Databricks administrator account to register data sources with Immuta using the UI or API; however, you should not test Immuta policies using a Databricks administrator account, as they are able to bypass controls. See the Pre-Configuration page for more details.
A Databricks administrator can control who has access to specific tables in Databricks through Immuta Subscription Policies or by manually adding users to the data source. Data users will only see the immuta
database with no tables until they are granted access to those tables as Immuta data sources.
immuta
DatabaseWhen a table is registered in Immuta as a data source, users can see that table in the native Databricks database and in the immuta
database. This allows for an option to use a single database (immuta
) for all tables.
After data users have subscribed to data sources, administrators can apply fine-grained access controls, such as restricting rows or masking columns with advanced anonymization techniques, to manage what the users can see in each table. More details on the types of data policies can be found on Data Policies page, including an overview of masking struct and array columns in Databricks.
Note: Immuta recommends building Global Policies rather than Local Policies, as they allow organizations to easily manage policies as a whole and capture system state in a more deterministic manner.
All access controls must go through SQL.
Note: With R, you must load the SparkR library in a cell before accessing the data.
Usernames in Immuta must match usernames in Databricks. It is best practice is to use the same identity manager for Immuta that you use for Databricks (Immuta supports these identity manager protocols and providers. however, for Immuta SaaS users, it’s easiest to just ensure usernames match between systems.
An Immuta Application Administrator configures the Databricks integration and registers available cluster policies Immuta generates.
The Immuta init script adds the immuta
plugin in Databricks: the Immuta SecurityManager, wrappers, and Immuta analysis hook plan rewrite.
A Data Owner registers Databricks tables in Immuta as data sources. A Data Owner, Data Governor, or Administrator creates or changes a policy or user in Immuta.
Data source metadata, tags, user metadata, and policy definitions are stored in Immuta's Metadata Database.
A Databricks user who is subscribed to the data source in Immuta queries the corresponding table directly in their notebook or workspace.
During Spark Analysis, Spark calls down to the Metastore to get table metadata.
Immuta intercepts the call to retrieve table metadata from the Metastore.
Immuta modifies the Logical Plan to enforce policies that apply to that user.
Immuta wraps the Physical Plan with specific Java classes to signal to the SecurityManager that it is a trusted node and is allowed to scan raw data. Immuta blocks direct access to S3 unless it backs a registered table in Immuta.
The Physical Plan is applied and filters out and transforms raw data coming back to the user.
The user sees policy-enforced data.
This page contains references to the term whitelist, which Immuta no longer uses. When the term is removed from the software, it will be removed from this page.
Databricks instance: Premium tier workspace and Cluster access control enabled
Databricks instance has network level access to Immuta instance
Access to Immuta archives
Permissions and access to download (outside Internet access) or transfer files to the host machine
Recommended Databricks Workspace Configurations:
Note: Azure Databricks authenticates users with Microsoft Entra ID. Be sure to configure your Immuta instance with an IAM that uses the same user ID as does Microsoft Entra ID. Immuta's Spark security plugin will look to match this user ID between the two systems. See this Microsoft Entra ID page for details.
Use the table below to determine which version of Immuta supports your Databricks Runtime version:
11.3 LTS
2023.1 and newer
10.4 LTS
2022.2.x and newer
7.3 LTS 9.1 LTS
2021.5.x and newer
The table below outlines the integrations supported for various Databricks cluster configurations. For example, the only integration available to enforce policies on a cluster configured to run on Databricks Runtime 9.1 is the Databricks Spark integration.
Cluster 1
9.1
Unavailable
Unavailable
Cluster 2
10.4
Unavailable
Unavailable
Cluster 3
11.3
Unavailable
Cluster 4
11.3
Cluster 5
11.3
Legend:
Immuta supports the Custom access mode.
Supported Languages:
Python
SQL
R (requires advanced configuration; work with your Immuta support professional to use R)
Scala (requires advanced configuration; work with your Immuta support professional to use Scala)
Users Who Can Read Raw Tables On-Cluster
If a Databricks Admin is tied to an Immuta account, they will have the ability to read raw tables on-cluster.
If a Databricks user is listed as an "ignored" user, they will have the ability to read raw tables on-cluster. Users can be added to the immuta.spark.acl.whitelist
configuration to become ignored users.
The Immuta Databricks integration injects an Immuta plugin into the SparkSQL stack at cluster startup. The Immuta plugin creates an "immuta" database that is available for querying and intercepts all queries executed against it. For these queries, policy determinations will be obtained from the connected Immuta instance and applied before returning the results to the user.
The Databricks cluster init script provided by Immuta downloads the Immuta artifacts onto the target cluster and puts them in the appropriate locations on local disk for use by Spark. Once the init script runs, the Spark application running on the Databricks cluster will have the appropriate artifacts on its CLASSPATH to use Immuta for policy enforcement.
The cluster init script uses environment variables in order to
Determine the location of the required artifacts for downloading.
Authenticate with the service/storage containing the artifacts.
Note: Each target system/storage layer (HTTPS, for example) can only have one set of environment variables, so the cluster init script assumes that any artifact retrieved from that system uses the same environment variables.
See the Databricks Pre-Configuration Details page for known limitations.
There are two installation options for Databricks. Click a link below to navigate to a tutorial for your chosen method:
Simplified Configuration: The steps to enable the integration with this method include
Adding the integration on the App Settings page.
Downloading or automatically pushing cluster policies to your Databricks workspace.
Creating or restarting your cluster.
Manual Configuration: The steps to enable the integration with this method include
Downloading and configuring Immuta artifacts.
Staging Immuta artifacts somewhere the cluster can read from during its startup procedures.
Protecting Immuta environment variables with Databricks Secrets.
Creating and configuring the cluster to start with the init script and load Immuta into its SparkSQL environment.
For easier debugging of the Immuta Databricks installation, enable cluster init script logging. In the cluster page in Databricks for the target cluster, under Advanced Options -> Logging, change the Destination from NONE
to DBFS
and change the path to the desired output location. Note: The unique cluster ID will be added onto the end of the provided path.
For debugging issues between the Immuta web service and Databricks, you can view the Spark UI on your target Databricks cluster. On the cluster page, click the Spark UI tab, which shows the Spark application UI for the cluster. If you encounter issues creating Databricks data sources in Immuta, you can also view the JDBC/ODBC Server portion of the Spark UI to see the result of queries that have been sent from Immuta to Databricks.
The Validation and Debugging Notebook (immuta-validation.ipynb
) is packaged with other Databricks release artifacts (for manual installations), or it can be downloaded from the App Settings page when configuring native Databricks through the Immuta UI. This notebook is designed to be used by or under the guidance of an Immuta Support Professional.
Import the notebook into a Databricks workspace by navigating to Home in your Databricks instance.
Click the arrow next to your name and select Import.
Once you have executed commands in the notebook and populated it with debugging information, export the notebook and its contents by opening the File menu, selecting Export, and then selecting DBC Archive.
Databricks Unity Catalog allows you to manage and access data in your Databricks account across all of your workspaces and introduces fine-grained access controls in Databricks.
Immuta’s integration with Unity Catalog allows you to manage multiple Databricks workspaces through Unity Catalog while protecting your data with Immuta policies. Instead of manually creating UDFs or granting access to each table in Databricks, you can author your policies in Immuta and have Immuta manage and enforce Unity Catalog access-control policies on your data in Databricks clusters or SQL warehouses:
Subscription policies: Immuta subscription policies automatically grant and revoke access to Databricks tables.
Data policies: Immuta data policies enforce row- and column-level security without creating views, so users can query tables as they always have without their workflows being disrupted.
Unity Catalog uses the following hierarchy of data objects:
Metastore: Created at the account level and is attached to one or more Databricks workspaces. The metastore contains metadata of all the catalogs, schemas, and tables available to query. All clusters on that workspace use the configured metastore and all workspaces that are configured to use a single metastore share those objects.
Catalog: A catalog sits on top of schemas (also called databases) and tables to manage permissions across a set of schemas.
Schema: Organizes tables and views.
Table: Tables can be managed or external tables.
For details about the Unity Catalog object model, see the Databricks Unity Catalog documentation.
The Databricks Unity Catalog integration supports
applying column masking and row-redaction policies on tables
applying subscription polices on tables and views
enforcing Unity Catalog access controls, even if Immuta becomes disconnected
Delta and Parquet files
allowing non-Immuta reads and writes
using Photon
using a proxy server
Unity Catalog supports managing permissions at the Databricks account level through controls applied directly to objects in the metastore. To interact with the metastore and apply controls to any table, Immuta requires a personal access token (PAT) for an Immuta system account user with permissions to manage all data protected by Immuta. See the permissions requirements section for a list of specific Databricks privileges.
Immuta uses this Immuta system account user to run queries that set up all the tables, user-defined functions (UDFs), and other data necessary for policy enforcement. Upon enabling the native integration, Immuta will create a catalog named after your provided workspaceName
that contains two schemas:
immuta_system
: Contains internal Immuta data.
immuta_policies
: Contains policy UDFs.
When policies require changes to be pushed to Unity Catalog, Immuta updates the internal tables in the immuta_system
schema with the updated policy information. If necessary, new UDFs are pushed to replace any out-of-date policies in the immuta_policies
schema and any row filters or column masks are updated to point at the new policies. Many of these operations require compute on the configured Databricks cluster or SQL endpoint, so compute must be available for these policies to succeed.
Immuta’s Unity Catalog integration applies Databricks table-, row-, and column-level security controls that are enforced natively within Databricks. Immuta's management of these Databricks security controls is automated and ensures that they synchronize with Immuta policy or user entitlement changes.
Table-level security: Immuta manages REVOKE and GRANT privileges on securable objects in Databricks through subscription policies. When you create a subscription policy in Immuta, Immuta uses the Unity Catalog API to issue GRANTS or REVOKES against the catalog, schema, or table in Databricks for every user affected by that subscription policy.
Row-level security: Immuta applies SQL UDFs to restrict access to rows for querying users.
Column-level security: Immuta applies column-mask SQL UDFs to tables for querying users. These column-mask UDFs run for any column that requires masking.
The Unity Catalog integration supports the following policy types:
Conditional masking
Constant
Custom masking
Hashing
Null
Regex: You must use the global regex flag (g
) when creating a regex masking policy in this integration. You cannot use the case insensitive regex flag (i
) when creating a regex masking policy in this integration. See the limitations section for examples.
Rounding (date and numeric rounding)
Matching (only show rows where)
Custom WHERE
Never
Where user
Where value in column
Minimization
Time-based restrictions
Some users may need to be exempt from masking and row-level policy enforcement. When you add user accounts to the configured exemption group in Databricks, Immuta will not enforce policies for those users. Exemption groups are created when the Unity Catalog integration is configured, and no policies will apply to these users' queries, despite any policies enforced on the tables they query.
The principal used to register data sources in Immuta will be automatically added to this exemption group for that Databricks table. Consequently, users added to this list and used to register data sources in Immuta should be limited to service accounts.
hive_metastore
When enabling Unity Catalog support in Immuta, the catalog for all Databricks data sources will be updated to point at the default hive_metastore
catalog. Internally, Databricks exposes this catalog as a proxy to the workspace-level Hive metastore that schemas and tables were kept in before Unity Catalog. Since this catalog is not a real Unity Catalog catalog, it does not support any Unity Catalog policies. Therefore, Immuta will ignore any data sources in the hive_metastore
in any Databricks Unity Catalog integration, and policies will not be applied to tables there.
However, with Databricks metastore magic you can use hive_metastore
and enforce subscription and data policies with the Databricks Spark integration.
The Databricks Unity Catalog integration supports the access token method to configure the integration and create data sources in Immuta. This is the access token for the Immuta service principal. This service principal must have the metastore privileges listed in the permissions section for the metastore associated with the Databricks workspace. If this token is configured to expire, update this field regularly for the integration to continue to function.
The Unity Catalog data object model introduces a 3-tiered namespace, as outlined above. Consequently, your Databricks tables registered as data sources in Immuta will reference the catalog, schema (also called a database), and table.
External data connectors and query-federated tables are preview features in Databricks. See the Databricks documentation for details about the support and limitations of these features before registering them as data sources in the Unity Catalog integration.
Access requirements
For Databricks Unity Catalog audit to work, Immuta must have, at minimum, the following access.
USE CATALOG
on the system
catalog
USE SCHEMA
on the system.access
schema
SELECT
on the following system tables:
system.access.audit
system.access.table_lineage
system.access.column_lineage
The Databricks Unity Catalog integration audits user queries run in clusters or SQL warehouses for deployments configured with the Databricks Unity Catalog integration. The audit ingest is set when configuring the integration and the audit logs can be scoped to only ingest specific workspaces if needed.
See the Unity Catalog native audit page for details about manually prompting ingest of audit logs and the contents of the logs.
See the Enable Unity Catalog guide for a list of requirements.
The table below outlines the integrations supported for various Databricks cluster configurations. For example, the only integration available to enforce policies on a cluster configured to run on Databricks Runtime 9.1 is the Databricks Spark integration.
Cluster 1
9.1
Unavailable
Unavailable
Cluster 2
10.4
Unavailable
Unavailable
Cluster 3
11.3
Unavailable
Cluster 4
11.3
Cluster 5
11.3
Legend:
Unity Catalog row- and column-level security controls are unsupported for single-user clusters. See the Databricks documentation for details about this limitation.
Row access policies with more than 1023 columns are unsupported. This is an underlying limitation of UDFs in Databricks. Immuta will only create row access policies with the minimum number of referenced columns. This limit will therefore apply to the number of columns referenced in the policy and not the total number in the table.
If you disable table grants, Immuta revokes the grants. Therefore, if users had access to a table before enabling Immuta, they’ll lose access.
You must use the global regex flag (g
) when creating a regex masking policy in this integration, and you cannot use the case insensitive regex flag (i
) when creating a regex masking policy in this integration. See the examples below for guidance:
regex with a global flag (supported): /^ssn|social ?security$/g
regex without a global flag (unsupported): /^ssn|social ?security$/
regex with a case insensitive flag (unsupported): /^ssn|social ?security$/gi
regex without a case insensitive flag (supported): /^ssn|social ?security$/g
If a registered data source is owned by a Databricks group at the table level, then the Unity Catalog integration cannot apply data masking policies to that table in Unity Catalog.
Therefore, set all table-level ownership on your Unity Catalog data sources to an individual user or service principal instead of a Databricks group. Catalogs and schemas can still be owned by a Databricks group, as ownership at that level doesn't interfere with the integration.
The following features are currently unsupported:
Databricks change data feed support
Immuta projects
Multiple IAMs on a single cluster
Column masking policies on views
Mixing masking policies on the same column
Row-redaction policies on views
R and Scala cluster support
Scratch paths
User impersonation
Policy enforcement on raw Spark reads
Python UDFs for advanced masking functions
Direct file-to-SQL reads
Data policies on ARRAY, MAP, or STRUCT type columns
Snippets for Databricks data sources may be empty in the Immuta UI.
This page outlines the configuration for setting up project UDFs, which allow users to set their current project in Immuta through Spark. For details about the specific functions available and how to use them, see the .
Use Project UDFs in Databricks
Currently, caches are not all invalidated outside of Databricks because Immuta caches information pertaining to a user's current project in the NameNode plugin and in Vulcan. Consequently, this feature should only be used in Databricks.
Immuta caches a mapping of user accounts and users' current projects in the Immuta Web Service and on-cluster. When users change their project with UDFs instead of the Immuta UI, Immuta invalidates all the caches on-cluster (so that everything changes immediately) and the cluster submits a request to change the project context to a web worker. Immediately after that request, another call is made to a web worker to refresh the current project.
To allow use of project UDFs in Spark jobs, raise the caching on-cluster and lower the cache timeouts for the Immuta Web Service. Otherwise, caching could cause dissonance among the requests and calls to multiple web workers when users try to change their project contexts.
Click the App Settings icon in the left sidebar and scroll to the HDFS Cache Settings section.
Lower the Cache TTL of HDFS user names (ms) to 0.
Click Save.
In the Spark environment variables section, set the IMMUTA_CURRENT_PROJECT_CACHE_TIMEOUT_SECONDS
and IMMUTA_PROJECT_CACHE_TIMEOUT_SECONDS
to high values (like 10000
).
Note: These caches will be invalidated on cluster when a user calls immuta.set_current_project
, so they can effectively be cached permanently on cluster to avoid periodically reaching out to the web service.
Audience: System Administrators
Content Summary: This guide details the manual installation method for enabling native access to Databricks with Immuta policies enforced.
Prerequisites: Ensure your Databricks workspace, instance, and permissions meet the guidelines outlined in the .
Databricks Unity Catalog
If Unity Catalog is enabled in a Databricks workspace, you must use an Immuta cluster policy when you setup the integration to create an Immuta-enabled cluster.
The immuta_conf.xml
is no longer required.
The immuta_conf.xml
file that was previously used to configure the native Databricks integration is no longer required to install Immuta, so it is no longer staged as a deployment artifact. However, you can use if you wish to deploy an immuta_conf.xml
file to set properties.
The required Immuta base URL and Immuta system API key properties, along with any other valid properties, can still be specified as Spark environment variables or in the optional immuta_conf.xml
file. As before, if the same property is specified in both locations, the Spark environment variable takes precedence.
If you have an existing immuta_conf.xml
file, you can continue using it. However, it's recommended that you delete any default properties from the file that you have not explicitly overridden, or remove the file completely and rely on Spark environment variables. Either method will ensure that any property defaults changed in upcoming Immuta releases are propagated to your environment.
Spark Version
Use Spark 2 with Databricks Runtime prior to 7.x. Use Spark 3 with Databricks Runtime 7.x or later. Attempting to use an incompatible jar and Databricks Runtime will fail.
Navigate to the page. If you are prompted to log in and need basic authentication credentials, contact your Immuta support professional.
Navigate to the Databricks folder for your Immuta version. Ex: https://archives.immuta.com/hadoop/databricks/2024.1.13/
.
Download the .jar file (Immuta plugin) as well as the other scripts listed below, which will load the plugin at cluster startup.
The immuta-benchmark-suite.dbc
is a collection of notebooks packaged as a .dbc file. After you have added cluster policies to your cluster, you can import this file into Databricks to run performance tests and compare a regular Databricks cluster to one protected by Immuta. Detailed instructions are available in the first notebook, which will require an Immuta and non-Immuta cluster to generate test data and perform queries.
Specify the following properties as Spark environment variables or in the optional immuta_conf.xml
file. If the same property is specified in both locations, the Spark environment variable takes precedence. The variable names are the config names in all upper case with _
instead of .
. For example, to set the value of immuta.base.url
via an environment variable, you would set the following in the Environment Variables
section of cluster configuration: IMMUTA_BASE_URL=https://immuta.mycompany.com
immuta.system.api.key
: Obtain this value from the under HDFS > System API Key. You will need to be a user with the APPLICATION_ADMIN
role to complete this action. Warning: Generating a key will destroy any previously generated HDFS keys. This will cause previously integrated HDFS systems to lose access to your Immuta console. The key will only be shown once when generated.
immuta.base.url
: The full URL for the target Immuta instance Ex: https://immuta.mycompany.com
.
immuta.user.mapping.iamid
: If users authenticate to Immuta using an IAM different from Immuta's built-in IAM, you need to update the configuration file to reflect the ID of that IAM. The IAM ID is shown within the Immuta App Settings page within the Identity Management section. See for more details.
Environment Variables with Google Cloud Platform
Do not use environment variables to set sensitive properties when using Google Cloud Platform. Set them directly in immuta_conf.xml
.
When configuring the Databricks cluster, a path will need to be provided to each of the artifacts downloaded/created in the previous step. To do this, those artifacts must be hosted somewhere that your Databricks instance can access. The following methods can be used for this step:
These artifacts will be downloaded to the required location within the clusters file-system by the init script downloaded in the previous step. In order for the init script to find these files, a URI will have to be provided through environment variables configured on the cluster. Each method's URI structure and setup is explained below.
URI Structure: s3://[bucket]/[path]
Upload the configuration file, JSON file, and JAR file to an S3 bucket that the role from step 1 has access to.
If you wish to authenticate using access keys, add the following items to the cluster's environment variables:
If you've assumed a role and received a session token, that can be added here as well:
URI Structure: abfs(s)://[container]@[account].dfs.core.windows.net/[path]
Environment Variables:
If you want to authenticate using an account key, add the following to your cluster's environment variables:
If you want to authenticate using an Azure SAS token, add the following to your cluster's environment variables:
URI Structure: adl://[account].azuredatalakestore.net/[path]
Environment Variables:
If authenticating as a Microsoft Entra ID user,
If authenticating using a service principal,
URI Structure: http(s)://[host](:port)/[path]
Artifacts are available for download from Immuta using basic authentication. Your basic authentication credentials can be obtained from your Immuta support professional.
DBFS does not support access control. Any Databricks user can access DBFS via the Databricks command line utility. Files containing sensitive materials (such as Immuta API keys) should not be stored there in plain text. Use other methods described herein to properly secure such materials.
URI Structure: dbfs:/[path]
Since any user has access to everything in DBFS:
The artifacts can be stored anywhere in DBFS.
It's best to have a cluster-specific place for your artifacts in DBFS if you are testing to avoid overwriting or reusing someone else's artifacts accidentally.
Databricks secrets can be used in the Environment Variables
configuration section for a cluster by referencing the secret path rather than the actual value of the environment variable. For example, if a user wanted to make the following value secret
they could instead create a Databricks secret and reference it as the value of that variable. For instance, if the secret scope my_secrets
was created, and the user added a secret with the key my_secret_env_var
containing the desired sensitive environment variable, they would reference it in the Environment Variables
section:
Then, at runtime, {{secrets/my_secrets/my_secret_env_var}}
would be replaced with the actual value of the secret if the owner of the cluster has access to that secret.
Best Practice: Replace Sensitive Variables with Secrets
Immuta recommends that ANY SENSITIVE environment variables listed below in the various artifact deployment instructions be replaced with secrets.
Cluster creation in an Immuta-enabled organization or Databricks workspace should be limited to administrative users to avoid allowing users to create non-Immuta enabled clusters.
Select the Custom Access mode.
Opt to adjust the Autopilot Options and Worker Type settings. The default values provided here may be more than what is necessary for non-production or smaller use-cases. To reduce resource usage you can enable/disable autoscaling, limit the size and number of workers, and set the inactivity timeout to a lower value.
In the Advanced Options section, click the Instances tab.
Click the Spark tab. In Spark Config field, add your configuration.
Cluster Configuration Requirements:
Click the Init Scripts tab and set the following configurations:
Destination: Specify the service you used to host the Immuta artifacts.
File Path: Specify the full URI to the immuta_cluster_init_script.sh
.
Add the new key/value to the configuration.
Click the Permissions tab and configure the following setting:
Who has access: Users or groups will need to have the permission Can Attach To to execute queries against Immuta configured data sources.
(Re)start the cluster.
As mentioned in the "Environment Variables" section of the cluster configuration, there may be some cases where it is necessary to add sensitive configuration to SparkSession.sparkContext.hadoopConfiguration
in order to read the data composing Immuta data sources.
As an example, when accessing external tables stored in Azure Data Lake Gen 2, Spark must have credentials to access the target containers/filesystems in ADLg2, but users must not have access to those credentials. In this case, an additional configuration file may be provided with a storage account key that the cluster may use to access ADLg2.
The additional configuration file looks very similar to the Immuta Configuration file referenced above. Some example configuration files for accessing different storage layers are below.
IAM Role for S3 Access
ADL Prefix
Prior to Databricks Runtime version 6, the following configuration items should have a prefix of dfs.adls
rather than fs.adl
When the Immuta enabled Databricks cluster has been successfully started, users will see a new database labeled "immuta". This database is the virtual layer provided to access data sources configured within the connected Immuta instance.
Before users can query an Immuta data source, an administrator must give the user Can Attach To
permissions on the cluster and GRANT
the user access to the immuta
database.
The following SQL query can be run as an administrator within a journal to give the user access to "Immuta":
By default, the IAM used to map users between Databricks and Immuta is the BIM (Immuta's internal IAM). The Immuta Spark plugin will check the Databricks username against the username within the BIM to determine access. For a basic integration, this means the users email address in Databricks and the connected Immuta instance must match.
Audience: System Administrators
Content Summary: This page outlines how to install and configure trusted third-party libraries for Databricks.
Specifying More than One Trusted Library
To specify more than one trusted library, comma delimit the URIs:
In the Databricks Clusters UI, install your third-party library .jar or Maven artifact with Library Source Upload
, DBFS
, DBFS/S3
, or Maven
. Alternatively, use the Databricks libraries API.
In the Databricks Clusters UI, add the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS
property as a Spark environment variable and set it to your artifact's URI:
Maven Artifacts
For Maven artifacts, the URI is maven:/<maven_coordinates>
, where <maven_coordinates>
is the Coordinates field found when clicking on the installed artifact on the Libraries tab in the Databricks Clusters UI. Here's an example of an installed artifact:
In this example, you would add the following Spark environment variable:
.jar Artifacts
For jar artifacts, the URI is the Source field found when clicking on the installed artifact on the Libraries tab in the Databricks Clusters UI. For artifacts installed from DBFS or S3, this ends up being the original URI to your artifact. For uploaded artifacts, Databricks will rename your .jar and put it in a directory in DBFS. Here's an example of an installed artifact:
In this example, you would add the following Spark environment variable:
Restart the cluster.
Once the cluster is up, execute a command in a notebook. If the trusted library installation is successful, you should see driver log messages like this:
Audience: System Administrators
Content Summary: This guide illustrates how to run R and Scala
spark-submit
jobs on Databricks, including prerequisites and caveats.
Language Support
R and Scala are supported, but require advanced configuration; work with your Immuta support professional to use these languages. Python spark-submit
jobs are not supported by the Databricks Spark integration.
Using R in a Notebook
Because of how some user properties are populated in Databricks, users should load the SparkR library in a separate cell before attempting to use any SparkR functions.
spark-submit
Before you can run spark-submit
jobs on Databricks you must initialize the Spark session with the settings outlined below.
Initialize the Spark session by entering these settings into the R submit script immuta.spark.acl.assume.not.privileged="true"
and spark.hadoop.immuta.databricks.config.update.service.enabled="false"
.
This will enable the R script to access Immuta data sources, scratch paths, and workspace tables.
Once the script is written, upload the script to a location in dbfs/S3/ABFS
to give the Databricks cluster access to it.
spark submit
JobTo create the R spark-submit
job,
Go to the Databricks jobs page.
Create a new job, and select Configure spark-submit.
Set up the parameters:
Note: The path dbfs:/path/to/script.R
can be in S3 or ABFS (on Azure Databricks), assuming the cluster is configured with access to that path.
Edit the cluster configuration, and change the Databricks Runtime to be a supported version (5.5, 6.4, 7.3, or 7.4).
Configure the Environment Variables section as you normally would for an .
Before you can run spark-submit
jobs on Databricks you must initialize the Spark session with the settings outlined below.
Configure the Spark session with immuta.spark.acl.assume.not.privileged="true"
and spark.hadoop.immuta.databricks.config.update.service.enabled="false"
.
Note: Stop your Spark session (spark.stop()
) at the end of your job or the cluster will not terminate.
The spark submit job needs to be launched using a different classloader which will point at the designated user JARs directory. The following Scala template can be used to handle launching your submit code using a separate classloader:
spark-submit
JobTo create the Scala spark-submit
job,
Build and upload your JAR to dbfs/S3/ABFS
where the cluster has access to it.
Select Configure spark-submit, and configure the parameters:
Note: The fully-qualified class name of the class whose main
function will be used as the entry point for your code in the --class
parameter.
Note: The path dbfs:/path/to/code.jar
can be in S3 or ABFS (on Azure Databricks) assuming the cluster is configured with access to that path.
Edit the cluster configuration, and change the Databricks Runtime to a supported version (5.5, 6.4, 7.3, or 7.4).
Include IMMUTA_INIT_ADDITIONAL_JARS_URI=dbfs:/path/to/code.jar
in the "Environment Variables" (where dbfs:/path/to/code.jar
is the path to your jar) so that the jar is uploaded to all the cluster nodes.
The user mapping works differently from notebooks because spark-submit
clusters are not configured with access to the Databricks SCIM API. The cluster tags are read to get the cluster creator and match that user to an Immuta user.
Privileged users (Databricks Admins and Whitelisted Users) must be tied to an Immuta user and given access through Immuta to access data through spark-submit
jobs because the setting immuta.spark.acl.assume.not.privileged="true"
is used.
Currently when an API key is generated it invalidates the previous key. This can cause issues if a user is using multiple clusters in parallel, since each cluster will generate a new API key for that Immuta user. To avoid these issues, manually generate the API key in Immuta and set the immuta.api.key
on all the clusters or use a specified job user for the submit job.
Audience: System Administrators
Content Summary: This page describes the Scala cluster policy.
Scala Clusters
This configuration is for Scala-only clusters.
Where Scala language support is needed, this configuration can be used in the Custom .
According to Databricks’ cluster type support documentation, Scala clusters are intended for . However, nothing inherently prevents a Scala cluster from being configured for multiple users. Even with the Immuta SecurityManager enabled, there are limitations to user isolation within a Scala job.
For a secure configuration, it is recommended that clusters intended for Scala workloads are limited to Scala jobs only and are made homogeneous through the use of or externally via convention/cluster ACLs. (In homogeneous clusters, all users are at the same level of groups/authorizations; this is enforced externally, rather than directly by Immuta.)
For full details on Databricks’ best practices in configuring clusters, please read their .
Audience: System Administrators
Content Summary: This document describes how to use an existing Hive external metastore instead of the built-in metastore.
Immuta supports the use of external metastores in , following the same configuration detailed in the .
Download the metastore jars and point to them as specified in . Metastore jars must end up on the cluster's local disk at this explicit path: /databricks/hive_metastore_jars
.
If using DBR 7.x with Hive 2.3.x, either
Set spark.sql.hive.metastore.version
to 2.3.7
and spark.sql.hive.metastore.jars
to builtin
or
Download the metastore jars and set spark.sql.hive.metastore.jars
to /databricks/hive_metastore_jars/*
as before.
To use AWS Glue Data Catalog as the metastore for Databricks, see the .
/
/
The feature or integration is enabled.
The feature or integration is disabled.
/
/
The feature or integration is enabled.
The feature or integration is disabled.
If your compliance requirements restrict users from changing projects within a session, you can block the use of Immuta's project UDFs on a Databricks Spark cluster. To do so, configure the immuta.spark.databricks.disabled.udfs
option as described on the .
Host files in and provide access by the cluster
Host files in Gen 1 or Gen 2 and provide access by the cluster
Host files on an server accessible by the cluster
Host files in (Not recommended for production)
Create an instance profile for clusters by following .
Upload the configuration file, JSON file, and JAR file to an .
Upload the configuration file, JSON file, and JAR file to .
Upload the artifacts directly to using the .
It is important that non-administrator users on an Immuta-enabled Databricks cluster do not have access to view or modify Immuta configuration or the immuta-spark-hive.jar
file, as this would potentially pose a security loophole around Immuta policy enforcement. Therefore, use to apply environment variables to an Immuta-enabled cluster in a secure way.
Create a cluster in Databricks by following the .
IAM Role (AWS ONLY): Select the instance role you created for this cluster. (For access key authentication, you should instead use the environment variables listed in the section.)
In the Environment Variables section, add the environment variables necessary for your configuration. Remember that these variables should be as mentioned above.
To use an additional Hadoop configuration file, you will need to set the IMMUTA_INIT_ADDITIONAL_CONF_URI
environment variable referenced in the section to be the full URI to this file.
S3 can also be accessed using an IAM role attached to the cluster. See the for more details.
.
Below are example queries that can be run to obtain data from an Immuta-configured data source. Because Immuta supports raw tables in Databricks, you do not have to use Immuta-qualified table names in your queries like the first example. Instead, you can run queries like the second example, which does not reference the .
See the for a detailed walkthrough.
It is possible within Immuta to have multiple users share the same username if they exist within different IAMs. In this case, the cluster can be configured to lookup users from a specified IAM. To do this, the value of immuta.user.mapping.iamid
created and hosted in the previous steps must be updated to be the targeted IAM ID configured within the Immuta instance. The IAM ID can be found on the . Each Databricks cluster can only be mapped to one IAM.
There is an option of using the immuta.api.key
setting with an Immuta API key generated on the .
Audience: Databricks Administrators
Content Summary: This page provides an overview of Immuta's Databricks Trusted Libraries feature and support of Notebook-Scoped Libraries on Machine Learning Clusters.
The Immuta security manager blocks users from executing code that could allow them to gain access to sensitive data by only allowing select code paths to access sensitive files and methods. These select code paths provide Immuta's code access to sensitive resources while blocking end users from these sensitive resources directly.
Similarly, when users install third-party libraries those libraries will be denied access to sensitive resources by default. However, cluster administrators can specify which of the installed Databricks libraries should be trusted by Immuta.
The trusted libraries feature allows Databricks cluster administrators to avoid Immuta security manager errors when using third-party libraries. An administrator can specify an installed library as "trusted," which will enable that library's code to bypass the Immuta security manager. Contact your Immuta support professional for custom security configurations for your libraries.
This feature does not impact Immuta's ability to apply policies; trusting a library only allows code through what previously would have been blocked by the security manager.
Security Vulnerability
Using this feature could create a security vulnerability, depending on the third-party library. For example, if a library exposes a public method named readProtectedFile
that displays the contents of a sensitive file, then trusting that library would allow end users access to that file. Work with your Immuta support professional to determine if the risk does not apply to your environment or use case.
Databricks Libraries API
Installing trusted libraries outside of the Databricks Libraries API (e.g., ADD JAR ...
) is not supported.
The following types of libraries are supported when installing a third-party library using the Databricks UI or the Databricks Libraries API:
Library source
is Upload
, DBFS
or DBFS/S3
and the Library Type
is Jar
.
Library source
is Maven
.
Databricks installs libraries right after a cluster has started, but there is no guarantee that library installation will complete before a user's code is executed. If a user executes code before a trusted library installation has completed, Immuta will not be able to identify the library as trusted. This can be solved by either
waiting for library installation to complete before running any third-party library commands or
executing a Spark query. This will force Immuta to wait for any trusted Immuta libraries to complete installation before proceeding.
When installing a library using Maven as a library source, Databricks will also install any transitive dependencies for the library. However, those transitive dependencies are installed behind the scenes and will not appear as installed libraries in either the Databricks UI or using the Databricks Libraries API. Only libraries specifically listed in the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS
environment variable will be trusted by Immuta, which does not include installed transitive dependencies. This effectively means that any code paths that include a class from a transitive dependency but do not include a class from a trusted third-party library can still be blocked by the Immuta security manager. For example, if a user installs a trusted third-party library that has a transitive dependency of a file-util
library, the user will not be able to directly use the file-util
library to read a sensitive file that is normally protected by the Immuta security manager.
In many cases, it is not a problem if dependent libraries aren't trusted because code paths where the trusted library calls down into dependent libraries will still be trusted. However, if the dependent library needs to be trusted, there is a workaround:
Add the transitive dependency jar paths to the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS
environment variable. In the driver log4j
logs, Databricks outputs the source jar locations when it installs transitive dependencies. In the cluster driver logs, look for a log message similar to the following:
In the above example, where slf4j
is the transitive dependency, you would add the path dbfs:/FileStore/jars/maven/org/slf4j/slf4j-api-1.7.25.jar
to the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS
environment variable and restart your cluster.
In case of failure, check the driver logs for details. Some possible causes of failure include
One of the Immuta configured trusted library URIs does not point to a Databricks library. Check that you have configured the correct URI for the Databricks library.
For trusted Maven artifacts, the URI must follow this format: maven:/group.id:artifact-id:version
.
Databricks failed to install a library. Any Databricks library installation errors will appear in the Databricks UI under the Libraries tab.
For details about configuring trusted libraries, navigate to the installation guide.
Users on Databricks runtimes 8+ can manage notebook-scoped libraries with %pip
commands.
However, this functionality differs from Immuta's trusted libraries feature, and Python libraries are still not supported as trusted libraries. The Immuta Security Manager will deny the code of libraries installed with %pip
access to sensitive resources.
No additional configuration is needed to enable this feature. Users only need to be running on clusters with DBR 8+.
When using Delta Lake, the API does not go through the normal Spark execution path. This means that Immuta's Spark extensions do not provide protection for the API. To solve this issue and ensure that Immuta has control over what a user can access, the Delta Lake API is blocked.
Spark SQL can be used instead to give the same functionality with all of Immuta's data protections.
Below is a table of the Delta Lake API with the Spark SQL that may be used instead.
DeltaTable.convertToDelta
CONVERT TO DELTA parquet./path/to/parquet/
DeltaTable.delete
DELETE FROM [table_identifier delta./path/to/delta/
] WHERE condition
DeltaTable.generate
GENERATE symlink_format_manifest FOR TABLE [table_identifier delta./path/to/delta
]
DeltaTable.history
DESCRIBE HISTORY [table_identifier delta./path/to/delta
] (LIMIT x)
DeltaTable.merge
MERGE INTO
DeltaTable.update
UPDATE [table_identifier delta./path/to/delta/
] SET column = valueWHERE (condition)
DeltaTable.vacuum
VACUUM [table_identifier delta./path/to/delta
]
See here for a complete list of the Delta SQL Commands.
When a table is created in a native workspace, you can merge a different Immuta data source from that workspace into that table you created.
Create a table in the native workspace.
Create a temporary view of the Immuta data source you want to merge into that table.
Use that temporary view as the data source you add to the project workspace.
Run the following command:
Audience: System Administrators
Content Summary: This page describes the Python & SQL & R with Library Support cluster policy.
Py4j Security Disabled
In addition to support for Python, SQL, and R, this configuration adds support for additional Python libraries and utilities by disabling Databricks-native Py4j security.
This configuration does not rely on Databricks-native Py4j security to secure the cluster, while process isolation is still enabled to secure filesystem and network access from within Python processes. On an Immuta-enabled cluster, once Py4J security is disabled the Immuta SecurityManager is installed to prevent nefarious actions from Python in the JVM. Disabling Py4J security also allows for expanded Python library support, including many Python ML classes (such as LogisticRegression
, StringIndexer
, and DecisionTreeClassifier
) and dbutils.fs.
By default, all actions in R will execute as the root user. Among other things, this permits access to the entire filesystem (including sensitive configuration data). And without iptable restrictions, a user may freely access the cluster’s cloud storage credentials. To properly support the use of the R language, Immuta’s initialization script wraps the R and Rscript binaries to launch each command as a temporary, non-privileged user. This user has limited filesystem and network access. The Immuta SecurityManager is also installed to prevent users from bypassing policies and protects against the above vulnerabilities from within the JVM.
The SecurityManager will incur a small increase in performance overhead; average latency will vary depending on whether the cluster is homogeneous or heterogeneous. (In homogeneous clusters, all users are at the same level of groups/authorizations; this is enforced externally, rather than directly by Immuta.)
When users install third-party Java/Scala libraries, they will be denied access to sensitive resources by default. However, cluster administrators can specify which of the installed Databricks libraries should be trusted by Immuta.
A homogeneous cluster is recommended for configurations where Py4J security is disabled. If all users have the same level of authorization, there would not be any data leakage, even if a nefarious action was taken.
For full details on Databricks’ best practices in configuring clusters, please read their governance documentation.
Audience: Databricks Users
Content Summary: This page describes Immuta's support of Databricks Change Data Feed (CDF).
CDF shows the row-level changes between versions of a Delta table. The changes displayed include row data and metadata that indicates whether the row was inserted, deleted, or updated.
Immuta does not support applying policies to the changed data, and the CDF cannot be read for data source tables if the user does not have access to the raw data in Databricks. However, the CDF can be read if the querying user is allowed to read the raw data and one of the following statements is true:
the table is in the current workspace,
the table is in a scratch path,
non-Immuta reads are enabled AND the table does not intersect with a workspace under which the current user is not acting, or
non-Immuta reads are enabled AND the table is not part of an Immuta data source.
There are no configuration changes necessary to use this feature.
Immuta does not support reading changes in streaming queries.
Audience: System Administrators
Content Summary: This page describes the Python & SQL cluster policy.
Performance
This is the most performant policy configuration.
In this configuration, Immuta is able to rely on Databricks-native security controls, reducing overhead. The key security control here is the enablement of process isolation. This prevents users from obtaining unintentional access to the queries of other users. In other words, masked and filtered data is consistently made accessible to users in accordance with their assigned attributes. This Immuta cluster configuration relies on Py4J security being enabled.
Many Python ML classes (such as LogisticRegression
, StringIndexer
, and DecisionTreeClassifier
) and dbutils.fs are unfortunately not supported with Py4J security enabled. Users will also be unable to use the Databricks Connect client library. Additionally, only Python and SQL are available as supported languages.
For full details on Databricks’ best practices in configuring clusters, please read their governance documentation.
This page outlines configuration details for Immuta-enabled Databricks clusters. Databricks Administrators should place the desired configuration in the Spark environment variables (recommended) or immuta_conf.xml
(not recommended).
This page contains references to the term whitelist, which Immuta no longer uses. When the term is removed from the software, it will be removed from this page.
Environment Variable Overrides
Properties in the config file can be overridden during installation using environment variables. The variable names are the config names in all upper case with _
instead of .
. For example, to set the value of immuta.base.url
via an environment variable, you would set the following in the Environment Variables
section of cluster configuration: IMMUTA_BASE_URL=https://immuta.mycompany.com
immuta.ephemeral.host.override
Default: true
Description: Set this to false
if ephemeral overrides should not be enabled for Spark. When true
, this will automatically override ephemeral data source httpPaths with the httpPath of the Databricks cluster running the user's Spark application.
immuta.ephemeral.host.override.httpPath
Description: This configuration item can be used if automatic detection of the Databricks httpPath should be disabled in favor of a static path to use for ephemeral overrides.
immuta.ephemeral.table.path.check.enabled
Default: true
Description: When querying Immuta data sources in Spark, the metadata from the Metastore is compared to the metadata for the target source in Immuta to validate that the source being queried exists and is queryable on the current cluster. This check typically validates that the target (database, table) pair exists in the Metastore and that the table’s underlying location matches what is in Immuta. This configuration can be used to disable location checking if that location is dynamic or changes over time. Note: This may lead to undefined behavior if the same table names exist in multiple workspaces but do not correspond to the same underlying data.
immuta.spark.acl.enabled
Default: true
Description: Immuta Access Control List (ACL). Controls whether Databricks users are blocked from accessing non-Immuta tables. Ignored if Databricks Table ACLs are enabled (i.e., spark.databricks.acl.dfAclsEnabled=true
).
immuta.spark.acl.whitelist
Description: Comma-separated list of Databricks usernames who may access raw tables when the Immuta ACL is in use.
immuta.spark.acl.privileged.timeout.seconds
Default: 3600
Description: The number of seconds to cache privileged user status for the Immuta ACL. A privileged Databricks user is an admin or is whitelisted in immuta.spark.acl.whitelist
.
immuta.spark.acl.assume.not.privileged
Default: false
Description: Session property that overrides privileged user status when the Immuta ACL is in use. This should only be used in R scripts associated with spark-submit jobs.
immuta.spark.audit.all.queries
Default: false
Description: Enables auditing all queries run on a Databricks cluster, regardless of whether users touch Immuta-protected data or not.
immuta.spark.databricks.allow.non.immuta.reads
Default: false
Description: Allows non-privileged users to SELECT
from tables that are not protected by Immuta. See Limited Enforcement in Databricks for details about this feature.
immuta.spark.databricks.allow.non.immuta.writes
Default: false
Description: Allows non-privileged users to run DDL commands and data-modifying commands against tables or spaces that are not protected by Immuta. See Limited Enforcement in Databricks for details about this feature.
immuta.spark.databricks.allowed.impersonation.users
Description: This configuration is a comma-separated list of Databricks users who are allowed to impersonate Immuta users.
immuta.spark.databricks.dbfs.mount.enabled
Default: false
Description: Exposes the DBFS FUSE mount located at /dbfs
. Granular permissions are not possible, so all users will have read/write access to all objects therein. Note: Raw, unfiltered source data should never be stored in DBFS.
immuta.spark.databricks.disabled.udfs
Description: Block one or more Immuta user-defined functions (UDFs) from being used on an Immuta cluster. This should be a Java regular expression that matches the set of UDFs to block by name (excluding the immuta
database). For example to block all project UDFs, you may configure this to be ^.*_projects?$
. For a list of functions, see the project UDFs page.
immuta.spark.databricks.filesystem.blacklist
Default: hdfs
Description: A list of filesystem protocols that this instance of Immuta will not support for workspaces. This is useful in cases where a filesystem is available to a cluster but should not be used on that cluster.
immuta.spark.databricks.filesystem.is3a.path.style.access.config
Default: false
Description: Enables the is3a
filesystem that retrieves your API key and communicates with Immuta as if it were talking directly to S3, allowing users to access object-backed data sources through Immuta's s3p
endpoint. This setting is only available on Databricks 7+ clusters.
immuta.spark.databricks.jar.uri
Default: file:///databricks/jars/immuta-spark-hive.jar
Description: The location of immuta-spark-hive.jar
on the filesystem for Databricks. This should not need to change unless a custom initialization script that places immuta-spark-hive in a non-standard location is necessary.
immuta.spark.databricks.local.scratch.dir.enabled
Default: true
Description: Creates a world-readable/writable scratch directory on local disk to facilitate the use of dbutils
and 3rd party libraries that may write to local disk. Its location is non-configurable and is stored in the environment variable IMMUTA_LOCAL_SCRATCH_DIR
. Note: Sensitive data should not be stored at this location.
immuta.spark.databricks.log.level
Default Value: INFO
Description: The SLF4J log level to apply to Immuta's Spark plugins.
immuta.spark.databricks.log.stdout.enabled
Default: false
Description: If true, writes logging output to stdout/the console as well as the log4j-active.txt
file (default in Databricks).
immuta.spark.databricks.py4j.strict.enabled
Default: true
Description: Disable to allow the use of the dbutils
API in Python. Note: This setting should only be disabled for customers who employ a homogeneous integration (i.e., all users have the same level of data access).
immuta.spark.databricks.scratch.database
Description: This configuration is a comma-separated list of additional databases that will appear as scratch databases when running a SHOW DATABASE
query. This configuration increases performance by circumventing the Metastore to get the metadata for all the databases to determine what to display for a SHOW DATABASE
query; it won't affect access to the scratch databases. Instead, use immuta.spark.databricks.scratch.paths
to control read and write access to the underlying database paths.
Additionally, this configuration will only display the scratch databases that are configured and will not validate that the configured databases exist in the Metastore. Therefore, it is up to the Databricks administrator to properly set this value and keep it current.
immuta.spark.databricks.scratch.paths
Description: Comma-separated list of remote paths that Databricks users are allowed to directly read/write. These paths amount to unprotected "scratch spaces." You can create a scratch database by configuring its specified location (or configure dbfs:/user/hive/warehouse/<db_name>.db
for the default location).
To create a scratch path to a location or a database stored at that location, configure
To create a scratch path to a database created using the default location,
immuta.spark.databricks.scratch.paths.create.db.enabled
Default: false
Description: Enables non-privileged users to create or drop scratch databases.
immuta.spark.databricks.single.impersonation.user
Default: false
Description: When true
, this configuration prevents users from changing their impersonation user once it has been set for a given Spark session. This configuration should be set when the BI tool or other service allows users to submit arbitrary SQL or issue SET commands.
immuta.spark.databricks.submit.tag.job
Default: true
Description: Denotes whether the Spark job will be run that "tags" a Databricks cluster as being associated with Immuta.
immuta.spark.databricks.trusted.lib.uris
Description: Databricks Trusted Libraries
immuta.spark.non.immuta.table.cache.seconds
Default: 3600
Description: The number of seconds Immuta caches whether a table has been exposed as a source in Immuta. This setting only applies when immuta.spark.databricks.allow.non.immuta.writes
or immuta.spark.databricks.allow.non.immuta.reads
is enabled.
immuta.spark.require.equalization
Default: false
Description: Requires that users act through a single, equalized project. A cluster should be equalized if users need to run Scala jobs on it, and it should be limited to Scala jobs only via spark.databricks.repl.allowedLanguages
.
immuta.spark.resolve.raw.tables.enabled
Default: true
Description: Enables use of the underlying database and table name in queries against a table-backed Immuta data source. Administrators or whitelisted users can set immuta.spark.session.resolve.raw.tables.enabled
to false
to bypass resolving raw databases or tables as Immuta data sources. This is useful if an admin wants to read raw data but is also an Immuta user. By default, data policies will be applied to a table even for an administrative user if that admin is also an Immuta user.
immuta.spark.session.resolve.raw.tables.enabled
Default: true
Description: Same as above, but a session property that allows users to toggle this functionality. If users run set immuta.spark.session.resolve.raw.tables.enabled=false
, they will see raw data only (not Immuta data policy-enforced data). Note: This property is not set in immuta_conf.xml
.
immuta.spark.show.immuta.database
Default: true
Description: This shows the immuta
database in the configured Databricks cluster. When set to false
Immuta will no longer show this database when a SHOW DATABASES
query is performed. However, queries can still be performed against tables in the immuta
database using the Immuta-qualified table name (e.g., immuta.my_schema_my_table
) regardless of whether or not this feature is enabled.
immuta.spark.version.validate.enabled
Default: true
Description: Immuta checks the versions of its artifacts to verify that they are compatible with each other. When set to true
, if versions are incompatible, that information will be logged to the Databricks driver logs and the cluster will not be usable. If a configuration file or the jar artifacts have been patched with a new version (and the artifacts are known to be compatible), this check can be set to false
so that the versions don't get logged as incompatible and make the cluster unusable.
immuta.user.context.class
Default: com.immuta.spark.OSUserContext
Description: The class name of the UserContext that will be used to determine the current user in immuta-spark-hive
. The default implementation gets the OS user running the JVM for the Spark application.
immuta.user.mapping.iamid
Default: bim
Description: Denotes which IAM in Immuta should be used when mapping the current Spark user's username to a userid in Immuta. This defaults to Immuta's internal IAM (bim
) but should be updated to reflect an actual production IAM.
Audience: System Administrators
Content Summary: It is most secure to leverage an equalized project when working in a Scala cluster; however, it is not required to limit Scala to equalized projects. This document outlines security recommendations for Scala clusters and discusses the security risks involved when equalized projects are not used.
Language Support
R and Scala are both supported, but require advanced configuration; work with your Immuta support professional to use these languages.
There are limitations to isolation among users in Scala jobs on a Databricks cluster, even when using Immuta’s SecurityManager. When data is broadcast, cached (spilled to disk), or otherwise saved to SPARK_LOCAL_DIR
, it's impossible to distinguish between which user’s data is composed in each file/block. If you are concerned about this vulnerability, Immuta suggests that Scala clusters
be limited to Scala jobs only.
use project equalization, which forces all users to act under the same set of attributes, groups, and purposes with respect to their data access.
When data is read in Spark using an Immuta policy-enforced plan, the masking and redaction of rows is performed at the leaf level of the physical Spark plan, so a policy such as "Mask using hashing the column social_security_number
for everyone" would be implemented as an expression on a project node right above the FileSourceScanExec/LeafExec
node at the bottom of the plan. This process prevents raw data from being shuffled in a Spark application and, consequently, from ending up in SPARK_LOCAL_DIR
.
This policy implementation coupled with an equalized project guarantees that data being dropped into SPARK_LOCAL_DIR
will have policies enforced and that those policies will be homogeneous for all users on the cluster. Since each user will have access to the same data, if they attempt to manually access other users' cached/spilled data, they will only see what they have access to via equalized permissions on the cluster. If project equalization is not turned on, users could dig through that directory and find data from another user with heightened access, which would result in a data leak.
To require that Scala clusters be used in equalized projects and avoid the risk described above, change the immuta.spark.require.equalization
value to true
in your Immuta configuration file when you spin up Scala clusters:
Once this configuration is complete, users on the cluster will need to switch to an Immuta equalized project before running a job. (Remember that when working under an Immuta Project, only tables within that project can be seen.) Once the first job is run using that equalized project, all subsequent jobs, no matter the user, must also be run under that same equalized project. If you need to change a cluster's project, you must restart the cluster.
Audience: System Administrators
Content Summary: This page describes ephemeral overrides for Databricks data sources.
Best Practices: Ephemeral Overrides
Disable ephemeral overrides for clusters when using multiple workspaces and dedicate a single cluster to serve queries from Immuta in a single workspace.
If you use multiple E2 workspaces without disabling ephemeral overrides, avoid applying the where user row-level policy to data sources.
In Immuta, a Databricks data source is considered ephemeral, meaning that the compute resources associated with that data source will not always be available.
Ephemeral data sources allow the use of ephemeral overrides, user-specific connection parameter overrides that are applied to Immuta metadata operations and queries that the user runs through the Query Editor.
When a user runs a Spark job in Databricks, Immuta plugins automatically submit ephemeral overrides for that user to Immuta for all applicable data sources to use the current cluster as compute for all subsequent metadata operations for that user against the applicable data sources.
A user runs a query on cluster B.
The Immuta plugins on the cluster check if there is a source in the Metastore with a matching database, table name, and location for its underlying data. Note: If tables are dynamic or change over time, users can disable the comparison of the location of the underlying data by setting immuta.ephemeral.table.path.check.enabled
to false
; disabling this configuration allows users to avoid keeping the relevant data sources in Immuta up-to-date (which would require API calls and automation).
The Immuta plugins on the cluster detect that the user is subscribed to data sources 1, 2, and 3 and that data sources 1 and 3 are both present in the Metastore for cluster B, so the plugins submit ephemeral override requests for data sources 1 and 3 to override their connections with the HTTP path from cluster B.
Since data source 2 is not present in the Metastore, it is marked as a JDBC source.
If the user attempts to query data source 2 and they have not enabled JDBC sources, they will be presented with an error message telling them to do so:
com.immuta.spark.exceptions.ImmutaConfigurationException
: This query plan will cause data to be pulled over JDBC. This spark context is not configured to allow this. To enable JDBC setimmuta.enable.jdbc=true
in the spark context hadoop configuration.
Ephemeral overrides are enabled by default because Immuta must be aware of a cluster that is running to serve metadata queries. The operations that use the ephemeral overrides include
Visibility checks on the data source for a particular user. These checks assess how to apply row-level policies for specific users.
Stats collection triggered by a specific user.
Validating a custom WHERE clause policy against a data source. When owners or governors create custom WHERE clause policies, Immuta uses compute resources to validate the SQL in the policy. In this case, the ephemeral overrides for the user writing the policy are used to contact a cluster for SQL validation.
High Cardinality Column detection. Certain advanced policy types (e.g., minimization and randomized response) in Immuta require a High Cardinality Column, and that column is computed on data source creation. It can be recomputed on demand and, if so, will use the ephemeral overrides for the user requesting computation.
However, ephemeral overrides can be problematic in environments that have a dedicated cluster to handle maintenance activities, since ephemeral overrides can cause these operations to execute on a different cluster than the dedicated one.
To reduce the risk that a user has overrides set to a cluster (or multiple clusters) that aren't currently up,
direct all clusters' HTTP paths for overrides to a cluster dedicated for metadata queries or
disable overrides completely.
To disable ephemeral overrides, set immuta.ephemeral.host.override
in spark-defaults.conf
to false.
Databricks Unity Catalog is a shared metastore at the Databricks account level that streamlines management of multiple Databricks workspaces for users.
Immuta’s Databricks Spark integration with Unity Catalog support uses a custom Databricks plugin to enforce Immuta policies on a Databricks cluster with Unity Catalog enabled. This integration provides a pathway for you to add your tables to the Unity Catalog metastore so that you can use the metastore from any workspace while protecting your data with Immuta policies.
Databricks Runtime 11.3.
Unity Catalog enabled on your Databricks cluster.
Unity Catalog metastore created and attached to a Databricks workspace.
The metastore owner you are using to manage permissions has been granted access to all catalogs, schemas, and tables that will be protected by Immuta. Data protected by Immuta should only be granted to privileged users in Unity Catalog so that the only view of that data is through an Immuta-enabled cluster.
You have generated a personal access token for the metastore owner that Immuta can use to read data in Unity Catalog.
You do not plan to use non-Unity Catalog enabled clusters with Immuta data sources. Once enabled, all access to data source tables must be on Databricks clusters with Unity Catalog enabled on runtime 11.3.
Configure your cluster to register data in Immuta.
Register Unity Catalog tables as Immuta data sources.
Build policies in Immuta to restrict access to data.
Deprecation notice
Support for this feature has been deprecated.
This page outlines the basic features of the Query Editor, which contains three main components: Table List and Schema View, the Query Editor, and the Query Results View. For a tutorial that details how to use the Query Editor, navigate to the Data Source User Guide.
The Query Editor allows users who are subscribed to a data source to preview data and write and execute queries directly in the Immuta UI for any data sources they are subscribed to. Additionally, Data Owners can examine how their policies impact the underlying data.
This panel contains a list of tables (grouped by schema) the user is subscribed to, and this list will automatically update when users switch their current project. Clicking a table in the list displays the schema view, which shows all columns with their respective data types.
Users can enter, modify, and execute their own queries in this panel. After users click Run Query, results will appear in the Query Results panel.
In the top right corner of the Query Editor is a dropdown to select a schema. Any tables in SELECT
statements that are not schema-qualified will use the schema chosen from the dropdown.
This panel displays the data returned by the query. Table columns can be resized or re-arranged by clicking and dragging, and results can be filtered. Currently displayed results can also be exported to .csv (limited to 1000 rows.)
Application Administrators can turn off the Query Engine to ensure data does not leave a data localization zone when authorized users access the Immuta Application outside data jurisdiction.
When the Query Engine is disabled, the SQL Credentials tab on a user profile page is removed. The associated SQL accounts are also deleted, so if an Administrator re-enables the Query Engine those SQL accounts must be recreated.
For a tutorial that details how to disable the Query Engine, navigate to the App Settings Tutorial.
Audience: System Administrators
Content Summary: This page describes how the Security Manager is disabled for Databricks clusters that do not allow R or Scala code to be executed. Databricks Administrators should place the desired configuration in the
immuta_conf.xml
file.
The Immuta Security Manager is an essential element of the Databricks deployment that ensures users can't perform unauthorized actions when using Scala and R, since those languages have features that allow users to circumvent policies without the Security Manager enabled. However, the Security Manager must inspect the call stack every time a permission check is triggered, which adds overhead to queries. To improve Immuta's query performance on Databricks, Immuta disables the Security Manager when Scala and R are not being used.
The cluster init script checks the cluster’s configuration and automatically removes the Security Manager configuration when
spark.databricks.repl.allowedlanguages
is a subset of {python, sql}
IMMUTA_SPARK_DATABRICKS_PY4J_STRICT_ENABLED
is true
When the cluster is configured this way, Immuta can rely on Databricks' process isolation and Py4J security to prevent user code from performing unauthorized actions.
Note: Immuta still expects the spark.driver.extraJavaOptions
and spark.executor.extraJavaOptions
to be set and pointing at the Security Manager.
Beyond disabling the Security Manager, Immuta will skip several startup tasks that are required to secure the cluster when Scala and R are configured, and fewer permission checks will occur on the Driver and Executors in the Databricks cluster, reducing overhead and improving performance.
There are still cases that require the Security Manager; in those instances, Immuta creates a fallback Security Manager to check the code path, so the IMMUTA_INIT_ALLOWED_CALLING_CLASSES_URI
environment variable must always point to a valid calling class file.
Databricks’ dbutils.fs
is blocked by their PY4J
security; therefore, it can’t be used to access scratch paths.
The Immuta platform solves two of the largest issues facing data-driven organizations: access and control. In large organizations, it can be difficult, if not impossible, for data scientists to access all the data they need. Once they do get access, it’s often difficult to make sure they use the data in ways that are compliant with regulations.
The Immuta platform solves both problems by providing a consistent point of access for all data analysis and dynamically protects your data with complex policies -- enforced based on the user accessing the data and the logic of the policy -- creating efficient digital data exchanges compliant with organizations' regulations with complete visibility of policy enforcement. Benefits include
Scalability and Evolvability: A scalable and evolvable data management system allows you to make changes that impact thousands of tables at once, accurately. It also allows you to evolve your policies over time with minor changes (or no changes at all) through policy logic.
Understandability: Immuta can present policies in a natural language form that is easily understood and provide an audit history of change to create a trust and verify environment. This allows you to prove policy is being implemented correctly to business leaders concerned with compliance and risk, and your business can meet audit obligations to external parties or customers.
Stability and Repeatability: Immuta was built with the “as-code” movement in mind, allowing you to treat Immuta as ephemeral and represent state in source control. You can merge data policy management into your existing engineering paradigms and toolchains, allowing full automation of every component of Immuta. Additionally, time-to-data is reduced across the organization because policy management is stable and time can be spent on other complex initiatives.
Distributed Stewardship: Immuta enables fine-grained data ownership and controls over organizational domains, allowing a data mesh environment for sharing data - embracing the ubiquity of your organization. You can enable different parts of your organization to manage their data policies in a self-serve manner without involving you in every step, and you can make data available across the organization without the need to centralize both the data and authority over the data. This frees your organization to share more data more quickly.
Consistency: With inconsistency comes complexity, both for your team and the downstream analysts trying to read data. That complexity from inconsistency removes all value of separating policy from compute. Immuta provides complete consistency so that you can build a policy once, in a single location, and have it enforced scalably and consistently across all your data warehouses.
Availability: Availability of these highly granular decisions at the access control level can increase data access by over 50% in some cases when using Immuta because friction between compliance and data access is reduced.
Performance: Performance is tied to how Immuta implements policy enforcement. Rather than requiring a copy of data to be created, Immuta enforces policies live.
Data Sources
Policies
Projects
Audit Logs and Immuta Reports
Application Admins: Application Admins manage the configuration of Immuta for their organization. These users can configure Immuta to use external identity managers and catalogs, enable or disable data handlers, adjust email and cache settings, generate system API keys, and manage various other advanced settings.
Data Owners: In order for data to be available in the Immuta platform, a Data Owner — the individual or team responsible for the data — needs to connect their data to Immuta. Once data is connected to Immuta, that data is called a data source. In the process of creating a data source, Data Owners are able to set policies on their data source that restrict which users can access it, which rows within the data a user can access, and which columns within the data source are visible or masked. Data Owners can also decide whether to make their data source public, which makes it available for discovery to all users in the Immuta Web UI, or made private, which means only the Data Owner and its assigned subscribers know it exists.
Data Users: Data Users consume the data that’s been made available through Immuta. Data Users can browse the Immuta Web UI seeking access to data and easily connect their third-party data science tools to Immuta.
Project Owners: These users can create their own project to restrict how their data will be utilized using purpose-based restrictions or to efficiently organize their data sources.
Governors: Governors set Global Policies within Immuta, meaning they can restrict the ways that data is used within Immuta across multiple projects and data sources. Governors can also set purpose-based usage restrictions on projects, which can help limit the ways that data is used within Immuta. By default, Governors can subscribe to data sources; however, this setting can be disabled on the App Settings page. Additionally, users can be a Governor and Admin simultaneously by default, but this setting can also be changed on the App Settings page to render the Governor and Admin roles mutually exclusive.
Project Managers: These users inspect, manage, approve, and deny various project changes, including purpose requests and project data sources.
User Admins: Another type of System Administrator is the User Admin, who is able to manage the permissions, attributes, and groups that attach to each user. Permissions are only managed locally within Immuta, but groups and attributes can be managed locally or derived from user management frameworks such as LDAP or Active Directory that are external to Immuta. By default, Admins can subscribe to data sources; however, this setting can be disabled on the App Settings page to remove the Admin's ability to create or subscribe to data sources. Additionally, users can be an Admin and Governor simultaneously by default, but this setting can also be changed on the App Settings page to render the Admin and Governor roles mutually exclusive.
Application Admin
APPLICATION_ADMIN
Data Owner
CREATE_DATA_SOURCE
CREATE_DATA_SOURCE_IN_PROJECT
CREATE_PROJECT
Data User
-
Data Governor
GOVERNANCE
Project Manager
PROJECT_MANAGEMENT
User Admin
USER_ADMIN
Permissions are a system-level mechanism that control what actions a user is allowed to take. These are applied to both the API and UI actions. Permissions can be added to any user by a System Administrator (any user with the USER_ADMIN
permission), but the permissions themselves are managed by Immuta and cannot be added or removed in the Immuta UI; however, custom permissions can be created on the App Settings page.
APPLICATION_ADMIN: Gives the user access to administrative actions for the configuration of Immuta. These actions include
Adding external IAMs.
Adding ODBC drivers.
Adding external catalogs.
Configuring email settings.
AUDIT: Gives the user access to the audit logs.
CREATE_DATA_SOURCE: Gives the user the ability to create data sources.
CREATE_DATA_SOURCE_IN_PROJECT: Gives the user the ability to create data sources within a project.
CREATE_S3_DATASOURCE_WITH_INSTANCE_ROLE: When creating an S3 data source, this allows the user to the handler to assume an AWS Role when ingesting data.
CREATE_FILTER: Gives the user the ability to create and save a search filter.
CREATE_PROJECT: Gives the user the ability to create projects.
FETCH_POLICY_INFO: Gives the user access to an endpoint that returns visibilities, masking information, and filters for a given data source.
GOVERNANCE: Gives the user the ability to set Global Policies, create purpose-based usage restrictions on projects, and manage tags.
IMPERSONATE_USER: Allows user to impersonate other Immuta users by entering their own SQL credentials to authenticate with the Immuta Query Engine and then specifying which user they would like to impersonate.
IMPERSONATE_HDFS_USER: When creating an HDFS data source, this allows the user to enter any HDFS username to use when accessing data.
PROJECT_MANAGEMENT: Allows users to create purposes, approve and deny purpose requests, and manage project data sources.
USER_ADMIN: Gives the user access to administrative actions for managing users in Immuta. These include
Creating and managing users and groups.
Add and remove user permissions.
Create and manage user attributes.
SaaS: This deployment option provides data access control through Immuta's native integrations, with automatic software updates and no infrastructure or maintenance costs.
Self-Managed: Immuta supports self-managed deployments for users who store their data on-premises or in private clouds, such as VPC. Users can connect to on-premises data sources and cloud data platforms that run on Amazon Web Services, Microsoft Azure, and Google Cloud Platform.
This section illustrates how to install Immuta using Kubernetes, which allows Immuta to easily scale to meet all your future growth needs.
See the Helm installation prerequisites guide for details about system requirements.
Immuta Query Engine Port
The required firewall rules depend on whether you will use the Immuta Query Engine or exclusively use integrations. If you only use integrations, port 5432 is optional.
The following firewall rules are required to be opened to any host or network that need access to the Immuta service. Navigate to the tab of the technology you plan to use:
443
TCP
Web Service
5432
TCP
PostgreSQL
443
TCP
Web Service
Immuta has a Helm chart available for installation on Kubernetes:
Specific guides are available for the following Kubernetes cloud providers:
Immuta supports the Kubernetes distributions outlined below.
1.25
1.26
1.27
1.28
1.29
1.25
1.26
1.27
1.28
1.29
1.24
1.25
1.26
1.27
1.28
4.12
4.13
4.14
1.24
1.25
1.26
1.27
1.28
Ingress Controller
The Immuta Helm Chart's built-in ingress controller is enabled by default, but will be disabled by default in future versions. If you have production workloads, consider moving away from using the built-in ingress controller.
AWS EKS
AWS Cloud Watch or third-party logging solution
Built-in ingress controller or third-party ingress controller
AWS EBS (default storage class in EKS)
AWS S3
Azure EKS
Third-party logging solution
Built-in ingress controller or third-party ingress controller
Azure managed disks (default storage class in AKS)
Azure Blob Storage
Google GKE
Third-party logging solution
Built-in ingress controller or third-party ingress controller
Google Cloud Persistent Disks (default storage class in GKE)
Google Cloud Storage
Red Hat OpenShift
Third-party logging solution
Built-in ingress controller or third-party ingress controller
Cloud disks (AWS EBS, Azure managed disks, or Google Cloud Persistent Disks)
Cloud storage (S3, Azure Blob, Google Cloud Storage) or self-hosted object storage (such as MinIO)
Rancher RKE2
Third-party logging solution
Built-in ingress controller or third-party ingress controller
Cloud Disks (AWS EBS, Azure managed disks, Google Cloud Persistent Disks)
Cloud storage (S3, Azure Blob, Google Cloud Storage) or self-hosted object storage (such as MinIO)
Immuta depends on the Helm functionality outlined below.
templates and functions
Helm hooks:
pre-install
pre-upgrade
post-upgrade
post-delete: This hook is not strictly necessary and is only used to clean up some resources that are not deleted by Helm itself. If the post-delete hook is not supported, some resources may be left on the cluster after running helm delete
.
Immuta support ends at our Helm implementation; wrapping Helm in another orchestration tool falls outside the Immuta support window.
Identify a team to set up, configure, and maintain your Kubernetes environment. Immuta will help you with the installation of our platform, but the Kubernetes environment is your company's responsibility. Review Kubernetes best practices here.
Only use the Immuta-provided default Nginx Ingress Controller if you are using the Immuta query engine. Otherwise, opt to use your own ingress controller or no controller at all.
Test your backups at least once a month.
Create the proper IAM roles and IAM permissions (if using IAM roles). Your backups will fail if this is not configured correctly.
Implementing infrastructure monitoring for the systems hosting your Immuta application is critical to ensuring its optimal performance, availability, and security. With today's complex IT environments, any disruption or delay in the underlying infrastructure can significantly impact your Immuta operations, affecting data governance processes and business outcomes. Infrastructure monitoring
allows you to proactively oversee your servers, networks, and other hardware components in real time.
identifies potential bottlenecks, hardware failures, or performance anomalies before they lead to significant issues or downtime.
can alert you to unusual activities that might indicate security threats, allowing for swift mitigation.
By monitoring your hosting infrastructure, you ensure that your Immuta application continues to run smoothly, securely, and effectively.
Use any monitoring tool that is already deployed. If you're not using any monitoring tools yet, consider some of the following options:
CloudTrail (if using AWS EKS or other cloud technologies)
DataDog (generally platform agnostic)
Prometheus (free and open-source software)
Using a log aggregation tool for your Immuta application is vital to maintaining operational efficiency and security. Modern applications' complex ecosystems generate vast amounts of log data that can be challenging to manage and analyze manually. A log aggregation tool centralizes these logs, making it easier to monitor the application's performance and health in real time. It can help detect anomalies, identify patterns, and troubleshoot issues more efficiently, thereby reducing downtime. Moreover, in the context of security, these tools can help detect suspicious activities or potential breaches by analyzing log data, contributing significantly to your overall data governance and risk mitigation strategy.
The logs in Kubernetes pods get renewed often, preventing Immuta support from viewing log history that is days or weeks old. Because these logs are necessary when investigating the behavior of pods and troubleshooting deployment related issues, enable log aggregation to capture the log history.
Use any logging tool that is already deployed. If you're not using a log aggregation tool yet, consider one of the following options:
Splunk
DataDog
Grafana Loki (free and open-source software)
Once your log aggregation tool is deployed, follow these general best-practice guidelines:
Pull logs from Immuta on a daily basis. These logs contain all of the information you will need to support auditing and compliance for access, queries, and changes in your environment.
Store logs for at least 30 days in a log aggregator for monitoring and compliance.
Discuss with your compliance group or lines of business which fields you want to monitor or report on from the Immuta logs. Immuta captures a wealth of information each time a user logs in, changes a policy, or runs a query, so work with your team to determine which items to capture in a log aggregation tool or store long-term.
To ensure top performance, audit records should not be stored in Immuta longer than the default of 60 days. For long-term audit records, use an external audit storage solution to ensure long-term data retention, data preservation, centralized monitoring, enhanced security, and scalability. Using an external audit storage solution also empowers your organization to meet compliance requirements and derive valuable insights from audit data for informed decision-making and analysis.
By default, most Immuta audit records expire after 60 days, but there are some audit record types that do not expire after 60 days. See the Immuta system audit logs page for details.
Backup frequency and retention settings directly impact your data protection and disaster recovery capabilities. While a daily backup is the default frequency and provides a standard level of data security, it's essential to evaluate your specific needs and the sensitivity of your data. For organizations dealing with more sensitive or critical information, increasing the backup frequency beyond daily backups can help minimize the risk of data loss and potential downtime. However, balancing the backup frequency with resource use is vital to avoid impacting performance: longer retention periods enable historical data recovery, while shorter periods optimize storage usage. It is crucial to assess regulatory requirements, data validation practices, and your organization's tolerance for data loss to set an effective retention policy.
Configuring backup settings that align with your desired recovery capabilities and data validation frequency ensures a resilient and reliable application deployment. With the flexibility provided by Helm values, you can fine-tune these settings to match your unique business needs and data protection goals effectively:
Backup frequency: By default, backups are taken once a day at midnight. This can be changed in the backup.schedule
parameter in Helm values file using CronJob syntax to specify the frequency of these backups. Daily backups are standard but if there are more sensitive data, you can do more than one backup every day and vice versa.
Backup file retention: Additionally, the number of backup files retained also matters. By default, 10 backup files are stored at all times in your storage of choice. Every time a new backup is taken, the oldest file is removed from the storage. This can be changed in the backup.maxBackupCount
parameter in the Helm values file.
For smaller deployments, 10 backup files is acceptable, assuming the backups are taken once a day.
For production deployment, work with your Immuta representative to determine the right number of backup files for your environment.
Audience: System Administrators, Data Owners, and Data Users
Content Summary: This page describes the Databricks integration, configuration options, and features.
See the Databricks integration page for a tutorial on enabling Databricks and these features through the App Settings page.
The table below outlines the integrations supported for various Databricks cluster configurations. For example, the only integration available to enforce policies on a cluster configured to run on Databricks Runtime 9.1 is the Databricks Spark integration.
Cluster 1
9.1
Unavailable
Unavailable
Cluster 2
10.4
Unavailable
Unavailable
Cluster 3
11.3
Unavailable
Cluster 4
11.3
Cluster 5
11.3
Legend:
The feature or integration is enabled.
The feature or integration is disabled.
Databricks instance: Premium tier workspace and Cluster access control enabled
Databricks instance has network level access to Immuta instance
Access to Immuta archives
Permissions and access to download (outside Internet access) or transfer files to the host machine
Recommended Databricks Workspace Configurations:
Note: Azure Databricks authenticates users with Microsoft Entra ID. Be sure to configure your Immuta instance with an IAM that uses the same user ID as does Microsoft Entra ID. Immuta's Spark security plugin will look to match this user ID between the two systems. See this Microsoft Entra ID page for details.
See this page for a list of Databricks Runtimes Immuta supports.
Immuta supports the Custom access mode.
Supported Languages:
Python
SQL
R (requires advanced configuration; work with your Immuta support professional to use R)
Scala (requires advanced configuration; work with your Immuta support professional to use Scala)
The Immuta Databricks integration supports the following Databricks features:
Change Data Feed: Databricks users can see the Databricks Change Data Feed on queried tables if they are allowed to read raw data and meet specific qualifications.
Databricks Libraries: Users can register their Databricks Libraries with Immuta as trusted libraries, allowing Databricks cluster administrators to avoid Immuta security manager errors when using third-party libraries.
External Metastores: Immuta supports the use of external metastores in local or remote mode.
Spark Direct File Reads: In addition to supporting direct file reads through workspace and scratch paths, Immuta allows direct file reads in Spark for file paths.
Users can have additional write access in their integration using project workspaces. Users can integrate a single or multiple workspaces with a single Immuta instance. For more details, see the Databricks Project Workspaces page.
The Immuta Databricks integration cannot ingest tags from Databricks, but you can connect any of these supported external catalogs to work with your integration.
Native impersonation allows users to natively query data as another Immuta user. To enable native user impersonation, see the Integration User Impersonation page.
Audit Limitations
Immuta will audit queries that come from interactive notebooks, notebook jobs, and JDBC connections, but will not audit Scala or R submit jobs. Furthermore, Immuta only audits Spark jobs that are associated with Immuta tables. Consequently, Immuta will not audit a query in a notebook cell that does not trigger a Spark job, unless immuta.spark.audit.all.queries
is set to true
; for more details about this configuration and auditing all queries in Databricks, see Limited Enforcement in Databricks.
Capturing the code or query that triggers the Spark plan makes audit records more useful in assessing what users are doing.
To audit the code or query that triggers the Spark plan, Immuta hooks into Databricks where notebook cells and JDBC queries execute and saves the cell or query text. Then, Immuta pulls this information into the audits of the resulting Spark jobs. Examples of a saved cell/query and the resulting audit record are provided on the Databricks JDBC and Notebook Cell Query Audit Logs page.
A user can configure multiple integrations of Databricks to a single Immuta instance and use them dynamically or with workspaces.
Immuta does not support Databricks clusters with Photon acceleration enabled.
You can use a library (like Boto 3 in Python) to access standard Amazon S3 and point it at Immuta to access your data. The integration with Databricks uses a file system (is3a
) that retrieves your API key and communicates with Immuta as if it were talking directly to S3, allowing users to access S3 and Azure Blob data sources through Immuta's s3p
endpoint.
This mechanism would never go to S3 directly. To access S3 directly, you will need to expose an S3-backed table or view in the Databricks Metastore as a source or use native workspaces/scratch paths.
To use the is3a
filesystem, add the following snippet to your cluster configuration:
This configuration is needed to allow any access to is3a
on Databricks 7+.
In Databricks or Spark, write queries that access this data by referencing the S3 path (shown in the Basic Information section of the Upload Files modal above), but using the URL scheme is3a
:
This integration is only available for object-backed data sources. Consequently, all the standard limitations that apply to object-backed data sources in Immuta apply here.
Additional configuration is necessary to allow is3a
paths to function as scratch paths. Contact your Immuta support professional for guidance.
In addition to supporting direct file reads through workspace and scratch paths, Immuta allows direct file reads in Spark for file paths. As a result, users who prefer to interact with their data using file paths or who have existing workflows revolving around file paths can continue to use these workflows without rewriting those queries for Immuta.
When reading from a path in Spark, the Immuta Databricks plugin queries the Immuta Web Service to find Databricks data sources for the current user that are backed by data from the specified path. If found, the query plan maps to the Immuta data source and follows existing code paths for policy enforcement.
Spark Direct File Reads in EMR
EMR uses the same integration as Databricks, but you will need to use the to interact with Immuta data sources.
For example, instead of spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table/partition_column=01/my_file.parquet")
, use immuta.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table/partition_column=01/my_file.parquet")
.
Users can read data from individual parquet files in a sub-directory and partitioned data from a sub-directory (or by using a where
predicate). Use the tabs below to view examples of reading data using these methods.
Read Data from an Individual Parquet File
To read from an individual file, load a partition file from a sub-directory:
Read Partitioned Data from a Sub-Directory
To read partitioned data from a sub-directory, load a parquet partition from a sub-directory:
Alternatively, load a parquet partition using a where
predicate:
Direct file reads in Spark are also supported for object-backed Immuta data sources (such as S3 or Azure Blob data sources) using the :
Direct file reads for Immuta data sources only apply to table-backed Immuta data sources, not data sources created from views or queries.
If more than one data source has been created for a path, Immuta will use the first valid data source it finds. It is therefore not recommended to use this integration when more than one data source has been created for a path.
On Databricks, multiple input paths are supported as long as they belong to the same data source. However, for EMR only a single input path is supported.
CSV-backed tables are not currently supported.
Loading a delta
partition from a sub-directory is not recommended by Spark and is not supported in Immuta. Instead, use a where
predicate:
Audience: Data Users and System Administrators
Content Summary: This page provides an explanation and solution for this common Databricks error.
Error Message: py4j.security.Py4JSecurityException: Constructor <> is not whitelisted
Explanation: This error indicates you are being blocked by Py4j security rather than the Immuta Security Manager. is strict and generally ends up blocking many ML libraries.
Solution: Turn off Py4j security on the offending cluster by setting IMMUTA_SPARK_DATABRICKS_PY4J_STRICT_ENABLED=false
in the environment variables section. Additionally, because there are limitations to the security mechanisms Immuta employs on-cluster when Py4j security is disabled, ensure that all users on the cluster have the same level of access to data, as users could theoretically see (policy-enforced) data that other users have queried.
is a shared metastore at the Databricks account level that streamlines management of multiple Databricks workspaces for users.
Immuta’s Databricks Spark integration with Unity Catalog support uses a custom Databricks plugin to enforce Immuta policies on a Databricks cluster with Unity Catalog enabled. This integration allows you to add your tables to the Unity Catalog metastore so that you can use the metastore from any workspace while protecting your data with Immuta policies.
Databricks clusters with Unity Catalog use the following hierarchy of data objects:
Metastore: Created at the account level and is attached to one or more Databricks workspaces. The metastore contains metadata of the configured tables available to query. All clusters on that workspace use the configured metastore and all workspaces that are configured to use a single metastore share those tables.
Catalog: A catalog sits on top of schemas (also called databases) and tables to manage permissions across a set of schemas.
Schema: Organizes tables and views.
Table: Tables can be managed or external tables.
For details about the Unity Catalog object model, search for .
Immuta’s Databricks Spark integration with Unity Catalog support uses a custom Databricks plugin to enforce Immuta policies on a Databricks cluster with Unity Catalog enabled. For Immuta to see all relevant tables that have a data source mapped to them, Immuta requires a privileged metastore owner’s personal access token (PAT) from Databricks, and that metastore owner must have been . This token is stored encrypted to provide an Immuta-enabled Databricks cluster access to more data than a specific user on that cluster might otherwise have.
You must use an Immuta-provided cluster policy to start your Databricks cluster, as these cluster policies explicitly set the data security mode to the Custom setting that allows Immuta to enforce policies on top of Unity Catalog and add Unity Catalog support to the cluster. Once your configuration is complete, policy enforcement will be the same as the policy enforcement for the .
For configuration instructions, see the .
External locations and storage credentials must be configured correctly on Immuta-enabled clusters to allow tables to be created in a non-managed path. Immuta does not control access to storage credentials or external locations, and a user will have the same level of access to these on an Immuta-enabled cluster as they do on a non-Immuta enabled cluster.
Scratch paths are locations in storage that users can read and write to without Immuta policies applied. Immuta's support for scratch paths in Unity Catalog is designed to work with external locations.
You must configure external locations for any scratch path and grant those locations to the metastore owner user being used to connect Immuta. Creating a database in a scratch location in an Immuta-enabled cluster with Unity Catalog differs from how it is supported on a non-Immuta cluster with Unity Catalog; on a non-Immuta cluster, a database will not have a location if it is created against a catalog other than the legacy hive_metastore
.
Immuta requires the database location to be specified in the create database call on an Immuta-enabled cluster so that Immuta can validate whether the read or write is permitted, as illustrated in the example below:
The only change is that Databricks metadata is saved in Unity Catalog at the account level, not the workspace level.
Databricks Runtime 11.3.
Unity Catalog enabled on your Databricks cluster.
Unity Catalog and attached to a Databricks workspace.
The metastore owner you are using to manage permissions has been granted access to all catalogs, schemas, and tables that will be protected by Immuta. Data protected by Immuta should only be granted to privileged users in Unity Catalog so that the only view of that data is through an Immuta-enabled cluster.
You have generated a personal access token for the metastore owner that Immuta can use to read data in Unity Catalog.
You do not plan to use non-Unity Catalog enabled clusters with Immuta data sources. Once enabled, all access to data source tables must be on Databricks clusters with Unity Catalog enabled on runtime 11.3.
For details about the supported features listed in the table above, see the pre-configuration details page for .
The table below outlines the integrations supported for various Databricks cluster configurations. For example, the only integration available to enforce policies on a cluster configured to run on Databricks Runtime 9.1 is the Databricks Spark integration.
Legend:
Databricks metastore magic allows you to migrate your data from the Databricks legacy Hive metastore to the Unity Catalog metastore while protecting data and maintaining your current processes in a single Immuta instance.
Native workspaces are not supported. Creating a native workspace on a Unity Catalog enabled host is undefined behavior and may cause data loss or crashes.
Tables must be GRANTed access to the Databricks metastore owner token configured for the integration. For the table to be accessible to the user, the full chain of catalog, schema, and table must all have the appropriate grants to this administrator user to allow them to SELECT from the table.
R notebooks may have path-related errors accessing tables.
Databricks on Azure will return errors when creating a database in a scratch location when Unity Catalog is enabled.
Databricks accounts deployed on Google Cloud Platform are not supported.
Audience: All users
Content Summary: The Policies page allows all users to view and search all policies and the data sources they apply to. Additionally, Governors and Data Owners can manage Global Policies and Restricted Global Policies on this page.
This document illustrates the basic features of the Policies page. For a tutorial, navigate to the , the or the .
These tabs list all policies and detail the tags, purposes, and policy type; the scope and state of the policy, and when and by whom the policy was created.
The Advanced Search allows users to search for policies based on specific facets, such as policy type, rule type, purposes, conflicts, and creator.
Audience: All Immuta Users
Content Summary: This page details the major components, installation, scalability, availability, and security of the Immuta platform.
Immuta's server-side software comprises the following major components:
Fingerprint Service: When enabled, additional statistical queries made during the health check are distilled into summary statistics, called fingerprints. During this process, statistical query results and data samples (which may contain PII) are temporarily held in memory by the Fingerprint Service.
Immuta Metadata Database: The database that contains instance metadata that powers the core functionality of Immuta, including policy data and attributes about data sources (tags, audit data, etc.).
Immuta Web Service: This component includes the Immuta UI and API and is responsible for all web-based user interaction with Immuta, metadata ingest, and the data fingerprinting process. Notionally a single web service, the fingerprinting functionality runs as a separate service internally and can be independently scaled.
Immuta's standard installation is a . This could be a Kubernetes cluster you manage or a hosted solution such as AKS, EKS, or GKE. This is the preferred deployment because of the minimal administration needed to achieve scale and availability.
Immuta is designed to be scalable in several dimensions. For the standard Immuta deployment, minimal administrative effort is required to manage scaling beyond the addition of nodes to the Immuta system. Scalability can also be achieved in non-standard deployments, but requires the time of skilled systems administrator resources.
The is stateless and horizontally scalable.
By keeping a metadata catalog rather than maintaining separate copies of data, Immuta's database is designed to remain small and responsive. By running replicated instances of this internal database, the catalog can scale in support of the web service.
Because each component of Immuta is designed to be horizontally scalable, Immuta can be configured for high availability. Upgrades and major configuration changes may require scheduled downtime, but even if Immuta's master internal database fails, recovery happens within seconds. With the addition of an external load balancer, Immuta's standard deployment comes preconfigured with these availability features.
Immuta’s core function of policy enforcement and management is designed to improve your data security. Beyond this primary feature, Immuta protects your data in several other ways.
Immuta is designed to leverage your existing identity management system when desired. This design allows Immuta to benefit from the work your security team has already done to validate users, protect credentials, and define roles and attributes.
By default, all network communications with Immuta and within Immuta are encrypted via TLS. This practice ensures your data is protected while in transit.
Immuta does not make any persistent copies of data.
Immuta does not store raw customer data. However, it may temporarily cache samples of their data for SDD and fingerprinting. These samples are stored in the metadata database and cache containers.
Audience: System Administrators
Content Summary: This page outlines how to deploy Immuta on OpenShift.
Immuta OpenShift Support
Immuta officially supports OpenShift 4 () and does not support OpenShift 3.
Run the following command in your terminal:
runAsUser
and fsGroup
The Immuta Helm Chart must be configured to set two values within the approved ranges for the OpenShift project Immuta is being deployed into: runAsUser
and fsGroup
.
runAsUser
: On a Pod SecurityContext, this field defines the user ID that will run the processes within the pod. In the next step, this can be set to any value within the range defined in sa.scc.uid-range
. See details below.
fsGroup
: This field defines a group ID that will be added as a supplemental group to the Pod. Files in PersistentVolumes
will be writable by this group ID. In the next step, this must be set to the minimum value in the range defined in sa.scc.supplemental-groups
. See details below.
View the approved ranges in OpenShift using one of the two methods below:
OpenShift Console
Navigate to the Project Details page and click the link under Annotations.
Take note of the values for openshift.io/sa.scc.uid-range
and openshift.io/sa.scc.supplemental-groups
.
OpenShift CLI
Alternatively, use the OpenShift CLI to inspect the relevant values directly. For example,
In both illustrations above, the first part of the value (leading up to the /
) is the assigned user ID/group ID range, and the second part (trailing the /
) is the extent of the range.
For example, the minimum UID for sa.scc.uid-range=1000620000/10000
is 1000620000
and the maximum is 1000629999
(1000620000 + 10000
).
For the examples throughout the rest of this tutorial, 1000620000
will be set as the value for both runAsUser
and fsGroup
.
Set these OpenShift-specific Helm values in a YAML file that will be passed to helm install
in the next step:
externalHostname
: Set to a subdomain of the domain configured for the OpenShift Ingress controller. Contact your OpenShift administrator to get the configured domain if it is unknown.
securityContext.runAsUser
: Set this to a user ID in the range specified by the annotation openshift.io/sa.scc.uid-range
in the OpenShift project for the following components:
backup.securityContext.runAsUser
cache.securityContext.runAsUser
database.securityContext.runAsUser
fingerprint.securityContext.runAsUser
queryEngine.securityContext.runAsUser
web.securityContext.runAsUser
securityContext.fsGroup
: Set this to the minimum value in the range defined in sa.scc.supplemental-groups
in the OpenShift project for the following components:
backup.securityContext.fsGroup
database.securityContext.fsGroup
queryEngine.securityContext.fsGroup
web.securityContext.fsGroup
patroniKubernetes.use_endpoints
: Set to false
for the components below. This change is required for Patroni to be able to successfully elect a leader.
database.patroniKubernetes.use_endpoints
queryEngine.patroniKubernetes.use_endpoints
web.ingress.enabled
: Set to false
to disable creation of Ingress resources for the Immuta Web Service. OpenShift provides its own Ingress controller for handling HTTP ingress, and this is configured by creating Routes.
To set up ingress for Immuta using the OpenShift Ingress controller, get the CA certificate used by Immuta for internal TLS. This will be used by the OpenShift Ingress controller to validate the upstream TLS connection to Immuta.
Create a Route using the OpenShift CLI. The hostname flag should be set to match the value configured for externalHostname
in the Helm values file, and it should be a subdomain of the domain that the OpenShift Ingress controller is configured for.
This will create a route to be served by the OpenShift Ingress controller. At this point, Immuta is installed and should be accessible at the configured hostname.
Run kubectl get svc immuta-query-engine-clients
to inspect the Query Engine client's service in Kubernetes to get the assigned External IP address. For example,
Copy the External-IP address. You will paste this value in the Immuta App Settings page to update the Public Query Engine Hostname.
In the Immuta UI, click the App Settings icon in the left sidebar and scroll to the Public URLs section.
Enter the value you copied from the EXTERNAL-IP
column in the Public Query Engine Hostname field.
Click Save to update the configuration.
Nginx Ingress must be disabled to run with the restricted SCC. Immuta's built-in Nginx Ingress controller will not run with the restricted SCC and must be disabled to run in this configuration. OpenShift has its own Ingress controller that can be used for HTTP traffic for the Immuta Web Service. However, since the OpenShift Ingress controller does not support TCP traffic, a separate LoadBalancer service must be used for the Query Engine, and the Public Query Engine Hostname must be updated accordingly.
A data source is how users virtually expose data (that lives in a remote data platforms) across their enterprise to other users. When you expose a data source you are not copying the data; you are using metadata to tell Immuta how to expose it. Once exposed and subscribed to, the data will be accessed in a consistent manner across analytics and visualization tools, allowing reproducibility and sharing. For more information and tutorials about data sources, see .
Policies are fine-grained security controls applied to data sources by Data Owners or Data Governors, who determine the logic behind what is hidden from whom. Immuta offers two policy types: , which determine who can access a data source, and , which determine what data the user sees once they get access to a data source. Through these policies, data is hidden, masked, redacted, and anonymized in the control plane based on the attributes of the users accessing the data and the purpose under which they are acting. For more information and tutorials about policies, see .
Projects allow users to logically group work by linking data sources and can be created to efficiently organize work or to provide special access to data to specific users. The same security restrictions regarding data sources are applied to projects; project members still need to be subscribed to data sources in order to access data, and only users with appropriate attributes and credentials will be able to see the data if it contains any row-level or masking security. However, Project Owners can enable , which improves collaboration by ensuring that the data in the project looks identical to all members, regardless of their level of access to data. When enabled, this feature automatically equalizes all permissions so that no project member has more access to data than the member with the least access. For more detailed discussion and tutorials about projects, see .
All activity in Immuta is audited, and Data Owners and users with the AUDIT
permission can access audit logs that detail who subscribes to each data source, why they subscribe, when they access data, and which files they access. These logs can be used for a number of intentions, including insider threat surveillance and data access monitoring for billing purposes. Audit logs can also be shipped to your enterprise auditing capability, if desired. Similarly, Governors can build Immuta Reports to analyze how data is being used and accessed across Immuta using the Immuta Report Builder. Reports can be based on users, groups, projects, data sources, tags, purposes, policies, and connections within Immuta. For more information and tutorials about audit logs and Immuta Reports, see the and the , respectively.
without any permissions
AWS RDS Postgres (Use the supported version identified in the .)
Azure Database for PostgreSQL (Use the supported version identified in the .)
Google Cloud SQL for PostgreSQL (Use the supported version identified in the .)
Cloud-managed PostgreSQL, such as AWS RDS Postgres, Azure Database for PostgreSQL, or Google Cloud SQL for PostgreSQL (Use the supported version identified in the .)
Cloud-managed PostgreSQL, such as AWS RDS Postgres, Azure Database for PostgreSQL, or Google Cloud SQL for PostgreSQL (Use the supported version identified in the .)
/
/
The Unity Catalog data object model introduces a 3-tiered namespace, as . Consequently, your Databricks tables registered as data sources in Immuta will now reference the catalog, schema (also called a database), and the table.
If a Databricks table is not a Delta table (if it is an ORC, Parquet, or other file format), it must be an external table. This is a Databricks Unity Catalog restriction and is not related to Immuta. See the for details about creating these objects to allow external locations to be used.
For configuration instructions, see the .
The data flow for Unity Catalog is the same as the data flow for the integration.
The feature or integration is enabled.
The feature or integration is disabled.
No configuration is necessary to enable this feature. For more details, see the .
to Immuta data sources is not supported.
(called available until protected by policy on the App Settings page), which makes Immuta clusters available to all Immuta users until protected by a policy, is not supported. You must set IMMUTA_SPARK_DATABRICKS_ALLOW_NON_IMMUTA_READS
and IMMUTA_SPARK_DATABRICKS_ALLOW_NON_IMMUTA_WRITES
to false
in your cluster policies manually or by selecting Protected until made available by policy in the .
.
For more details on security context restraints and how the user and group ID ranges are allocated, see the .
queryEngine.clientService.type
: Set to LoadBalancer
so that a LoadBalancer will be created to handle the TCP traffic for the Query Engine. The LoadBalancer that OpenShift creates will have its own hostname/IP address, and you must . This step can be omitted if the Query Engine is not being used.
Follow the , but supply the additional values file using the --values
flag in the .
Audience: System Administrators
Content Summary: The Immuta Helm installation integrates well with Kubernetes on AWS. This guide walks through the various components that can be set up.
Prerequisite: An Amazon EKS cluster with a recommended minimum of 3 m5.xlarge worker nodes.
Using a Kubernetes namespace
If deploying Immuta into a Kubernetes namespace other than the default, you must include the --namespace
option into all helm
and kubectl
commands provided throughout this section.
As of Kubernetes 1.23+ on EKS, you have to configure the EBS CSI driver in order for the Immuta Helm deployment to be able to request volumes for storage. Follow these instructions:
Upon cluster creation, create an IAM policy and role and associate it with the cluster. See AWS documentation for details: Creating the Amazon EBS CSI driver IAM role for service accounts.
Upon cluster creation, add the EBS CSI driver as an add-on to the cluster. See AWS documentation for details: Managing the Amazon EBS CSI driver as an Amazon EKS add-on.
For deploying Immuta on a Kubernetes cluster using the AWS cloud provider, you can mostly follow the Kubernetes Helm installation guide.
The only deviations from that guide are in the custom values file(s) you create. You will want to incorporate any changes referenced throughout this guide, particularly in the Backups and Load Balancing sections below.
Best Practice: Use S3 for Shared Storage
On AWS Immuta recommends that you use S3 for shared storage.
AWS IAM Best Practices
When using AWS IAMs make sure that you are using the best practices outlined here: AWS IAM Best Practices.
Best Practice: Production and Persistence
If deploying Immuta to a production environment using the built-in metadata database, it is recommended to resize the /
partition on each node to at least 50GB. The default size for many cloud providers is 20 GB.
To begin, you will need an IAM role that Immuta can use to access the S3 bucket from your Kubernetes cluster. There are four options for role assumption:
IAM Roles for Service Accounts : recommended for EKS.
Kube2iam or kiam: recommended if you have other workloads running in the cluster.
Instance profile: recommended if only Immuta is running in the cluster.
AWS secret access keys: simplest set-up if access keys and secrets are allowed in your environment.
The role you choose above must have at least the following IAM permissions:
The easiest way to expose your Immuta deployment running on Kubernetes with the AWS cloud provider is to set up nginx ingress as serviceType: LoadBalancer
and let the chart handle creation of an ELB.
Best Practices: ELB Listeners Configured to Use SSL
For best performance and to avoid any issues with web sockets, the ELB listeners need to be configured to use SSL instead of HTTPS.
If you are using the included ingress controller, it will create a Kubernetes LoadBalancer Service to expose Immuta outside of your cluster. The following options are available for configuring the LoadBalancer Service:
nginxIngress.controller.service.annotations
: Useful for setting options such as creating an internal load balancer or configuring TLS termination at the load balancer.
nginxIngress.controller.service.loadBalancerSourceRanges
: Used to limit which client IP addresses can access the load balancer.
nginxIngress.controller.service.externalTrafficPolicy
: Useful when working with Network Load Balancers on AWS. It can be set to “Local” to allow the client IP address to be propagated to the Pods.
Possible values for these various settings can be found in the Kubernetes Documentation.
If you would like to use automatic ELB provisioning, you can use the following values:
You can then manually edit the ELB configuration in the AWS console to use ACM TLS certificates to ensure your HTTPS traffic is secured by a trusted certificate. For instructions on doing this, please see Amazon's guide on how to Configure an HTTPS Listener for Your Classic Load Balancer
Another option is to set up nginx ingress with serviceType: NodePort
and configure load balancers outside of the cluster.
For example,
In order to determine the ports to configure the load balancer for, examine the Service configuration:
This will print out the port name and port. For example,
The Immuta deployment to EKS has a very low maintenance burden.
Best Practices: Installation Maintenance
Immuta recommends the following basic procedures for monitoring and periodic maintenance of your installation:
Periodically examine the contents of S3 to ensure database backups exist for the expected time range.
Ensure your Immuta installation is current and update it if it is not per the update instructions.
Be aware of the solutions to common management tasks for Kubernetes deployment
If kubectl
does not meet your monitoring needs, we recommend installing the Kubernetes Dashboard using the AWS provided instructions.
Ensure that your Immuta deployment is taking regular backups to AWS S3.
Your Immuta deployment is highly available and resilient to failure. For some catastrophic failures, recovery from backup may be required. Below is a list of failure conditions and the steps necessary to ensure Immuta is operational.
Internal Immuta Service Failure: Because Immuta is running in a Kubernetes deployment, no action should be necessary. Should a failure occur that is not automatically resolved, follow Immuta backup restoration procedures.
EKS Cluster Failure: Should your EKS cluster experience a failure, simply create a new cluster and follow Immuta backup restoration procedures.
Availability Zone Failure: Because EKS and ELB as well as the Immuta installation within EKS are designed to tolerate the failure of an availability zone, there are no steps needed to address the failure of an availability zone.
Region Failure: To provide recovery capability in the unlikely event of an AWS Region failure, Immuta recommends periodically copying database backups into an S3 bucket in a different AWS region. Should you experience a region failure, simply create a new cluster in a working region and follow Immuta backup restoration procedures.
See the AWS Documentation for more information on managing service limits to allow for proper disaster recovery.
Deprecation notice
Support for this integration has been deprecated. This integration will be removed in the 2024.2 LTS release.
Enabling Unity Catalog
The integration cannot be disabled once enabled, as it will permanently migrate all data sources to support the additional Unity Catalog controls and hierarchy. Unity Catalog support in Immuta is enabled globally across all Databricks data sources and integrations.
Databricks Runtime 11.3.
Unity Catalog enabled on your Databricks cluster.
Unity Catalog metastore created and attached to a Databricks workspace.
The metastore owner you are using to manage permissions has been granted access to all catalogs, schemas, and tables that will be protected by Immuta. Data protected by Immuta should only be granted to privileged users in Unity Catalog so that the only view of that data is through an Immuta-enabled cluster.
You have generated a personal access token for the metastore owner that Immuta can use to read data in Unity Catalog.
You do not plan to use non-Unity Catalog enabled clusters with Immuta data sources. Once enabled, all access to data source tables must be on Databricks clusters with Unity Catalog enabled on runtime 11.3.
In Unity Catalog, catalogs manage permissions across a set of databases.
Create a new catalog on a non-Immuta cluster as the metastore admin, who is tied to a specific metastore attached to one or more Databricks workspaces. That way, the catalog will be owned by the metastore admin, which gives broad permissions to grant or revoke objects in the catalog to other users. If this catalog is intended to be protected by Immuta, the data should not be granted to other users besides the metastore admin.
You can opt to set the default catalog for queries run without explicitly specifying the catalog for a table by adding the following Spark configuration to your Databricks cluster:
This configuration does not limit the cluster to only using this catalog; it merely sets the default for queries run without explicitly specifying the catalog for a table.
Click the App Settings icon in the left sidebar.
Scroll to the Global Integration Settings section and check the Enable Databricks Unity Catalog support in Immuta checkbox.
Complete the following fields:
Workspace Host Name: The hostname (also known as the instance name) of a Databricks workspace instance on an account you want to connect to Immuta. This Databricks workspace is used to run short duration Databricks jobs so that Immuta can pull a token for the metastore owner.
Databricks Account Administrator Personal Access Token: Immuta requires you to provide a personal access token of a Databricks metastore administrator so that Immuta can protect all the data sources available. Databricks metastore administrators are set by changing the owner of a metastore in the account console (or using DDL statements by an account-level administrator). Metastores can be owned by a group that enabled more than one user to be an owner.
Schedule: Immuta uses the administrator token to keep the Immuta-enabled clusters synchronized and needs to periodically refresh it to ensure that the cluster does not use an expired token. This schedule is in cron syntax and will be used to launch the synchronization job.
The default value for this runs the token sync job at midnight daily. This cadence should be sufficient for most Unity Catalog configurations; however, if the timing of the job is problematic you can adjust the time of day to run at a more convenient time.
Token Sync Retries: The number of attempts Immuta will perform to re-request the token. The default value should work for most systems, but in environments with networking or load issues consider increasing this number.
Save the configuration.
After saving the configuration, Immuta will be configured to use Unity Catalog data sources and will automatically sync the Databricks metastore administrator API token, which is required for the integration to correctly view and apply policies to any data source in Databricks.
Check that your token sync job was correctly run in Databricks. Navigate to Workflows and click the Job runs tab. Search for a job that starts with Immuta Unity Token Sync.
If the token sync fails, there will be log messages in the web service logs. These should be discoverable in the event that the connection to Databricks is not functioning. In the event that the token is not synchronized correctly, the following error will appear when performing actions in Databricks:
If the token expires, the following error will appear when performing actions on any Immuta-enabled Databricks cluster: ImmutaException: 403: Invalid access token.
In this case, you can re-run the token sync job by modifying the schedule for token synchronization on the App Settings page. When the configuration is saved, the token synchronization job will run again immediately (regardless of schedule) and will refresh the token. Consider shortening the window between token synchronization jobs by editing the schedule if you see this error.
If you already have a Databricks Spark integration configured, follow the Enable Unity Catalog Support for an Existing Databricks Spark Integration guide.
Existing Data Sources
Existing data sources will reference the default catalog, hive_metastore
, once Unity Catalog is enabled. However, this default catalog will not be used when you create new data sources.
If you already have an Immuta Databricks Spark integration configured, follow the steps below to enable Unity Catalog support in Immuta.
Enable Unity Catalog support on the App Settings page.
Re-push cluster policies to your Databricks cluster. Note that you must set IMMUTA_SPARK_DATABRICKS_ALLOW_NON_IMMUTA_READS
and IMMUTA_SPARK_DATABRICKS_ALLOW_NON_IMMUTA_WRITES
to false
in your cluster policies manually or by selecting Protected until made available by policy in the Databricks integration section of the App Settings page. See the Databricks Spark integration with Unity Catalog support limitations for details.
Re-start your Databricks cluster with the new cluster policy applied.
Migration troubleshooting
If multiple Snowflake integrations are enabled, they will all migrate together. If one fails, they will all revert to the Snowflake Standard integration.
If an error occurs during migration and the integration cannot be reverted, the integration must be disabled and re-enabled.
Click the App Settings icon in the left sidebar.
Click Preview Features in the left panel.
Scroll to the Native Snowflake Governance Controls modal and check the checkbox.
Using the credentials entered to enable the Snowflake integration, fill out the Username and Password or Key Pair.
Click Save.
Click Confirm.
Cluster 1
9.1
Unavailable
Unavailable
Cluster 2
10.4
Unavailable
Unavailable
Cluster 3
11.3
Unavailable
Cluster 4
11.3
Cluster 5
11.3
Audience: System Administrators
Content Summary: This guide illustrates the deployment of an Immuta cluster on Microsoft Azure Kubernetes Service. Requirements may vary depending on the Azure Cloud environment and/or region. For comprehensive assistance, please contact an Immuta Support Professional.
This guide is intended to supplement the main Helm installation guide, which is referred to often throughout this page.
Prerequisites:
Node Size: Immuta's suggested minimum Azure VM size for Azure Kubernetes Service deployments is
Standard_D3_v2
(4 vCPU, 14GB RAM, 200 GB SSD) or equivalent. The Immuta helm installation requires a minimum of 3 nodes. Additional nodes can be added on demand.TLS Certificates: See the main helm installation guide for TLS certificate requirements.
To install Azure CLI 2.0, please visit Microsoft's documentation and follow the instructions for your chosen platform. You can also use the Azure Cloud Shell.
For more information on nodes, see the Azure VM sizing documentation.
Before installing Immuta, you will need to spin up your AKS cluster. If you would like to install Immuta on an existing AKS cluster, you can skip this step. If wish to deploy a dedicated cluster for Immuta, please visit Deploying Immuta Cluster Infrastructure on AKS.
Navigate to the installation method of your choice:
Please see the main helm installation guide for the full walkthrough of installing Immuta via our Helm Chart. This section will focus on the specific requirements for the helm installation on AKS.
Since you are deploying Immuta as an Azure cloud application in AKS, you can easily configure the Nginx Ingress Controller that is bundled with the Immuta Helm deployment as a load balancer using the generated hostname from Azure.
Confirm that you have the following configurations in your values.yaml
file before deploying:
If you are using the included ingress controller, it will create a Kubernetes LoadBalancer Service to expose Immuta outside of your cluster. The following options are available for configuring the LoadBalancer Service:
nginxIngress.controller.service.annotations
: Useful for setting options such as creating an internal load balancer or configuring TLS termination at the load balancer.
nginxIngress.controller.service.loadBalancerSourceRanges
: Used to limit which client IP addresses can access the load balancer.
nginxIngress.controller.service.externalTrafficPolicy
: Useful when working with Network Load Balancers on AWS. It can be set to “Local” to allow the client IP address to be propagated to the Pods.
Possible values for these various settings can be found in the Kubernetes Documentation.
After running helm install
, you can find the public IP address of the nginx controller by running
If the public IP address shows up as <pending>
, wait a few moments and check again. Once you have the IP address, run the following commands to configure the Immuta Azure Cloud Application to use your ingress controller:
Shortly after running these commands, you should be able to reach the Immuta console in your web browser at the configured externalHostName
.
Best Practice: Network Security Group
Immuta recommends that you set up the network security group for the Immuta cluster to be closed to public traffic outside of your organization. If your organization already has rules and guidelines for your Azure Cloud Application security groups, then you should adhere to those. Otherwise, we recommend visiting Microsoft's documentation page for configuring Network security groups to find a solution that fits your environment.
To configure backups with Azure, see the backup section in the Immuta Helm Chart.
If you've previously provisioned an AKS cluster (see Deploying Immuta Cluster Infrastructure on AKS) and have installed the Installation Prerequisites, you can run an automated script that will
Prepare the Helm values file,
Register the required secrets to pull Immuta's Docker images,
Run the Helm installation, and
Create the mapping between the external IP address Ingress Controller (the cluster's load balancer) and the cluster's public DNS name.
Please Note
Running the automated deployment script will make a series of decisions for you:
The TLS certificates will be generated on-the-fly and will be self-signed. You can easily change this later by following the instructions in the main Helm installation guide.
The number of replicas from each component will be automatically derived from your AKS cluster's node count. This can be easily modified by overriding the replicas parameter.
The installation will set up backup volumes by default. Set the BACKUPS
value to 0
to disable Immuta backups.
Download the script:
Make it executable by running:
Below is the list of the parameters that the script accepts. These parameters are environment variables that are prepended to the execution command.
CLUSTER_NAME
The name of your AKS cluster
Required
-
SUBSCRIPTION_ID
The Azure Subscription ID
Required
-
CLUSTER_RESOURCE_GROUP
The resource group that contains the cluster
Required
-
DOCKER_USERNAME
Obtain from your Immuta support professional
Required
-
DOCKER_PASSWORD
Obtain from your Immuta support professional
Required
-
DB_PASSWORD
An arbitrary metadata database password
Required
-
DB_SUPERUSER_PASSWORD
An arbitrary metadata database super-user password
Required
-
DB_REPLICATION_PASSWORD
An arbitrary metadata database replication password
Required
-
DB_PATRONI_API_PASSWORD
An arbitrary metadata database Patroni API password
Required
-
QE_PASSWORD
An arbitrary Query Engine password
Required
-
QE_SUPERUSER_PASSWORD
An arbitrary Query Engine super-user password
Required
-
QE_REPLICATION_PASSWORD
An arbitrary Query Engine replication password
Required
-
QE_PATRONI_API_PASSWORD
An arbitrary Query Engine Patroni API password
Required
-
IMMUTA_VERSION
The version tag of the desired Immuta installation
Optional
<Current Immuta version>
IMMUTA_K8S_NAMESPACE
The Kubernetes namespace to create and deploy Immuta to
Optional
default
REPLICAS
The number of replicas of each main component in the cluster
Optional
1
BACKUPS
Whether or not backups should be enabled
Optional
1
SA_RESOURCE_GROUP
Backup Storage Account resource group
Optional
Same as CLUSTER_RESOURCE_GROUP
To run the script and deploy, you can simply prepend the above-mentioned parameters to the execution command, with the action deploy
. For example,
You can use the same script to destroy a deployment you had previously run with this script, by running the following command:
The value of CLUSTER_NAME
should be identical to the name of the CLUSTER_NAME
value you've used to deploy Immuta.
Private preview
This feature is only available to select accounts. Reach out to your Immuta representative to enable this feature.
Snowflake column lineage specifies how data flows from source tables or columns to the target tables in write operations. When Snowflake lineage tag propagation is enabled in Immuta, Immuta automatically applies tags added to a Snowflake table to its descendant data source columns in Immuta so you can build policies using those tags to restrict access to sensitive data.
Snowflake Access History tracks user read and write operations. Snowflake column lineage extends this Access History to specify how data flows from source columns to the target columns in write operations, allowing data stewards to understand how sensitive data moves from ancestor tables to target tables so that they can
trace data back to its source to validate the integrity of dashboards and reports,
identify who performed write operations to meet compliance requirements,
evaluate data quality and pinpoint points of failure, and
tag sensitive data on source tables without having tag columns on their descendant tables.
However, tagging sensitive data doesn’t innately protect that data in Snowflake; users need Immuta to disseminate these lineage tags automatically to descendant tables registered in Immuta so data stewards can build policies using the semantic and business context captured by those tags to restrict access to sensitive data. When Snowflake lineage tag propagation is enabled, Immuta propagates tags applied to a data source to its descendant data source columns in Immuta, which keeps your data inventory in Immuta up-to-date and allows you to protect your data with policies without having to manually tag every new Snowflake data source you register in Immuta.
An application administrator enables the feature on the Immuta app settings page.
Snowflake lineage metadata (column names and tags) for the Snowflake tables is stored in the metadata database.
A data owner creates a new data source (or adds a new column to a Snowflake table) that initiates a job that applies all tags for each column from its ancestor columns.
A data owner or governor adds a tag to a column in Immuta that has descendants, which initiates a job that propagates the tag to all descendants.
An audit record is created that includes which tags were applied and from which columns those tags originated.
The Snowflake Account Usage ACCESS_HISTORY
view contains column lineage information.
To appropriately propagate tags to descendant data sources, Immuta fetches Access History metadata to determine what column tags have been updated, stores this metadata in the Immuta metadata database, and then applies those tags to relevant descendant columns of tables registered in Immuta.
Consider the following example using the Customer, Customer 2, and Customer 3 tables that were all registered in Immuta as data sources.
Customer: source table
Customer 2: descendant of Customer
Customer 3: descendant of Customer 2
If the Discovered.Electronic Mail Address
tag is added to the Customer data source in Immuta, that tag will propagate through lineage to the Customer 2 and Customer 3 data sources.
After an application administrator has enabled Snowflake lineage tag propagation, data owners can register data in Immuta and have tags in Snowflake propagated from ancestor tables to descendant data sources. Whenever new tags are added to those tables in Immuta, those upstream tags will propagate to descendant data sources.
By default all tags are propagated, but these tags can be filtered on the app settings page or using the Immuta API.
Lineage tag propagation works with any tag added to the data dictionary. Tags can be manually added, synced from an external catalog, or discovered by SDD. Consider the following example using the Customer, Customer 2, and Customer 3 tables that were all registered in Immuta as data sources.
Customer: source table
Customer 2: descendant of Customer
Customer 3: descendant of Customer 2
Immuta added the Discovered.Electronic Mail Address
tag to the Customer data source, and that tag propagated through lineage to the Customer 2 and Customer 3 data sources.
Removing the tag from the Customer 2 table soft deletes it from the Customer 2 data source. When a tag is deleted, downstream lineage tags are removed, unless another parent data source still has that tag. The tag remains visible, but it will not be re-added if a future propagation event specifies the same tag again. Immuta prevents you from removing Snowflake object tags from data sources. You can only remove Immuta-managed tags. To remove Snowflake object tags from tables, you must remove them in Snowflake.
However the Discovered.Electronic Mail Address
tag still applies to the Customer 3 data source because Customer still has the tag applied. The only way a tag will be removed from descendant data sources is if no other ancestor of the descendant still prescribes the tag.
If the Snowflake lineage tag propagation feature is disabled, tags will remain on Immuta data sources.
Sensitive data discovery will still run on data sources and can be manually triggered. Tags applied through sensitive data discovery will propagate as tags added through lineage to descendant Immuta data sources.
Immuta audit records include Snowflake lineage tag events when a tag is added or removed.
The example audit record below illustrates the SNOWFLAKE_TAGS.pii
tag successfully propagating from the Customer table to Customer 2:
Without tableFilter
set, Immuta will ingest lineage for every table on the Snowflake instance.
Tag propagation based on lineage is not retroactive. For example, if you add a table, add tags to that table, and then run the lineage ingestion job, tags will not get propagated. However, if you add a table, run the lineage ingestion job, and then add tags to the table, the tags will get propagated.
The native lineage job needs to pull in lineage data before any tag is applied in Immuta. When Immuta gets new lineage information from Snowflake, Immuta does not update existing tags in Immuta.
There can be up to a 3-hour delay in Snowflake for a lineage event to make it into the ACCESS_HISTORY
view.
Immuta does not ingest lineage information for views.
Snowflake only captures lineage events for CTAS
, CLONE
, MERGE
, and INSERT
write operations. Snowflake does not capture lineage events for DROP
, RENAME
, ADD
, or SWAP
. Instead of using these latter operations, you need to recreate a table with the same name if you need to make changes.
Immuta cannot enforce coherence of your Snowflake lineage. If a column, table, or schema in the middle of the lineage graph gets dropped, Immuta will not do anything unless a table with that same name gets recreated. This means a table that gets dropped but not recreated could live in Immuta’s system indefinitely.
To migrate from the private preview version of table grants (available before September 2022) to the GA version, complete the steps below.
Navigate to the App Settings page.
Click Integration Settings in the left panel, and scroll to the Global Integration Settings section.
Uncheck the Snowflake Table Grants checkbox to disable the feature.
Click Save. Wait for about 1 minute per 1000 users. This gives time for Immuta to drop all the previously created user roles.
Use the to re-enable the feature.
Private preview
This feature is only available to select accounts. Reach out to your Immuta representative to enable this feature.
Snowflake Enterprise Edition
Contact your Immuta representative to enable this feature in your Immuta tenant.
Navigate to the App Setting page and click the Integration tab.
Click +Add Native Integration and select Snowflake from the dropdown menu.
Complete the Host, Port, and Default Warehouse fields.
Enable Native Query Audit.
Enable Native Lineage and complete the following fields:
Ingest Batch Sizes: This setting configures the number of rows Immuta ingests per batch when streaming Access History data from your Snowflake instance.
Table Filter: This filter determines which tables Immuta will ingest lineage for. Enter a regular expression that excludes /
from the beginning and end to filter tables. Without this filter, Immuta will attempt to ingest lineage for every table on your Snowflake instance.
Tag Filter: This filter determines which tags to propagate using lineage. Enter a regular expression that excludes /
from the beginning and end to filter tags. Without this filter, Immuta will ingest lineage for every tag on your Snowflake instance.
Opt to enable Automatically ingest Snowflake object tags.
Select Manual or Automatic Setup and
The Snowflake lineage sync endpoint triggers the native lineage ingestion job that allows Immuta to propagate Snowflake tags added through lineage to Immuta data sources.
Copy the example and replace the Immuta URL and API key with your own.
Change the payload attribute values to your own, where
tableFilter
(string): This regular expression determines which tables Immuta will ingest lineage for. Enter a regular expression that excludes /
from the beginning and end to filter tables. Without this filter, Immuta will attempt to ingest lineage for every table on your Snowflake instance.
batchSize
(integer): This parameter configures the number of rows Immuta ingests per batch when streaming Access History data from your Snowflake instance. Minimum 1.
lastTimestamp
(string): Setting this parameter will only return lineage events later than the value provided. Use a format like 2022-06-29T09:47:06.012-07:00.
Once the sync job is complete, you can complete the following steps:
/
/
.
Audience: System Administrators
Content Summary: This guide details how to manually update your Databricks cluster after changes to the Immuta init script or cluster policies are made.
If a Databricks cluster needs to be manually updated to reflect changes in the Immuta init script or cluster policies, you can remove and set up your integration again to get the updated policies and init script.
Log in to Immuta as an Application Admin.
Click the App Settings icon in the left sidebar and click the Integrations tab.
Your existing Databricks integration should be listed here; expand it and note the configuration values. Now select Remove to remove your integration.
Click Add Native Integration and select Databricks Integration to add a new integration.
Enter your Databricks integration settings again as configured previously.
Click Add Native Integration to add the integration, and then select Configure Cluster Policies to set up the updated cluster policies and init script.
Select the cluster policies you wish to use for your Immuta-enabled Databricks clusters.
Use the tabs below to view instructions for automatically pushing cluster policies and the init script (recommended) or manually updating your cluster policies.
Automatically Push Cluster Policies
Select Automatically Push Cluster Policies and enter your privileged Databricks access token. This token must have privileges to write to cluster policies.
Select Apply Policies to push the cluster policies and init script again.
Click Save and Confirm to deploy your changes.
Manually Update Cluster Policies
Download the init script and the new cluster policies to your local computer.
Click Save and Confirm to save your changes in Immuta.
Log in to your Databricks workspace with your administrator account to set up cluster policies.
Get the path you will upload the init script (immuta_cluster_init_script_proxy.sh
) to by opening one of the cluster policy .json
files and looking for the defaultValue
of the field init_scripts.0.dbfs.destination
. This should be a DBFS path in the form of dbfs:/immuta-plugin/hostname/immuta_cluster_init_script_proxy.sh
.
Click Data in the left pane to upload your init script to DBFS to the path you found above.
To find your existing cluster policies you need to update, click Compute in the left pane and select the Cluster policies tab.
Edit each of these cluster policies that were configured before and overwrite the contents of the JSON with the new cluster policy JSON you downloaded.
Restart any Databricks clusters using these updated policies for the changes to take effect.
Audience: System Administrators
Content Summary: This page describes how to hide the
immuta
database in Databricks.
Hiding the database does not disable access to it
Queries can still be performed against tables in the immuta
database using the Immuta-qualified table name (e.g., immuta.my_schema_my_table
) regardless of whether or not this feature is enabled.
The immuta
database on Immuta-enabled clusters allows Immuta to track Immuta-managed data sources separately from remote Databricks tables so that policies and other security features can be applied. However, Immuta supports raw tables in Databricks, so table-backed queries do not need to reference this database. When configuring a Databricks cluster, you can hide immuta
from any calls to SHOW DATABASES
so that users are not confused or misled by that database.
immuta
DatabaseWhen configuring a Databricks cluster, hide immuta
by using the following environment variable in the Spark cluster configuration:
Then, Immuta will not show this database when a SHOW DATABASES
query is performed.
Audience: System Administrators
Content Summary: This page describes the Python & SQL & R cluster policy.
Additional Overhead
In relation to the Python & SQL cluster policy, this configuration trades some additional overhead for added support of the R language.
In this configuration, you are able to rely on the Databricks-native security controls. The key security control here is the enablement of process isolation. This prevents users from obtaining unintentional access to the queries of other users. In other words, masked and filtered data is consistently made accessible to users in accordance with their assigned attributes.
Like the Python & SQL configuration, Py4j security is enabled for the Python & SQL & R configuration. However, because R has been added Immuta enables the SecurityManager, in addition to Py4j security, to provide more security guarantees. For example, by default all actions in R execute as the root user; among other things, this permits access to the entire filesystem (including sensitive configuration data), and, without iptable restrictions, a user may freely access the cluster’s cloud storage credentials. To address these security issues, Immuta’s initialization script wraps the R and Rscript binaries to launch each command as a temporary, non-privileged user with limited filesystem and network access and installs the Immuta SecurityManager, which prevents users from bypassing policies and protects against the above vulnerabilities from within the JVM.
Consequently, the cost of introducing R is that the SecurityManager incurs a small increase in performance overhead; however, average latency will vary depending on whether the cluster is homogeneous or heterogeneous. (In homogeneous clusters, all users are at the same level of groups/authorizations; this is enforced externally, rather than directly by Immuta.)
Many Python ML classes (such as LogisticRegression
, StringIndexer
, and DecisionTreeClassifier
) and dbutils.fs are unfortunately not supported with Py4J security enabled. Users will also be unable to use the Databricks Connect client library.
When users install third-party Java/Scala libraries, they will be denied access to sensitive resources by default. However, cluster administrators can specify which of the installed Databricks libraries should be trusted by Immuta.
For full details on Databricks’ best practices in configuring clusters, please read their governance documentation.
Generally, Immuta prevents users from seeing data unless they are explicitly given access, which blocks access to raw sources in the underlying databases. However, in some native patterns (such as Snowflake), Immuta adds views to allow users access to Immuta sources but does not impede access to preexisting sources in the underlying database. Therefore, if a user had access in Snowflake to a table before Immuta was installed, they would still have access to that table after.
Unlike the example above, Databricks non-admin users will only see sources to which they are subscribed in Immuta, and this can present problems if organizations have a data lake full of non-sensitive data and Immuta removes access to all of it. The Limited Enforcement Scope feature addresses this challenge by allowing Immuta users to access any tables that are not protected by Immuta (i.e., not registered as a data source or a table in a native workspace). Although this is similar to how privileged users in Databricks operate, non-privileged users cannot bypass Immuta controls.
This feature is composed of two configurations:
Allowing non-Immuta reads: Immuta users with regular (unprivileged) Databricks roles may SELECT
from tables that are not registered in Immuta.
Allowing non-Immuta writes: Immuta users with regular (unprivileged) Databricks roles can run DDL commands and data-modifying commands against tables or spaces that are not registered in Immuta.
Additionally, Immuta supports auditing all queries run on a Databricks cluster, regardless of whether users touch Immuta-protected data or not. To configure Immuta to do so, navigate to the Enable Auditing of All Queries in Databricks section.
Non-Immuta Reads
This setting does not allow reading data directly with commands like spark.read.format("x"). Users are still required to read data and query tables using Spark SQL.
When non-Immuta reads are enabled, Immuta users will see all databases and tables when they run show databases and/or show tables. However, this does not mean they will be able to query all of them.
Enable non-Immuta Reads by setting this configuration in the Spark environment variables (recommended) or immuta_conf.xml
(not recommended):
Opt to adjust the cache duration by changing the default value in the Spark environment variables (recommended) or immuta_conf.xml
(not recommended). (Immuta caches whether a table has been exposed as an Immuta source to improve performance. The default caching duration is 1 hour.)
Non-Immuta Writes
These non-protected tables/spaces have the same exposure as detailed in the read section, but with the distinction that users can write data directly to these paths.
With non-Immuta writes enabled, it will be possible for users on the cluster to mix any policy-enforced data they may have access to via any registered data sources in Immuta with non-Immuta data, and write the ensuing result to a non-Immuta write space where it would be visible to others. If this is not a desired possibility, the cluster should instead be configured to only use Immuta’s native workspaces.
Enable non-Immuta Writes by setting this configuration in the Spark environment variables (recommended) or immuta_conf.xml
(not recommended):
Opt to adjust the cache duration by changing the default value in the Spark environment variables (recommended) or immuta_conf.xml
(not recommended). (Immuta caches whether a table has been exposed as an Immuta source to improve performance. The default caching duration is 1 hour.)
Enable support for auditing all queries run on a Databricks cluster (regardless of whether users touch Immuta-protected data or not) by setting this configuration in the Spark environment variables (recommended) or immuta_conf.xml
(not recommended):
The controls and default values associated with non-Immuta reads, non-Immuta writes, and audit functionality are outlined below.
Audience: System Administrators
Content Summary: This page describes the sparklyr cluster policy.
Single-User Clusters Recommended
Like Databricks, Immuta recommends single-user clusters for sparklyr when user isolation is required. A single-user cluster can either be a job cluster or a cluster with credential passthrough enabled. Note: spark-submit jobs are not currently supported.
Two cluster types can be configured with sparklyr: Single-User Clusters (recommended) and Multi-User Clusters (discouraged).
Single-User Clusters: Credential Passthrough (required on Databricks) allows a single-user cluster to be created. This setting automatically configures the cluster to assume the role of the attached user when reading from storage (S3). Because Immuta requires that raw data is readable by the cluster, the instance profile associated with the cluster should be used rather than a role assigned to the attached user.
Multi-User Clusters: Because Immuta cannot guarantee user isolation in a multi-user sparklyr cluster, it is not recommended to deploy a multi-user cluster. To force all users to act under the same set of attributes, groups, and purposes with respect to their data access and eliminate the risk of a data leak, all sparklyr multi-user clusters must be equalized either by convention (all users able to attach to the cluster have the same level of data access in Immuta) or by configuration (detailed below).
In addition to the configuration for an Immuta cluster with R, add this environment variable to the Environment Variables section of the cluster:
This configuration makes changes to the iptables rules on the cluster to allow the sparklyr client to connect to the required ports on the JVM used by the sparklyr backend service.
Install and load libraries into a notebook. Databricks includes the stable version of sparklyr, so library(sparklyr)
in an R notebook is sufficient, but you may opt to install the latest version of sparklyr from CRAN
. Additionally, loading library(DBI)
will allow you to execute SQL queries.
Set up a sparklyr connection:
Pass the connection object to execute queries:
Add the following items to the Spark Config section of the cluster:
The trustedFileSystems
setting is required to allow Immuta’s wrapper FileSystem (used in conjunction with the ImmutaSecurityManager
for data security purposes) to be used with credential passthrough. Additionally, the InstanceProfileCredentialsProvider
must be configured to continue using the cluster’s instance profile for data access, rather than a role associated with the attached user.
Immuta Discourages Deploying Multi-User Clusters with sparklyr Configuration
It is possible, but not recommended, to deploy a multi-user cluster sparklyr configuration. Immuta cannot guarantee user isolation in a multi-user sparklyr configuration.
The configurations in this section enable sparklyr, require project equalization, map sparklyr sessions to the correct Immuta user, and prevent users from accessing Immuta native workspaces.
Add the following environment variables to the Environment Variables section of your cluster configuration:
Add the following items to the Spark Config section:
Immuta’s integration with sparklyr does not currently support
spark-submit jobs,
UDFs, or
Databricks Runtimes 5, 6, or 7.
Audience: System Administrators
Content Summary: This page outlines how to access DBFS in Databricks for non-sensitive data. Databricks Administrators should place the desired configuration in the Spark environment variables (recommended) or the
immuta_conf.xml
file (not recommended).
DBFS FUSE Mount Limitation
This feature cannot be used in environments with E2 Private Link enabled.
This feature (provided by Databricks) mounts DBFS to the local cluster filesystem at /dbfs
. Although disabled when using process isolation, this feature can safely be enabled if raw, unfiltered data is not stored in DBFS and all users on the cluster are authorized to see each other’s files. When enabled, the entirety of DBFS essentially becomes a scratch path where users can read and write files in /dfbs/path/to/my/file
as though they were local files.
For example,
In Python,
Note: This solution also works in R and Scala.
To enable the DBFS FUSE mount, set this configuration: immuta.spark.databricks.dbfs.mount.enabled=true
.
Mounting a Bucket
Users can mount additional buckets to DBFS that can also be accessed using the FUSE mount.
Mounting a bucket is a one-time action, and the mount will be available to all clusters in the workspace from that point on.
Mounting must be performed from a non-Immuta cluster.
Scratch paths will work when performing arbitrary remote filesystem operations with fs magic or Scala dbutils.fs functions. For example,
To support %fs magic and Scala DBUtils with scratch paths, configure
To use dbutils
in Python, set this configuration: immuta.spark.databricks.py4j.strict.enabled=false
.
This section illustrates the workflow for getting a file from a remote scratch path, editing it locally with Python, and writing it back to a remote scratch path.
Get the file from remote storage:
Make a copy if you want to explicitly edit localScratchFile
, as it will be read-only and owned by root:
Write the new file back to remote storage:
Databricks metastore magic allows you to migrate your data from the Databricks legacy Hive metastore to the Unity Catalog metastore while protecting data and maintaining your current processes in a single Immuta instance.
Databricks metastore magic is for customers who intend to use either
the integration or
the , but they would like to protect tables in the Hive metastore.
is enabled in Immuta.
Databricks has two built-in metastores that contain metadata about your tables, views, and storage credentials:
Legacy Hive metastore: Created at the workspace level. This metastore contains metadata of the configured tables in that workspace available to query.
Unity Catalog metastore: Created at the account level and is attached to one or more Databricks workspaces. This metastore contains metadata of the configured tables available to query. All clusters on that workspace use the configured metastore and all workspaces that are configured to use a single metastore share those tables.
Databricks allows you to use the legacy Hive metastore and the Unity Catalog metastore simultaneously. However, Unity Catalog does not support controls on the Hive metastore, so you must attach a Unity Catalog metastore to your workspace and move existing databases and tables to the attached Unity Catalog metastore to use the governance capabilities of Unity Catalog.
Immuta's Databricks Spark integration and Unity Catalog integration enforce access controls on the Hive and Unity Catalog metastores, respectively. However, because these metastores have two distinct security models, users were discouraged from using both in a single Immuta instance before metastore magic; the Databricks Spark integration and Unity Catalog integration were unaware of each other, so using both concurrently caused undefined behavior.
Metastore magic reconciles the distinct security models of the legacy Hive metastore and the Unity Catalog metastore, allowing you to use multiple metastores (specifically, the Hive metastore or alongside Unity Catalog metastores) within a Databricks workspace and single Immuta instance and keep policies enforced on all your tables as you migrate them. The diagram below shows Immuta enforcing policies on registered tables across workspaces.
In clusters A and D, Immuta enforces policies on data sources in each workspace's Hive metastore and in the Unity Catalog metastore shared by those workspaces. In clusters B, C, and E (which don't have Unity Catalog enabled in Databricks), Immuta enforces policies on data sources in the Hive metastores for each workspace.
With metastore magic, the Databricks Spark integration enforces policies only on data in the Hive metastore, while the Databricks Spark integration with Unity Catalog support or the Unity Catalog integration enforces policies on tables in the Unity Catalog metastore. The table below illustrates this policy enforcement.
Databricks Spark integration with Unity Catalog support and Databricks Unity Catalog integration
Enabling the Databricks Spark integration with Unity Catalog support and the Databricks Unity Catalog integration is not supported. Do not use both integrations to enforce policies on your table.
Databricks SQL cannot run the Databricks Spark plugin to protect tables, so Hive metastore data sources will not be policy enforced in Databricks SQL.
The table below outlines the integrations supported for various Databricks cluster configurations. For example, the only integration available to enforce policies on a cluster configured to run on Databricks Runtime 9.1 is the Databricks Spark integration.
Legend:
Essentially, you have two options to enforce policies on all your tables as you migrate after you have :
Enforce plugin-based policies on all tables: Enable the . For details about plugin-based policies, see this .
Enforce plugin-based policies on Hive metastore tables and Unity Catalog native controls on Unity Catalog metastore tables: Enable the and the Databricks Unity Catalog integration. Some Immuta policies are not supported in the Databricks Unity Catalog integration. Reach out to your Immuta representative for documentation of these limitations.
To enforce policies on data sources in Databricks SQL, use to manually lock down Hive metastore data sources and the Databricks Unity Catalog integration to protect tables in the Unity Catalog metastore. Table access control is enabled by default on SQL warehouses, and any Databricks cluster without the Immuta plugin must have table access control enabled.
The feature or integration is enabled.
The feature or integration is disabled.
Hive metastore
Unity Catalog metastore
Cluster 1
9.1
Unavailable
Unavailable
Cluster 2
10.4
Unavailable
Unavailable
Cluster 3
11.3
Unavailable
Cluster 4
11.3
Cluster 5
11.3
/
/