Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Immuta can scan your data sources and apply relevant tags when data is recognized. This eliminates a manual tagging process for your data, saving you time and providing standard taxonomy across all your data sources.
Registered Snowflake, Databricks, Redshift, or Starburst (Trino) data sources
Immuta permission GOVERNANCE
Sensitive data discovery (SDD) is an Immuta Discover feature that identifies your data sources and applies relevant tags when data is recognized. This eliminates a manual tagging process for your data, saving you time and providing standard taxonomy across all your data sources.
To learn more, see the Data discovery page.
Enable sensitive data discovery on your tenant. Opt to have SDD run automatically for new data sources by setting a global framework, or run SDD granularly by applying data sources to specific frameworks.
For additional control, create your own identifiers to recognize the data that matters to you. Add these identifiers to new frameworks and specify the data sources that need this framework. This fine-level control creates automatic tagging that is relevant and accurate to your data, requiring fewer manual adjustments to the resulting tags.
Customize SDD for your data:
If you have any tags that are applied to your data sources by SDD that you don't want, you can easily disable these tags for each data source. This ensures that they will not be applied to the data source again if identification is re-run.
Reference pages:
Immuta comes with a default framework containing built-in Discovered tags and built-in identifiers. These identifiers and tags can be used in your own frameworks.
Classification is an Immuta Discover feature that categorizes your data based on the content and the associated risk the data poses. This increases your understanding of your data and allows you to make faster decisions about it.
To create or manage a framework using the Immuta API, see the Frameworks API reference page.
If you have any tags that are applied to your data sources by classification that you don't want, you can easily disable these tags for each data source. This ensures that they will not be applied to the data source again when classification is re-run.
Private preview: This feature is only available to select accounts.
Identifiers in domains allows you to use the same domains you already organize your data in to hold identifiers and run sensitive data discovery (SDD) without having to use identification frameworks. See the for more information about the feature and limitations.
Identifiers can be added and SDD can be run in any of your current domains. However, if you are not already using domains, set up a domain specifically to run SDD:
.
.
.
.
Navigate to the Identifiers tab of your domain.
Click Get Started.
Add reference identifiers to your domain that are relevant to your data by clicking the checkboxes. Note: When added to your domain, the identifier is a point-in-time copy of the reference identifier. It has the same name, pattern, and tags.
Click Add Identifiers.
This can be done within a domain from the Identifiers tab to create a domain-specific identifier, or it can be done from the Discover Identifiers page to create a reference identifier.
Click Create New.
Enter a name and description for your identifier.
Click Next.
For regex, enter a regex to be matched against column values. The default criteria encoding is case-sensitive. You can change this encoding using the regex criteria. The regex must use RE2.
For column name regex, enter a regex to be matched against column names. The default criteria encoding is case-insensitive. You can change this encoding using the regex criteria. The regex must use RE2 syntax.
For a dictionary, enter the values in a comma-separated list to match against column values. Opt to toggle the Case insensitive switch to on if you want the dictionary to be case sensitive.
Click Next.
Select the tags to apply: Use the text box to search for a tag under the "Discovered" hierarchy or type a tag name to create a new tag under the "Discovered.Entity" hierarchy to apply to columns that match your identifier.
Click Next to review your new identifier and click Create Identifier to create it.
Requirements:
Immuta permission GOVERNANCE
Click the Discover icon in the navigation menu and select the Identification tab.
Click Create New.
Enter a Name and Description for the identification framework.
Select the option to Create empty framework.
Click Create.
After you create the identification framework, you can .
Click the Discover icon in the navigation menu and select the Identification tab.
Click Create New.
Enter a Name and Description for the identification framework.
Select the option to Create identifiers from an existing framework.
Select the checkbox for the framework you want to copy. You can only copy a single framework. For more information about a framework, click the framework name to open a new tab with details about the framework.
Click Create.
To add an identifier to a framework,
Click the Discover icon in the navigation menu and select the Identification tab.
Select the framework name for the identification framework you want to edit.
Click Add Identifier.
Choose in the dropdown to add an identifier from those already in Immuta or create a new identifier for the framework.
For existing identifiers: Opt to edit the tags. Then click Add Identifier.
For new identifiers:
Fill out a Name and Description.
For regex, enter a regex to be matched against column values. The default criteria encoding is case-sensitive. You can change this encoding using the regex criteria. The regex must use RE2.
For column name regex, enter a regex to be matched against column names. The default criteria encoding is not case-sensitive. You can change this encoding using the regex criteria. The regex must use RE2 syntax.
For a dictionary, enter the values in a comma-separated list to match against column values. Opt to toggle the Case insensitive switch to on if you want the dictionary to be case sensitive.
Select the tags to apply: Use the text box to search for a tag under the "Discovered" hierarchy or type a tag name to create a new tag under the "Discovered" hierarchy to apply to columns that match your identifier.
Click Next to review your new identifier and click Create Identifier to create it.
To edit the tags applied by an identifier for a framework,
Click the Discover icon in the navigation menu and select the Identification tab.
Select the framework name for the identification framework you want to edit.
Click the more actions icon for an identifier and select Edit tags.
Remove the tags or type a tag name to add tags.
Click Save.
Click the Discover icon in the navigation menu and select the Identification tab.
Select the framework name for the identification framework you want to edit.
Click the more actions icon for an identifier and select Delete.
Click Delete again in the modal.
To assign a framework to run on specific data sources,
Click the Discover icon in the navigation menu and select the Identification tab.
Select the framework you want to assign and navigate to the Data Sources tab.
Click Add Data Sources.
Select the checkbox for the data source you want this framework to run on. You may select more than one.
Click Add Data Source(s).
After a data source is removed from a framework, it will use the global framework for any SDD scans and the tags applied by the removed framework will be replaced. The global framework is signified by the globe icon.
To remove data sources from a framework,
Click the Discover icon in the navigation menu and select the Identification tab.
Select the framework you want to remove data sources from and navigate to the Data Sources tab.
Select the checkbox for the data source you want to remove from the framework. You may select more than one.
Select Remove and click Remove again in the modal.
Requirement: No data sources assigned to the framework
To delete a framework,
Click the Discover icon in the navigation menu and select the Identification tab.
Select Delete and click Delete again in the modal.
Sensitive data discovery (SDD) is an Immuta feature that uses data patterns to determine what type of data your column represents. Using identification frameworks and identifiers, Immuta evaluates your data and can assign the appropriate tags to your data dictionary based on what it finds. This saves the time of identifying your data manually and provides the benefit of a standard taxonomy across all your data sources in Immuta.
Sensitive data discovery is supported for from the following technologies:
or
: Sensitive data discovery for Starburst (Trino) is currently in public preview and available to all accounts. Reach out to your Immuta representative to enable it on your tenant.
: Sensitive data discovery for Redshift is currently in private preview and available to all accounts. Reach out to your Immuta representative to enable it on your tenant.
To evaluate your data, SDD generates a SQL query using the identification framework's identifiers; the Immuta system account then executes that query in the native technology. Immuta receives the query result, containing the column name and the matching identifiers but no raw data values. These results are then used to apply the resulting tags to the appropriate columns.
This evaluating and tagging process occurs when identification runs and happens automatically from the following events, if a global framework is set:
A new data source is created.
Schema monitoring is enabled, and a new data source is detected.
The following actions will also trigger identification:
Column detection is enabled, and new columns are detected. Here, SDD will only run on new columns, and no existing tags will be removed or changed. Note, this will use the identification framework that already ran on the data source.
A user manually triggers it from the data source health check menu. Note, this will use the identification framework that already applies to the data source or the global framework, if set.
A user manually triggers it from the identification frameworks page.
A user manually triggers it through the API.
An identification framework is a group of identifiers that will look for particular criteria and tag any columns where those conditions are met.
Each organization can set a global framework to apply to all the data sources in Immuta by default unless they have a different framework assigned. It is labeled on the frameworks page with a globe icon. If a global framework is set, identification will run on all new data sources. If a global framework is not set, identification will only run on data sources manually applied to an identification framework.
An identifier is a criteria and the tags to apply to data that matches the criteria. When Immuta recognizes that criteria, it can tag the data to describe the type.
Improved identifiers
If you are interested in these improved identifiers, reach out to your Immuta support professional.
Criteria are the conditions that need to be met for resulting tags to be applied to data.
SDD only supports regular expressions (regex) written in RE2 syntax.
Regex: This criteria contains a case-insensitive regular expression that searches for matches against column values.
Dictionary: This criteria contains a list of words and phrases to match against column values.
Column name: This criteria includes a case-insensitive regular expression matched against column names, not against the values in the column. The identifier's tags will be applied to the column where the name is found. Multiple column name identifiers can match a column and be applied.
The amount of time it takes to run identification on a data source depends on several factors:
Columns: The time to run identification grows nearly linearly with the number of text columns in the data source.
Row count: Performance of identification may vary depending on the sampling method used by each technology. For Snowflake, the number of rows has little impact on the time because data sampling has near-constant performance.
Views: Performance on views is limited by the performance of the query that defines the view.
*Two built-in patterns support and match based on additional data types:
DATE
: Columns will match this identifier if they are string and the regex matches or if the data type is date, date+time, or timestamp.
TIME
: Columns will match this identifier if they are string and the regex matches or if the data type is time. Note that if the date is included in the data, it will not match this identifier.
Immuta compiles dictionary patterns into a regex that is sent in the body of a query.
Redshift Spectrum is not supported with SDD.
The Redshift cluster must be up and running for SDD to successfully run.
The username and password auth method is fully supported with SDD.
AWS access key is supported with limitations with SDD:
redshift-data:BatchExecuteStatement
redshift-data:CancelStatement
redshift-data:DescribeStatement
redshift-data:ExecuteStatement
redshift-data:GetStatementResult
redshift-data:ListStatements
The AWS access key used to register the data source must have redshift:GetClusterCredentials
for the cluster, user, and database that they onboard their data sources with.
Redshift Serverless data sources are not supported for native SDD with the AWS access key authentication method.
These limitations are only relevant to users who have previously enabled and run Immuta SDD.
Immuta has improved the performance and behavior of sensitive data discovery (SDD), so references to two types of SDD can be found in the product:
Legacy SDD was available before October 2023. It is no longer available, but some users may still see the term "legacy SDD" in the context of their data tags.
Native SDD was released to Snowflake and Databricks in May 2023. It was released to Starburst (Trino) and Redshift in April 2024. Native SDD is the only type of SDD available. It is often just referred to as SDD.
If you had legacy SDD enabled, running native SDD can result in different tags being applied because native SDD is more accurate and has fewer false positives than legacy SDD. Running a new SDD scan against a table will change the context of the resulting tags, but no Discovered tags previously applied by legacy SDD will be removed.
Enter criteria: Select the .
Enter criteria: Select the .
Only tags can be edited within a framework. Edits made to an identifier within a framework will only impact that specific identifier. To fully edit an identifier (including the name, description, or criteria) for all frameworks, use the .
Click the more actions icon in the Action column for the framework you want to delete. The global framework cannot be deleted. If you want to delete it, .
Users can or the .
Sensitive data discovery (SDD) runs to discover data. These frameworks are a collection of . These identifiers contain a single and the tags that will be applied when the criteria's conditions have been met. See the sections below for more information on each component.
While organizations can have multiple frameworks, only one may be applied to each data source. Immuta has the built-in "Default Framework," which contains all the and assigns the .
For a how-to on the framework actions users can take, see the .
Users can or leave the global framework field blank.
Immuta comes with to discover common categories of data. These identifiers cannot be modified or deleted. to find their specific data.
A was released October 2024.
For a how-to on the identifier actions users can take, see the .
Competitive criteria analysis: This criteria is a process that will review all the regex and dictionary criteria within the identifiers of the framework and search for the identifier with the best fit. In this review, each competitive criteria analysis identifier in the framework competes against each other to find the best and most specific identifier that fits the data. The resulting tags for the best identifier are then applied to the column. Only one competitive criteria analysis identifier will apply per column. Competitive criteria identifiers, both built-in and custom, must match at least 90% of the data sampled. To learn more about the competitive nature, see the .
Create a new identifier in the or with the .
Only application admins can on the Immuta app settings page. Then, data source creators can disable SDD on a data-source-by-data-source basis.
When SDD is manually triggered by a data owner, all column tags previously applied by SDD are removed and the tags prescribed by the latest run are applied. However, if SDD is triggered because a new column is detected by schema monitoring, tags will only be applied to the new column, and no tags will be modified on existing columns. Additionally, governors, data source owners, and data source experts can to prevent them from being used and auto-tagged on that data source in the future.
Identifiers: The number of identifiers being used the time to run identification.
The time it takes to run identification for all newly onboarded data sources in Immuta is not limited by SDD performance but by the execution of background jobs in Immuta. when onboarding a large number of data sources to ensure the advanced settings are set appropriately for your organization.
For users interested in testing SDD, note that the built-in identifiers by Immuta require a 90% match to data to be assigned to a column. This means that with synthetic data, there may be situations where the data is not real enough to fit the confidence needed to match identifiers. To test SDD, use a dev environment, create copies of your tables, or and see the tags that would be applied to your data by SDD.
Deleting the built-in Discovered tags is not recommended: If you do delete built-in Discovered tags and use the Default Framework, then when the identifier is matched the column will not be tagged. As an alternative, tags can be disabled on a , or SDD can be turned off on a data-source-by-data-source basis when creating a data source.
For Snowflake, the size of the dictionary is limited by the .
For Databricks, Immuta will start up a Databricks cluster to complete the SDD job if one is not already running. This can cause unnecessary costs if the cluster becomes idle. Follow to automatically terminate inactive clusters after a set period of time.
SDD will only work on Starburst (Trino) data sources authenticated with username and password. is not supported with SDD.
is not supported with SDD.
The AWS access key used to register the data source can do a minimum of the following :
If using a custom URL, then the data source registered with the AWS access key must have the region
and clusterid
included in the formatted like the following:
See the page for more information.