1 of 13

Data Discovery

Sensitive data discovery (SDD) is an Immuta feature that uses data patterns to determine what type of data your column represents. Using identification frameworks and identifiers, Immuta evaluates your data and can assign the appropriate tags to your data dictionary based on what it finds. This saves the time of identifying your data manually and provides the benefit of a standard taxonomy across all your data sources in Immuta.

Supported technologies

Sensitive data discovery is supported for data sources from the following technologies:

Snowflake
Databricks Spark or Databricks Unity Catalog
Starburst (Trino): Sensitive data discovery for Starburst (Trino) is currently in public preview and available to all accounts. Reach out to your Immuta representative to enable it on your tenant.
Redshift: Sensitive data discovery for Redshift is currently in private preview and available to all accounts. Reach out to your Immuta representative to enable it on your tenant.

Architecture

To evaluate your data, SDD generates a SQL query using the identification framework's identifiers; the Immuta system account then executes that query in the native technology. Immuta receives the query result, containing the column name and the matching identifiers but no raw data values. These results are then used to apply the resulting tags to the appropriate columns.

This evaluating and tagging process occurs when identification runs, which happens automatically from the following events:

A new data source is created.
Schema monitoring is enabled, and a new data source is detected.
Column detection is enabled, and new columns are detected. Here, SDD will only run on new columns, and no existing tags will be removed or changed.
A user manually triggers it from the data source health check menu.
A user manually triggers it from the identification frameworks page.
A user manually triggers it through the API.

Users can manually run identification from a data source's overview page or the identification frameworks page.

Components

Sensitive data discovery (SDD) runs frameworks to discover data. These frameworks are a collection of identifiers. These identifiers contain a single criteria and the tags that will be applied when the criteria's conditions have been met. See the sections below for more information on each component.

Identification framework

An identification framework is a group of identifiers that will look for particular criteria and tag any columns where those conditions are met.

While organizations can have multiple frameworks, only one may be applied to each data source. Immuta has the built-in "Default Framework," which contains all the built-in identifiers and assigns the built-in Discovered tags.

For a how-to on the framework actions users can take, see the Manage frameworks page.

Global framework

Each organization has a single global framework that will apply to all the data sources in Immuta by default unless they have a different framework assigned. It is labeled on the frameworks page with a globe icon. Users can bypass this global framework by applying a specific framework to data sources.

Identifier

An identifier is a criteria and the tags to apply to data that matches the criteria. When Immuta recognizes that criteria, it can tag the data to describe the type.

Immuta comes with built-in identifiers to discover common categories of data. These identifiers cannot be modified or deleted. Users can also create their own unique identifiers to find their specific data.

Improved identifiers

A new and improved pack of the built-in identifiers was released October 2024.

If you are interested in these improved identifiers, reach out to your Immuta support professional.

For a how-to on the identifier actions users can take, see the Create an identifier page.

Criteria

Criteria are the conditions that need to be met for resulting tags to be applied to data.

SDD only supports regular expressions (regex) written in RE2 syntax.

Supported criteria types for identifiers

Competitive criteria analysis: This criteria is a process that will review all the regex and dictionary criteria within the identifiers of the framework and search for the identifier with the best fit. In this review, each competitive criteria analysis identifier in the framework competes against each other to find the best and most specific identifier that fits the data. The resulting tags for the best identifier are then applied to the column. Only one competitive criteria analysis identifier will apply per column. To learn more about the competitive nature, see the How competitive criteria analysis works guide.
- Regex: This criteria contains a case-insensitive regular expression that searches for matches against column values.
- Dictionary: This criteria contains a list of words and phrases to match against column values.
Column name: This criteria includes a case-insensitive regular expression matched against column names, not against the values in the column. The identifier's tags will be applied to the column where the name is found. Multiple column name identifiers can match a column and be applied.

Create a new identifier in the Immuta UI or with the sdd/classifier endpoint.

Configuration

Only application admins can enable sensitive data discovery (SDD) globally on the Immuta app settings page. Then, data source creators can disable SDD on a data-source-by-data-source basis.

Tag mutability

When SDD is manually triggered by a data owner, all column tags previously applied by SDD are removed and the tags prescribed by the latest run are applied. However, if SDD is triggered because a new column is detected by schema monitoring, tags will only be applied to the new column, and no tags will be modified on existing columns. Additionally, governors, data source owners, and data source experts can disable any unwanted Discovered tags in the data dictionary to prevent them from being used and auto-tagged on that data source in the future.

Performance

The amount of time it takes to run identification on a data source depends on several factors:

Columns: The time to run identification grows nearly linearly with the number of text columns in the data source.
Identifiers: The number of identifiers being used weakly impacts the time to run identification.
Row count: Performance of identification may vary depending on the sampling method used by each technology. For Snowflake, the number of rows has little impact on the time because data sampling has near-constant performance.
Views: Performance on views is limited by the performance of the query that defines the view.

The time it takes to run identification for all newly onboarded data sources in Immuta is not limited by SDD performance but by the execution of background jobs in Immuta. Consult your Immuta account manager when onboarding a large number of data sources to ensure the advanced settings are set appropriately for your organization.

Testing

For users interested in testing SDD, note that the built-in identifiers by Immuta require a 90% match to data to be assigned to a column. This means that with synthetic data, there may be situations where the data is not real enough to fit the confidence needed to match identifiers. To test SDD, use a dev environment, create copies of your tables, or use the API to run a dryRun and see the tags that would be applied to your data by SDD.

Considerations

Deleting the built-in Discovered tags is not recommended: If you do delete built-in Discovered tags and use the Default Framework, then when the identifier is matched the column will not be tagged. As an alternative, tags can be disabled on a column-by-column basis from the data dictionary, or SDD can be turned off on a data-source-by-data-source basis when creating a data source.

Supported data types and casing

*Two built-in patterns support and match based on additional data types:

DATE: Columns will match this identifier if they are string and the regex matches or if the data type is date, date+time, or timestamp.
TIME: Columns will match this identifier if they are string and the regex matches or if the data type is time. Note that if the date is included in the data, it will not match this identifier.

Limitations with dictionary patterns

Immuta compiles dictionary patterns into a regex that is sent in the body of a query.

For Snowflake, the size of the dictionary is limited by the overall query text size limit in Snowflake of 1 MB.

Databricks limitation

For Databricks, Immuta will start up a Databricks cluster to complete the SDD job if one is not already running. This can cause unnecessary costs if the cluster becomes idle. Follow Databricks best practices to automatically terminate inactive clusters after a set period of time.

Starburst (Trino) limitation

SDD will only work on Starburst (Trino) data sources authenticated with username and password. OAuth 2.0 is not supported with SDD.

Redshift limitations

Redshift Spectrum is not supported with SDD.
The Redshift cluster must be up and running for SDD to successfully run.

Redshift supported authentication methods

The username and password auth method is fully supported with SDD.
Okta is not supported with SDD.
AWS access key is supported with limitations with SDD:
- The AWS access key used to register the data source can do a minimum of the following redshift-data API actions:
  - redshift-data:BatchExecuteStatement
  - redshift-data:CancelStatement
  - redshift-data:DescribeStatement
  - redshift-data:ExecuteStatement
  - redshift-data:GetStatementResult
  - redshift-data:ListStatements
- The AWS access key used to register the data source must have redshift:GetClusterCredentials for the cluster, user, and database that they onboard their data sources with.
- If using a custom URL, then the data source registered with the AWS access key must have the region and clusterid included in the additional connection string options formatted like the following:
  region=us-east-2;clusterid=12345
- Redshift Serverless data sources are not supported for native SDD with the AWS access key authentication method.

Migrating from legacy to native SDD

These limitations are only relevant to users who have previously enabled and run Immuta SDD.

Immuta has improved the performance and behavior of sensitive data discovery (SDD), so references to two types of SDD can be found in the product:

Legacy SDD was available before October 2023. It is no longer available, but some users may still see the term "legacy SDD" in the context of their data tags.
Native SDD was released to Snowflake and Databricks in May 2023. It was released to Starburst (Trino) and Redshift in April 2024. Native SDD is the only type of SDD available. It is often just referred to as SDD.

If you had legacy SDD enabled, running native SDD can result in different tags being applied because native SDD is more accurate and has fewer false positives than legacy SDD. Running a new SDD scan against a table will change the context of the resulting tags, but no Discovered tags previously applied by legacy SDD will be removed.

See the Migrate from legacy to native SDD page for more information.

How-to Guides

Enable Sensitive Data Discovery (SDD)

Requirements:

Immuta permission GOVERNANCE
Registered Snowflake, Databricks, Redshift, or Starburst (Trino) data sources

This how-to guide is for enabling sensitive data discovery (SDD) for the first time. For additional information on sensitive data discovery, see the Data discovery page.

Turn on SDD

Navigate to the App Settings page and scroll to the Sensitive Data Discovery section.
Select the Enable Sensitive Data Discovery (SDD) checkbox to enable SDD.
Click Save and then click Confirm to apply your changes. Note that the Immuta tenant will have a system restart.

Create a new framework with identifiers

Once SDD is enabled on your tenant, SDD will automatically run when new data sources are added, but it must be manually run for all existing data sources. This allows you to test out SDD with a select few data sources without worrying that it will add tags throughout all your data sources.

For this step, you will pick the identifiers to match the data that matters to your organization. For example, for international data, you may want to enable many different identifiers for many countries, like the "Australia Passport" identifier and the "Finland National ID Number" identifier. However, if you are dealing with United States domestic financial data, those identifiers would be irrelevant. In that case, it would be better to identify the data likely to appear, like Bitcoin or US Bank Routing MICR.

First, create an empty framework,

Navigate to Discover and Identification.
Select Create New.
Enter a Name and Description for your new identification framework.
Select Create empty framework.

Then, add a new identifier to that framework,

Navigate to Discover and Identifiers.
Use the checkboxes to select all the identifiers relevant to your data. Tip: From the overview page you can see the name and the tags that will be applied by the identifier. To better understand the data it will match, click the name to read the description.
Once you have checked the identifiers you want in your framework, click Add to Framework.
Type the framework name in the text box.
Click Add to Framework.

Run identification on your data sources

Once you have created a framework relevant to your data, it is time to test it on your data and customize it. Run identification on a select number of data sources where you understand the data to assess and adjust the tags to reflect what you expect to see.

Add those select data sources to your new framework,

Navigate to Discover and Identification.
Click your new framework name.
Navigate to the Data Sources tab.
Click Add Data Sources.
Check the checkboxes for the select data sources you want to try SDD on.
Click Add Data Source(s).

Then, run identification on those data sources,

Navigate to Discover and Identification.
Click the action menu for your new framework.
Click Run Identification.

View the identification results

After identification runs, you will receive a notification that the job is complete. Then, you can view the results from the data source dictionary.

Navigate to the data source overview page of the data source you added to the framework.
Click the Data Dictionary tab.
Assess whether the Discovered tags are applied as expected.
If you are happy with the Discovered tags, follow the Assign data sources to frameworks guide to add the rest of your data sources to the framework and follow the Run identification guide to run identification on all your data sources.
If you want additional tags, follow the Create an identifier guide to create identifiers that matter to your data.

Manage Identification Frameworks

Requirements:

Sensitive data discovery (SDD) enabled
Immuta permission GOVERNANCE

Create an identification framework

Create an identification framework with no identifiers

Click the Discover icon in the navigation menu and select the Identification tab.
Click Create New.
Enter a Name and Description for the identification framework.
Select the option to Create empty framework.
Click Create.

After you create the identification framework, you can create new identifiers.

Copy an existing identification framework and its identifiers

Click the Discover icon in the navigation menu and select the Identification tab.
Click Create New.
Enter a Name and Description for the identification framework.
Select the option to Create identifiers from an existing framework.
Select the checkbox for the framework you want to copy. You can only copy a single framework. For more information about a framework, click the framework name to open a new tab with details about the framework.
Click Create.

Manage an identification framework's identifiers

Add an identifier to a framework

To add an identifier to a framework,

Click the Discover icon in the navigation menu and select the Identification tab.
Select the framework name for the identification framework you want to edit.
Click Add Identifier.
Choose in the dropdown to add an identifier from those already in Immuta or create a new identifier for the framework.
- For existing identifiers: Opt to edit the tags. Then click Add Identifier.
- For new identifiers:
  1. Fill out a Name and Description.
  2. Enter criteria: Select the Type of criteria.
    For regex, enter a regex to be matched against column values. The default criteria encoding is case-sensitive. You can change this encoding using the regex criteria. The regex must use RE2.
    For column name regex, enter a regex to be matched against column names. The default criteria encoding is not case-sensitive. You can change this encoding using the regex criteria. The regex must use RE2 syntax.
    For a dictionary, enter the values in a comma-separated list to match against column values. Opt to toggle the Case insensitive switch to on if you want the dictionary to be case sensitive.
  3. Select the tags to apply: Use the text box to search for a tag under the "Discovered" hierarchy or type a tag name to create a new tag under the "Discovered" hierarchy to apply to columns that match your identifier.
  4. Click Next to review your new identifier and click Create Identifier to create it.

Edit an identifier in a framework

Only tags can be edited within a framework. Edits made to an identifier within a framework will only impact that specific identifier. To fully edit an identifier (including the name, description, or criteria) for all frameworks, use the Edit an identifier how-to guide.

To edit the tags applied by an identifier for a framework,

Click the Discover icon in the navigation menu and select the Identification tab.
Select the framework name for the identification framework you want to edit.
Click the more actions icon for an identifier and select Edit tags.
Remove the tags or type a tag name to add tags.
Click Save.

Delete an identifier from a framework

Click the Discover icon in the navigation menu and select the Identification tab.
Select the framework name for the identification framework you want to edit.
Click the more actions icon for an identifier and select Delete.
Click Delete again in the modal.

Manage an identification framework's data sources

Assign an identification framework to data sources

To assign a framework to run on specific data sources,

Click the Discover icon in the navigation menu and select the Identification tab.
Select the framework you want to assign and navigate to the Data Sources tab.
Click Add Data Sources.
Select the checkbox for the data source you want this framework to run on. You may select more than one.
Click Add Data Source(s).

Remove data sources from an identification framework

After a data source is removed from a framework, it will use the global framework for any SDD scans and the tags applied by the removed framework will be replaced. The global framework is signified by the globe icon.

To remove data sources from a framework,

Click the Discover icon in the navigation menu and select the Identification tab.
Select the framework you want to remove data sources from and navigate to the Data Sources tab.
Select the checkbox for the data source you want to remove from the framework. You may select more than one.
Select Remove and click Remove again in the modal.

Delete an identification framework

Requirement: No data sources assigned to the framework

To delete a framework,

Click the Discover icon in the navigation menu and select the Identification tab.
Click the more actions icon in the Action column for the framework you want to delete. The global framework cannot be deleted. If you want to delete it, configure a different framework as the global framework.
Select Delete and click Delete again in the modal.

Manage Identifiers

Requirements:

Sensitive data discovery (SDD) enabled
Immuta permission GOVERNANCE

Create an identifier

Click the Discover icon in the navigation menu and select the Identifiers tab.
Click Create New.
Enter a Name and Description for the new identifier.
Enter criteria: Select the Type of criteria.
1. For regex, enter a regex to be matched against column values. The default criteria encoding is case-sensitive. You can change this encoding using the regex criteria. The regex must use RE2.
2. For column name regex, enter a regex to be matched against column names. The default criteria encoding is case-insensitive. You can change this encoding using the regex criteria. The regex must use RE2 syntax.
3. For a dictionary, enter the values in a comma-separated list to match against column values. Opt to toggle the Case insensitive switch to on if you want the dictionary to be case sensitive.
Select the tags to apply: Use the text box to search for a tag under the "Discovered" hierarchy or type a tag name to create a new tag under the "Discovered" hierarchy to apply to columns that match your identifier.
Click Next to review your new identifier and click Create Identifier to create it.
See the Manage identification frameworks page to add your new identifier to a framework.

Note that all user-created identifiers must be a 90% match or greater for the contents of the column to be tagged.

Edit an identifier

Editing the details or criteria of an identifier from the identifiers menu will affect any framework with that identifier throughout Immuta. Editing the tags will only affect new frameworks the identifier is added to.

To edit an identifier,

Click the Discover icon in the navigation menu and select the Identifiers tab.
Click the name of the identifier you want to edit.
Click Edit.
Edit the field you want to change.
Click Save.

Built-in identifiers cannot be edited.

Delete an identifier

Deleting an identifier will remove it from all the frameworks it is in throughout Immuta.

To delete an identifier,

Click the Discover icon in the navigation menu and select the Identifiers tab.
Click the more actions icon in the Action column for the identifier you want to delete.
Select Delete and click Delete again in the modal.

Built-in identifiers cannot be deleted.

Run and Manage Sensitive Data Discovery on Data Sources

Requirements:

Registered , , , or data sources
Immuta permission GOVERNANCE

Identification (or sensitive data discovery (SDD)) runs automatically. If you want to re-run identification when a new global framework is set or when new identifiers have been added to a framework, you can or from the UI by following a how-to below.

Run identification using a specific framework

Click the Discover icon and the Identification tab in the navigation menu.
Select the more actions icon.
Select Run Identification and then select it again in the modal.

Run identification on a data source

Navigate to the data source overview page.
Click the health status.
Select Re-run next to Sensitive Data Discovery (SDD).

Verify discovered tags

Verify discovered tags

If sensitive data discovery has been enabled, then manually adding tags to columns in the data dictionary will be unnecessary in most cases. The data owner will just need to verify that the Discovered tags are correct.

Disable Discovered tags from the data dictionary

If a governor, data owner, or data source expert disables a Discovered tag from the data dictionary, the column will not be re-tagged next time identification (or SDD) runs. When a Discovered tag is disabled, it will not completely disappear, and it can be manually enabled through the tag side sheet.

To disable a discovered tag,

Navigate to a data source and click the Data Dictionary tab.
Scroll to the column you want to remove the tag from and click the tag you want to remove.
Click Disable in the side sheet and then click Confirm.

Manage Sensitive Data Discovery Settings

Requirement: Immuta permission APPLICATION_ADMIN

Configure the global framework

Click the App Settings icon in the left sidebar.
Click Sensitive Data Discovery in the left panel to navigate to that section.
Enter the request-friendly name of your global identification framework in the Global SDD Template Name field. This name can be found in the URL when you navigate to the identification framework's page.
Click Save, and then Confirm your changes.

Migrate From Legacy to Native SDD

This guide provides information and best practices for migrating from the deprecated legacy sensitive data discovery (SDD) option to the improved native SDD. This guide is for users who have already enabled SDD on their tenant and have Discovered tags on their data sources.

Before you begin

Native vs legacy SDD

Legacy SDD is deprecated. It will be removed and replaced by native SDD. Native SDD is significantly improved from legacy SDD for discovering and tagging your data with upgrades to the built-in identifiers. Additionally, the greatest benefit is the respect for data residency. Native SDD doesn't move any of your data when running. The discovery is done right in your data platform, and the platform only returns the matching identifiers and column names to Immuta.

See the for more information on native SDD.

Requirements

Native SDD requires Snowflake, Databricks, Starburst (Trino), or Redshift data sources
Legacy SDD enabled on your tenant
Legacy SDD tags applied to your data sources: To find out if you have legacy SDD tags applied, create a governance report as described in the .

Enable native SDD

Contact your Immuta representative to enable native SDD on your Immuta tenant. Many users already have native SDD enabled, so proceed to if you want to self-service check if native SDD is already running and tagging your data before you reach out to the representative.

This action will not change anything immediately on your tenant; however, anytime identification runs in the future, it will be native SDD instead of the legacy version.

To assess native SDD for your data, proceed with the steps below. If you do not review native SDD, the legacy SDD tags will all remain on your data source columns. However, when on new data sources and columns, it will apply native SDD tags, and because of the improvements to SDD, it may tag different data than legacy SDD.

Understand the context of your tags

Requirement: Immuta permission GOVERNANCE

To check the tags on an individual data source, navigate to the data source data dictionary and select a Discovered tag. On the tag side sheet, you can determine the context of the tag. When identifiers match data, native SDD will apply tags, and their tag context will be Sensitive Data Discovery. Any tags with the context Legacy Sensitive Data Discovery were not matched by native SDD but will remain on the data source.
To check your tags globally, navigate to the governance reports page and build a report for sensitive data discovery. This report will present the legacy tags on your data sources' columns and native SDD tags that are also on those columns. Use this report to assess the context of the Discovered tags and understand if native SDD is matching the data you want it to.

These actions will allow you to understand the differences between how native SDD and legacy SDD tag your data and whether your data is recognized as expected by native SDD or if legacy SDD was over-tagging your data. This way you can better tune SDD to your data.

If there are any legacy SDD tags that you want native SDD to catch, you need to tune native SDD so that this type of data is discovered in future tables and columns; see guidance on that in the next section.

Tune SDD

Requirement: Immuta permission GOVERNANCE

Using the report you built above, complete these actions to tune SDD:

Focus on a legacy SDD tag properly applied to your data. Assess whether the native SDD tag on the column instead was applied more accurately than the legacy tag. If it is applied incorrectly, proceed to the next step.
Complete the steps above for all legacy SDD tags.

Completing the actions above will create parity between what legacy SDD was tagging your data and what native SDD will tag in the future.

Reference Guide

Built-in Identifier Reference

Immuta comes with a set of built-in identifiers that look for common data types. These identifiers were written by Immuta's research and development team and cannot be deleted or edited by users. However, users can add these built-in identifiers to their own frameworks and edit the tags applied by them.

When using SDD with classification frameworks, it is recommended to use the default resulting tags listed in the table below for these built-in identifiers. This ensures that the framework rules apply sensitivity tags as intended.

Identifiers must match at least 90% of the sampled data to be tagged, with two exceptions noted below. See the How competitive pattern analysis works guide for more information about sampling and thresholds.

Deprecation notice

The following Discovered tags have been deprecated:

Discovered.Identifier Direct
Discovered.Identifier Indirect
Discovered.Identifier Undetermined
Discovered.PCI
Discovered.PHI
Discovered.PII

New SaaS tenants will not see these tags applied by SDD. Current tenants relying on these tags for policies should contact their Immuta representative for support before these tags are removed from the product.

Identifier descriptions and default resulting tags

Identifier

Description

Resulting tags from the default identifier

Improved Pack: Built-in Identifier Reference

Public preview

This feature is available to all tenants. Reach out to your Immuta support professional to use this feature.

Immuta comes with a pack of built-in identifiers that look for common data types. And since the first pack was released, improvements have been made. These improvements are now available in this improved pack, which includes some unchanged identifiers, but also many new and improved versions of legacy identifiers. These identifiers were written by Immuta's research and development team and cannot be deleted or edited by users. However, users can add these built-in identifiers to their own frameworks and edit the tags applied by them.

Identifiers must match at least 90% of the sampled data to be tagged, with three exceptions noted below. See the How competitive pattern analysis works guide for more information about sampling and thresholds.

Identifier descriptions and default resulting tags

Identifier

Description

Resulting tags from the default identifier

Built-in Discovered Tags Reference

Immuta is pre-configured with a set of tags that can be used to write global policies before data sources even exist. See a list of the built-in Discovered tags below and the Built-in identifier reference page for information about where these tags will be applied by the built-in identifiers.

Country tags

All the tags below belong to the Country parent. For example, the full tag name will appear as Discovered . Country . Argentina.

Child tag name

Description

Entity tags

All the tags below belong to the Entity parent. For example, the full tag name will appear as Discovered . Entity . Aadhaar Individual.

Identifier tags

Deprecation notice

The following identifier tags have been deprecated. New SaaS tenants will not see these tags applied by SDD. Current tenants relying on these tags for policies should contact their Immuta representative for support before these tags are removed from the product.

None of the tags below have an additional parent or child tag. For example, the full tag name will appear as Discovered . Identifier Direct.

Personal information tags

Deprecation notice

None of the tags below have an additional parent or child tag. For example, the full tag name will appear as Discovered . PCI.

How Competitive Pattern Analysis Works

Of sensitive data discovery's three criteria options, regex and dictionary are competitive. This means that when assessing your data, if multiple identifiers could match, only one with competitive criteria will be chosen to tag the data. To better understand how Immuta executes this competition, read further.

Discover employs a three-phased competitive criteria analysis approach for sensitive data discovery (SDD):

Sampling: No data is moved, and Immuta checks the identifiers against a sample of data from the table.
Qualifying: Identifiers with a criteria match of less than a 90% match are filtered out.
Scoring: The remaining identifiers are compared with one another to find the most specific criteria that qualifies and matches the sample.

In the end, competitive criteria analysis aims to find a single identifier for each column that best describes the data format.

Sampling

In the sampling process, no database contents are transmitted to Immuta; instead, Immuta receives only the column-wise hit rate (the number of times the criteria has matched a value in the column) information for each active identifier. To do this, Discover instructs a remote database to measure column-wise hit rate information for all active identifiers over a row sample.

The sample size is decided based on the number of identifiers and the data size, when available. In the most simplified case, the requested number of sampled rows depends only on the number of regex and dictionary criteria being run in the framework, not the data size. The sample size dependence on the number of identifiers is weak and will not exceed 13,000 rows.

Number of identifiers

Sample size

Sampling considerations

In practice, the number of sampled values for each column may be less than the requested number of rows. This happens when the target table has less than the requested number of rows, when many of the column values are null, or because of technology-specific limitations.

Snowflake and Starburst (Trino): Discover implements native table sampling by row count.
Databricks and Redshift: Due to technology limitations and the inability to predict the size of the table, Discover implements a best-effort sampling strategy comprising a flat 10% row sample capped at the first 10,000 sampled rows. In particular, under-sampling may occur on tables with less than 100,000 rows. Moreover, the resulting sample is biased towards earlier records.
All platforms: Sampling from views can have significantly slower performance that varies by the performance of the query that defines the view.

Qualifying

During the qualification phase, identifiers that do not agree with the data are disqualified. An identifier agrees with the data if the hit rate on the remote sample exceeds the predefined threshold. This threshold is 90% match for most built-in identifiers; however, a few built-in identifiers have lower threshold . The 90% threshold is standard for all custom identifiers as well to ensure the criteria matches the data within the column and avoid false positives. If no identifiers qualify, then no identifier is assessed for scoring and the column is not tagged.

Scoring

During the scoring phase, a machine inference is carried out among all qualified identifiers, combining criteria-derived complexity information with hit rate information to determine which identifier best describes the sample data. This process prefers the more restrictive of two competing identifiers since the ability to satisfy the more difficult-to-satisfy identifier itself serves as evidence that it is more likely. This phase ends by returning a single most likely identifier per the inference process.

Example

Here are a set of regex identifiers and a sample of data:

Identifiers:

[a-zA-Z0-9]{3} - This regex will match 3 character strings with the characters a-z, lowercase or uppercase, or digits 0-9.
[a-c]{3} - This regex will match 3 character strings with the characters a-c, lowercase.
(a|b|d){3} - This regex will match 3 character strings with the characters a, b, or d, lowercase.

When qualifying the identifiers, Identifier 1 and Identifier 3 both match 90% or more of the data. Identifier 2 does not, and is disqualified.

Then the qualified identifiers are scored. Here, Identifier 1, despite matching 100% of the data, is unspecific and could match over 200,000 values. On the other hand, Identifier 3 matches just at 90% but is very specific with only 27 available values.

Therefore, with the specificity taken into account, Identifier 3 would be the match for this column, and its tags would be applied to the data source in Immuta.

Important notes

Dictionaries are part of the competitive process, while column-name regex are not.
Scoring ties are rare but can occur if the same criteria (either dictionary or regex) is specified more than once (even in different forms). Scoring ties are inconclusive, and the scoring phase will not return an identifier in the case of a tie.
Criteria complexity analysis is sensitive to the total number of strings an identifier accepts or, equivalently for dictionaries, the number of entries. Therefore, identifiers that accept much more than is necessary to describe the intended column data format may perform more poorly in the competitive analysis because they are easier to satisfy.

Data Discovery

Supported technologies

Sensitive data discovery is supported for data sources from the following technologies:

Snowflake
Databricks Spark or Databricks Unity Catalog
Starburst (Trino): Sensitive data discovery for Starburst (Trino) is currently in public preview and available to all accounts. Reach out to your Immuta representative to enable it on your tenant.
Redshift: Sensitive data discovery for Redshift is currently in private preview and available to all accounts. Reach out to your Immuta representative to enable it on your tenant.

Architecture

This evaluating and tagging process occurs when identification runs, which happens automatically from the following events:

A new data source is created.
Schema monitoring is enabled, and a new data source is detected.
Column detection is enabled, and new columns are detected. Here, SDD will only run on new columns, and no existing tags will be removed or changed.
A user manually triggers it from the data source health check menu.
A user manually triggers it from the identification frameworks page.
A user manually triggers it through the API.

Users can manually run identification from a data source's overview page or the identification frameworks page.

Components

Identification framework

An identification framework is a group of identifiers that will look for particular criteria and tag any columns where those conditions are met.

For a how-to on the framework actions users can take, see the Manage frameworks page.

Global framework

Identifier

An identifier is a criteria and the tags to apply to data that matches the criteria. When Immuta recognizes that criteria, it can tag the data to describe the type.

Improved identifiers

A new and improved pack of the built-in identifiers was released October 2024.

If you are interested in these improved identifiers, reach out to your Immuta support professional.

For a how-to on the identifier actions users can take, see the Create an identifier page.

Criteria

Criteria are the conditions that need to be met for resulting tags to be applied to data.

SDD only supports regular expressions (regex) written in RE2 syntax.

Supported criteria types for identifiers

Competitive criteria analysis: This criteria is a process that will review all the regex and dictionary criteria within the identifiers of the framework and search for the identifier with the best fit. In this review, each competitive criteria analysis identifier in the framework competes against each other to find the best and most specific identifier that fits the data. The resulting tags for the best identifier are then applied to the column. Only one competitive criteria analysis identifier will apply per column. To learn more about the competitive nature, see the How competitive criteria analysis works guide.
- Regex: This criteria contains a case-insensitive regular expression that searches for matches against column values.
- Dictionary: This criteria contains a list of words and phrases to match against column values.
Column name: This criteria includes a case-insensitive regular expression matched against column names, not against the values in the column. The identifier's tags will be applied to the column where the name is found. Multiple column name identifiers can match a column and be applied.

Create a new identifier in the Immuta UI or with the sdd/classifier endpoint.

Configuration

Only application admins can enable sensitive data discovery (SDD) globally on the Immuta app settings page. Then, data source creators can disable SDD on a data-source-by-data-source basis.

Tag mutability

Performance

The amount of time it takes to run identification on a data source depends on several factors:

Columns: The time to run identification grows nearly linearly with the number of text columns in the data source.
Identifiers: The number of identifiers being used weakly impacts the time to run identification.
Row count: Performance of identification may vary depending on the sampling method used by each technology. For Snowflake, the number of rows has little impact on the time because data sampling has near-constant performance.
Views: Performance on views is limited by the performance of the query that defines the view.

Testing

Considerations

Supported data types and casing

Type of identifier

Supported data types

Case sensitivity

*Two built-in patterns support and match based on additional data types:

DATE: Columns will match this identifier if they are string and the regex matches or if the data type is date, date+time, or timestamp.
TIME: Columns will match this identifier if they are string and the regex matches or if the data type is time. Note that if the date is included in the data, it will not match this identifier.

Limitations with dictionary patterns

Immuta compiles dictionary patterns into a regex that is sent in the body of a query.

For Snowflake, the size of the dictionary is limited by the overall query text size limit in Snowflake of 1 MB.

Databricks limitation

Starburst (Trino) limitation

SDD will only work on Starburst (Trino) data sources authenticated with username and password. OAuth 2.0 is not supported with SDD.

Redshift limitations

Redshift Spectrum is not supported with SDD.
The Redshift cluster must be up and running for SDD to successfully run.

Redshift supported authentication methods

The username and password auth method is fully supported with SDD.
Okta is not supported with SDD.
AWS access key is supported with limitations with SDD:
- The AWS access key used to register the data source can do a minimum of the following redshift-data API actions:
  - redshift-data:BatchExecuteStatement
  - redshift-data:CancelStatement
  - redshift-data:DescribeStatement
  - redshift-data:ExecuteStatement
  - redshift-data:GetStatementResult
  - redshift-data:ListStatements
- The AWS access key used to register the data source must have redshift:GetClusterCredentials for the cluster, user, and database that they onboard their data sources with.
- If using a custom URL, then the data source registered with the AWS access key must have the region and clusterid included in the additional connection string options formatted like the following:
  region=us-east-2;clusterid=12345
- Redshift Serverless data sources are not supported for native SDD with the AWS access key authentication method.

Migrating from legacy to native SDD

These limitations are only relevant to users who have previously enabled and run Immuta SDD.

Immuta has improved the performance and behavior of sensitive data discovery (SDD), so references to two types of SDD can be found in the product:

Legacy SDD was available before October 2023. It is no longer available, but some users may still see the term "legacy SDD" in the context of their data tags.
Native SDD was released to Snowflake and Databricks in May 2023. It was released to Starburst (Trino) and Redshift in April 2024. Native SDD is the only type of SDD available. It is often just referred to as SDD.

See the Migrate from legacy to native SDD page for more information.