Data Identification

Identification is an Immuta feature that uses data patterns to determine what type of data your column represents. Using identifiers within domains, Immuta evaluates your data and can assign the appropriate tags to your data dictionary based on what it finds. This saves the time of identifying your data manually and provides the benefit of a standard taxonomy across all your data sources in Immuta.

Architecture

To evaluate your data, Immuta generates a SQL query using a domain's identifiers. The Immuta system account then executes that query in the remote technology to match any regex and dictionary identifiers. Immuta receives the query result, containing the column name and the matching identifiers but no raw data values. Column name identifiers are all matched within Immuta and don't require any query to the remote technology. These results are then used to apply the resulting tags to the appropriate columns.

This evaluating and tagging process occurs when identification runs and happens automatically from the following event:

The following actions will also trigger identification:

Identifiers

Identification runs identifiers to discover data. These identifiers are grouped into domains with data sources. Each identifier contains a single criteria and the tags that will be applied when the criteria's conditions have been met.

There are two types of identifiers in Immuta:

  1. Reference identifiers: This is a library of the identifiers that can be added to domains. When added to a domain, a copy of the reference identifier is made as the domain-specific identifier.

    1. Immuta comes with built-in identifiers to discover common categories of data. These cannot be modified or deleted.

    2. Data governors can create their own reference identifiers for use within your organization.

  2. Domain-specific identifiers: These identifiers only exist within a specific domain and are checked against the data sources in that domain when identification runs.

    1. Users with the Manage Identifiers permission can create these identifiers or add them to a domain from a reference identifier.

    2. If a domain-specific identifier was copied over from a reference identifier, there is no lineage and any edits to the reference identifier will not be reflected in the domain-specific copy.

Criteria

Criteria are the conditions in an identifier that need to be met for resulting tags to be applied to data.

Supported criteria types for identifiers

  • Competitive criteria analysis: This criteria is a process that will review all the regex and dictionary criteria within the identifiers of the domain and search for the identifier with the best fit. In this review, each competitive criteria analysis identifier in the domain competes against each other to find the best and most specific identifier that fits the data. The resulting tags for the best identifier are then applied to the column. Only one competitive criteria analysis identifier for each domain will apply per column. Competitive criteria identifiers, both built-in and custom, must match at least 90% of the data sampled. To learn more about the competitive nature, see the How competitive criteria analysis works guide.

    • Regex: This criteria contains a case-insensitive regular expression (regex) that searches for matches against column values. Immuta only supports regular expressions written in RE2 syntax.

    • Dictionary: This criteria contains a list of words and phrases to match against column values.

  • Column name: This criteria includes a case-insensitive regular expression (regex) matched against column names, not against the values in the column. The identifier's tags will be applied to the column where the name is found. Multiple column name identifiers can match a column and be applied. Immuta only supports regular expressions written in RE2 syntax.

Create a new identifier in the Immuta UI or with the sdd/identifier endpoint.

Identification framework

An identification framework is a group of identifiers that will look for particular criteria and tag any columns where those conditions are met.

While organizations can have multiple frameworks, only one may be applied to each data source. Immuta has the built-in "Default Framework," which contains all the built-in identifiers and assigns the built-in Discovered tags.

For a how-to on the framework actions users can take, see the Manage frameworks page.

Global framework

Each organization can set a global framework to apply to all the data sources in Immuta by default unless they have a different framework assigned. It is labeled on the frameworks page with a globe icon. If a global framework is set, identification will run on all new data sources. If a global framework is not set, identification will only run on data sources manually applied to an identification framework.

Users can set any framework as the global framework or leave the global framework field blank.

Supported technologies

Identification has varied support for data sources from different technologies based on the identifier type.

Technology
Regex
Dictionary
Column name regex

Amazon S3

Not supported

Not supported

Supported

AWS Lake Formation

Not supported

Not supported

Supported

Azure Synapse Analytics

Not supported

Not supported

Supported

Databricks

Supported

Supported

Supported

Google BigQuery

Not supported

Not supported

Supported

Redshift

Supported

Supported

Supported

Snowflake

Supported

Supported

Supported

Starburst (Trino)

Supported

Supported

Supported

What has changed with identification in 2025.1?

If you used identification (previously SDD) prior to this feature release in 2025.1, there are some differences:

  1. There are now two types of identifiers:

    1. Reference identifiers

    2. Domain identifiers

    See information about these in the Identifier section.

  2. There is a new permission to manage identifiers within domains: Manage Identifiers. The permission allows you to do the following:

    1. Create an identifier within your domain

    2. View the reference identifiers in Immuta

    3. Add, edit, and delete identifiers within your domain

  3. Previously, tags applied by identification had to be from the parent Discovered tag. However, with identifiers in domains, any tag can be used in an identifier.

  4. The following have been removed:

    1. Identification frameworks: Previously, all identifiers had to be contained within a framework and that framework had to be assigned to a data source to run. Now, identifiers are added to domains with data sources.

    2. Global framework: Previously, a global framework could be set to run identification automatically on all new data sources not currently in a framework. This behavior can be similarly obtained if you are using connections by creating a domain with dynamic assignment based on the Immuta Connections tag.

Before and after comparison

See the table below for information on the differences of when identification runs with the SDD feature before vs after with identifiers in domains.

Event
Before
After

Identification runs automatically on all new data sources

Yes, if a global framework is set

Identification runs automatically on new data sources found from schema monitoring

Yes, if a global framework is set

Identification runs automatically on new columns found from column detection in a data source where identification has already run

Yes

Yes

Identification runs automatically when a data source is added to a domain with identifiers

No

Yes

Identification runs when a user manually triggers it from the data source health check menu

Yes

Yes

Identification runs when a user manually triggers it from the domain's page

No

Yes

Identification runs when a user manually triggers it from the identification framework page

Yes

No

Identification runs when a user manually triggers it through the API

Yes

Tag mutability

When identification is manually triggered by a data owner, all column tags previously applied by identification are removed and the tags prescribed by the latest run are applied. However, if identification is triggered because a new column is detected by schema monitoring or object sync, tags will only be applied to the new column, and no tags will be modified on existing columns. Additionally, governors, data source owners, and data source experts can disable any unwanted tags in the data dictionary to prevent them from being used and auto-tagged on that data source in the future.

Performance

The amount of time it takes to run identification on a data source depends on several factors:

  • Columns: The time to run identification grows nearly linearly with the number of text columns in the data source.

  • Identifiers: The number of identifiers being used weakly impacts the time to run identification.

  • Row count: Performance of identification may vary depending on the sampling method used by each technology. For Snowflake, the number of rows has little impact on the time because data sampling has near-constant performance.

  • Views: Performance on views is limited by the performance of the query that defines the view. Running identification on complex views with large amounts of data is more likely to result in timeouts. Immuta recommends running identification on the underlying base tables.

The time it takes to run identification for all newly onboarded data sources in Immuta is not limited by identification's performance but by the execution of background jobs in Immuta. Consult your Immuta account manager when onboarding a large number of data sources to ensure the advanced settings are set appropriately for your organization.

Testing

For users interested in testing identification, note that the built-in identifiers by Immuta require a 90% match to data to be assigned to a column. This means that with synthetic data, there may be situations where the data is not real enough to fit the confidence needed to match identifiers. To test identification, use a dev environment and create copies of your tables.

Audit

The following identification-related events are audited and can be found on the audit page in the UI:

  • SDDClassifierCreated: An identifier is created.

  • SDDClassifierDeleted: An identifier is deleted.

  • SDDClassifierUpdated: An identifier's criteria, description, name, or tag is updated.

  • TagApplied: A tag is applied to a data source or column. Tag events from identification will have actor.name.Immuta System Account and will include the related identifiers in the event as relatedResources.type.CLASSIFIERS.

  • TagRemoved: A tag is removed from a data source or column. Tag events from identification will have actor.name.Immuta System Account and will include the related identifiers in the event as relatedResources.type.CLASSIFIERS.

Considerations

Deleting the built-in Discovered tags is not recommended: If you do delete built-in Discovered tags and use the built-in identifiers without editing the tags, then when the identifier is matched the column will not be tagged. As an alternative, tags can be disabled on a column-by-column basis from the data dictionary, or identification won't run if you do not add identifiers to domains.

Supported data types and casing

Type of identifier
Supported data types
Case sensitivity

Data regex*

Text string columns

Case-sensitive

Column name regex

Any column

Not case-sensitive

Dictionary

Text string columns

Can be toggled in the identifier definition

*Two built-in patterns support and match based on additional data types:

  • DATE: Columns will match this identifier if they are string and the regex matches or if the data type is date, date+time, or timestamp.

  • TIME: Columns will match this identifier if they are string and the regex matches or if the data type is time. Note that if the date is included in the data, it will not match this identifier.

Limitations with dictionary patterns

Immuta compiles dictionary patterns into a regex that is sent in the body of a query.

For Snowflake, the size of the dictionary is limited by the overall query text size limit in Snowflake of 1 MB.

Databricks limitations

  • Immuta will start up a Databricks cluster to complete the identification job if one is not already running. This can cause unnecessary costs if the cluster becomes idle. Follow Databricks best practices to automatically terminate inactive clusters after a set period of time.

  • The following Databricks Unity Catalog securable objects are supported with Immuta, but cannot be used with identification:

    • Volumes

    • Models

    • Functions

Redshift limitations

  • The Redshift cluster must be up and running for identification to successfully run.

AWS access key limitations

To use AWS access key authentication on a Redshift data source and have competitive criteria analysis identifiers supported,

  • The AWS access key used to register the data source must be able to do a minimum of the following redshift-data API actions:

    • redshift-data:BatchExecuteStatement

    • redshift-data:CancelStatement

    • redshift-data:DescribeStatement

    • redshift-data:ExecuteStatement

    • redshift-data:GetStatementResult

    • redshift-data:ListStatements

  • The AWS access key used to register the data source must have redshift:GetClusterCredentials for the cluster, user, and database that they onboard their data sources with.

  • If using a custom URL, then the data source registered with the AWS access key must have the region and clusterid included in the additional connection string options formatted like the following example:

      region=us-east-2;clusterid=12345
  • Redshift Serverless data sources are not supported for competitive criteria analysis identifiers with the AWS access key authentication method.

Legacy SDD

Legacy SDD was available before October 2023. It is no longer available, but some users may still see the term "legacy SDD" in the context of their data tags applied to specific data sources. These tags will be removed the next time identification runs.

Last updated