Data Identification
Last updated
Was this helpful?
Last updated
Was this helpful?
Identification is an Immuta feature that uses data patterns to determine what type of data your column represents. Using identifiers within domains, Immuta evaluates your data and can assign the appropriate tags to your data dictionary based on what it finds. This saves the time of identifying your data manually and provides the benefit of a standard taxonomy across all your data sources in Immuta.
To evaluate your data, Immuta generates a SQL query using a domain's identifiers. The Immuta system account then executes that query in the remote technology to match any regex and dictionary identifiers. Immuta receives the query result, containing the column name and the matching identifiers but no raw data values. Column name identifiers are all matched within Immuta and don't require any query to the remote technology. These results are then used to apply the resulting tags to the appropriate columns.
This evaluating and tagging process occurs when identification runs and happens automatically from the following event:
A new data source is added to a domain with identifiers (either manually or )
The following actions will also trigger identification:
Column detection is enabled, and new columns are detected on data sources within a domain with identifiers. Here, identification will only run on new columns, and no existing tags will be removed or changed.
. Note, this will use the identifiers that already applied to the data source.
.
A user manually triggers it through the API.
Identification runs identifiers to discover data. These identifiers are grouped into with data sources. Each identifier contains a single criteria and the tags that will be applied when the criteria's conditions have been met.
There are two types of identifiers in Immuta:
Reference identifiers: This is a library of the identifiers that can be added to domains. When added to a domain, a copy of the reference identifier is made as the domain-specific identifier.
Immuta comes with to discover common categories of data. These cannot be modified or deleted.
Data governors can create their own reference identifiers for use within your organization.
Domain-specific identifiers: These identifiers only exist within a specific domain and are checked against the data sources in that domain when identification runs.
Users with the Manage Identifiers
permission can create these identifiers or add them to a domain from a reference identifier.
If a domain-specific identifier was copied over from a reference identifier, there is no lineage and any edits to the reference identifier will not be reflected in the domain-specific copy.
Criteria are the conditions in an identifier that need to be met for resulting tags to be applied to data.
Regex: This criteria contains a case-insensitive regular expression (regex) that searches for matches against column values. Immuta only supports regular expressions written in RE2 syntax.
Dictionary: This criteria contains a list of words and phrases to match against column values.
Column name: This criteria includes a case-insensitive regular expression (regex) matched against column names, not against the values in the column. The identifier's tags will be applied to the column where the name is found. Multiple column name identifiers can match a column and be applied. Immuta only supports regular expressions written in RE2 syntax.
An identification framework is a group of identifiers that will look for particular criteria and tag any columns where those conditions are met.
While organizations can have multiple frameworks, only one may be applied to each data source. Immuta has the built-in "Default Framework," which contains all the built-in identifiers and assigns the built-in Discovered tags.
Each organization can set a global framework to apply to all the data sources in Immuta by default unless they have a different framework assigned. It is labeled on the frameworks page with a globe icon. If a global framework is set, identification will run on all new data sources. If a global framework is not set, identification will only run on data sources manually applied to an identification framework.
Amazon S3
Not supported
Not supported
Supported
AWS Lake Formation
Not supported
Not supported
Supported
Azure Synapse Analytics
Not supported
Not supported
Supported
Databricks
Supported
Supported
Supported
Google BigQuery
Not supported
Not supported
Supported
Redshift
Supported
Snowflake
Supported
Supported
Supported
Starburst (Trino)
Supported
If you used identification (previously SDD) prior to this feature release in 2025.1, there are some differences:
There are now two types of identifiers:
Reference identifiers
Domain identifiers
There is a new permission to manage identifiers within domains: Manage Identifiers
. The permission allows you to do the following:
Create an identifier within your domain
View the reference identifiers in Immuta
Add, edit, and delete identifiers within your domain
Previously, tags applied by identification had to be from the parent Discovered tag. However, with identifiers in domains, any tag can be used in an identifier.
The following have been removed:
Identification frameworks: Previously, all identifiers had to be contained within a framework and that framework had to be assigned to a data source to run. Now, identifiers are added to domains with data sources.
The amount of time it takes to run identification on a data source depends on several factors:
Columns: The time to run identification grows nearly linearly with the number of text columns in the data source.
Row count: Performance of identification may vary depending on the sampling method used by each technology. For Snowflake, the number of rows has little impact on the time because data sampling has near-constant performance.
Views: Performance on views is limited by the performance of the query that defines the view. Running identification on complex views with large amounts of data is more likely to result in timeouts. Immuta recommends running identification on the underlying base tables.
Default 15-minute timeout
For users interested in testing identification, note that the built-in identifiers by Immuta require a 90% match to data to be assigned to a column. This means that with synthetic data, there may be situations where the data is not real enough to fit the confidence needed to match identifiers. To test identification, use a dev environment and create copies of your tables.
Data regex*
Text string columns
Case-sensitive
Column name regex
Any column
Not case-sensitive
Dictionary
Text string columns
Can be toggled in the identifier definition
*Two built-in patterns support and match based on additional data types:
DATE
: Columns will match this identifier if they are string and the regex matches or if the data type is date, date+time, or timestamp.
TIME
: Columns will match this identifier if they are string and the regex matches or if the data type is time. Note that if the date is included in the data, it will not match this identifier.
Immuta compiles dictionary patterns into a regex that is sent in the body of a query.
The following Databricks Unity Catalog securable objects are supported with Immuta, but cannot be used with identification:
Volumes
Models
Functions
Username and password
Supported
Supported
Supported
Not supported
The Redshift cluster must be up and running for identification to successfully run.
Redshift Spectrum is only supported with column name regex identifiers.
Username and password
Supported
Supported
AWS access key
Supported
Supported
Not supported
To use AWS access key authentication on a Redshift data source and have competitive criteria analysis identifiers supported,
redshift-data:BatchExecuteStatement
redshift-data:CancelStatement
redshift-data:DescribeStatement
redshift-data:ExecuteStatement
redshift-data:GetStatementResult
redshift-data:ListStatements
The AWS access key used to register the data source must have redshift:GetClusterCredentials
for the cluster, user, and database that they onboard their data sources with.
Redshift Serverless data sources are not supported for competitive criteria analysis identifiers with the AWS access key authentication method.
This is only relevant to users who enabled and ran Immuta SDD prior to October 2023.
Legacy SDD was available before October 2023. It is no longer available, but some users may still see the term "legacy SDD" in the context of their data tags applied to specific data sources. These tags will be removed the next time identification runs.
Competitive criteria analysis: This criteria is a process that will review all the regex and dictionary criteria within the identifiers of the domain and search for the identifier with the best fit. In this review, each competitive criteria analysis identifier in the domain competes against each other to find the best and most specific identifier that fits the data. The resulting tags for the best identifier are then applied to the column. Only one competitive criteria analysis identifier for each domain will apply per column. Competitive criteria identifiers, both built-in and custom, must match at least 90% of the data sampled. To learn more about the competitive nature, see the .
Create a new identifier in the or with the .
End-of-life (EOL) notice: Support for identification frameworks has reached EOL. Please see the .
For a how-to on the framework actions users can take, see the .
Users can or leave the global framework field blank.
Identification has varied support for from different technologies based on the identifier type.
Supported in private preview (see )
Supported in private preview (see )
Supported in public preview (see )
Supported in public preview (see )
See information about these in the .
Global framework: Previously, a global framework could be set to run identification automatically on all new data sources not currently in a framework. This behavior can be similarly obtained if you are using connections by .
When identification is manually triggered by a data owner, all column tags previously applied by identification are removed and the tags prescribed by the latest run are applied. However, if identification is triggered because a new column is detected by schema monitoring or object sync, tags will only be applied to the new column, and no tags will be modified on existing columns. Additionally, governors, data source owners, and data source experts can to prevent them from being used and auto-tagged on that data source in the future.
Identifiers: The number of identifiers being used the time to run identification.
The time it takes to run identification for all newly onboarded data sources in Immuta is not limited by identification's performance but by the execution of background jobs in Immuta. when onboarding a large number of data sources to ensure the advanced settings are set appropriately for your organization.
Identification queries will timeout after 15 minutes to avoid overconsumption of resources and reduce the cost of running identification. If your identification run was not completed because of this timeout, to change the default setting.
The following identification-related events are and can be found on the audit page in the UI:
: An identifier is created.
: An identifier is deleted.
: An identifier's criteria, description, name, or tag is updated.
: A tag is applied to a data source or column. Tag events from identification will have actor.name.Immuta System Account
and will include the related identifiers in the event as relatedResources.type.CLASSIFIERS
.
: A tag is removed from a data source or column. Tag events from identification will have actor.name.Immuta System Account
and will include the related identifiers in the event as relatedResources.type.CLASSIFIERS
.
Deleting the built-in Discovered tags is not recommended: If you do delete built-in Discovered tags and use the built-in identifiers without editing the tags, then when the identifier is matched the column will not be tagged. As an alternative, tags can be disabled on a , or identification won't run if you do not add identifiers to domains.
For Snowflake, the size of the dictionary is limited by the .
Immuta will start up a Databricks cluster to complete the identification job if one is not already running. This can cause unnecessary costs if the cluster becomes idle. Follow to automatically terminate inactive clusters after a set period of time.
Supported (see )
The AWS access key used to register the data source must be able to do a minimum of the following :
If using a custom URL, then the data source registered with the AWS access key must have the region
and clusterid
included in the formatted like the following example: