Data Identification

Identification is an Immuta feature that uses data patterns to determine what type of data your column represents. Using identifiers within domains, Immuta evaluates your data and can assign the appropriate tags to your data dictionary based on what it finds. This saves the time of identifying your data manually and provides the benefit of a standard taxonomy across all your data sources in Immuta.

Architecture

To evaluate your data, Immuta generates a SQL query using a domain's identifiers. The Immuta system account then executes that query in the remote technology to match any regex and dictionary identifiers. Immuta receives the query result, containing the column name and the matching identifiers but no raw data values. Column name identifiers are all matched within Immuta and don't require any query to the remote technology. These results are then used to apply the resulting tags to the appropriate columns.

This evaluating and tagging process occurs when identification runs and happens automatically from the following event:

A new data source is added to a domain with identifiers (either manually or automatically via tags)

The following actions will also trigger identification:

Column detection is enabled, and new columns are detected on data sources within a domain with identifiers. Here, identification will only run on new columns, and no existing tags will be removed or changed.
A user manually triggers it from the data source health check menu. Note, this will use the identifiers that already applied to the data source.
A user manually triggers it from the domain page.
A user manually triggers it through the API.

Identifiers

Identification runs identifiers to discover data. These identifiers are grouped into domains with data sources. Each identifier contains a single criteria and the tags that will be applied when the criteria's conditions have been met.

There are two types of identifiers in Immuta:

Reference identifiers: This is a library of the identifiers that can be added to domains. When added to a domain, a copy of the reference identifier is made as the domain-specific identifier.
1. Immuta comes with built-in identifiers to discover common categories of data. These cannot be modified or deleted.
2. Data governors can create their own reference identifiers for use within your organization.
Domain-specific identifiers: These identifiers only exist within a specific domain and are checked against the data sources in that domain when identification runs.
1. Users with the Manage Identifiers permission can create these identifiers or add them to a domain from a reference identifier.
2. If a domain-specific identifier was copied over from a reference identifier, there is no lineage and any edits to the reference identifier will not be reflected in the domain-specific copy.

Criteria

Criteria are the conditions in an identifier that need to be met for resulting tags to be applied to data.

Competitive criteria analysis: This criteria is a process that will review all the regex and dictionary criteria within the identifiers of the domain and search for the identifier with the best fit. In this review, each competitive criteria analysis identifier in the domain competes against each other to find the best and most specific identifier that fits the data. The resulting tags for the best identifier are then applied to the column. Only one competitive criteria analysis identifier for each domain will apply per column. Competitive criteria identifiers, both built-in and custom, must match at least 90% of the data sampled. To learn more about the competitive nature, see the How competitive criteria analysis works guide.
- Regex: This criteria contains a case-insensitive regular expression (regex) that searches for matches against column values. Immuta only supports regular expressions written in RE2 syntax.
- Dictionary: This criteria contains a list of words and phrases to match against column values.
Column name: This criteria includes a case-insensitive regular expression (regex) matched against column names, not against the values in the column. The identifier's tags will be applied to the column where the name is found. Multiple column name identifiers can match a column and be applied. Immuta only supports regular expressions written in RE2 syntax.

Create a new identifier in the Immuta UI or with the sdd/identifier endpoint.

Supported technologies

Identification has varied support for data sources from different technologies based on the identifier type.

Technology

Regex

Dictionary

Column name regex

Amazon S3

Not supported

Supported

Azure Synapse Analytics

Not supported

Supported

Databricks

Supported

Google BigQuery

Not supported

Supported

Redshift

Supported

Snowflake

Supported

Starburst (Trino)

Supported

Tag mutability

When identification is manually triggered by a data owner, all column tags previously applied by identification are removed and the tags prescribed by the latest run are applied. However, if identification is triggered because a new column is detected by schema monitoring or object sync, tags will only be applied to the new column, and no tags will be modified on existing columns. Additionally, governors, data source owners, and data source experts can disable any unwanted tags in the data dictionary to prevent them from being used and auto-tagged on that data source in the future.

Performance

The amount of time it takes to run identification on a data source depends on several factors:

Columns: The time to run identification grows nearly linearly with the number of text columns in the data source.
Identifiers: The number of identifiers being used weakly impacts the time to run identification.
Row count: Performance of identification may vary depending on the sampling method used by each technology. For Snowflake, the number of rows has little impact on the time because data sampling has near-constant performance.
Views: Performance on views is limited by the performance of the query that defines the view. Running identification on complex views with large amounts of data is more likely to result in timeouts. Immuta recommends running identification on the underlying base tables.

The time it takes to run identification for all newly onboarded data sources in Immuta is not limited by identification's performance but by the execution of background jobs in Immuta. Consult your Immuta account manager when onboarding a large number of data sources to ensure the advanced settings are set appropriately for your organization.

Default 15-minute timeout

Identification queries will timeout after 15 minutes to avoid overconsumption of resources and reduce the cost of running identification. If your identification run was not completed because of this timeout, submit a support ticket to change the default setting.

Testing

For users interested in testing identification, note that the built-in identifiers by Immuta require a 90% match to data to be assigned to a column. This means that with synthetic data, there may be situations where the data is not real enough to fit the confidence needed to match identifiers. To test identification, use a dev environment and create copies of your tables.

Audit

The following identification-related events are audited and can be found on the audit page in the UI:

SDDClassifierCreated: An identifier is created.
SDDClassifierDeleted: An identifier is deleted.
SDDClassifierUpdated: An identifier's criteria, description, name, or tag is updated.
TagApplied: A tag is applied to a data source or column. Tag events from identification will have actor.name.Immuta System Account and will include the related identifiers in the event as relatedResources.type.CLASSIFIERS.
TagRemoved: A tag is removed from a data source or column. Tag events from identification will have actor.name.Immuta System Account and will include the related identifiers in the event as relatedResources.type.CLASSIFIERS.

Considerations

Deleting the built-in Discovered tags is not recommended: If you do delete built-in Discovered tags and use the built-in identifiers without editing the tags, then when the identifier is matched the column will not be tagged. As an alternative, tags can be disabled on a column-by-column basis from the data dictionary, or identification won't run if you do not add identifiers to domains.

Supported data types and casing

Type of identifier

Supported data types

Case sensitivity

Data regex*

Text string columns

Case-sensitive

Column name regex

Any column

Not case-sensitive

Dictionary

Text string columns

Can be toggled in the identifier definition

*Two built-in patterns support and match based on additional data types:

DATE: Columns will match this identifier if they are string and the regex matches or if the data type is date, date+time, or timestamp.
TIME: Columns will match this identifier if they are string and the regex matches or if the data type is time. Note that if the date is included in the data, it will not match this identifier.

Limitations with query size

The size of the identification query for dictionary patterns, which are compiled into a regex and regex patterns, is limited by the backing technology:

For Snowflake, the overall query text size limit is 1 MB.
For Starburst (Trino), the default query character limit is 1,000,000 characters. However, this limit can be increased if your identifiers require it.

Databricks limitations

Immuta will start up a Databricks cluster to complete the identification job if one is not already running. This can cause unnecessary costs if the cluster becomes idle. Follow Databricks best practices to automatically terminate inactive clusters after a set period of time.
The following Databricks Unity Catalog securable objects are supported with Immuta, but cannot be used with identification:
- Volumes (external and managed)
- Models
- Functions
Using a large number of files to store the data in a table with a large number of rows may result in the Databricks planner scanning the entire table, resulting in a slow performing query.

Redshift limitations

The Redshift cluster must be up and running for identification to successfully run.

AWS access key limitations

To use AWS access key authentication on a Redshift data source and have competitive criteria analysis identifiers supported,

The AWS access key used to register the data source must be able to do a minimum of the following redshift-data API actions:
- redshift-data:BatchExecuteStatement
- redshift-data:CancelStatement
- redshift-data:DescribeStatement
- redshift-data:ExecuteStatement
- redshift-data:GetStatementResult
- redshift-data:ListStatements
The AWS access key used to register the data source must have redshift:GetClusterCredentials for the cluster, user, and database that they onboard their data sources with.
If using a custom URL, then the data source registered with the AWS access key must have the region and clusterid included in the additional connection string options formatted like the following example:
```
  region=us-east-2;clusterid=12345
```
Redshift Serverless data sources are not supported for competitive criteria analysis identifiers with the AWS access key authentication method.

Legacy SDD

This is only relevant to users who enabled and ran Immuta SDD prior to October 2023.

Legacy SDD was available before October 2023. It is no longer available, but some users may still see the term "legacy SDD" in the context of their data tags applied to specific data sources. These tags will be removed the next time identification runs.

PreviousCustom REST Catalog Interface Endpoints NextIntroduction

Last updated 15 days ago

hashtagArchitecture

hashtagIdentifiers

hashtagCriteria

hashtagSupported technologies

hashtagTag mutability

hashtagPerformance

hashtagTesting

hashtagAudit

hashtagConsiderations

hashtagSupported data types and casing

hashtagLimitations with query size

hashtagDatabricks limitations

hashtagRedshift limitations

hashtagAWS access key limitations

hashtagLegacy SDD

Architecture

Identifiers

Criteria

Supported technologies

Tag mutability

Performance

Testing

Audit

Considerations

Supported data types and casing

Limitations with query size

Databricks limitations

Redshift limitations

AWS access key limitations

Legacy SDD