1 of 5

Reference Guides

How Competitive Pattern Analysis Works

Of identification's three criteria options, regex and dictionary are competitive. This means that when assessing your data, if multiple identifiers could match, only one with competitive criteria will be chosen to tag the data. To better understand how Immuta executes this competition, read further.

Immuta employs a three-phased competitive criteria analysis approach for identification:

Sampling: No data is moved, and Immuta checks the identifiers against a sample of data from the table.
Qualifying: Identifiers with a criteria match of less than a 90% match are filtered out.
: The remaining identifiers are compared with one another to find the most specific criteria that qualifies and matches the sample.

In the end, competitive criteria analysis aims to find a single identifier for each column that best describes the data format.

Sampling

In the sampling process, no database contents are transmitted to Immuta; instead, Immuta receives only the column-wise hit rate (the number of times the criteria has matched a value in the column) information for each active identifier. To do this, Immuta instructs a remote database to measure column-wise hit rate information for all active identifiers over a row sample.

The sample size is decided based on the number of identifiers and the data size, when available. In the most simplified case, the requested number of sampled rows depends only on the number of regex and dictionary criteria being run in the domain, not the data size. The sample size dependence on the number of identifiers is weak and will not exceed 13,000 rows.

Number of identifiers

Sample size

Sampling considerations

In practice, the number of sampled values for each column may be less than the requested number of rows because columns are not independently sampled but rather projected from a row-wise sample. This can impact the sample when the target table has less than the requested number of rows, when some of the column values are null, or because of technology-specific limitations.

Snowflake and Starburst (Trino): Immuta implements table sampling by row count.
Databricks and Redshift: Due to technology limitations and the inability to predict the size of the table, Immuta implements a best-effort sampling strategy comprising a flat 10% row sample capped at the first 10,000 sampled rows. In particular, under-sampling may occur on tables with less than 100,000 rows. Moreover, the resulting sample is biased towards earlier records.
All platforms: Sampling from views can have significantly slower performance that varies by the performance of the query that defines the view.

Qualifying

During the qualification phase, identifiers that do not agree with the data are disqualified. An identifier agrees with the data if the hit rate on the remote sample exceeds the predefined threshold. This threshold is 90% match for most built-in identifiers; however, a few built-in identifiers have lower threshold requirements. The 90% threshold is standard for all custom identifiers as well to ensure the criteria matches the data within the column and to avoid false positives. Note that threshold calculations are relative to the number of non-null entries for each column.

If no identifiers qualify, then no identifier is assessed for scoring and the column is not tagged.

Scoring

During the scoring phase, a machine inference is carried out among all qualified identifiers, combining criteria-derived complexity information with hit rate information to determine which identifier best describes the sample data. This process prefers the more restrictive of two competing identifiers since the ability to satisfy the more difficult-to-satisfy identifier itself serves as evidence that it is more likely. This phase ends by returning a single most likely identifier per the inference process.

Example

Here are a set of regex identifiers and a sample of data:

Identifiers:

[a-zA-Z0-9]{3} - This regex will match 3 character strings with the characters a-z, lowercase or uppercase, or digits 0-9.
[a-c]{3} - This regex will match 3 character strings with the characters a-c, lowercase.
(a|b|d){3} - This regex will match 3 character strings with the characters a, b, or d, lowercase.

Sample data

Matches Identifier 1

Matches Identifier 2

Matches Identifier 3

When qualifying the identifiers, Identifier 1 and Identifier 3 both match 90% or more of the data. Identifier 2 does not, and is disqualified.

Then the qualified identifiers are scored. Here, Identifier 1, despite matching 100% of the data, is unspecific and could match over 200,000 values. On the other hand, Identifier 3 matches just at 90% but is very specific with only 27 available values.

Therefore, with the specificity taken into account, Identifier 3 would be the match for this column, and its tags would be applied to the data source in Immuta.

Important notes

Dictionaries are part of the competitive process, while column-name regex are not.
Scoring ties are rare but can occur if the same criteria (either dictionary or regex) is specified more than once (even in different forms). Scoring ties are inconclusive, and the scoring phase will not return an identifier in the case of a tie.
Criteria complexity analysis is sensitive to the total number of strings an identifier accepts or, equivalently for dictionaries, the number of entries. Therefore, identifiers that accept much more than is necessary to describe the intended column data format may perform more poorly in the competitive analysis because they are easier to satisfy.

Built-in Identifier Reference

Immuta comes with a pack of built-in identifiers that look for common data types. These identifiers were written by Immuta's research and development team and cannot be deleted or edited by users. However, users can add these built-in identifiers to their own domains and edit the tags applied by them.

Identifiers must match at least 90% of the sampled data to be tagged, with three exceptions noted below. See the How competitive pattern analysis works guide for more information about sampling and thresholds.

Identifier descriptions and default resulting tags

Identifier

Description

Resulting tags from the default identifier

Built-in Identifier Changelog

May 21, 2025

Identifiers in domains is released as GA and these identifier updates are coupled with that release.

Improvements

The following identifiers have been improved to better match their intended data patterns. These updates have only been made to the built-in reference identifiers. If these are already in your domains, they will remain there as domain-specific identifiers with the previous pattern. If you want to add these improved identifiers to your domain, edit the name because identifier names must be unique within each domain.

To see more about the specific changes made, see the annotations on the .

AUSTRALIA_MEDICARE_NUMBER
AUSTRALIA_PASSPORT
BRAZIL_CPF_NUMBER
CANADA_PASSPORT

Deprecations

The following identifiers are deprecated and no longer included in the reference identifiers. If these are already in your domains, they will remain there as domain-specific identifiers.

AGE
DENMARK_CPR_NUMBER
FINLAND_NATIONAL_ID_NUMBER
FRANCE_CNI

New

The following identifiers are newly created to identify common data patterns. Copy these new reference identifiers to any of your domains.

BELGIUM_NATIONAL_REGISTRATION_NUMBER: Detects numeric strings consistent with Belgium's National Registration Number. Requires 11 characters in the form YY.MM.DD-NNN-XX, where YY.MM.DD corresponds to birth date, NNN is a number, and XX is a checksum digit.
COUNTRY: Detects strings consistent with the names of all countries in the world. This identifier is case-insensitive.
FINANCIAL_INSTITUTIONS: Matches strings consistent with names of financial institutions based on lists provided by the FDIC and OCC, includes alternative names.

First identifier pack released

62 built-in identifiers are released for use with identification.

Built-in Discovered Tags Reference

Immuta is pre-configured with a set of tags that can be used to write global policies before data sources even exist. See a list of the built-in Discovered tags below and the Built-in identifier reference page for information about where these tags will be applied by the built-in identifiers.

Country tags

All the tags below belong to the Country parent. For example, the full tag name will appear as Discovered . Country . Argentina.

Child tag name

Description

Entity tags

All the tags below belong to the Entity parent. For example, the full tag name will appear as Discovered . Entity . Aadhaar Individual.

Child tag name

Description

Built-in Identifier Changelog

May 21, 2025

Identifiers in domains is released as GA and these identifier updates are coupled with that release.

Improvements

To see more about the specific changes made, see the annotations on the .

AUSTRALIA_MEDICARE_NUMBER
AUSTRALIA_PASSPORT
BRAZIL_CPF_NUMBER
CANADA_PASSPORT

Deprecations

The following identifiers are deprecated and no longer included in the reference identifiers. If these are already in your domains, they will remain there as domain-specific identifiers.

AGE
DENMARK_CPR_NUMBER
FINLAND_NATIONAL_ID_NUMBER
FRANCE_CNI

New

The following identifiers are newly created to identify common data patterns. Copy these new reference identifiers to any of your domains.

BELGIUM_NATIONAL_REGISTRATION_NUMBER: Detects numeric strings consistent with Belgium's National Registration Number. Requires 11 characters in the form YY.MM.DD-NNN-XX, where YY.MM.DD corresponds to birth date, NNN is a number, and XX is a checksum digit.
COUNTRY: Detects strings consistent with the names of all countries in the world. This identifier is case-insensitive.
FINANCIAL_INSTITUTIONS: Matches strings consistent with names of financial institutions based on lists provided by the FDIC and OCC, includes alternative names.

First identifier pack released

62 built-in identifiers are released for use with identification.

How Competitive Pattern Analysis Works

Immuta employs a three-phased competitive criteria analysis approach for identification:

Sampling: No data is moved, and Immuta checks the identifiers against a sample of data from the table.
Qualifying: Identifiers with a criteria match of less than a 90% match are filtered out.
: The remaining identifiers are compared with one another to find the most specific criteria that qualifies and matches the sample.

In the end, competitive criteria analysis aims to find a single identifier for each column that best describes the data format.

Sampling

Number of identifiers

Sample size

Sampling considerations

Snowflake and Starburst (Trino): Immuta implements table sampling by row count.
Databricks and Redshift: Due to technology limitations and the inability to predict the size of the table, Immuta implements a best-effort sampling strategy comprising a flat 10% row sample capped at the first 10,000 sampled rows. In particular, under-sampling may occur on tables with less than 100,000 rows. Moreover, the resulting sample is biased towards earlier records.
All platforms: Sampling from views can have significantly slower performance that varies by the performance of the query that defines the view.

Qualifying

If no identifiers qualify, then no identifier is assessed for scoring and the column is not tagged.

Scoring

Example

Here are a set of regex identifiers and a sample of data:

Identifiers:

[a-zA-Z0-9]{3} - This regex will match 3 character strings with the characters a-z, lowercase or uppercase, or digits 0-9.
[a-c]{3} - This regex will match 3 character strings with the characters a-c, lowercase.
(a|b|d){3} - This regex will match 3 character strings with the characters a, b, or d, lowercase.

Sample data

Matches Identifier 1

Matches Identifier 2

Matches Identifier 3

When qualifying the identifiers, Identifier 1 and Identifier 3 both match 90% or more of the data. Identifier 2 does not, and is disqualified.

Therefore, with the specificity taken into account, Identifier 3 would be the match for this column, and its tags would be applied to the data source in Immuta.

Important notes

Dictionaries are part of the competitive process, while column-name regex are not.
Scoring ties are rare but can occur if the same criteria (either dictionary or regex) is specified more than once (even in different forms). Scoring ties are inconclusive, and the scoring phase will not return an identifier in the case of a tie.
Criteria complexity analysis is sensitive to the total number of strings an identifier accepts or, equivalently for dictionaries, the number of entries. Therefore, identifiers that accept much more than is necessary to describe the intended column data format may perform more poorly in the competitive analysis because they are easier to satisfy.

Reference Guides

How Competitive Pattern Analysis Works

hashtagSampling

hashtagSampling considerations

hashtagQualifying

hashtagScoring

hashtagExample

hashtagImportant notes

Built-in Identifier Reference

hashtagIdentifier descriptions and default resulting tags

Built-in Identifier Changelog

hashtagMay 21, 2025

hashtagImprovements

hashtagDeprecations

hashtagNew

hashtagFirst identifier pack released

Built-in Discovered Tags Reference

hashtagCountry tags

hashtagEntity tags

Built-in Identifier Changelog

hashtagMay 21, 2025

hashtagImprovements

hashtagDeprecations

hashtagNew

hashtagFirst identifier pack released

Reference Guides

How Competitive Pattern Analysis Works

hashtagSampling

hashtagSampling considerations

hashtagQualifying

hashtagScoring

hashtagExample

hashtagImportant notes

Built-in Discovered Tags Reference

hashtagCountry tags

hashtagEntity tags

Built-in Identifier Reference

hashtagIdentifier descriptions and default resulting tags

Sampling

Sampling considerations

Qualifying

Scoring

Example

Important notes

Identifier descriptions and default resulting tags

May 21, 2025

Improvements

Deprecations

New

First identifier pack released

Country tags

Entity tags

May 21, 2025

Improvements

Deprecations

New

First identifier pack released

Sampling

Sampling considerations

Qualifying

Scoring

Example

Important notes

Country tags

Entity tags

Identifier descriptions and default resulting tags