Of sensitive data discovery's three criteria options, regex and dictionary are competitive. This means that when assessing your data, if multiple identifiers could match, only one with competitive criteria will be chosen to tag the data. To better understand how Immuta executes this competition, read further.
Discover employs a three-phased competitive criteria analysis approach for sensitive data discovery (SDD):
Sampling: No data is moved, and Immuta checks the identifiers against a sample of data from the table.
Qualifying: Identifiers with a criteria match of less than a 90% match are filtered out.
Scoring: The remaining identifiers are compared with one another to find the most specific criteria that qualifies and matches the sample.
In the end, competitive criteria analysis aims to find a single identifier for each column that best describes the data format.
In the sampling process, no database contents are transmitted to Immuta; instead, Immuta receives only the column-wise hit rate (the number of times the criteria has matched a value in the column) information for each active identifier. To do this, Discover instructs a remote database to measure column-wise hit rate information for all active identifiers over a row sample.
The sample size is decided based on the number of identifiers and the data size, when available. In the most simplified case, the requested number of sampled rows depends only on the number of regex and dictionary criteria being run in the framework, not the data size. The sample size dependence on the number of identifiers is weak and will not exceed 13,000 rows.
5
7369 rows
50
9211 rows
500
11053 rows
5000
12895 rows
In practice, the number of sampled values for each column may be less than the requested number of rows because columns are not independently sampled but rather projected from a row-wise sample. This can impact the sample when the target table has less than the requested number of rows, when some of the column values are null
, or because of technology-specific limitations.
Snowflake and Starburst (Trino): Discover implements table sampling by row count.
Databricks and Redshift: Due to technology limitations and the inability to predict the size of the table, Discover implements a best-effort sampling strategy comprising a flat 10% row sample capped at the first 10,000 sampled rows. In particular, under-sampling may occur on tables with less than 100,000 rows. Moreover, the resulting sample is biased towards earlier records.
All platforms: Sampling from views can have significantly slower performance that varies by the performance of the query that defines the view.
All platforms: Any null
values included in the sample will not count towards the qualification or scoring when included in the sample. However, it will lower the number of available values to match against the patterns, as the sample size is not dynamic based on the ignored null
values.
During the qualification phase, identifiers that do not agree with the data are disqualified. An identifier agrees with the data if the on the remote sample exceeds the predefined threshold. This threshold is 90% match for most built-in identifiers; however, a few built-in identifiers have lower threshold . The 90% threshold is standard for all custom identifiers as well to ensure the criteria matches the data within the column and to avoid false positives. Note that threshold calculations are relative to the number of non-null entries for each column.
If no identifiers qualify, then no identifier is assessed for scoring and the column is not tagged.
During the scoring phase, a machine inference is carried out among all qualified identifiers, combining criteria-derived complexity information with hit rate information to determine which identifier best describes the sample data. This process prefers the more restrictive of two competing identifiers since the ability to satisfy the more difficult-to-satisfy identifier itself serves as evidence that it is more likely. This phase ends by returning a single most likely identifier per the inference process.
Here are a set of regex identifiers and a sample of data:
Identifiers:
[a-zA-Z0-9]{3}
- This regex will match 3 character strings with the characters a-z, lowercase or uppercase, or digits 0-9.
[a-c]{3}
- This regex will match 3 character strings with the characters a-c, lowercase.
(a|b|d){3}
- This regex will match 3 character strings with the characters a, b, or d, lowercase.
dad
Yes
Yes
baa
Yes
Yes
add
Yes
Yes
add
Yes
Yes
cab
Yes
Yes
bad
Yes
Yes
aba
Yes
Yes
baa
Yes
Yes
dad
Yes
Yes
baa
Yes
Yes
When qualifying the identifiers, Identifier 1 and Identifier 3 both match 90% or more of the data. Identifier 2 does not, and is disqualified.
Then the qualified identifiers are scored. Here, Identifier 1, despite matching 100% of the data, is unspecific and could match over 200,000 values. On the other hand, Identifier 3 matches just at 90% but is very specific with only 27 available values.
Therefore, with the specificity taken into account, Identifier 3 would be the match for this column, and its tags would be applied to the data source in Immuta.
Dictionaries are part of the competitive process, while column-name regex are not.
Scoring ties are rare but can occur if the same criteria (either dictionary or regex) is specified more than once (even in different forms). Scoring ties are inconclusive, and the scoring phase will not return an identifier in the case of a tie.
Criteria complexity analysis is sensitive to the total number of strings an identifier accepts or, equivalently for dictionaries, the number of entries. Therefore, identifiers that accept much more than is necessary to describe the intended column data format may perform more poorly in the competitive analysis because they are easier to satisfy.
Private preview: This feature is only available to select accounts.
Sensitive data discovery (SDD) runs identifiers to discover data. These identifiers are grouped into domains with data sources. Each identifier contains a single criteria and the tags that will be applied when the criteria's conditions have been met.
There are two types of identifiers in Immuta:
Reference identifiers: These identifiers are a library of the identifiers that can be added to domains. When added to a domain, reference identifiers are copied over and become domain-specific identifiers.
Immuta comes with built-in identifiers to discover common categories of data. These cannot be modified or deleted.
Data governors can create their own reference identifiers for use within your organization.
Domain-specific identifiers: These identifiers only exist within a specific domain and are checked against the data sources in that domain when SDD runs.
Users with the Manage Identifiers permission can create these identifiers or add them to a domain from a reference identifier.
If a domain-specific identifier was copied over from a reference identifier, there is no lineage and any edits to the reference identifier will not be reflected in the domain-specific copy.
Criteria are the conditions in an identifier that need to be met for resulting tags to be applied to data.
SDD only supports regular expressions (regex) written in RE2 syntax.
Competitive criteria analysis: This criteria is a process that will review all the regex and dictionary criteria within the identifiers of the domain and search for the identifier with the best fit. In this review, each competitive criteria analysis identifier in the domain competes against each other to find the best and most specific identifier that fits the data. The resulting tags for the best identifier are then applied to the column. Only one competitive criteria analysis identifier for each domain will apply per column. Competitive criteria identifiers, both built-in and custom, must match at least 90% of the data sampled. To learn more about the competitive nature, see the How competitive criteria analysis works guide.
Regex: This criteria contains a case-insensitive regular expression that searches for matches against column values.
Dictionary: This criteria contains a list of words and phrases to match against column values.
Column name: This criteria includes a case-insensitive regular expression matched against column names, not against the values in the column. The identifier's tags will be applied to the column where the name is found. Multiple column name identifiers can match a column and be applied.
Create a new identifier in the Immuta UI or with the sdd/classifier
endpoint.
If you used SDD prior to this feature release in January 2025, there are some differences:
There are now two types of identifiers:
Reference identifiers
Domain identifiers
See information about these in the Identifiers section.
There is a new permission to manage identifiers within domains: Manage Identifiers. The permission allows you to do the following:
Create an identifier within your domain
View the reference identifiers in Immuta
Add, edit, and delete identifiers within your domain
The following have been removed:
Identification frameworks: Previously, all identifiers had to be contained within a framework and that framework had to be assigned to a data source to run. Now, identifiers are added to domains with data sources.
Global framework: Previously, a global framework could be set to run SDD automatically on all new data sources. This behavior cannot be achieved with identifiers in domains.
See the table below for information on when SDD runs with the SDD feature before vs after with identifiers in domains.
SDD runs automatically on all new data sources
Yes, if a global framework is set
No
SDD runs automatically on new data sources found from schema monitoring
Yes, if a global framework is set
No
SDD runs automatically on new columns found from column detection in a data source where SDD has already run
Yes
Yes
SDD runs when a user manually triggers it from the data source health check menu
Yes
Yes
SDD runs when a user manually triggers it from the domain's page
No
Yes
SDD runs when a user manually triggers it from the identification framework page
Yes
No
SDD runs when a user manually triggers it through the API
Yes
Yes
Immuta is pre-configured with a set of tags that can be used to write global policies before data sources even exist. See a list of the built-in Discovered tags below and the Built-in identifier reference page for information about where these tags will be applied by the built-in identifiers.
All the tags below belong to the Country
parent. For example, the full tag name will appear as Discovered . Country . Argentina
.
Argentina
This tag is applied to data recognized as specific to Argentina (e.g., an Argentina National Identity Number).
Australia
This tag is applied to data recognized as specific to Australia (e.g., an Australian Medicare number or Australian passport number).
Belgium
This tag is applied to data recognized as specific to Belgium (e.g., a Belgium National ID card).
Brazil
This tag is applied to data recognized as specific to Brazil (e.g., a Brazil CPF number).
Canada
This tag is applied to data recognized as specific to Canada (e.g., a British Columbia PHN, OHIP string, Canadian passport number, or Quebec's HIN).
Chile
This tag is for data specific to Chile.
China
This tag is for data specific to China.
Colombia
This tag is for data specific to Colombia.
Denmark
This tag is applied to data recognized as specific to Denmark (e.g., a Denmark CPR or Person-number).
Finland
This tag is applied to data recognized as specific to Finland (e.g., a Finland National ID number).
France
This tag is applied to data recognized as specific to France (e.g., a French National ID card number, France National ID number, or French passport number).
Germany
This tag is applied to data recognized as specific to Germany (e.g., a German driver's license number or a Germany Identity Card number).
Hong Kong
This tag is for data specific to Hong Kong.
India
This tag is for data specific to India.
Indonesia
This tag is for data specific to Indonesia.
Japan
This tag is for data specific to Japan.
Korea
This tag is for data specific to Korea.
Mexico
This tag is for data specific to Mexico.
Netherlands
This tag is for data specific to Netherlands.
Norway
This tag is for data specific to Norway.
Paraguay
This tag is for data specific to Paraguay.
Peru
This tag is for data specific to Peru.
Poland
This tag is for data specific to Poland.
Singapore
This tag is for data specific to Singapore.
Spain
This tag is applied to data recognized as specific to Spain (e.g., Spain Foreigner Identification number, Spain Tax Identification number, or Spanish passport number).
Sweden
This tag is applied to data recognized as specific to Sweden (e.g., a Sweden National ID number or Swedish passport number).
Taiwan
This tag is for data specific to Taiwan.
Thailand
This tag is applied to data recognized as specific to Thailand (e.g., a Thailand National ID number).
Turkey
This tag is for data specific to Turkey.
UK
This tag is applied to data recognized as specific to the United Kingdom (e.g., a United Kingdom driver's license number, United Kingdom National Insurance number, or United Kingdom Taxpayer Reference number).
Uruguay
This tag is for data specific to Uruguay.
US
This tag is applied to data recognized as specific to the U.S. (e.g., an FDA code, United States ATIN, ABA routing number, DEA number, United States EIN, United States NPI number, United States ITIN, United States passport number, United States Preparer Taxpayer ID number, United States SSN, United States territory or state, or United States toll-free phone number).
Venezuela
This tag is for data specific to Venezuela.
All the tags below belong to the Entity
parent. For example, the full tag name will appear as Discovered . Entity . Aadhaar Individual
.
Aadhaar Individual
This tag is for Aadhaar Individual numbers.
Adoption Taxpayer ID Number
This tag is applied to data recognized as a United States Adoption Taxpayer Identification number.
Age
This tag is applied to data recognized as an age.
Bank Account
This tag is for bank account numbers.
Bank Routing MICR
This tag is applied to data recognized as an American Bankers Association routing number.
Bankers CUSIP ID
This tag is for CUSP identification numbers for stocks and bonds.
British Columbia Health Network Number
This tag is applied to data recognized as British Columbia's Personal Health Number.
BSN Number
This tag is for Netherlands citizen service number.
BSN Number
This tag is for Netherlands citizen service numbers.
CDC Number
This tag is for CDC numbers.
CDI Number
This tag is for CDI numbers.
CIC Number
This tag is for CIC numbers.
CNI
This tag is applied to data recognized as a French National ID card number.
CPF Number
This tag is applied to data recognized as Brazil's CPF number.
CPR Number
This tag is applied to data recognized as Denmark's Personal Identification number.
Credit Card Number
This tag is applied to data recognized as a credit card number.
CURP Number
This tag is for Mexican CURP numbers.
CRYPTO
This tag is applied to data recognized as a Bitcoin Invoice Address.
Date
This tag is applied to data recognized as a date.
Date of Birth
This tag is applied to data recognized as a date of birth.
DEA Number
This tag is applied to data recognized as the DEA number of a healthcare provider.
DNI Number
This tag is applied to data recognized as an Argentina National Identity number.
Domain Name
This tag is applied to data recognized as a domain.
Driver's License Number
This tag is applied to data recognized as driver's licenses numbers from Germany or the United Kingdom.
Electronic Mail Address
This tag is applied to data recognized as an email address.
Employer ID Number
This tag is applied to data recognized as an Employer Identification number from the United States.
Ethnic Group
This tag is applied to data recognized as an ethnic group.
FDA Code
This tag is applied to data recognized as the code of a drug or ingredient registered with the FDA.
Gender
This tag is applied to data recognized as a gender.
GST Individual
This tag is for Indian GST individual numbers.
Healthcare NPI
This tag is applied to data recognized as a United States National Provider Identifier number.
IBAN Code
This tag is applied to data recognized as an International Bank Account number.
ICD10 Code
This tag is applied to data recognized as an ICD10 code from the International Statistical Classification of Diseases and Related Health Problems.
ICD9 Code
This tag is for ICD9 codes from the International Statistical Classification of Diseases and Related Health Problems.
ID Number
This tag is for any ID number.
Identity Card Number
This tag is applied to data recognized as an identity card number from Germany.
IMEI
This tag is applied to data recognized as an International Mobile Equipment Identity number.
Individual Number
This tag is for any individual number.
Individual Taxpayer ID Number
This tag is applied to data recognized as a United States Individual Taxpayer Identification Number.
IP Address
This tag is applied to data recognized as an IP address.
Location
This tag is applied to data recognized as a country, state, address, or municipality.
MAC Address
This tag is applied to data recognized as a Media Access Control address.
MAC Address Local
This tag is applied to data recognized as a local Media Access Control address.
Medicare Number
This tag is applied to data recognized as a Medicare number from Australia.
National Health Service Number
This tag is for national health service numbers.
National ID Card Number
This tag is applied to data recognized as a national ID card number from Belgium.
National ID Number
This tag is applied to data recognized as a national ID number from Finland, Sweden, and Thailand.
National Insurance Number
This tag is applied to data recognized as a United Kingdom national insurance number.
National Registration ID Number
This tag is for national registration ID numbers.
NI Number
This tag is for Norway NI numbers.
NIE Number
This tag is applied to data recognized as a Spanish Foreigner Identification number.
NIF Number
This tag is applied to data recognized as a Spanish Tax Identification number.
NIK Number
This tag is applied to data recognized as an Indonesian personal identification number (NIK).
NIR
This tag is applied to data recognized as France's National ID number.
Ontario Health Insurance Number
This tag is applied to data recognized as part of an Ontario Health Insurance Plan string.
PAN Individual
This tag is for PAN Individual numbers.
Passport
This tag is applied to data recognized as a passport number from Australia, Canada, France, Spain, Sweden, and the United States.
Person Name
This tag is applied to data recognized as people's names.
PESEL Number
This tag is for Poland PESEL numbers.
Postal Code
This tag is applied to data recognized as a United States zip code.
Preparer Taxpayer ID Number
This tag is applied to data recognized as a Preparer Taxpayer ID number.
Quebec Health Insurance Number
This tag is applied to data recognized as a Quebec Health Insurance Number.
Resident ID Number
This tag is for China Resident ID numbers.
RRN
This tag is for Korea Resident Registration numbers.
Social Insurance Number
This tag is applied to data recognized as a social insurance number.
Social Security Number
This tag is applied to data recognized as a United States Social Security Number.
State
This tag is applied to data recognized as a state of the United States.
Swift Code
This tag is applied to data recognized as a SWIFT code.
Tax File Number
This tag is applied to data recognized as a tax file number.
Taxpayer ID Number
This tag is applied to data recognized as Taxpayer ID numbers from the United States.
Taxpayer Reference
This tag is applied to data recognized as United Kingdom Taxpayer Reference numbers.
Telephone Number
This tag is applied to data recognized as a phone number.
Tollfree Telephone Number
This tag is applied to data recognized as a United States toll-free phone number.
URL
This tag is applied to data recognized as a URL.
Vehicle Identifier or Serial Number
This tag is applied to data recognized as a VIN.
Deprecation notice
The following identifier tags have been deprecated. New SaaS tenants will not see these tags applied by SDD. Current tenants relying on these tags for policies should contact their Immuta representative for support before these tags are removed from the product.
None of the tags below have an additional parent or child tag. For example, the full tag name will appear as Discovered . Identifier Direct
.
Identifier Direct
This tag is applied to data recognized as a direct identifier that can be uniquely associated with an individual. Examples of direct identifiers include: name, username, email, official individual identification numbers such as passport or identity card numbers, or privately issued individual identification numbers such as a student ID.
Identifier Indirect
This tag is applied to data recognized as an indirect identifier that is not uniquely associated with an individual. However this indirect identifier could become distinguishable when combined with other attributes. Examples of indirect identifiers include: age and affinity.
Identifier Undetermined
This tag is applied to data which could be an identifier associated with an individual.
Deprecation notice
The following identifier tags have been deprecated. New SaaS tenants will not see these tags applied by SDD. Current tenants relying on these tags for policies should contact their Immuta representative for support before these tags are removed from the product.
None of the tags below have an additional parent or child tag. For example, the full tag name will appear as Discovered . PCI
.
PCI
This tag is applied to data recognized as payment card information.
PHI
This tag is applied to data recognized as personal health data.
PII
This tag is applied to data recognized as personally identifiable information.