Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Immuta allows you to automate discovering and tagging data across your data platform. Tagging is critical for two reasons:
It allows you to define data sensitivity, which in turn allows you to monitor where you have potential data security issues and gaps in your security posture.
It allows you to abstract your physical structure from your access policy logic. For example, you can build access policies like mask all columns tagged Person Name (where Person Name was auto-tagged by Immuta) rather than much less scalable policies that must be knowledgeable of your physical layers like mask column x in database y in data platform z.
Today’s sensitive data discovery tools give you a shallow overview of your data corpus across a long list of platforms. They give you pointers on where you have sensitive data without the granularity to drive your column- or row-level access controls. They help you understand what data you possess according to a regulatory framework, like HIPAA or PCI but without the details needed to automate your audit or compliance reporting. Knowing that you need to drive east to west on a road map from New York to California is helpful but ultimately insufficient to get you from a specific location to another.
Existing tools promise a high degree of automation, yet their many false positives result in painful manual work that never stops. Although data gets scanned automatically, performance breaks down at scale, or you manually need to fine-tune the computing resources of the scanners. Last but not least, your security team objects to the agent-based processing that requires taking data out of your data platform, and the associated data residency concerns may give you pause.
At Immuta, we believe that data security should not be painful. We believe that you can innovate and move quickly, while at the same time protecting your data and adhering to your internal policies and external regulations. Technology and automation allow you to make the right trade-off decisions quickly. It all starts with highly accurate and actionable metadata. If you trust your metadata and if it’s actionable, you can leverage it to automatically grant access to data, mask sensitive information, and automate your audit reporting.
Immuta was built to tackle those challenges and address them through a unique architecture that was designed in collaboration with the largest financial institutions, healthcare companies, and government agencies in the world. The cloud and AI paradigm requires a fundamentally different approach. You must assume that your data is dynamic, unique, and collected in a multitude of different geographies and legal jurisdictions. Immuta is built for this new world and its specific demands.
Identifying and classifying data requires analyzing and looking at the data - there’s no way around it. Immuta does all the analysis and processing inside the remote technology. It takes advantage of those platforms’ inherent scalability to enable you to analyze large amounts of data quickly, efficiently, and without the need for separate resource optimization for containers or virtual machines.
By processing data directly inside the data platform, Immuta automatically adheres to data residency and locality requirements. If you run your data warehouse or lake globally - across North America, the European Union, and Asia - Immuta processes the data in the region where your data is stored. No data ever leaves the data platform, and it will never move across different cloud regions.
In-platform processing greatly reduces risk and improves your data security posture. Provisioning agents, whether they’re in a container, virtual machine, or Amazon Machine Image (AMI), create complexity and an unnecessary security risk. Not only can those agents become compromised, but their misconfiguration might lead to data leaks to other parts of your cloud infrastructure. An agentless approach can better leverage data platform optimizations to process data instead of transferring it out to re-optimize and analyze. This simplifies operations and increases efficiency for your infrastructure teams.
The advantages of in-platform processing are abundant, but implementing it across a multitude of platforms is challenging. Immuta helps bypass the obstacles by doing all the heavy lifting for you and building in specific implementations for each technology. Although all those implementations are ultimately different, Immuta abstracts the results to one standardized taxonomy, so you can have consistently accurate and granular metadata across all your data stores.
Immuta classifies data on a column level and instantaneously identifies schema changes. Only with that level of granularity and automation can you adhere to your audit requirements and understand what actions have been taken on your data. For example, if non-sensitive data is joined with sensitive data at query time, Immuta will monitor and record that for your review. Continuous object sync ensures schema changes never result in holes in your access controls and data security posture.
Trust in your metadata is critical for data security.
To unblock your data consumers, you need to automate your data access controls; this requires trusting that your classification and metadata are accurate and actionable. Immuta's identification provides you with highly accurate metadata and tags out-of-the-box and assists you in fine-tuning the classification mechanism to deal with false positives quickly. That enables you to build policies that dynamically grant or restrict access to protected data (like PHI or PII) depending on who is accessing it and what protections you want to apply.
Immuta works in three phases to identify, categorize, and classify your data:
Identification: In this first phase, data is identified by its kind – for example, a name or an age. This identification can be manually performed, externally provided by a catalog, or automatically determined through column-level analysis of patterns.
Categorization: In the second phase, data is categorized in the context of where it appears, subject to your active frameworks. For example, a record occurring in a clinical context containing both a name and individual health data is protected health information (PHI) under HIPAA.
This categorization of data helps to understand the context it is in, including information like whether or not a record pertains to an individual, the composition and kinds of identifiers present, the data subject, whether the data belongs to any controlled data categories under certain legislation, etc.
Classification: In the third and final phase, data is classified according to its sensitivity level (e.g., Customer Financial Data is Highly Sensitive) and the risk associated to the data subject. Audit dashboards support 3 sensitivity levels. However, organizations are free to customize the sensitivity names for the tags as needed.
Identification is an Immuta feature that uses data patterns to determine what type of data your column represents. Using identifiers within domains, Immuta evaluates your data and can assign the appropriate tags to your data dictionary based on what it finds. This saves the time of identifying your data manually and provides the benefit of a standard taxonomy across all your data sources in Immuta.
To evaluate your data, Immuta generates a SQL query using a domain's identifiers. The Immuta system account then executes that query in the remote technology to match any regex and dictionary identifiers. Immuta receives the query result, containing the column name and the matching identifiers but no raw data values. Column name identifiers are all matched within Immuta and don't require any query to the remote technology. These results are then used to apply the resulting tags to the appropriate columns.
This evaluating and tagging process occurs when identification runs and happens automatically from the following event:
A new data source is added to a domain with identifiers (either manually or automatically via tags)
The following actions will also trigger identification:
Column detection is enabled, and new columns are detected on data sources within a domain with identifiers. Here, identification will only run on new columns, and no existing tags will be removed or changed.
A user manually triggers it from the data source health check menu. Note, this will use the identifiers that already applied to the data source.
Identification runs identifiers to discover data. These identifiers are grouped into domains with data sources. Each identifier contains a single criteria and the tags that will be applied when the criteria's conditions have been met.
There are two types of identifiers in Immuta:
Reference identifiers: This is a library of the identifiers that can be added to domains. When added to a domain, a copy of the reference identifier is made as the domain-specific identifier.
Immuta comes with built-in identifiers to discover common categories of data. These cannot be modified or deleted.
Data governors can create their own reference identifiers for use within your organization.
Domain-specific identifiers: These identifiers only exist within a specific domain and are checked against the data sources in that domain when identification runs.
Users with the Manage Identifiers permission can create these identifiers or add them to a domain from a reference identifier.
If a domain-specific identifier was copied over from a reference identifier, there is no lineage and any edits to the reference identifier will not be reflected in the domain-specific copy.
Criteria are the conditions in an identifier that need to be met for resulting tags to be applied to data.
Competitive criteria analysis: This criteria is a process that will review all the regex and dictionary criteria within the identifiers of the domain and search for the identifier with the best fit. In this review, each competitive criteria analysis identifier in the domain competes against each other to find the best and most specific identifier that fits the data. The resulting tags for the best identifier are then applied to the column. Only one competitive criteria analysis identifier for each domain will apply per column. Competitive criteria identifiers, both built-in and custom, must match at least 90% of the data sampled. To learn more about the competitive nature, see the How competitive criteria analysis works guide.
Regex: This criteria contains a case-insensitive regular expression (regex) that searches for matches against column values. Immuta only supports regular expressions written in RE2 syntax.
Dictionary: This criteria contains a list of words and phrases to match against column values.
Column name: This criteria includes a case-insensitive regular expression (regex) matched against column names, not against the values in the column. The identifier's tags will be applied to the column where the name is found. Multiple column name identifiers can match a column and be applied. Immuta only supports regular expressions written in RE2 syntax.
Create a new identifier in the Immuta UI or with the sdd/identifier endpoint.
Identification has varied support for data sources from different technologies based on the identifier type.
Amazon Redshift view-based integration
Supported
Supported
Supported
Amazon Redshift viewless integration
Not supported
Not supported
Not supported
Amazon S3
Not supported
Not supported
Supported
AWS Lake Formation
Not supported
Not supported
Supported
Azure Synapse Analytics
Not supported
Not supported
Supported
Databricks
Supported
Supported
Supported
Google BigQuery
Not supported
Not supported
Supported
MariaDB
Not supported
Not supported
Not supported
Oracle
Not supported
Not supported
Supported
PostgreSQL
Not supported
Not supported
Supported
Snowflake
Supported
Supported
Supported
SQL Server
Not supported
Not supported
Supported
Starburst (Trino)
Supported
Supported
Supported
Teradata
Not supported
Not supported
Supported
If you used SDD prior to this feature release in January 2025, there are some differences:
There are now two types of identifiers:
Reference identifiers
Domain identifiers
See information about these in the Identifiers section.
There is a new permission to manage identifiers within domains: Manage Identifiers. The permission allows you to do the following:
Create an identifier within your domain
View the reference identifiers in Immuta
Add, edit, and delete identifiers within your domain
Previously, tags applied by SDD had to be from the parent Discovered tag. However, with identifiers in domains, any tag can be used in an identifier.
The following have been removed:
Identification frameworks: Previously, all identifiers had to be contained within a framework and that framework had to be assigned to a data source to run. Now, identifiers are added to domains with data sources.
Global framework: Previously, a global framework could be set to run SDD automatically on all new data sources. This behavior can be similarly obtained if you are using connections by creating a domain with dynamic assignment based on the Immuta Connections tag.
When identification is manually triggered by a data owner, all column tags previously applied by identification are removed and the tags prescribed by the latest run are applied. However, if identification is triggered because a new column is detected by schema monitoring or object sync, tags will only be applied to the new column, and no tags will be modified on existing columns. Additionally, governors, data source owners, and data source experts can disable any unwanted tags in the data dictionary to prevent them from being used and auto-tagged on that data source in the future.
The amount of time it takes to run identification on a data source depends on several factors:
Columns: The time to run identification grows nearly linearly with the number of text columns in the data source.
Identifiers: The number of identifiers being used weakly impacts the time to run identification.
Row count: Performance of identification may vary depending on the sampling method used by each technology. For Snowflake, the number of rows has little impact on the time because data sampling has near-constant performance.
Views: Performance on views is limited by the performance of the query that defines the view. Running identification on complex views with large amounts of data is more likely to result in timeouts. Immuta recommends running identification on the underlying base tables.
The time it takes to run identification for all newly onboarded data sources in Immuta is not limited by identification performance but by the execution of background jobs in Immuta. Consult your Immuta account manager when onboarding a large number of data sources to ensure the advanced settings are set appropriately for your organization.
Default 15-minute timeout
Identification queries will timeout after 15 minutes to avoid overconsumption of resources and reduce the cost of running identification. If your identification run was not completed because of this timeout, submit a support ticket to change the default setting.
For users interested in testing identification, note that the built-in identifiers by Immuta require a 90% match to data to be assigned to a column. This means that with synthetic data, there may be situations where the data is not real enough to fit the confidence needed to match identifiers. To test identification, use a dev environment and create copies of your tables.
The following identification-related events are audited and can be found on the audit page in the UI:
SDDClassifierCreated: An identifier is created.
SDDClassifierDeleted: An identifier is deleted.
SDDClassifierUpdated: An identifier's criteria, description, name, or tag is updated.
​TagApplied: A tag is applied to a data source or column. Tag events from identification will have actor.name.Immuta System Account and will include the related identifiers in the event as relatedResources.type.CLASSIFIERS.
​TagRemoved: A tag is removed from a data source or column. Tag events from identification will have actor.name.Immuta System Account and will include the related identifiers in the event as relatedResources.type.CLASSIFIERS.
Deleting the built-in Discovered tags is not recommended: If you do delete built-in Discovered tags and use the built-in identifiers without editing the tags, then when the identifier is matched the column will not be tagged. As an alternative, tags can be disabled on a column-by-column basis from the data dictionary, or identification won't run if you do not add identifiers to domains.
Data regex*
Text string columns
Case-sensitive
Column name regex
Any column
Not case-sensitive
Dictionary
Text string columns
Can be toggled in the identifier definition
*Two built-in patterns support and match based on additional data types:
DATE: Columns will match this identifier if they are string and the regex matches or if the data type is date, date+time, or timestamp.
TIME: Columns will match this identifier if they are string and the regex matches or if the data type is time. Note that if the date is included in the data, it will not match this identifier.
The size of the identification query for dictionary patterns, which are compiled into a regex and regex patterns, is limited by the backing technology:
For Snowflake, the overall query text size limit is 1 MB.
For Starburst (Trino), the default query character limit is 1,000,000 characters. However, this limit can be increased if your identifiers require it.
Immuta will start up a Databricks cluster to complete the identification job if one is not already running. This can cause unnecessary costs if the cluster becomes idle. Follow Databricks best practices to automatically terminate inactive clusters after a set period of time.
The following Databricks Unity Catalog securable objects are supported with Immuta, but cannot be used with identification:
Volumes (external and managed)
Models
Functions
Using a large number of files to store the data in a table with a large number of rows may result in the Databricks planner scanning the entire table, resulting in a slow performing query.
The Redshift cluster must be up and running for identification to successfully run
To use AWS access key authentication on a Redshift data source and have competitive criteria analysis identifiers supported,
The AWS access key used to register the data source must be able to do a minimum of the following redshift-data API actions:
redshift-data:BatchExecuteStatement
redshift-data:CancelStatement
redshift-data:DescribeStatement
redshift-data:ExecuteStatement
redshift-data:GetStatementResult
redshift-data:ListStatements
The AWS access key used to register the data source must have redshift:GetClusterCredentials for the cluster, user, and database that they onboard their data sources with.
If using a custom URL, then the data source registered with the AWS access key must have the region and clusterid included in the additional connection string options formatted like the following:
region=us-east-2;clusterid=12345Redshift Serverless data sources are not supported for competitive criteria analysis identifiers with the AWS access key authentication method.
This is only relevant to users who enabled and ran Immuta SDD prior to October 2023.
Legacy SDD was available before October 2023. It is no longer available, but some users may still see the term "legacy SDD" in the context of their data tags applied to specific data sources. These tags will be removed the next time identification runs.
This how-to guide is for enabling identification for the first time. For additional information on identification, see the Data identification page.
Requirement: Immuta permission GOVERNANCE
Prerequisites
Identifiers can be added to and identification can run in any of your current domains. However, if you are not already using domains, set up a domain specifically to run identification:
Navigate to the Identifiers tab of your domain.
Click Get Started.
Add reference identifiers to your domain that are relevant to your data by clicking the checkboxes. The identifier becomes a point-in-time copy of the reference identifier. It has the same name, criteria, and tags. Note you cannot add multiple identifiers with the same name to the same domain, so if you want to add an improved reference identifier, edit the name.
Click Add Identifiers.
This action can be done within a domain from the Identifiers tab to create a domain-specific identifier, or it can be done from the Identifiers page to create a reference identifier.
Click Create New.
Enter a name and description for your identifier.
Click Next.
Enter criteria: Select the Type of criteria.
For regex, enter a regex to be matched against column values. The default criteria encoding is case-sensitive. You can change this encoding using the regex criteria. The regex must use RE2 syntax.
For column name regex, enter a regex to be matched against column names. The default criteria encoding is case-insensitive. You can change this encoding using the regex criteria. The regex must use RE2 syntax.
For a dictionary, enter the values in a comma-separated list to match against column values. Opt to toggle the Case insensitive switch to on if you want the dictionary to be case sensitive.
Click Next.
Select the tags to apply: Use the text box to search for a tag or type a tag name to create a new tag under the "Discovered . Entity" hierarchy to apply to columns that match your identifier.
Click Next to review your new identifier and click Create Identifier to create it.
Note that all user-created identifiers must be a 90% match or greater for the contents of the column to be tagged.
Once you have created identifiers relevant to your data, it is time to run them on your data. You may choose to run identification on a select number of data sources where you understand the data to assess and adjust the tags to reflect what you expect to see.
Navigate to the Domains page and select your domain.
Open the More Actions icon.
Select Run Identification from the dropdown.
After identification runs, you will receive a notification that the job is complete. Then, you can view the results from the data source dictionary.
Navigate to the data source overview page of the data source you have in the domain.
Click the Data Dictionary tab.
Assess whether the tags are applied as expected.
If you are happy with the tags, follow the Assign data sources to domains guide to add the rest of your data sources to the domain and then run identification on the domain again.
If you want additional tags, follow the Create an identifier guide again to create additional identifiers that matter to your data.
Requirement: Immuta permission GOVERNANCE or domain-specific Manage Identifiers
This action can be done within a domain from the Identifiers tab to create a domain-specific identifier, or it can be done from the Identifiers page to create a reference identifier.
Click Create New.
Enter a name and description for the new identifier.
Click Next.
Enter criteria: Select the Type of criteria.
For regex, enter a regex to be matched against column values. The default criteria encoding is case-sensitive. You can change this encoding using the regex criteria. The regex must use RE2.
For a dictionary, enter the values in a comma-separated list to match against column values. Opt to toggle the Case insensitive switch to on if you want the dictionary to be case sensitive.
For column name regex, enter a regex to be matched against column names. The default criteria encoding is case-insensitive. You can change this encoding using the regex criteria. The regex must use RE2 syntax.
Click Next.
Select the tags to apply: Use the text box to search for a tag or type a tag name to create a new tag under the "Discovered . Entity" hierarchy to apply to columns that match your identifier.
Click Next to review your new identifier and click Create Identifier to create it.
Note that all user-created identifiers must be a 90% match or greater for the contents of the column to be tagged.
Editing the details of an identifier from the identifiers page will only affect that identifier; no copies of that identifier will be impacted.
To edit a reference identifier,
Click Metadata in the navigation menu and select Identifiers.
Click the more actions icon of the identifier you want to edit.
Select Edit.
Edit the field you want to change.
Click Save.
Built-in identifiers cannot be edited.
To edit a domain identifier,
Click Domains and navigate to the domain.
Select the Identifiers tab.
Click the more actions icon of the identifier you want to edit.
Click Edit.
Edit the field you want to change.
Click Save.
Deleting a domain identifier from the identifiers page will remove it from the domain it is in.
To delete a reference identifier,
Click Metadata in the navigation menu and select Identifiers.
Click the more actions icon of the identifier you want to delete.
Select Delete and click Save.
Built-in identifiers cannot be deleted.
To delete a domain identifier,
Click Domains and navigate to the domain.
Select the Identifiers tab.
Click the more actions icon of the identifier you want to delete.
Select Delete and click Save.
Requirement: Immuta permission GOVERNANCE or domain-specific Manage Identifiers
Identification can be configured to run automatically or manually on a domain-by-domain basis. If you want to re-run identification when a data source or new identifiers have been added to a domain, you can or from the UI.
Navigate to the Domains page and select your domain.
Click the Settings tab.
Select the autoscanning on the toggle:
On: Identification will automatically run when new data sources are added to the domain or when object sync detects new columns on data sources already in the domain.
Off: Identification will only run when manually started by a user.
Navigate to the Domains page and select your domain.
Select the more actions icon.
Select Run Identification and then select it again in the modal.
Navigate to the data source overview page.
Click the health status.
Select Re-run next to Sensitive Data Discovery (SDD).
If a governor, data owner, or data source expert disables a tag from the data dictionary applied by identification, the column will not be re-tagged next time identification runs. When a tag is disabled, it will not completely disappear, and it can be manually enabled through the tag side sheet.
To disable a tag,
Navigate to a data source and click the Data Dictionary tab.
Scroll to the column you want to remove the tag from and click the tag you want to remove.
Click Disable in the side sheet and then click Confirm.
Identifiers in domains is released as GA and these identifier updates are coupled with that release.
The following identifiers have been improved to better match their intended data patterns. These updates have only been made to the built-in reference identifiers. If these are already in your domains, they will remain there as domain-specific identifiers with the previous pattern. If you want to add these improved identifiers to your domain, edit the name because identifier names must be unique within each domain.
To see more about the specific changes made, see the annotations on the .
AUSTRALIA_MEDICARE_NUMBER
AUSTRALIA_PASSPORT
BRAZIL_CPF_NUMBER
CANADA_PASSPORT
CREDIT_CARD_NUMBER
DATE
DOMAIN_NAME
FDA_CODE
FRANCE_NIR
GENDER
ICD10_CODE
IMEI_HARDWARE_ID
MAC_ADDRESS
PERSON_NAME
POSTAL_CODE
SPAIN_NIF_NUMBER
TIME
UK_NATIONAL_INSURANCE_NUMBER
URL
US_HEALTHCARE_NPI
US_SOCIAL_SECURITY_NUMBER
US_STATE
The following identifiers are deprecated and no longer included in the reference identifiers. If these are already in your domains, they will remain there as domain-specific identifiers.
AGE
DENMARK_CPR_NUMBER
FINLAND_NATIONAL_ID_NUMBER
FRANCE_CNI
GERMANY_IDENTITY_CARD_NUMBER
SPAIN_NIE_NUMBER
SWEDEN_NATIONAL_ID_NUMBER
SWEDEN_PASSPORT
THAILAND_NATIONAL_ID_NUMBER
UK_TAXPAYER_REFERENCE
US_BANK_ROUTING_MICR
US_PASSPORT
US_TOLLFREE_PHONE_NUMBER
The following identifiers are newly created to identify common data patterns. Copy these new reference identifiers to any of your domains.
BELGIUM_NATIONAL_REGISTRATION_NUMBER: Detects numeric strings consistent with Belgium's National Registration Number. Requires 11 characters in the form YY.MM.DD-NNN-XX, where YY.MM.DD corresponds to birth date, NNN is a number, and XX is a checksum digit.
COUNTRY: Detects strings consistent with the names of all countries in the world. This identifier is case-insensitive.
FINANCIAL_INSTITUTIONS: Matches strings consistent with names of financial institutions based on lists provided by the FDIC and OCC, includes alternative names.
GREAT_BRITAIN_DRIVERS_LICENSE: Previously named UK_DRIVERS_LICENSE_NUMBER. Now, renamed because it does not detect license numbers from Northern Ireland.
ICD_10_PCS: Detects strings consistent with procedure codes from the International Statistical Classification of Diseases and Related Health Problems (ICD), as drawn from the Clinical Modification lexicon from 2020.
NAICS_CODE: Detects strings consistent with North American Industry Classification System (NAICS). A two-digit number represents a basic sector and each preceding digit represents a more specific sub sector with a maximum of six digits.
SEC_STOCK_TICKER: Matches strings consistent with the stock tickers recognized by the U.S. Securities and Exchange Commission (SEC).
US_PERSON_FULL_NAME: Detects strings consistent with a person's {first name} space {last name}. Uses the same names from the PERSON_NAME identifier. This identifier must match at least 20% of the data sampled and is case-insensitive.
US_STREET_ADDRESS: Previously named STREET_ADDRESS.
62 built-in identifiers are released for use with identification.
Of identification's three criteria options, regex and dictionary are competitive. This means that when assessing your data, if multiple identifiers could match, only one with competitive criteria will be chosen to tag the data. To better understand how Immuta executes this competition, read further.
Immuta employs a three-phased competitive criteria analysis approach for identification:
Sampling: No data is moved, and Immuta checks the identifiers against a sample of data from the table.
Qualifying: Identifiers with a criteria match of less than a 90% match are filtered out.
Scoring: The remaining identifiers are compared with one another to find the most specific criteria that qualifies and matches the sample.
In the end, competitive criteria analysis aims to find a single identifier for each column that best describes the data format.
In the sampling process, no database contents are transmitted to Immuta; instead, Immuta receives only the column-wise hit rate (the number of times the criteria has matched a value in the column) information for each active identifier. To do this, Immuta instructs a remote database to measure column-wise hit rate information for all active identifiers over a row sample.
The sample size is decided based on the number of identifiers and the data size, when available. In the most simplified case, the requested number of sampled rows depends only on the number of regex and dictionary criteria being run in the domain, not the data size. The sample size dependence on the number of identifiers is weak and will not exceed 13,000 rows.
5
7369 rows
50
9211 rows
500
11053 rows
5000
12895 rows
In practice, the number of sampled values for each column may be less than the requested number of rows because columns are not independently sampled but rather projected from a row-wise sample. This can impact the sample when the target table has less than the requested number of rows, when some of the column values are null, or because of technology-specific limitations.
Snowflake and Starburst (Trino): Immuta implements table sampling by row count.
Databricks and Redshift: Due to technology limitations and the inability to predict the size of the table, Immuta implements a best-effort sampling strategy comprising a flat 10% row sample capped at the first 10,000 sampled rows. In particular, under-sampling may occur on tables with less than 100,000 rows. Moreover, the resulting sample is biased towards earlier records.
All platforms: Sampling from views can have significantly slower performance that varies by the performance of the query that defines the view.
All platforms: Any null values included in the sample will not count towards the qualification or scoring when included in the sample. However, it will lower the number of available values to match against the patterns, as the sample size is not dynamic based on the ignored null values.
During the qualification phase, identifiers that do not agree with the data are disqualified. An identifier agrees with the data if the on the remote sample exceeds the predefined threshold. This threshold is 90% match for most built-in identifiers; however, a few built-in identifiers have lower threshold . The 90% threshold is standard for all custom identifiers as well to ensure the criteria matches the data within the column and to avoid false positives. Note that threshold calculations are relative to the number of non-null entries for each column.
If no identifiers qualify, then no identifier is assessed for scoring and the column is not tagged.
During the scoring phase, a machine inference is carried out among all qualified identifiers, combining criteria-derived complexity information with hit rate information to determine which identifier best describes the sample data. This process prefers the more restrictive of two competing identifiers since the ability to satisfy the more difficult-to-satisfy identifier itself serves as evidence that it is more likely. This phase ends by returning a single most likely identifier per the inference process.
Here are a set of regex identifiers and a sample of data:
Identifiers:
[a-zA-Z0-9]{3} - This regex will match 3 character strings with the characters a-z, lowercase or uppercase, or digits 0-9.
[a-c]{3} - This regex will match 3 character strings with the characters a-c, lowercase.
(a|b|d){3} - This regex will match 3 character strings with the characters a, b, or d, lowercase.
dad
Yes
❌
Yes
baa
Yes
❌
Yes
add
Yes
❌
Yes
add
Yes
❌
Yes
cab
Yes
Yes
❌
bad
Yes
❌
Yes
aba
Yes
❌
Yes
baa
Yes
❌
Yes
dad
Yes
❌
Yes
baa
Yes
❌
Yes
When qualifying the identifiers, Identifier 1 and Identifier 3 both match 90% or more of the data. Identifier 2 does not, and is disqualified.
Then the qualified identifiers are scored. Here, Identifier 1, despite matching 100% of the data, is unspecific and could match over 200,000 values. On the other hand, Identifier 3 matches just at 90% but is very specific with only 27 available values.
Therefore, with the specificity taken into account, Identifier 3 would be the match for this column, and its tags would be applied to the data source in Immuta.
Dictionaries are part of the competitive process, while column-name regex are not.
Scoring ties are rare but can occur if the same criteria (either dictionary or regex) is specified more than once (even in different forms). Scoring ties are inconclusive, and the scoring phase will not return an identifier in the case of a tie.
Criteria complexity analysis is sensitive to the total number of strings an identifier accepts or, equivalently for dictionaries, the number of entries. Therefore, identifiers that accept much more than is necessary to describe the intended column data format may perform more poorly in the competitive analysis because they are easier to satisfy.
Immuta is pre-configured with a set of tags that can be used to write global policies before data sources even exist. See a list of the built-in Discovered tags below and the Built-in identifier reference page for information about where these tags will be applied by the built-in identifiers.
All the tags below belong to the Country parent. For example, the full tag name will appear as Discovered . Country . Argentina.
Argentina
This tag is applied to data recognized as specific to Argentina (e.g., an Argentina National Identity Number).
Australia
This tag is applied to data recognized as specific to Australia (e.g., an Australian Medicare number or Australian passport number).
Belgium
This tag is applied to data recognized as specific to Belgium (e.g., a Belgium National ID card).
Brazil
This tag is applied to data recognized as specific to Brazil (e.g., a Brazil CPF number).
Canada
This tag is applied to data recognized as specific to Canada (e.g., a British Columbia PHN, OHIP string, Canadian passport number, or Quebec's HIN).
Chile
This tag is for data specific to Chile.
China
This tag is for data specific to China.
Colombia
This tag is for data specific to Colombia.
Denmark
This tag is applied to data recognized as specific to Denmark (e.g., a Denmark CPR or Person-number).
Finland
This tag is applied to data recognized as specific to Finland (e.g., a Finland National ID number).
France
This tag is applied to data recognized as specific to France (e.g., a French National ID card number, France National ID number, or French passport number).
Germany
This tag is applied to data recognized as specific to Germany (e.g., a German driver's license number or a Germany Identity Card number).
Hong Kong
This tag is for data specific to Hong Kong.
India
This tag is for data specific to India.
Indonesia
This tag is for data specific to Indonesia.
Japan
This tag is for data specific to Japan.
Korea
This tag is for data specific to Korea.
Mexico
This tag is for data specific to Mexico.
Netherlands
This tag is for data specific to Netherlands.
Norway
This tag is for data specific to Norway.
Paraguay
This tag is for data specific to Paraguay.
Peru
This tag is for data specific to Peru.
Poland
This tag is for data specific to Poland.
Singapore
This tag is for data specific to Singapore.
Spain
This tag is applied to data recognized as specific to Spain (e.g., Spain Foreigner Identification number, Spain Tax Identification number, or Spanish passport number).
Sweden
This tag is applied to data recognized as specific to Sweden (e.g., a Sweden National ID number or Swedish passport number).
Taiwan
This tag is for data specific to Taiwan.
Thailand
This tag is applied to data recognized as specific to Thailand (e.g., a Thailand National ID number).
Turkey
This tag is for data specific to Turkey.
UK
This tag is applied to data recognized as specific to the United Kingdom (e.g., a United Kingdom driver's license number, United Kingdom National Insurance number, or United Kingdom Taxpayer Reference number).
Uruguay
This tag is for data specific to Uruguay.
US
This tag is applied to data recognized as specific to the U.S. (e.g., an FDA code, United States ATIN, ABA routing number, DEA number, United States EIN, United States NPI number, United States ITIN, United States passport number, United States Preparer Taxpayer ID number, United States SSN, United States territory or state, or United States toll-free phone number).
Venezuela
This tag is for data specific to Venezuela.
All the tags below belong to the Entity parent. For example, the full tag name will appear as Discovered . Entity . Aadhaar Individual.
Aadhaar Individual
This tag is for Aadhaar Individual numbers.
Adoption Taxpayer ID Number
This tag is applied to data recognized as a United States Adoption Taxpayer Identification number.
Age
This tag is applied to data recognized as an age.
Bank Account
This tag is for bank account numbers.
Bank Routing MICR
This tag is applied to data recognized as an American Bankers Association routing number.
Bankers CUSIP ID
This tag is for CUSP identification numbers for stocks and bonds.
British Columbia Health Network Number
This tag is applied to data recognized as British Columbia's Personal Health Number.
BSN Number
This tag is for Netherlands citizen service number.
CDC Number
This tag is for CDC numbers.
CDI Number
This tag is for CDI numbers.
CIC Number
This tag is for CIC numbers.
CNI
This tag is applied to data recognized as a French National ID card number.
CPF Number
This tag is applied to data recognized as Brazil's CPF number.
CPR Number
This tag is applied to data recognized as Denmark's Personal Identification number.
Credit Card Number
This tag is applied to data recognized as a credit card number.
CRYPTO
This tag is applied to data recognized as a Bitcoin Invoice Address.
CURP Number
This tag is for Mexican CURP numbers.
Date
This tag is applied to data recognized as a date.
Date of Birth
This tag is applied to data recognized as a date of birth.
DEA Number
This tag is applied to data recognized as the DEA number of a healthcare provider.
DNI Number
This tag is applied to data recognized as an Argentina National Identity number.
Domain Name
This tag is applied to data recognized as a domain.
Driver's License Number
This tag is applied to data recognized as driver's licenses numbers from Germany or the United Kingdom.
Electronic Mail Address
This tag is applied to data recognized as an email address.
Employer ID Number
This tag is applied to data recognized as an Employer Identification number from the United States.
Ethnic Group
This tag is applied to data recognized as an ethnic group.
FDA Code
This tag is applied to data recognized as the code of a drug or ingredient registered with the FDA.
Financial Institution
This tag is applied to data recognized as the names of financial institutions based on lists provided by the FDIC and OCC, including alternative names.
Gender
This tag is applied to data recognized as a gender.
GST Individual
This tag is for Indian GST individual numbers.
Healthcare NPI
This tag is applied to data recognized as a United States National Provider Identifier number.
IBAN Code
This tag is applied to data recognized as an International Bank Account number.
ICD10 Code
This tag is applied to data recognized as an ICD10 code from the International Statistical Classification of Diseases and Related Health Problems.
ICD10 Procedure Code
This tag is applied to data recognized as an ICD10 procedure code from the International Statistical Classification of Diseases and Related Health Problems.
ICD9 Code
This tag is for ICD9 codes from the International Statistical Classification of Diseases and Related Health Problems.
ID Number
This tag is for any ID number.
Identity Card Number
This tag is applied to data recognized as an identity card number from Germany.
IMEI
This tag is applied to data recognized as an International Mobile Equipment Identity number.
Individual Number
This tag is for any individual number.
Individual Taxpayer ID Number
This tag is applied to data recognized as a United States Individual Taxpayer Identification Number.
IP Address
This tag is applied to data recognized as an IP address.
Location
This tag is applied to data recognized as a country, state, address, or municipality.
MAC Address
This tag is applied to data recognized as a Media Access Control address.
MAC Address Local
This tag is applied to data recognized as a local Media Access Control address.
Medicare Number
This tag is applied to data recognized as a Medicare number from Australia.
NAICS Code
This tag is applied to data recognized as a North America Industry Classification System (NAICS) code.
National Health Service Number
This tag is for national health service numbers.
National ID Card Number
This tag is applied to data recognized as a national ID card number from Belgium.
National ID Number
This tag is applied to data recognized as a national ID number from Finland, Sweden, and Thailand.
National Insurance Number
This tag is applied to data recognized as a United Kingdom national insurance number.
National Registration ID Number
This tag is for national registration ID numbers.
National Registration Number
This tag is applied to data recognized as a national registration number from Belgium.
NI Number
This tag is for Norway NI numbers.
NIE Number
This tag is applied to data recognized as a Spanish Foreigner Identification number.
NIF Number
This tag is applied to data recognized as a Spanish Tax Identification number.
NIK Number
This tag is applied to data recognized as an Indonesian personal identification number (NIK).
NIR
This tag is applied to data recognized as France's National ID number.
Ontario Health Insurance Number
This tag is applied to data recognized as part of an Ontario Health Insurance Plan string.
PAN Individual
This tag is for PAN Individual numbers.
Passport
This tag is applied to data recognized as a passport number from Australia, Canada, France, Spain, Sweden, and the United States.
Person Name
This tag is applied to data recognized as people's names.
PESEL Number
This tag is for Poland PESEL numbers.
Postal Code
This tag is applied to data recognized as a United States zip code.
Preparer Taxpayer ID Number
This tag is applied to data recognized as a Preparer Taxpayer ID number.
Quebec Health Insurance Number
This tag is applied to data recognized as a Quebec Health Insurance Number.
Resident ID Number
This tag is for China Resident ID numbers.
RRN
This tag is for Korea Resident Registration numbers.
SEC Stock Ticker
This tag is applied to data recognized as a stock ticker recognized by the U.S. Securities and Exchange Commission (SEC).
Social Insurance Number
This tag is applied to data recognized as a social insurance number.
Social Security Number
This tag is applied to data recognized as a United States Social Security Number.
State
This tag is applied to data recognized as a state of the United States.
Swift Code
This tag is applied to data recognized as a SWIFT code.
Tax File Number
This tag is applied to data recognized as a tax file number.
Taxpayer ID Number
This tag is applied to data recognized as Taxpayer ID numbers from the United States.
Taxpayer Reference
This tag is applied to data recognized as United Kingdom Taxpayer Reference numbers.
Telephone Number
This tag is applied to data recognized as a phone number.
Tollfree Telephone Number
This tag is applied to data recognized as a United States toll-free phone number.
URL
This tag is applied to data recognized as a URL.
Vehicle Identifier or Serial Number
This tag is applied to data recognized as a VIN.
Immuta comes with a pack of built-in identifiers that look for common data types. These identifiers were written by Immuta's research and development team and cannot be deleted or edited by users. However, users can add these built-in identifiers to their own domains and edit the tags applied by them.
Identifiers must match at least 90% of the sampled data to be tagged, with three exceptions noted below. See the How competitive pattern analysis works guide for more information about sampling and thresholds.
ARGENTINA_DNI_NUMBER
Detects strings consistent with Argentina's National Identity (DNI) Number. Requires an eight-digit number with periods after the second and fifth digits.
Discovered.Country.Argentina
Discovered.Entity.DNI Number
AUSTRALIA_MEDICARE_NUMBER
Detects numeric strings consistent with Australian Medicare Number. Requires a ten- or eleven-digit number. The starting digit must be between 2 and 6, inclusive. Spaces must be placed between the fourth and fifth and ninth and tenth digits. Optional eleventh digit separated by a / or a space.
Discovered.Country.Australia
Discovered.Entity.Medicare Number
AUSTRALIA_PASSPORT
Detects strings consistent with the Australian Passport number. A string of 8 or 9 characters is required, with a starting uppercase character (A, B, C, D, E, F, G, H, J, L, M, N, R, X, or U) or a two-character alphabetic prefix (P followed by A, B, C, D, E, F, U, W, X, or Z) followed by seven numeric digits.
Discovered.Country.Australia
Discovered.Entity.Passport
BELGIUM_NATIONAL_ID_CARD_NUMBER
Detects numeric strings consistent with Belgium's National ID card. Requires a twelve-digit number with a required hyphen (-) between the third and fourth digits. Allows for an optional hyphen between the tenth and eleventh digits.
Discovered.Country.Belgium
Discovered.Entity.National ID Card Number
BELGIUM_NATIONAL_REGISTRATION_NUMBER New
Detects numeric strings consistent with Belgium's National Registration Number. Requires 11 characters in the form YY.MM.DD-NNN-XX, where YY.MM.DD corresponds to birth date, NNN is a number, and XX is a checksum digit.
Discovered.Country.Belgium
Discovered.Entity.National Registration Number
BITCOIN_INVOICE_ADDRESS
Detects strings consistent with the following Bitcoin Invoice Address formats: P2PKH, P2SH, and Bech32.
Discovered.Entity.CRYPTO
BRAZIL_CPF_NUMBER
Detects a numeric string consistent with Brazil's CPF (Cadastro de Pessoas F\u00edsica) number. An eleven-digit numeric string with optional non-numeric separators (., -, or space) after the third, sixth, and ninth digits.
Discovered.Country.Brazil
Discovered.Entity.CPF Number
CANADA_BC_PHN
Detects numeric strings consistent with British Columbia's Personal Health Number (PHN). Requires a ten-digit numeric string with hyphens (-) or spaces after the fourth and seventh digits.
Discovered.Country.Canada
Discovered.Entity.British Columbia Health Network Number
CANADA_OHIP
Detects alphanumeric strings consistent with Ontario's Health Insurance Plan (OHIP). Requires a twelve-digit capitalized alphanumeric code. Optional hyphens (-) or spaces can appear after the fourth, seventh, and tenth digits.
Discovered.Country.Canada
Discovered.Entity.Ontario Health Insurance Number
CANADA_PASSPORT
Detects strings consistent with the Canadian Passport Number format. Allows for two formats. One format requires two capital letters followed by six digits. The other format requires one letter, followed by six digits, and ends in two letters.
Discovered.Country.Canada
Discovered.Entity.Passport
CANADA_QUEBEC_HIN
Detects alphanumeric strings consistent with Quebec's Health Insurance Number (HIN). Requires four alphabetic characters followed by an optional space or hyphen (-), and then eight digits with an optional hyphen or space after the fourth digit.
Discovered.Country.Canada
Discovered.Entity.Quebec Health Insurance Number
COUNTRY New
Detects strings consistent with the names of all countries in the world. This identifier is case-insensitive.
Discovered.Entity.Location
CREDIT_CARD_NUMBER
Detects strings consistent with a credit card number with prefixes matching major credit card companies.
Discovered.Entity.Credit Card Number
DATE
Detects strings consistent with dates in or date type: date, date+time, or timestamp. This identifier is case-insensitive.
Discovered.Entity.Date
DOMAIN_NAME
Detects strings that begin with a letter and are no more than 225 characters. A full domain can have one to four labels separated by a .. Each label can be one to 63 alphanumeric characters long. And each label after the first must be in the dictionary list of possible labels. This identifier is case-insensitive.
Discovered.Entity.Domain Name
EMAIL_ADDRESS
Detect strings consistent with an email address. Usernames are required to be fewer than 255 characters, follow by @, a domain of fewer than 255 characters, and a top level domain of between 2 and 20 characters.
Discovered.Entity.Electronic Mail Address
ETHNIC_GROUP
Detects strings consistent with the US Census . This identifier allows for dashes to be used in place of spaces and is case-insensitive.
Discovered.Entity.Ethnic Group
FDA_CODE
Detects a string consistent with a drug or ingredient registered with the Food and Drug Administration (FDA). Must start with between 4 to 5 digits, followed by a hyphen, followed by 3 to 4 digits, followed by a hyphen, and finishing with 1 to 2 digits.
Discovered.Country.US
Discovered.Entity.FDA Code
FINANCIAL_INSTITUTIONS New
Detects strings consistent with names of financial institutions based on lists provided by the FDIC and OCC, includes alternative names.
Discovered.Entity.Financial Institutions
FRANCE_NIR
Detects numeric strings consistent with France's National ID number (Numéro d'Inscription au Répertoire). Requires a fifteen-digit numeric string. An optional hyphen (-) or space can appear after the 13th digit.
Discovered.Country.France
Discovered.Entity.NIR
FRANCE_PASSPORT
Detects alphanumeric strings consistent with the French Passport number. Requires two numbers followed by two uppercase letters and ends with five digits.
Discovered.Country.France
Discovered.Entity.Passport
GENDER
Detects strings consistent with and common abbreviations. This identifier is case-insensitive.
Discovered.Entity.Gender
GERMANY_DRIVERS_LICENSE_NUMBER
Detects alphanumeric strings consistent with Germany's driver's license number. Requires an eleven-element string of the format CDDCCCCCCDC where C is an uppercase Latin letter and D is a numeric digit.
Discovered.Country.Germany
Discovered.Entity.Drivers License Number
GREAT_BRITAIN_DRIVERS_LICENSE
Detects alphanumeric strings consistent with the United Kingdom's driver's license number. Requires either a 16- or 18-character string. The first five characters represent the driver's surname, padded with 9s, followed by a single digit for decade of birth, two digits for month of birth (incremented by 50 for female drivers), two digits for day of birth, one digit for year of birth, two letters, an arbitrary digit, and two digits. Two additional digits can be present for each license issuance.
Discovered.Country.UK
Discovered.Entity.Drivers License Number
IBAN_CODE
Detects strings consistent with an International Bank Account Number (IBAN). Requires a string in the form ZZ-DD-BBAN, where ZZ is a country code, DD is two numeric digits, and BBAN is a Basic Bank Account Number comprising two to seven groups of three to five uppercase alphanumeric characters, optionally separated by space or dash, and optionally followed by a final group of length one to three.
Discovered.Entity.IBAN Code
ICD10_CODE
Detects strings consistent with codes from the International Statistical Classification of Diseases and Related Health Problems (ICD), as drawn from the Clinical Modification lexicon from the year 2025. This identifier is case-insensitive.
Discovered.Entity.ICD10 Code
ICD_10_PCS New
Detects strings consistent with procedure codes from the International Statistical Classification of Diseases and Related Health Problems (ICD), as drawn from the Clinical Modification lexicon from 2020.
Discovered.Entity.ICD10 Procedure Code
IMEI_HARDWARE_ID
Detects strings consistent with an International Mobile Equipment Identity (IMEI) number. Must contain 15 or 16 digits with optional hyphens or spaces after the 2nd, 8th, and 14th digits.
Discovered.Entity.IMEI
IP_ADDRESS
Detects IP Addresses in the V4 and V6 formats. This identifier is case-insensitive.
Discovered.Entity.IP Address
LOCATION
Detects ISO3166 formatted locations. This identifier must match at least 80% of the data sampled.
Discovered.Entity.Location
MAC_ADDRESS
Detects strings consistent with a Media Access Control (MAC) address. Must contain twelve hexadecimal digits, with every two digits separated by a colon or hyphen.
Discovered.Entity.MAC Address
NAICS_CODE New
Detects strings consistent with North American Industry Classification System (NAICS). A two-digit number represents a basic sector and each preceding digit represents a more specific sub sector with a maximum of six digits.
Discovered.Entity.NAICS Code
PERSON_NAME
Detects strings consistent with a dictionary of people's names. The name dictionary is US-centric with person names drawn from the US Social Security database, covering 80% of the U.S. population. This identifier must match at least 45% of the data sampled. This identifier is case-insensitive.
Discovered.Entity.Person Name
PHONE_NUMBER
Detects strings consistent with telephone numbers. Primarily looks for strings consistent with the United States telephone numbers naming convention. Optional area codes allowed.
Discovered.Entity.Telephone Number
POSTAL_CODE
Detects strings consistent with a valid US Zip code with an optional +4 separated by a dash. Only valid five-digit zip codes are detected. This identifier is case-insensitive.
Discovered.Entity.Postal Code
SEC_STOCK_TICKER New
Detects strings consistent with the stock tickers recognized by the U.S. Securities and Exchange Commission (SEC).
Discovered.Entity.Stock Ticker Symbol
SPAIN_NIF_NUMBER
Detects strings consistent with Spain's Tax Identification number. Requires a string with nine alphanumeric characters. Requires either eight digits followed by an optional hyphen or space and a single uppercase letter or the initial character must be X, Y, or Z, followed by an optional dash or space, seven numeric digits, followed by an optional dash or space, and finally, by a single uppercase letter.
Discovered.Country.Spain
Discovered.Entity.NIF Number
SPAIN_PASSPORT
Detects string consistent with Spain's Passport Number. Requires a eight- or nine-character string starting with either two or three uppercase letters followed by six numeric digits.
Discovered.Country.Spain
Discovered.Entity.Passport
SWIFT_CODE
Detects alphanumeric strings consistent with a SWIFT code (or Bank Identifier Code (BIC)) format. Requires values consistent with AAAAAACCDDD, where A is an uppercase letter, C is an uppercase letter or numeric digit, and DDD is an optional three-character sequence of uppercase letters or numeric digits.
Discovered.Entity.Swift Code
TIME
Detects strings consistent with times in various formats or data type: time. If date is included in the time, it will not match. Use the DATE identifier instead.
Discovered.Entity.Date
UK_NATIONAL_INSURANCE_NUMBER
Detects alphanumeric strings consistent with the United Kingdom's National Insurance Number. Requires a nine-character string. The first two digits must be uppercase letters, followed by an optional space, then six digits with optional spaces or hyphens (-) every two digits, ending with A, B, C, or D.
Discovered.Country.UK
Discovered.Entity.National Insurance Number
URL
Detects string consistent with a URL. String must begin with a common schema, followed a string and ending with a top level domain of no more than 128 alphanumeric characters.
Discovered.Entity.URL
US_DEA_NUMBER
Detects alphanumeric strings consistent a Drug Enforcement Administration (DEA) number is assigned to a health care provider. It must have a length of nine characters. The first two digits must be uppercase alphanumeric characters, and the last seven characters are numeric digits. The first character may not be I, N, O, Q, V, W, Y, or Z.
Discovered.Country.US
Discovered.Entity.DEA Number
US_EMPLOYER_IDENTIFICATION_NUMBER
Detects numeric string consistent United States Employer Identification Number (EIN). Strings must contain nine digits with a hyphen after the second digit.
Discovered.Country.US
Discovered.Entity.Employer ID Number
US_HEALTHCARE_NPI
Detects 10-digit numeric strings consistent with US National Provider Identifier (NPI). It must either start with 80840 followed by a 1 or 2, or it must begin with a 1 or 2.
Discovered.Country.US
Discovered.Entity.Healthcare NPI
US_PERSON_FULL_NAME New
Detects strings consistent with a person's {first name} space {last name}. Uses the same names from the PERSON_NAME identifier. This identifier must match at least 20% of the data sampled and is case-insensitive.
Discovered.Entity.Person Name
US_PREPARER_TAXPAYER_IDENTIFICATION_NUMBER
Detects strings consistent with a Preparer Taxpayer ID number. Strings must have nine characters, starting with a P that is followed by eight digits.
Discovered.Country.US
Discovered.Entity.Preparer Taxpayer ID Number
US_SOCIAL_SECURITY_NUMBER
Detects strings consistent with a US Social Security Number. Strings must contain nine digits and comprise three parts: the three left-most digits designating the area number, the middle two digits designating the group number, and the four right-most digits designating the serial number. For a column to be tagged, none of these parts can contain all zeroes, and area numbers must not be 666 or in the range of 900-999.
Discovered.Country.US
Discovered.Entity.Social Security Number
US_STATE
Detects strings consistent with either a full name or two-letter abbreviation of a US state or territory.
Discovered.Country.US
Discovered.Entity.State
US_STREET_ADDRESS
Detects strings consistent with U.S. street addresses. Requires the street naming convention of {address_number} {street_name} {unit number (optional)} with an optional road suffix after the street name. The maximum length for street name is 20 alphanumeric characters. This identifier must match at least 80% of the data sampled and is case-insensitive.
Discovered.Entity.Location
VEHICLE_IDENTIFICATION_NUMBER
Detects strings consistent with Vehicle Identification Numbers. A valid World Manufacturer Identifier is required.
Discovered.Country.US
Discovered.Entity.Vehicle Identifier or Serial Number