1 of 13

Data Discovery

Sensitive data discovery (SDD) is an Immuta feature that uses data patterns to determine what type of data your column represents. Using frameworks, rules, and patterns, Immuta evaluates your data and can assign the appropriate tags to your data dictionary based on what it finds. This saves the time of identifying your data manually and provides the benefit of a standard taxonomy across all your data sources in Immuta.

Supported technologies

SDD supports data discovery on data sources from the following technologies:

Snowflake
Databricks or Databricks Unity Catalog
Starburst (Trino): SDD for Starburst (Trino) is currently in public preview and available to all accounts. Enable this feature on the Immuta app settings page.
Redshift: SDD for Redshift is currently in private preview and available to all accounts. Please reach out to your Immuta representative to enable it on your tenant.

Architecture

To evaluate your data, SDD generates a SQL query using the identification framework's rules; the Immuta system account then executes that query in the native technology. Immuta receives the query result, containing the column name and the matching rules but no raw data values. These results are then used to apply the resulting tags to the appropriate columns.

This evaluating and tagging process occurs when SDD runs, which happens automatically from the following events:

A new data source is created.
Schema monitoring is enabled and a new data source is detected.
Column detection is enabled and new columns are detected. Here, SDD will only run on new columns and no existing tags will be removed or changed.

Users can also manually trigger SDD to run from a data source's overview page or the identification frameworks page.

Components

Sensitive data discovery (SDD) runs frameworks to discover data. These frameworks are a collection of rules. These rules contain a single criteria and the resulting tags that will be applied when the criteria's conditions have been met. See the sections below for more information on each component.

Identification framework

An identification framework is a collection of rules that will look for a particular criteria and tag any columns where those conditions are met. While organizations can have multiple frameworks, only one may be applied to each data source. Immuta has the built-in Default Framework, which contains all the built-in patterns and assigns the built-in Discovered tags based on pattern matching.

For a how-to on the framework actions users can take, see the Manage frameworks page.

Global framework

Each organization has a single global framework that will apply to all the data sources in Immuta by default, unless they have a different framework assigned. It is labeled on the frameworks page with a globe icon. Users can bypass this global framework by applying a specific framework to a set of data sources.

Rule

A rule is a criteria and the resulting tags to apply to data that matches the criteria. When Immuta recognizes that criteria, it can tag the data to describe the type. Each rule is specific to its own framework, but all a framework's rules can be copied to create a new framework.

For a how-to on the rule actions users can take, see the Manage rules page.

Criteria

Criteria are the conditions that need to be met for resulting tags to be applied to data.

Supported criteria types

Competitive pattern analysis: This criteria is a process that will review all the regex and dictionary patterns within the rules of the framework and search for the pattern with the best fit. If there are multiple rules in a framework using competitive pattern analysis, only one will be applied to any column. To learn more about the competitive nature, see the How competitive pattern analysis works guide.
Column name: This criteria matches a column name pattern to the column names in the data sources. The rule's resulting tags will be applied to the column where the name is found.

Pattern

A pattern is the type of data Immuta will look for to meet the requirements to tag a column. They can be used in rules across multiple frameworks, but can only be used once within each framework. Immuta comes with built-in patterns to discover common categories of data. These patterns cannot be modified and are within preset rules with preset tags. Users can also create their own unique patterns to find their specific data. SDD only supports regex patterns written in RE2 syntax.

Supported pattern types

The three types of patterns are described below:

Regex: This pattern contains a case-insensitive regular expression that searches for matches against column values.
Column name: This pattern includes a case-insensitive regular expression that is only matched against column names, not against the values in the column.
Dictionary: This pattern contains a list of words and phrases to match against column values.

Configuration

Only application admins can enable sensitive data discovery (SDD) globally on the Immuta app settings page. Then, data source creators can disable SDD on a data-source-by-data-source basis.

Tag mutability

When SDD is manually triggered by a data owner, all column tags that were previously applied by SDD are removed and the tags prescribed by the latest run are applied. However, if SDD is triggered because a new column is detected by schema monitoring, tags will only be applied to the new column, and no tags will be modified on existing columns. Additionally, governors, data source owners, and data source experts can disable any unwanted Discovered tags in the data dictionary to prevent them from being used and auto-tagged on that data source in the future.

Performance

The amount of time it takes to run identification on a data source depends on several factors:

Columns: The time to run identification grows nearly linearly with the number of text columns in the data source.
Identifiers: The number of identifiers being used weakly impacts the time to run identification.
Row count: Performance of identification may vary depending on the sampling method used by each technology. For Snowflake, the number of rows has little impact on the time because data sampling has near-constant performance.
Views: Performance on views is limited by the performance of the query that defines the view.

The time it takes to run SDD for all newly onboarded data sources in Immuta is not limited by SDD performance but by the execution of background jobs in Immuta. Consult your Immuta account manager when onboarding a large number of data sources to ensure the advanced settings are set appropriately for your organization.

Testing

For users interested in testing SDD, note that the built-in patterns by Immuta require a certain amount of confidence to be assigned to a column. This means that with synthetic data, there may be situations where the data is not real enough to fit the confidence needed to match patterns. To test SDD, use a dev environment, create copies of your tables, or use the API to run a dryRun and see the tags that would be applied to your data by SDD.

Considerations

Deleting the built-in Discovered tags is not recommended: If you do delete built-in Discovered tags and use the Default Framework, then when the pattern is matched the column will not be tagged. As an alternative, tags can be disabled on a column-by-column basis from the data dictionary, or SDD can be turned off on a data-source-by-data-source basis when creating a data source.

Supported data types and casing

Type of identifier

Supported data types

Case sensitivity

Data regex

Text string columns

Case-sensitive

Column name regex

Any column

Not case-sensitive

Dictionary

Text string columns

Can be toggled in the identifier definition

Limitations with dictionary patterns

Immuta compiles dictionary patterns into a regex that is sent in the body of a query.

For Snowflake, the size of the dictionary is limited by the overall query text size limit in Snowflake of 1 MB.

Databricks limitations

For Databricks, Immuta will start up a Databricks cluster to complete the SDD job if one is not already running. This can cause unnecessary cost if the cluster becomes idle. Follow Databricks best practices to automatically terminate inactive clusters after a set period of time.
SDD for Databricks Unity Catalog will only work on data sources authenticated with a personal access token (PAT). OAuth machine-to-machine (M2M) is not supported with SDD.

Starburst (Trino) limitation

SDD will only work on Starburst (Trino) data sources authenticated with username and password. OAuth 2.0 is not supported with SDD.

Redshift limitations

Redshift Spectrum is not supported with native SDD.

Redshift supported authentication methods

Username and password is fully supported with native SDD.
Okta is not supported with native SDD.
AWS access key is supported with limitations with native SDD:
- The AWS access key used to register the data source can do a minimum of the following redshift-data API actions:
  - redshift-data:BatchExecuteStatement
  - redshift-data:CancelStatement
  - redshift-data:DescribeStatement
  - redshift-data:ExecuteStatement
  - redshift-data:GetStatementResult
  - redshift-data:ListStatements
- The AWS access key used to register the data source must have redshift:GetClusterCredentials for the cluster, user, and database that they onboard their data sources with.
- If using a custom URL, then the data source registered with the AWS access key must have the region and clusterid included in the additional connection string options formatted like the following:
  region=us-east-2;clusterid=12345
- Redshift Serverless data sources are not supported for native SDD with the AWS access key authentication method.

Migrating from legacy to native SDD

These limitations are only relevant to users who have previously enabled and run Immuta SDD.

If you had legacy SDD enabled, running native SDD can result in different tags being applied because native SDD is more accurate and has fewer false positives than legacy SDD. Running a new SDD scan against a table will change the context of the resulting tags, but no Discovered tags previously applied by legacy SDD will be removed.

See the Migrate from legacy to native SDD page for more information.

How-to Guides

Enable Sensitive Data Discovery (SDD)

Requirement: Immuta permission GOVERNANCE

This how-to guide is for enabling sensitive data discovery (SDD). For additional information on sensitive data discovery and classification, see the .

Navigate to the App Settings page and scroll to the Sensitive Data Discovery section.
Select the Enable Sensitive Data Discovery (SDD) checkbox to enable SDD.
Click Save and then click Confirm to apply your changes. Note that the Immuta tenant will have a system restart.

Run SDD for a select group of data sources; use one of the following options to run SDD on specific data sources:

Make the following request specifying the data sources in the request using the Immuta API.

curl \
    --request 'POST' \
    'https://your-immuta-url.immuta.com/sdd/run' \
    --header 'Content-Type: application/json' \
    --header 'Authorization: 438a3096966c4a5188b3b468cedb213e' \
    --data '{"sources":["Example Data Source Name", "Example Data Source 2 Name"]}'

A successful request will have the code 200 and a body with the number of jobs created from the request:

{
    "jobCount": 2
}

Navigate to the data source overview page of the data source you listed in the payload.
Click the Data Dictionary tab.
Assess whether the Discovered and classification tags applied are accurate.
If they are, then repeat the steps above for more of your data sources. Once a majority of your data sources appear to have accurate tags, . If the tags are not accurate, you will need to tune SDD and classification frameworks. See the for instructions.

Run SDD on all data sources

Click the Discover icon and the Identification tab in the navigation menu.
Select the more actions icon.
Select Run SDD and then select it again in the modal.

Run SDD on all data sources using the API

Requirement: Immuta permission GOVERNANCE

Make the following request using the Immuta API to run SDD for all data sources, specifying all as true:

curl \
    --request 'POST' \
    'https://your-immuta-url.immuta.com/sdd/run' \
    --header 'Content-Type: application/json' \
    --header 'Authorization: 438a3096966c4a5188b3b468cedb213e' \
    --data '{"all": true}'

A successful request will have the code 200 and a body with the number of jobs created from the request:

{
    "jobCount": 12
}

Manage Identification Frameworks

Requirements:

SDD enabled and
Registered
Immuta permission GOVERNANCE

Create a framework

Create a framework with no rules

Click the Discover icon in the navigation menu and select the Frameworks tab.
Click Create New.
Enter a Name for the framework.
Enter a Description for the framework.
Select the option to Create empty framework.
Click Create.

After you create the framework, you can .

Copy an existing framework and its rules

Click the Discover icon in the navigation menu and select the Frameworks tab.
Click Create New.
Enter a Name for the framework.
Enter a Description for the framework.
Select the option to Create rules from an existing framework.
Select the checkbox for the framework you want to copy. You can only copy a single framework. For more information about a framework, click the framework name to open a new tab with details about the framework.
Click Create.

Assign a framework to data sources

To assign a framework to run on specific data sources,

Click the Discover icon in the navigation menu and select the Frameworks tab.
Select the framework you want to assign and navigate to the Data Sources tab.
Click Add Data Sources.
Select the checkbox for the data source you want this framework to run on. You may select more than one.
Click Add Data Source(s).

Remove data sources from a framework

After a data source is removed from a framework, it will use the global framework for any SDD scans and the tags applied by the removed framework will be replaced. To remove data sources from a framework,

Click the Discover icon in the navigation menu and select the Frameworks tab.
Select the framework you want to remove data sources from and navigate to the Data Sources tab.
Select the checkbox for the data source you want to remove from the framework. You may select more than one.
Select the Bulk Actions more options.
Select Remove Data Sources.
Click Confirm.

Delete a framework

Deleting a framework will remove it from any data sources. Those data sources will then use the global framework for any SDD scans and the tags applied by the deleted framework will be replaced. Governors can delete any framework, and users with the CREATE_DATA_SOURCE or CREATE_DATA_SOURCE_IN_PROJECT permissions can only delete frameworks they created. To delete a framework,

Click the Discover icon in the navigation menu and select the Frameworks tab.
Select Remove.
Click Confirm.

Manage Patterns

Requirements:

SDD enabled and
Immuta permission GOVERNANCE

Create a pattern

Click the Discover icon in the navigation menu and select the Patterns tab.
Click Create New.
In the modal, enter a name for the new pattern.
Write a Description for the type of data the pattern will find.
Select the .
1. For regex and column name regex, enter the regex.
2. For dictionary, enter the values you want the pattern to match and toggle the switch on if you want them to be case-sensitive.
Click Create Pattern.
See the to add your new pattern to a framework.

Note that all user-created patterns must be a 90% match or greater for the contents of the column to be tagged.

Edit a pattern

Editing a pattern will affect any rule built off the pattern throughout Immuta. To edit a pattern,

Click the Discover icon in the navigation menu and select the Patterns tab.
Click the name of the pattern you want to edit.
Click Edit.
Edit the field you want to change. Note any field shadowed is not editable, and the pattern must be deleted and re-created to change them.
Click Save.

Built-in patterns cannot be edited.

Delete a pattern

Deleting a pattern will remove it from Immuta and remove all the rules that relied on it in the frameworks throughout Immuta. To delete a pattern,

Click the Discover icon in the navigation menu and select the Patterns tab.
Click the three dot menu in the Action column for the pattern you want to delete.
Select Remove.
Click Confirm.

Built-in patterns cannot be deleted.

Manage Rules

Requirements:

SDD enabled and
Immuta permission GOVERNANCE

Create a rule

You can only have one rule per pattern in the framework. If you do not see the pattern for the rule you want to create, then it already has a rule built off of it.

Click the Discover icon in the navigation menu and select the Framework tab.
Select the framework you want to edit and navigate to the Discovery Rules tab.
Click Create New.
Select the Tags to apply from the dropdown. The tags you select are the tags applied when the pattern is matched. Note that resulting tags must be under the Discovered parent tag and cannot be parent tags themselves unless they have already been manually applied to a data source.
Select the Criteria type from the dropdown. See the .
1. Competitive pattern analysis is for regex and dictionary patterns.
2. Column name is for column name patterns.
Select the Pattern from the dropdown.
Click Create Rule.

Edit a rule

Click the Discover icon in the navigation menu and select the Frameworks tab.
Select the framework of the rule you want to edit and navigate to the Discovery Rules tab.
Select the rule you want to edit.
Click Edit.
Edit the field you want to change. Note any field shadowed is not editable, and the rule must be deleted and re-created to change them.
Click Save.

Delete a rule

Deleting a rule removes the tags once applied by that rule the next time SDD runs on a data source. To delete a rule,

Click the Discover icon in the navigation menu and select the Frameworks tab.
Select the framework you want to edit and navigate to the Discovery Rules tab.
Click the three dot menu in the Action column for the rule you want to delete.
Select Remove.
Click Confirm.

Manage SDD on Data Sources

Requirements:

SDD enabled and
Registered
Immuta permission GOVERNANCE

Run SDD using a specific framework

SDD runs automatically, but if you want to re-run SDD when a new global framework is set or when new rules have been added, you can or for specific frameworks through the UI:

Click the Discover icon and the Identification tab in the navigation menu.
Select the more actions icon.
Select Run SDD and then select it again in the modal.

Run SDD on a data source

SDD runs automatically, but if you want to re-run SDD when a new global framework is set or when new rules have been added, you can or for specific data sources through the UI:

Navigate to the data source overview page.
Click the health status.
Select Re-run next to Sensitive Data Discovery (SDD).

Verify discovered tags

Verify discovered tags

If sensitive data discovery has been enabled, then manually adding tags to columns in the data dictionary will be unnecessary in most cases. The data owner will just need to verify that the Discovered tags are correct.

Disable Discovered tags from the data dictionary

If a governor, data owner, or data source expert disables a Discovered tag from the data dictionary, the column will not be re-tagged when that data source's fingerprint is recalculated or SDD is re-run. When a Discovered tag is disabled, the tag will not completely disappear, so it can be manually enabled through the tag side sheet.

To disable a discovered tag,

Navigate to a data source and click the Data Dictionary tab.
Scroll to the column you want to remove the tag from and click the tag you want to remove.
Click Disable in the side sheet and then click Confirm.

Manage Global SDD Settings

Requirement: Immuta permission APPLICATION_ADMIN

Configure the global framework

Click the App Settings icon in the left sidebar.
Click Sensitive Data Discovery in the left panel to navigate to that section.
Enter the request-friendly name of your global template in the Global SDD Template Name field. This name can be found in the tooltip on the framework's detail page.
Click Save, and then Confirm your changes.

Migrate From Legacy to Native SDD

This guide provides information and best practices for migrating from the deprecated legacy sensitive data discovery (SDD) option to the improved native SDD. This guide is for users who have already enabled SDD on their tenant and have Discovered tags on their data sources.

Before you begin

Native vs legacy SDD

Legacy SDD is deprecated. It will be removed and replaced by native SDD. Native SDD is significantly improved from legacy SDD for discovering and tagging your data with upgrades to the built-in patterns. Additionally, the greatest benefit is the respect for data residency. Native SDD doesn't move any of your data when running. The discovery is done right in your data platform, and the platform only returns the matching patterns and column names to Immuta.

See the Sensitive data discovery reference page for more information on native SDD.

Requirements

Native SDD requires Snowflake, Databricks, Redshift, or Starburst (Trino) data sources
Legacy SDD enabled on your tenant
Legacy SDD tags applied to your data sources: To find out if you have legacy SDD tags applied, create a governance report as described in the understand the context of you tags section.

Enable native SDD

Contact your Immuta representative to enable native SDD on your Immuta tenant. Note that unless specifically disabled, all Immuta installations after the 2024.2 LTS have native SDD automatically enabled. Proceed to understand the context of your tags if you want to self-service check if native SDD is already running and tagging your data before you reach out to the representative.

This action will not change anything immediately on your tenant; however, anytime SDD runs in the future, it will be native SDD instead of the legacy version.

To assess native SDD for your data, proceed with the steps below. If you do not review native SDD, the legacy SDD tags will all remain on your data source columns. However, when SDD automatically runs on new data sources and columns, it will apply native SDD tags, and because of the improvements to SDD, it may tag different data than legacy SDD.

Understand the context of your tags

Requirement: Immuta permission GOVERNANCE

Manually run SDD globally to run native SDD on your data sources.
To check the tags on an individual data source, navigate to the data source data dictionary and select a Discovered tag. On the tag side sheet, you can determine the context of the tag. When patterns match data, native SDD will apply tags, and their tag context will be Sensitive Data Discovery. Any tags with the context Legacy Sensitive Data Discovery were not matched by native SDD but will remain on the data source.
To check your tags globally, navigate to the governance reports page and build a report for sensitive data discovery. This report will present the legacy tags on your data sources' columns and native SDD tags that are also on those columns. Use this report to assess the context of the Discovered tags and understand if native SDD is matching the data you want it to.

These actions will allow you to understand the differences between how native SDD and legacy SDD tag your data and whether your data is recognized as expected by native SDD or if legacy SDD was over-tagging your data. This way you can better tune SDD to your data.

If there are any legacy SDD tags that you want native SDD to catch, you need to tune native SDD so that this type of data is discovered in future tables and columns; see guidance on that in the next section.

Tune SDD

Requirement: Immuta permission GOVERNANCE

Using the report you built above, complete these actions to tune SDD:

Focus on a legacy SDD tag properly applied to your data. Assess whether the native SDD tag on the column instead was applied more accurately than the legacy tag. If it is applied incorrectly, proceed to the next step.
Create a new regex or dictionary pattern to discover this data. Ensure it is specific and will match your data with a 90% confidence.
Create a new rule in your framework using the new pattern and the Discovered tag you want applied to the data.
Complete the steps above for all legacy SDD tags.
Retest your updated rules and patterns by re-running SDD on the select data sources and continue refining to the level of accuracy you want.

Completing the actions above will create parity between what legacy SDD was tagging your data and what native SDD will tag in the future.

Reference Guides

How Competitive Pattern Analysis Works

Of sensitive data discovery's three pattern options, regex and dictionary are competitive. This means that when assessing your data, if multiple patterns could match, only one of the competitive patterns will be chosen and tag the data. To better understand how Immuta executes this competition, read further.

Discover employs a three-phased competitive pattern analysis approach for sensitive data discovery (SDD):

Sampling: No data is moved, and Immuta checks the patterns against a sample of data from the table.
Qualifying: Patterns that have less than a 90% match are filtered out.
Scoring: The remaining patterns are compared with one another to find the most specific pattern that qualifies and matches the sample.

In the end, competitive pattern analysis aims to find a single pattern for each column that best describes the data format.

Sampling

In the sampling process, no database contents are transmitted to Immuta; instead, Immuta receives only the column-wise hit rate (the number of times the pattern has matched a value in the column) information for each active pattern. To do this, Discover instructs a remote database to measure column-wise hit rate information for all active patterns over a row sample.

The sample size is decided based on the number of patterns and the data size, when available. In the most simplified case, the requested number of sampled rows depends only on the number of regex and dictionary patterns being run in the framework, not the data size. The sample size dependence on the number of patterns is weak and will not exceed 13,000 rows.

Number of patterns

Sample size

7369 rows

9211 rows

500

11053 rows

5000

12895 rows

Sampling considerations

In practice, the number of sampled values for each column may be less than the requested number of rows. This happens when the target table has less than the requested number of rows, when many of the column values are null, or because of technology-specific limitations.

Snowflake and Starburst (Trino): Discover implements native table sampling by row count.
Databricks and Redshift: Due to technology limitations and the inability to predict the size of the table, Discover implements a best-effort sampling strategy comprising a flat 10% row sample capped at the first 10,000 sampled rows. In particular, under-sampling may occur on tables with less than 100,000 rows. Moreover, the resulting sample is biased towards earlier records.
All platforms: Sampling from views can have significantly slower performance that varies by the performance of the query that defines the view.

Qualifying

During the qualification phase, patterns that do not agree with the data are disqualified. A pattern agrees with the data if the hit rate on the remote sample exceeds the predefined threshold. This threshold is 90% match for most built-in patterns; however, two built-in patterns have lower threshold . The 90% threshold is standard for all custom patterns as well to ensure the pattern matches the data within the column and avoid false positives. If no patterns qualify, then no pattern is assessed for scoring and the column is not tagged.

Scoring

During the scoring phase, a machine inference is carried out among all qualified patterns, combining pattern-derived complexity information with hit rate information to determine which pattern best describes the sample data. This process prefers the more restrictive of two competing patterns since the ability to satisfy the more difficult-to-satisfy pattern itself serves as evidence that it is more likely. This phase ends by returning a single most likely pattern per the inference process.

Example

Here are a set of regex patterns and a sample of data:

Patterns:

[a-zA-Z0-9]{3} - This pattern will match 3 character strings with the characters a-z, lowercase or uppercase, or digits 0-9.
[a-c]{3} - This pattern will match 3 character strings with the characters a-c, lowercase.
(a|b|d){3} - This pattern will match 3 character strings with the characters a, b, or d, lowercase.

Sample data

Matches Pattern 1

Matches Pattern 2

Matches Pattern 3

dad

Yes

baa

Yes

add

Yes

add

Yes

cab

Yes

bad

Yes

aba

Yes

baa

Yes

dad

Yes

baa

Yes

When qualifying the patterns, Pattern 1 and Pattern 3 both match 90% or more of the data. Pattern 2 does not, and is disqualified.

Then the qualified patterns are scored. Here, Pattern 1, despite matching 100% of the data, is unspecific and could match over 200,000 values. On the other hand, Pattern 3 matches just at 90% but is very specific with only 27 available values.

Therefore, with the specificity taken into account, Pattern 3 would be the match for this column, and its tags would be applied to the data source in Immuta.

Important notes

Dictionaries are considered patterns by Immuta and are part of the competitive process, while column-name regex patterns are not.
Scoring ties are rare but can occur if the same pattern is specified more than once (even in different forms). Scoring ties are inconclusive, and the scoring phase will not return a pattern in the case of a tie.
Pattern complexity analysis is sensitive to the total number of strings a pattern accepts or, equivalently for dictionaries, the number of entries. Therefore, patterns that accept much more than is necessary to describe the intended column data format may perform more poorly in the competitive analysis because they are easier to satisfy.

Built-in Pattern Reference

In previous documentation, rule and pattern are referred to as classifier or identifier. The language is being updated to rule to be more accurate and not conflate meaning with Detect classification.

Immuta comes with a set of built-in patterns that look for common data types. These patterns were written by Immuta's research and development team and cannot be deleted or edited by users. However, users can build their own rules using these built-in patterns, which will customize the resulting tags based on the organization's needs.

When using SDD with classification frameworks, it is recommended to use the default resulting tags listed in the table below for these built-in patterns. This ensures that the framework rules apply sensitivity tags as intended.

Pattern descriptions and default resulting tags

Pattern

Description

Resulting tags from the default rules

AGE

Matches numeric strings between 10 and 199.

Discovered.PII

Discovered.Identifier Indirect

Discovered.PHI

Discovered.Entity.Age

ARGENTINA_DNI_NUMBER

Matches strings consistent with Argentina National Identity (DNI) Number. Requires an eight-digit number with optional periods between the second and third and fifth and sixth digit.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Argentina

Discovered.PHI

Discovered.Entity.DNI Number

AUSTRALIA_MEDICARE_NUMBER

Matches numeric strings consistent with Australian Medicare number. Requires a ten- or eleven-digit number. The starting digit must be between 2 and 6, inclusive. Optional spaces can be placed between the fourth and fifth and ninth and tenth digit. The optional 11th digit separated by a / can be present. A checksum is required.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Australia

Discovered.PHI

Discovered.Entity.Medicare Number

AUSTRALIA_PASSPORT

Matches strings consistent with Australian Passport number. An 8- or 9-character string is required, with a starting upper case character (N, E, D, F, A, C, U, X) or a two-character starting character (P followed by A, B, C, D, E, F, U, W, X, or Z) followed by seven digits.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Australia

Discovered.PHI

Discovered.Entity.Passport

BELGIUM_NATIONAL_ID_CARD_NUMBER

Matches numeric strings consistent with Belgium's National ID card. Requires a twelve-digit number with hyphen (-) between the third and fourth digit and tenth and eleventh digits. A two checksum is required.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Belgium

Discovered.PHI

Discovered.Entity.National ID Card Number

BITCOIN_INVOICE_ADDRESS

Matches strings consistent with the following Bitcoin Invoice Address formats: P2PKH, P2SH, and Bech32. P2PKH and P2SH must start with a 1 or a 3, respectively, followed by 25 - 34 alphanumeric characters, excluding l, I, O, and 0. Bech32 formats must begin with bc1 and be followed by 39 characters. To be identified, any addresses must have a valid checksum.

Discovered.Entity.CRYPTO

Discovered.PCI

BRAZIL_CPF_NUMBER

Matches a numeric string consistent with Brazil's CPF (Cadastro Pessoal de Pessoa Física) number. An eleven-digit numeric string with non-numeric separators after the third, sixth, and ninth digits. A two digit checksum is required.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Brazil

Discovered.PHI

Discovered.Entity.CPF Number

CANADA_BC_PHN

Matches numeric strings consistent with British Columbia's Personal Health Number (PHN). Requires a ten-digit numeric string with optional hyphen (-) or spaces after the fourth and seventh digits.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Canada

Discovered.PHI

Discovered.Entity.British Columbia Health Network Number

CANADA_OHIP

Matches alphanumeric strings consistent with Ontario's Health Insurance Plan (OHIP). Requires a twelve-digit alphanumeric code. Optional hyphens (-) or spaces can appear after the fourth, seventh, and tenth digits. The final two characters are a checksum.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Canada

Discovered.PHI

Discovered.Entity.Ontario Health Insurance Number

CANADA_PASSPORT

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Canada

Discovered.PHI

Discovered.Entity.Passport

CANADA_QUEBEC_HIN

Matches alphanumeric strings consistent with Quebec's Health Insurance Number (HIN). Requires four alphabetic characters followed by an optional space or hyphen (-), and then eight digits with an optional hyphen or space after the fourth digit.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Canada

Discovered.PHI

Discovered.Entity.Quebec Health Insurance Number

CREDIT_CARD_NUMBER

Matches strings consistent with a credit card number with prefixes matching major credit card companies. Must include a valid checksum.

Discovered.PCI

Discovered.Entity.Credit Card Number

DATE

Matches strings consistent with dates. These can include days of the week, dates, and date times.

Discovered.Entity.Date

DENMARK_CPR_NUMBER

Matches numeric strings consistent with Personal Identification Number (CPR-number or Person-number). Requires a ten-digit number with either a DDMMYY-SSSS or DDMMYYSSSS format. The first six digits are an individual's birth date in Day, Month, Year format. The final four digits comprise the sequence number.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Denmark

Discovered.PHI

Discovered.Entity.CPR Number

DOMAIN_NAME

Matches domain names using a very broad pattern.

Discovered.Entity.Domain Name

EMAIL_ADDRESS

Detect strings consistent with an email address. Usernames are required to be fewer than 255 characters, follow by @a, a domain of fewer than 255 characters, and a top level domain of between 2 and 20 characters.

Discovered.PHI

Discovered.Entity.Electronic Mail Address

Discovered.Identifier Direct

ETHNIC_GROUP

Matches strings consistent with the US Census race designations.

Discovered.PII

Discovered.Entity.Ethnic Group

FDA_CODE

Matches a string consistent with a drug or ingredient registered with Food and Drug Administration (FDA). Must start with between 4 to 6 digits, followed by a hyphen, followed by 3 to 4 digits, followed by a hyphen, and finishing with one to two digits.

Discovered.Country.US

Discovered.Entity.FDA Code

FINLAND_NATIONAL_ID_NUMBER

Matches a string consistent with Finland's National ID number. Requires an eleven-character string in a DDMMYYCZZZQ format. The first six digits are an individual's birth date in Day, Month, Year format. The C character is a century of birth indicator (+ for the years 1800-1899, - for years 1900-1999, and A for years 2000-2099). ZZZ is an individual ID number, and Q is a required checksum.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Finland

Discovered.PHI

Discovered.Entity.National ID Number

FRANCE_CNI

Matches numeric strings consistent with the French National ID card number (carte nationale d'identité). Requires a twelve-digit numeric string.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.France

Discovered.PHI

Discovered.Entity.CNI

FRANCE_NIR

Matches numeric strings consistent with France's National ID number (Numéro d'Inscription au Répertoire). Requires a fifteen-digit numeric string. An optional hyphen (-) or space can appear after the 13th digit. The 14th and 15th digits act as a checksum.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.France

Discovered.PHI

Discovered.Entity.NIR

FRANCE_PASSPORT

Matches alphanumeric strings consistent with the French Passport number. Requires two numbers followed by two upper case letters and ends with 5 digits.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.France

Discovered.PHI

Discovered.Entity.Passport

GENDER

Matches strings consistent with gender or gender abbreviations.

Discovered.PII

Discovered.Identifier Indirect

Discovered.PHI

Discovered.Entity.Gender

GERMANY_DRIVERS_LICENSE_NUMBER

Matches alphanumeric strings consistent with Germany's Driver's License number. Requires an eleven-element string, with a digit or a letter followed by two digits, 6 digits or letters, one digit, and one digit or letter.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Germany

Discovered.PHI

Discovered.Entity.Drivers License Number

GERMANY_IDENTITY_CARD_NUMBER

Matches alphanumeric strings consistent with Germany's Identity Card number. Requires a single letter followed by eight digits.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Germany

Discovered.PHI

Discovered.Entity.Identity Card Number

IBAN_CODE

Matches strings consistent with an International Bank Account Number (IBAN). Must contain a valid country code.

Discovered.Entity.IBAN Code

ICD10_CODE

Matches strings consistent with codes from the International Statistical Classification of Diseases and Related Health Problems (ICD), as drawn from the Clinical Modification lexicon from the year 2020.

Discovered.Entity.ICD10 Code

IMEI_HARDWARE_ID

Matches strings consistent with an International Mobile Equipment Identity (IMEI) number. Must contain 15 digits with optional hyphens or spaces after the second, 8th, and 14th digits.

Discovered.Entity.IMEI

IP_ADDRESS

Matches IP Addresses in the V4 and V6 formats.

Discovered.Entity.IP Address

LOCATION

Matches strings consistent with Countries, States, Addresses, or Municipalities. By default focuses on locations in the United States.

Discovered.Entity.Location

MAC_ADDRESS

Matches strings consistent with a Media Access Control (MAC) address. Must contain twelve hexadecimal digits, with every two digits separated by a colon.

Discovered.Entity.MAC Address

MAC_ADDRESS_LOCAL

Matches strings consistent with a local Media Access Control (MAC) address.

Discovered.Entity.MAC Address Local

PERSON_NAME

Matches strings consistent with a dictionary of people's names. Names are drawn from the US Social Security database.

Discovered.PII

Discovered.PHI

Discovered.Entity.Person Name

Discovered.Identifier Indirect

PHONE_NUMBER

Matches strings consistent with telephone numbers. Primarily looks for strings consistent with the United States telephone numbers naming convention.

Discovered.Entity.Telephone Number

POSTAL_CODE

Matches strings consistent with a valid US zip code with an optional +4. Only valid 5 digit zip codes are detected.

Discovered.Entity.Postal Code

SPAIN_NIE_NUMBER

Matches strings consistent with Spain's Foreigner Identification number. Requires an eight-character string. The initial character must be X, Y, or Z, followed by seven digits, then by an optional hyphen or space and a single checksum character.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Spain

Discovered.PHI

Discovered.Entity.NIE Number

SPAIN_NIF_NUMBER

Matches strings consistent with Spain's Tax Identification number. Requires an eight-character string. Requires eight digits followed by an optional hyphen or space and a single checksum character.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Spain

Discovered.PHI

Discovered.Entity.NIF Number

SPAIN_PASSPORT

Matches strings consistent with Spain's Passport number. Requires an eight- or nine-character string, starting with either two or three letters followed by six digits.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Spain

Discovered.PHI

Discovered.Entity.Passport

STREET_ADDRESS

Matches strings consistent with street addresses. Primarily looks for strings consistent with the United States street naming convention.

Discovered.Entity.Location

SWEDEN_NATIONAL_ID_NUMBER

Matches numeric strings consistent with Sweden's Nation ID number. Requires a ten- or twelve-digit string that must start with a date in either the YYMMDD or YYYYMMDD formats. An optional - or + character then separates four ending digits. The final digit is a checksum.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Sweden

Discovered.PHI

Discovered.Entity.National ID Number

SWEDEN_PASSPORT

Matches numeric strings consistent with Sweden's Passport number. Requires an 8-digit number.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Sweden

Discovered.PHI

Discovered.Entity.Passport

SWIFT_CODE

Matches alphanumeric strings consistent with a SWIFT code (or Bank Identifier Code (BIC)) format.

Discovered.Entity.Swift Code

THAILAND_NATIONAL_ID_NUMBER

Matches strings consistent with Thailand's National ID number. Requires a 13-digit number with optional spaces or hyphens (-) after the first, fifth, tenth, and twelfth digits. The final digit is a checksum.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Thailand

Discovered.PHI

Discovered.Entity.National ID Number

TIME

Matches strings consistent with times. Can contain both date and time pieces.

Discovered.Entity.Date

UK_DRIVERS_LICENSE_NUMBER

Matches alphanumeric strings consistent with the United Kingdom's Driver's License number. Requires either a 16- or 18-character string. The first five characters represent the driver's surname, padded with 9s, followed by a single digit for decade of birth, two digits for month of birth (incremented by 50 for female drivers), two digits for day of birth, one digit for year of birth, two letters, an arbitrary digit, and two digits. Two additional digits can be present for each license issuance.

Discovered.PII

Discovered.Identifier Direct,

Discovered.Country.UK

Discovered.PHI

Discovered.Entity.Drivers License Number

UK_NATIONAL_INSURANCE_NUMBER

Matches alphanumeric strings consistent with the United Kingdom's National Insurance number. Requires a nine-character string. The first two digits must be letters, followed by an optional space, then six digits with optional spaces or hyphens (-) every two digits, ending with a letter.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.UK

Discovered.PHI

Discovered.Entity.National Insurance Number

UK_TAXPAYER_REFERENCE

Matches ten-digit numeric strings consistent with UK Taxpayer Reference (UTR) numbers. The final digit is a checksum.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.UK

Discovered.PHI

Discovered.Entity.Taxpayer Reference

URL

Matches string consistent with a Uniform Resource Locator (URL). String must begin with http://, https://, ftp://, file:///, or mailto:, followed by a string and ending with a top level domain of no more than 128 characters.

Discovered.Entity.URL

US_ADOPTION_TAXPAYER_IDENTIFICATION_NUMBER

Matches a numeric string consistent United States Adoption Taxpayer Identification Number (ATIN). Requires a string similar in format to a US Social Security Number, but starting with a 9 in the Area Number and having 93 as an allowed Group Number.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.US

Discovered.PHI

Discovered.Entity.Adoption Taxpayer ID Number

US_BANK_ROUTING_MICR

Matches numeric string consistent with an American Bankers Association (ABA) Routing Number. Must be a nine-digit number starting with 0, 1, 2, 3, 6, or 7, followed by eight digits. The final digit is a checksum.

Discovered.Country.US

Discovered.Entity.Bank Routing MICR

US_DEA_NUMBER

Matches alphanumeric strings consistent with a Drug Enforcement Administration (DEA) number that is assigned to a health care provider. Must be a length of nine characters. The first two digits must be alphanumeric, and the last seven digits must be digits. The final digit is a checksum.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.US

Discovered.Entity.DEA Number

US_EMPLOYER_IDENTIFICATION_NUMBER

Matches numeric string consistent United States Employer Identification Number (EIN). Strings must contain nine digits with a hyphen after the second digit.

Discovered.Country.US

Discovered.Entity.Employer ID Number

US_HEALTHCARE_NPI

Matches numeric strings consistent with US National Provider Identifier (NPI). Strings must be either 10 or 15 digits with the final digit being a valid checksum.

Discovered.PII

Discovered.Country.US

Discovered.Entity.Healthcare NPI

Discovered.Identifier Undetermined

US_INDIVIDUAL_TAXPAYER_IDENTIFICATION_NUMBER

Matches a numeric string consistent United States Individual Taxpayer Identification Number (ITIN). Requires a string similar in format to a US Social Security Number, but starting with a 9 in the Area Number and having a limited set of allowed Group Numbers.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.US

Discovered.PHI

Discovered.Entity.Individual Taxpayer ID Number

US_PASSPORT

Matches numeric strings consistent with United States Passport number. Strings must contain nine digits.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.US

Discovered.PHI

Discovered.Entity.Passport

US_PREPARER_TAXPAYER_IDENTIFICATION_NUMBER

Matches strings consistent with a Preparer Taxpayer ID number. Strings must have nine characters, starting with a P that is followed by 8 digits.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.US

Discovered.Entity.Preparer Taxpayer ID Number

US_SOCIAL_SECURITY_NUMBER

Matches strings consistent with a US Social Security Number. Strings must contain nine digits and comprise three parts: the three left-most digits designating the area number, the middle two digits designating the group number, and the four right-most digits designating the serial number. For a column to be tagged, none of these parts can contain all zeroes, and area numbers must not be 666 or in the range of 900-999.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.US

Discovered.PHI

Discovered.Entity.Social Security Number

US_STATE

Matches strings consistent with either a full name or two-letter abbreviation of a US state or territory.

Discovered.Country.US

Discovered.Entity.State

US_TOLLFREE_PHONE_NUMBER

Matches strings consistent with a US toll-free telephone number. Allowed area codes are 800, 88+any digit, or 899.

Discovered.Country.US

Discovered.Entity.Tollfree Telephone Number

VEHICLE_IDENTIFICATION_NUMBER

Matches strings consistent with Vehicle Identification Numbers. A checksum is required as well as a valid World Manufacturer Identifier.

Discovered.Country.US

Discovered.Entity.Vehicle Identifier or Serial Number

Built-in Discovered Tags Reference

Immuta is pre-configured with a set of tags that can be used to write global policies before data sources even exist. See a list of the built-in Discovered tags below and the for information about where these tags will be applied by the built-in rules

Country tags

All the tags below belong to the Country parent. For example, the full tag name will appear as Discovered . Country . Argentina.

Child tag name

Description

Entity tags

All the tags below belong to the Entity parent. For example, the full tag name will appear as Discovered . Entity . Aadhaar Individual.

Identifier tags

None of the tags below have an additional parent or child tag. For example, the full tag name will appear as Discovered . Identifier Direct.

Personal information tags

None of the tags below have an additional parent or child tag. For example, the full tag name will appear as Discovered . PCI.

Built-in Pattern Reference

In previous documentation, rule and pattern are referred to as classifier or identifier. The language is being updated to rule to be more accurate and not conflate meaning with Detect classification.

Pattern descriptions and default resulting tags

Pattern

Description

Resulting tags from the default rules

AGE

Matches numeric strings between 10 and 199.

Discovered.PII

Discovered.Identifier Indirect

Discovered.PHI

Discovered.Entity.Age

ARGENTINA_DNI_NUMBER

Matches strings consistent with Argentina National Identity (DNI) Number. Requires an eight-digit number with optional periods between the second and third and fifth and sixth digit.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Argentina

Discovered.PHI

Discovered.Entity.DNI Number

AUSTRALIA_MEDICARE_NUMBER

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Australia

Discovered.PHI

Discovered.Entity.Medicare Number

AUSTRALIA_PASSPORT

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Australia

Discovered.PHI

Discovered.Entity.Passport

BELGIUM_NATIONAL_ID_CARD_NUMBER

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Belgium

Discovered.PHI

Discovered.Entity.National ID Card Number

BITCOIN_INVOICE_ADDRESS

Discovered.Entity.CRYPTO

Discovered.PCI

BRAZIL_CPF_NUMBER

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Brazil

Discovered.PHI

Discovered.Entity.CPF Number

CANADA_BC_PHN

Matches numeric strings consistent with British Columbia's Personal Health Number (PHN). Requires a ten-digit numeric string with optional hyphen (-) or spaces after the fourth and seventh digits.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Canada

Discovered.PHI

Discovered.Entity.British Columbia Health Network Number

CANADA_OHIP

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Canada

Discovered.PHI

Discovered.Entity.Ontario Health Insurance Number

CANADA_PASSPORT

Matches strings consistent with the Canadian Passport Number format as .

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Canada

Discovered.PHI

Discovered.Entity.Passport

CANADA_QUEBEC_HIN

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Canada

Discovered.PHI

Discovered.Entity.Quebec Health Insurance Number

CREDIT_CARD_NUMBER

Matches strings consistent with a credit card number with prefixes matching major credit card companies. Must include a valid checksum.

Discovered.PCI

Discovered.Entity.Credit Card Number

DATE

Matches strings consistent with dates. These can include days of the week, dates, and date times.

Discovered.Entity.Date

DENMARK_CPR_NUMBER

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Denmark

Discovered.PHI

Discovered.Entity.CPR Number

DOMAIN_NAME

Matches domain names using a very broad pattern.

Discovered.Entity.Domain Name

EMAIL_ADDRESS

Discovered.PHI

Discovered.Entity.Electronic Mail Address

Discovered.Identifier Direct

ETHNIC_GROUP

Matches strings consistent with the US Census race designations.

Discovered.PII

Discovered.Entity.Ethnic Group

FDA_CODE

Discovered.Country.US

Discovered.Entity.FDA Code

FINLAND_NATIONAL_ID_NUMBER

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Finland

Discovered.PHI

Discovered.Entity.National ID Number

FRANCE_CNI

Matches numeric strings consistent with the French National ID card number (carte nationale d'identité). Requires a twelve-digit numeric string.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.France

Discovered.PHI

Discovered.Entity.CNI

FRANCE_NIR

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.France

Discovered.PHI

Discovered.Entity.NIR

FRANCE_PASSPORT

Matches alphanumeric strings consistent with the French Passport number. Requires two numbers followed by two upper case letters and ends with 5 digits.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.France

Discovered.PHI

Discovered.Entity.Passport

GENDER

Matches strings consistent with gender or gender abbreviations.

Discovered.PII

Discovered.Identifier Indirect

Discovered.PHI

Discovered.Entity.Gender

GERMANY_DRIVERS_LICENSE_NUMBER

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Germany

Discovered.PHI

Discovered.Entity.Drivers License Number

GERMANY_IDENTITY_CARD_NUMBER

Matches alphanumeric strings consistent with Germany's Identity Card number. Requires a single letter followed by eight digits.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Germany

Discovered.PHI

Discovered.Entity.Identity Card Number

IBAN_CODE

Matches strings consistent with an International Bank Account Number (IBAN). Must contain a valid country code.

Discovered.Entity.IBAN Code

ICD10_CODE

Discovered.Entity.ICD10 Code

IMEI_HARDWARE_ID

Matches strings consistent with an International Mobile Equipment Identity (IMEI) number. Must contain 15 digits with optional hyphens or spaces after the second, 8th, and 14th digits.

Discovered.Entity.IMEI

IP_ADDRESS

Matches IP Addresses in the V4 and V6 formats.

Discovered.Entity.IP Address

LOCATION

Matches strings consistent with Countries, States, Addresses, or Municipalities. By default focuses on locations in the United States.

Discovered.Entity.Location

MAC_ADDRESS

Matches strings consistent with a Media Access Control (MAC) address. Must contain twelve hexadecimal digits, with every two digits separated by a colon.

Discovered.Entity.MAC Address

MAC_ADDRESS_LOCAL

Matches strings consistent with a local Media Access Control (MAC) address.

Discovered.Entity.MAC Address Local

PERSON_NAME

Matches strings consistent with a dictionary of people's names. Names are drawn from the US Social Security database.

Discovered.PII

Discovered.PHI

Discovered.Entity.Person Name

Discovered.Identifier Indirect

PHONE_NUMBER

Matches strings consistent with telephone numbers. Primarily looks for strings consistent with the United States telephone numbers naming convention.

Discovered.Entity.Telephone Number

POSTAL_CODE

Matches strings consistent with a valid US zip code with an optional +4. Only valid 5 digit zip codes are detected.

Discovered.Entity.Postal Code

SPAIN_NIE_NUMBER

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Spain

Discovered.PHI

Discovered.Entity.NIE Number

SPAIN_NIF_NUMBER

Matches strings consistent with Spain's Tax Identification number. Requires an eight-character string. Requires eight digits followed by an optional hyphen or space and a single checksum character.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Spain

Discovered.PHI

Discovered.Entity.NIF Number

SPAIN_PASSPORT

Matches strings consistent with Spain's Passport number. Requires an eight- or nine-character string, starting with either two or three letters followed by six digits.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Spain

Discovered.PHI

Discovered.Entity.Passport

STREET_ADDRESS

Matches strings consistent with street addresses. Primarily looks for strings consistent with the United States street naming convention.

Discovered.Entity.Location

SWEDEN_NATIONAL_ID_NUMBER

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Sweden

Discovered.PHI

Discovered.Entity.National ID Number

SWEDEN_PASSPORT

Matches numeric strings consistent with Sweden's Passport number. Requires an 8-digit number.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Sweden

Discovered.PHI

Discovered.Entity.Passport

SWIFT_CODE

Matches alphanumeric strings consistent with a SWIFT code (or Bank Identifier Code (BIC)) format.

Discovered.Entity.Swift Code

THAILAND_NATIONAL_ID_NUMBER

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.Thailand

Discovered.PHI

Discovered.Entity.National ID Number

TIME

Matches strings consistent with times. Can contain both date and time pieces.

Discovered.Entity.Date

UK_DRIVERS_LICENSE_NUMBER

Discovered.PII

Discovered.Identifier Direct,

Discovered.Country.UK

Discovered.PHI

Discovered.Entity.Drivers License Number

UK_NATIONAL_INSURANCE_NUMBER

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.UK

Discovered.PHI

Discovered.Entity.National Insurance Number

UK_TAXPAYER_REFERENCE

Matches ten-digit numeric strings consistent with UK Taxpayer Reference (UTR) numbers. The final digit is a checksum.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.UK

Discovered.PHI

Discovered.Entity.Taxpayer Reference

URL

Discovered.Entity.URL

US_ADOPTION_TAXPAYER_IDENTIFICATION_NUMBER

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.US

Discovered.PHI

Discovered.Entity.Adoption Taxpayer ID Number

US_BANK_ROUTING_MICR

Discovered.Country.US

Discovered.Entity.Bank Routing MICR

US_DEA_NUMBER

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.US

Discovered.Entity.DEA Number

US_EMPLOYER_IDENTIFICATION_NUMBER

Matches numeric string consistent United States Employer Identification Number (EIN). Strings must contain nine digits with a hyphen after the second digit.

Discovered.Country.US

Discovered.Entity.Employer ID Number

US_HEALTHCARE_NPI

Matches numeric strings consistent with US National Provider Identifier (NPI). Strings must be either 10 or 15 digits with the final digit being a valid checksum.

Discovered.PII

Discovered.Country.US

Discovered.Entity.Healthcare NPI

Discovered.Identifier Undetermined

US_INDIVIDUAL_TAXPAYER_IDENTIFICATION_NUMBER

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.US

Discovered.PHI

Discovered.Entity.Individual Taxpayer ID Number

US_PASSPORT

Matches numeric strings consistent with United States Passport number. Strings must contain nine digits.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.US

Discovered.PHI

Discovered.Entity.Passport

US_PREPARER_TAXPAYER_IDENTIFICATION_NUMBER

Matches strings consistent with a Preparer Taxpayer ID number. Strings must have nine characters, starting with a P that is followed by 8 digits.

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.US

Discovered.Entity.Preparer Taxpayer ID Number

US_SOCIAL_SECURITY_NUMBER

Discovered.PII

Discovered.Identifier Direct

Discovered.Country.US

Discovered.PHI

Discovered.Entity.Social Security Number

US_STATE

Matches strings consistent with either a full name or two-letter abbreviation of a US state or territory.

Discovered.Country.US

Discovered.Entity.State

US_TOLLFREE_PHONE_NUMBER

Matches strings consistent with a US toll-free telephone number. Allowed area codes are 800, 88+any digit, or 899.

Discovered.Country.US

Discovered.Entity.Tollfree Telephone Number

VEHICLE_IDENTIFICATION_NUMBER

Matches strings consistent with Vehicle Identification Numbers. A checksum is required as well as a valid World Manufacturer Identifier.

Discovered.Country.US

Discovered.Entity.Vehicle Identifier or Serial Number