arrow-left

All pages
gitbookPowered by GitBook
1 of 5

Loading...

Loading...

Loading...

Loading...

Loading...

Built-in Identifier Changelog

hashtag
May 21, 2025

Identifiers in domains is released as GA and these identifier updates are coupled with that release.

hashtag
Improvements

The following identifiers have been improved to better match their intended data patterns. These updates have only been made to the built-in reference identifiers. If these are already in your domains, they will remain there as domain-specific identifiers with the previous pattern. If you want to add these improved identifiers to your domain, edit the name because identifier names must be unique within each domain.

To see more about the specific changes made, see the annotations on the .

  • AUSTRALIA_MEDICARE_NUMBER

  • AUSTRALIA_PASSPORT

  • BRAZIL_CPF_NUMBER

  • CANADA_PASSPORT

hashtag
Deprecations

The following identifiers are deprecated and no longer included in the reference identifiers. If these are already in your domains, they will remain there as domain-specific identifiers.

  • AGE

  • DENMARK_CPR_NUMBER

  • FINLAND_NATIONAL_ID_NUMBER

  • FRANCE_CNI

hashtag
New

The following identifiers are newly created to identify common data patterns. Copy these new reference identifiers to any of your domains.

  • BELGIUM_NATIONAL_REGISTRATION_NUMBER: Detects numeric strings consistent with Belgium's National Registration Number. Requires 11 characters in the form YY.MM.DD-NNN-XX, where YY.MM.DD corresponds to birth date, NNN is a number, and XX is a checksum digit.

  • COUNTRY: Detects strings consistent with the names of all countries in the world. This identifier is case-insensitive.

  • FINANCIAL_INSTITUTIONS: Matches strings consistent with names of financial institutions based on lists provided by the FDIC and OCC, includes alternative names.

hashtag
First identifier pack released

62 built-in identifiers are released for use with identification.

CREDIT_CARD_NUMBER

  • DATE

  • DOMAIN_NAME

  • FDA_CODE

  • FRANCE_NIR

  • GENDER

  • ICD10_CODE

  • IMEI_HARDWARE_ID

  • MAC_ADDRESS

  • PERSON_NAME

  • POSTAL_CODE

  • SPAIN_NIF_NUMBER

  • TIME

  • UK_NATIONAL_INSURANCE_NUMBER

  • URL

  • US_HEALTHCARE_NPI

  • US_SOCIAL_SECURITY_NUMBER

  • US_STATE

  • GERMANY_IDENTITY_CARD_NUMBER

  • SPAIN_NIE_NUMBER

  • SWEDEN_NATIONAL_ID_NUMBER

  • SWEDEN_PASSPORT

  • THAILAND_NATIONAL_ID_NUMBER

  • UK_TAXPAYER_REFERENCE

  • US_BANK_ROUTING_MICR

  • US_PASSPORT

  • US_TOLLFREE_PHONE_NUMBER

  • GREAT_BRITAIN_DRIVERS_LICENSE: Previously named UK_DRIVERS_LICENSE_NUMBER. Now, renamed because it does not detect license numbers from Northern Ireland.

  • ICD_10_PCS: Detects strings consistent with procedure codes from the International Statistical Classification of Diseases and Related Health Problems (ICD), as drawn from the Clinical Modification lexicon from 2020.

  • NAICS_CODE: Detects strings consistent with North American Industry Classification System (NAICS). A two-digit number represents a basic sector and each preceding digit represents a more specific sub sector with a maximum of six digits.

  • SEC_STOCK_TICKER: Matches strings consistent with the stock tickers recognized by the U.S. Securities and Exchange Commission (SEC).

  • US_PERSON_FULL_NAME: Detects strings consistent with a person's {first name} space {last name}. Uses the same names from the PERSON_NAME identifier. This identifier must match at least 20% of the data sampled and is case-insensitive.

  • US_STREET_ADDRESS: Previously named STREET_ADDRESS.

  • Built-in identifier reference page

    Reference Guides

    How Competitive Pattern Analysis Works

    Of identification's three criteria options, regex and dictionary are competitive. This means that when assessing your data, if multiple identifiers could match, only one with competitive criteria will be chosen to tag the data. To better understand how Immuta executes this competition, read further.

    Immuta employs a three-phased competitive criteria analysis approach for identification:

    1. Sampling: No data is moved, and Immuta checks the identifiers against a sample of data from the table.

    2. Qualifying: Identifiers with a criteria match of less than a 90% match are filtered out.

    3. : The remaining identifiers are compared with one another to find the most specific criteria that qualifies and matches the sample.

    In the end, competitive criteria analysis aims to find a single identifier for each column that best describes the data format.

    hashtag
    Sampling

    In the sampling process, no database contents are transmitted to Immuta; instead, Immuta receives only the column-wise hit rate (the number of times the criteria has matched a value in the column) information for each active identifier. To do this, Immuta instructs a remote database to measure column-wise hit rate information for all active identifiers over a row sample.

    The sample size is decided based on the number of identifiers and the data size, when available. In the most simplified case, the requested number of sampled rows depends only on the number of regex and dictionary criteria being run in the domain, not the data size. The sample size dependence on the number of identifiers is weak and will not exceed 13,000 rows.

    Number of identifiers
    Sample size

    hashtag
    Sampling considerations

    In practice, the number of sampled values for each column may be less than the requested number of rows because columns are not independently sampled but rather projected from a row-wise sample. This can impact the sample when the target table has less than the requested number of rows, when some of the column values are null, or because of technology-specific limitations.

    • Snowflake and Starburst (Trino): Immuta implements table sampling by row count.

    • Databricks and Redshift: Due to technology limitations and the inability to predict the size of the table, Immuta implements a best-effort sampling strategy comprising a flat 10% row sample capped at the first 10,000 sampled rows. In particular, under-sampling may occur on tables with less than 100,000 rows. Moreover, the resulting sample is biased towards earlier records.

    • All platforms: Sampling from views can have significantly slower performance that varies by the performance of the query that defines the view.

    hashtag
    Qualifying

    During the qualification phase, identifiers that do not agree with the data are disqualified. An identifier agrees with the data if the hit rate on the remote sample exceeds the predefined threshold. This threshold is 90% match for most built-in identifiers; however, a few built-in identifiers have lower threshold requirements. The 90% threshold is standard for all custom identifiers as well to ensure the criteria matches the data within the column and to avoid false positives. Note that threshold calculations are relative to the number of non-null entries for each column.

    If no identifiers qualify, then no identifier is assessed for scoring and the column is not tagged.

    hashtag
    Scoring

    During the scoring phase, a machine inference is carried out among all qualified identifiers, combining criteria-derived complexity information with hit rate information to determine which identifier best describes the sample data. This process prefers the more restrictive of two competing identifiers since the ability to satisfy the more difficult-to-satisfy identifier itself serves as evidence that it is more likely. This phase ends by returning a single most likely identifier per the inference process.

    hashtag
    Example

    Here are a set of regex identifiers and a sample of data:

    Identifiers:

    1. [a-zA-Z0-9]{3} - This regex will match 3 character strings with the characters a-z, lowercase or uppercase, or digits 0-9.

    2. [a-c]{3} - This regex will match 3 character strings with the characters a-c, lowercase.

    3. (a|b|d){3} - This regex will match 3 character strings with the characters a, b, or d, lowercase.

    Sample data
    Matches Identifier 1
    Matches Identifier 2
    Matches Identifier 3

    When qualifying the identifiers, Identifier 1 and Identifier 3 both match 90% or more of the data. Identifier 2 does not, and is disqualified.

    Then the qualified identifiers are scored. Here, Identifier 1, despite matching 100% of the data, is unspecific and could match over 200,000 values. On the other hand, Identifier 3 matches just at 90% but is very specific with only 27 available values.

    Therefore, with the specificity taken into account, Identifier 3 would be the match for this column, and its tags would be applied to the data source in Immuta.

    hashtag
    Important notes

    • Dictionaries are part of the competitive process, while column-name regex are not.

    • Scoring ties are rare but can occur if the same criteria (either dictionary or regex) is specified more than once (even in different forms). Scoring ties are inconclusive, and the scoring phase will not return an identifier in the case of a tie.

    • Criteria complexity analysis is sensitive to the total number of strings an identifier accepts or, equivalently for dictionaries, the number of entries. Therefore, identifiers that accept much more than is necessary to describe the intended column data format may perform more poorly in the competitive analysis because they are easier to satisfy.

    All platforms: Any null values included in the sample will not count towards the qualification or scoring when included in the sample. However, it will lower the number of available values to match against the patterns, as the sample size is not dynamic based on the ignored null values.

    Yes

    add

    Yes

    ❌

    Yes

    cab

    Yes

    Yes

    ❌

    bad

    Yes

    ❌

    Yes

    aba

    Yes

    ❌

    Yes

    baa

    Yes

    ❌

    Yes

    dad

    Yes

    ❌

    Yes

    baa

    Yes

    ❌

    Yes

    5

    7369 rows

    50

    9211 rows

    500

    11053 rows

    5000

    12895 rows

    dad

    Yes

    ❌

    Yes

    baa

    Yes

    ❌

    Yes

    add

    Yes

    Scoring

    ❌

    Built-in Discovered Tags Reference

    Immuta is pre-configured with a set of tags that can be used to write global policies before data sources even exist. See a list of the built-in Discovered tags below and the Built-in identifier reference page for information about where these tags will be applied by the built-in identifiers.

    hashtag
    Country tags

    All the tags below belong to the Country parent. For example, the full tag name will appear as Discovered . Country . Argentina.

    Child tag name
    Description

    hashtag
    Entity tags

    All the tags below belong to the Entity parent. For example, the full tag name will appear as Discovered . Entity . Aadhaar Individual.

    Child tag name
    Description

    This tag is for data specific to China.

    Colombia

    This tag is for data specific to Colombia.

    Denmark

    This tag is applied to data recognized as specific to Denmark (e.g., a Denmark CPR or Person-number).

    Finland

    This tag is applied to data recognized as specific to Finland (e.g., a Finland National ID number).

    France

    This tag is applied to data recognized as specific to France (e.g., a French National ID card number, France National ID number, or French passport number).

    Germany

    This tag is applied to data recognized as specific to Germany (e.g., a German driver's license number or a Germany Identity Card number).

    Hong Kong

    This tag is for data specific to Hong Kong.

    India

    This tag is for data specific to India.

    Indonesia

    This tag is for data specific to Indonesia.

    Japan

    This tag is for data specific to Japan.

    Korea

    This tag is for data specific to Korea.

    Mexico

    This tag is for data specific to Mexico.

    Netherlands

    This tag is for data specific to Netherlands.

    Norway

    This tag is for data specific to Norway.

    Paraguay

    This tag is for data specific to Paraguay.

    Peru

    This tag is for data specific to Peru.

    Poland

    This tag is for data specific to Poland.

    Singapore

    This tag is for data specific to Singapore.

    Spain

    This tag is applied to data recognized as specific to Spain (e.g., Spain Foreigner Identification number, Spain Tax Identification number, or Spanish passport number).

    Sweden

    This tag is applied to data recognized as specific to Sweden (e.g., a Sweden National ID number or Swedish passport number).

    Taiwan

    This tag is for data specific to Taiwan.

    Thailand

    This tag is applied to data recognized as specific to Thailand (e.g., a Thailand National ID number).

    Turkey

    This tag is for data specific to Turkey.

    UK

    This tag is applied to data recognized as specific to the United Kingdom (e.g., a United Kingdom driver's license number, United Kingdom National Insurance number, or United Kingdom Taxpayer Reference number).

    Uruguay

    This tag is for data specific to Uruguay.

    US

    This tag is applied to data recognized as specific to the U.S. (e.g., an FDA code, United States ATIN, ABA routing number, DEA number, United States EIN, United States NPI number, United States ITIN, United States passport number, United States Preparer Taxpayer ID number, United States SSN, United States territory or state, or United States toll-free phone number).

    Venezuela

    This tag is for data specific to Venezuela.

    British Columbia Health Network Number

    This tag is applied to data recognized as British Columbia's Personal Health Number.

    BSN Number

    This tag is for Netherlands citizen service number.

    CDC Number

    This tag is for CDC numbers.

    CDI Number

    This tag is for CDI numbers.

    CIC Number

    This tag is for CIC numbers.

    CNI

    This tag is applied to data recognized as a French National ID card number.

    CPF Number

    This tag is applied to data recognized as Brazil's CPF number.

    CPR Number

    This tag is applied to data recognized as Denmark's Personal Identification number.

    Credit Card Number

    This tag is applied to data recognized as a credit card number.

    CRYPTO

    This tag is applied to data recognized as a Bitcoin Invoice Address.

    CURP Number

    This tag is for Mexican CURP numbers.

    Date

    This tag is applied to data recognized as a date.

    Date of Birth

    This tag is applied to data recognized as a date of birth.

    DEA Number

    This tag is applied to data recognized as the DEA number of a healthcare provider.

    DNI Number

    This tag is applied to data recognized as an Argentina National Identity number.

    Domain Name

    This tag is applied to data recognized as a domain.

    Driver's License Number

    This tag is applied to data recognized as driver's licenses numbers from Germany or the United Kingdom.

    Electronic Mail Address

    This tag is applied to data recognized as an email address.

    Employer ID Number

    This tag is applied to data recognized as an Employer Identification number from the United States.

    Ethnic Group

    This tag is applied to data recognized as an ethnic group.

    FDA Code

    This tag is applied to data recognized as the code of a drug or ingredient registered with the FDA.

    Financial Institution

    This tag is applied to data recognized as the names of financial institutions based on lists provided by the FDIC and OCC, including alternative names.

    Gender

    This tag is applied to data recognized as a gender.

    GST Individual

    This tag is for Indian GST individual numbers.

    Healthcare NPI

    This tag is applied to data recognized as a United States National Provider Identifier number.

    IBAN Code

    This tag is applied to data recognized as an International Bank Account number.

    ICD10 Code

    This tag is applied to data recognized as an ICD10 code from the International Statistical Classification of Diseases and Related Health Problems.

    ICD10 Procedure Code

    This tag is applied to data recognized as an ICD10 procedure code from the International Statistical Classification of Diseases and Related Health Problems.

    ICD9 Code

    This tag is for ICD9 codes from the International Statistical Classification of Diseases and Related Health Problems.

    ID Number

    This tag is for any ID number.

    Identity Card Number

    This tag is applied to data recognized as an identity card number from Germany.

    IMEI

    This tag is applied to data recognized as an International Mobile Equipment Identity number.

    Individual Number

    This tag is for any individual number.

    Individual Taxpayer ID Number

    This tag is applied to data recognized as a United States Individual Taxpayer Identification Number.

    IP Address

    This tag is applied to data recognized as an IP address.

    Location

    This tag is applied to data recognized as a country, state, address, or municipality.

    MAC Address

    This tag is applied to data recognized as a Media Access Control address.

    MAC Address Local

    This tag is applied to data recognized as a local Media Access Control address.

    Medicare Number

    This tag is applied to data recognized as a Medicare number from Australia.

    NAICS Code

    This tag is applied to data recognized as a North America Industry Classification System (NAICS) code.

    National Health Service Number

    This tag is for national health service numbers.

    National ID Card Number

    This tag is applied to data recognized as a national ID card number from Belgium.

    National ID Number

    This tag is applied to data recognized as a national ID number from Finland, Sweden, and Thailand.

    National Insurance Number

    This tag is applied to data recognized as a United Kingdom national insurance number.

    National Registration ID Number

    This tag is for national registration ID numbers.

    National Registration Number

    This tag is applied to data recognized as a national registration number from Belgium.

    NI Number

    This tag is for Norway NI numbers.

    NIE Number

    This tag is applied to data recognized as a Spanish Foreigner Identification number.

    NIF Number

    This tag is applied to data recognized as a Spanish Tax Identification number.

    NIK Number

    This tag is applied to data recognized as an Indonesian personal identification number (NIK).

    NIR

    This tag is applied to data recognized as France's National ID number.

    Ontario Health Insurance Number

    This tag is applied to data recognized as part of an Ontario Health Insurance Plan string.

    PAN Individual

    This tag is for PAN Individual numbers.

    Passport

    This tag is applied to data recognized as a passport number from Australia, Canada, France, Spain, Sweden, and the United States.

    Person Name

    This tag is applied to data recognized as people's names.

    PESEL Number

    This tag is for Poland PESEL numbers.

    Postal Code

    This tag is applied to data recognized as a United States zip code.

    Preparer Taxpayer ID Number

    This tag is applied to data recognized as a Preparer Taxpayer ID number.

    Quebec Health Insurance Number

    This tag is applied to data recognized as a Quebec Health Insurance Number.

    Resident ID Number

    This tag is for China Resident ID numbers.

    RRN

    This tag is for Korea Resident Registration numbers.

    SEC Stock Ticker

    This tag is applied to data recognized as a stock ticker recognized by the U.S. Securities and Exchange Commission (SEC).

    Social Insurance Number

    This tag is applied to data recognized as a social insurance number.

    Social Security Number

    This tag is applied to data recognized as a United States Social Security Number.

    State

    This tag is applied to data recognized as a state of the United States.

    Swift Code

    This tag is applied to data recognized as a SWIFT code.

    Tax File Number

    This tag is applied to data recognized as a tax file number.

    Taxpayer ID Number

    This tag is applied to data recognized as Taxpayer ID numbers from the United States.

    Taxpayer Reference

    This tag is applied to data recognized as United Kingdom Taxpayer Reference numbers.

    Telephone Number

    This tag is applied to data recognized as a phone number.

    Tollfree Telephone Number

    This tag is applied to data recognized as a United States toll-free phone number.

    URL

    This tag is applied to data recognized as a URL.

    Vehicle Identifier or Serial Number

    This tag is applied to data recognized as a VIN.

    Argentina

    This tag is applied to data recognized as specific to Argentina (e.g., an Argentina National Identity Number).

    Australia

    This tag is applied to data recognized as specific to Australia (e.g., an Australian Medicare number or Australian passport number).

    Belgium

    This tag is applied to data recognized as specific to Belgium (e.g., a Belgium National ID card).

    Brazil

    This tag is applied to data recognized as specific to Brazil (e.g., a Brazil CPF number).

    Canada

    This tag is applied to data recognized as specific to Canada (e.g., a British Columbia PHN, OHIP string, Canadian passport number, or Quebec's HIN).

    Chile

    This tag is for data specific to Chile.

    Aadhaar Individual

    This tag is for Aadhaar Individual numbers.

    Adoption Taxpayer ID Number

    This tag is applied to data recognized as a United States Adoption Taxpayer Identification number.

    Age

    This tag is applied to data recognized as an age.

    Bank Account

    This tag is for bank account numbers.

    Bank Routing MICR

    This tag is applied to data recognized as an American Bankers Association routing number.

    Bankers CUSIP ID

    This tag is for CUSP identification numbers for stocks and bonds.

    China

    Built-in Identifier Reference

    Immuta comes with a pack of built-in identifiers that look for common data types. These identifiers were written by Immuta's research and development team and cannot be deleted or edited by users. However, users can add these built-in identifiers to their own domains and edit the tags applied by them.

    Identifiers must match at least 90% of the sampled data to be tagged, with three exceptions noted below. See the How competitive pattern analysis works guide for more information about sampling and thresholds.

    hashtag
    Identifier descriptions and default resulting tags

    Identifier
    Description
    Resulting tags from the default identifier
    • Discovered.Entity.CRYPTO

    BRAZIL_CPF_NUMBER Improved

    Detects a numeric string consistent with Brazil's CPF (Cadastro de Pessoas Físicas) number. An eleven-digit numeric string with optional non-numeric separators (., -, or space) after the third, sixth, and ninth digits. Examples

    • Discovered.Country.Brazil

    • Discovered.Entity.CPF Number

    CANADA_BC_PHN

    Detects numeric strings consistent with British Columbia's Personal Health Number (PHN). Requires a ten-digit numeric string with hyphens (-) or spaces after the fourth and seventh digits.

    • Discovered.Country.Canada

    • Discovered.Entity.British Columbia Health Network Number

    CANADA_OHIP

    Detects alphanumeric strings consistent with Ontario's Health Insurance Plan (OHIP). Requires a twelve-digit capitalized alphanumeric code. Optional hyphens (-) or spaces can appear after the fourth, seventh, and tenth digits.

    • Discovered.Country.Canada

    • Discovered.Entity.Ontario Health Insurance Number

    CANADA_PASSPORT Improved

    Detects strings consistent with the Canadian Passport Number format. Allows for two formats. One format requires two capital letters followed by six digits. The other format requires one letter, followed by six digits, and ends in two letters. Examples

    • Discovered.Country.Canada

    • Discovered.Entity.Passport

    CANADA_QUEBEC_HIN

    Detects alphanumeric strings consistent with Quebec's Health Insurance Number (HIN). Requires four alphabetic characters followed by an optional space or hyphen (-), and then eight digits with an optional hyphen or space after the fourth digit.

    • Discovered.Country.Canada

    • Discovered.Entity.Quebec Health Insurance Number

    COUNTRY New

    Detects strings consistent with the names of all countries in the world. This identifier is case-insensitive.

    • Discovered.Entity.Location

    CREDIT_CARD_NUMBER Improved

    Detects strings consistent with a credit card number with prefixes matching major credit card companies.

    • Discovered.Entity.Credit Card Number

    DATE Improved

    Detects strings consistent with dates in over 30 different formats or date type: date, date+time, or timestamp. This identifier is case-insensitive.

    • Discovered.Entity.Date

    DOMAIN_NAME Improved

    Detects strings that begin with a letter and are no more than 225 characters. A full domain can have one to four labels separated by a .. Each label can be one to 63 alphanumeric characters long. And each label after the first must be in the dictionary list of possible labels. This identifier is case-insensitive.

    • Discovered.Entity.Domain Name

    EMAIL_ADDRESS

    Detect strings consistent with an email address. Usernames are required to be fewer than 255 characters, follow by @, a domain of fewer than 255 characters, and a top level domain of between 2 and 20 characters.

    • Discovered.Entity.Electronic Mail Address

    ETHNIC_GROUP

    Detects strings consistent with the US Census race designations. This identifier allows for dashes to be used in place of spaces and is case-insensitive.

    • Discovered.Entity.Ethnic Group

    FDA_CODE Improved

    Detects a string consistent with a drug or ingredient registered with the Food and Drug Administration (FDA). Must start with between 4 to 5 digits, followed by a hyphen, followed by 3 to 4 digits, followed by a hyphen, and finishing with 1 to 2 digits.

    • Discovered.Country.US

    • Discovered.Entity.FDA Code

    FINANCIAL_INSTITUTIONS New

    Detects strings consistent with names of financial institutions based on lists provided by the FDIC and OCC, includes alternative names.

    • Discovered.Entity.Financial Institutions

    FRANCE_NIR Improved

    Detects numeric strings consistent with France's National ID number (Numéro d'Inscription au Répertoire). Requires a fifteen-digit numeric string. An optional hyphen (-) or space can appear after the 13th digit.

    • Discovered.Country.France

    • Discovered.Entity.NIR

    FRANCE_PASSPORT

    Detects alphanumeric strings consistent with the French Passport number. Requires two numbers followed by two uppercase letters and ends with five digits.

    • Discovered.Country.France

    • Discovered.Entity.Passport

    GENDER Improved

    Detects strings consistent with gender types and common abbreviations. This identifier is case-insensitive.

    • Discovered.Entity.Gender

    GERMANY_DRIVERS_LICENSE_NUMBER

    Detects alphanumeric strings consistent with Germany's driver's license number. Requires an eleven-element string of the format CDDCCCCCCDC where C is an uppercase Latin letter and D is a numeric digit.

    • Discovered.Country.Germany

    • Discovered.Entity.Drivers License Number

    GREAT_BRITAIN_DRIVERS_LICENSE New

    Detects alphanumeric strings consistent with the United Kingdom's driver's license number. Requires either a 16- or 18-character string. The first five characters represent the driver's surname, padded with 9s, followed by a single digit for decade of birth, two digits for month of birth (incremented by 50 for female drivers), two digits for day of birth, one digit for year of birth, two letters, an arbitrary digit, and two digits. Two additional digits can be present for each license issuance. Examples

    • Discovered.Country.UK

    • Discovered.Entity.Drivers License Number

    IBAN_CODE

    Detects strings consistent with an International Bank Account Number (IBAN). Requires a string in the form ZZ-DD-BBAN, where ZZ is a country code, DD is two numeric digits, and BBAN is a Basic Bank Account Number comprising two to seven groups of three to five uppercase alphanumeric characters, optionally separated by space or dash, and optionally followed by a final group of length one to three.

    • Discovered.Entity.IBAN Code

    ICD10_CODE Improved

    Detects strings consistent with codes from the International Statistical Classification of Diseases and Related Health Problems (ICD), as drawn from the Clinical Modification lexicon from the year 2025. This identifier is case-insensitive.

    • Discovered.Entity.ICD10 Code

    ICD_10_PCS New

    Detects strings consistent with procedure codes from the International Statistical Classification of Diseases and Related Health Problems (ICD), as drawn from the Clinical Modification lexicon from 2020. Example

    • Discovered.Entity.ICD10 Procedure Code

    IMEI_HARDWARE_ID Improved

    Detects strings consistent with an International Mobile Equipment Identity (IMEI) number. Must contain 15 or 16 digits with optional hyphens or spaces after the 2nd, 8th, and 14th digits. Examples

    • Discovered.Entity.IMEI

    IP_ADDRESS

    Detects IP Addresses in the V4 and V6 formats. This identifier is case-insensitive.

    • Discovered.Entity.IP Address

    LOCATION

    Detects ISO3166 formatted locations. This identifier must match at least 80% of the data sampled.

    • Discovered.Entity.Location

    MAC_ADDRESS Improved

    Detects strings consistent with a Media Access Control (MAC) address. Must contain twelve hexadecimal digits, with every two digits separated by a colon or hyphen. Examples

    • Discovered.Entity.MAC Address

    NAICS_CODE New

    Detects strings consistent with North American Industry Classification System (NAICS). A two-digit number represents a basic sector and each preceding digit represents a more specific sub sector with a maximum of six digits. Examples

    • Discovered.Entity.NAICS Code

    PERSON_NAME Improved

    Detects strings consistent with a dictionary of people's names. The name dictionary is US-centric with person names drawn from the US Social Security database, covering 80% of the U.S. population. This identifier must match at least 45% of the data sampled. This identifier is case-insensitive.

    • Discovered.Entity.Person Name

    PHONE_NUMBER Improved

    Detects strings consistent with telephone numbers. Primarily looks for strings consistent with the United States telephone numbers naming convention. Optional area codes allowed.

    • Discovered.Entity.Telephone Number

    POSTAL_CODE Improved

    Detects strings consistent with a valid US Zip code with an optional +4 separated by a dash. Only valid five-digit zip codes are detected. This identifier is case-insensitive.

    • Discovered.Entity.Postal Code

    SEC_STOCK_TICKER New

    Detects strings consistent with the stock tickers recognized by the U.S. Securities and Exchange Commission (SEC).

    • Discovered.Entity.Stock Ticker Symbol

    SPAIN_NIF_NUMBER Improved

    Detects strings consistent with Spain's Tax Identification number. Requires a string with nine alphanumeric characters. Requires either eight digits followed by an optional hyphen or space and a single uppercase letter or the initial character must be X, Y, or Z, followed by an optional dash or space, seven numeric digits, followed by an optional dash or space, and finally, by a single uppercase letter. Examples

    • Discovered.Country.Spain

    • Discovered.Entity.NIF Number

    SPAIN_PASSPORT

    Detects string consistent with Spain's Passport Number. Requires a eight- or nine-character string starting with either two or three uppercase letters followed by six numeric digits.

    • Discovered.Country.Spain

    • Discovered.Entity.Passport

    SWIFT_CODE

    Detects alphanumeric strings consistent with a SWIFT code (or Bank Identifier Code (BIC)) format. Requires values consistent with AAAAAACCDDD, where A is an uppercase letter, C is an uppercase letter or numeric digit, and DDD is an optional three-character sequence of uppercase letters or numeric digits.

    • Discovered.Entity.Swift Code

    TIME Improved

    Detects strings consistent with times in various formats or data type: time. If date is included in the time, it will not match. Use the DATE identifier instead.

    • Discovered.Entity.Date

    UK_NATIONAL_INSURANCE_NUMBER Improved

    Detects alphanumeric strings consistent with the United Kingdom's National Insurance Number. Requires a nine-character string. The first two digits must be uppercase letters, followed by an optional space, then six digits with optional spaces or hyphens (-) every two digits, ending with A, B, C, or D.

    • Discovered.Country.UK

    • Discovered.Entity.National Insurance Number

    URL Improved

    Detects string consistent with a URL. String must begin with a common schema, followed a string and ending with a top level domain of no more than 128 alphanumeric characters.

    • Discovered.Entity.URL

    US_DEA_NUMBER

    Detects alphanumeric strings consistent a Drug Enforcement Administration (DEA) number is assigned to a health care provider. It must have a length of nine characters. The first two digits must be uppercase alphanumeric characters, and the last seven characters are numeric digits. The first character may not be I, N, O, Q, V, W, Y, or Z.

    • Discovered.Country.US

    • Discovered.Entity.DEA Number

    US_EMPLOYER_IDENTIFICATION_NUMBER

    Detects numeric string consistent United States Employer Identification Number (EIN). Strings must contain nine digits with a hyphen after the second digit.

    • Discovered.Country.US

    • Discovered.Entity.Employer ID Number

    US_HEALTHCARE_NPI Improved

    Detects 10-digit numeric strings consistent with US National Provider Identifier (NPI). It must either start with 80840 followed by a 1 or 2, or it must begin with a 1 or 2.

    • Discovered.Country.US

    • Discovered.Entity.Healthcare NPI

    US_PERSON_FULL_NAME New

    Detects strings consistent with a person's {first name} space {last name}. Uses the same names from the PERSON_NAME identifier. This identifier must match at least 20% of the data sampled and is case-insensitive.

    • Discovered.Entity.Person Name

    US_PREPARER_TAXPAYER_IDENTIFICATION_NUMBER

    Detects strings consistent with a Preparer Taxpayer ID number. Strings must have nine characters, starting with a P that is followed by eight digits.

    • Discovered.Country.US

    • Discovered.Entity.Preparer Taxpayer ID Number

    US_SOCIAL_SECURITY_NUMBER Improved

    Detects strings consistent with a US Social Security Number. Strings must contain nine digits and comprise three parts: the three left-most digits designating the area number, the middle two digits designating the group number, and the four right-most digits designating the serial number. For a column to be tagged, none of these parts can contain all zeroes, and area numbers must not be 666 or in the range of 900-999. Examples

    • Discovered.Country.US

    • Discovered.Entity.Social Security Number

    US_STATE Improved

    Detects strings consistent with either a full name or two-letter abbreviation of a US state or territory.

    • Discovered.Country.US

    • Discovered.Entity.State

    US_STREET_ADDRESS New

    Detects strings consistent with U.S. street addresses. Requires the street naming convention of {address_number} {street_name} {unit number (optional)} with an optional road suffix after the street name. The maximum length for street name is 20 alphanumeric characters. This identifier must match at least 80% of the data sampled and is case-insensitive.

    • Discovered.Entity.Location

    VEHICLE_IDENTIFICATION_NUMBER

    Detects strings consistent with Vehicle Identification Numbers. A valid World Manufacturer Identifier is required.

    • Discovered.Country.US

    • Discovered.Entity.Vehicle Identifier or Serial Number

    ARGENTINA_DNI_NUMBER

    Detects strings consistent with Argentina's National Identity (DNI) Number. Requires an eight-digit number with periods after the second and fifth digits.

    • Discovered.Country.Argentina

    • Discovered.Entity.DNI Number

    AUSTRALIA_MEDICARE_NUMBER Improved

    Detects numeric strings consistent with Australian Medicare Number. Requires a ten- or eleven-digit number. The starting digit must be between 2 and 6, inclusive. Spaces must be placed between the fourth and fifth and ninth and tenth digits. Optional eleventh digit separated by a / or a space. Examples

    • Discovered.Country.Australia

    • Discovered.Entity.Medicare Number

    AUSTRALIA_PASSPORT Improved

    Detects strings consistent with the Australian Passport number. A string of 8 or 9 characters is required, with a starting uppercase character (A, B, C, D, E, F, G, H, J, L, M, N, R, X, or U) or a two-character alphabetic prefix (P followed by A, B, C, D, E, F, U, W, X, or Z) followed by seven numeric digits. Examples

    • Discovered.Country.Australia

    • Discovered.Entity.Passport

    BELGIUM_NATIONAL_ID_CARD_NUMBER

    Detects numeric strings consistent with Belgium's National ID card. Requires a twelve-digit number with a required hyphen (-) between the third and fourth digits. Allows for an optional hyphen between the tenth and eleventh digits.

    • Discovered.Country.Belgium

    • Discovered.Entity.National ID Card Number

    BELGIUM_NATIONAL_REGISTRATION_NUMBER New

    Detects numeric strings consistent with Belgium's National Registration Number. Requires 11 characters in the form YY.MM.DD-NNN-XX, where YY.MM.DD corresponds to birth date, NNN is a number, and XX is a checksum digit. Example

    • Discovered.Country.Belgium

    • Discovered.Entity.National Registration Number

    BITCOIN_INVOICE_ADDRESS

    Detects strings consistent with the following Bitcoin Invoice Address formats: P2PKH, P2SH, and Bech32.