1 of 35

Manage Data Metadata

Immuta uses tags primarily to enforce policies, but tags can also be used for generating Immuta reports and search results in the Immuta UI.

Connect external catalogs

Integrate your existing data catalog with Immuta.

Data discovery

Automatically discover and tag data based on its content (like column names).

Data classification

Automatically tag data based on its sensitivity and the associated risk level.

Connect External Catalogs

Connect an external catalog to use tagging capabilities outside of Immuta and pull tags from external table schemas. Once the catalog has been connected, Immuta ingests a data dictionary from the catalog and applies data source and column tags directly to the data source. These tags can then be used to create policies.

Getting started

This getting started guide outlines how to use external catalogs in Immuta.

How-to guide

Configure an external catalog: Configure Alation, Collibra, or a custom REST catalog to ingest tags into Immuta.

Reference guides

External catalog integrations: This reference guide describes the requirements of the external catalogs Immuta supports.
Custom REST catalog introduction: This reference guide describes the custom catalog option for users to make API calls to retrieve metadata on their data.
Custom REST catalog interface endpoints: This reference guide describes the endpoints for configuring a custom REST catalog.

Getting Started with External Catalogs

The how-to guides linked on this page illustrate how to link an external catalog with Immuta to ingest tags.

Best practice: Use a single catalog; having more than one can lead to multiple truths and data leaks.

Link an Alation external catalog

Requirement: A catalog with tags that correspond to your Immuta data sources

Configuration process

Connect your Alation catalog to Immuta.
When changes are made to the external catalog, refresh external tags.

Link a Collibra external catalog

Requirements:

A physical data dictionary with assets that correspond to your Immuta data sources
The Collibra global role Catalog or Catalog Author

Configuration process

Connect your Collibra catalog to Immuta.
When changes are made to the external catalog, refresh external tags.

Link a Microsoft Purview external catalog

Requirements:

A catalog with assets that correspond to your Immuta data sources
The ability to create a registered app in the Azure portal

Configuration process

Connect your Microsoft Purview catalog to Immuta.
When changes are made to the external catalog, refresh external tags.

Link a custom REST catalog

Requirements:

A catalog with tags that correspond to your Immuta data sources
Authenticate with the Immuta API

Configuration process

Connect your custom REST Catalog to Immuta.
When changes are made to the external catalog, refresh external tags.

Link Databricks Unity Catalog for tag ingestion

Requirements:

Fewer than 2,500 Databricks Unity Catalog data sources registered in Immuta
Databricks privileges listed on the Configure a Databricks Unity Catalog integration page

Configuration process

Once you register data sources, table and column tags from Databricks Unity Catalog will be ingested and applied to the corresponding data sources in Immuta.

Link Snowflake for tag ingestion

Requirements:

A Snowflake user who can set the following permissions:
1. GRANT IMPORTED PRIVILEGES ON DATABASE snowflake
2. GRANT APPLY TAG ON ACCOUNT
Snowflake Enterprise Edition or higher

Configuration process

Configure Snowflake tag ingestion in Immuta.
When changes are made to the tags in Snowflake, refresh external tags

Configure an External Catalog

This page outlines how to connect an external catalog on the Immuta app settings page. For details on external catalogs in Immuta, see the External catalog reference guide.

Link an Alation catalog

Requirements:

APPLICATION_ADMIN Immuta permission
An Alation API access token connected to a user with the Server Admin permission

To change the default expiration period for your Alation catalog's API tokens, see configure the expiration period for Alation API tokens.

Navigate to the App Settings page.
Scroll to 2 External Catalogs, and click Add Catalog.
Enter a Display Name and select Alation from the dropdown menu.
Complete the URL and API key fields. The API key must be an API access token for your Alation instance connected to a user with the Server Admin permission.
Configure whether or not Alation tags and custom fields are imported as Immuta tags:
- Link Alation tags: When selected, Immuta imports Alation tags as Immuta tags.
- Link Alation Custom Fields: When selected, Immuta imports Alation custom fields as Immuta tags. Follow the Alation documentation to create an Alation custom field, add permissions to your custom field, and apply custom fields to tables and columns.
Opt to select Upload Certificates.
1. Upload the Certificate Authority, Certificate File, and Key File.
2. Opt to enable Strict SSL by selecting the checkbox.
Click the Test Connection button.
Once the connection is successful, click Save.

Link a Collibra catalog

Requirement: APPLICATION_ADMIN Immuta permission

Navigate to the App Settings page.
Scroll to 2 External Catalogs, and click Add Catalog.
Enter the Display Name and select Collibra from the dropdown menu.
Enter the HTTP endpoint of the catalog in the URL field.
Complete the Username and Password fields. Note: This is the username and the password that Immuta can use to connect to the external catalog.
Complete the Asset Mappings modal to set which Collibra asset types align to the Immuta data source and column. Immuta will only link data sources from the asset types you specify.
Complete the Attributes as Tags modal to specify which Collibra attributes you want in Immuta. These attributes will come in as parent tags with their values as children tags.
Opt to select Upload Certificates.
1. Upload the Certificate Authority, Certificate File, and Key File.
2. Opt to enable Strict SSL by selecting the checkbox.
Click the Test Connection button.
Once the connection is successful, click Save.

Link a Microsoft Purview external catalog

Private preview

The Microsoft Purview catalog integration is only available to select accounts. Contact your Immuta representative to enable this feature.

Requirement: APPLICATION_ADMIN Immuta permission

Prerequisite

Supported account type: "Accounts in this organizational directory only"
Microsoft-Graph: User.Read API permission
A client secret

Using that registered app, navigate to Immuta and complete the following:

Navigate to the App Settings page.
Scroll to 2 External Catalogs, and click Add Catalog.
Enter the Display Name and select Microsoft Purview from the dropdown menu.
Complete the following fields:
1. Enter the Microsoft Purview endpoint URL including the Azure Account Name, like https://<ACCOUNTNAME>.purview.azure.com, in the Purview Endpoint URL field.
2. Complete the Microsoft Entra Directory (tenant) ID and Microsoft Entra (client) ID fields.
3. Enter the Microsoft Entra Application Client Secret ID for Immuta to authenticate and connect to the Purview API. The secret cannot be expired.
Click the Test Connection button.
Once the test is successful, click Save.

Link a custom REST catalog

Requirement: APPLICATION_ADMIN Immuta permission

Integrating a custom REST catalog service with Immuta requires implementing a REST interface. For details about the necessary endpoints that must be serviced, see the Custom REST catalog interface endpoints page.

Navigate to the App Settings page.
Scroll to 2 External Catalogs, and click Add Catalog.
Enter the Display Name and select Rest from the dropdown menu.
Select the Internal Plugin checkbox if the catalog has been uploaded to Immuta as a custom server plugin.
Complete the following fields:
1. Enter the HTTP endpoint of the catalog in the URL field.
2. Complete the Username and Password fields.
3. Enter the path of the Tags Endpoint.
4. Enter the path of the Data Source Endpoint.
5. Enter the path to the information page for a data source in the Data Source Link Template field.
Opt to enter the path to the information page for a column in the Column Link Template field.
Opt to upload a Catalog Image.
Opt to select Upload Certificates.
1. Upload the Certificate Authority, Certificate File, and Key File.
2. Opt to enable Strict SSL by selecting the checkbox.
Click the Test Connection button.
Click the Test Data Source Link.
Once both tests are successful, click Save.

Enable Snowflake tag ingestion

See the Configure a Snowflake integration page for guidance on configuring tag ingestion.

If Snowflake data sources existed before configuring tag ingestion, Immuta will automatically sync those data sources to the catalog and apply tags to them. Immuta will automatically check the external catalog for changes and sync data sources to the catalog every 24 hours.

Enable Databricks Unity Catalog tag ingestion

See the Configure a Databricks Unity Catalog integration page for guidance on configuring tag ingestion.

If Databricks Unity Catalog data sources existed before configuring tag ingestion, Immuta will automatically sync those data sources to the catalog and apply tags to them. Immuta will automatically check the external catalog for changes and sync data sources to the catalog every 24 hours.

Manually link catalogs to data sources

You can manually link and remove external catalogs from data sources on the data source details tab.

Navigate to your data source.
In the connection information section, click the Link Catalog icon (or Unlink Catalog to remove an external catalog from a data source).
Select your external catalog from the dropdown menu.
Click Link to confirm.

Manually sync external catalog tags

Navigate to your data source and click the data source Health dropdown menu.
Click Re-run in the External Catalog section.

Reference Guides

External Catalog Introduction

Users who want to use tags from outside of Immuta can connect an external catalog to automatically pull and apply them to Immuta data sources. These tags can then be used to drive or .

Supported external catalogs

Immuta supports the following external catalogs:

To configure an external catalog, see the .

Architecture

Once an external catalog has been configured on the Immuta app settings page, there are two recurring process steps:

Linking to data sources and columns: Whenever a new data source is created or an external catalog is set up, Immuta will attempt to automatically link data sources to their corresponding assets in the external catalog. This is done by comparing the fully qualified name of a data source in Immuta with its corresponding asset name in the external catalog, so data sources must have the same name in Immuta and the external catalog. Alternatively, a user can also a data source to an asset in an external catalog. Once a data source has been linked to an external catalog, it can be seen on the data source's detail page.
Pull and apply tags in Immuta: Using the link established in the first step, Immuta polls the external catalog to ingest and apply tags to each data source and its columns. Immuta checks every 24 hours for any relevant metadata changes in the connected external catalog. Tags originating from an external catalog can be found on the tags list page and on the data dictionary page for each data source.

See below for more information about the way Immuta integrates with each supported external catalog provider.

Alation

Immuta's Alation integration supports importing both tags and custom fields, Alation's two primary ways of allowing data stewards to apply metadata to data assets.

Tags: Tags are a single word or phrase that can be attached to most Alation objects by nearly anyone. For instance, users can add a PCI tag for financial data.
Custom fields: Custom fields are key-value pairs that can only be attached and removed by authorized users. Unlike tags, custom fields can have multiple values associated with a single key. For example, the custom field DK_STEWARD could have MARKETING, FINANCE, and CUSTOMER values associated with it. Using Alation custom fields allows you to explicitly control who can modify information associated with that field inside of Alation, whereas Alation standard tags are modifiable by any user inside of Alation.

When pulled into Immuta, Alation tags and custom fields will be applied to data sources as either column or data source tags in Immuta. Importing both Alation tags and custom fields into Immuta provides full flexibility for customers leveraging the Alation enterprise data catalog, no matter what operating model they choose to document their metadata in Alation.

Collibra

Immuta's Collibra integration supports importing both tags and attributes. Additionally, data source and column descriptions from the connected Collibra catalog will be pulled into Immuta.

Tags: Tags are a single word or phrase that can be attached to objects in Collibra. For instance, users can add a PHI tag on health-related data assets.
Attributes: Attributes in Collibra are a characteristic that describes an asset with an individual field. Unlike tags, attributes can have multiple values associated with a single key. For example, the attribute classification could have non sensitive, sensitive, and highly sensitive values associated with it. Using Collibra attributes allows you to explicitly control who can modify information associated with that field inside of Collibra, whereas Collibra standard tags are modifiable by any user inside of Collibra.

When pulled into Immuta, Collibra tags and attributes will be applied to data sources as either column or data source tags in Immuta. Importing both Collibra tags and attributes into Immuta provides full flexibility for customers leveraging the Collibra data catalog, no matter what operating model they choose to document their metadata in Collibra.

How Immuta gets metadata from Collibra

Limitations

Collibra assets must have unique full names in order for Immuta to guarantee exact matching. If there are multiple Collibra assets with the same name, Immuta will link to the first asset it matches to.
Columns must have a direct relation to their parent asset in Collibra. Indirect/inherited relations are not supported and will result in column tags and attributes not being ingested into Immuta.

Microsoft Purview catalog

Private preview

The Microsoft Purview catalog integration is only available to select accounts. Contact your Immuta representative to enable this feature.

How Immuta gets metadata from Microsoft Purview

Linking to data sources and columns in Microsoft Purview: Immuta links data sources to assets in Microsoft Purview by looking up the fully qualified name of an entity. The composition of the fully qualified name in Microsoft Purview differs depending on the technology type backing the data source.
Pull and apply tags in Immuta from Microsoft Purview: Immuta polls Microsoft Purview every 24 hours for all tags.

Limitations

Standard tags from Purview do not get ingested into Immuta
The current implementation only supports Databricks Unity Catalog, Snowflake and Azure Synapse Analytics data sources and their associated columns
Managed attributes are supported, but have the following limitations:
- If a managed attribute is applied to an Immuta data source but later expires, it will still appear as a tag on the data source. Expired attributes must be removed from the object in Purview for the tag to be removed from the Immuta data source.
- The following managed attribute data types are not supported and will not be applied to Immuta data sources as tags:
  - Dates
  - Number types
  - Rich text

Custom REST catalog

If users have an unsupported catalog, or have customized their catalog integration, they can connect through the REST Catalog using the Immuta API.

Databricks Unity Catalog tag ingestion

Design partner preview: This feature is only available to select accounts. Reach out to your Immuta representative to enable this feature.

Snowflake tag ingestion

External catalog behaviors

Tags ingested from external catalogs cannot be edited within Immuta. To edit, delete, or add a tag from an external catalog to a data source or column, make the change in the external catalog.
Immuta searches all external catalog providers once per day and links data sources without an external catalog attached to them to the first catalog that matches.
S3 data sources cannot currently be linked to external catalogs.

Resources

Custom REST Catalog Interface Introduction

The custom REST catalog integration allows Immuta to make a defined set of API calls to a Custom REST service you develop to retrieve metadata. The Custom REST service receives Immuta's calls, and then collects the relevant information and delivers it back to Immuta.

The diagram below highlights the main feature of Immuta's Custom REST Catalog integration.

Through a Custom REST Catalog, you can build and maintain your own solutions that provide metadata required to effectively use Immuta within your organization.

Section Contents

API Interface Specification Documentation: This page details the endpoints and data schemas of the API and contains example requests and responses.

Custom REST Catalog Interface Endpoints

Architectural Overview

The diagram below contrasts Immuta's provided catalog integration architecture with this Customer REST Catalog interface - which gives the customer tremendous control over the metadata being provided to Immuta.

The custom-developed service must be built to receive and handle calls to the REST endpoints specified below. Immuta will call these endpoints as detailed below when certain events occur and at various intervals. The required responses to complete the connection are also detailed.

General Concepts

Tags in Immuta

Tags are attributes applied to data - either at the top, data source, level or at the individual column level.

Tags in Immuta take the form of a nested tree structure. There are "parents", "children", "grand-children", etc.:

| Parent (root)
|\_ Child1
|   \_ Grandchild1 (leaf)
 \_ Child2
    \_ Grandchild1 (leaf)

The REST Catalog interface interprets a tag's relationship mapping from a string based on a standard "dot" (.) notation, like:

"Parent.Child1.Grandchild1"

Tags returned must meet the following constraints:

They must be no longer than 500 characters. Longer tags will not throw an error but will be truncated silently at 500 characters.
They must be composed of letters, digits, underscores, dashes, and whitespace characters. A period (.) is used as a separator as described above. Other special characters are not supported.

A tag object has a single id property, which is used to uniquely identify the tag within the catalog. This id may be of either a string or integer type, and its value is completely up to the designer of the REST Catalog service. Common examples include: a standard integer value, a UUID, or perhaps a hash of the tag's string value (if it is unique within the system).

For this Customer REST Catalog interface, tags are represented in JSON like:

"<string>[.<string>[.<string>...]]": {
    "id": "<unique identifier, string or int>"
},

For example, the object below specifies 3 different tags:

"REST_Catalog_Root": {
    "id": "id_is_set_by_catalog_and_can_be_int_or_string"
},
"REST_Catalog_Root.Child1": {
    "id": "d3e859da-40e9-43d2-a302-294458e79a64"
},
"REST_Catalog_Root.Child2.Grandchild1": {
    "id": 10
}

Descriptions in Immuta

Descriptions are strings that, like tags, can be applied to either a data source or an individual column. These strings support UTF-8, including special and various language characters.

Authentication

Immuta can make requests to your REST Catalog service using any of the following authentication methods:

Username and password: Immuta can send requests with a username and a password in the Authorization HTTP header. In this case, the custom REST service will need to be able to parse a Basic Authorization Header and validate the credentials sent with it.
PKI Certificate: Immuta can also send requests using a CA certificate, a certificate, and a key.
NO Authentication: Immuta can make unauthenticated requests to your REST Catalog service. However, this should only be used if you have other security measures in place (e.g., if the service is in an isolated network that's reachable only by your Immuta environment).

Authentication and specific endpoints

When accessing the /dataSource and /tags endpoints, Immuta will use the configured username and password. If you choose to also protect the human-readable pages with authentication, users will be prompted to authenticate when they first visit those pages.

Endpoint Specification

GET `/tags`

Overview

The /tags endpoint is used to collect ALL the tags the catalog can provide. It is used by Immuta to populate Immuta's tags list in the Governance section. These tags can then be used for policy creation ahead of actual data sources being created that make use of them. This enables policies to immediately apply when data sources are registered with Immuta.

As with all external catalogs, tags ingested by Immuta from the REST catalog interface are not able to be modified locally within Immuta as this catalog becomes the "source of truth" for them. This results in the tags showing in Immuta with either a lock icon next to them, or without the delete button that would allow a user to manually remove them from an assigned data source or column.

Request

The /tags endpoint receives a simple GET request from Immuta. No payload nor query parameters are required.

Example request:

curl http://<your_custom_rest_catalog>/tags \
     --header 'Authorization: Basic <base64 of username:password>'

Response

Example response:

{
  "REST_Catalog_Root": {
      "id": "id_is_set_by_catalog_and_can_be_int_or_string"
  },
  "REST_Catalog_Root.Child1": {
      "id": "d3e859da-40e9-43d2-a302-294458e79a64"
  },
  "REST_Catalog_Root.Child2.Grandchild1": {
      "id": 10
  }
}

POST `/dataSource`

Overview

The /dataSource endpoint does the vast majority of the work. It receives a POST request from Immuta, and returns the mapping of a data source and its columns to the applied tags and descriptions.

Immuta will try to fetch metadata for a data source in the system at various times:

During data source creation. During data source creation, Immuta will send metadata to the REST Catalog service, most notably the connection details of the data source, which includes the schema and table name. It is important that the Custom REST service implemented can parse this information and search its records for an appropriate record to return with an ID unique to this data source in its catalogMetadata object.
When a user manually links the data source. Data sources that either fail to auto-link, or that were created prior to the Custom REST catalog being configured, can still be manually linked. To do so, a data source owner can provide the ID of the asset as defined by the Custom REST Catalog via the Immuta UI. In order for this to work, the Custom REST Catalog service must support matching data source assets by unique ID.
During various refreshes. Once linked, Immuta will periodically call the /dataSource endpoint to ensure information is up to date.

Request

Immuta's POST requests to the /dataSource endpoint will consist of a payload containing many of the elements outlined below:

This object must be parsed by the in Custom REST Catalog order to determine the specific data source metadata being requested.

For the most part, Immuta will provide the id of the data source as part of the catalogMetadata. This should be used as the primary metadata lookup value.

{
  "catalogMetadata": {
    "id": <unique integer or string value>
  }
}

When a data source is being created, such an id will not yet be known to Immuta. Immuta will instead send handlerInfo information as part of the request.

{
  "handlerInfo": {
    "schema": "schema_name",
    "table": "table_name"
  }
}

When an id is not specified, the schema and table name elements should be parsed in an attempt to identify the desired catalog entry and provide an appropriate id. If such a lookup is successful and an id is returned to Immuta in the catalogMetadata section, Immuta will establish an automatic link between the the new data source and the catalog entry, and future references will use that id.

Response

Example response:

  "catalogMetadata": {
    "id": <unique integer or string>
  },
  "description": <string>,
  "tags": {
    "Parent": {
      "id": <tag_id1>
    },
  },
  "dictionary": {
    "some_column_name": {
      "catalogMetadata": {
        "id": <col_id1>
      },
      "description": "This column has example data in it",
      "tags": {
        "Parent.Child1": {
          "id": <tag_id2>
        },
        "Parent.Child1.Grandchild1": {
          "id": <tag_id3>
        }
      }
    }
  }
}

GET `/dataSource/page/{id}`

Overview

This endpoint returns a human-readable information page from the REST catalog for the data source associated with {id}. Immuta provides this as a mechanism for allowing the REST catalog to provide additional information about the data source that may not be directly ingested by or visible within Immuta. This link is accessed in the Immuta UI when a user clicks the catalog logo associated with the data source.

Request

Immuta will send a GET request to the /dataSource/page/{id} endpoint, where {id} will be:

Example request:

curl http://<your_custom_rest_catalog>/dataSource/page/123

Response

The Custom REST Catalog can either provide such a page directly, or can redirect the user to any resource where the appropriate page would be provided - for example a backing full service catalog such as Collibra, if this Custom REST catalog is simply being used to support a custom data model.

Example response:

<html> 
  <head> 
    <title>data source 123</title> 
  </head> 
  <body> data source 123 is a data source that was created just for documentation.
  </body> 
</html>

GET `/column/{id}`

Overview

This endpoint returns the catalog's human-readable information page for the column associated with {id}. Immuta provides this as a mechanism for allowing the REST catalog to provide additional information about the specific column that may not be directly ingested by or visible within Immuta.

Request

Immuta will send a GET request to the /column/{id} endpoint, where {id} will be:

Example request:

curl http://<your_custom_rest_catalog>/column/10

Response

Example response:

<html>
  <head>
    <title>data source 123 Column 10</title>
  </head>
  <body>
    Column 10 is full of example data for documentation reasons.
  </body>
</html>

Data Discovery

Sensitive data discovery (SDD) is an Immuta feature that uses data patterns to determine what type of data your column represents. Using identification frameworks and identifiers, Immuta evaluates your data and can assign the appropriate tags to your data dictionary based on what it finds. This saves the time of identifying your data manually and provides the benefit of a standard taxonomy across all your data sources in Immuta.

Architecture

To evaluate your data, SDD generates a SQL query using the identification framework's identifiers; the Immuta system account then executes that query in the remote technology. Immuta receives the query result, containing the column name and the matching identifiers but no raw data values. These results are then used to apply the resulting tags to the appropriate columns.

This evaluating and tagging process occurs when identification runs and happens automatically from the following events, if a global framework is set:

A new data source is created.
Schema monitoring is enabled, and a new data source is detected.

The following actions will also trigger identification:

Column detection is enabled, and new columns are detected. Here, SDD will only run on new columns, and no existing tags will be removed or changed. Note, this will use the identification framework that already ran on the data source.
A user manually triggers it from the data source health check menu. Note, this will use the identification framework that already applies to the data source or the global framework, if set.
A user manually triggers it from the identification frameworks page.
A user manually triggers it through the API.

Users can manually run identification from a data source's overview page or the identification frameworks page.

Components

Sensitive data discovery (SDD) runs frameworks to discover data. These frameworks are a collection of identifiers. These identifiers contain a single criteria and the tags that will be applied when the criteria's conditions have been met. See the sections below for more information on each component.

Identification framework

An identification framework is a group of identifiers that will look for particular criteria and tag any columns where those conditions are met.

While organizations can have multiple frameworks, only one may be applied to each data source. Immuta has the built-in "Default Framework," which contains all the built-in identifiers and assigns the built-in Discovered tags.

For a how-to on the framework actions users can take, see the Manage frameworks page.

Global framework

Each organization can set a global framework to apply to all the data sources in Immuta by default unless they have a different framework assigned. It is labeled on the frameworks page with a globe icon. If a global framework is set, identification will run on all new data sources. If a global framework is not set, identification will only run on data sources manually applied to an identification framework.

Users can set any framework as the global framework or leave the global framework field blank.

Identifier

An identifier is a criteria and the tags to apply to data that matches the criteria. When Immuta recognizes that criteria, it can tag the data to describe the type.

Immuta comes with built-in identifiers to discover common categories of data. These identifiers cannot be modified or deleted. Users can also create their own unique identifiers to find their specific data.

Improved identifiers

A new and improved pack of the built-in identifiers was released October 2024.

If you are interested in these improved identifiers, reach out to your Immuta support professional.

For a how-to on the identifier actions users can take, see the Create an identifier page.

Supported criteria types for identifiers

Competitive criteria analysis: This criteria is a process that will review all the regex and dictionary criteria within the identifiers of the framework and search for the identifier with the best fit. In this review, each competitive criteria analysis identifier in the framework competes against each other to find the best and most specific identifier that fits the data. The resulting tags for the best identifier are then applied to the column. Only one competitive criteria analysis identifier will apply per column. Competitive criteria identifiers, both built-in and custom, must match at least 90% of the data sampled. To learn more about the competitive nature, see the How competitive criteria analysis works guide.
- Regex: This criteria contains a case-insensitive regular expression that searches for matches against column values. SDD only supports regular expressions (regex) written in RE2 syntax.
- Dictionary: This criteria contains a list of words and phrases to match against column values.
Column name: This criteria includes a case-insensitive regular expression matched against column names, not against the values in the column. The identifier's tags will be applied to the column where the name is found. Multiple column name identifiers can match a column and be applied. SDD only supports regular expressions (regex) written in RE2 syntax.

Supported technologies

Sensitive data discovery has varied support for data sources from different technologies based on the identifier type.

Technology

Regex

Dictionary

Column name regex

Snowflake

Supported

Databricks

Supported

Starburst (Trino)

Supported

Redshift

Supported

Azure Synapse Analytics

Not supported

Supported

Amazon S3

Not supported

Supported

Google BigQuery

Not supported

Supported

Configuration

Only application admins can enable sensitive data discovery (SDD) globally on the Immuta app settings page. Then, data source creators can disable SDD on a data-source-by-data-source basis.

Tag mutability

When SDD is manually triggered by a data owner, all column tags previously applied by SDD are removed and the tags prescribed by the latest run are applied. However, if SDD is triggered because a new column is detected by schema monitoring, tags will only be applied to the new column, and no tags will be modified on existing columns. Additionally, governors, data source owners, and data source experts can disable any unwanted Discovered tags in the data dictionary to prevent them from being used and auto-tagged on that data source in the future.

Performance

The amount of time it takes to run identification on a data source depends on several factors:

Columns: The time to run identification grows nearly linearly with the number of text columns in the data source.
Identifiers: The number of identifiers being used weakly impacts the time to run identification.
Row count: Performance of identification may vary depending on the sampling method used by each technology. For Snowflake, the number of rows has little impact on the time because data sampling has near-constant performance.
Views: Performance on views is limited by the performance of the query that defines the view.

The time it takes to run identification for all newly onboarded data sources in Immuta is not limited by SDD performance but by the execution of background jobs in Immuta. Consult your Immuta account manager when onboarding a large number of data sources to ensure the advanced settings are set appropriately for your organization.

Testing

For users interested in testing SDD, note that the built-in identifiers by Immuta require a 90% match to data to be assigned to a column. This means that with synthetic data, there may be situations where the data is not real enough to fit the confidence needed to match identifiers. To test SDD, use a dev environment, create copies of your tables, or use the API to run a dryRun and see the tags that would be applied to your data by SDD.

Considerations

Deleting the built-in Discovered tags is not recommended: If you do delete built-in Discovered tags and use the Default Framework, then when the identifier is matched the column will not be tagged. As an alternative, tags can be disabled on a column-by-column basis from the data dictionary, or SDD can be turned off on a data-source-by-data-source basis when creating a data source.

Supported data types and casing

Type of identifier

Supported data types

Case sensitivity

Data regex*

Text string columns

Case-sensitive

Column name regex

Any column

Not case-sensitive

Dictionary

Text string columns

Can be toggled in the identifier definition

*Two built-in patterns support and match based on additional data types:

DATE: Columns will match this identifier if they are string and the regex matches or if the data type is date, date+time, or timestamp.
TIME: Columns will match this identifier if they are string and the regex matches or if the data type is time. Note that if the date is included in the data, it will not match this identifier.

Limitations with dictionary patterns

Immuta compiles dictionary patterns into a regex that is sent in the body of a query.

For Snowflake, the size of the dictionary is limited by the overall query text size limit in Snowflake of 1 MB.

Databricks limitation

For Databricks, Immuta will start up a Databricks cluster to complete the SDD job if one is not already running. This can cause unnecessary costs if the cluster becomes idle. Follow Databricks best practices to automatically terminate inactive clusters after a set period of time.

Starburst (Trino) limitation

Authentication method

Column name regex identifiers

Competitive criteria analysis identifiers

Username and password

Supported

Not supported

Redshift limitations

The Redshift cluster must be up and running for SDD to successfully run
Redshift Spectrum is only supported with column name regex identifiers

Redshift supported authentication methods

Authentication method

Column name regex identifiers

Competitive criteria analysis identifiers

Username and password

Supported

AWS access key

Supported

Not supported

AWS access key limitations

To use AWS access key authentication on a Redshift data source and have competitive criteria analysis identifiers supported,

The AWS access key used to register the data source must be able to do a minimum of the following redshift-data API actions:
- redshift-data:BatchExecuteStatement
- redshift-data:CancelStatement
- redshift-data:DescribeStatement
- redshift-data:ExecuteStatement
- redshift-data:GetStatementResult
- redshift-data:ListStatements
The AWS access key used to register the data source must have redshift:GetClusterCredentials for the cluster, user, and database that they onboard their data sources with.
If using a custom URL, then the data source registered with the AWS access key must have the region and clusterid included in the additional connection string options formatted like the following:
```
  region=us-east-2;clusterid=12345
```
Redshift Serverless data sources are not supported for competitive criteria analysis identifiers with the AWS access key authentication method.

Legacy SDD

This is only relevant to users who enabled and ran Immuta SDD prior to October 2023.

Legacy SDD was available before October 2023. It is no longer available, but some users may still see the term "legacy SDD" in the context of their data tags applied to specific data sources. These tags can be disabled from data sources but cannot be removed.

Introduction

Immuta allows you to automate discovering and tagging data across your data platform. Tagging is critical for two reasons:

It allows you to define data sensitivity, which in turn allows you to monitor where you have potential data security issues and gaps in your security posture.
It allows you to abstract your physical structure from your access policy logic. For example, you can build access policies like mask all columns tagged Person Name (where Person Name was auto-tagged by Discover) rather than much less scalable policies that must be knowledgeable of your physical layers like mask column x in database y in data platform z.

Challenge and goals

Today’s sensitive data discovery tools give you a shallow overview of your data corpus across a long list of platforms. They give you pointers on where you have sensitive data without the granularity to drive your column- or row-level access controls. They help you understand what data you possess according to a regulatory framework, like HIPAA or PCI but without the details needed to automate your audits or compliance reporting. Knowing that you need to drive east to west on a road map from New York to California is helpful but ultimately insufficient to get you from a specific location to another.

Existing tools promise a high degree of automation, yet their many false positives result in painful manual work that never stops. Although data gets scanned automatically, performance breaks down at scale, or you manually need to fine-tune the computing resources of the scanners. Last but not least, your security team objects to the agent-based processing that requires taking data out of your data platform, and the associated data residency concerns may give you pause.

At Immuta, we believe that data security should not be painful. We believe that you can innovate and move quickly, while at the same time protecting your data and adhering to your internal policies and external regulations. Technology and automation allow you to make the right trade-off decisions quickly. It all starts with highly accurate and actionable metadata. If you trust your metadata and if it’s actionable, you can leverage it to automatically grant access to data, mask sensitive information, and automate your audit reporting.

Immuta Discover was built to tackle those challenges and address them through a unique architecture that was designed in collaboration with the largest financial institutions, healthcare companies, and government agencies in the world. The cloud and AI paradigm requires a fundamentally different approach. You must assume that your data is dynamic, unique, and collected in a multitude of different geographies and legal jurisdictions. Immuta Discover is built for this new world and its specific demands.

How does it work?

Scalability through in-platform processing

Identifying and classifying data requires analyzing and looking at the data - there’s no way around it. Immuta Discover does all the analysis and processing inside the remote technology. It takes advantage of those platforms’ inherent scalability to enable you to analyze large amounts of data quickly, efficiently, and without the need for separate resource optimization for containers or virtual machines.

Data residency compliance by design

By processing data directly inside the data platform, Immuta Discover automatically adheres to data residency and locality requirements. If you run your data warehouse or lake globally - across North America, the European Union, and Asia - Immuta processes the data in the region where your data is stored. No data ever leaves the data platform, and it will never move across different cloud regions.

Improved security and simplicity through agentless scanning

In-platform processing greatly reduces risk and improves your data security posture. Provisioning agents, whether they’re in a container, virtual machine, or Amazon Machine Image (AMI), create complexity and an unnecessary security risk. Not only can those agents become compromised, but their misconfiguration might lead to data leaks to other parts of your cloud infrastructure. An agentless approach can better leverage data platform optimizations to process data instead of transferring it out to re-optimize and analyze. This simplifies operations and increases efficiency for your infrastructure teams.

Cross-platform consistency

The advantages of in-platform processing are abundant, but implementing it across a multitude of platforms is challenging. Immuta helps bypass the obstacles by doing all the heavy lifting for you and building in specific implementations for each technology. Although all those implementations are ultimately different, Immuta abstracts the results to one standardized taxonomy, so you can have consistently accurate and granular metadata across all your data stores.

Granular query-level classification

Immuta Discover classifies data on a column level and instantaneously identifies schema changes. Only with that level of granularity and automation can you adhere to your audit requirements and understand what actions have been taken on your data. For example, if non-sensitive data is joined with sensitive data at query time, Immuta Discover will monitor and record that for your review. Continuous schema monitoring ensures schema changes never result in holes in your access controls and data security posture.

Highly accurate and actionable metadata

Trust in your metadata is critical for data security.

To unblock your data consumers, you need to automate your data access controls; this requires trusting that your classification and metadata are accurate and actionable. Immuta Discover provides you with highly accurate metadata and tags out-of-the-box and assists you in fine-tuning the classification mechanism to deal with false positives quickly. That enables you to build policies that dynamically grant or restrict access to protected data (like PHI or PII) depending on who is accessing it and what protections you want to apply.

Components of Discover

Immuta Discover works in three phases: identification, categorization, and classification.

Identification: In this first phase, data is identified by its kind – for example, a name or an age. This identification can be manually performed, externally provided by a catalog, or automatically determined by Immuta Discover through column-level analysis of patterns.
Categorization: In the second phase, data is categorized in the context of where it appears, subject to your active frameworks. For example, a record occurring in a clinical context containing both a name and individual health data is protected health information (PHI) under HIPAA.
This categorization of data helps to understand the context it is in, including information like whether or not a record pertains to an individual, the composition and kinds of identifiers present, the data subject, whether the data belongs to any controlled data categories under certain legislation, etc.
Classification: In the third and final phase, data is classified according to its sensitivity level (e.g., Customer Financial Data is Highly Sensitive) and the risk associated to the data subject. Detect dashboards support 3 sensitivity levels. However, customers are free to customize the sensitivity names for the tags as needed.

Getting Started with Discover

Immuta can scan your data sources and apply relevant tags when data is recognized. This eliminates a manual tagging process for your data, saving you time and providing standard taxonomy across all your data sources.

Requirements

Registered data sources
Immuta permission GOVERNANCE

Implement SDD

Sensitive data discovery (SDD) is an Immuta Discover feature that identifies your data sources and applies relevant tags when data is recognized. This eliminates a manual tagging process for your data, saving you time and providing standard taxonomy across all your data sources.

To learn more, see the Data discovery page.

Enable sensitive data discovery

Enable sensitive data discovery on your tenant. Opt to have SDD run automatically for new data sources by setting a global framework, or run SDD granularly by applying data sources to specific frameworks.

Configure SDD for your data

For additional control, create your own identifiers to recognize the data that matters to you. Add these identifiers to new frameworks and specify the data sources that need this framework. This fine-level control creates automatic tagging that is relevant and accurate to your data, requiring fewer manual adjustments to the resulting tags.

Customize SDD for your data:
1. Create identifiers
2. Create a framework
Assign your framework to specific data sources

Adjust Discovered tags

If you have any tags that are applied to your data sources by SDD that you don't want, you can easily disable these tags for each data source. This ensures that they will not be applied to the data source again if identification is re-run.

Verify tags
Disable Discovered tags

Reference pages:

Immuta comes with a default framework containing built-in Discovered tags and built-in identifiers. These identifiers and tags can be used in your own frameworks.

Built-in Discovered tags
Built-in identifiers

Implement classification

Classification is an Immuta Discover feature that categorizes your data based on the content and the associated risk the data poses. This increases your understanding of your data and allows you to make faster decisions about it.

Configure classification for your data

To create or manage a framework using the Immuta API, see the Frameworks API reference page.

Adjust classification tags

If you have any tags that are applied to your data sources by classification that you don't want, you can easily disable these tags for each data source. This ensures that they will not be applied to the data source again when classification is re-run.

Assess your data source tags.
Tune your data dictionaries.

How-to Guides

Use Identifiers in Domains

Private preview: This feature is only available to select accounts.

Identifiers in domains allows you to use the same domains you already organize your data in to hold identifiers and run sensitive data discovery (SDD) without having to use identification frameworks. See the Identifiers in domains guide for more information about the feature and limitations.

Prerequisites

Identifiers can be added and SDD can be run in any of your current domains. However, if you are not already using domains, set up a domain specifically to run SDD:

Add identifiers to a domain

Navigate to the Identifiers tab of your domain.
Click Get Started.
Add reference identifiers to your domain that are relevant to your data by clicking the checkboxes. Note: When added to your domain, the identifier is a point-in-time copy of the reference identifier. It has the same name, pattern, and tags.
Click Add Identifiers.

Create a new identifier

This can be done within a domain from the Identifiers tab to create a domain-specific identifier, or it can be done from the Discover Identifiers page to create a reference identifier.

Click Create New.
Enter a name and description for your identifier.
Click Next.
Enter criteria: Select the Type of criteria.
1. For regex, enter a regex to be matched against column values. The default criteria encoding is case-sensitive. You can change this encoding using the regex criteria. The regex must use RE2.
2. For column name regex, enter a regex to be matched against column names. The default criteria encoding is case-insensitive. You can change this encoding using the regex criteria. The regex must use RE2 syntax.
3. For a dictionary, enter the values in a comma-separated list to match against column values. Opt to toggle the Case insensitive switch to on if you want the dictionary to be case sensitive.
Click Next.
Select the tags to apply: Use the text box to search for a tag under the "Discovered" hierarchy or type a tag name to create a new tag under the "Discovered.Entity" hierarchy to apply to columns that match your identifier.
Click Next to review your new identifier and click Create Identifier to create it.

Enable Sensitive Data Discovery (SDD)

Requirements:

Immuta permission GOVERNANCE
Registered data sources; see the reference page for supported technologies

This how-to guide is for enabling sensitive data discovery (SDD) for the first time. For additional information on sensitive data discovery, see the Data discovery page.

Turn on SDD

Requirement: Immuta permission APPLICATION_ADMIN

Navigate to the App Settings page and scroll to the Sensitive Data Discovery section.
Select the Enable Sensitive Data Discovery (SDD) checkbox to enable SDD.
Click Save and then click Confirm to apply your changes. Note that the Immuta tenant will have a system restart.

Note that the global framework is not set by default, so SDD will not run automatically on any data sources. Set a global framework to have identification automatically run on all new data sources.

Configure the global framework

Requirement: Immuta permission APPLICATION_ADMIN

Navigate to the App Settings page and scroll to the Sensitive Data Discovery section.
Enter the request-friendly name of your global identification framework in the Global SDD Template Name field. This name can be found in the URL when you navigate to the identification framework's page.
Click Save, and then Confirm your changes.

Create a new framework with identifiers

Once SDD is enabled on your tenant, SDD will automatically run when new data sources are added, but it must be manually run for all existing data sources. This allows you to test out SDD with a select few data sources without worrying that it will add tags throughout all your data sources.

For this step, you will pick the identifiers to match the data that matters to your organization. For example, for international data, you may want to enable many different identifiers for many countries, like the "Australia Passport" identifier and the "Finland National ID Number" identifier. However, if you are dealing with United States domestic financial data, those identifiers would be irrelevant. In that case, it would be better to identify the data likely to appear, like Bitcoin or US Bank Routing MICR.

First, create an empty framework,

Navigate to Discover and Identification.
Select Create New.
Enter a Name and Description for your new identification framework.
Select Create empty framework.

Then, add a new identifier to that framework,

Navigate to Discover and Identifiers.
Use the checkboxes to select all the identifiers relevant to your data. Tip: From the overview page you can see the name and the tags that will be applied by the identifier. To better understand the data it will match, click the name to read the description.
Once you have checked the identifiers you want in your framework, click Add to Framework.
Type the framework name in the text box.
Click Add to Framework.

Run identification on your data sources

Once you have created a framework relevant to your data, it is time to test it on your data and customize it. Run identification on a select number of data sources where you understand the data to assess and adjust the tags to reflect what you expect to see.

Add those select data sources to your new framework,

Navigate to Discover and Identification.
Click your new framework name.
Navigate to the Data Sources tab.
Click Add Data Sources.
Check the checkboxes for the select data sources you want to try SDD on.
Click Add Data Source(s).

Then, run identification on those data sources,

Navigate to Discover and Identification.
Click the action menu for your new framework.
Click Run Identification.

View the identification results

After identification runs, you will receive a notification that the job is complete. Then, you can view the results from the data source dictionary.

Navigate to the data source overview page of the data source you added to the framework.
Click the Data Dictionary tab.
Assess whether the Discovered tags are applied as expected.
If you are happy with the Discovered tags, follow the Assign data sources to frameworks guide to add the rest of your data sources to the framework and follow the Run identification guide to run identification on all your data sources.
If you want additional tags, follow the Create an identifier guide to create identifiers that matter to your data.

Manage Identification Frameworks

Requirements:

Sensitive data discovery (SDD) enabled
Immuta permission GOVERNANCE

Create an identification framework

Create an identification framework with no identifiers

Click the Discover icon in the navigation menu and select the Identification tab.
Click Create New.
Enter a Name and Description for the identification framework.
Select the option to Create empty framework.
Click Create.

After you create the identification framework, you can create new identifiers.

Copy an existing identification framework and its identifiers

Click the Discover icon in the navigation menu and select the Identification tab.
Click Create New.
Enter a Name and Description for the identification framework.
Select the option to Create identifiers from an existing framework.
Select the checkbox for the framework you want to copy. You can only copy a single framework. For more information about a framework, click the framework name to open a new tab with details about the framework.
Click Create.

Manage an identification framework's identifiers

Add an identifier to a framework

To add an identifier to a framework,

Click the Discover icon in the navigation menu and select the Identification tab.
Select the framework name for the identification framework you want to edit.
Click Add Identifier.
Choose in the dropdown to add an identifier from those already in Immuta or create a new identifier for the framework.
- For existing identifiers: Opt to edit the tags. Then click Add Identifier.
- For new identifiers:
  1. Fill out a Name and Description.
  2. Enter criteria: Select the Type of criteria.
    For regex, enter a regex to be matched against column values. The default criteria encoding is case-sensitive. You can change this encoding using the regex criteria. The regex must use RE2.
    For column name regex, enter a regex to be matched against column names. The default criteria encoding is not case-sensitive. You can change this encoding using the regex criteria. The regex must use RE2 syntax.
    For a dictionary, enter the values in a comma-separated list to match against column values. Opt to toggle the Case insensitive switch to on if you want the dictionary to be case sensitive.
  3. Select the tags to apply: Use the text box to search for a tag under the "Discovered" hierarchy or type a tag name to create a new tag under the "Discovered" hierarchy to apply to columns that match your identifier.
  4. Click Next to review your new identifier and click Create Identifier to create it.

Edit an identifier in a framework

Only tags can be edited within a framework. Edits made to an identifier within a framework will only impact that specific identifier. To fully edit an identifier (including the name, description, or criteria) for all frameworks, use the Edit an identifier how-to guide.

To edit the tags applied by an identifier for a framework,

Click the Discover icon in the navigation menu and select the Identification tab.
Select the framework name for the identification framework you want to edit.
Click the more actions icon for an identifier and select Edit tags.
Remove the tags or type a tag name to add tags.
Click Save.

Delete an identifier from a framework

Click the Discover icon in the navigation menu and select the Identification tab.
Select the framework name for the identification framework you want to edit.
Click the more actions icon for an identifier and select Delete.
Click Delete again in the modal.

Manage an identification framework's data sources

Assign an identification framework to data sources

To assign a framework to run on specific data sources,

Click the Discover icon in the navigation menu and select the Identification tab.
Select the framework you want to assign and navigate to the Data Sources tab.
Click Add Data Sources.
Select the checkbox for the data source you want this framework to run on. You may select more than one.
Click Add Data Source(s).

Remove data sources from an identification framework

After a data source is removed from a framework, it will use the global framework for any SDD scans and the tags applied by the removed framework will be replaced. The global framework is signified by the globe icon.

To remove data sources from a framework,

Click the Discover icon in the navigation menu and select the Identification tab.
Select the framework you want to remove data sources from and navigate to the Data Sources tab.
Select the checkbox for the data source you want to remove from the framework. You may select more than one.
Select Remove and click Remove again in the modal.

Delete an identification framework

Requirement: No data sources assigned to the framework

To delete a framework,

Click the Discover icon in the navigation menu and select the Identification tab.
Click the more actions icon in the Action column for the framework you want to delete. The global framework cannot be deleted. If you want to delete it, configure a different framework as the global framework.
Select Delete and click Delete again in the modal.

Manage Identifiers

Requirements:

Sensitive data discovery (SDD) enabled
Immuta permission GOVERNANCE
Registered data sources; see the reference page for supported technologies

Create an identifier

Click the Discover icon in the navigation menu and select the Identifiers tab.
Click Create New.
Enter a Name and Description for the new identifier.
Enter criteria: Select the Type of criteria.
1. For regex, enter a regex to be matched against column values. The default criteria encoding is case-sensitive. You can change this encoding using the regex criteria. The regex must use RE2. These identifiers are only supported on Snowflake, Databricks, Starburst (Trino), and Redshift data sources.
2. For a dictionary, enter the values in a comma-separated list to match against column values. Opt to toggle the Case insensitive switch to on if you want the dictionary to be case sensitive. These identifiers are only supported on Snowflake, Databricks, Starburst (Trino), and Redshift data sources.
3. For column name regex, enter a regex to be matched against column names. The default criteria encoding is case-insensitive. You can change this encoding using the regex criteria. The regex must use RE2 syntax.
Select the tags to apply: Use the text box to search for a tag under the "Discovered" hierarchy or type a tag name to create a new tag under the "Discovered" hierarchy to apply to columns that match your identifier.
Click Next to review your new identifier and click Create Identifier to create it.
See the Manage identification frameworks page to add your new identifier to a framework.

Note that all user-created identifiers must be a 90% match or greater for the contents of the column to be tagged.

Edit an identifier

Editing the details or criteria of an identifier from the identifiers menu will affect any framework with that identifier throughout Immuta. Editing the tags will only affect new frameworks the identifier is added to.

To edit an identifier,

Click the Discover icon in the navigation menu and select the Identifiers tab.
Click the name of the identifier you want to edit.
Click Edit.
Edit the field you want to change.
Click Save.

Built-in identifiers cannot be edited.

Delete an identifier

Deleting an identifier will remove it from all the frameworks it is in throughout Immuta.

To delete an identifier,

Click the Discover icon in the navigation menu and select the Identifiers tab.
Click the more actions icon in the Action column for the identifier you want to delete.
Select Delete and click Delete again in the modal.

Built-in identifiers cannot be deleted.

Run and Manage Sensitive Data Discovery on Data Sources

Requirements:

Sensitive data discovery enabled
Registered data sources; see the reference page for supported technologies
Immuta permission GOVERNANCE

Identification (or sensitive data discovery (SDD)) runs automatically. If you want to re-run identification when a new global framework is set or when new identifiers have been added to a framework, you can manually run it for all data sources using the API or from the UI by following a how-to below.

Run identification using a specific framework

Click the Discover icon and the Identification tab in the navigation menu.
Select the more actions icon.
Select Run Identification and then select it again in the modal.

Run identification on a data source

Navigate to the data source overview page.
Click the health status.
Select Re-run next to Sensitive Data Discovery (SDD).

Verify discovered tags

Verify discovered tags

If sensitive data discovery has been enabled, then manually adding tags to columns in the data dictionary will be unnecessary in most cases. The data owner will just need to verify that the Discovered tags are correct.

Disable Discovered tags from the data dictionary

If a governor, data owner, or data source expert disables a Discovered tag from the data dictionary, the column will not be re-tagged next time identification (or SDD) runs. When a Discovered tag is disabled, it will not completely disappear, and it can be manually enabled through the tag side sheet.

To disable a discovered tag,

Navigate to a data source and click the Data Dictionary tab.
Scroll to the column you want to remove the tag from and click the tag you want to remove.
Click Disable in the side sheet and then click Confirm.

Reference Guides

Identifiers in Domains

Private preview: This feature is only available to select accounts.

Sensitive data discovery (SDD) runs identifiers to discover data. These identifiers are grouped into domains with data sources. Each identifier contains a single criteria and the tags that will be applied when the criteria's conditions have been met.

Identifier

There are two types of identifiers in Immuta:

Reference identifiers: These identifiers are a library of the identifiers that can be added to domains. When added to a domain, reference identifiers are copied over and become domain-specific identifiers.
1. Immuta comes with built-in identifiers to discover common categories of data. These cannot be modified or deleted.
2. Data governors can create their own reference identifiers for use within your organization.
Domain-specific identifiers: These identifiers only exist within a specific domain and are checked against the data sources in that domain when SDD runs.
1. Users with the Manage Identifiers permission can create these identifiers or add them to a domain from a reference identifier.
2. If a domain-specific identifier was copied over from a reference identifier, there is no lineage and any edits to the reference identifier will not be reflected in the domain-specific copy.

Criteria

Criteria are the conditions in an identifier that need to be met for resulting tags to be applied to data.

SDD only supports regular expressions (regex) written in RE2 syntax.

Supported criteria types for identifiers

Competitive criteria analysis: This criteria is a process that will review all the regex and dictionary criteria within the identifiers of the domain and search for the identifier with the best fit. In this review, each competitive criteria analysis identifier in the domain competes against each other to find the best and most specific identifier that fits the data. The resulting tags for the best identifier are then applied to the column. Only one competitive criteria analysis identifier for each domain will apply per column. Competitive criteria identifiers, both built-in and custom, must match at least 90% of the data sampled. To learn more about the competitive nature, see the How competitive criteria analysis works guide.
- Regex: This criteria contains a case-insensitive regular expression that searches for matches against column values.
- Dictionary: This criteria contains a list of words and phrases to match against column values.
Column name: This criteria includes a case-insensitive regular expression matched against column names, not against the values in the column. The identifier's tags will be applied to the column where the name is found. Multiple column name identifiers can match a column and be applied.

Create a new identifier in the Immuta UI or with the sdd/classifier endpoint.

What has changed with SDD when using this new feature?

If you used SDD prior to this feature release in January 2025, there are some differences:

There are now two types of identifiers:
1. Reference identifiers
2. Domain identifiers
See information about these in the Identifiers section.
There is a new permission to manage identifiers within domains: Manage Identifiers. The permission allows you to do the following:
1. Create an identifier within your domain
2. View the reference identifiers in Immuta
3. Add, edit, and delete identifiers within your domain
The following have been removed:
1. Identification frameworks: Previously, all identifiers had to be contained within a framework and that framework had to be assigned to a data source to run. Now, identifiers are added to domains with data sources.
2. Global framework: Previously, a global framework could be set to run SDD automatically on all new data sources. This behavior cannot be achieved with identifiers in domains.

Limitations

See the table below for information on when SDD runs with the SDD feature before vs after with identifiers in domains.

Event

Before

After

SDD runs automatically on all new data sources

Yes, if a global framework is set

SDD runs automatically on new data sources found from schema monitoring

Yes, if a global framework is set

SDD runs automatically on new columns found from column detection in a data source where SDD has already run

Yes

SDD runs when a user manually triggers it from the data source health check menu

Yes

SDD runs when a user manually triggers it from the domain's page

Yes

SDD runs when a user manually triggers it from the identification framework page

Yes

SDD runs when a user manually triggers it through the API

Yes

Built-in Identifier Reference

Immuta comes with a set of built-in identifiers that look for common data types. These identifiers were written by Immuta's research and development team and cannot be deleted or edited by users. However, users can add these built-in identifiers to their own frameworks and edit the tags applied by them.

Identifiers must match at least 90% of the sampled data to be tagged, with two exceptions noted below. See the How competitive pattern analysis works guide for more information about sampling and thresholds.

Deprecation notice

The following Discovered tags have been deprecated:

Discovered.Identifier Direct
Discovered.Identifier Indirect
Discovered.Identifier Undetermined
Discovered.PCI
Discovered.PHI
Discovered.PII

New SaaS tenants will not see these tags applied by SDD. Current tenants relying on these tags for policies should contact their Immuta representative for support before these tags are removed from the product in December 2024.

Identifier descriptions and default resulting tags

Identifier

Description

Resulting tags from the default identifier

AGE

Matches numeric strings between 10 and 199.

Discovered.Entity.Age

ARGENTINA_DNI_NUMBER

Matches strings consistent with Argentina National Identity (DNI) Number. Requires an eight-digit number with optional periods between the second and third and fifth and sixth digit.

Discovered.Country.Argentina

Discovered.Entity.DNI Number

AUSTRALIA_MEDICARE_NUMBER

Matches numeric strings consistent with Australian Medicare number. Requires a ten- or eleven-digit number. The starting digit must be between 2 and 6, inclusive. Optional spaces can be placed between the fourth and fifth and ninth and tenth digit. The optional 11th digit separated by a / can be present. A checksum is required.

Discovered.Country.Australia

Discovered.Entity.Medicare Number

AUSTRALIA_PASSPORT

Matches strings consistent with Australian Passport number. An 8- or 9-character string is required, with a starting upper case character (N, E, D, F, A, C, U, X) or a two-character starting character (P followed by A, B, C, D, E, F, U, W, X, or Z) followed by seven digits.

Discovered.Country.Australia

Discovered.Entity.Passport

BELGIUM_NATIONAL_ID_CARD_NUMBER

Matches numeric strings consistent with Belgium's National ID card. Requires a twelve-digit number with hyphen (-) between the third and fourth digit and tenth and eleventh digits. A two checksum is required.

Discovered.Country.Belgium

Discovered.Entity.National ID Card Number

BITCOIN_INVOICE_ADDRESS

Matches strings consistent with the following Bitcoin Invoice Address formats: P2PKH, P2SH, and Bech32. P2PKH and P2SH must start with a 1 or a 3, respectively, followed by 25 - 34 alphanumeric characters, excluding l, I, O, and 0. Bech32 formats must begin with bc1 and be followed by 39 characters. To be identified, any addresses must have a valid checksum.

Discovered.Entity.CRYPTO

BRAZIL_CPF_NUMBER

Matches a numeric string consistent with Brazil's CPF (Cadastro Pessoal de Pessoa Física) number. An eleven-digit numeric string with non-numeric separators after the third, sixth, and ninth digits. A two digit checksum is required.

Discovered.Country.Brazil

Discovered.Entity.CPF Number

CANADA_BC_PHN

Matches numeric strings consistent with British Columbia's Personal Health Number (PHN). Requires a ten-digit numeric string with optional hyphen (-) or spaces after the fourth and seventh digits.

Discovered.Country.Canada

Discovered.Entity.British Columbia Health Network Number

CANADA_OHIP

Matches alphanumeric strings consistent with Ontario's Health Insurance Plan (OHIP). Requires a twelve-digit alphanumeric code. Optional hyphens (-) or spaces can appear after the fourth, seventh, and tenth digits. The final two characters are a checksum.

Discovered.Country.Canada

Discovered.Entity.Ontario Health Insurance Number

CANADA_PASSPORT

Discovered.Country.Canada

Discovered.Entity.Passport

CANADA_QUEBEC_HIN

Matches alphanumeric strings consistent with Quebec's Health Insurance Number (HIN). Requires four alphabetic characters followed by an optional space or hyphen (-), and then eight digits with an optional hyphen or space after the fourth digit.

Discovered.Country.Canada

Discovered.Entity.Quebec Health Insurance Number

CREDIT_CARD_NUMBER

Matches strings consistent with a credit card number with prefixes matching major credit card companies. Must include a valid checksum.

Discovered.Entity.Credit Card Number

DATE

Matches strings consistent with dates. These can include days of the week, dates, and date times.

Discovered.Entity.Date

DENMARK_CPR_NUMBER

Matches numeric strings consistent with Personal Identification Number (CPR-number or Person-number). Requires a ten-digit number with either a DDMMYY-SSSS or DDMMYYSSSS format. The first six digits are an individual's birth date in Day, Month, Year format. The final four digits comprise the sequence number.

Discovered.Country.Denmark

Discovered.Entity.CPR Number

DOMAIN_NAME

Matches domain names using a very broad pattern.

Discovered.Entity.Domain Name

EMAIL_ADDRESS

Detect strings consistent with an email address. Usernames are required to be fewer than 255 characters, follow by @a, a domain of fewer than 255 characters, and a top level domain of between 2 and 20 characters.

Discovered.Entity.Electronic Mail Address

ETHNIC_GROUP

Matches strings consistent with the US Census race designations.

Discovered.Entity.Ethnic Group

FDA_CODE

Matches a string consistent with a drug or ingredient registered with Food and Drug Administration (FDA). Must start with between 4 to 6 digits, followed by a hyphen, followed by 3 to 4 digits, followed by a hyphen, and finishing with one to two digits.

Discovered.Country.US

Discovered.Entity.FDA Code

FINANCIAL_INSTITUTIONS New

Matches strings consistent with names of financial institutions based on lists provided by the FDIC and OCC, includes alternative names.

Discovered.Entity.Financial Institutions

FINLAND_NATIONAL_ID_NUMBER

Matches a string consistent with Finland's National ID number. Requires an eleven-character string in a DDMMYYCZZZQ format. The first six digits are an individual's birth date in Day, Month, Year format. The C character is a century of birth indicator (+ for the years 1800-1899, - for years 1900-1999, and A for years 2000-2099). ZZZ is an individual ID number, and Q is a required checksum.

Discovered.Country.Finland

Discovered.PHI

Discovered.Entity.National ID Number

FRANCE_CNI

Matches numeric strings consistent with the French National ID card number (carte nationale d'identité). Requires a twelve-digit numeric string.

Discovered.Country.France

Discovered.Entity.CNI

FRANCE_NIR

Matches numeric strings consistent with France's National ID number (Numéro d'Inscription au Répertoire). Requires a fifteen-digit numeric string. An optional hyphen (-) or space can appear after the 13th digit. The 14th and 15th digits act as a checksum.

Discovered.Country.France

Discovered.Entity.NIR

FRANCE_PASSPORT

Matches alphanumeric strings consistent with the French Passport number. Requires two numbers followed by two upper case letters and ends with 5 digits.

Discovered.Country.France

Discovered.Entity.Passport

GENDER

Matches strings consistent with gender or gender abbreviations.

Discovered.Entity.Gender

GERMANY_DRIVERS_LICENSE_NUMBER

Matches alphanumeric strings consistent with Germany's Driver's License number. Requires an eleven-element string, with a digit or a letter followed by two digits, 6 digits or letters, one digit, and one digit or letter.

Discovered.Country.Germany

Discovered.Entity.Drivers License Number

GERMANY_IDENTITY_CARD_NUMBER

Matches alphanumeric strings consistent with Germany's Identity Card number. Requires a single letter followed by eight digits.

Discovered.Country.Germany

Discovered.Entity.Identity Card Number

IBAN_CODE

Matches strings consistent with an International Bank Account Number (IBAN). Must contain a valid country code.

Discovered.Entity.IBAN Code

ICD10_CODE

Matches strings consistent with codes from the International Statistical Classification of Diseases and Related Health Problems (ICD), as drawn from the Clinical Modification lexicon from the year 2020.

Discovered.Entity.ICD10 Code

IMEI_HARDWARE_ID

Matches strings consistent with an International Mobile Equipment Identity (IMEI) number. Must contain 15 digits with optional hyphens or spaces after the second, 8th, and 14th digits.

Discovered.Entity.IMEI

IP_ADDRESS

Matches IP Addresses in the V4 and V6 formats.

Discovered.Entity.IP Address

LOCATION

Matches strings consistent with Countries or Municipalities. By default focuses on locations in the United States.

Discovered.Entity.Location

MAC_ADDRESS

Matches strings consistent with a Media Access Control (MAC) address. Must contain twelve hexadecimal digits, with every two digits separated by a colon.

Discovered.Entity.MAC Address

MAC_ADDRESS_LOCAL

Matches strings consistent with a local Media Access Control (MAC) address.

Discovered.Entity.MAC Address Local

PERSON_NAME

Matches strings consistent with a dictionary of people's names. Names are drawn from the US Social Security database. This identifier must match at least 45% of the data sampled.

Discovered.Entity.Person Name

PHONE_NUMBER

Matches strings consistent with telephone numbers. Primarily looks for strings consistent with the United States telephone numbers naming convention.

Discovered.Entity.Telephone Number

POSTAL_CODE

Matches strings consistent with a valid US zip code with an optional +4. Only valid 5 digit zip codes are detected.

Discovered.Entity.Postal Code

SEC_STOCK_TICKER New

Matches strings consistent with the stock tickers recognized by the U.S. Securities and Exchange Commission (SEC).

Discovered.Entity.Stock Ticker Symbol

SPAIN_NIE_NUMBER

Matches strings consistent with Spain's Foreigner Identification number. Requires an eight-character string. The initial character must be X, Y, or Z, followed by seven digits, then by an optional hyphen or space and a single checksum character.

Discovered.Country.Spain

Discovered.Entity.NIE Number

SPAIN_NIF_NUMBER

Matches strings consistent with Spain's Tax Identification number. Requires an eight-character string. Requires eight digits followed by an optional hyphen or space and a single checksum character.

Discovered.Country.Spain

Discovered.Entity.NIF Number

SPAIN_PASSPORT

Matches strings consistent with Spain's Passport number. Requires an eight- or nine-character string, starting with either two or three letters followed by six digits.

Discovered.Country.Spain

Discovered.Entity.Passport

STREET_ADDRESS

Matches strings consistent with street addresses. Primarily looks for strings consistent with the United States street naming convention. This identifier must match at least 80% of the data sampled.

Discovered.Entity.Location

SWEDEN_NATIONAL_ID_NUMBER

Matches numeric strings consistent with Sweden's Nation ID number. Requires a ten- or twelve-digit string that must start with a date in either the YYMMDD or YYYYMMDD formats. An optional - or + character then separates four ending digits. The final digit is a checksum.

Discovered.Country.Sweden

Discovered.Entity.National ID Number

SWEDEN_PASSPORT

Matches numeric strings consistent with Sweden's Passport number. Requires an 8-digit number.

Discovered.Country.Sweden

Discovered.Entity.Passport

SWIFT_CODE

Matches alphanumeric strings consistent with a SWIFT code (or Bank Identifier Code (BIC)) format.

Discovered.Entity.Swift Code

THAILAND_NATIONAL_ID_NUMBER

Matches strings consistent with Thailand's National ID number. Requires a 13-digit number with optional spaces or hyphens (-) after the first, fifth, tenth, and twelfth digits. The final digit is a checksum.

Discovered.Country.Thailand

Discovered.Entity.National ID Number

TIME

Matches strings consistent with times. Can contain both date and time pieces.

Discovered.Entity.Date

UK_DRIVERS_LICENSE_NUMBER

Matches alphanumeric strings consistent with the United Kingdom's Driver's License number. Requires either a 16- or 18-character string. The first five characters represent the driver's surname, padded with 9s, followed by a single digit for decade of birth, two digits for month of birth (incremented by 50 for female drivers), two digits for day of birth, one digit for year of birth, two letters, an arbitrary digit, and two digits. Two additional digits can be present for each license issuance.

Discovered.Country.UK

Discovered.Entity.Drivers License Number

UK_NATIONAL_INSURANCE_NUMBER

Matches alphanumeric strings consistent with the United Kingdom's National Insurance number. Requires a nine-character string. The first two digits must be letters, followed by an optional space, then six digits with optional spaces or hyphens (-) every two digits, ending with a letter.

Discovered.Country.UK

Discovered.Entity.National Insurance Number

UK_TAXPAYER_REFERENCE

Matches ten-digit numeric strings consistent with UK Taxpayer Reference (UTR) numbers. The final digit is a checksum.

Discovered.Country.UK

Discovered.Entity.Taxpayer Reference

URL

Matches string consistent with a Uniform Resource Locator (URL). String must begin with http://, https://, ftp://, file:///, or mailto:, followed by a string and ending with a top level domain of no more than 128 characters.

Discovered.Entity.URL

US_ADOPTION_TAXPAYER_IDENTIFICATION_NUMBER

Matches a numeric string consistent United States Adoption Taxpayer Identification Number (ATIN). Requires a string similar in format to a US Social Security Number, but starting with a 9 in the Area Number and having 93 as an allowed Group Number.

Discovered.Country.US

Discovered.Entity.Adoption Taxpayer ID Number

US_BANK_ROUTING_MICR

Matches numeric string consistent with an American Bankers Association (ABA) Routing Number. Must be a nine-digit number starting with 0, 1, 2, 3, 6, or 7, followed by eight digits. The final digit is a checksum.

Discovered.Country.US

Discovered.Entity.Bank Routing MICR

US_DEA_NUMBER

Matches alphanumeric strings consistent with a Drug Enforcement Administration (DEA) number that is assigned to a health care provider. Must be a length of nine characters. The first two digits must be alphanumeric, and the last seven digits must be digits. The final digit is a checksum.

Discovered.Country.US

Discovered.Entity.DEA Number

US_EMPLOYER_IDENTIFICATION_NUMBER

Matches numeric string consistent United States Employer Identification Number (EIN). Strings must contain nine digits with a hyphen after the second digit.

Discovered.Country.US

Discovered.Entity.Employer ID Number

US_HEALTHCARE_NPI

Matches numeric strings consistent with US National Provider Identifier (NPI). Strings must be either 10 or 15 digits with the final digit being a valid checksum.

Discovered.Country.US

Discovered.Entity.Healthcare NPI

US_INDIVIDUAL_TAXPAYER_IDENTIFICATION_NUMBER

Matches a numeric string consistent United States Individual Taxpayer Identification Number (ITIN). Requires a string similar in format to a US Social Security Number, but starting with a 9 in the Area Number and having a limited set of allowed Group Numbers.

Discovered.Country.US

Discovered.Entity.Individual Taxpayer ID Number

US_PASSPORT

Matches numeric strings consistent with United States Passport number. Strings must contain nine digits.

Discovered.Country.US

Discovered.Entity.Passport

US_PREPARER_TAXPAYER_IDENTIFICATION_NUMBER

Matches strings consistent with a Preparer Taxpayer ID number. Strings must have nine characters, starting with a P that is followed by 8 digits.

Discovered.Country.US

Discovered.Entity.Preparer Taxpayer ID Number

US_SOCIAL_SECURITY_NUMBER

Matches strings consistent with a US Social Security Number. Strings must contain nine digits and comprise three parts: the three left-most digits designating the area number, the middle two digits designating the group number, and the four right-most digits designating the serial number. For a column to be tagged, none of these parts can contain all zeroes, and area numbers must not be 666 or in the range of 900-999.

Discovered.Country.US

Discovered.Entity.Social Security Number

US_STATE

Matches strings consistent with either a full name or two-letter abbreviation of a US state or territory.

Discovered.Country.US

Discovered.Entity.State

US_TOLLFREE_PHONE_NUMBER

Matches strings consistent with a US toll-free telephone number. Allowed area codes are 800, 88+any digit, or 899.

Discovered.Country.US

Discovered.Entity.Tollfree Telephone Number

VEHICLE_IDENTIFICATION_NUMBER

Matches strings consistent with Vehicle Identification Numbers. A checksum is required as well as a valid World Manufacturer Identifier.

Discovered.Country.US

Discovered.Entity.Vehicle Identifier or Serial Number

Improved Pack: Built-in Identifier Reference

Public preview

This feature is available to all tenants. Reach out to your Immuta support professional to use this feature.

Immuta comes with a pack of built-in identifiers that look for common data types. And since the first pack was released, improvements have been made. These improvements are now available in this improved pack, which includes some unchanged identifiers, but also many new and improved versions of legacy identifiers. These identifiers were written by Immuta's research and development team and cannot be deleted or edited by users. However, users can add these built-in identifiers to their own frameworks and edit the tags applied by them.

Identifiers must match at least 90% of the sampled data to be tagged, with three exceptions noted below. See the How competitive pattern analysis works guide for more information about sampling and thresholds.

Identifier descriptions and default resulting tags

Identifier

Description

Resulting tags from the default identifier

ARGENTINA_DNI_NUMBER

Detects strings consistent with Argentina's National Identity (DNI) Number. Requires an eight-digit number with periods after the second and fifth digits.

Discovered.Country.Argentina

Discovered.Entity.DNI Number

AUSTRALIA_MEDICARE_NUMBER

Detects numeric strings consistent with Australian Medicare Number. Requires a ten- or eleven-digit number. The starting digit must be between 2 and 6, inclusive. Spaces must be placed between the fourth and fifth and ninth and tenth digits. Optional eleventh digit separated by a / or a space.

Discovered.Country.Australia

Discovered.Entity.Medicare Number

AUSTRALIA_PASSPORT

Detects strings consistent with the Australian Passport number. A string of 8 or 9 characters is required, with a starting uppercase character (A, B, C, D, E, F, G, H, J, L, M, N, R, X, or U) or a two-character alphabetic prefix (P followed by A, B, C, D, E, F, U, W, X, or Z) followed by seven numeric digits.

Discovered.Country.Australia

Discovered.Entity.Passport

BELGIUM_NATIONAL_ID_CARD_NUMBER

Detects numeric strings consistent with Belgium's National ID card. Requires a twelve-digit number with a required hyphen (-) between the third and fourth digits. Allows for an optional hyphen between the tenth and eleventh digits.

Discovered.Country.Belgium

Discovered.Entity.National ID Card Number

BELGIUM_NATIONAL_REGISTRATION_NUMBER New

Detects numeric strings consistent with Belgium's National Registration Number. Requires 11 characters in the form YY.MM.DD-NNN-XX, where YY.MM.DD corresponds to birth date, NNN is a number, and XX is a checksum digit.

Discovered.Country.Belgium
Discovered.Entity.National Registration Number

BITCOIN_INVOICE_ADDRESS

Detects strings consistent with the following Bitcoin Invoice Address formats: P2PKH, P2SH, and Bech32.

Discovered.Entity.CRYPTO

BRAZIL_CPF_NUMBER

Detects a numeric string consistent with Brazil's CPF (Cadastro de Pessoas F\u00edsica) number. An eleven-digit numeric string with optional non-numeric separators (., -, or space) after the third, sixth, and ninth digits.

Discovered.Country.Brazil

Discovered.Entity.CPF Number

CANADA_BC_PHN

Detects numeric strings consistent with British Columbia's Personal Health Number (PHN). Requires a ten-digit numeric string with hyphens (-) or spaces after the fourth and seventh digits.

Discovered.Country.Canada

Discovered.Entity.British Columbia Health Network Number

CANADA_OHIP

Detects alphanumeric strings consistent with Ontario's Health Insurance Plan (OHIP). Requires a twelve-digit capitalized alphanumeric code. Optional hyphens (-) or spaces can appear after the fourth, seventh, and tenth digits.

Discovered.Country.Canada

Discovered.Entity.Ontario Health Insurance Number

CANADA_PASSPORT

Detects strings consistent with the Canadian Passport Number format. Allows for two formats. One format requires two capital letters followed by six digits. The other format requires one letter, followed by six digits, and ends in two letters.

Discovered.Country.Canada

Discovered.Entity.Passport

CANADA_QUEBEC_HIN

Detects alphanumeric strings consistent with Quebec's Health Insurance Number (HIN). Requires four alphabetic characters followed by an optional space or hyphen (-), and then eight digits with an optional hyphen or space after the fourth digit.

Discovered.Country.Canada

Discovered.Entity.Quebec Health Insurance Number

COUNTRY New

Detects strings consistent with the names of all countries in the world. This identifier is case-insensitive.

Discovered.Entity.Location

CREDIT_CARD_NUMBER

Detects strings consistent with a credit card number with prefixes matching major credit card companies.

Discovered.Entity.Credit Card Number

DATE

Detects strings consistent with dates in or date type: date, date+time, or timestamp. This identifier is case-insensitive.

Discovered.Entity.Date

DOMAIN_NAME

Detects strings that begin with a letter and are no more than 225 characters. A full domain can have one to four labels separated by a .. Each label can be one to 63 alphanumeric characters long. And each label after the first must be in the dictionary list of possible labels. This identifier is case-insensitive.

Discovered.Entity.Domain Name

EMAIL_ADDRESS

Detect strings consistent with an email address. Usernames are required to be fewer than 255 characters, follow by @, a domain of fewer than 255 characters, and a top level domain of between 2 and 20 characters.

Discovered.Entity.Electronic Mail Address

ETHNIC_GROUP

Detects strings consistent with the US Census . This identifier allows for dashes to be used in place of spaces and is case-insensitive.

Discovered.Entity.Ethnic Group

FDA_CODE

Detects a string consistent with a drug or ingredient registered with the Food and Drug Administration (FDA). Must start with between 4 to 5 digits, followed by a hyphen, followed by 3 to 4 digits, followed by a hyphen, and finishing with 1 to 2 digits.

Discovered.Country.US

Discovered.Entity.FDA Code

FINANCIAL_INSTITUTIONS New

Detects strings consistent with names of financial institutions based on lists provided by the FDIC and OCC, includes alternative names.

Discovered.Entity.Financial Institutions

FRANCE_NIR

Detects numeric strings consistent with France's National ID number (Numéro d'Inscription au Répertoire). Requires a fifteen-digit numeric string. An optional hyphen (-) or space can appear after the 13th digit.

Discovered.Country.France

Discovered.Entity.NIR

FRANCE_PASSPORT

Detects alphanumeric strings consistent with the French Passport number. Requires two numbers followed by two uppercase letters and ends with five digits.

Discovered.Country.France

Discovered.Entity.Passport

GENDER

Detects strings consistent with and common abbreviations. This identifier is case-insensitive.

Discovered.Entity.Gender

GERMANY_DRIVERS_LICENSE_NUMBER

Detects alphanumeric strings consistent with Germany's driver's license number. Requires an eleven-element string of the format CDDCCCCCCDC where C is an uppercase Latin letter and D is a numeric digit.

Discovered.Country.Germany

Discovered.Entity.Drivers License Number

GREAT_BRITAIN_DRIVERS_LICENSE

Detects alphanumeric strings consistent with the United Kingdom's driver's license number. Requires either a 16- or 18-character string. The first five characters represent the driver's surname, padded with 9s, followed by a single digit for decade of birth, two digits for month of birth (incremented by 50 for female drivers), two digits for day of birth, one digit for year of birth, two letters, an arbitrary digit, and two digits. Two additional digits can be present for each license issuance.

Discovered.Country.UK

Discovered.Entity.Drivers License Number

IBAN_CODE

Detects strings consistent with an International Bank Account Number (IBAN). Requires a string in the form ZZ-DD-BBAN, where ZZ is a country code, DD is two numeric digits, and BBAN is a Basic Bank Account Number comprising two to seven groups of three to five uppercase alphanumeric characters, optionally separated by space or dash, and optionally followed by a final group of length one to three.

Discovered.Entity.IBAN Code

ICD10_CODE

Detects strings consistent with codes from the International Statistical Classification of Diseases and Related Health Problems (ICD), as drawn from the Clinical Modification lexicon from the year 2025. This identifier is case-insensitive.

Discovered.Entity.ICD10 Code

ICD_10_PCS New

Detects strings consistent with procedure codes from the International Statistical Classification of Diseases and Related Health Problems (ICD), as drawn from the Clinical Modification lexicon from 2020.

Discovered.Entity.ICD10 Procedure Code

IMEI_HARDWARE_ID

Detects strings consistent with an International Mobile Equipment Identity (IMEI) number. Must contain 15 or 16 digits with optional hyphens or spaces after the 2nd, 8th, and 14th digits.

Discovered.Entity.IMEI

IP_ADDRESS

Detects IP Addresses in the V4 and V6 formats. This identifier is case-insensitive.

Discovered.Entity.IP Address

LOCATION

Detects ISO3166 formatted locations. This identifier must match at least 80% of the data sampled.

Discovered.Entity.Location

MAC_ADDRESS

Detects strings consistent with a Media Access Control (MAC) address. Must contain twelve hexadecimal digits, with every two digits separated by a colon or hyphen.

Discovered.Entity.MAC Address

NAICS_CODE New

Detects strings consistent with North American Industry Classification System (NAICS). A two-digit number represents a basic sector and each preceding digit represents a more specific sub sector with a maximum of six digits.

Discovered.Entity.NAICS Code

PERSON_NAME

Detects strings consistent with a dictionary of people's names. The name dictionary is US-centric with person names drawn from the US Social Security database, covering 80% of the U.S. population. This identifier must match at least 45% of the data sampled. This identifier is case-insensitive.

Discovered.Entity.Person Name

PHONE_NUMBER

Detects strings consistent with telephone numbers. Primarily looks for strings consistent with the United States telephone numbers naming convention. Optional area codes allowed.

Discovered.Entity.Telephone Number

POSTAL_CODE

Detects strings consistent with a valid US Zip code with an optional +4 separated by a dash. Only valid five-digit zip codes are detected. This identifier is case-insensitive.

Discovered.Entity.Postal Code

SEC_STOCK_TICKER New

Detects strings consistent with the stock tickers recognized by the U.S. Securities and Exchange Commission (SEC).

Discovered.Entity.Stock Ticker Symbol

SPAIN_NIF_NUMBER

Detects strings consistent with Spain's Tax Identification number. Requires a string with nine alphanumeric characters. Requires either eight digits followed by an optional hyphen or space and a single uppercase letter or the initial character must be X, Y, or Z, followed by an optional dash or space, seven numeric digits, followed by an optional dash or space, and finally, by a single uppercase letter.

Discovered.Country.Spain

Discovered.Entity.NIF Number

SPAIN_PASSPORT

Detects string consistent with Spain's Passport Number. Requires a eight- or nine-character string starting with either two or three uppercase letters followed by six numeric digits.

Discovered.Country.Spain

Discovered.Entity.Passport

SWIFT_CODE

Detects alphanumeric strings consistent with a SWIFT code (or Bank Identifier Code (BIC)) format. Requires values consistent with AAAAAACCDDD, where A is an uppercase letter, C is an uppercase letter or numeric digit, and DDD is an optional three-character sequence of uppercase letters or numeric digits.

Discovered.Entity.Swift Code

TIME

Detects strings consistent with times in various formats or data type: time. If date is included in the time, it will not match. Use the DATE identifier instead.

Discovered.Entity.Date

UK_NATIONAL_INSURANCE_NUMBER

Detects alphanumeric strings consistent with the United Kingdom's National Insurance Number. Requires a nine-character string. The first two digits must be uppercase letters, followed by an optional space, then six digits with optional spaces or hyphens (-) every two digits, ending with A, B, C, or D.

Discovered.Country.UK

Discovered.Entity.National Insurance Number

URL

Detects string consistent with a URL. String must begin with a common schema, followed a string and ending with a top level domain of no more than 128 alphanumeric characters.

Discovered.Entity.URL

US_DEA_NUMBER

Detects alphanumeric strings consistent a Drug Enforcement Administration (DEA) number is assigned to a health care provider. It must have a length of nine characters. The first two digits must be uppercase alphanumeric characters, and the last seven characters are numeric digits. The first character may not be I, N, O, Q, V, W, Y, or Z.

Discovered.Country.US

Discovered.Entity.DEA Number

US_EMPLOYER_IDENTIFICATION_NUMBER

Detects numeric string consistent United States Employer Identification Number (EIN). Strings must contain nine digits with a hyphen after the second digit.

Discovered.Country.US

Discovered.Entity.Employer ID Number

US_HEALTHCARE_NPI

Detects 10-digit numeric strings consistent with US National Provider Identifier (NPI). It must either start with 80840 followed by a 1 or 2, or it must begin with a 1 or 2.

Discovered.Country.US

Discovered.Entity.Healthcare NPI

US_PERSON_FULL_NAME New

Detects strings consistent with a person's {first name} space {last name}. Uses the same names from the PERSON_NAME identifier. This identifier must match at least 20% of the data sampled and is case-insensitive.

Discovered.Entity.Person Name

US_PREPARER_TAXPAYER_IDENTIFICATION_NUMBER

Detects strings consistent with a Preparer Taxpayer ID number. Strings must have nine characters, starting with a P that is followed by eight digits.

Discovered.Country.US

Discovered.Entity.Preparer Taxpayer ID Number

US_SOCIAL_SECURITY_NUMBER

Detects strings consistent with a US Social Security Number. Strings must contain nine digits and comprise three parts: the three left-most digits designating the area number, the middle two digits designating the group number, and the four right-most digits designating the serial number. For a column to be tagged, none of these parts can contain all zeroes, and area numbers must not be 666 or in the range of 900-999.

Discovered.Country.US

Discovered.Entity.Social Security Number

US_STATE

Detects strings consistent with either a full name or two-letter abbreviation of a US state or territory.

Discovered.Country.US

Discovered.Entity.State

US_STREET_ADDRESS

Detects strings consistent with U.S. street addresses. Requires the street naming convention of {address_number} {street_name} {unit number (optional)} with an optional road suffix after the street name. The maximum length for street name is 20 alphanumeric characters. This identifier must match at least 80% of the data sampled and is case-insensitive.

Discovered.Entity.Location

VEHICLE_IDENTIFICATION_NUMBER

Detects strings consistent with Vehicle Identification Numbers. A valid World Manufacturer Identifier is required.

Discovered.Country.US

Discovered.Entity.Vehicle Identifier or Serial Number

Built-in Discovered Tags Reference

Immuta is pre-configured with a set of tags that can be used to write global policies before data sources even exist. See a list of the built-in Discovered tags below and the for information about where these tags will be applied by the built-in identifiers.

Country tags

All the tags below belong to the Country parent. For example, the full tag name will appear as Discovered . Country . Argentina.

Child tag name

Description

Entity tags

All the tags below belong to the Entity parent. For example, the full tag name will appear as Discovered . Entity . Aadhaar Individual.

Identifier tags

Deprecation notice

The following identifier tags have been deprecated. New SaaS tenants will not see these tags applied by SDD. Current tenants relying on these tags for policies should contact their Immuta representative for support before these tags are removed from the product.

None of the tags below have an additional parent or child tag. For example, the full tag name will appear as Discovered . Identifier Direct.

Personal information tags

Deprecation notice

None of the tags below have an additional parent or child tag. For example, the full tag name will appear as Discovered . PCI.

How Competitive Pattern Analysis Works

Of sensitive data discovery's , regex and dictionary are competitive. This means that when assessing your data, if multiple identifiers could match, only one with competitive criteria will be chosen to tag the data. To better understand how Immuta executes this competition, read further.

Discover employs a three-phased competitive criteria analysis approach for sensitive data discovery (SDD):

: No data is moved, and Immuta checks the identifiers against a sample of data from the table.
: Identifiers with a criteria match of less than a 90% match are filtered out.
: The remaining identifiers are compared with one another to find the most specific criteria that qualifies and matches the sample.

In the end, competitive criteria analysis aims to find a single identifier for each column that best describes the data format.

Sampling

In the sampling process, no database contents are transmitted to Immuta; instead, Immuta receives only the column-wise hit rate (the number of times the criteria has matched a value in the column) information for each active identifier. To do this, Discover instructs a remote database to measure column-wise hit rate information for all active identifiers over a row sample.

The sample size is decided based on the number of identifiers and the data size, when available. In the most simplified case, the requested number of sampled rows depends only on the number of regex and dictionary criteria being run in the framework, not the data size. The sample size dependence on the number of identifiers is weak and will not exceed 13,000 rows.

Number of identifiers

Sample size

Sampling considerations

In practice, the number of sampled values for each column may be less than the requested number of rows because columns are not independently sampled but rather projected from a row-wise sample. This can impact the sample when the target table has less than the requested number of rows, when some of the column values are null, or because of technology-specific limitations.

Snowflake and Starburst (Trino): Discover implements table sampling by row count.
Databricks and Redshift: Due to technology limitations and the inability to predict the size of the table, Discover implements a best-effort sampling strategy comprising a flat 10% row sample capped at the first 10,000 sampled rows. In particular, under-sampling may occur on tables with less than 100,000 rows. Moreover, the resulting sample is biased towards earlier records.
All platforms: Sampling from views can have significantly slower performance that varies by the performance of the query that defines the view.
All platforms: Any null values included in the sample will not count towards the qualification or scoring when included in the sample. However, it will lower the number of available values to match against the patterns, as the sample size is not dynamic based on the ignored null values.

Qualifying

During the qualification phase, identifiers that do not agree with the data are disqualified. An identifier agrees with the data if the on the remote sample exceeds the predefined threshold. This threshold is 90% match for most built-in identifiers; however, a few built-in identifiers have lower threshold . The 90% threshold is standard for all custom identifiers as well to ensure the criteria matches the data within the column and to avoid false positives. Note that threshold calculations are relative to the number of non-null entries for each column.

If no identifiers qualify, then no identifier is assessed for scoring and the column is not tagged.

Scoring

During the scoring phase, a machine inference is carried out among all qualified identifiers, combining criteria-derived complexity information with hit rate information to determine which identifier best describes the sample data. This process prefers the more restrictive of two competing identifiers since the ability to satisfy the more difficult-to-satisfy identifier itself serves as evidence that it is more likely. This phase ends by returning a single most likely identifier per the inference process.

Example

Here are a set of regex identifiers and a sample of data:

Identifiers:

[a-zA-Z0-9]{3} - This regex will match 3 character strings with the characters a-z, lowercase or uppercase, or digits 0-9.
[a-c]{3} - This regex will match 3 character strings with the characters a-c, lowercase.
(a|b|d){3} - This regex will match 3 character strings with the characters a, b, or d, lowercase.

When qualifying the identifiers, Identifier 1 and Identifier 3 both match 90% or more of the data. Identifier 2 does not, and is disqualified.

Then the qualified identifiers are scored. Here, Identifier 1, despite matching 100% of the data, is unspecific and could match over 200,000 values. On the other hand, Identifier 3 matches just at 90% but is very specific with only 27 available values.

Therefore, with the specificity taken into account, Identifier 3 would be the match for this column, and its tags would be applied to the data source in Immuta.

Important notes

Dictionaries are part of the competitive process, while column-name regex are not.
Scoring ties are rare but can occur if the same criteria (either dictionary or regex) is specified more than once (even in different forms). Scoring ties are inconclusive, and the scoring phase will not return an identifier in the case of a tie.
Criteria complexity analysis is sensitive to the total number of strings an identifier accepts or, equivalently for dictionaries, the number of entries. Therefore, identifiers that accept much more than is necessary to describe the intended column data format may perform more poorly in the competitive analysis because they are easier to satisfy.

Data Classification

Classification is the process in which data is categorized by the content and the associated risk level based on context. Classification complements sensitive data discovery (SDD), and the tags classification applies can give additional information in the Detect dashboards for data sources.

How-to guides

: Use the API to activate a classification framework.
: Create a classification framework using a provided template.

Reference guide

: This reference guide describes classification frameworks and how classification works in Immuta.

How-to Guides

Activate Classification Frameworks

Requirements:

Registered
Immuta permission GOVERNANCE

To activate a classification framework,

Navigate to Discover and select the Classification tab.
Click the more actions icon in the Actions column for the framework you want to activate.
Select Activate.

Deactivate a classification framework

Navigate to Discover and select the Classification tab.
Click the more actions icon in the Actions column for the framework you want to activate.
Select Deactivate.

Activate and manage classification frameworks using the API

To activate a framework using the Immuta API, see the .

Adjust Identification and Classification Framework Tags

Requirements:

Registered
Immuta permission GOVERNANCE

Immuta Discover provides identifiers out-of-the-box to recognize and tag data. Users can then utilize classification frameworks and build them to apply tags based off those identifier tags and their own catalog tags.

Tune identification frameworks and identifiers first to adjust where Discovered tags are applied. Because classification frameworks can apply classification tags from the Discovered tags, tuning SDD should come first and will have trickle-down effects on classification. Customizing SDD requires some initial work but will automate data tagging for all data sources in the future.

Follow the steps below to tune SDD for your data:

.
.
.
: This will remove the tags from any previous identification frameworks and re-run identification with your new framework. From here, either continue to edit identifiers to reconfigure the applied tags, or if you are happy with the results, proceed to the next step.
.

After SDD has applied entity tags, any active classification frameworks will automatically reapply their tags to account for any changes to Discovered tags. It may be necessary to adjust the classification tags based on your organization's data, security, and compliance needs.

Assess your data source tags

Requirement: Immuta permission GOVERNANCE or data owner

Target some data sources to manually review tags:

Navigate to the data dictionary for the data source by opening the Data Sources page and selecting a data source. Click the Data Dictionary tab to open the data dictionary.
The data dictionary lists the data source columns, with details about the name, data type, and a list of the tags on each column. Assess whether the tags are accurate to your data.

If you find that too many tags are applied

Tags may be unexpected but still accurate to your data. Additionally, they may have been applied because they were found to be the best match from the identifiers in the framework.

If you want to improve SDD and personalize it to your data, assess why the tag was applied to your data:

Is the identifier incorrectly matching this specific column, but correct in other places? It must have been the most correct match found by identification. Create a better match by completing the following steps:

If you want to remove the unexpected tags, use one of the following how-to guides:

Ensure the Discovered tags are applied properly by adjusting SDD.

If you find that tags are missing

If you were expecting some sensitive data to be tagged and it is not, enable additional tags using one of the following how-to guides:

Ensure the Discovered tags are applied properly by adjusting SDD.

Tune your data dictionaries

Requirement: Immuta permissions GOVERNANCE and AUDIT

Navigate to the Data Sources page and select the data sources that you assessed and noted issues.
Click the Data Dictionary tab.
Delete unnecessary tags by clicking on the tag you want to remove from the column, and select Disable from the tag side sheet.
To add tags,
1. Click Add Tags in the Actions column.
2. Begin typing the name of the tag you want to add in the Search by Name field and select the tag from the dropdown list.
3. Click Add.

How to Use a Classification Framework with Your Own Tags

After you have registered data sources in Immuta, you can start automating data classification of a column based on its context, which is the combination of

associated tags already applied to the column
tags applied to the neighboring columns and
table tags on the data source.

The starter framework in this how-to is built to map a classification scale of restricted, confidential, internal, and public to Immuta's three-level scale, which is used in the data source and query event dashboards.

Follow this guide to map your tags to the example framework, or consult the for more information about the framework schema.

Customize the framework

Using the example framework below, customize the framework for your organization's classification tags:

{
  "shortName": "ECMC Framework",
  "name": "External Catalog Mapping Classification Framework",
  "description": "This framework maps the classification tags the organization has in Collibra to Immuta data sources.",
  "": [
    {
      "name": "ECMC.Confidentiality.Highly Sensitive",
      "source": "curated",
      "": [
        {
          "": "confidentiality",
          "": 2
        }
      ]
    },
    {
      "name": "ECMC.Confidentiality.Sensitive",
      "source": "curated",
      "sensitivities": [
        {
          "dimension": "confidentiality",
          "sensitivity": 1
        }
      ]
    },
    {
      "name": "ECMC.Confidentiality.Nonsensitive",
      "source": "curated",
      "sensitivities": []
    }
  ],
  "": [
    {
      "name": "ECMC 00001",
      "": {
        "name": "ECMC.Confidentiality.Highly Sensitive",
        "": "curated"
      },
      "": [
        {
          "name": "Restricted",
          "source": "collibra"
        }
      ],
      "": [],
      "": []
    },
    {
      "name": "ECMC 00002",
      "classificationTag": {
        "name": "ECMC.Confidentiality.Sensitive",
        "source": "curated"
      },
      "columnTags": [
        {
          "name": "Confidential",
          "source": "collibra"
        }
      ],
      "neighborColumnTags": [],
      "tableTags": []
    },
    {
      "name": "ECMC 00003",
      "classificationTag": {
        "name": "ECMC.Confidentiality.Sensitive",
        "source": "curated"
      },
      "columnTags": [
        {
          "name": "Internal",
          "source": "collibra"
        }
      ],
      "neighborColumnTags": [],
      "tableTags": []
    },
    {
      "name": "ECMC 00004",
      "classificationTag": {
        "name": "ECMC.Confidentiality.Nonsensitive",
        "source": "curated"
      },
      "columnTags": [
        {
          "name": "Public",
          "source": "curated"
        }
      ],
      "neighborColumnTags": [],
      "tableTags": []
    }
  ],
  "": true
}

Parameters

tags: These tags are automatically created in Immuta with the sensitivity you assign. They must not already exist in Immuta. All tags used in the classificationTag parameter should be defined here.
tags.sensitivities: This is metadata for the sensitivity of the new tag. Use confidentiality for dimension. Options for sensitivity are 1 (shown as sensitive in Detect dashboards) and 2 (shown as highly sensitive in Detect dashboards). For nonsensitive, leave this parameter empty.
rules: These are the rules for applying the tags defined above. Each rule contains the classification tag to apply if the requirements are met and the requirements: the column tags, neighboring column tags, and table tags that must be present. All requirements within each defined rule must be met for the classification tag to be applied.
rules.classificationTag: The name and source of the tag you want applied if the rule requirements are met. This classification tag must be defined in tags. The source is curated.
rules.columnTags: These are the required tags for a column. If the tags defined here are found on a column, and the other tag rules are met, then the rule's classificationTag will be applied to the same column.
rules.neighborColumnTags: These are the required tags on other columns in the data source (or in the query if dynamic query classification is enabled). If the tags defined here are found on any column in the data source, and the other tag rules are met, then the rule's classificationTag will be applied to all the neighboring columns.
rules.tableTags: These are the required tags on the data source. If the tags defined here are found on the data source, and the other tag rules are met, then the rule's classificationTag will be applied to all the columns in that data source.
active: When true the framework is active and will apply tags when the rules are met.

How to edit rules

Follow the example below to map your tags to the rules in the example framework.

This example framework has a rule where columns tagged DSF.Interpretation.Credentials.Secret by sensitive data discovery will be tagged RAF.Confidentiality.High:

"rules": [
{
    "name": "RAF 00004",
    "classificationTag": {
      "name": "RAF.Confidentiality.High",
      "source": "curated"
    },
    "columnTags": [
    {
        "name": "DSF.Interpretation.Credentials.Secret",
        "source": "curated"
    }
    ],
    "neighborColumnTags": [],
    "tableTags": []
}
]

To translate this to your tags, replace the name and source value of the columnTags, neighborColumnTags, or tableTags with your own. This new example is for a Collibra tag from the external catalog that an organization uses for confidential data. This rule now states: Apply the classification tag RAF.Confidentiality.High to a column if it has the collibra tag Confidential. Repeat this for your organization's remaining classification levels.

"rules": [
{
    "name": "RAF 00004",
    "classificationTag": {
      "name": "RAF.Confidentiality.High",
      "source": "curated"
    },
    "columnTags": [
    {
        "name": "Confidential",
        "source": "collibra"
    }
    ],
    "neighborColumnTags": [],
    "tableTags": []
}
]

Find the `name` and `source` for your tags

If you do not know the name or source for your tags, you can list your tags using the Immuta API:

curl \
    --request GET \
    --header "accept: application/json" \
    --header "Authorization: Bearer <your-token." \
    https://your-immuta-url.com/tag

This request will list all the tags in your Immuta environment, similar to this example response:

[
  {
    "id": 114,
    "name": "DataProperties.Cross-Sectional",
    "source": "curated",
    "deleted": false,
    "systemCreated": true
  },
  {
    "id": 2,
    "name": "Discovered.Country.Argentina",
    "source": "curated",
    "deleted": false,
    "systemCreated": true
  },
  {
    "id": 9,
    "name": "Discovered.Country.Australia",
    "source": "collibra",
    "deleted": false,
    "systemCreated": true
  }
]

Activate your new framework

Requirement: Immuta permission GOVERNANCE

Once you have made all the customizations to the example framework, make the following request using the Immuta API, with your full customized framework as the payload.

curl \
    --request POST \
    --header "Content-Type: application/json" \
    --header "Authorization: Bearer <your-token>" \
    --data @example-payload.json \
    https://your.immuta.com/frameworks/

Your new framework will now be visible in the Immuta UI by navigating the the Classification section under Discover.

Reference Guide

Classification Frameworks

Classification is the process in which data is categorized by the content and the associated risk level based on context. To classify your data, Discover evaluates your data in two phases:

Sensitive data discovery (SDD) runs to identify your data by content type. The data is discovered and evaluated by the identifier it matches and is tagged.
Classification runs to classify your data by its context. The data is classified by the rules within a framework and the tags currently applied to the column and table. Once the data is classified, it's tagged with special tags with additional metadata used in the as sensitivity and visualize when that sensitive data is accessed.

Both phases of classification in Immuta can be customized to find and tag the data your organization cares about. After data is classified, classification tags can be used to or .

Using Discover classification to assign risk and sensitivity levels to your data and Detect dashboards to visualize the risk levels offers these benefits:

Increasing the semantic understanding of your data to better meet compliance requirements
Reducing the time to make decisions about what data access is allowed under what purposes
Reducing the effort and time to respond to auditors about data access in your company
Reducing the labor of classifying data to enumerate what data is within the scope of security or regulatory compliance frameworks

What is the difference between entity tags and classification tags?

Both entity and classification tags describe the content of data on a per-column basis, and you can use them to and . However, there are key differences between the two:

Entity tags are applied through identification and describe what the data is. SDD applies entity tags to columns based on the patterns of the data.
Classification tags are applied through categorization and risk assessment and describe the context of the data and the risk it poses. Using classification frameworks, classification tags are applied to columns based on the entity tags previously applied by SDD. Additional classification tags can then be applied, providing even more context or expressing the property of the record rather than just the column.

Why isn’t entity tagging sufficient for classification?

Entity tags describe the contents of individual columns, in isolation. But you don't access individual columns in isolation, so why would you determine their sensitivity that way? Entity tags do not attempt to and cannot contextualize column contents with neighboring columns' contents. This means that connections between data are lost if they cannot be identified through a pattern within the column itself. Classification tags describe the contents of a table with the context of all its columns, providing a holistic view of the risk of the data for what it is, rather than the pattern it fits. Context is necessary to understand whether your data is public or private data, risky or safe to have ungoverned access, or sensitive and creating toxic joins when accessed with other tables.

For example, under HIPAA, a list of procedures a doctor performed is only considered protected health information (PHI) if it can be associated with the identity of patients. Since entity tagging operates on a single column-by-column basis, it cannot reason whether or not a column containing procedure codes merits classification as PHI. Therefore, entity tagging will not tag procedure codes as PHI. But classification tagging will tag it PHI if it detects patient identity information in the other columns of the table.

Additionally, entity tagging does not indicate how sensitive the data is, but classification tags can carry a sensitivity level. For example, an entity tag may identify a column that contains telephone numbers, but the entity tag alone cannot say that the column is sensitive. A phone number associated with a person may be classified as sensitive, while the publicly listed phone number of a company might not be considered sensitive.

After you understand what entities your data contains using SDD, you need to adopt frameworks that determine what combinations of data constitute sensitive data and their level of sensitivity.

What is a framework?

Frameworks are a set of data categories and a set of classification rules to place data into those categories. In Immuta, the data categories are represented by tags, and when data fits a classification rule the tag is applied:

Classification rules determine how each classification tag is applied. These rules can apply tags based on tags already on the column, tags applied to neighboring columns, and tags applied to the data source. This means that the complete data source is considered when classifying your data sources, and even tags applied to individual columns can affect the risk level of the entire data source.

Frameworks are often built off of an interpretation of regulatory frameworks or standards, such as the US Health Insurance Portability and Accountability Act (HIPAA) and the PCI standard. However, organizations can also build frameworks that represent their internal business processes. When used in Immuta, they automate data tagging and provide information about what data you have immediately after it is registered in Immuta.

What are the benefits of classification?

Data classification is a process, and with Immuta, much of it is automated. This means that you can reap the benefits of classified and tagged data quicker and easier than manually classifying and tagging it:

Build data platform compliance: Create classification frameworks to identify and classify your data based on the industry practices and regulations your organization needs to abide by. Once the frameworks are built, they will automatically tag data as it's registered, ensuring your data sources are properly tagged to abide by the regulations you care about.

Manage Tags

How-to Guides

Add Tags to Data Sources and Projects

Add tags to data sources

Click the Data icon in the navigation menu and select the Data Sources tab.
Select a data source.
Click the Add Tags button on the Details tab.
Begin typing a tag name in the Search by Name field and select the tag from the dropdown list.
Click Add. A list of the applied tags will populate on the Details tab.
Repeat as necessary for other data sources and tags.

Remove tags from data sources

Click the Data icon in the navigation menu and select the Data Sources tab.
Select a data source.
Scroll to the Tags section on the Details tab, and click on the tag you want to remove.
Click Delete in the side sheet and then click Confirm.

Manage data dictionary tags

The data dictionary lists the columns within the data source and the value type of the data within each column. From this page, governors can add tags to or remove them from specific columns in a data source.

Add tags to the data dictionary

Navigate to a data source and click the Data Dictionary tab.
Scroll to the column you want to add a tag to and click Add Tags.
Begin typing in the Search by Name field and select the tag from the dropdown list.
Click Add. The applied tag will appear below the column name in the data dictionary.

Remove tags from the data dictionary

Navigate to a data source and click the Data Dictionary tab.
Scroll to the column you want to remove the tag from and click on the tag you want to delete.
Click Delete in the side sheet and then click Confirm.

Manage project tags

Click the Data icon and select Projects in the navigation menu.
Select a project.
Click the Add Tags button on the Project Overview tab.
Begin typing in the Search by Name field that appears, and then select the tag from the dropdown list.
Click Add. A list of the applied tags will populate on the project overview.

Remove tags from projects

Click the Data icon and select Projects in the navigation menu.
Select a project.
Scroll to the Tags section on the Overview tab, and then click the tag you want to delete.
Click Delete in the side sheet and then click Confirm.

Reference guides

For information about data sources and tags, see the following guides:

How-to guides

In addition to adding and managing data source tags as outlined above, data owners can manage data source

Manage Data Metadata

Connect external catalogs

Data discovery

Data classification

Manage tags

Connect External Catalogs

Getting started

How-to guide

Reference guides

Getting Started with External Catalogs

Link an Alation external catalog

Configuration process

Link a Collibra external catalog

Configuration process

Link a Microsoft Purview external catalog

Configuration process

Link a custom REST catalog

Configuration process

Link Databricks Unity Catalog for tag ingestion

Configuration process

Link Snowflake for tag ingestion

Configuration process

Configure an External Catalog

Link an Alation catalog

Link a Collibra catalog

Link a Microsoft Purview external catalog

Prerequisite

Link a custom REST catalog

Enable Snowflake tag ingestion

Enable Databricks Unity Catalog tag ingestion

Manually link catalogs to data sources

Manually sync external catalog tags

Reference Guides

External Catalog Introduction

Supported external catalogs

Architecture

Alation

Collibra

How Immuta gets metadata from Collibra

Limitations

Microsoft Purview catalog

How Immuta gets metadata from Microsoft Purview

Limitations

Custom REST catalog

Databricks Unity Catalog tag ingestion

Snowflake tag ingestion

External catalog behaviors

Resources

Custom REST Catalog Interface Introduction

Section Contents

Custom REST Catalog Interface Endpoints

Architectural Overview

General Concepts

Tags in Immuta

Descriptions in Immuta

Authentication

Endpoint Specification

GET /tags

Overview

Request

Response

POST /dataSource

Overview

Request

Response

GET /dataSource/page/{id}

Overview

Request

Response

GET /column/{id}

Overview

Request

Response

Data Discovery

Architecture

Components

Identification framework

Global framework

Identifier

Supported criteria types for identifiers

GET `/tags`

POST `/dataSource`

GET `/dataSource/page/{id}`

GET `/column/{id}`