1 of 23

Discover Your Data

Discover scans your data sources and applies relevant tags when data is recognized. This eliminates a manual tagging process for your data, saving you time and providing standard taxonomy across all your data sources.

Getting started

This guide illustrates how to implement sensitive data discovery and classification.

Introduction

This reference guide discusses the components and benefits of Immuta Discover.

Architecture

This reference guide describes the design of Immuta Discover.

Data discovery

Sensitive data discovery (SDD) is an Immuta feature that uses data patterns to determine what type of data your column represents. This saves the time of identifying your data manually and provides the benefit of a standard taxonomy across all your data sources in Immuta.

The guides in this section discuss the components of SDD and how to use it to tag your data.

Data classification

Classification is the process in which data is categorized by the content and the associated risk level based on context. The guides in this section illustrate how to configure and customize classification for your organization.

Getting Started

Requirements

Native SDD enabled and turned on
Frameworks enabled
Registered Snowflake, Databricks, Redshift, or Starburst (Trino) data sources
Immuta permission GOVERNANCE

Implement SDD

Sensitive data discovery (SDD) is an Immuta Discover feature that scans your data sources and applies relevant tags when data is recognized. This eliminates a manual tagging process for your data, saving you time and providing standard taxonomy across all your data sources.

To learn more, see the Data discovery page.

Enable sensitive data discovery

Enable sensitive data discovery to start using the default framework on all of your registered data sources. This out-of-the-box framework discovers common data types and tags them automatically when a new data source is registered.

Turn on SDD
Trigger SDD on a data source

Configure SDD for your data

For additional control, create your own patterns to recognize the data that matters to you. Add these patterns to new frameworks and specify the data sources that need this framework. This fine-level control creates automatic tagging that is relevant and accurate to your data, requiring fewer manual adjustments to the resulting tags.

Customize SDD for your data:
Assign your framework to specific data sources

Adjust Discovered tags

If you have any tags that are applied to your data sources by SDD that you don't want, you can easily disable these tags for each data source. This ensures that they will not be applied to the data source again if SDD is re-run.

Verify tags
Disable Discovered tags

Reference pages:

Immuta comes with a default framework containing built-in Discovered tags and built-in patterns. These patterns and tags can be used in your own frameworks.

Built-in Discovered tags
Built-in patterns

Implement classification

Classification is an Immuta Discover feature that categorizes your data based on the content and the associated risk the data poses. This increases your understanding of your data and allows you to make faster decisions about it.

Enable classification

Enable classification from the Immuta app settings page.

Activate a framework

Activate any of the following frameworks:

To start seeing classification tags, enable the Data Security Framework.
If you are using Snowflake and want to see information on your sensitive data in Detect, enable the Risk Assessment Framework.
Opt to enable any of the other compliance frameworks.

Complete the following steps for each framework you want to activate:

Navigate to Discover and select the Classification tab.
Click the more actions icon in the Actions column for the framework you want to activate.
Select Activate.

Configure classification for your data

To configure or manage a framework using the Immuta API, see the Frameworks API reference page.

Adjust classification tags

If you have any tags that are applied to your data sources by classification that you don't want, you can easily disable these tags for each data source. This ensures that they will not be applied to the data source again when classification is re-run.

Assess your data source tags.
Tune your data dictionaries.

Introduction

Immuta allows you to automate discovering and tagging data across your data platform. Tagging is critical for two reasons:

It allows you to define data sensitivity, which in turn allows you to monitor where you have potential data security issues and gaps in your security posture.
It allows you to abstract your physical structure from your access policy logic. For example, you can build access policies like mask all columns tagged PII (where PII was auto-tagged by Discover) rather than much less scalable policies that must be knowledgeable of your physical layers like mask column x in database y in data platform z.

Challenge and goals

Today’s sensitive data discovery tools give you a shallow overview of your data corpus across a long list of platforms. They give you pointers on where you have sensitive data without the granularity to drive your column- or row-level access controls. They help you understand what data you possess according to a regulatory framework, like HIPAA or PCI, but without the details needed to automate your audits or compliance reporting. Knowing that you need to drive east to west on a road map from New York to California is helpful but ultimately insufficient to get you from a specific location to another.

Existing tools promise a high degree of automation, yet their many false positives result in painful manual work that never stops. Although data gets scanned automatically, performance breaks down at scale, or you manually need to fine-tune the computing resources of the scanners. Last but not least, your security team objects to the agent-based processing that requires taking data out of your data platform, and the associated data residency concerns may give you pause.

At Immuta, we believe that data security should not be painful. We believe that you can innovate and move quickly, while at the same time protecting your data and adhering to your internal policies and external regulations. Technology and automation allow you to make the right trade-off decisions quickly. It all starts with highly accurate and actionable metadata. If you trust your metadata and if it’s actionable, you can leverage it to automatically grant access to data, mask sensitive information, and automate your audit reporting.

Immuta Discover was built to tackle those challenges and address them through a unique architecture that was designed in collaboration with the largest financial institutions, healthcare companies, and government agencies in the world. The cloud and AI paradigm requires a fundamentally different approach. You must assume that your data is dynamic, unique, and collected in a multitude of different geographies and legal jurisdictions. Immuta Discover is built for this new world and its specific demands.

How does it work?

Scalability through in-platform processing

Identifying and classifying data requires analyzing and looking at the data - there’s no way around it. Immuta Discover does all the analysis and processing inside the native technology. It takes advantage of those platforms’ inherent scalability to enable you to analyze large amounts of data quickly, efficiently, and without the need for separate resource optimization for containers or virtual machines.

Data residency compliance by design

By processing data directly inside the data platform, Immuta Discover automatically adheres to data residency and locality requirements. If you run your data warehouse or lake globally - across North America, the European Union, and Asia - Immuta processes the data in the region where your data is stored. No data ever leaves the data platform, and it will never move across different cloud regions.

Improved security and simplicity through agentless scanning

In-platform processing greatly reduces risk and improves your data security posture. Provisioning agents, whether they’re in a container, virtual machine, or Amazon Machine Image (AMI), create complexity and an unnecessary security risk. Not only can those agents become compromised, but their misconfiguration might lead to data leaks to other parts of your cloud infrastructure. An agentless approach can better leverage data platform optimizations to process data instead of transferring it out to re-optimize and analyze. This simplifies operations and increases efficiency for your infrastructure teams.

Cross-platform consistency

The advantages of in-platform processing are abundant, but implementing it across a multitude of platforms is challenging. Immuta helps bypass the obstacles by doing all the heavy lifting for you and building in specific implementations for each technology. Although all those implementations are ultimately different, Immuta abstracts the results to one standardized taxonomy, so you can have consistently accurate and granular metadata across all your data stores.

Granular query-level classification

Immuta Discover classifies data on a column level and instantaneously identifies schema changes. Only with that level of granularity and automation can you adhere to your audit requirements and understand what actions have been taken on your data. For example, if non-sensitive data is joined with sensitive data at query time, Immuta Discover will monitor and record that for your review. Continuous schema monitoring ensures schema changes never result in holes in your access controls and data security posture.

Highly accurate and actionable metadata

Trust in your metadata is critical for data security.

To unblock your data consumers, you need to automate your data access controls; this requires trusting that your classification and metadata are accurate and actionable. Immuta Discover provides you with highly accurate metadata and tags out-of-the-box and assists you in fine-tuning the classification mechanism to deal with false positives quickly. That enables you to build policies that dynamically grant or restrict access to protected data (like PHI or PII) depending on who is accessing it and what protections you want to apply.

Components of Discover

Immuta Discover works in three phases: identification, categorization, and classification.

Identification: In this first phase, data is identified by its kind – for example, a name or an age. This identification can be manually performed, externally provided by a catalog, or automatically determined by Immuta Discover through column-level analysis of patterns.
Categorization: In the second phase, data is categorized in the context of where it appears, subject to any active data compliance or security frameworks. For example, a record occurring in a clinical context containing both a name and individual health data is protected health information (PHI) under HIPAA.
While every phase can and should be customized, for categorization Immuta provides a bundle of default frameworks. The generic Data Security Framework provides the base for the specific frameworks and gives fine-grained categorization of your data into a consistent set of security and compliance concepts. This categorization of data helps to understand the context it is in, including information like whether or not a record pertains to an individual, the composition and kinds of identifiers present, the data subject, whether the data belongs to any controlled data categories under certain legislation, etc.
The categorization provided by the Immuta classification frameworks may be used out-of-the-box; however, they are best leveraged as a starting point for purpose-built compliance frameworks implementing organization-specific compliance categories.
Classification: In the third and final phase, data is classified according to its sensitivity level (e.g., Customer Financial Data is Highly Sensitive) and the risk associated to the data subject. Immuta supplies sensitivity level defaults in Detect and risk assessment default tags based on standard industry practice. However, customers are free to customize the assignments under their respective views.

Architecture

Discover automates discovering and tagging data across your data platform. It encompasses the identification and classification of data using frameworks.

Requirements

Native SDD enabled
Frameworks enabled
Registered Snowflake, Databricks, Redshift, or Starburst (Trino) data sources

Components

The Immuta UI has separate sections for identification frameworks and classification frameworks. Both frameworks are made of rules, criteria, and resulting tags, but the criteria types differ for each framework type. Identification frameworks use competitive pattern matching and column name matching to discover data types and tag them. Classification frameworks use tags on the column, neighboring columns, and data source for context and then tag the columns based on that context. Find more information about each framework type below.

Identification frameworks

Identification frameworks run with sensitive data discovery (SDD). They use data patterns to discover data and tag it based on what the data is.

Supported criteria and pattern types

Competitive pattern analysis: This criteria is a process that will review all the regex and dictionary patterns within the rules of the framework and search for the pattern with the best fit. In this review, each competitive pattern analysis criteria in the framework competes against each other to find the best and most specific pattern that fits the data. The resulting tags for the best pattern's rule are then applied to the column.
- Regex pattern: This pattern contains a case-insensitive regular expression that searches for matches against column values. Create a regex pattern in the UI or with the sdd/classifier endpoint.
- Dictionary pattern: This pattern contains a list of words and phrases to match against column values. Create a dictionary pattern in the UI or with the sdd/classifier endpoint.
Column name: This criteria matches a column name pattern to the column names in the data sources. The rule's resulting tags will be applied to the column where the name is found.
- Column name pattern: This pattern includes a case-insensitive regular expression matched against column names, not against the values in the column. Create a column name pattern in the UI or with the sdd/classifier endpoint.

To start using identification frameworks in the UI, see the Getting started guide.
To manage identification frameworks with the API, see the /sdd/template endpoint reference guide.

Classification frameworks

Classification frameworks run with the classify service. They determine rule match and criteria fit based on proximity tags and then tag data based on the context it is within.

Supported criteria

Match column tag: This criteria applies resulting tags based on specific tags already on the column.
Match neighboring column tag: This criteria applies resulting tags based on specific tags on neighboring columns.

To manage classification frameworks in the UI, see the Activate frameworks guide.
To create a classification framework with the API, see the /frameworks endpoint reference guide.

Data inventory dashboard

Private preview This feature is only available to select accounts.

The data inventory dashboard visualizes information about your organization's data. It presents your entire data corpus within the context of the frameworks you have actively tagging your data with details like when your data was scanned last or how much of the scanned data is relevant to your active frameworks.

In the data inventory dashboard you will see tiles for scanned coverage and the percent of data scanned within a specific time frame. These tiles are referencing data scanned by an identification framework with SDD. To increase the number of your data sources that have been scanned, run SDD.

The next section of the dashboard shows tiles for the compliance frameworks. Within each graph is the separation of columns found containing or not containing the data important to the compliance framework. These graphs update every time classification runs, which will happen from these events.

For information on the frameworks visualized in the dashboard, see the Immuta frameworks reference guide.

Workflow

The Discover workflow involves both identification with SDD and classification:

A user with the GOVERNANCE permission enables SDD and activates classification frameworks.
Users register data in Immuta.
SDD runs:
1. Immuta generates a SQL query using the identification framework's rules.
2. That query is executed in the native database.
3. Immuta receives the query results containing the column name and the matching rules but no raw data values.
4. SDD applies the resulting tags to the relevant columns.
Classification runs:
1. The data source's current tags are checked against the framework's rules.
2. When a matching rule is found, the resulting tags are applied to the relevant columns.
Users with the GOVERNANCE permission or data owners can view the data inventory dashboard with visualizations of their scanned data.

Frequency

This workflow will run when a new data source is manually registered in Immuta or found from schema monitoring. Additionally, SDD alone will run from the following events:

A new data source is created.
Schema monitoring is enabled, and a new data source is detected.
Column detection is enabled, and new columns are detected. Here, SDD will only run on new columns, and no existing tags will be removed or changed.
A user manually triggers it from the data source health check menu.
A user manually triggers it from the identification frameworks page.
A user manually triggers it through the API.

Classification will run from the following events:

A framework gets created, updated, or deleted.
A tag gets added to or removed from a column manually or by SDD.
A tag gets added to a data source.
A user manually triggers it from the data source health check menu.
A user manually triggers it through the API.

Caveat

Customizing classification frameworks currently requires users to use the Immuta API.

Discover section contents

Conceptual guides:

Data classification

Getting started guide:

Getting started with Discover

How-to guides:

Identification guides:
Classification guides:
- Activate a classification framework
- Adjust and accept entity and classification tags

Reference guides:

Built-in pattern reference
Discovered tag reference
Built-in classification frameworks reference

Data Discovery

Sensitive data discovery (SDD) is an Immuta feature that uses data patterns to determine what type of data your column represents. Using frameworks, rules, and patterns, Immuta evaluates your data and can assign the appropriate tags to your data dictionary based on what it finds. This saves the time of identifying your data manually and provides the benefit of a standard taxonomy across all your data sources in Immuta.

Supported technologies

Native SDD supports data discovery on data sources from the following technologies:

Snowflake
Databricks or Databricks Unity Catalog
Starburst (Trino): Native SDD for Starburst (Trino) is currently in public preview and available to all accounts. Enable this feature on the Immuta app settings page.
Redshift: Native SDD for Redshift is currently in private preview and available to all accounts. Please reach out to your Immuta representative to enable it on your tenant.

Architecture

To evaluate your data, SDD generates a SQL query using the identification framework's rules; the Immuta system account then executes that query in the native technology. Immuta receives the query result, containing the column name and the matching rules but no raw data values. These results are then used to apply the resulting tags to the appropriate columns.

This evaluating and tagging process occurs when SDD runs, which happens automatically from the following events:

A new data source is created.
Schema monitoring is enabled and a new data source is detected.
Column detection is enabled and new columns are detected. Here, SDD will only run on new columns and no existing tags will be removed or changed.

Users can also manually trigger SDD to run from a data source's overview page or the identification frameworks page.

Components

Sensitive data discovery (SDD) runs frameworks to discover data. These frameworks are a collection of rules. These rules contain a single criteria and the resulting tags that will be applied when the criteria's conditions have been met. See the sections below for more information on each component.

Identification framework

An identification framework is a collection of rules that will look for a particular criteria and tag any columns where those conditions are met. While organizations can have multiple frameworks, only one may be applied to each data source. Immuta has the built-in Default Framework, which contains all the built-in patterns and assigns the built-in Discovered tags based on pattern matching.

For a how-to on the framework actions users can take, see the Manage frameworks page.

Global framework

Each organization has a single global framework that will apply to all the data sources in Immuta by default, unless they have a different framework assigned. It is labeled on the frameworks page with a globe icon. Users can bypass this global framework by applying a specific framework to a set of data sources.

Rule

A rule is a criteria and the resulting tags to apply to data that matches the criteria. When Immuta recognizes that criteria, it can tag the data to describe the type. Each rule is specific to its own framework, but all a framework's rules can be copied to create a new framework.

For a how-to on the rule actions users can take, see the Manage rules page.

Criteria

Criteria are the conditions that need to be met for resulting tags to be applied to data.

Supported criteria types

Competitive pattern analysis: This criteria is a process that will review all the regex and dictionary patterns within the rules of the framework and search for the pattern with the best fit. If there are multiple rules in a framework using competitive pattern analysis, only one will be applied to any column. To learn more about the competitive nature, see the How competitive pattern analysis works guide.
Column name: This criteria matches a column name pattern to the column names in the data sources. The rule's resulting tags will be applied to the column where the name is found.

Pattern

A pattern is the type of data Immuta will look for to meet the requirements to tag a column. They can be used in rules across multiple frameworks, but can only be used once within each framework. Immuta comes with built-in patterns to discover common categories of data. These patterns cannot be modified and are within preset rules with preset tags. Users can also create their own unique patterns to find their specific data. SDD only supports regex patterns written in RE2 syntax.

Supported pattern types

The three types of patterns are described below:

Regex: This pattern contains a case-insensitive regular expression that searches for matches against column values.
Column name: This pattern includes a case-insensitive regular expression that is only matched against column names, not against the values in the column.
Dictionary: This pattern contains a list of words and phrases to match against column values.

Configuration

Only application admins can enable sensitive data discovery (SDD) globally on the Immuta app settings page. Then, data source creators can disable SDD on a data-source-by-data-source basis.

Tag mutability

When SDD is manually triggered by a data owner, all column tags that were previously applied by SDD are removed and the tags prescribed by the latest run are applied. However, if SDD is triggered because a new column is detected by schema monitoring, tags will only be applied to the new column, and no tags will be modified on existing columns. Additionally, governors, data source owners, and data source experts can disable any unwanted Discovered tags in the data dictionary to prevent them from being used and auto-tagged on that data source in the future.

Performance

The amount of time it takes to run identification on a data source depends on several factors:

Columns: The time to run identification grows nearly linearly with the number of text columns in the data source.
Identifiers: The number of identifiers being used weakly impacts the time to run identification.
Row count: Performance of identification may vary depending on the sampling method used by each technology. For Snowflake, the number of rows has little impact on the time because data sampling has near-constant performance.
Views: Performance on views is limited by the performance of the query that defines the view.

The time it takes to run SDD for all newly onboarded data sources in Immuta is not limited by SDD performance but by the execution of background jobs in Immuta. Consult your Immuta account manager when onboarding a large number of data sources to ensure the advanced settings are set appropriately for your organization.

Testing

For users interested in testing SDD, note that the built-in patterns by Immuta require a certain amount of confidence to be assigned to a column. This means that with synthetic data, there may be situations where the data is not real enough to fit the confidence needed to match patterns. To test SDD, use a dev environment, create copies of your tables, or use the API to run a dryRun and see the tags that would be applied to your data by SDD.

Considerations

Deleting the built-in Discovered tags is not recommended: If you do delete built-in Discovered tags and use the Default Framework, then when the pattern is matched the column will not be tagged. As an alternative, tags can be disabled on a column-by-column basis from the data dictionary, or SDD can be turned off on a data-source-by-data-source basis when creating a data source.

Supported data types and casing

Type of identifier

Supported data types

Case sensitivity

Data regex

Text string columns

Case-sensitive

Column name regex

Any column

Not case-sensitive

Dictionary

Text string columns

Can be toggled in the identifier definition

Limitations with dictionary patterns

Immuta compiles dictionary patterns into a regex that is sent in the body of a query.

For Snowflake, the size of the dictionary is limited by the overall query text size limit in Snowflake of 1 MB.

Databricks limitations

For Databricks, Immuta will start up a Databricks cluster to complete the SDD job if one is not already running. This can cause unnecessary cost if the cluster becomes idle. Follow Databricks best practices to automatically terminate inactive clusters after a set period of time.
Native SDD for Databricks Unity Catalog will only work on data sources authenticated with a personal access token (PAT). OAuth machine-to-machine (M2M) is not supported with SDD.

Starburst (Trino) limitation

Native SDD will only work on Starburst (Trino) data sources authenticated with username and password. OAuth 2.0 is not supported with SDD.

Redshift limitations

Redshift Spectrum is not supported with native SDD.

Redshift supported authentication methods

Username and password is fully supported with native SDD.
Okta is not supported with native SDD.
AWS access key is supported with limitations with native SDD:
- The AWS access key used to register the data source can do a minimum of the following redshift-data API actions:
  - redshift-data:BatchExecuteStatement
  - redshift-data:CancelStatement
  - redshift-data:DescribeStatement
  - redshift-data:ExecuteStatement
  - redshift-data:GetStatementResult
  - redshift-data:ListStatements
- The AWS access key used to register the data source must have redshift:GetClusterCredentials for the cluster, user, and database that they onboard their data sources with.
- If using a custom URL, then the data source registered with the AWS access key must have the region and clusterid included in the additional connection string options formatted like the following:
  region=us-east-2;clusterid=12345
- Redshift Serverless data sources are not supported for native SDD with the AWS access key authentication method.

Migrating from legacy to native SDD

These limitations are only relevant to users who have previously enabled and run Immuta SDD.

If you had legacy SDD enabled, running native SDD can result in different tags being applied because native SDD is more accurate and has fewer false positives than legacy SDD. Running a new SDD scan against a table will change the context of the resulting tags, but no Discovered tags previously applied by legacy SDD will be removed.

See the Migrate from legacy to native SDD page for more information.

How-to Guides

Enable Sensitive Data Discovery (SDD)

Requirement: Immuta permission GOVERNANCE

This how-to guide is for enabling sensitive data discovery (SDD). For additional information on sensitive data discovery and classification, see the Discover architecture page.

Navigate to the App Settings page and scroll to the Sensitive Data Discovery section.
Select the Enable Sensitive Data Discovery (SDD) checkbox to enable SDD.
Click Save and then click Confirm to apply your changes. Note that the Immuta tenant will have a system restart.
Run SDD for a select group of data sources; use one of the following options to run SDD on specific data sources:
1. Run SDD on a data source in the UI.
2. Run SDD using a specific framework in the UI.
3. Make the following request specifying the data sources in the request using the Immuta API.
  curl \ --request 'POST' \ 'https://your-immuta-url.immuta.com/sdd/run' \ --header 'Content-Type: application/json' \ --header 'Authorization: 438a3096966c4a5188b3b468cedb213e' \ --data '{"sources":["Example Data Source Name", "Example Data Source 2 Name"]}'
  A successful request will have the code 200 and a body with the number of jobs created from the request:
  { "jobCount": 2 }
Navigate to the data source overview page of the data source you listed in the payload.
Click the Data Dictionary tab.
Assess whether the Discovered and classification tags applied are accurate.
If they are, then repeat the steps above for more of your data sources. Once a majority of your data sources appear to have accurate tags, run SDD on all your data sources. If the tags are not accurate, you will need to tune SDD and classification frameworks. See the Adjust frameworks and tags guide for instructions.

Run SDD on all data sources

Click the Discover icon and the Identification tab in the navigation menu.
Select the more actions icon.
Select Run SDD and then select it again in the modal.

Run SDD on all data sources using the API

Requirement: Immuta permission GOVERNANCE

Make the following request using the Immuta API to run SDD for all data sources, specifying all as true:

curl \
    --request 'POST' \
    'https://your-immuta-url.immuta.com/sdd/run' \
    --header 'Content-Type: application/json' \
    --header 'Authorization: 438a3096966c4a5188b3b468cedb213e' \
    --data '{"all": true}'

A successful request will have the code 200 and a body with the number of jobs created from the request:

{
    "jobCount": 12
}

Manage Identification Frameworks

Requirements:

Native SDD enabled and turned on
Registered Snowflake, Databricks, Redshift, or Starburst (Trino) data sources
Immuta permission GOVERNANCE

Create a framework

Create a framework with no rules

Click the Discover icon in the navigation menu and select the Frameworks tab.
Click Create New.
Enter a Name for the framework.
Enter a Description for the framework.
Select the option to Create empty framework.
Click Create.

After you create the framework, you can create new rules for it.

Copy an existing framework and its rules

Click the Discover icon in the navigation menu and select the Frameworks tab.
Click Create New.
Enter a Name for the framework.
Enter a Description for the framework.
Select the option to Create rules from an existing framework.
Select the checkbox for the framework you want to copy. You can only copy a single framework. For more information about a framework, click the framework name to open a new tab with details about the framework.
Click Create.

Assign a framework to data sources

To assign a framework to run on specific data sources,

Click the Discover icon in the navigation menu and select the Frameworks tab.
Select the framework you want to assign and navigate to the Data Sources tab.
Click Add Data Sources.
Select the checkbox for the data source you want this framework to run on. You may select more than one.
Click Add Data Source(s).

Remove data sources from a framework

After a data source is removed from a framework, it will use the global framework for any SDD scans and the tags applied by the removed framework will be replaced. To remove data sources from a framework,

Click the Discover icon in the navigation menu and select the Frameworks tab.
Select the framework you want to remove data sources from and navigate to the Data Sources tab.
Select the checkbox for the data source you want to remove from the framework. You may select more than one.
Select the Bulk Actions more options.
Select Remove Data Sources.
Click Confirm.

Delete a framework

Deleting a framework will remove it from any data sources. Those data sources will then use the global framework for any SDD scans and the tags applied by the deleted framework will be replaced. Governors can delete any framework, and users with the CREATE_DATA_SOURCE or CREATE_DATA_SOURCE_IN_PROJECT permissions can only delete frameworks they created. To delete a framework,

Click the Discover icon in the navigation menu and select the Frameworks tab.
Click the three dot menu in the Action column for the framework you want to delete. Note that the global framework cannot be deleted. If you want to delete it, configure a different framework as the global framework.
Select Remove.
Click Confirm.

Manage Patterns

Requirements:

Native SDD enabled and turned on
Immuta permission GOVERNANCE

Create a pattern

Click the Discover icon in the navigation menu and select the Patterns tab.
Click Create New.
In the modal, enter a name for the new pattern.
Write a Description for the type of data the pattern will find.
Select the Type of pattern.
1. For regex and column name regex, enter the regex.
2. For dictionary, enter the values you want the pattern to match and toggle the switch on if you want them to be case-sensitive.
Click Create Pattern.
See the Manage rules page to add your new pattern to a framework.

Note that all user-created patterns must be a 90% match or greater for the contents of the column to be tagged.

Edit a pattern

Editing a pattern will affect any rule built off the pattern throughout Immuta. To edit a pattern,

Click the Discover icon in the navigation menu and select the Patterns tab.
Click the name of the pattern you want to edit.
Click Edit.
Edit the field you want to change. Note any field shadowed is not editable, and the pattern must be deleted and re-created to change them.
Click Save.

Built-in patterns cannot be edited.

Delete a pattern

Deleting a pattern will remove it from Immuta and remove all the rules that relied on it in the frameworks throughout Immuta. To delete a pattern,

Click the Discover icon in the navigation menu and select the Patterns tab.
Click the three dot menu in the Action column for the pattern you want to delete.
Select Remove.
Click Confirm.

Built-in patterns cannot be deleted.

Manage Rules

Requirements:

Native SDD enabled and
Immuta permission GOVERNANCE

Create a rule

You can only have one rule per pattern in the framework. If you do not see the pattern for the rule you want to create, then it already has a rule built off of it.

Click the Discover icon in the navigation menu and select the Framework tab.
Select the framework you want to edit and navigate to the Discovery Rules tab.
Click Create New.
Select the Tags to apply from the dropdown. The tags you select are the tags applied when the pattern is matched. Note that resulting tags must be under the Discovered parent tag and cannot be parent tags themselves unless they have already been manually applied to a data source.
Select the Criteria type from the dropdown. See the .
1. Competitive pattern analysis is for regex and dictionary patterns.
2. Column name is for column name patterns.
Select the Pattern from the dropdown.
Click Create Rule.

Edit a rule

Click the Discover icon in the navigation menu and select the Frameworks tab.
Select the framework of the rule you want to edit and navigate to the Discovery Rules tab.
Select the rule you want to edit.
Click Edit.
Edit the field you want to change. Note any field shadowed is not editable, and the rule must be deleted and re-created to change them.
Click Save.

Delete a rule

Deleting a rule removes the tags once applied by that rule the next time SDD runs on a data source. To delete a rule,

Click the Discover icon in the navigation menu and select the Frameworks tab.
Select the framework you want to edit and navigate to the Discovery Rules tab.
Click the three dot menu in the Action column for the rule you want to delete.
Select Remove.
Click Confirm.

Manage SDD on Data Sources

Requirements:

Native SDD enabled and
Registered
Immuta permission GOVERNANCE

Run SDD using a specific framework

SDD runs automatically, but if you want to re-run SDD when a new global framework is set or when new rules have been added, you can or for specific frameworks through the UI:

Click the Discover icon and the Identification tab in the navigation menu.
Select the more actions icon.
Select Run SDD and then select it again in the modal.

Run SDD on a data source

SDD runs automatically, but if you want to re-run SDD when a new global framework is set or when new rules have been added, you can or for specific data sources through the UI:

Navigate to the data source overview page.
Click the health status.
Select Re-run next to Sensitive Data Discovery (SDD).

Verify discovered tags

Verify discovered tags

If sensitive data discovery has been enabled, then manually adding tags to columns in the data dictionary will be unnecessary in most cases. The data owner will just need to verify that the Discovered tags are correct.

Disable Discovered tags from the data dictionary

If a governor, data owner, or data source expert disables a Discovered tag from the data dictionary, the column will not be re-tagged when that data source's fingerprint is recalculated or SDD is re-run. When a Discovered tag is disabled, the tag will not completely disappear, so it can be manually enabled through the tag side sheet.

To disable a discovered tag,

Navigate to a data source and click the Data Dictionary tab.
Scroll to the column you want to remove the tag from and click the tag you want to remove.
Click Disable in the side sheet and then click Confirm.

Manage Global SDD Settings

Requirement: Immuta permission APPLICATION_ADMIN

Configure the global framework

Click the App Settings icon in the left sidebar.
Click Sensitive Data Discovery in the left panel to navigate to that section.
Enter the request-friendly name of your global template in the Global SDD Template Name field. This name can be found in the tooltip on the framework's detail page.
Click Save, and then Confirm your changes.

Migrate From Legacy to Native SDD

This guide provides information and best practices for migrating from the deprecated legacy sensitive data discovery (SDD) option to the improved native SDD. This guide is for users who have already enabled SDD on their tenant and have Discovered tags on their data sources.

Before you begin

Native vs legacy SDD

Legacy SDD is deprecated. It will be removed and replaced by native SDD. Native SDD is significantly improved from legacy SDD for discovering and tagging your data with upgrades to the built-in patterns. Additionally, the greatest benefit is the respect for data residency. Native SDD doesn't move any of your data when running. The discovery is done right in your data platform, and the platform only returns the matching patterns and column names to Immuta.

See the for more information on native SDD.

Requirements

Native SDD requires Snowflake, Databricks, Redshift, or Starburst (Trino) data sources
Legacy SDD enabled on your tenant
Legacy SDD tags applied to your data sources: To find out if you have legacy SDD tags applied, create a governance report as described in the .

Enable native SDD

Contact your Immuta representative to enable native SDD on your Immuta tenant. Note that unless specifically disabled, all Immuta installations after the 2024.2 LTS have native SDD automatically enabled. Proceed to if you want to self-service check if native SDD is already running and tagging your data before you reach out to the representative.

This action will not change anything immediately on your tenant; however, anytime SDD runs in the future, it will be native SDD instead of the legacy version.

To assess native SDD for your data, proceed with the steps below. If you do not review native SDD, the legacy SDD tags will all remain on your data source columns. However, when on new data sources and columns, it will apply native SDD tags, and because of the improvements to SDD, it may tag different data than legacy SDD.

Understand the context of your tags

Requirement: Immuta permission GOVERNANCE

To check the tags on an individual data source, navigate to the data source data dictionary and select a Discovered tag. On the tag side sheet, you can determine the context of the tag. When patterns match data, native SDD will apply tags, and their tag context will be Sensitive Data Discovery. Any tags with the context Legacy Sensitive Data Discovery were not matched by native SDD but will remain on the data source.
To check your tags globally, navigate to the governance reports page and build a report for sensitive data discovery. This report will present the legacy tags on your data sources' columns and native SDD tags that are also on those columns. Use this report to assess the context of the Discovered tags and understand if native SDD is matching the data you want it to.

These actions will allow you to understand the differences between how native SDD and legacy SDD tag your data and whether your data is recognized as expected by native SDD or if legacy SDD was over-tagging your data. This way you can better tune SDD to your data.

If there are any legacy SDD tags that you want native SDD to catch, you need to tune native SDD so that this type of data is discovered in future tables and columns; see guidance on that in the next section.

Tune SDD

Requirement: Immuta permission GOVERNANCE

Using the report you built above, complete these actions to tune SDD:

Focus on a legacy SDD tag properly applied to your data. Assess whether the native SDD tag on the column instead was applied more accurately than the legacy tag. If it is applied incorrectly, proceed to the next step.
Complete the steps above for all legacy SDD tags.

Completing the actions above will create parity between what legacy SDD was tagging your data and what native SDD will tag in the future.

Reference Guides

Built-in Pattern Reference

In previous documentation, rule and pattern are referred to as classifier or identifier. The language is being updated to rule to be more accurate and not conflate meaning with .

Immuta comes with a set of built-in patterns that look for common data types. These patterns were written by Immuta's research and development team and cannot be deleted or edited by users. However, users can build their own rules using these built-in patterns, which will customize the resulting tags based on the organization's needs.

When using SDD with , it is recommended to use the default resulting tags listed in the table below for these built-in patterns. This ensures that the framework rules apply sensitivity tags as intended.

Pattern descriptions and default resulting tags

Pattern

Description

Resulting tags from the default rules

Built-in Discovered Tags Reference

Immuta is pre-configured with a set of tags that can be used to write global policies before data sources even exist. See a list of the built-in Discovered tags below and the for information about where these tags will be applied by the built-in rules

Country tags

All the tags below belong to the Country parent. For example, the full tag name will appear as Discovered . Country . Argentina.

Child tag name

Description

Entity tags

All the tags below belong to the Entity parent. For example, the full tag name will appear as Discovered . Entity . Aadhaar Individual.

Identifier tags

None of the tags below have an additional parent or child tag. For example, the full tag name will appear as Discovered . Identifier Direct.

Personal information tags

None of the tags below have an additional parent or child tag. For example, the full tag name will appear as Discovered . PCI.

How-to Guides

Architecture

Discover automates discovering and tagging data across your data platform. It encompasses the identification and classification of data using frameworks.

Requirements

Native SDD enabled
Frameworks enabled
Registered Snowflake, Databricks, Redshift, or Starburst (Trino) data sources

Components

Identification frameworks

Identification frameworks run with sensitive data discovery (SDD). They use data patterns to discover data and tag it based on what the data is.

Supported criteria and pattern types

Competitive pattern analysis: This criteria is a process that will review all the regex and dictionary patterns within the rules of the framework and search for the pattern with the best fit. In this review, each competitive pattern analysis criteria in the framework competes against each other to find the best and most specific pattern that fits the data. The resulting tags for the best pattern's rule are then applied to the column.
- Regex pattern: This pattern contains a case-insensitive regular expression that searches for matches against column values. Create a regex pattern in the UI or with the sdd/classifier endpoint.
- Dictionary pattern: This pattern contains a list of words and phrases to match against column values. Create a dictionary pattern in the UI or with the sdd/classifier endpoint.
Column name: This criteria matches a column name pattern to the column names in the data sources. The rule's resulting tags will be applied to the column where the name is found.
- Column name pattern: This pattern includes a case-insensitive regular expression matched against column names, not against the values in the column. Create a column name pattern in the UI or with the sdd/classifier endpoint.

To start using identification frameworks in the UI, see the Getting started guide.
To manage identification frameworks with the API, see the /sdd/template endpoint reference guide.

Classification frameworks

Classification frameworks run with the classify service. They determine rule match and criteria fit based on proximity tags and then tag data based on the context it is within.

Supported criteria

Match column tag: This criteria applies resulting tags based on specific tags already on the column.
Match neighboring column tag: This criteria applies resulting tags based on specific tags on neighboring columns.

To manage classification frameworks in the UI, see the Activate frameworks guide.
To create a classification framework with the API, see the /frameworks endpoint reference guide.

Data inventory dashboard

Private preview This feature is only available to select accounts.

For information on the frameworks visualized in the dashboard, see the Immuta frameworks reference guide.

Workflow

The Discover workflow involves both identification with SDD and classification:

A user with the GOVERNANCE permission enables SDD and activates classification frameworks.
Users register data in Immuta.
SDD runs:
1. Immuta generates a SQL query using the identification framework's rules.
2. That query is executed in the native database.
3. Immuta receives the query results containing the column name and the matching rules but no raw data values.
4. SDD applies the resulting tags to the relevant columns.
Classification runs:
1. The data source's current tags are checked against the framework's rules.
2. When a matching rule is found, the resulting tags are applied to the relevant columns.
Users with the GOVERNANCE permission or data owners can view the data inventory dashboard with visualizations of their scanned data.

Frequency

This workflow will run when a new data source is manually registered in Immuta or found from schema monitoring. Additionally, SDD alone will run from the following events:

A new data source is created.
Schema monitoring is enabled, and a new data source is detected.
Column detection is enabled, and new columns are detected. Here, SDD will only run on new columns, and no existing tags will be removed or changed.
A user manually triggers it from the data source health check menu.
A user manually triggers it from the identification frameworks page.
A user manually triggers it through the API.

Classification will run from the following events:

A framework gets created, updated, or deleted.
A tag gets added to or removed from a column manually or by SDD.
A tag gets added to a data source.
A user manually triggers it from the data source health check menu.
A user manually triggers it through the API.

Caveat

Customizing classification frameworks currently requires users to use the Immuta API.

Discover section contents

Conceptual guides:

Data classification

Getting started guide:

Getting started with Discover

How-to guides:

Identification guides:
Classification guides:
- Activate a classification framework
- Adjust and accept entity and classification tags

Reference guides:

Built-in pattern reference
Discovered tag reference
Built-in classification frameworks reference

Data Discovery

Supported technologies

Native SDD supports data discovery on data sources from the following technologies:

Snowflake
Databricks or Databricks Unity Catalog
Starburst (Trino): Native SDD for Starburst (Trino) is currently in public preview and available to all accounts. Enable this feature on the Immuta app settings page.
Redshift: Native SDD for Redshift is currently in private preview and available to all accounts. Please reach out to your Immuta representative to enable it on your tenant.

Architecture

This evaluating and tagging process occurs when SDD runs, which happens automatically from the following events:

A new data source is created.
Schema monitoring is enabled and a new data source is detected.
Column detection is enabled and new columns are detected. Here, SDD will only run on new columns and no existing tags will be removed or changed.

Users can also manually trigger SDD to run from a data source's overview page or the identification frameworks page.

Components

Identification framework

For a how-to on the framework actions users can take, see the Manage frameworks page.

Global framework

Rule

For a how-to on the rule actions users can take, see the Manage rules page.

Criteria

Criteria are the conditions that need to be met for resulting tags to be applied to data.

Supported criteria types

Competitive pattern analysis: This criteria is a process that will review all the regex and dictionary patterns within the rules of the framework and search for the pattern with the best fit. If there are multiple rules in a framework using competitive pattern analysis, only one will be applied to any column. To learn more about the competitive nature, see the How competitive pattern analysis works guide.
Column name: This criteria matches a column name pattern to the column names in the data sources. The rule's resulting tags will be applied to the column where the name is found.

Pattern

Supported pattern types

The three types of patterns are described below:

Regex: This pattern contains a case-insensitive regular expression that searches for matches against column values.
Column name: This pattern includes a case-insensitive regular expression that is only matched against column names, not against the values in the column.
Dictionary: This pattern contains a list of words and phrases to match against column values.

Configuration

Only application admins can enable sensitive data discovery (SDD) globally on the Immuta app settings page. Then, data source creators can disable SDD on a data-source-by-data-source basis.

Tag mutability

Performance

The amount of time it takes to run identification on a data source depends on several factors:

Columns: The time to run identification grows nearly linearly with the number of text columns in the data source.
Identifiers: The number of identifiers being used weakly impacts the time to run identification.
Row count: Performance of identification may vary depending on the sampling method used by each technology. For Snowflake, the number of rows has little impact on the time because data sampling has near-constant performance.
Views: Performance on views is limited by the performance of the query that defines the view.

Testing

Considerations

Supported data types and casing

Type of identifier

Supported data types

Case sensitivity

Data regex

Text string columns

Case-sensitive

Column name regex

Any column

Not case-sensitive

Dictionary

Text string columns

Can be toggled in the identifier definition

Limitations with dictionary patterns

Immuta compiles dictionary patterns into a regex that is sent in the body of a query.

For Snowflake, the size of the dictionary is limited by the overall query text size limit in Snowflake of 1 MB.

Databricks limitations

For Databricks, Immuta will start up a Databricks cluster to complete the SDD job if one is not already running. This can cause unnecessary cost if the cluster becomes idle. Follow Databricks best practices to automatically terminate inactive clusters after a set period of time.
Native SDD for Databricks Unity Catalog will only work on data sources authenticated with a personal access token (PAT). OAuth machine-to-machine (M2M) is not supported with SDD.

Starburst (Trino) limitation

Native SDD will only work on Starburst (Trino) data sources authenticated with username and password. OAuth 2.0 is not supported with SDD.

Redshift limitations

Redshift Spectrum is not supported with native SDD.

Redshift supported authentication methods

Username and password is fully supported with native SDD.
Okta is not supported with native SDD.
AWS access key is supported with limitations with native SDD:
- The AWS access key used to register the data source can do a minimum of the following redshift-data API actions:
  - redshift-data:BatchExecuteStatement
  - redshift-data:CancelStatement
  - redshift-data:DescribeStatement
  - redshift-data:ExecuteStatement
  - redshift-data:GetStatementResult
  - redshift-data:ListStatements
- The AWS access key used to register the data source must have redshift:GetClusterCredentials for the cluster, user, and database that they onboard their data sources with.
- If using a custom URL, then the data source registered with the AWS access key must have the region and clusterid included in the additional connection string options formatted like the following:
  region=us-east-2;clusterid=12345
- Redshift Serverless data sources are not supported for native SDD with the AWS access key authentication method.

Migrating from legacy to native SDD

These limitations are only relevant to users who have previously enabled and run Immuta SDD.

See the Migrate from legacy to native SDD page for more information.

Discover Your Data

Getting Started

Requirements

Implement SDD

Enable sensitive data discovery

Configure SDD for your data

Adjust Discovered tags

Implement classification

Enable classification

Activate a framework

Configure classification for your data

Adjust classification tags

Introduction

Challenge and goals

How does it work?

Scalability through in-platform processing

Data residency compliance by design

Improved security and simplicity through agentless scanning

Cross-platform consistency

Granular query-level classification

Highly accurate and actionable metadata

Components of Discover

Architecture

Requirements

Components

Identification frameworks

Supported criteria and pattern types

Related guides

Classification frameworks

Supported criteria

Related guides

Data inventory dashboard

Workflow

Frequency

Caveat

Discover section contents

Data Discovery

Supported technologies

Architecture

Components

Identification framework

Global framework

Rule

Criteria

Supported criteria types

Pattern

Supported pattern types

Configuration

Tag mutability

Performance

Testing

Considerations

Supported data types and casing

Limitations with dictionary patterns

Databricks limitations

Starburst (Trino) limitation

Redshift limitations

Redshift supported authentication methods

Migrating from legacy to native SDD

How-to Guides

Enable Sensitive Data Discovery (SDD)

Run SDD on all data sources

Run SDD on all data sources using the API

Manage Identification Frameworks

Create a framework

Create a framework with no rules

Copy an existing framework and its rules

Assign a framework to data sources

Remove data sources from a framework

Delete a framework

Manage Patterns

Create a pattern

Edit a pattern

Delete a pattern

Manage Rules

Create a rule

Edit a rule

Delete a rule

Manage SDD on Data Sources

Run SDD using a specific framework