Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Discover scans your data sources and applies relevant tags when data is recognized. This eliminates a manual tagging process for your data, saving you time and providing standard taxonomy across all your data sources.
This guide illustrates how to implement sensitive data discovery and classification.
This reference guide discusses the components and benefits of Immuta Discover.
This reference guide describes the design of Immuta Discover.
Sensitive data discovery (SDD) is an Immuta feature that uses data patterns to determine what type of data your column represents. This saves the time of identifying your data manually and provides the benefit of a standard taxonomy across all your data sources in Immuta.
The guides in this section discuss the components of SDD and how to use it to tag your data.
Classification is the process in which data is categorized by the content and the associated risk level based on context. The guides in this section illustrate how to configure and customize classification for your organization.
Requirements:
Native SDD enabled and turned on
Immuta permission GOVERNANCE
Click the Discover icon in the navigation menu and select the Frameworks tab.
Click Create New.
Enter a Name for the framework.
Enter a Description for the framework.
Select the option to Create empty framework.
Click Create.
After you create the framework, you can create new rules for it.
Click the Discover icon in the navigation menu and select the Frameworks tab.
Click Create New.
Enter a Name for the framework.
Enter a Description for the framework.
Select the option to Create rules from an existing framework.
Select the checkbox for the framework you want to copy. You can only copy a single framework. For more information about a framework, click the framework name to open a new tab with details about the framework.
Click Create.
To assign a framework to run on specific data sources,
Click the Discover icon in the navigation menu and select the Frameworks tab.
Select the framework you want to assign and navigate to the Data Sources tab.
Click Add Data Sources.
Select the checkbox for the data source you want this framework to run on. You may select more than one.
Click Add Data Source(s).
After a data source is removed from a framework, it will use the global framework for any SDD scans and the tags applied by the removed framework will be replaced. To remove data sources from a framework,
Click the Discover icon in the navigation menu and select the Frameworks tab.
Select the framework you want to remove data sources from and navigate to the Data Sources tab.
Select the checkbox for the data source you want to remove from the framework. You may select more than one.
Select the Bulk Actions more options.
Select Remove Data Sources.
Click Confirm.
Deleting a framework will remove it from any data sources. Those data sources will then use the global framework for any SDD scans and the tags applied by the deleted framework will be replaced. Governors can delete any framework, and users with the CREATE_DATA_SOURCE
or CREATE_DATA_SOURCE_IN_PROJECT
permissions can only delete frameworks they created. To delete a framework,
Click the Discover icon in the navigation menu and select the Frameworks tab.
Click the three dot menu in the Action column for the framework you want to delete. Note that the global framework cannot be deleted. If you want to delete it, configure a different framework as the global framework.
Select Remove.
Click Confirm.
Discover automates discovering and tagging data across your data platform. It encompasses the identification and classification of data using frameworks.
Native SDD enabled
The Immuta UI has separate sections for identification frameworks and classification frameworks. Both frameworks are made of rules, criteria, and resulting tags, but the criteria types differ for each framework type. Identification frameworks use competitive pattern matching and column name matching to discover data types and tag them. Classification frameworks use tags on the column, neighboring columns, and data source for context and then tag the columns based on that context. Find more information about each framework type below.
Identification frameworks run with sensitive data discovery (SDD). They use data patterns to discover data and tag it based on what the data is.
Competitive pattern analysis: This criteria is a process that will review all the regex and dictionary patterns within the rules of the framework and search for the pattern with the best fit. In this review, each competitive pattern analysis criteria in the framework competes against each other to find the best and most specific pattern that fits the data. The resulting tags for the best pattern's rule are then applied to the column.
Regex pattern: This pattern contains a case-insensitive regular expression that searches for matches against column values. Create a regex pattern in the UI or with the sdd/classifier
endpoint.
Dictionary pattern: This pattern contains a list of words and phrases to match against column values. Create a dictionary pattern in the UI or with the sdd/classifier
endpoint.
Column name: This criteria matches a column name pattern to the column names in the data sources. The rule's resulting tags will be applied to the column where the name is found.
Column name pattern: This pattern includes a case-insensitive regular expression matched against column names, not against the values in the column. Create a column name pattern in the UI or with the sdd/classifier
endpoint.
To start using identification frameworks in the UI, see the Getting started guide.
To manage identification frameworks with the API, see the /sdd/template
endpoint reference guide.
Classification frameworks run with the classify service. They determine rule match and criteria fit based on proximity tags and then tag data based on the context it is within.
Match column tag: This criteria applies resulting tags based on specific tags already on the column.
Match neighboring column tag: This criteria applies resulting tags based on specific tags on neighboring columns.
To manage classification frameworks in the UI, see the Activate frameworks guide.
To create a classification framework with the API, see the /frameworks
endpoint reference guide.
Private preview This feature is only available to select accounts.
The data inventory dashboard visualizes information about your organization's data. It presents your entire data corpus within the context of the frameworks you have actively tagging your data with details like when your data was scanned last or how much of the scanned data is relevant to your active frameworks.
In the data inventory dashboard you will see tiles for scanned coverage and the percent of data scanned within a specific time frame. These tiles are referencing data scanned by an identification framework with SDD. To increase the number of your data sources that have been scanned, run SDD.
The next section of the dashboard shows tiles for the compliance frameworks. Within each graph is the separation of columns found containing or not containing the data important to the compliance framework. These graphs update every time classification runs, which will happen from these events.
For information on the frameworks visualized in the dashboard, see the Immuta frameworks reference guide.
The Discover workflow involves both identification with SDD and classification:
A user with the GOVERNANCE
permission enables SDD and activates classification frameworks.
Users register data in Immuta.
SDD runs:
Immuta generates a SQL query using the identification framework's rules.
That query is executed in the native database.
Immuta receives the query results containing the column name and the matching rules but no raw data values.
SDD applies the resulting tags to the relevant columns.
Classification runs:
The data source's current tags are checked against the framework's rules.
When a matching rule is found, the resulting tags are applied to the relevant columns.
Users with the GOVERNANCE
permission or data owners can view the data inventory dashboard with visualizations of their scanned data.
This workflow will run when a new data source is manually registered in Immuta or found from schema monitoring. Additionally, SDD alone will run from the following events:
A new data source is created.
Schema monitoring is enabled, and a new data source is detected.
Column detection is enabled, and new columns are detected. Here, SDD will only run on new columns, and no existing tags will be removed or changed.
A user manually triggers it from the data source health check menu.
A user manually triggers it from the identification frameworks page.
A user manually triggers it through the API.
Classification will run from the following events:
A framework gets created, updated, or deleted.
A tag gets added to or removed from a column manually or by SDD.
A tag gets added to a data source.
A user manually triggers it from the data source health check menu.
A user manually triggers it through the API.
Customizing classification frameworks currently requires users to use the Immuta API.
Conceptual guides:
Getting started guide:
How-to guides:
Classification guides:
Reference guides:
Sensitive data discovery (SDD) is an Immuta feature that uses data patterns to determine what type of data your column represents. Using frameworks, rules, and patterns, Immuta evaluates your data and can assign the appropriate tags to your data dictionary based on what it finds. This saves the time of identifying your data manually and provides the benefit of a standard taxonomy across all your data sources in Immuta.
Native SDD supports data discovery on data sources from the following technologies:
Starburst (Trino): Native SDD for Starburst (Trino) is currently in public preview and available to all accounts. Enable this feature on the Immuta app settings page.
Redshift: Native SDD for Redshift is currently in private preview and available to all accounts. Please reach out to your Immuta representative to enable it on your tenant.
To evaluate your data, SDD generates a SQL query using the identification framework's rules; the Immuta system account then executes that query in the native technology. Immuta receives the query result, containing the column name and the matching rules but no raw data values. These results are then used to apply the resulting tags to the appropriate columns.
This evaluating and tagging process occurs when SDD runs, which happens automatically from the following events:
A new data source is created.
Schema monitoring is enabled and a new data source is detected.
Column detection is enabled and new columns are detected. Here, SDD will only run on new columns and no existing tags will be removed or changed.
Users can also manually trigger SDD to run from a data source's overview page or the identification frameworks page.
Sensitive data discovery (SDD) runs frameworks to discover data. These frameworks are a collection of rules. These rules contain a single criteria and the resulting tags that will be applied when the criteria's conditions have been met. See the sections below for more information on each component.
An identification framework is a collection of rules that will look for a particular criteria and tag any columns where those conditions are met. While organizations can have multiple frameworks, only one may be applied to each data source. Immuta has the built-in Default Framework, which contains all the built-in patterns and assigns the built-in Discovered tags based on pattern matching.
For a how-to on the framework actions users can take, see the Manage frameworks page.
Each organization has a single global framework that will apply to all the data sources in Immuta by default, unless they have a different framework assigned. It is labeled on the frameworks page with a globe icon. Users can bypass this global framework by applying a specific framework to a set of data sources.
A rule is a criteria and the resulting tags to apply to data that matches the criteria. When Immuta recognizes that criteria, it can tag the data to describe the type. Each rule is specific to its own framework, but all a framework's rules can be copied to create a new framework.
For a how-to on the rule actions users can take, see the Manage rules page.
Criteria are the conditions that need to be met for resulting tags to be applied to data.
Competitive pattern analysis: This criteria is a process that will review all the regex and dictionary patterns within the rules of the framework and search for the pattern with the best fit. If there are multiple rules in a framework using competitive pattern analysis, only one will be applied to any column. To learn more about the competitive nature, see the How competitive pattern analysis works guide.
Column name: This criteria matches a column name pattern to the column names in the data sources. The rule's resulting tags will be applied to the column where the name is found.
A pattern is the type of data Immuta will look for to meet the requirements to tag a column. They can be used in rules across multiple frameworks, but can only be used once within each framework. Immuta comes with built-in patterns to discover common categories of data. These patterns cannot be modified and are within preset rules with preset tags. Users can also create their own unique patterns to find their specific data. SDD only supports regex patterns written in RE2 syntax.
The three types of patterns are described below:
Regex: This pattern contains a case-insensitive regular expression that searches for matches against column values.
Column name: This pattern includes a case-insensitive regular expression that is only matched against column names, not against the values in the column.
Dictionary: This pattern contains a list of words and phrases to match against column values.
Only application admins can enable sensitive data discovery (SDD) globally on the Immuta app settings page. Then, data source creators can disable SDD on a data-source-by-data-source basis.
When SDD is manually triggered by a data owner, all column tags that were previously applied by SDD are removed and the tags prescribed by the latest run are applied. However, if SDD is triggered because a new column is detected by schema monitoring, tags will only be applied to the new column, and no tags will be modified on existing columns. Additionally, governors, data source owners, and data source experts can disable any unwanted Discovered tags in the data dictionary to prevent them from being used and auto-tagged on that data source in the future.
The amount of time it takes to run identification on a data source depends on several factors:
Columns: The time to run identification grows nearly linearly with the number of text columns in the data source.
Identifiers: The number of identifiers being used weakly impacts the time to run identification.
Row count: Performance of identification may vary depending on the sampling method used by each technology. For Snowflake, the number of rows has little impact on the time because data sampling has near-constant performance.
Views: Performance on views is limited by the performance of the query that defines the view.
The time it takes to run SDD for all newly onboarded data sources in Immuta is not limited by SDD performance but by the execution of background jobs in Immuta. Consult your Immuta account manager when onboarding a large number of data sources to ensure the advanced settings are set appropriately for your organization.
For users interested in testing SDD, note that the built-in patterns by Immuta require a certain amount of confidence to be assigned to a column. This means that with synthetic data, there may be situations where the data is not real enough to fit the confidence needed to match patterns. To test SDD, use a dev environment, create copies of your tables, or use the API to run a dryRun
and see the tags that would be applied to your data by SDD.
Deleting the built-in Discovered tags is not recommended: If you do delete built-in Discovered tags and use the Default Framework, then when the pattern is matched the column will not be tagged. As an alternative, tags can be disabled on a column-by-column basis from the data dictionary, or SDD can be turned off on a data-source-by-data-source basis when creating a data source.
Data regex
Text string columns
Case-sensitive
Column name regex
Any column
Not case-sensitive
Dictionary
Text string columns
Can be toggled in the identifier definition
Immuta compiles dictionary patterns into a regex that is sent in the body of a query.
For Snowflake, the size of the dictionary is limited by the overall query text size limit in Snowflake of 1 MB.
For Databricks, Immuta will start up a Databricks cluster to complete the SDD job if one is not already running. This can cause unnecessary cost if the cluster becomes idle. Follow Databricks best practices to automatically terminate inactive clusters after a set period of time.
Native SDD for Databricks Unity Catalog will only work on data sources authenticated with a personal access token (PAT). OAuth machine-to-machine (M2M) is not supported with SDD.
Native SDD will only work on Starburst (Trino) data sources authenticated with username and password. OAuth 2.0 is not supported with SDD.
Redshift Spectrum is not supported with native SDD.
Username and password is fully supported with native SDD.
Okta is not supported with native SDD.
AWS access key is supported with limitations with native SDD:
The AWS access key used to register the data source can do a minimum of the following redshift-data API actions:
redshift-data:BatchExecuteStatement
redshift-data:CancelStatement
redshift-data:DescribeStatement
redshift-data:ExecuteStatement
redshift-data:GetStatementResult
redshift-data:ListStatements
The AWS access key used to register the data source must have redshift:GetClusterCredentials
for the cluster, user, and database that they onboard their data sources with.
If using a custom URL, then the data source registered with the AWS access key must have the region
and clusterid
included in the additional connection string options formatted like the following:
Redshift Serverless data sources are not supported for native SDD with the AWS access key authentication method.
These limitations are only relevant to users who have previously enabled and run Immuta SDD.
If you had legacy SDD enabled, running native SDD can result in different tags being applied because native SDD is more accurate and has fewer false positives than legacy SDD. Running a new SDD scan against a table will change the context of the resulting tags, but no Discovered tags previously applied by legacy SDD will be removed.
See the Migrate from legacy to native SDD page for more information.
Immuta allows you to automate discovering and tagging data across your data platform. Tagging is critical for two reasons:
It allows you to define data sensitivity, which in turn allows you to monitor where you have potential data security issues and gaps in your security posture.
It allows you to abstract your physical structure from your access policy logic. For example, you can build access policies like mask all columns tagged PII
(where PII
was auto-tagged by Discover) rather than much less scalable policies that must be knowledgeable of your physical layers like mask column x in database y in data platform z
.
Today’s sensitive data discovery tools give you a shallow overview of your data corpus across a long list of platforms. They give you pointers on where you have sensitive data without the granularity to drive your column- or row-level access controls. They help you understand what data you possess according to a regulatory framework, like HIPAA or PCI, but without the details needed to automate your audits or compliance reporting. Knowing that you need to drive east to west on a road map from New York to California is helpful but ultimately insufficient to get you from a specific location to another.
Existing tools promise a high degree of automation, yet their many false positives result in painful manual work that never stops. Although data gets scanned automatically, performance breaks down at scale, or you manually need to fine-tune the computing resources of the scanners. Last but not least, your security team objects to the agent-based processing that requires taking data out of your data platform, and the associated data residency concerns may give you pause.
At Immuta, we believe that data security should not be painful. We believe that you can innovate and move quickly, while at the same time protecting your data and adhering to your internal policies and external regulations. Technology and automation allow you to make the right trade-off decisions quickly. It all starts with highly accurate and actionable metadata. If you trust your metadata and if it’s actionable, you can leverage it to automatically grant access to data, mask sensitive information, and automate your audit reporting.
Immuta Discover was built to tackle those challenges and address them through a unique architecture that was designed in collaboration with the largest financial institutions, healthcare companies, and government agencies in the world. The cloud and AI paradigm requires a fundamentally different approach. You must assume that your data is dynamic, unique, and collected in a multitude of different geographies and legal jurisdictions. Immuta Discover is built for this new world and its specific demands.
Identifying and classifying data requires analyzing and looking at the data - there’s no way around it. Immuta Discover does all the analysis and processing inside the native technology. It takes advantage of those platforms’ inherent scalability to enable you to analyze large amounts of data quickly, efficiently, and without the need for separate resource optimization for containers or virtual machines.
By processing data directly inside the data platform, Immuta Discover automatically adheres to data residency and locality requirements. If you run your data warehouse or lake globally - across North America, the European Union, and Asia - Immuta processes the data in the region where your data is stored. No data ever leaves the data platform, and it will never move across different cloud regions.
In-platform processing greatly reduces risk and improves your data security posture. Provisioning agents, whether they’re in a container, virtual machine, or Amazon Machine Image (AMI), create complexity and an unnecessary security risk. Not only can those agents become compromised, but their misconfiguration might lead to data leaks to other parts of your cloud infrastructure. An agentless approach can better leverage data platform optimizations to process data instead of transferring it out to re-optimize and analyze. This simplifies operations and increases efficiency for your infrastructure teams.
The advantages of in-platform processing are abundant, but implementing it across a multitude of platforms is challenging. Immuta helps bypass the obstacles by doing all the heavy lifting for you and building in specific implementations for each technology. Although all those implementations are ultimately different, Immuta abstracts the results to one standardized taxonomy, so you can have consistently accurate and granular metadata across all your data stores.
Immuta Discover classifies data on a column level and instantaneously identifies schema changes. Only with that level of granularity and automation can you adhere to your audit requirements and understand what actions have been taken on your data. For example, if non-sensitive data is joined with sensitive data at query time, Immuta Discover will monitor and record that for your review. Continuous schema monitoring ensures schema changes never result in holes in your access controls and data security posture.
Trust in your metadata is critical for data security.
To unblock your data consumers, you need to automate your data access controls; this requires trusting that your classification and metadata are accurate and actionable. Immuta Discover provides you with highly accurate metadata and tags out-of-the-box and assists you in fine-tuning the classification mechanism to deal with false positives quickly. That enables you to build policies that dynamically grant or restrict access to protected data (like PHI or PII) depending on who is accessing it and what protections you want to apply.
Immuta Discover works in three phases: identification, categorization, and classification.
Identification: In this first phase, data is identified by its kind – for example, a name or an age. This identification can be manually performed, externally provided by a catalog, or automatically determined by Immuta Discover through column-level analysis of patterns.
Categorization: In the second phase, data is categorized in the context of where it appears, subject to any active data compliance or security frameworks. For example, a record occurring in a clinical context containing both a name and individual health data is protected health information (PHI) under HIPAA.
While every phase can and should be customized, for categorization Immuta provides a bundle of default frameworks. The generic Data Security Framework provides the base for the specific frameworks and gives fine-grained categorization of your data into a consistent set of security and compliance concepts. This categorization of data helps to understand the context it is in, including information like whether or not a record pertains to an individual, the composition and kinds of identifiers present, the data subject, whether the data belongs to any controlled data categories under certain legislation, etc.
The categorization provided by the Immuta classification frameworks may be used out-of-the-box; however, they are best leveraged as a starting point for purpose-built compliance frameworks implementing organization-specific compliance categories.
Classification: In the third and final phase, data is classified according to its sensitivity level (e.g., Customer Financial Data is Highly Sensitive) and the risk associated to the data subject. Immuta supplies sensitivity level defaults in Detect and risk assessment default tags based on standard industry practice. However, customers are free to customize the assignments under their respective views.
Requirements:
Native SDD enabled and turned on
Immuta permission GOVERNANCE
Click the Discover icon in the navigation menu and select the Patterns tab.
Click Create New.
In the modal, enter a name for the new pattern.
Write a Description for the type of data the pattern will find.
Select the Type of pattern.
For regex and column name regex, enter the regex.
For dictionary, enter the values you want the pattern to match and toggle the switch on if you want them to be case-sensitive.
Click Create Pattern.
See the Manage rules page to add your new pattern to a framework.
Note that all user-created patterns must be a 90% match or greater for the contents of the column to be tagged.
Editing a pattern will affect any rule built off the pattern throughout Immuta. To edit a pattern,
Click the Discover icon in the navigation menu and select the Patterns tab.
Click the name of the pattern you want to edit.
Click Edit.
Edit the field you want to change. Note any field shadowed is not editable, and the pattern must be deleted and re-created to change them.
Click Save.
Built-in patterns cannot be edited.
Deleting a pattern will remove it from Immuta and remove all the rules that relied on it in the frameworks throughout Immuta. To delete a pattern,
Click the Discover icon in the navigation menu and select the Patterns tab.
Click the three dot menu in the Action column for the pattern you want to delete.
Select Remove.
Click Confirm.
Built-in patterns cannot be deleted.
Requirement: Immuta permission GOVERNANCE
This how-to guide is for enabling sensitive data discovery (SDD). For additional information on sensitive data discovery and classification, see the Discover architecture page.
Navigate to the App Settings page and scroll to the Sensitive Data Discovery section.
Select the Enable Sensitive Data Discovery (SDD) checkbox to enable SDD.
Click Save and then click Confirm to apply your changes. Note that the Immuta tenant will have a system restart.
Run SDD for a select group of data sources; use one of the following options to run SDD on specific data sources:
Make the following request specifying the data sources in the request using the Immuta API.
A successful request will have the code 200
and a body with the number of jobs created from the request:
Navigate to the data source overview page of the data source you listed in the payload.
Click the Data Dictionary tab.
Assess whether the Discovered and classification tags applied are accurate.
If they are, then repeat the steps above for more of your data sources. Once a majority of your data sources appear to have accurate tags, run SDD on all your data sources. If the tags are not accurate, you will need to tune SDD and classification frameworks. See the Adjust frameworks and tags guide for instructions.
Click the Discover icon and the Identification tab in the navigation menu.
Select the more actions icon.
Select Run SDD and then select it again in the modal.
Requirement: Immuta permission GOVERNANCE
Make the following request using the Immuta API to run SDD for all data sources, specifying all
as true
:
A successful request will have the code 200
and a body with the number of jobs created from the request:
Discover scans your data sources and applies relevant tags when data is recognized. This eliminates a manual tagging process for your data, saving you time and providing standard taxonomy across all your data sources.
Native SDD enabled and turned on
Immuta permission GOVERNANCE
Sensitive data discovery (SDD) is an Immuta Discover feature that scans your data sources and applies relevant tags when data is recognized. This eliminates a manual tagging process for your data, saving you time and providing standard taxonomy across all your data sources.
To learn more, see the Data discovery page.
Enable sensitive data discovery to start using the default framework on all of your registered data sources. This out-of-the-box framework discovers common data types and tags them automatically when a new data source is registered.
For additional control, create your own patterns to recognize the data that matters to you. Add these patterns to new frameworks and specify the data sources that need this framework. This fine-level control creates automatic tagging that is relevant and accurate to your data, requiring fewer manual adjustments to the resulting tags.
Customize SDD for your data:
If you have any tags that are applied to your data sources by SDD that you don't want, you can easily disable these tags for each data source. This ensures that they will not be applied to the data source again if SDD is re-run.
Reference pages:
Immuta comes with a default framework containing built-in Discovered tags and built-in patterns. These patterns and tags can be used in your own frameworks.
Classification is an Immuta Discover feature that categorizes your data based on the content and the associated risk the data poses. This increases your understanding of your data and allows you to make faster decisions about it.
Enable classification from the Immuta app settings page.
Activate any of the following frameworks:
To start seeing classification tags, enable the Data Security Framework.
If you are using Snowflake and want to see information on your sensitive data in Detect, enable the Risk Assessment Framework.
Opt to enable any of the other compliance frameworks.
Complete the following steps for each framework you want to activate:
Navigate to Discover and select the Classification tab.
Click the more actions icon in the Actions column for the framework you want to activate.
Select Activate.
To configure or manage a framework using the Immuta API, see the Frameworks API reference page.
If you have any tags that are applied to your data sources by classification that you don't want, you can easily disable these tags for each data source. This ensures that they will not be applied to the data source again when classification is re-run.
Requirement: Immuta permission APPLICATION_ADMIN
Click the App Settings icon in the left sidebar.
Click Sensitive Data Discovery in the left panel to navigate to that section.
Enter the request-friendly name of your global template in the Global SDD Template Name field. This name can be found in the tooltip on the framework's detail page.
Click Save, and then Confirm your changes.
Requirements:
Native SDD enabled and
Immuta permission GOVERNANCE
You can only have one rule per pattern in the framework. If you do not see the pattern for the rule you want to create, then it already has a rule built off of it.
Click the Discover icon in the navigation menu and select the Framework tab.
Select the framework you want to edit and navigate to the Discovery Rules tab.
Click Create New.
Select the Tags to apply from the dropdown. The tags you select are the tags applied when the pattern is matched. Note that resulting tags must be under the Discovered parent tag and cannot be parent tags themselves unless they have already been manually applied to a data source.
Select the Criteria type from the dropdown. See the .
Competitive pattern analysis is for regex and dictionary patterns.
Column name is for column name patterns.
Select the Pattern from the dropdown.
Click Create Rule.
Click the Discover icon in the navigation menu and select the Frameworks tab.
Select the framework of the rule you want to edit and navigate to the Discovery Rules tab.
Select the rule you want to edit.
Click Edit.
Edit the field you want to change. Note any field shadowed is not editable, and the rule must be deleted and re-created to change them.
Click Save.
Deleting a rule removes the tags once applied by that rule the next time SDD runs on a data source. To delete a rule,
Click the Discover icon in the navigation menu and select the Frameworks tab.
Select the framework you want to edit and navigate to the Discovery Rules tab.
Click the three dot menu in the Action column for the rule you want to delete.
Select Remove.
Click Confirm.
Immuta is pre-configured with a set of tags that can be used to write global policies before data sources even exist. See a list of the built-in Discovered tags below and the for information about where these tags will be applied by the built-in rules
All the tags below belong to the Country
parent. For example, the full tag name will appear as Discovered . Country . Argentina
.
All the tags below belong to the Entity
parent. For example, the full tag name will appear as Discovered . Entity . Aadhaar Individual
.
None of the tags below have an additional parent or child tag. For example, the full tag name will appear as Discovered . Identifier Direct
.
None of the tags below have an additional parent or child tag. For example, the full tag name will appear as Discovered . PCI
.
Requirements:
Native SDD enabled and
Registered
Immuta permission GOVERNANCE
SDD runs automatically, but if you want to re-run SDD when a new global framework is set or when new rules have been added, you can or for specific frameworks through the UI:
Click the Discover icon and the Identification tab in the navigation menu.
Select the more actions icon.
Select Run SDD and then select it again in the modal.
SDD runs automatically, but if you want to re-run SDD when a new global framework is set or when new rules have been added, you can or for specific data sources through the UI:
Navigate to the data source overview page.
Click the health status.
Select Re-run next to Sensitive Data Discovery (SDD).
Verify discovered tags
If sensitive data discovery has been enabled, then manually adding tags to columns in the data dictionary will be unnecessary in most cases. The data owner will just need to verify that the Discovered tags are correct.
If a governor, data owner, or data source expert disables a Discovered tag from the data dictionary, the column will not be re-tagged when that data source's fingerprint is recalculated or SDD is re-run. When a Discovered tag is disabled, the tag will not completely disappear, so it can be manually enabled through the tag side sheet.
To disable a discovered tag,
Navigate to a data source and click the Data Dictionary tab.
Scroll to the column you want to remove the tag from and click the tag you want to remove.
Click Disable in the side sheet and then click Confirm.
This guide provides information and best practices for migrating from the deprecated legacy sensitive data discovery (SDD) option to the improved native SDD. This guide is for users who have already enabled SDD on their tenant and have Discovered tags on their data sources.
Legacy SDD is deprecated. It will be removed and replaced by native SDD. Native SDD is significantly improved from legacy SDD for discovering and tagging your data with upgrades to the built-in patterns. Additionally, the greatest benefit is the respect for data residency. Native SDD doesn't move any of your data when running. The discovery is done right in your data platform, and the platform only returns the matching patterns and column names to Immuta.
See the for more information on native SDD.
Native SDD requires Snowflake, Databricks, Redshift, or Starburst (Trino) data sources
Legacy SDD enabled on your tenant
Legacy SDD tags applied to your data sources: To find out if you have legacy SDD tags applied, create a governance report as described in the .
Contact your Immuta representative to enable native SDD on your Immuta tenant. Note that unless specifically disabled, all Immuta installations after the 2024.2 LTS have native SDD automatically enabled. Proceed to if you want to self-service check if native SDD is already running and tagging your data before you reach out to the representative.
This action will not change anything immediately on your tenant; however, anytime SDD runs in the future, it will be native SDD instead of the legacy version.
To assess native SDD for your data, proceed with the steps below. If you do not review native SDD, the legacy SDD tags will all remain on your data source columns. However, when on new data sources and columns, it will apply native SDD tags, and because of the improvements to SDD, it may tag different data than legacy SDD.
Requirement: Immuta permission GOVERNANCE
To check the tags on an individual data source, navigate to the data source data dictionary and select a Discovered tag. On the tag side sheet, you can determine the context of the tag. When patterns match data, native SDD will apply tags, and their tag context will be Sensitive Data Discovery
. Any tags with the context Legacy Sensitive Data Discovery
were not matched by native SDD but will remain on the data source.
To check your tags globally, navigate to the governance reports page and build a report for sensitive data discovery. This report will present the legacy tags on your data sources' columns and native SDD tags that are also on those columns. Use this report to assess the context of the Discovered tags and understand if native SDD is matching the data you want it to.
These actions will allow you to understand the differences between how native SDD and legacy SDD tag your data and whether your data is recognized as expected by native SDD or if legacy SDD was over-tagging your data. This way you can better tune SDD to your data.
If there are any legacy SDD tags that you want native SDD to catch, you need to tune native SDD so that this type of data is discovered in future tables and columns; see guidance on that in the next section.
Requirement: Immuta permission GOVERNANCE
Using the report you built above, complete these actions to tune SDD:
Focus on a legacy SDD tag properly applied to your data. Assess whether the native SDD tag on the column instead was applied more accurately than the legacy tag. If it is applied incorrectly, proceed to the next step.
Complete the steps above for all legacy SDD tags.
Completing the actions above will create parity between what legacy SDD was tagging your data and what native SDD will tag in the future.
In previous documentation, rule and pattern are referred to as classifier or identifier. The language is being updated to rule to be more accurate and not conflate meaning with .
Immuta comes with a set of built-in patterns that look for common data types. These patterns were written by Immuta's research and development team and cannot be deleted or edited by users. However, users can build their own rules using these built-in patterns, which will customize the resulting tags based on the organization's needs.
When using SDD with , it is recommended to use the default resulting tags listed in the table below for these built-in patterns. This ensures that the framework rules apply sensitivity tags as intended.
to run native SDD on your data sources.
to discover this data. Ensure it is specific and will match your data with a 90% confidence.
in your framework using the new pattern and the Discovered tag you want applied to the data.
Retest your updated rules and patterns by and continue refining to the level of accuracy you want.
Aadhaar Individual
This tag is for Aadhaar Individual numbers.
Adoption Taxpayer ID Number
This tag is applied to data recognized as a United States Adoption Taxpayer Identification number.
Age
This tag is applied to data recognized as an age.
Bank Account
This tag is for bank account numbers.
Bank Routing MICR
This tag is applied to data recognized as an American Bankers Association routing number.
Bankers CUSIP ID
This tag is for CUSP identification numbers for stocks and bonds.
British Columbia Health Network Number
This tag is applied to data recognized as British Columbia's Personal Health Number.
BSN Number
This tag is for Netherlands citizen service number.
BSN Number
This tag is for Netherlands citizen service numbers.
CDC Number
This tag is for CDC numbers.
CDI Number
This tag is for CDI numbers.
CIC Number
This tag is for CIC numbers.
CNI
This tag is applied to data recognized as a French National ID card number.
CPF Number
This tag is applied to data recognized as Brazil's CPF number.
CPR Number
This tag is applied to data recognized as Denmark's Personal Identification number.
Credit Card Number
This tag is applied to data recognized as a credit card number.
CURP Number
This tag is for Mexican CURP numbers.
CRYPTO
This tag is applied to data recognized as a Bitcoin Invoice Address.
Date
This tag is applied to data recognized as a date.
Date of Birth
This tag is applied to data recognized as a date of birth.
DEA Number
This tag is applied to data recognized as the DEA number of a healthcare provider.
DNI Number
This tag is applied to data recognized as an Argentina National Identity number.
Domain Name
This tag is applied to data recognized as a domain.
Driver's License Number
This tag is applied to data recognized as driver's licenses numbers from Germany or the United Kingdom.
Electronic Mail Address
This tag is applied to data recognized as an email address.
Employer ID Number
This tag is applied to data recognized as an Employer Identification number from the United States.
Ethnic Group
This tag is applied to data recognized as an ethnic group.
FDA Code
This tag is applied to data recognized as the code of a drug or ingredient registered with the FDA.
Gender
This tag is applied to data recognized as a gender.
GST Individual
This tag is for Indian GST individual numbers.
Healthcare NPI
This tag is applied to data recognized as a United States National Provider Identifier number.
IBAN Code
This tag is applied to data recognized as an International Bank Account number.
ICD10 Code
This tag is applied to data recognized as an ICD10 code from the International Statistical Classification of Diseases and Related Health Problems.
ICD9 Code
This tag is for ICD9 codes from the International Statistical Classification of Diseases and Related Health Problems.
ID Number
This tag is for any ID number.
Identity Card Number
This tag is applied to data recognized as an identity card number from Germany.
IMEI
This tag is applied to data recognized as an International Mobile Equipment Identity number.
Individual Number
This tag is for any individual number.
Individual Taxpayer ID Number
This tag is applied to data recognized as a United States Individual Taxpayer Identification Number.
IP Address
This tag is applied to data recognized as an IP address.
Location
This tag is applied to data recognized as a country, state, address, or municipality.
MAC Address
This tag is applied to data recognized as a Media Access Control address.
MAC Address Local
This tag is applied to data recognized as a local Media Access Control address.
Medicare Number
This tag is applied to data recognized as a Medicare number from Australia.
National Health Service Number
This tag is for national health service numbers.
National ID Card Number
This tag is applied to data recognized as a national ID card number from Belgium.
National ID Number
This tag is applied to data recognized as a national ID number from Finland, Sweden, and Thailand.
National Insurance Number
This tag is applied to data recognized as a United Kingdom national insurance number.
National Registration ID Number
This tag is for national registration ID numbers.
NI Number
This tag is for Norway NI numbers.
NIE Number
This tag is applied to data recognized as a Spanish Foreigner Identification number.
NIF Number
This tag is applied to data recognized as a Spanish Tax Identification number.
NIK Number
This tag is applied to data recognized as an Indonesian personal identification number (NIK).
NIR
This tag is applied to data recognized as France's National ID number.
Ontario Health Insurance Number
This tag is applied to data recognized as part of an Ontario Health Insurance Plan string.
PAN Individual
This tag is for PAN Individual numbers.
Passport
This tag is applied to data recognized as a passport number from Australia, Canada, France, Spain, Sweden, and the United States.
Person Name
This tag is applied to data recognized as people's names.
PESEL Number
This tag is for Poland PESEL numbers.
Postal Code
This tag is applied to data recognized as a United States zip code.
Preparer Taxpayer ID Number
This tag is applied to data recognized as a Preparer Taxpayer ID number.
Quebec Health Insurance Number
This tag is applied to data recognized as a Quebec Health Insurance Number.
Resident ID Number
This tag is for China Resident ID numbers.
RRN
This tag is for Korea Resident Registration numbers.
Social Insurance Number
This tag is applied to data recognized as a social insurance number.
Social Security Number
This tag is applied to data recognized as a United States Social Security Number.
State
This tag is applied to data recognized as a state of the United States.
Swift Code
This tag is applied to data recognized as a SWIFT code.
Tax File Number
This tag is applied to data recognized as a tax file number.
Taxpayer ID Number
This tag is applied to data recognized as Taxpayer ID numbers from the United States.
Taxpayer Reference
This tag is applied to data recognized as United Kingdom Taxpayer Reference numbers.
Telephone Number
This tag is applied to data recognized as a phone number.
Tollfree Telephone Number
This tag is applied to data recognized as a United States toll-free phone number.
URL
This tag is applied to data recognized as a URL.
Vehicle Identifier or Serial Number
This tag is applied to data recognized as a VIN.
Identifier Direct
This tag is applied to data recognized as a direct identifier that can be uniquely associated with an individual. Examples of direct identifiers include: name, username, email, official individual identification numbers such as passport or identity card numbers, or privately issued individual identification numbers such as a student ID.
Identifier Indirect
This tag is applied to data recognized as an indirect identifier that is not uniquely associated with an individual. However this indirect identifier could become distinguishable when combined with other attributes. Examples of indirect identifiers include: age and affinity.
Identifier Undetermined
This tag is applied to data which could be an identifier associated with an individual.
PCI
This tag is applied to data recognized as payment card information.
PHI
This tag is applied to data recognized as personal health data.
PII
This tag is applied to data recognized as personally identifiable information.
Argentina
This tag is applied to data recognized as specific to Argentina (e.g., an Argentina National Identity Number).
Australia
This tag is applied to data recognized as specific to Australia (e.g., an Australian Medicare number or Australian passport number).
Belgium
This tag is applied to data recognized as specific to Belgium (e.g., a Belgium National ID card).
Brazil
This tag is applied to data recognized as specific to Brazil (e.g., a Brazil CPF number).
Canada
This tag is applied to data recognized as specific to Canada (e.g., a British Columbia PHN, OHIP string, Canadian passport number, or Quebec's HIN).
Chile
This tag is for data specific to Chile.
China
This tag is for data specific to China.
Colombia
This tag is for data specific to Colombia.
Denmark
This tag is applied to data recognized as specific to Denmark (e.g., a Denmark CPR or Person-number).
Finland
This tag is applied to data recognized as specific to Finland (e.g., a Finland National ID number).
France
This tag is applied to data recognized as specific to France (e.g., a French National ID card number, France National ID number, or French passport number).
Germany
This tag is applied to data recognized as specific to Germany (e.g., a German driver's license number or a Germany Identity Card number).
Hong Kong
This tag is for data specific to Hong Kong.
India
This tag is for data specific to India.
Indonesia
This tag is for data specific to Indonesia.
Japan
This tag is for data specific to Japan.
Korea
This tag is for data specific to Korea.
Mexico
This tag is for data specific to Mexico.
Netherlands
This tag is for data specific to Netherlands.
Norway
This tag is for data specific to Norway.
Paraguay
This tag is for data specific to Paraguay.
Peru
This tag is for data specific to Peru.
Poland
This tag is for data specific to Poland.
Singapore
This tag is for data specific to Singapore.
Spain
This tag is applied to data recognized as specific to Spain (e.g., Spain Foreigner Identification number, Spain Tax Identification number, or Spanish passport number).
Sweden
This tag is applied to data recognized as specific to Sweden (e.g., a Sweden National ID number or Swedish passport number).
Taiwan
This tag is for data specific to Taiwan.
Thailand
This tag is applied to data recognized as specific to Thailand (e.g., a Thailand National ID number).
Turkey
This tag is for data specific to Turkey.
UK
This tag is applied to data recognized as specific to the United Kingdom (e.g., a United Kingdom driver's license number, United Kingdom National Insurance number, or United Kingdom Taxpayer Reference number).
Uruguay
This tag is for data specific to Uruguay.
US
This tag is applied to data recognized as specific to the U.S. (e.g., an FDA code, United States ATIN, ABA routing number, DEA number, United States EIN, United States NPI number, United States ITIN, United States passport number, United States Preparer Taxpayer ID number, United States SSN, United States territory or state, or United States toll-free phone number).
Venezuela
This tag is for data specific to Venezuela.
AGE
Matches numeric strings between 10 and 199.
Discovered.PII
Discovered.Identifier Indirect
Discovered.PHI
Discovered.Entity.Age
ARGENTINA_DNI_NUMBER
Matches strings consistent with Argentina National Identity (DNI) Number. Requires an eight-digit number with optional periods between the second and third and fifth and sixth digit.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.Argentina
Discovered.PHI
Discovered.Entity.DNI Number
AUSTRALIA_MEDICARE_NUMBER
Matches numeric strings consistent with Australian Medicare number. Requires a ten- or eleven-digit number. The starting digit must be between 2 and 6, inclusive. Optional spaces can be placed between the fourth and fifth and ninth and tenth digit. The optional 11th digit separated by a /
can be present. A checksum is required.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.Australia
Discovered.PHI
Discovered.Entity.Medicare Number
AUSTRALIA_PASSPORT
Matches strings consistent with Australian Passport number. An 8- or 9-character string is required, with a starting upper case character (N, E, D, F, A, C, U, X) or a two-character starting character (P followed by A, B, C, D, E, F, U, W, X, or Z) followed by seven digits.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.Australia
Discovered.PHI
Discovered.Entity.Passport
BELGIUM_NATIONAL_ID_CARD_NUMBER
Matches numeric strings consistent with Belgium's National ID card. Requires a twelve-digit number with hyphen (-
) between the third and fourth digit and tenth and eleventh digits. A two checksum is required.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.Belgium
Discovered.PHI
Discovered.Entity.National ID Card Number
BITCOIN_INVOICE_ADDRESS
Matches strings consistent with the following Bitcoin Invoice Address formats: P2PKH, P2SH, and Bech32. P2PKH and P2SH must start with a 1 or a 3, respectively, followed by 25 - 34 alphanumeric characters, excluding l, I, O, and 0. Bech32 formats must begin with bc1
and be followed by 39 characters. To be identified, any addresses must have a valid checksum.
Discovered.Entity.CRYPTO
Discovered.PCI
BRAZIL_CPF_NUMBER
Matches a numeric string consistent with Brazil's CPF (Cadastro Pessoal de Pessoa Física) number. An eleven-digit numeric string with non-numeric separators after the third, sixth, and ninth digits. A two digit checksum is required.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.Brazil
Discovered.PHI
Discovered.Entity.CPF Number
CANADA_BC_PHN
Matches numeric strings consistent with British Columbia's Personal Health Number (PHN). Requires a ten-digit numeric string with optional hyphen (-
) or spaces after the fourth and seventh digits.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.Canada
Discovered.PHI
Discovered.Entity.British Columbia Health Network Number
CANADA_OHIP
Matches alphanumeric strings consistent with Ontario's Health Insurance Plan (OHIP). Requires a twelve-digit alphanumeric code. Optional hyphens (-
) or spaces can appear after the fourth, seventh, and tenth digits. The final two characters are a checksum.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.Canada
Discovered.PHI
Discovered.Entity.Ontario Health Insurance Number
CANADA_PASSPORT
Matches strings consistent with the Canadian Passport Number format as described here.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.Canada
Discovered.PHI
Discovered.Entity.Passport
CANADA_QUEBEC_HIN
Matches alphanumeric strings consistent with Quebec's Health Insurance Number (HIN). Requires four alphabetic characters followed by an optional space or hyphen (-
), and then eight digits with an optional hyphen or space after the fourth digit.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.Canada
Discovered.PHI
Discovered.Entity.Quebec Health Insurance Number
CREDIT_CARD_NUMBER
Matches strings consistent with a credit card number with prefixes matching major credit card companies. Must include a valid checksum.
Discovered.PCI
Discovered.Entity.Credit Card Number
DATE
Matches strings consistent with dates. These can include days of the week, dates, and date times.
Discovered.Entity.Date
DENMARK_CPR_NUMBER
Matches numeric strings consistent with Personal Identification Number (CPR-number or Person-number). Requires a ten-digit number with either a DDMMYY-SSSS
or DDMMYYSSSS
format. The first six digits are an individual's birth date in Day, Month, Year format. The final four digits comprise the sequence number.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.Denmark
Discovered.PHI
Discovered.Entity.CPR Number
DOMAIN_NAME
Matches domain names using a very broad pattern.
Discovered.Entity.Domain Name
EMAIL_ADDRESS
Detect strings consistent with an email address. Usernames are required to be fewer than 255 characters, follow by @a
, a domain of fewer than 255 characters, and a top level domain of between 2 and 20 characters.
Discovered.PHI
Discovered.Entity.Electronic Mail Address
Discovered.Identifier Direct
ETHNIC_GROUP
Matches strings consistent with the US Census race designations.
Discovered.PII
Discovered.Entity.Ethnic Group
FDA_CODE
Matches a string consistent with a drug or ingredient registered with Food and Drug Administration (FDA). Must start with between 4 to 6 digits, followed by a hyphen, followed by 3 to 4 digits, followed by a hyphen, and finishing with one to two digits.
Discovered.Country.US
Discovered.Entity.FDA Code
FINLAND_NATIONAL_ID_NUMBER
Matches a string consistent with Finland's National ID number. Requires an eleven-character string in a DDMMYYCZZZQ
format. The first six digits are an individual's birth date in Day, Month, Year format. The C
character is a century of birth indicator (+
for the years 1800-1899, -
for years 1900-1999, and A
for years 2000-2099). ZZZ
is an individual ID number, and Q
is a required checksum.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.Finland
Discovered.PHI
Discovered.Entity.National ID Number
FRANCE_CNI
Matches numeric strings consistent with the French National ID card number (carte nationale d'identité). Requires a twelve-digit numeric string.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.France
Discovered.PHI
Discovered.Entity.CNI
FRANCE_NIR
Matches numeric strings consistent with France's National ID number (Numéro d'Inscription au Répertoire). Requires a fifteen-digit numeric string. An optional hyphen (-
) or space can appear after the 13th digit. The 14th and 15th digits act as a checksum.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.France
Discovered.PHI
Discovered.Entity.NIR
FRANCE_PASSPORT
Matches alphanumeric strings consistent with the French Passport number. Requires two numbers followed by two upper case letters and ends with 5 digits.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.France
Discovered.PHI
Discovered.Entity.Passport
GENDER
Matches strings consistent with gender or gender abbreviations.
Discovered.PII
Discovered.Identifier Indirect
Discovered.PHI
Discovered.Entity.Gender
GERMANY_DRIVERS_LICENSE_NUMBER
Matches alphanumeric strings consistent with Germany's Driver's License number. Requires an eleven-element string, with a digit or a letter followed by two digits, 6 digits or letters, one digit, and one digit or letter.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.Germany
Discovered.PHI
Discovered.Entity.Drivers License Number
GERMANY_IDENTITY_CARD_NUMBER
Matches alphanumeric strings consistent with Germany's Identity Card number. Requires a single letter followed by eight digits.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.Germany
Discovered.PHI
Discovered.Entity.Identity Card Number
IBAN_CODE
Matches strings consistent with an International Bank Account Number (IBAN). Must contain a valid country code.
Discovered.Entity.IBAN Code
ICD10_CODE
Matches strings consistent with codes from the International Statistical Classification of Diseases and Related Health Problems (ICD), as drawn from the Clinical Modification lexicon from the year 2020.
Discovered.Entity.ICD10 Code
IMEI_HARDWARE_ID
Matches strings consistent with an International Mobile Equipment Identity (IMEI) number. Must contain 15 digits with optional hyphens or spaces after the second, 8th, and 14th digits.
Discovered.Entity.IMEI
IP_ADDRESS
Matches IP Addresses in the V4 and V6 formats.
Discovered.Entity.IP Address
LOCATION
Matches strings consistent with Countries, States, Addresses, or Municipalities. By default focuses on locations in the United States.
Discovered.Entity.Location
MAC_ADDRESS
Matches strings consistent with a Media Access Control (MAC) address. Must contain twelve hexadecimal digits, with every two digits separated by a colon.
Discovered.Entity.MAC Address
MAC_ADDRESS_LOCAL
Matches strings consistent with a local Media Access Control (MAC) address.
Discovered.Entity.MAC Address Local
PERSON_NAME
Matches strings consistent with a dictionary of people's names. Names are drawn from the US Social Security database.
Discovered.PII
Discovered.PHI
Discovered.Entity.Person Name
Discovered.Identifier Indirect
PHONE_NUMBER
Matches strings consistent with telephone numbers. Primarily looks for strings consistent with the United States telephone numbers naming convention.
Discovered.Entity.Telephone Number
POSTAL_CODE
Matches strings consistent with a valid US zip code with an optional +4. Only valid 5 digit zip codes are detected.
Discovered.Entity.Postal Code
SPAIN_NIE_NUMBER
Matches strings consistent with Spain's Foreigner Identification number. Requires an eight-character string. The initial character must be X, Y, or Z, followed by seven digits, then by an optional hyphen or space and a single checksum character.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.Spain
Discovered.PHI
Discovered.Entity.NIE Number
SPAIN_NIF_NUMBER
Matches strings consistent with Spain's Tax Identification number. Requires an eight-character string. Requires eight digits followed by an optional hyphen or space and a single checksum character.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.Spain
Discovered.PHI
Discovered.Entity.NIF Number
SPAIN_PASSPORT
Matches strings consistent with Spain's Passport number. Requires an eight- or nine-character string, starting with either two or three letters followed by six digits.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.Spain
Discovered.PHI
Discovered.Entity.Passport
STREET_ADDRESS
Matches strings consistent with street addresses. Primarily looks for strings consistent with the United States street naming convention.
Discovered.Entity.Location
SWEDEN_NATIONAL_ID_NUMBER
Matches numeric strings consistent with Sweden's Nation ID number. Requires a ten- or twelve-digit string that must start with a date in either the YYMMDD
or YYYYMMDD
formats. An optional -
or +
character then separates four ending digits. The final digit is a checksum.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.Sweden
Discovered.PHI
Discovered.Entity.National ID Number
SWEDEN_PASSPORT
Matches numeric strings consistent with Sweden's Passport number. Requires an 8-digit number.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.Sweden
Discovered.PHI
Discovered.Entity.Passport
SWIFT_CODE
Matches alphanumeric strings consistent with a SWIFT code (or Bank Identifier Code (BIC)) format.
Discovered.Entity.Swift Code
THAILAND_NATIONAL_ID_NUMBER
Matches strings consistent with Thailand's National ID number. Requires a 13-digit number with optional spaces or hyphens (-
) after the first, fifth, tenth, and twelfth digits. The final digit is a checksum.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.Thailand
Discovered.PHI
Discovered.Entity.National ID Number
TIME
Matches strings consistent with times. Can contain both date and time pieces.
Discovered.Entity.Date
UK_DRIVERS_LICENSE_NUMBER
Matches alphanumeric strings consistent with the United Kingdom's Driver's License number. Requires either a 16- or 18-character string. The first five characters represent the driver's surname, padded with 9
s, followed by a single digit for decade of birth, two digits for month of birth (incremented by 50 for female drivers), two digits for day of birth, one digit for year of birth, two letters, an arbitrary digit, and two digits. Two additional digits can be present for each license issuance.
Discovered.PII
Discovered.Identifier Direct
,
Discovered.Country.UK
Discovered.PHI
Discovered.Entity.Drivers License Number
UK_NATIONAL_INSURANCE_NUMBER
Matches alphanumeric strings consistent with the United Kingdom's National Insurance number. Requires a nine-character string. The first two digits must be letters, followed by an optional space, then six digits with optional spaces or hyphens (-
) every two digits, ending with a letter.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.UK
Discovered.PHI
Discovered.Entity.National Insurance Number
UK_TAXPAYER_REFERENCE
Matches ten-digit numeric strings consistent with UK Taxpayer Reference (UTR) numbers. The final digit is a checksum.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.UK
Discovered.PHI
Discovered.Entity.Taxpayer Reference
URL
Matches string consistent with a Uniform Resource Locator (URL). String must begin with http://
, https://
, ftp://
, file:///
, or mailto:
, followed by a string and ending with a top level domain of no more than 128 characters.
Discovered.Entity.URL
US_ADOPTION_TAXPAYER_IDENTIFICATION_NUMBER
Matches a numeric string consistent United States Adoption Taxpayer Identification Number (ATIN). Requires a string similar in format to a US Social Security Number, but starting with a 9 in the Area Number and having 93 as an allowed Group Number.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.US
Discovered.PHI
Discovered.Entity.Adoption Taxpayer ID Number
US_BANK_ROUTING_MICR
Matches numeric string consistent with an American Bankers Association (ABA) Routing Number. Must be a nine-digit number starting with 0, 1, 2, 3, 6, or 7, followed by eight digits. The final digit is a checksum.
Discovered.Country.US
Discovered.Entity.Bank Routing MICR
US_DEA_NUMBER
Matches alphanumeric strings consistent with a Drug Enforcement Administration (DEA) number that is assigned to a health care provider. Must be a length of nine characters. The first two digits must be alphanumeric, and the last seven digits must be digits. The final digit is a checksum.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.US
Discovered.Entity.DEA Number
US_EMPLOYER_IDENTIFICATION_NUMBER
Matches numeric string consistent United States Employer Identification Number (EIN). Strings must contain nine digits with a hyphen after the second digit.
Discovered.Country.US
Discovered.Entity.Employer ID Number
US_HEALTHCARE_NPI
Matches numeric strings consistent with US National Provider Identifier (NPI). Strings must be either 10 or 15 digits with the final digit being a valid checksum.
Discovered.PII
Discovered.Country.US
Discovered.Entity.Healthcare NPI
Discovered.Identifier Undetermined
US_INDIVIDUAL_TAXPAYER_IDENTIFICATION_NUMBER
Matches a numeric string consistent United States Individual Taxpayer Identification Number (ITIN). Requires a string similar in format to a US Social Security Number, but starting with a 9 in the Area Number and having a limited set of allowed Group Numbers.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.US
Discovered.PHI
Discovered.Entity.Individual Taxpayer ID Number
US_PASSPORT
Matches numeric strings consistent with United States Passport number. Strings must contain nine digits.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.US
Discovered.PHI
Discovered.Entity.Passport
US_PREPARER_TAXPAYER_IDENTIFICATION_NUMBER
Matches strings consistent with a Preparer Taxpayer ID number. Strings must have nine characters, starting with a P
that is followed by 8 digits.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.US
Discovered.Entity.Preparer Taxpayer ID Number
US_SOCIAL_SECURITY_NUMBER
Matches strings consistent with a US Social Security Number. Strings must contain nine digits and comprise three parts: the three left-most digits designating the area number, the middle two digits designating the group number, and the four right-most digits designating the serial number. For a column to be tagged, none of these parts can contain all zeroes, and area numbers must not be 666 or in the range of 900-999.
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.US
Discovered.PHI
Discovered.Entity.Social Security Number
US_STATE
Matches strings consistent with either a full name or two-letter abbreviation of a US state or territory.
Discovered.Country.US
Discovered.Entity.State
US_TOLLFREE_PHONE_NUMBER
Matches strings consistent with a US toll-free telephone number. Allowed area codes are 800, 88+any digit, or 899.
Discovered.Country.US
Discovered.Entity.Tollfree Telephone Number
VEHICLE_IDENTIFICATION_NUMBER
Matches strings consistent with Vehicle Identification Numbers. A checksum is required as well as a valid World Manufacturer Identifier.
Discovered.Country.US
Discovered.Entity.Vehicle Identifier or Serial Number
Public preview: This feature is public preview and available to all accounts.
Discover comes preconfigured with a bundle of classification frameworks for use out-of-the-box once endorsed by your organization's admins. These frameworks are designed by Immuta’s Legal Engineering and Research Engineering teams and informed by data privacy regulations and security standards: GDPR, CCPA, GLBA, HIPAA, PCI, and global best practices. They are a starting point for companies to customize to their own classification, security, and risk policies.
The Data Security Framework is the general classification framework. It provides the groundwork for categorizing data based on its context but is not specific to any regulatory framework and does not assign sensitivity or risk values to the data it tags. It provides a consistent taxonomy used throughout Immuta, from other built-in frameworks to customized frameworks that classify data valuable to your organization to Secure data and subscription policies.
The Data Security Framework is a supportive tool that accelerates data classification. Use the Data Security Framework in tandem with Discover identification frameworks out-of-the-box for the easy and quick onboarding of data sources and tags. Then, choose the compliance frameworks that matter to your industry or start building your own classification frameworks that assign sensitivity to the specific data of your organization. Your organization's compliance team should review the compliance frameworks as you would a template for a policy or contract and adapt them as needed to ensure a complete inventory and proper classification of your data.
You can view the Data Security Framework tags and their descriptions from the tags page in the UI or from the data dictionary when they are applied to a data source. Note the field and record tags. While they seem similar, the field and record tags are both necessary to convey the content of your data. Field tags describe the content of the columns, and record tags describe the content of the table.
Use the Data Security Framework with the Risk Assessment Framework
To classify your data use both the Data Security Framework to set the groundwork for classification and the Risk Assessment Framework to apply tags with sensitivity metadata based on the Data Security Framework tags. With Snowflake, these frameworks together will show sensitive queries in Detect dashboards.
The Risk Assessment Framework provides the visible tags to your data's sensitivity based on the confidentiality risks it poses to your organization or the data subjects.
Use the Risk Assessment Framework out-of-the-box with the Data Security Framework and Discover identification frameworks to provide sensitivities to view in the Detect dashboards. Additionally, you can copy the framework using the API and create new rules to assign risk level and sensitivity to other data specific to your use case.
The risk assessment tags have sensitivity level metadata assigned to them that will appear in the Detect dashboards as non-sensitive (when no risk assessment tag is applied), sensitive, and highly-sensitive. Additionally, use the risk assessment tags to build Secure policies to restrict access to highly-risky and confidential data.
RAF.Confidentiality.Medium
Indicates confidential data with medium privacy risk to the data subject.
Sensitive
1
RAF.Confidentiality.High
Indicates confidential data with high privacy risk to the data subject.
Highly-Sensitive
2
RAF.Confidentiality.Very High
Indicates confidential data with very high privacy risk to the data subject.
Highly-Sensitive
3
Private preview: This feature is private preview and available to select accounts.
Use the Data Security Framework with regulatory frameworks
The Data Security Framework provides the necessary translation of Discovered entity tags to classification tags. Without the Data Security Framework on, the regulatory frameworks will not automatically work with your data and will require customization.
Immuta comes with four regulatory frameworks informed by the best practices of a specific regulation or standard. These are designed by Immuta’s Legal Engineering and Research Engineering teams as a general interpretation, but each organization should customize them based on their internal practices:
CCPA Framework: Classifies personal sensitive information controlled under the California Consumer Privacy Act (CCPA), as amended by the Consumer Privacy Rights Act (CCPA). This framework tags personal information, including communication content (like the body of a text message) and details about an individual's sexual orientation, religion, race, or biometric data.
GDPR Framework: Classifies personal data of specific categories protected under the EU General Data Protection Regulation (GDPR). This framework tags personal data, including details about an individual's health, sexual orientation, religion, race, or biometric data.
HIPAA Framework: Classifies protected health data controlled under the US Health Insurance Portability and Accountability Act (HIPAA). This framework tags health data connected to a specific individual.
PCI Framework: Classifies payment card information relevant to the Payment Card Industry (PCI) standard. This framework tags payment card information, including account, authentication, and cardholder data.
Some compliance frameworks are used to to add context and apply Data Security Framework tags. Use the data inventory dashboard to enable frameworks with information on the other frameworks they depend on.
Organizations are responsible for making their own independent assessment of the framework rules. The framework rules are only templates and are not necessarily adapted to the specific context in which an organization operates. Framework rules do not constitute legal advice. They do not create any commitments or assurances from Immuta that users will necessarily comply with the statutes or standards that have informed these framework rules.
Private preview
This feature is only available to select accounts. To activate classification frameworks without the private preview feature, use the .
Requirements:
Native SDD enabled and
Registered
Immuta permission GOVERNANCE
To activate a classification framework,
Navigate to Discover and select the Classification tab.
Click the more actions icon in the Actions column for the framework you want to activate.
Select Activate.
Repeat this process for all frameworks relevant to your data. See the for information on Immuta's built-in frameworks.
Navigate to Discover and select the Classification tab.
Click the more actions icon in the Actions column for the framework you want to activate.
Select Deactivate.
About Classification in Immuta
Public preview: This feature is available to all accounts.
Classification is the process in which data is categorized by the content and the associated risk level based on context. To classify your data, Discover evaluates your data in phases:
Sensitive data discovery (SDD) runs to identify your data by content type. The data is discovered and evaluated by the pattern it matches and is tagged.
The Data Security Framework scans those tags and any other tags applied to the data source and columns to categorize the data by context. This phase considers the data and the data surrounding it to understand the category of the data within the context of the data source.
Other regulatory-based frameworks scan and build off of the Data Security Framework tags. These frameworks are specific to regulations and standards and tag the data that matters to each framework.
The Risk Assessment Framework scans and builds off of the Data Security Framework. This framework tags data with specific risk assessment tags that describe the risk the data poses to your organization or the data subject. They also contain additional metadata used in the to describe the risk as sensitivity and visualize when that sensitive data is accessed.
Every phase of classification in Immuta can be customized to find and tag the data your organization cares about. Users can customize the Data Security Framework to find, match, and tag data they want categorized based on the organization's processes. Then, users can modify the by adjusting the sensitivity of classification tags to the organization’s policies or creating new tags and rules in customized frameworks. After data is classified, classification tags can be used to or .
Using Discover classification to assign risk and sensitivity levels to your data and Detect dashboards to visualize the risk levels offers these benefits:
Increasing the semantic understanding of your data to better meet compliance requirements
Reducing the time to make decisions about what data access is allowed under what purposes
Reducing the effort and time to respond to auditors about data access in your company
Reducing the labor of classifying data to enumerate what data is within the scope of security or regulatory compliance frameworks
Both entity and classification tags describe the content of data on a per-column basis, and you can use them to and . However, there are key differences between the two:
Entity tags are applied through identification and describe what the data is. SDD applies entity tags to columns based on the patterns of the data.
Classification tags are applied through categorization and risk assessment and describe the context of the data and the risk it poses. Using classification frameworks, classification tags are applied to columns based on the entity tags previously applied by SDD. Additional classification tags can then be applied, providing even more context or expressing the property of the record rather than just the column.
Entity tags describe the contents of individual columns, in isolation. But you don't access individual columns in isolation, so why would you determine their sensitivity that way? Entity tags do not attempt to and cannot contextualize column contents with neighboring columns' contents. This means that connections between data are lost if they cannot be identified through a pattern within the column itself. Classification tags describe the contents of a table with the context of all its columns, providing a holistic view of the risk of the data for what it is, rather than the pattern it fits. Context is necessary to understand whether your data is public or private data, risky or safe to have ungoverned access, or sensitive and creating toxic joins when accessed with other tables.
Additionally, entity tagging does not indicate how sensitive the data is, but classification tags can carry a sensitivity level. For example, an entity tag may identify a column that contains telephone numbers, but the entity tag alone cannot say that the column is sensitive. A phone number associated with a person may be classified as sensitive, while the publicly listed phone number of a company might not be considered sensitive.
After you understand what entities your data contains using SDD, you need to adopt frameworks that determine what combinations of data constitute sensitive data and their level of sensitivity.
Frameworks are a set of data categories and a set of classification rules to place data into those categories. In Immuta, the data categories are represented by tags, and when data fits a classification rule the tag is applied:
Classification rules determine how each classification tag is applied. These rules can apply tags based on tags already on the column, tags applied to neighboring columns, and tags applied to the data source. This means that the complete data source is considered when classifying your data sources, and even tags applied to individual columns can affect the risk level of the entire data source.
Data classification is a process, and with Immuta, much of it is automated. This means that you can reap the benefits of classified and tagged data quicker and easier than manually classifying and tagging it:
Requirements:
Native SDD enabled and
Registered
Immuta permission GOVERNANCE
Immuta Discover provides identification frameworks out-of-the-box to recognize and tag data, and Discover also provides classification frameworks out-of-the-box to categorize and classify data. These frameworks are all generic to industry practices and should be customized to each organization's specific needs.
Tune SDD frameworks, rules, and patterns first to adjust where Discovered tags are applied. Because classification frameworks apply classification tags from the Discovered tags, tuning SDD should come first and will have trickle-down effects on classification. Customizing SDD requires some initial work but will automate data tagging for all data sources in the future.
Follow the steps below to tune SDD from the Default Framework:
: It is recommended to copy the Default Framework and adjust the rules from there.
.
.
: This will remove the tags from any previous identification frameworks and rerun SDD with your new framework. From here, either continue to edit patterns and rules to reconfigure the applied tags, or if you are happy with the results, proceed to the next step.
.
After SDD has applied entity tags, classification frameworks will automatically reapply their tags to account for any changes to Discovered tags. It may be necessary to adjust the classification tags based on your organization's data, security, and compliance needs.
Requirements:
Immuta permission AUDIT
Use the Detect dashboards to review queries at different sensitivity levels and review the tags that have been applied to your data source columns to understand the tags that Immuta applied there:
Have an Immuta user subscribed to a data source make multiple queries to a data source in Snowflake. The user should query both non-sensitive and sensitive data.
Navigate to the Audit page and click ↻Native Query Audit to pull in queries made in Snowflake.
Navigate to the Events (Beta) page. Note that Snowflake has a 15-minute data latency for all audit events.
Select the Event Id of one of the queries. Click the Columns tab.
The Column tab lists the columns in the query organized from highest to lowest sensitivity and the tags applied to each column. Check that the columns you know to be sensitive are here.
For example, if the query has a column with last names, you should see a minimum of the following tags: Discovered.PII
, DSF. Personal
, DSF.Record.Subject.Type.Individual
, DSF.Record.Identifiability.Identifiable
, and DSF.Control.Personal
.
Note any sensitive columns not labeled as sensitive.
Complete steps 2-5 for as many queries as you want.
Requirement: Immuta permission GOVERNANCE
or data owner
Target some data sources to manually review tags:
Navigate to the data dictionary for the data source by opening the Data Sources page and selecting a data source. Click the Data Dictionary tab to open the data dictionary.
The data dictionary lists the data source columns, with details about the name, data type, and a list of the tags on each column. Assess whether the tags are accurate to your data.
Tags may be unexpected but still accurate to your data. Additionally, they may have been applied because they were found to be the best match from the SDD patterns in the framework.
If you want to improve SDD and personalize it to your data,
Assess why the tag was applied to your data.
Is the pattern incorrectly matching this specific column, but correct in other places? It must have been the most correct match found by SDD. Create a better match by completing the following steps:
If you want to remove the unexpected tags, use one of the following how-to guides:
If you were expecting some sensitive data to be tagged and it is not, enable additional tags using one of the following how-to guides:
Requirement: Immuta permissions GOVERNANCE
and AUDIT
Navigate to the Data Sources page and select the data sources that you assessed and noted issues.
Click the Data Dictionary tab.
Delete unnecessary tags by clicking on the tag you want to remove from the column, and select Disable from the tag side sheet.
To add tags,
Click Add Tags in the Actions column.
Begin typing the name of the tag you want to add in the Search by Name field and select the tag from the dropdown list.
Click Add.
The built-in in Immuta provide a quick way to leverage your own catalog or data platform tags to establish classifications tags. These classification tags can then be used in the Immuta Data Platform for query activity visualizations, monitors, reports, and policies. After you have configured a data catalog integration and registered data sources in Immuta, you can start automating data classification of a column based on its context by considering the combination of its associated tags, its neighboring columns' tags, or its table tag. Classification frameworks also provide . To use classification frameworks with your current tags from an external catalog, use one of the following options:
Follow the tutorial below: This starter framework is built to map a classification scale of restricted, confidential, internal, and public to Immuta's three level scale. It requires an , but all other steps are described below.
: This minimal framework allows you to map your own classification tags to Immuta classification tags. Then, your users' queries will have a sensitivity score on the Detect dashboard and in audit logs based on the classification tags on the data columns they queried. Use this option if you have already classified your organization’s data in an external catalog and want that metadata reflected in Immuta as Sensitive and Highly Sensitive.
: This option allows you to map your own tags describing your data to Immuta's predefined classification tags in the context of a specific compliance framework. Immuta provides built-in frameworks for GDPR, CCPA, and HIPAA. Map your tags to the most comparable Data Security Framework (DSF) tag, and Immuta will apply the classification tag based on the framework. Use this option if you have descriptive tags on your data and want that metadata mapped to a specific compliance framework.
Follow this guide to map your external catalog tags to the example framework, or consult the for more information about the framework schema.
Using the example framework below, customize the framework for your organization's classification tags.
tags
: These tags are automatically created in Immuta with the sensitivity you assign. All tags used in the classificationTag
parameter should be defined here.
tags.sensitivities
: This is metadata for the sensitivity of the new tag. Use confidentiality
for dimension
. Options for sensitivity
are 1
(shown as sensitive in Detect dashboards) and 2
(shown as highly sensitive in Detect dashboards). For nonsensitive, leave this parameter empty.
rules
: These are the rules for applying the tags
defined above.
rules.classificationTag
: This classification tag must be defined in tags
. Add the name you want and the source
is curated
. This is the tag that will be applied if the rule requirement is met.
rules.columnTags
: This object represents tags on a column. If the tag defined here is found on a column, then the rule's classificationTag
will be applied to the same column.
rules.neighborColumnTags
: This object represents tags on other columns in the data source. If the tag defined here is found on any column in the data source, then the rule's classificationTag
will be applied to all the neighboring columns.
rules.tableTags
: This object represents tags on the data source. If the tag defined here is found on the data source, then the rule's classificationTag
will be applied to all the columns in that data source.
active
: When true
the framework is active and will apply tags when the rules are met.
Follow the example below to map your external tags to the rules in the example framework.
The Immuta built-in framework, Risk Assessment Framework has a rule where columns tagged DSF.Interpretation.Credentials.Secret
by sensitive data discovery will be tagged RAF.Confidentiality.High
:
To translate this to your tags, replace the name and source value of the columnTags
, neighborColumnTags
, or tableTags
with your own. This new example is for a Collibra tag that an organization uses for confidential data. This rule now states: Apply the classification tag RAF.Confidentiality.High
to a column if it has the collibra
tag Confidential
. Repeat this for your organization's remaining classification levels.
name
and source
for your tagsIf you do not know the name
or source
for your tags, you can list your tags using the Immuta API:
This request will list all the tags in your Immuta environment, similar to this example response:
Requirement: Immuta permission GOVERNANCE
Once you have made all the customizations to the example framework, make the following request using the Immuta API, with your full customized framework as the payload.
Your new framework will now be visible in the Immuta UI by navigating the the Classification section under Discover.
Of sensitive data discovery's , regex and dictionary are competitive. This means that when assessing your data, if multiple patterns could match, only one of the competitive patterns will be chosen and tag the data. To better understand how Immuta executes this competition, read further.
Discover employs a three-phased competitive pattern analysis approach for sensitive data discovery (SDD):
: No data is moved, and Immuta checks the patterns against a sample of data from the table.
: Patterns that have less than a 90% match are filtered out.
: The remaining patterns are compared with one another to find the most specific pattern that qualifies and matches the sample.
In the end, competitive pattern analysis aims to find a single pattern for each column that best describes the data format.
In the sampling process, no database contents are transmitted to Immuta; instead, Immuta receives only the column-wise hit rate (the number of times the pattern has matched a value in the column) information for each active pattern. To do this, Discover instructs a remote database to measure column-wise hit rate information for all active patterns over a row sample.
The sample size is decided based on the number of patterns and the data size, when available. In the most simplified case, the requested number of sampled rows depends only on the number of regex and dictionary patterns being run in the framework, not the data size. The sample size dependence on the number of patterns is weak and will not exceed 13,000 rows.
In practice, the number of sampled values for each column may be less than the requested number of rows. This happens when the target table has less than the requested number of rows, when many of the column values are null
, or because of technology-specific limitations.
Snowflake and Starburst (Trino): Discover implements native table sampling by row count.
Databricks and Redshift: Due to technology limitations and the inability to predict the size of the table, Discover implements a best-effort sampling strategy comprising a flat 10% row sample capped at the first 10,000 sampled rows. In particular, under-sampling may occur on tables with less than 100,000 rows. Moreover, the resulting sample is biased towards earlier records.
All platforms: Sampling from views can have significantly slower performance that varies by the performance of the query that defines the view.
During the scoring phase, a machine inference is carried out among all qualified patterns, combining pattern-derived complexity information with hit rate information to determine which pattern best describes the sample data. This process prefers the more restrictive of two competing patterns since the ability to satisfy the more difficult-to-satisfy pattern itself serves as evidence that it is more likely. This phase ends by returning a single most likely pattern per the inference process.
Here are a set of regex patterns and a sample of data:
Patterns:
[a-zA-Z0-9]{3}
- This pattern will match 3 character strings with the characters a-z, lowercase or uppercase, or digits 0-9.
[a-c]{3}
- This pattern will match 3 character strings with the characters a-c, lowercase.
(a|b|d){3}
- This pattern will match 3 character strings with the characters a, b, or d, lowercase.
When qualifying the patterns, Pattern 1 and Pattern 3 both match 90% or more of the data. Pattern 2 does not, and is disqualified.
Then the qualified patterns are scored. Here, Pattern 1, despite matching 100% of the data, is unspecific and could match over 200,000 values. On the other hand, Pattern 3 matches just at 90% but is very specific with only 27 available values.
Therefore, with the specificity taken into account, Pattern 3 would be the match for this column, and its tags would be applied to the data source in Immuta.
Dictionaries are considered patterns by Immuta and are part of the competitive process, while column-name regex patterns are not.
Scoring ties are rare but can occur if the same pattern is specified more than once (even in different forms). Scoring ties are inconclusive, and the scoring phase will not return a pattern in the case of a tie.
Pattern complexity analysis is sensitive to the total number of strings a pattern accepts or, equivalently for dictionaries, the number of entries. Therefore, patterns that accept much more than is necessary to describe the intended column data format may perform more poorly in the competitive analysis because they are easier to satisfy.
To activate a framework using , see the .
For example, under HIPAA, a list of procedures a doctor performed is only considered protected health information (PHI) if it can be associated with the identity of patients. Since entity tagging operates on a single column-by-column basis, it cannot reason whether or not a column containing procedure codes merits classification as PHI. Therefore, entity tagging will not tag procedure codes as PHI. But classification tagging will tag it PHI if it detects patient identity information in the other columns of the table. This is an example that Immuta built-in frameworks can address out-of-the-box using the .
Classification tags are applied based on the Discovered tags from SDD or other tags on the data source. Classification tags contain additional metadata about each column, such as the source of the tag, the dimension, and the sensitivity level. This metadata is used in the framework rules and complex formulas that assign the sensitivity of queries visible in .
Frameworks are often built off of an interpretation of regulatory frameworks or standards, such as the US Health Insurance Portability and Accountability Act (HIPAA) and the PCI standard. However, organizations can also build frameworks that represent their internal business processes. When used in Immuta, they automate data tagging and provide, through the , information about what data you have immediately after it is registered in Immuta.
See the for more information about the frameworks Immuta provides out-of-the-box.
Quick data access control: Use Discover to identify and classify your data immediately after registration in Immuta. Then, off of those tags. This repeatable process will protect your data in its current state and whenever any new data sources are created. Automate the process further with ; schema monitoring allows you to register data just once. Then, Immuta will monitor your data environment for changes and, when found, update the data source in Immuta, update the tags on that data source, and then update user access based on your governance policies when changes happen.
Scale your data monitoring: Use Discover to identify and classify your data immediately after registration in Immuta. Then, view your data users' access to your sensitive and risky data through the .
Build data platform compliance: Use and customize the to identify and classify your data based on the industry practices and regulations your organization needs to abide by. The Immuta compliance frameworks are templates to provide a strong starting point for further customization to what matters to your organization. Once those frameworks are built, use them to classify your data immediately after data registration in Immuta.
Snowflake integration (If you are using Databricks, use the how-to below.)
Is the pattern incorrectly matching your data and irrelevant to your organization? .
.
.
so this column is correctly matched by SDD.
.
.
. Note that classification tags build off of other tags, so removing a single classification or Discovered tag can have trickle-down effects on the data source.
.
.
.
. Note that classification tags build off of other tags, so adding a single classification or Discovered tag can have trickle-down effects on the data source.
.
Tags can be edited on an individual basis for each data source. If broad changes to the classification framework are necessary to re-tag your data, use the .
For more information about these parameters see the .
During the qualification phase, patterns that do not agree with the data are disqualified. A pattern agrees with the data if the hit rate on the remote sample exceeds the predefined threshold. This threshold is 90% match for most built-in patterns; however, two built-in patterns have lower threshold . The 90% threshold is standard for all custom patterns as well to ensure the pattern matches the data within the column and avoid false positives. If no patterns qualify, then no pattern is assessed for scoring and the column is not tagged.
5
7369 rows
50
9211 rows
500
11053 rows
5000
12895 rows
dad
Yes
Yes
baa
Yes
Yes
add
Yes
Yes
add
Yes
Yes
cab
Yes
Yes
bad
Yes
Yes
aba
Yes
Yes
baa
Yes
Yes
dad
Yes
Yes
baa
Yes
Yes