Data Discovery
Sensitive data discovery (SDD) is an Immuta feature that uses data patterns to determine what type of data your column represents. Using rules and data samples from your tables, Immuta matches your data and can assign the appropriate tags to your data dictionary. This saves the time of identifying your data manually and provides the benefit of a standard taxonomy across all your data sources in Immuta.
Architecture
SDD works by looking at a sample of data from each table that it checks against a framework compiled of rules. If a rule's pattern matches a column of the sampled data with the appropriate amount of confidence, then the resulting tag is applied to that column, signifying the data it contains. Data can be sampled one of two ways:
- Native SDD (public preview) generates a query using data source connection information and the identification framework that it sends to Snowflake or Databricks. Immuta receives a query result containing the column name and matching rule, and Immuta uses those results to apply the resulting tags to the appropriate columns.
- Non-native SDD queries a small sample of data for each data source in Immuta. This sample is temporarily held in memory to check for rule matches. Then Immuta applies the resulting tags to those columns where matches were found.
This sampling and tagging process occur when SDD runs, which happens automatically from the following events:
- A new data source is created.
- Schema detection is enabled and a new data source is detected.
- Column detection is enabled and new columns are detected. Here, SDD will only run on new columns and no existing tags will be removed or changed.
Users can also manually trigger SDD to run from a data source's overview page.
Components
Sensitive data discovery (SDD) runs frameworks to discover data. These frameworks are a collection of rules. These rules contain a single criteria and the resulting tags that will be applied when the criteria's conditions have been met. See the sections below for more information on each component.
Identification framework
An identification framework is a collection of rules that will look for a particular criteria and tag any columns where those conditions are met. While organizations can have multiple frameworks, only one may be applied to each data source.
For a how-to on the framework actions users can take within the UI, see the Manage frameworks page.
Global framework
Each organization has a single global framework that will apply to all the data sources in Immuta. However, users can bypass this global framework and apply a specific framework to a set of data sources using the API.
An active global template cannot be deleted.
Rule
The rule is the basic building block of SDD. Each rule is a criteria and the resulting tags to apply to data that matches the criteria. When Immuta recognizes that criteria, it can tag the data to describe the type.
Users can use the API to create their own rules with unique patterns to find their specific data. Once a rule has been created, it can be enabled or disabled on the frameworks page.
Criteria
Criteria are the conditions that need to be met for resulting tags to be applied to data.
Supported criteria types
- Competitive pattern analysis: This criteria is a process that will review all the regex and dictionary patterns within the rules of the framework and search for the pattern with the best fit. The resulting tags for that pattern's rule will then be applied to the column.
- Column name: This criteria matches a column name pattern to the column names in the data sources. The rule's resulting tags will be applied to the column where the name is found.
Pattern
A pattern is the type of data Immuta will look for to meet the requirements to tag a column. Immuta comes with built-in patterns to discover common categories of data. These patterns cannot be modified and are within preset rules with preset tags. Users can use the API to create their own rules with unique patterns to find their specific data.
Supported pattern types
The three types of patterns are described below:
- Regex: This pattern contains a case-insensitive regular expression that searches for matches against column values.
- Column name: This pattern includes a case-insensitive regular expression that is only matched against column names, not against the values in the column.
- Dictionary: This pattern contains a list of words and phrases to match against column values.
Built-in pattern example
Immuta comes with a rule that searches for U.S. Social Security numbers: US_SOCIAL_SECURITY_NUMBER
. This rule uses a
competitive pattern analysis criteria with a built-in regex pattern that searches for strings of exactly nine digits,
with or without hyphens after the third and fifth digits, with a leading digit between 0 and 8. Each column is given a
score of the percentage of values that match the defined regex pattern. If this score is both above the minimum
confidence and the highest confidence of all the other competitive pattern analysis criteria within the framework, then
the resulting tags of the rule will be applied to the column:
Discovered.PII
Discovered.Identifier Direct
Discovered.Country.US
Discovered.PHI
Discovered.Entity.Social Security Number
Tag mutability
When SDD is triggered by a data owner on the data source overview or through the API, all column tags that were previously applied by SDD are removed and the tags prescribed by the latest run are applied. However, if SDD is triggered because a new column is detected by schema monitoring, tags will only be applied to the new column, and no tags will be modified on existing columns.
Native SDD for Snowflake and Databricks
Public preview
This feature is available to all accounts. To enable native SDD, reach out to your account manager to request for the feature to be enabled. After the feature has been added, enable SDD from the Immuta app settings page.
Requirements
- Immuta SaaS instance
- Snowflake or Databricks integration
Performance
Native SDD offers significant performance improvement over non-native SDD. The amount of time it takes to identify data relies on the number of text columns in the data source and the number of rules in the framework. The number of rows has little impact on the time because data sampling has near constant performance. However, views perform significantly worse due to extra query compilation time.
The time it takes to run SDD for all newly onboarded data sources in Immuta is not limited by native SDD performance but by the execution of background jobs in Immuta. Consult your Immuta account manager when onboarding a large number of data sources to ensure the advanced settings are set appropriately for your organization.
Considerations
Deleting the built-in Discovered tags is not recommended: If you do delete built-in Discovered tags and use SDD, when the pattern is matched, the column will not be tagged. As an alternative, tags can be disabled on a column-by-column basis from the data dictionary, or SDD can be turned off on a data-source-by-data-source basis when creating a data source.
Native SDD
-
Limitations with custom regex patterns:
- Custom regex patterns are case sensitive.
- Custom regex patterns are only supported on columns with the data type
string
.
-
Limitations with custom dictionary patterns:
- Immuta compiles custom dictionary patterns into a regex that is sent in the body of a query. For Snowflake, the size of the dictionary is limited by the overall query text size limit in Snowflake of 1 MB.
- Custom dictionary patterns are only supported on columns with the data type
string
.
-
For native SDD for Databricks, Immuta will start up a Databricks cluster to complete the SDD job if one is not already running. This can cause unnecessary cost if the cluster becomes idle. Follow Databricks best practices to automatically terminate inactive clusters after a set period of time.
-
Native SDD for Databricks only checks for rules on columns with the data type
string
. -
The attribute
minConfidence
is not supported for native SDD, but is still required for any data sources from data providers other than Snowflake and Databricks. While it should be included in any custom patterns you create, it will be ignored for native SDD. - The attribute
sampleSize
is not supported for native SDD because Immuta calculates the optimal sample size for the data and patterns you have in your framework. However, it is still required for any data sources from data providers other than Snowflake and Databricks. While it should be included in any custom patterns you create, it will be ignored for native SDD.
Migrating from non-native to native SDD
These limitations are only relevant to users who have previously enabled and run Immuta SDD.
If you had non-native SDD enabled, running native SDD can result in different resulting tags being applied because native SDD is more accurate and has fewer false positives than non-native SDD. Running a new SDD scan against a table will remove any old resulting tags on that table that are not present in the new scan and put a new set of resulting tags in place. This could have some impacts:
- If policies are built around Discovered tags, there could be a change in access as a result of tag changes.
- If you are using Immuta Detect, changing the applied tags will result in a change to the classification results that drive Detect events and dashboards.
Non-native SDD
- Non-native SDD does not run on data sources with over 1600 columns.
- There is no UI available for non-native SDD. All actions must be completed using the API or CLI.
Configure and customize SDD
To understand the configuration settings and customize SDD options using the API, see the Data discovery customization page.