Configure Sensitive Data Discovery (Public Preview)
Audience: Data Owners and Data Governors
Content Summary: This feature allows users to customize how sensitive data is detected and what tags are applied to that data.
Sensitive Data Discovery (SDD) comprises two major elements:
Classifiers: The classifier is the basic building block of SDD. Essentially, a classifier includes a pattern (e.g., a regex or a list of values) and a list of tags to apply to data that matches that pattern. For example, if a column sample matches a regex defined in a classifier, then all the tags in that classifier will be applied to that column. By default, all classifiers are matched against data sources when SDD is triggered, unless a template (defined below) is applied to a data source. Details about types of classifiers are provided below.
Templates: A template is a collection of classifiers and settings that are used to drive the configuration of SDD runs. Users may apply a template globally or to a specific set of data sources.
When SDD is triggered on a data source, it will use the classifiers and settings in its configured template to run the detection job. If no template has been configured, SDD will use the global settings, described below.
There are two types of classifiers:
Built-in Classifiers: These classifiers are included with Immuta and detect common categories of sensitive data (such as social security numbers, zip codes, and routing numbers) and cannot be modified. Users can list built-in classifiers through the Immuta API or view this Built-In Classifiers Reference page.
Custom Classifiers: Custom classifiers allow Data Governors to create their own regular expressions, dictionaries, and tags that SDD will use to detect sensitive data.
Custom Classifier Types
The three types of custom classifiers are described in the table below.
|Custom Classifier Type||Definition||Use Case|
|Regex Classifier||This classifier contains a case-insensitive regular expression that allows users to match a custom regex against column values.||If the built-in classifiers do not contain a regex that could match against values within your data sources, use this classifier to create your own regex. See Create a Regex Classifier for a specific use case example.|
|Column Name Regex Classifier||This classifier includes a case-insensitive regular expression that is only matched against column names, not against the values in the column.||If a column name clearly denotes that it contains sensitive data, you could create this classifier to match the regex against the name of columns instead of the column values. See Create a Column Name Regex Classifier for a specific use case example.|
|Dictionary Classifier||This classifier contains a list of words and phrases to match against column values.||Create a dictionary classifier if there are words or phrases included in your datasets that may be sensitive, but will not be detected by the built-in classifiers. See Create a Dictionary Classifier for a specific use case example.|
SDD Global Settings
When SDD is triggered on a data source, classifiers in the template applied to it run the detection job, while data sources without a template applied to them will have the classifiers or template defined in the global settings run the detection job. By default, the global setting will use all classifiers in the system to run the detection. However, a System Administrator can configure Immuta to use a global template to run the detection instead. While a template is actively global, it cannot be deleted by users.
The global template can be updated on the App Settings page in the Advanced Configuration section:
fingerprints: classification: globalTemplate: MY_GLOBAL_TEMPLATE_NAME # or use null to restore default behavior
SDD runs automatically when users create a new data source or when a new column is detected through schema monitoring, but users can also trigger SDD in the Immuta UI, through the Immuta CLI, or through the API.
Users can also configure SDD to do a
dryRun, which allows them to see what tags would be applied to a data source
without actually applying them. See the
Run Sensitive Data Discovery on Data Sources tutorial
When SDD is triggered by a Data Owner, all column tags that were previously applied by SDD are removed and the tags prescribed by the latest run are applied, but if SDD is triggered because a new column is detected by schema monitoring, no tags will be modified on existing columns.
Two common workflows for using SDD are outlined below. The first illustrates how to apply a single global template to all data sources, while the second outlines how users can create and apply templates to data sources they own.
Workflow 1: Apply a Global Template to All Data Sources
- Data Governor creates a template using one or more built-in or custom classifiers.
- System Administrator adds this template to the global settings so that it applies to all data sources.
- Users trigger SDD on data sources.
Workflow 2: Apply a Template to a Specific Data Source
- Data Governor creates one or more custom classifiers:
- Data Owner creates a template containing one or more classifiers.
- Data Owner applies their template to one or more data sources.
- Data Owner triggers SDD on one or more data sources, and tags are applied to columns where sensitive data was detected.