Sensitive Data Discovery

Deprecation notice

Support for this feature has been deprecated.

Sensitive data discovery (SDD) is an Immuta feature that uses sensitive data patterns to determine what type of data your column represents. Using identification rules and data samples from your tables, Immuta matches your data and can assign the appropriate tags to your data dictionary. This saves the time of identifying your data manually and provides the benefit of a standard taxonomy across all your data sources in Immuta.

Architecture

SDD works by looking at a sample of data from each table that it checks against templates compiled of built-in or customized identifiers. If an identifier's pattern is matched with a column of the sampled data with an appropriate amount of confidence, then the corresponding tag is applied to that column, signifying the data it contains.

SDD queries a small sample of data for each data source in Immuta. This sample is temporarily held in memory to check for identifier matches. Then Immuta applies the relevant tags to those columns where matches were found.

This sampling and tagging process will happen anytime SDD is run. SDD can be triggered through the Immuta CLI, through the API, or in the Immuta UI on the data sources overview page. SDD will also run automatically anytime one of the following events occurs:

A new data source is created.
Schema detection is enabled and a new data source is detected.
Column detection is enabled and new columns are detected. Here, SDD will only run on new columns and no existing tags will be removed or changed.

Components

Sensitive data discovery (SDD) comprises two major elements: identifiers and templates.

Identifier

The identifier is the basic building block of SDD. Each identifier in Immuta is a unique pattern (e.g., a regex or a list of values) and a list of tags to apply to data that matches the pattern. When Immuta recognizes that pattern, it can understand the type of data and tag the data to describe the type. For example, Immuta has the built-in identifier US_SOCIAL_SECURITY_NUMBER. Immuta will use a regex to look for strings of exactly nine digits, with or without hyphens after the third and fifth digits, with a leading digit between 0 and 8. SDD then scores columns by the percentage of values that match the pattern defined. This score determines whether or not the configured tags will be applied to a column. Once it finds a column that fits the expected pattern of US_SOCIAL_SECURITY_NUMBER with a reasonable match score, it will know how to tag it.

There are two types of identifiers:

Built-in identifier: These identifiers are included with Immuta and discover common categories of data (such as social security numbers, zip codes, and routing numbers). They cannot be modified. Users can list built-in identifiers through the Immuta API or view the Built-in identifiers reference page.
Custom identifier: Custom identifiers allow data governors to create their own regular expressions, dictionaries, and tags that SDD will use to discover and tag data.

By default, all identifiers are matched against data sources when SDD is triggered, unless a template is applied to a data source.

Supported identifier types

The three types of identifiers are described below:

Regex identifier: This identifier contains a case-insensitive regular expression that allows users to match a custom regex against column values.
Column name regex identifier: This identifier includes a case-insensitive regular expression that is only matched against column names, not against the values in the column.
Dictionary identifier: This identifier contains a list of words and phrases to match against column values.

Templates

A template is a collection of identifiers and settings that drive the configuration of SDD runs. The settings users can apply through templates include the following:

classifiers (identifiers) are applied to data sources in the SDD run.
tags is an optional override for the tags applied by the identifiers.
minConfidence is an optional override for the minConfidence established in the identifier(s). When the detection confidence is at least the percentage defined in minConfidence, tags are applied.
sampleSize is an optional override for how many records to sample from the data source.

Users may apply a template globally or to a specific set of data sources. When SDD is triggered on a data source, it will use the identifiers and settings in its configured template to run the detection job. If no template has been configured, SDD will use the global settings. By default, the global settings will use all identifiers in the system to run the detection.

Considerations

SDD does not run on data sources with over 1600 columns.
Deleting the built-in Discovered tags is not recommended: If you do delete built-in Discovered tags and use SDD, when the identifier is detected, the column will not be tagged. Tags can be disabled on a column-by-column basis from the data dictionary, or SDD can be turned off on a data-source-by-data-source basis when creating a data source.

Configure and customize SDD

To configure settings and customize SDD, see the SDD pre-configuration page.