Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Click the App Settings icon in the left sidebar.
Click Sensitive Data Discovery in the left panel to navigate to that section.
Select the checkbox to enable SDD, and then click Save and Confirm to apply your changes.
Click the App Settings icon in the left sidebar.
Click Sensitive Data Discovery in the left panel to navigate to that section.
Enter the name of your global template in the Global SDD Template Name field.
Click Save, and then Confirm your changes.
When a sample size is not specified in a template, SDD will use the default sample size of 1000 records. To adjust the sample size,
Click the App Settings icon in the left sidebar.
Click Sensitive Data Discovery in the left panel to navigate to that section.
Enter the number of rows in a data source you would like sampled when running SDD in the Default SDD Sample Size field.
Click Save, and then Confirm your changes.
Only application admins can enable sensitive data discovery (SDD) on the Immuta app settings page. Then, data source creators can disable SDD on a data-source-by-data-source basis. Additionally, governors, data source owners, and data source experts can disable any unwanted Discovered tags in the data dictionary to prevent them from being used and auto-tagged on that data source in the future.
When SDD is triggered on a data source, the job is run for the identifiers within the set template. If a template is not set, the identifier and template within the SDD job are defined by the global setting. By default, the global setting will run for all identifiers in the system. However, a system administrator can configure Immuta to use a custom global template instead.
An active global template cannot be deleted.
SDD uses a sample of data to assess the likelihood that a column contains data that fits the pattern specified in the configured identifiers.
The default for SDD is to sample 1000 records (the sample size) during this process. However, administrators can configure the sample size taken by SDD on the Immuta app settings page. In general, increasing the sample size increases the accuracy of SDD predictions, but decreasing the number of records sampled during SDD may be necessary to meet some organizations' compliance requirements.
When SDD is triggered by a data owner, all column tags that were previously applied by SDD are removed and the tags prescribed by the latest run are applied. However, if SDD is triggered because a new column is detected by schema monitoring, tags will only be applied to the new column, and no tags will be modified on existing columns.
Users can also configure SDD to do a dryRun
, which allows them to see what tags would be applied to a data source without actually applying them. See the Run sensitive data discovery on data sources page for details.
Two common workflows for using SDD are outlined below. The first illustrates how to apply a single global template to all data sources, while the second outlines how users can create and apply templates to data sources they own.
Data governor creates a template using one or more built-in or custom identifiers.
Data governor creates one or more custom identifiers:
Deprecation notice
Support for this feature has been deprecated.
Sensitive data discovery (SDD) is an Immuta feature that uses sensitive data patterns to determine what type of data your column represents. Using identification rules and data samples from your tables, Immuta matches your data and can assign the appropriate tags to your data dictionary. This saves the time of identifying your data manually and provides the benefit of a standard taxonomy across all your data sources in Immuta.
SDD works by looking at a sample of data from each table that it checks against templates compiled of built-in or customized identifiers. If an identifier's pattern is matched with a column of the sampled data with an appropriate amount of confidence, then the corresponding tag is applied to that column, signifying the data it contains.
SDD queries a small sample of data for each data source in Immuta. This sample is temporarily held in memory to check for identifier matches. Then Immuta applies the relevant tags to those columns where matches were found.
This sampling and tagging process will happen anytime SDD is run. SDD can be triggered through the Immuta CLI, through the API, or in the Immuta UI on the data sources overview page. SDD will also run automatically anytime one of the following events occurs:
A new data source is created.
Schema detection is enabled and a new data source is detected.
Column detection is enabled and new columns are detected. Here, SDD will only run on new columns and no existing tags will be removed or changed.
Sensitive data discovery (SDD) comprises two major elements: identifiers and templates.
The identifier is the basic building block of SDD. Each identifier in Immuta is a unique pattern (e.g., a regex or a list of values) and a list of tags to apply to data that matches the pattern. When Immuta recognizes that pattern, it can understand the type of data and tag the data to describe the type. For example, Immuta has the built-in identifier US_SOCIAL_SECURITY_NUMBER
. Immuta will use a regex to look for strings of exactly nine digits, with or without hyphens after the third and fifth digits, with a leading digit between 0 and 8. SDD then scores columns by the percentage of values that match the pattern defined. This score determines whether or not the configured tags will be applied to a column. Once it finds a column that fits the expected pattern of US_SOCIAL_SECURITY_NUMBER
with a reasonable match score, it will know how to tag it.
There are two types of identifiers:
Built-in identifier: These identifiers are included with Immuta and discover common categories of data (such as social security numbers, zip codes, and routing numbers). They cannot be modified. Users can list built-in identifiers through the Immuta API or view the Built-in identifiers reference page.
Custom identifier: Custom identifiers allow data governors to create their own regular expressions, dictionaries, and tags that SDD will use to discover and tag data.
By default, all identifiers are matched against data sources when SDD is triggered, unless a template is applied to a data source.
The three types of identifiers are described below:
Regex identifier: This identifier contains a case-insensitive regular expression that allows users to match a custom regex against column values.
Column name regex identifier: This identifier includes a case-insensitive regular expression that is only matched against column names, not against the values in the column.
Dictionary identifier: This identifier contains a list of words and phrases to match against column values.
A template is a collection of identifiers and settings that drive the configuration of SDD runs. The settings users can apply through templates include the following:
classifiers
(identifiers) are applied to data sources in the SDD run.
tags
is an optional override for the tags applied by the identifiers.
minConfidence
is an optional override for the minConfidence
established in the identifier(s). When the detection confidence is at least the percentage defined in minConfidence
, tags are applied.
sampleSize
is an optional override for how many records to sample from the data source.
Users may apply a template globally or to a specific set of data sources. When SDD is triggered on a data source, it will use the identifiers and settings in its configured template to run the detection job. If no template has been configured, SDD will use the global settings. By default, the global settings will use all identifiers in the system to run the detection.
SDD does not run on data sources with over 1600 columns.
Deleting the built-in Discovered tags is not recommended: If you do delete built-in Discovered tags and use SDD, when the identifier is detected, the column will not be tagged. Tags can be disabled on a column-by-column basis from the data dictionary, or SDD can be turned off on a data-source-by-data-source basis when creating a data source.
To configure settings and customize SDD, see the SDD pre-configuration page.
In previous documentation, identifier is referred to as classifier. The language is being updated to identifier to be more accurate and not conflate meaning with the Immuta data classification and frameworks feature.
Use case: Custom regex identifier
A regular expression (regex) custom identifier allows you to create your own rules that enable Immuta's sensitive data discovery to find matches based on a regex pattern. For example, if a table contains account numbers in the form of xxxxxxxxx-xxx-x
, you could define a regex pattern in a custom identifier to identify and tag these columns. The tutorial below uses this scenario to illustrate creating this identifier.
Save the custom regex identifier payload in a .json file.
Create the identifier using one of these methods:
Immuta CLI
HTTP API
If the request is successful, you will receive a response that contains details about the identifier.
Continue to one of the following tutorials:
In previous documentation, identifier is referred to as classifier. The language is being updated to identifier to be more accurate and not conflate meaning with the Immuta data classification and frameworks feature.
Specify the data sources you would like to run SDD on, and save the payload in a .json file.
Or choose to run SDD on all the data sources in Immuta, and save the payload in a .json file.
Trigger SDD using one of these methods:
Immuta CLI
HTTP API
If sensitive data discovery was successfully run, you will receive a response similar to this:
Users can test how SDD will apply tags to their data sources by completing a dryRun
, which allows users to test templates and tags:
test templates: If a template is specified in the payload when the dryRun
is true
, SDD will use this template instead of the template applied to the data source. Note: SDD will error if a template is specified here when dryRun
is false
.
test tags: Instead of applying tags, SDD just returns the tags that would be applied to the data source. This allows users to evaluate whether or not identifiers or templates are applying tags correctly without updating the data source.
After evaluating whether or not the tags have been applied appropriately, users can then make necessary changes to a template before triggering SDD again.
To complete a dryRun
,
Trigger SDD using one of these methods:
Immuta CLI
HTTP API
You will receive a response that illustrates tags that will be added, tags that will be removed, and the final SDD result:
Once you are satisfied with how tags are applied by SDD, set dryRun
to false
(or omit it from the payload).
Trigger SDD again:
Immuta CLI
HTTP API
If the request was successful, you will receive a response similar to this one:
Select a data source from your My Data Sources page.
Click the Health Check dropdown menu.
In the Sensitive Data Discovery (SDD) section, click Re-run.
Continue to one of the following tutorials:
Create a custom identifier: Data governors can create custom identifiers to define their own regular expressions, dictionaries, and tags that SDD will use to discover and tag data.
In previous documentation, identifier is referred to as classifier. The language is being updated to identifier to be more accurate and not conflate meaning with the Immuta data classification and frameworks feature.
Find identifiers to include in your template using one of these methods:
Immuta CLI
HTTP API
If the request was successful, you will receive a list of available identifiers.
Save the template payload in a .json file. Use the tabs below to see different examples of templates.
Create the template:
Immuta CLI
HTTP API
If the request is successful, you will receive a response that contains details about the template. Use the tabs below to see different responses for different templates.
After the template is applied to data sources and sensitive data discovery is run, the Discovered.account-number
tag will be applied to columns that Immuta identifies with 50% confidence, as configured in the identifier.
After the template is applied to data sources and sensitive data discovery is run, the Discovered.desk-location
tag will be applied to columns when Immuta detects the values Research Lab
, Blue Room
or Purple Room
with 60% confidence, as configured in the identifier.
After the template is applied to data sources and sensitive data discovery is run, the Discovered.social-security-number
tag will be applied to columns that have a name that match the ssn|social ?security
regex, such as ssn
, socialsecurity
, or social security
.
After the template is applied to data sources and sensitive data discovery is run, the Discovered.residence-hall
tag will be applied to columns when Immuta detects values that match those listed in the Residence Halls
data source with 70% confidence, as configured in the identifier.
Find templates to apply to your data sources:
Immuta CLI
HTTP API
If the request was successful, you will receive a list of available templates.
Select an appropriate template to apply to your data sources, and save the payload in a .json file:
Apply the template to your data source(s):
Immuta CLI
HTTP API
You will receive a response that indicates whether or not the template was successfully applied to your data sources.
Users cannot modify templates created by other data owners, but they can clone templates and make changes to the clone.
Get a list of templates to determine the template you want to clone using one of these methods:
Immuta CLI
HTTP API
Save the template clone name and details in a .json file.
Clone the template:
Immuta CLI
HTTP API
If the request was successful, you will receive a response that provides details about the template clone.
You can now modify the template, such as changing the identifiers (classifiers
) included and the sampleSize
.
To disable entity tags from being set, you can create a template to that configures the identifier that contains that tag.
For example, the built-in PERSON_NAME
identifier contains the following tags: Discovered.PHI
, Discovered.PII
, Discovered.Entity.Person Name
, and Discovered.Identifier Indirect
. However, your organization doesn't have any health data, so you don't want the PHI
tag to be applied to your data sources but you do want all the other tags within that identifier.
To override the Discovered.PHI
tag, you would create a template that includes the PERSON_NAME
identifier and removes the Discovered.PHI
from the list of tags in the template payload.
View the details about the PERSON_NAME
identifier so you know what to include in your template using one of these methods:
Immuta CLI
HTTP API
If the request was successful, the response will include details about the PERSON_NAME
identifier.
Remove the Discovered.PHI
tag from the list of tags in the identifier config
, and save the template payload in a .json file.
Create the template:
Immuta CLI
HTTP API
If the request is successful, you will receive a response that details the new template:
Now that you've created a template, continue to one of the following tutorials:
Scenario: You've for sensitive data discovery, but you discover there is no identifier that can automatically identify and tag columns that contain account numbers in your database.
Attributes of all custom identifiers are provided on the . However, attributes specific to the custom regex identifier are outlined in the table below.
Attribute | Description | Required |
---|
Generate your API key on the and save the API key somewhere secure. You will include this API key in the authorization header when you make a request to the Immuta API or use it to .
: Trigger SDD to run on specified data sources.
: Although only data governors can create identifiers, data owners can add identifiers to templates, which they then apply to their data sources to override minConfidence
or tags for identifiers within the template.
Attributes of all custom identifiers and templates are provided on the . However, attributes specific to this section are outlined below.
Attribute | Description |
---|
Specify the data sources you would like to run sensitive data discovery on and set dryRun
to true
in the payload in a .json file. Note: You can also apply a template to a data source as a dryRun
, like in the example below. However, when dryRun
is false
, a template cannot be included in the payload. Instead, the before running SDD.
: Trigger SDD to run on specified data sources.
: Although only data governors can create identifiers, data owners can add identifiers to templates, which they then apply to their data sources to override minConfidence
or tags for identifiers within the template.
Generate your API key on the and save the API key somewhere secure. You will include this API key in the authorization header when you make a request to the Immuta API.
Attributes of all custom identifiers and templates are provided on the . However, attributes specific to this section are outlined in the table below.
Attribute | Description |
---|
: Opt to add your template to the SDD global settings so that Immuta will use this template to run SDD for all data sources.
name |
| Yes |
displayName |
| Yes |
description |
| Yes |
type |
| Yes |
config |
| Yes |
minConfidence* |
| Yes |
tags* |
| Yes |
regex* |
| Yes |
template |
|
sources |
|
sources |
|
all |
|
wait |
|
dryRun |
template |
|
In previous documentation, identifier is referred to as classifier. The language is being updated to identifier to be more accurate and not conflate meaning with the Immuta data classification and frameworks feature.
Use case: Custom dictionary identifier
Scenario: You have data that includes the names of the rooms employees' desks are in across your organization. Although these locations may be considered sensitive in particular datasets, they would not be recognized by Immuta's built-in identifiers.
A custom dictionary identifier allows you to create your own rules that enable Immuta's sensitive data discovery to match a list of room names to values in the dataset. The tutorial below uses this scenario to illustrate creating this identifier.
Attributes of all custom identifiers are provided on the Sensitive data discovery API page. However, attributes specific to the custom dictionary identifier are outlined in the table below.
Generate your API key on the API Keys tab on your profile page and save the API key somewhere secure. You will include this API key in the authorization header when you make a request to the Immuta API or use it to configure your instance with the Immuta CLI.
Save the custom dictionary identifier payload in a .json file. The dictionary below contains the words Research Lab
, Blue Room
, and Purple Room
.
Create the identifier using one of these methods:
Immuta CLI
HTTP API
If the request is successful, you will receive a response that contains details about the identifier.
Continue to one of the following tutorials:
Run sensitive data discovery on data sources: Trigger SDD to run on specified data sources.
Create a template: Although only data governors can create identifiers, data owners can add identifiers to templates, which they then apply to their data sources to override minConfidence
or tags for identifiers within the template.
In previous documentation, identifier is referred to as classifier. The language is being updated to identifier to be more accurate and not conflate meaning with the Immuta data classification and frameworks feature.
Attributes of identifiers and templates are provided on the Sensitive data discovery API page. However, attributes specific to listing identifiers are outlined in the table below.
The response lists all built-in identifiers that are currently supported in Immuta SDD and their details, including their name and description. For example,
Generate your API key on the API Keys tab on your profile page and save the API key somewhere secure. You will include this API key in the authorization header when you make a request to the Immuta API or use it to configure your instance with the Immuta CLI.
List built-in identifiers using one of these methods:
Immuta CLI
HTTP API
If the request was successful, you will receive a list of built-in identifiers.
Run sensitive data discovery on data sources: Trigger SDD to run on specified data sources.
Create a template: Although only data governors can create identifiers, data owners can add identifiers to templates, which they then apply to their data sources to override minConfidence
or tags for identifiers within the template.
Create a custom identifier: Data governors can create custom identifiers to define their own regular expressions, dictionaries, and tags that SDD will use to discover and tag data.
boolean
When true
, SDD will not update the tags on the data source(s) and will just return what tags would have been applied or removed. See for an example. Default is false
.
Attribute | Description |
---|---|
Attribute | Description |
---|---|
name
string
Unique, request-friendly identifier name.
displayName
string
Unique, human-readable identifier name.
description
string
The identifier description.
type
string
The type of identifier: dictionary
.
config
object
Includes config.minConfidence
, config.tags
, config.values
, and config.caseSensitive
(defaults to false
). *See descriptions below.
minConfidence*
number
When the detection confidence is at least this percentage, tags are applied.
tags*
array[string]
The name of the tags to apply to the data source. Note: All tags must start with Discovered.
.
values*
array[string]
The list of words to include in the dictionary.
caseSensitive*
boolean
Indicates whether or not values
are case sensitive. Defaults to false
.
sortField
string
The field by which to sort the search results: id
, name
, displayName
, type
, createdAt
, or updatedAt
.
sortOrder
string
Denotes whether to sort the results in ascending (asc
) or descending (desc
) order. Default is asc
.
offSet
integer
Use in combination with limit
to fetch pages.
limit
integer
Limits the number of results displayed per page.
type
array[string]
Searches for identifiers based on identifier type: builtIn
.
searchText
string
A partial, case-insensitive search on name.
In previous documentation, identifier is referred to as classifier. The language is being updated to identifier to be more accurate and not conflate meaning with the Immuta data classification and frameworks feature.
Use case: Custom column name regex identifier
Scenario: You've listed Immuta's built-in identifiers for sensitive data discovery, but you discover there is no identifier that can automatically detect and tag columns that contain account numbers in your database.
A custom column name regular expression (regex) identifier allows you to create your own detectors that enable Immuta's sensitive data discovery to find column name matches based on a regex pattern. For example, if your database contains tables with social security numbers, you could define a regex pattern to match against the names of the column instead of the values within the column. The tutorial below uses this scenario to illustrate creating this identifier.
Attributes of all custom identifiers are provided on the Sensitive data discovery API page. However, attributes specific to the custom column name regex identifier are outlined in the table below.
Generate your API key on the API Keys tab on your profile page and save the API key somewhere secure. You will include this API key in the authorization header when you make a request to the Immuta API or use it to configure your instance with the Immuta CLI.
Save the custom column name regex identifier payload in a .json file. The regex ^ssn|social ?security$
looks for column names that match ssn
, socialsecurity
, or social security
.
Create the identifier using one of these methods:
Immuta CLI
HTTP API
If the request is successful, you will receive a response that contains details about the identifier.
Continue to one of the following tutorials:
Run sensitive data discovery on data sources: Trigger SDD to run on specified data sources.
Create a template: Although only data governors can create identifiers, data owners can add identifiers to templates, which they then apply to their data sources to override minConfidence
or tags for identifiers within the template.
Attribute | Description | Required |
---|---|---|
name
string
Unique, request-friendly identifier name.
Yes
displayName
string
Unique, human-readable identifier name.
Yes
description
string
The identifier description.
Yes
type
string
The type of identifier: columnNameRegex
.
Yes
config
object
Includes config.columnNameRegex
and config.tags
. *See descriptions for these below.
Yes
tags*
array[string]
The name of the tags to apply to the data source. Note: All tags must start with Discovered.
.
Yes
columnNameRegex*
string
A case-insensitive regular expression to match against column names.
Yes