Architecture

Discover automates discovering and tagging data across your data platform. It encompasses the identification and classification of data using frameworks.

Requirements

SDD enabled
Frameworks enabled
Registered Snowflake, Databricks, Redshift, or Starburst (Trino) data sources

Components

The Immuta UI has separate sections for identification frameworks and classification frameworks. Both frameworks are made of rules, criteria, and resulting tags, but the criteria types differ for each framework type. Identification frameworks use competitive pattern matching and column name matching to discover data types and tag them. Classification frameworks use tags on the column, neighboring columns, and data source for context and then tag the columns based on that context. Find more information about each framework type below.

Identification frameworks

Identification frameworks run with sensitive data discovery (SDD). They use data patterns to discover data and tag it based on what the data is.

Supported criteria and pattern types

Competitive pattern analysis: This criteria is a process that will review all the regex and dictionary patterns within the rules of the framework and search for the pattern with the best fit. In this review, each competitive pattern analysis criteria in the framework competes against each other to find the best and most specific pattern that fits the data. The resulting tags for the best pattern's rule are then applied to the column.
- Regex pattern: This pattern contains a case-insensitive regular expression that searches for matches against column values. Create a regex pattern in the UI or with the sdd/classifier endpoint.
- Dictionary pattern: This pattern contains a list of words and phrases to match against column values. Create a dictionary pattern in the UI or with the sdd/classifier endpoint.
Column name: This criteria matches a column name pattern to the column names in the data sources. The rule's resulting tags will be applied to the column where the name is found.
- Column name pattern: This pattern includes a case-insensitive regular expression matched against column names, not against the values in the column. Create a column name pattern in the UI or with the sdd/classifier endpoint.

To start using identification frameworks in the UI, see the Getting started guide.
To manage identification frameworks with the API, see the /sdd/template endpoint reference guide.

Classification frameworks

Classification frameworks run with the classify service. They determine rule match and criteria fit based on proximity tags and then tag data based on the context it is within.

Supported criteria

Match column tag: This criteria applies resulting tags based on specific tags already on the column.
Match neighboring column tag: This criteria applies resulting tags based on specific tags on neighboring columns.

To manage classification frameworks in the UI, see the Activate frameworks guide.
To create a classification framework with the API, see the /frameworks endpoint reference guide.

Data inventory dashboard

Private preview This feature is only available to select accounts.

The data inventory dashboard visualizes information about your organization's data. It presents your entire data corpus within the context of the frameworks you have actively tagging your data with details like when your data was scanned last or how much of the scanned data is relevant to your active frameworks.

In the data inventory dashboard you will see tiles for scanned coverage and the percent of data scanned within a specific time frame. These tiles are referencing data scanned by an identification framework with SDD. To increase the number of your data sources that have been scanned, run SDD.

The next section of the dashboard shows tiles for the compliance frameworks. Within each graph is the separation of columns found containing or not containing the data important to the compliance framework. These graphs update every time classification runs, which will happen from these events.

For information on the frameworks visualized in the dashboard, see the Immuta frameworks reference guide.

Workflow

The Discover workflow involves both identification with SDD and classification:

A user with the GOVERNANCE permission enables SDD and activates classification frameworks.
Users register data in Immuta.
SDD runs:
1. Immuta generates a SQL query using the identification framework's rules.
2. That query is executed in the remote database.
3. Immuta receives the query results containing the column name and the matching rules but no raw data values.
4. SDD applies the resulting tags to the relevant columns.
Classification runs:
1. The data source's current tags are checked against the framework's rules.
2. When a matching rule is found, the resulting tags are applied to the relevant columns.
Users with the GOVERNANCE permission or data owners can view the data inventory dashboard with visualizations of their scanned data.

Frequency

This workflow will run when a new data source is manually registered in Immuta or found from schema monitoring. Additionally, SDD alone will run from the following events:

A new data source is created.
Schema monitoring is enabled, and a new data source is detected.
Column detection is enabled, and new columns are detected. Here, SDD will only run on new columns, and no existing tags will be removed or changed.
A user manually triggers it from the data source health check menu.
A user manually triggers it from the identification frameworks page.
A user manually triggers it through the API.

Classification will run from the following events:

A framework gets created, updated, or deleted.
A tag gets added to or removed from a column manually or by SDD.
A tag gets added to a data source.
A user manually triggers it from the data source health check menu.
A user manually triggers it through the API.

Caveat

Customizing classification frameworks currently requires users to use the Immuta API.

Discover section contents

Conceptual guides:

Data classification

Getting started guide:

Getting started with Discover

How-to guides:

Identification guides:
Classification guides:
- Activate a classification framework
- Adjust and accept entity and classification tags

Reference guides:

PreviousIntroduction NextData Discovery

Last updated 6 months ago

Was this helpful?