Data Identification

Identification is an Immuta feature that uses data patterns to determine what type of data your column represents. Using identifiers within domains, Immuta evaluates your data and can assign the appropriate tags to your data dictionary based on what it finds. This saves the time of identifying your data manually and provides the benefit of a standard taxonomy across all your data sources in Immuta.

Architecture

To evaluate your data, Immuta generates a SQL query using a domain's identifiers. The Immuta system account then executes that query in the remote technology to match any regex and dictionary identifiers. Immuta receives the query result, containing the column name and the matching identifiers but no raw data values. Column name identifiers are all matched within Immuta and don't require any query to the remote technology. These results are then used to apply the resulting tags to the appropriate columns.

This evaluating and tagging process occurs when identification runs and happens automatically from the following event:

The following actions will also trigger identification:

Identifiers

Identification runs identifiers to discover data. These identifiers are grouped into domains with data sources. Each identifier contains a single criteria and the tags that will be applied when the criteria's conditions have been met.

There are two types of identifiers in Immuta:

  1. Reference identifiers: This is a library of the identifiers that can be added to domains. When added to a domain, a copy of the reference identifier is made as the domain-specific identifier.

    1. Immuta comes with built-in identifiers to discover common categories of data. These cannot be modified or deleted.

    2. Data governors can create their own reference identifiers for use within your organization.

  2. Domain-specific identifiers: These identifiers only exist within a specific domain and are checked against the data sources in that domain when identification runs.

    1. Users with the Manage Identifiers permission can create these identifiers or add them to a domain from a reference identifier.

    2. If a domain-specific identifier was copied over from a reference identifier, there is no lineage and any edits to the reference identifier will not be reflected in the domain-specific copy.

Criteria

Criteria are the conditions in an identifier that need to be met for resulting tags to be applied to data.

  • Competitive criteria analysis: This criteria is a process that will review all the regex and dictionary criteria within the identifiers of the domain and search for the identifier with the best fit. In this review, each competitive criteria analysis identifier in the domain competes against each other to find the best and most specific identifier that fits the data. The resulting tags for the best identifier are then applied to the column. Only one competitive criteria analysis identifier for each domain will apply per column. Competitive criteria identifiers, both built-in and custom, must match at least 90% of the data sampled. To learn more about the competitive nature, see the How competitive criteria analysis works guide.

    • Regex: This criteria contains a case-insensitive regular expression (regex) that searches for matches against column values. Immuta only supports regular expressions written in RE2 syntax.

    • Dictionary: This criteria contains a list of words and phrases to match against column values.

  • Column name: This criteria includes a case-insensitive regular expression (regex) matched against column names, not against the values in the column. The identifier's tags will be applied to the column where the name is found. Multiple column name identifiers can match a column and be applied. Immuta only supports regular expressions written in RE2 syntax.

Create a new identifier in the Immuta UI or with the sdd/identifier endpoint.

Supported technologies

Identification has varied support for data sources from different technologies based on the identifier type.

Technology
Regex
Dictionary
Column name regex

Amazon S3

Not supported

Not supported

Supported

Azure Synapse Analytics

Not supported

Not supported

Supported

Databricks

Supported

Supported

Supported

Google BigQuery

Not supported

Not supported

Supported

Redshift

Supported

Supported

Supported

Snowflake

Supported

Supported

Supported

Starburst (Trino)

Supported

Supported

Supported

Tag mutability

When identification is manually triggered by a data owner, all column tags previously applied by identification are removed and the tags prescribed by the latest run are applied. However, if identification is triggered because a new column is detected by schema monitoring or object sync, tags will only be applied to the new column, and no tags will be modified on existing columns. Additionally, governors, data source owners, and data source experts can disable any unwanted tags in the data dictionary to prevent them from being used and auto-tagged on that data source in the future.

Performance

The amount of time it takes to run identification on a data source depends on several factors:

  • Columns: The time to run identification grows nearly linearly with the number of text columns in the data source.

  • Identifiers: The number of identifiers being used weakly impacts the time to run identification.

  • Row count: Performance of identification may vary depending on the sampling method used by each technology. For Snowflake, the number of rows has little impact on the time because data sampling has near-constant performance.

  • Views: Performance on views is limited by the performance of the query that defines the view. Running identification on complex views with large amounts of data is more likely to result in timeouts. Immuta recommends running identification on the underlying base tables.

The time it takes to run identification for all newly onboarded data sources in Immuta is not limited by identification's performance but by the execution of background jobs in Immuta. Consult your Immuta account managerarrow-up-right when onboarding a large number of data sources to ensure the advanced settings are set appropriately for your organization.

circle-exclamation

Testing

For users interested in testing identification, note that the built-in identifiers by Immuta require a 90% match to data to be assigned to a column. This means that with synthetic data, there may be situations where the data is not real enough to fit the confidence needed to match identifiers. To test identification, use a dev environment and create copies of your tables.

Audit

The following identification-related events are audited and can be found on the audit page in the UI:

  • SDDClassifierCreated: An identifier is created.

  • SDDClassifierDeleted: An identifier is deleted.

  • SDDClassifierUpdated: An identifier's criteria, description, name, or tag is updated.

  • TagApplied: A tag is applied to a data source or column. Tag events from identification will have actor.name.Immuta System Account and will include the related identifiers in the event as relatedResources.type.CLASSIFIERS.

  • TagRemoved: A tag is removed from a data source or column. Tag events from identification will have actor.name.Immuta System Account and will include the related identifiers in the event as relatedResources.type.CLASSIFIERS.

Considerations

Deleting the built-in Discovered tags is not recommended: If you do delete built-in Discovered tags and use the built-in identifiers without editing the tags, then when the identifier is matched the column will not be tagged. As an alternative, tags can be disabled on a column-by-column basis from the data dictionary, or identification won't run if you do not add identifiers to domains.

Supported data types and casing

Type of identifier
Supported data types
Case sensitivity

Data regex*

Text string columns

Case-sensitive

Column name regex

Any column

Not case-sensitive

Dictionary

Text string columns

Can be toggled in the identifier definition

*Two built-in patterns support and match based on additional data types:

  • DATE: Columns will match this identifier if they are string and the regex matches or if the data type is date, date+time, or timestamp.

  • TIME: Columns will match this identifier if they are string and the regex matches or if the data type is time. Note that if the date is included in the data, it will not match this identifier.

Limitations with query size

The size of the identification query for dictionary patterns, which are compiled into a regex and regex patterns, is limited by the backing technology:

Databricks limitations

  • Immuta will start up a Databricks cluster to complete the identification job if one is not already running. This can cause unnecessary costs if the cluster becomes idle. Follow Databricks best practicesarrow-up-right to automatically terminate inactive clusters after a set period of time.

  • The following Databricks Unity Catalog securable objects are supported with Immuta, but cannot be used with identification:

    • Volumes (external and managed)

    • Models

    • Functions

  • Using a large number of files to store the data in a table with a large number of rows may result in the Databricks planner scanning the entire table, resulting in a slow performing query.

Redshift limitations

  • The Redshift cluster must be up and running for identification to successfully run.

AWS access key limitations

To use AWS access key authentication on a Redshift data source and have competitive criteria analysis identifiers supported,

  • The AWS access key used to register the data source must be able to do a minimum of the following redshift-data API actionsarrow-up-right:

    • redshift-data:BatchExecuteStatement

    • redshift-data:CancelStatement

    • redshift-data:DescribeStatement

    • redshift-data:ExecuteStatement

    • redshift-data:GetStatementResult

    • redshift-data:ListStatements

  • The AWS access key used to register the data source must have redshift:GetClusterCredentials for the cluster, user, and database that they onboard their data sources with.

  • If using a custom URL, then the data source registered with the AWS access key must have the region and clusterid included in the additional connection string options formatted like the following example:

  • Redshift Serverless data sources are not supported for competitive criteria analysis identifiers with the AWS access key authentication method.

Legacy SDD

circle-exclamation

Legacy SDD was available before October 2023. It is no longer available, but some users may still see the term "legacy SDD" in the context of their data tags applied to specific data sources. These tags will be removed the next time identification runs.

Last updated