How Competitive Pattern Analysis Works

Of sensitive data discovery's three pattern options, regex and dictionary are competitive. This means that when assessing your data, if multiple patterns could match, only one of the competitive patterns will be chosen and tag the data. To better understand how Immuta executes this competition, read further.

Discover employs a three-phased competitive pattern analysis approach for sensitive data discovery (SDD):

  1. Sampling: No data is moved, and Immuta checks the patterns against a sample of data from the table.

  2. Qualifying: Patterns that have less than a 90% match are filtered out.

  3. Scoring: The remaining patterns are compared with one another to find the most specific pattern that qualifies and matches the sample.

In the end, competitive pattern analysis aims to find a single pattern for each column that best describes the data format.

Sampling

In the sampling process, no database contents are transmitted to Immuta; instead, Immuta receives only the column-wise hit rate (the number of times the pattern has matched a value in the column) information for each active pattern. To do this, Discover instructs a remote database to measure column-wise hit rate information for all active patterns over a row sample.

The sample size is decided based on the number of patterns and the data size, when available. In the most simplified case, the requested number of sampled rows depends only on the number of regex and dictionary patterns being run in the framework, not the data size. The sample size dependence on the number of patterns is weak and will not exceed 13,000 rows.

Number of patternsSample size

5

7369 rows

50

9211 rows

500

11053 rows

5000

12895 rows

Sampling considerations

In practice, the number of sampled values for each column may be less than the requested number of rows. This happens when the target table has less than the requested number of rows, when many of the column values are null, or because of technology-specific limitations.

  • Snowflake and Starburst (Trino): Discover implements native table sampling by row count.

  • Databricks and Redshift: Due to technology limitations and the inability to predict the size of the table, Discover implements a best-effort sampling strategy comprising a flat 10% row sample capped at the first 10,000 sampled rows. In particular, under-sampling may occur on tables with less than 100,000 rows. Moreover, the resulting sample is biased towards earlier records.

Qualifying

During the qualification phase, patterns that do not agree with the data are disqualified. A pattern agrees with the data if the hit rate on the remote sample exceeds the predefined threshold. This threshold is 90% match for most built-in patterns; however, two built-in patterns have lower threshold . The 90% threshold is standard for all custom patterns as well to ensure the pattern matches the data within the column and avoid false positives. If no patterns qualify, then no pattern is assessed for scoring and the column is not tagged.

Scoring

During the scoring phase, a machine inference is carried out among all qualified patterns, combining pattern-derived complexity information with hit rate information to determine which pattern best describes the sample data. This process prefers the more restrictive of two competing patterns since the ability to satisfy the more difficult-to-satisfy pattern itself serves as evidence that it is more likely. This phase ends by returning a single most likely pattern per the inference process.

Example

Here are a set of regex patterns and a sample of data:

Patterns:

  1. [a-zA-Z0-9]{3} - This pattern will match 3 character strings with the characters a-z, lowercase or uppercase, or digits 0-9.

  2. [a-c]{3} - This pattern will match 3 character strings with the characters a-c, lowercase.

  3. (a|b|d){3} - This pattern will match 3 character strings with the characters a, b, or d, lowercase.

Sample dataMatches Pattern 1Matches Pattern 2Matches Pattern 3

dad

Yes

Yes

baa

Yes

Yes

add

Yes

Yes

add

Yes

Yes

cab

Yes

Yes

bad

Yes

Yes

aba

Yes

Yes

baa

Yes

Yes

dad

Yes

Yes

baa

Yes

Yes

When qualifying the patterns, Pattern 1 and Pattern 3 both match 90% or more of the data. Pattern 2 does not, and is disqualified.

Then the qualified patterns are scored. Here, Pattern 1, despite matching 100% of the data, is unspecific and could match over 200,000 values. On the other hand, Pattern 3 matches just at 90% but is very specific with only 27 available values.

Therefore, with the specificity taken into account, Pattern 3 would be the match for this column, and its tags would be applied to the data source in Immuta.

Important notes

  • Dictionaries are considered patterns by Immuta and are part of the competitive process, while column-name regex patterns are not.

  • Scoring ties are rare but can occur if the same pattern is specified more than once (even in different forms). Scoring ties are inconclusive, and the scoring phase will not return a pattern in the case of a tie.

  • Pattern complexity analysis is sensitive to the total number of strings a pattern accepts or, equivalently for dictionaries, the number of entries. Therefore, patterns that accept much more than is necessary to describe the intended column data format may perform more poorly in the competitive analysis because they are easier to satisfy.

Last updated

Copyright © 2014-2024 Immuta Inc. All rights reserved.