Of sensitive data discovery's three pattern options, regex and dictionary are competitive. This means that when assessing your data, if multiple patterns could match, only one of the competitive patterns will be chosen and tag the data. To better understand how Immuta executes this competition, read further.
Discover employs a three-phased competitive pattern analysis approach for sensitive data discovery (SDD):
Sampling: No data is moved, and Immuta checks the patterns against a sample of data from the table.
Qualifying: Patterns that have less than a 90% match are filtered out.
Scoring: The remaining patterns are compared with one another to find the most specific pattern that qualifies and matches the sample.
In the end, competitive pattern analysis aims to find a single pattern for each column that best describes the data format.
In the sampling process, no database contents are transmitted to Immuta; instead, Immuta receives only the column-wise hit rate (the number of times the pattern has matched a value in the column) information for each active pattern. To do this, Discover instructs a remote database to measure column-wise hit rate information for all active patterns over a row sample.
The sample size is decided based on the number of patterns and the data size, when available. In the most simplified case, the requested number of sampled rows depends only on the number of regex and dictionary patterns being run in the framework, not the data size. The sample size dependence on the number of patterns is weak and will not exceed 13,000 rows.
Number of patterns | Sample size |
---|---|
In practice, the number of sampled values for each column may be less than the requested number of rows. This happens when the target table has less than the requested number of rows, when many of the column values are null
, or because of technology-specific limitations.
Snowflake and Starburst (Trino): Discover implements native table sampling by row count.
Databricks and Redshift: Due to technology limitations and the inability to predict the size of the table, Discover implements a best-effort sampling strategy comprising a flat 10% row sample capped at the first 10,000 sampled rows. In particular, under-sampling may occur on tables with less than 100,000 rows. Moreover, the resulting sample is biased towards earlier records.
All platforms: Sampling from views can have significantly slower performance that varies by the performance of the query that defines the view.
During the qualification phase, patterns that do not agree with the data are disqualified. A pattern agrees with the data if the hit rate on the remote sample exceeds the predefined threshold. This threshold is 90% match for most built-in patterns; however, two built-in patterns have lower threshold . The 90% threshold is standard for all custom patterns as well to ensure the pattern matches the data within the column and avoid false positives. If no patterns qualify, then no pattern is assessed for scoring and the column is not tagged.
During the scoring phase, a machine inference is carried out among all qualified patterns, combining pattern-derived complexity information with hit rate information to determine which pattern best describes the sample data. This process prefers the more restrictive of two competing patterns since the ability to satisfy the more difficult-to-satisfy pattern itself serves as evidence that it is more likely. This phase ends by returning a single most likely pattern per the inference process.
Here are a set of regex patterns and a sample of data:
Patterns:
[a-zA-Z0-9]{3}
- This pattern will match 3 character strings with the characters a-z, lowercase or uppercase, or digits 0-9.
[a-c]{3}
- This pattern will match 3 character strings with the characters a-c, lowercase.
(a|b|d){3}
- This pattern will match 3 character strings with the characters a, b, or d, lowercase.
When qualifying the patterns, Pattern 1 and Pattern 3 both match 90% or more of the data. Pattern 2 does not, and is disqualified.
Then the qualified patterns are scored. Here, Pattern 1, despite matching 100% of the data, is unspecific and could match over 200,000 values. On the other hand, Pattern 3 matches just at 90% but is very specific with only 27 available values.
Therefore, with the specificity taken into account, Pattern 3 would be the match for this column, and its tags would be applied to the data source in Immuta.
Dictionaries are considered patterns by Immuta and are part of the competitive process, while column-name regex patterns are not.
Scoring ties are rare but can occur if the same pattern is specified more than once (even in different forms). Scoring ties are inconclusive, and the scoring phase will not return a pattern in the case of a tie.
Pattern complexity analysis is sensitive to the total number of strings a pattern accepts or, equivalently for dictionaries, the number of entries. Therefore, patterns that accept much more than is necessary to describe the intended column data format may perform more poorly in the competitive analysis because they are easier to satisfy.
Sample data | Matches Pattern 1 | Matches Pattern 2 | Matches Pattern 3 |
---|---|---|---|
5
7369 rows
50
9211 rows
500
11053 rows
5000
12895 rows
dad
Yes
Yes
baa
Yes
Yes
add
Yes
Yes
add
Yes
Yes
cab
Yes
Yes
bad
Yes
Yes
aba
Yes
Yes
baa
Yes
Yes
dad
Yes
Yes
baa
Yes
Yes