> For the complete documentation index, see [llms.txt](https://documentation.immuta.com/2024.3/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://documentation.immuta.com/2024.3/discover-your-data/data-discovery/reference-guides/sdd-scoring.md).

# How Competitive Criteria Analysis Works

Of sensitive data discovery's [three criteria options](/2024.3/discover-your-data/data-discovery.md#identifier), regex and dictionary are competitive. This means that when assessing your data, if multiple identifiers could match, only one with competitive criteria will be chosen to tag the data. To better understand how Immuta executes this competition, read further.

Discover employs a three-phased competitive criteria analysis approach for sensitive data discovery (SDD):

1. [Sampling](#sampling): No data is moved, and Immuta checks the identifiers against a sample of data from the table.
2. [Qualifying](#qualification): Identifiers with a criteria match of less than a 90% match are filtered out.
3. [Scoring](#scoring): The remaining identifiers are compared with one another to find the most specific criteria that qualifies and matches the sample.

In the end, competitive criteria analysis aims to find a single identifier for each column that best describes the data format.

## Sampling

In the sampling process, no database contents are transmitted to Immuta; instead, Immuta receives only the column-wise hit rate (the number of times the criteria has matched a value in the column) information for each active identifier. To do this, Discover instructs a remote database to measure column-wise hit rate information for all active identifiers over a row sample.

The sample size is decided based on the number of identifiers and the data size, when available. In the most simplified case, the requested number of sampled rows depends only on the number of regex and dictionary criteria being run in the framework, not the data size. The sample size dependence on the number of identifiers is weak and will not exceed 13,000 rows.

| Number of identifiers | Sample size |
| --------------------- | ----------- |
| 5                     | 7369 rows   |
| 50                    | 9211 rows   |
| 500                   | 11053 rows  |
| 5000                  | 12895 rows  |

### Sampling considerations

In practice, the number of sampled values for each column may be less than the requested number of rows. This happens when the target table has less than the requested number of rows, when many of the column values are `null`, or because of technology-specific limitations.

* Snowflake and Starburst (Trino): Discover implements native table sampling by row count.
* Databricks and Redshift: Due to technology limitations and the inability to predict the size of the table, Discover implements a best-effort sampling strategy comprising a flat 10% row sample capped at the first 10,000 sampled rows. In particular, under-sampling may occur on tables with less than 100,000 rows. Moreover, the resulting sample is biased towards earlier records.
* All platforms: Sampling from views can have significantly slower performance that varies by the performance of the query that defines the view.

## Qualifying

During the qualification phase, identifiers that do not agree with the data are disqualified. An identifier agrees with the data if the hit rate on the remote sample exceeds the predefined threshold. This threshold is 90% match for most built-in identifiers; however, two built-in identifiers have lower threshold requirements[^1]. The 90% threshold is standard for all custom identifiers as well to ensure the criteria matches the data within the column and avoid false positives. If no identifiers qualify, then no identifier is assessed for scoring and the column is not tagged.

## Scoring

During the scoring phase, a machine inference is carried out among all qualified identifiers, combining criteria-derived complexity information with hit rate information to determine which identifier best describes the sample data. This process prefers the more restrictive of two competing identifiers since the ability to satisfy the more difficult-to-satisfy identifier itself serves as evidence that it is more likely. This phase ends by returning a single most likely identifier per the inference process.

## Example

Here are a set of regex identifiers and a sample of data:

**Identifiers**:

1. `[a-zA-Z0-9]{3}` - This regex will match 3 character strings with the characters a-z, lowercase or uppercase, or digits 0-9.
2. `[a-c]{3}` - This regex will match 3 character strings with the characters a-c, lowercase.
3. `(a|b|d){3}` - This regex will match 3 character strings with the characters a, b, or d, lowercase.

| Sample data | Matches Identifier 1 | Matches Identifier 2 | Matches Identifier 3 |
| ----------- | -------------------- | -------------------- | -------------------- |
| dad         | Yes                  | :x:                  | Yes                  |
| baa         | Yes                  | :x:                  | Yes                  |
| add         | Yes                  | :x:                  | Yes                  |
| add         | Yes                  | :x:                  | Yes                  |
| cab         | Yes                  | Yes                  | :x:                  |
| bad         | Yes                  | :x:                  | Yes                  |
| aba         | Yes                  | :x:                  | Yes                  |
| baa         | Yes                  | :x:                  | Yes                  |
| dad         | Yes                  | :x:                  | Yes                  |
| baa         | Yes                  | :x:                  | Yes                  |

When **qualifying** the identifiers, Identifier 1 and Identifier 3 both match 90% or more of the data. Identifier 2 does not, and is disqualified.

Then the qualified identifiers are **scored**. Here, Identifier 1, despite matching 100% of the data, is unspecific and could match over 200,000 values. On the other hand, Identifier 3 matches just at 90% but is very specific with only 27 available values.

Therefore, with the specificity taken into account, Identifier 3 would be the match for this column, and its tags would be applied to the data source in Immuta.

## Important notes

* Dictionaries are part of the competitive process, while column-name regex are not.
* Scoring ties are rare but can occur if the same criteria (either dictionary or regex) is specified more than once (even in different forms). Scoring ties are inconclusive, and the scoring phase will not return an identifier in the case of a tie.
* Criteria complexity analysis is sensitive to the total number of strings an identifier accepts or, equivalently for dictionaries, the number of entries. Therefore, identifiers that accept much more than is necessary to describe the intended column data format may perform more poorly in the competitive analysis because they are easier to satisfy.

[^1]: STREET\_ADDRESS has an 80% threshold and PERSON\_NAME has a 45% threshold due to dictionary sizing limitations.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://documentation.immuta.com/2024.3/discover-your-data/data-discovery/reference-guides/sdd-scoring.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.