Masking Policies
Masking policies hide values in data, providing various levels of utility while still preserving privacy. Immuta offers column masking and cell-level masking.
As with all Immuta policy types, use global policies when authoring masking policies to manage policies at scale. When using global policies, tagging your data with metadata becomes critical and is described in detail in the Compliantly open more sensitive data for ML and analytics use case.
The masking options described on this page can be implemented in a variety of use cases, and there are several different approaches for masking data that allow you to make tradeoffs between privacy (how far you go with masking) and utility (how much you want the masked data to be useful to the data consumer). Use the table below to determine the circumstance under which a function should be used.
❌
❌
✅
✅
✅
❌
❌
❌
❌
✅
✅
✅
❌
❌
❌
❌
❌
❌
✅
n/a
n/a
n/a
n/a
n/a
❌
✅
❌
❌
❌
❌
❌
n/a
❌
❌
❌
❌
✅
✅
❌
❌
❌
❌
❌
❌
✅
✅
✅
❌
❌
❌
❌
❌
✅
✅
✅
✅
✅
❌
❌
✅
❌
✅
✅
❌
❌
✅
✅
✅
✅
✅
✅
❌
✅
✅
❌
❌
❌
❌
✅
✅
❌
10/10
10/10
Variable
6/10
4/10
2/10
5/10
8/10
Variable
Cell-level masking
Building a cell masking policy is done in the same manner as building a column masking policy. The primary difference is when selecting who the policy should apply to, a where clause is injected.
For example, a regular masking policy looks like the following:
Mask columns tagged
Discovered.Entity.Social Security Numberusing hashing for everyone except members of group admins
The cells can be conditionally masked by changing the for to a where:
Mask columns tagged
Discovered.Entity.Social Security Numberusing hashing wherecountry_of_residence = 'US'for everyone except members of group admins
That policy will check the country_of_residence column in the table and if the value is US the cell will be masked, otherwise the data will be presented in the clear as usual.
It is recommended that when referencing columns in custom SQL that you not use the physical column name as shown in the example above. Instead use the @columnTagged('tag name') function. This will allow you to target the policy on any table with a country_of_residence column no matter how that column is spelled on the physical table. For example, you would change the policy to the following example:
Mask columns tagged
Discovered.Entity.Social Security Numberusing hashing where@columnTagged('country') = 'US'for everyone except members of group admins
This example policy targets the column with the tag country in the policy logic dynamically.
Complex data types: masking fields within struct columns
Spark supports a class of data types called complex types, which can represent multiple data values in a single column. Immuta supports masking fields within array and struct columns:
Array: an ordered collection of elements
Struct: a collection of elements that are primitive or complex types
Without this feature enabled, the struct and array columns of a data source default to jsonb in the Data Dictionary, and the masking policies that users can apply to jsonb columns are limited. For example, if a user wanted to mask PII inside the column patient in the image below, they would have to apply null masking to the entire column or use a custom function instead of just masking name or address.

After Complex Data Types is enabled on the App settings page, the column type for struct columns for new data sources will display as struct in the data dictionary. (For data sources that are already in Immuta, users can edit the data source and change the column types for the appropriate columns from jsonb to struct.) Once struct fields are available, they can be searched, tagged, and used in masking policies. For example, a user could tag name, ssn, and street as PII instead of the entire patient column.
After a global or local policy masks the columns containing PII, users who do not meet the exception specified in the policy will see these values masked:

Note: Immuta uses the > delimiter to indicate that a field is nested instead of the . delimiter, since field and column names could include ..
Struct columns with many fields
To get column information about a data source, Immuta executes a DESCRIBE call for the table. In this call, Spark returns a simple string representation of the schema for each column in the table. For the patient column above, the simple string would look like this:
struct<name:string,ssn:string,age:int,address:struct<city:string,state:string,zipCode:string,street:text>>
Immuta then parses this string into the following format for the data source's dictionary:
{
dataType: 'struct',
children: [
{
name: 'name',
dataType: 'text'
},
{
name: 'ssn',
dataType: 'text'
},
{
name: 'age',
dataType: 'integer'
},
{
name: 'address',
dataType: 'struct',
children: [
{
name: 'city',
dataType: 'text'
},
{
name: 'state',
dataType: 'text'
},
{
name: 'zipCode',
dataType: 'text'
},
{
name: 'street',
dataType: 'text'
},
]
}
]
}However, if the struct contains more than 25 fields, Spark truncates the string, causing the parser to fail and fall back to jsonb. Immuta will attempt to avoid this failure by increasing the number of fields allowed in the server-side property setting, maxToStringFields.
Conditional masking
You can conditionally mask the value based on a value in another column. This allows you to build a policy that looks something like, "Mask bank account number where country = 'USA'" instead of blindly stating you want bank account masked always.
Note: When building conditional masking policies with custom SQL statements, avoid using a column that is masked using randomized response in the SQL statement, as this can lead to different behavior depending on your data platform and may produce results that are unexpected.
Constant
Masking with a constant replaces any value in a column with a specified value. For example, you can replace the values in a column with the constant Redacted. The underlying data will appear to be a constant, removing any utility of that data.
Apply this policy to strings that require a specific repeated value.
Custom function
This option uses SQL functions native to the underlying database to transform the values in a column. This can be used in numerous use cases, but notional examples include top-coding to some upper limit, a custom hash function, and string manipulation.
Single quotes enclosing the regex and escaping special characters are required. The following example masks telephone numbers variably depending on the presence of a dash (implying a prefix), space, or only digits:
REGEXP_REPLACE(@column, '(\\+?\\d{0,3}[-\\s]?)?\\d{4}', '****')The image below illustrates authoring a global policy using this custom function:

Limitations
The masking functions are executed against the remote database directly. A poorly written function could lead to poor quality results, data leaks, and performance hits.
Using custom functions can result in changes to the original data type. In order to prevent query errors you must ensure that you cast this result back to the original type.
The function must be valid for the data type of the selected column. If it is not
Local policies will error and show a message that the function is not valid.
Global policies will error and change to the default masking type (hashing for text and NULL for all others).
Format preserving masking
Format preserving masking uses a reversible function to mask the data in a way that the underlying structure of a value is preserved, so the length and type of a value are maintained. This is appropriate when the masked value should appear in the same format as the underlying value. Examples of this include social security numbers and credit card numbers where mask with format preserving masking would return masked values in a format consistent with credit cards or social security numbers, respectively.
This masking type also allows users to submit an unmasking request to users who meet the exceptions of the policy.
There is larger overhead with this masking type, and it should really only be used when format is critically valuable, such as situations when an engineer is building an application where downstream systems validate content. In almost all analytical use cases, format should not matter.
Hashing
Hashing masks the values with an irreversible sha256 hash, which is consistent for the same value throughout the data source, so you can count or track the specific values, but not know the true raw value.
This policy type is appropriate for cases where the underlying value is sensitive, but there is a need to segment the population. Such attributes could be addresses, time segments, or countries. It is important to note that hashing is susceptible to inference attacks based on prior knowledge of the population distribution. For example, if state is hashed, and the dataset is a sample across the United States, then an adversary could assume that the most frequently occurring hash value is California. As such, it's most secure to use the hashing mask on attributes that are evenly distributed across a population.
Hashed values are different across data sources, so you cannot join on hashed values unless you enable masked joins on data sources within a project. Immuta prevents joins on hashed values to protect against link attacks where two data owners may have exposed data with the same masked column (a quasi-identifier), but their data combined by that masked value could result in a sensitive data leak.
New column added policy
This templated policy pairs with schema monitoring to mask newly added columns to data sources until data owners review and approve these changes from the requests tab of their profile page.
When this policy is activated by a governor, it will automatically be enforced on data sources that have the New tag applied to them.
To learn how to activate this policy, navigate to the Clone, activate, or stage a global policy how-to guide.
NULL
This masking type replaces the values in the column with NULL, removing any identifiability from the column and all utility of the data.
Apply this policy to numeric or text attributes that have a high re-identification risk, but little analytic value (names and personal identifiers).
Randomized response
Randomized response masks data by slightly randomizing the values in a column, preserving the utility of the data while preventing outsiders from inferring content of specific records.
This function randomizes the displayed value to make the true value uncertain, but maintains some analytic utility.
For example, if an analyst wanted to publish data from a health survey she conducted, she could remove direct identifiers to make it difficult to single out individuals. However, consider these survey participants, a cohort of male welders who share the same zip code:
...
...
...
...
...
880d0096
75002
Male
Welder
Y
f267334b
75002
Male
Welder
Y
bfdb43db
75002
Male
Welder
Y
260930ce
75002
Male
Welder
Y
046dc7fb
75002
Male
Welder
Y
...
...
...
...
...
All members of this cohort have indicated substance abuse, sensitive personal information that could have damaging consequences, and, even though direct identifiers have been removed, outsiders could infer substance abuse for an individual if they knew a male welder in this zip code.
In this scenario, using randomized response would change some of the Y's in substance_abuse to N's and vice versa; consequently, outsiders couldn't be sure of the displayed value of substance_abuse given in any individual row, as they wouldn't know which rows had changed.
The randomization is applied differently to both categorical and quantitative values. In both cases, the noise can be increased to enhance privacy or reduced to preserve more analytic value. Immuta requires that you opt in to use this masking policy type.
Categorical randomized response: Categorical values are randomized by replacing a value with some non-zero probability. Not all values are randomized, and the consumer of the data is not told which values are randomized and which ones remain unchanged. Values are replaced by selecting a different value uniformly at random from among all other values. If a randomized response policy were applied to a “state” column, a person’s residency could flip from Maryland to Virginia, which would provide ambiguity to the actual state of residency. This policy is appropriate when obscuring sensitive values such as medical diagnosis or survey responses.
Datetime and numeric randomized response: Numeric and datetime randomized response apply a tunable, unbiased noise to the nominal value. This noise can obscure the underlying value, but the impact of the noise is reduced in aggregate. This masking type can be applied to sensitive numerical attributes, such as salary, age, or treatment dates.
How the randomization works
Immuta applies a random number generator (RNG) that is seeded with some fixed attributes of the data source, column, backing technology, and the value of the high cardinality column, an approach that simulates cached randomness without having to actually cache anything.
For string data, the random number generator essentially flips a biased coin. If the coin comes up as tails, which it does with the frequency of the replacement rate configured in the policy, then the value is changed to any other possible value in the column, selected uniformly at random from among those values. If the coin comes up as heads, the true value is released.
For numeric data, Immuta uses the RNG to add a random shift from a 0-centered Laplace distribution with the standard deviation specified in the policy configuration. For most purposes, knowing the distribution is not important, but the net effect is that on average the reported values should be the true value plus or minus the specified deviation value.
Preserving data utility
Using randomized response doesn't destroy the data because data is only randomized slightly; aggregate utility can be preserved because analysts know how and what proportion of the values will change. Through this technique, values can be interpreted as hints, signals, or suggestions of the truth, but it is much harder to reason about individual rows.
Additionally, randomized response gives deniability of record content not dataset participation, so individual rows can be displayed.
Regular expression (regex)
Deprecation notice
Support for masking with a non-global regex on Redshift data sources has been deprecated. Policy authors must use the global flag (by selecting Global in the regex policy builder) when masking using a regex on Redshift data sources.
See the Deprecations page for EOL dates.
This masking option uses a regular expression to replace all or a portion of a column value.
This policy is similar to replacing with a constant, but it provides more utility because you can retain portions of the true value, and REGEX replacement allows for some groupings to be maintained, while providing greater ambiguity to the disclosed value. This masking technique is useful when the underlying data has some consistent structure, the remasked underlying data represents some re-identification risk, and a regular expression can be used to mask the underlying data to be less identifiable.
When authoring the policy in Immuta, the regex and the replacement value do not need to be in single or double quotes.
The following regex rule would mask the final digits of an IP address:
Mask using a regex
\d+$the value in the columnsip_addressfor everyone.
In this case, the regular expression \d+$
\d matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)
This ensures we capture the last digit(s) after the last . in the IP address. We then can enter the replacement for what we captured, which in this case is XXX. So the outcome of the policy, would look like this: 164.16.13.XXX
This regex rule applies masking to telephone numbers variably depending on the presence of a dash (implying a prefix), space, or only digits:
Mask using a regex (\+?\d{0,3}[-\s]?)?\d{4} the value in the column tagged
Discovered...Telephone Numberfor everyone.
The image below illustrates authoring a regex global policy that will apply to Databricks Unity Catalog data sources:

Reversibility
Deprecation notice
Support for unmask requests has been deprecated.
Support for reversible masking on Redshift data sources has been deprecated.
See the Deprecations page for EOL dates.
This masking option masks the values using a , but allows users to submit an unmasking request to users who meet the exceptions of the policy. The user receiving the unmasking request must send the unmasked value to the requester.
This policy type is appropriate for cases where the underlying value is sensitive, but there is a need to segment the population. Such attributes could be addresses, time segments, or countries. It is important to note that reversible hashing is susceptible to inference attacks based on prior knowledge of the population distribution. For example, if state is hashed, and the dataset is a sample across the United States, then an adversary could assume that the most frequently occurring hash value is California. As such, it's most secure to use the reversible hashing mask on attributes that are evenly distributed across a population.
Hashed values are different across data sources, so you cannot join on hashed values unless you enable masked joins on data sources within a project. Immuta prevents joins on hashed values to protect against link attacks where two data owners may have exposed data with the same masked column (a quasi-identifier), but their data combined by that masked value could result in a sensitive data leak.
Apply this policy when there is a need to obscure a value while allowing an authorized user to recover the underlying value. Reversibly masked fields can leak the length of their contents, so it is important to consider whether or not this may be an attack vector for applications involving its use.
Rounding
Rounding masking policies reduce, round, or truncate numeric or datetime values to a fixed precision.
This technique hides precision from numeric values while providing more utility than simply hashing. For example, you could remove precision from a geospatial coordinate. You can also use this type of policy to remove precision from dates and times by rounding to the nearest hour, day, month, or year.
Datetime rounding: This policy truncates the precision of a datetime value to a user-defined precision.
minute,hour,day,months, andyearare the supported precisions.Numeric rounding: This policy maps the nominal value to the ceiling of some specified bandwidth. Immuta has a recommended bandwidth based on the Freedman-Diaconis rule.
Masking policy support by integration
See the data policy support matrix for an outline of masking policies supported by each integration.
Mixing masking policies on the same column
In some cases, you may want several different masking policies applied to the same column through Otherwise policies. To build these policies, select everyone who instead of everyone or everyone except. After you specify who the masking policy applies to, select how it applies to everyone else in the Otherwise condition.
You can add and remove tags in Otherwise conditions for global policies (unlike local policy Otherwise conditions), as illustrated above; however, all tags or regular expressions included in the initial everyone who rule must be included in an everyone or everyone except rule in the additional clauses.
Last updated
Was this helpful?

