Skip to content

Column Masking

Column masking policies allow you to hide the data in a column. However, there are several different approaches for masking data that allow you to make tradeoffs between privacy (how far you go with masking) vs utility (how much you want the masked data to be useful to the data consumer).

As with all Immuta policy types, it is recommended that you use global policies when authoring masking policies to manage policies at scale. When using global policies, tagging your data with metadata becomes critical and is described in detail in the Compliantly open more sensitive data for ML and analytics use case.

Types

  • Categorical Randomized Response: Categorical values are randomized by replacing a value with some non-zero probability. Not all values are randomized, and the consumer of the data is not told which values are randomized and which ones remain unchanged. Values are replaced by selecting a different value uniformly at random from among all other values. If a randomized response policy were applied to a “state” column, a person’s residency could flip from Maryland to Virginia, which would provide ambiguity to the actual state of residency. This policy is appropriate when obscuring sensitive values such as medical diagnosis or survey responses.

  • Custom Function: This function uses SQL functions native to the underlying database to transform the values in a column. This can be used in numerous use cases, but notional examples include top-coding to some upper limit, a custom hash function, and string manipulation.

  • K-Anonymization: Masking through k-anonymization is a distinct policy that can operate over multiple attributes. A k-anonymization policy applies rounding and NULL masking policies over multiple columns so that the columns contain at least “K” records, where K is a positive integer. As a result, attributes will only be disclosed when there is a sufficient number of observations. This policy is appropriate to apply over indirect identifiers, such as zip code, gender, or age. Generally, each of these identifiers is not uniquely linked to an individual, but when combined with other identifiers can be associated with a single person. Applying k-anonymization to these attributes provides the anonymity of crowds so that individual rows are made indistinct from each other, reducing the re-identification risk by making it unclear which record corresponds to a specific person. Immuta requires that you opt in to use this masking policy type. To enable k-anonymization for your account, contact your Immuta representative.

    Immuta supports k-anonymization of text, numeric, and time-based data types.

  • Mask with Format Preserving Masking: This function masks using a reversible function but does so in a way that the underlying structure of a value is preserved. This means the length and type of a value are maintained. This is appropriate when the masked value should appear in the same format as the underlying value. Examples of this would include social security numbers and credit card numbers where Mask with Format Preserving Masking would return masked values in a format consistent with credit cards or social security numbers, respectively. There is larger overhead with this masking type, and it should really only be used when format is critically valuable, such as situations when an engineer is building an application where downstream systems validate content. In almost all analytical use cases, format should not matter.

  • Mask with Reversibility: This function masks in a way that an authorized user can “unmask” a value and reveal the value to an authorized user. Masking with Reversibility is appropriate when there is a need to obscure a value while allowing an authorized user to recover the underlying value. All of the same use cases and caveats that apply to Replace with Hashing apply to this function. Reversibly masked fields can leak the length of their contents, so it is important to consider whether or not this may be an attack vector for applications involving its use.

  • Randomized Response: This function randomizes the displayed value to make the true value uncertain, but maintains some analytic utility. The randomization is applied differently to both categorical and quantitative values. In both cases, the noise can be increased to enhance privacy or reduced to preserve more analytic value.

  • Datetime and Numeric Randomized Response: Numeric and datetime randomized response apply a tunable, unbiased noise to the nominal value. This noise can obscure the underlying value, but the impact of the noise is reduced in aggregate. This masking type can be applied to sensitive numerical attributes, such as salary, age, or treatment dates.

  • Replace with Constant: This function replaces any value in a column with a specified value. The underlying data will appear to be a constant. This masking carries the same privacy and utility guarantees as Replace with NULL. Apply this policy to strings that require a specific repeated value.

  • Replace with Hashing: This function masks the values with an irreversible hash, which is consistent for the same value throughout the data source, so you can count or track the specific values, but not know the true raw value. This is appropriate for cases where the underlying value is sensitive, but there is a need to segment the population. Such attributes could be addresses, time segments, or countries. It is important to note that hashing is susceptible to inference attacks based on prior knowledge of the population distribution. For example, if “state” is hashed, and the dataset is a sample across the United States, then an adversary could assume that the most frequently occurring hash value is California. As such, it's most secure to use the hashing mask on attributes that are evenly distributed across a population.

  • Replace with Null: This function replaces any value in a column with NULL. This removes any identifiability from the column and removes all utility of the data. Apply this policy to numeric or text attributes that have a high re-identification risk, but little analytic value (names and personal identifiers).

  • Replace with REGEX: This function uses a regular expression to replace all or a portion of an attribute. REGEX replacement allows for some groupings to be maintained, while providing greater ambiguity to the disclosed value. This masking technique is useful when the underlying data has some consistent structure, the remasked underlying data represents some re-identification risk, and a regular expression can be used to mask the underlying data to be less identifiable.

  • Rounding: Immuta’s rounding policy reduces, rounds, or truncates numeric or datetime values to a fixed precision. This policy is appropriate when it is important to maintain analytic value of a quantity, but not at its native precision.

    • Date/Time Rounding: This policy truncates the precision of a datetime value to a user-defined precision. minute, hour, day, months, and year are the supported precisions.

    • Numeric Rounding: This policy maps the nominal value to the ceiling of some specified bandwidth. Immuta has a recommended bandwidth based on the Freedman-Diaconis rule.

Masking circumstances

The masking functions described above can be implemented in a variety of use cases. Use the table below to determine the circumstance under which a function should be used.

Masking Matrix Functions

Circumstance descriptions

  • Applicable to Numeric Data: The masking function can be applied to numeric values.

  • Column-Value Determinism: Repeated values in the same column are masked with the same output.

  • Introduces NULLs: The masking function may, under normal or irregular circumstances, return NULL values.

  • Performance: How performant the masking function will be (10/10 being the best).

  • Preserves Appearance: The output masked value resembles the valid column values. For example, a masking function would output phone numbers when given phone numbers. Here, NULL values are not counted against this property.

  • Preserves Averages: The average of the masked values (avg(mask(v))) will be near the average of the values in the clear (avg(v)).

  • Suitable for De-Identification: The masking function can be used to obscure record identifiers, hiding data subject identities and preventing future linking against other identified data.

  • Provides Deniability of Record Content: A (possibly identified) person can plausibly attribute the appearance of the value to the masking function. This is a desirable property of masking functions that retain analytic utility, as such functions must necessarily leak information about the original value. Fields masked with these functions provide strong protections against value inference attacks.

  • Preserves Equality and Grouping: Each value will be masked to the same value consistently without colliding with others. Therefore, equal values remain equal under masking while unequal values remain unequal, preserving equality. This implies that counting statistics are also preserved.

  • Preserves Message Length: The length of the masked value is equal to the length of the original value.

  • Preserves Range Statistics: The number of data values falling in a particular range is preserved. For strings, this can be interpreted as the number of strings falling between any two values by alphabetical order.

  • Preserves Value Locality: The output will remain near the input, which may be important for analytic purposes.

  • Reversible: Qualified individuals can reveal the original input value.

Masking policy support by integration

Masking Policy Support by Integration

Since Global Policies can apply masking policies across multiple different databases at once, if an unsupported masking policy is applied to a column, Immuta will revert to NULLing that column.

See the integration support matrix for an outline of masking policies supported by each integration.