1 of 19

Data Policies

Data policies determine what users see when they query data in a table they have access to.

Overview

This guide provides a general overview of data policies and their behavior.

How-to guides

Author a masking data policy
Author a minimization policy
Author a purpose-based restriction policy
Author a restricted data policy
Author a row-level policy
Author a time-based restriction policy
Certification, exemptions, and diffs: Certify policies, exempt users from policies, and view policy diffs on a data source.
External masking interface: Set up Immuta to use the encryption or masking algorithm in your external masking service.

Reference guides

All data policy types: This guide describes all the data policies available in Immuta.
Masking policies: This guide describes the types of masking policies available and when to use each.
Row-level policies: Row-level policies compare data values with user metadata at query-time to determine whether or not the querying user should have access to the individual rows of data.
Custom WHERE clause functions: This guide describes the custom functions you can use to extend the PostgreSQL WHERE syntax.
Data policy conflicts and fallback: In some cases, two conflicting global masking policies apply to a single data source. This guide describes how Immuta handles those conflicts.
Custom data policy certifications: When building a global data policy, governors can create custom certifications that must be acknowledged by data owners when the policy is applied to data sources.
Orchestrated masking policies: These policies reduce conflicts between masking policies that apply to a single column, allowing policies to scale more effectively across your organization.

Overview

Data policies manage what users see when they query data in a table they have access to.

There are three different ways to restrict data with data policies:

: Filter rows from certain users at query time.
: Mask values in a column at query time.
: Mask specific cells in a column based on separate values in the same row at query time.

When applying a data policy, it will always be enforced for all users, following the principle of least privilege, unless optional exceptions are added to policies. Data policy exceptions are built using any of the following conditions, which can be mixed with boolean logic:

If the user is a member of a group (or several groups)
If the user possesses a particular attribute (or several attributes)
If the (or several purposes) for which the data is allowed to be used

Data policy exceptions are very similar to from this perspective. With subscription policies, nobody has access to a newly created table until someone says otherwise with a subscription policy (as long as you follow for newly created tables and views). Similarly, when a masking policy is set on a column or a row-level policy on a table, it applies to everyone until someone says otherwise with an exception to the data policy.

Leveraging lookup tables

If user metadata is stored in a table in the same data platform where a policy is enforced, it is not necessary to move that user metadata in Immuta. Instead it can be referenced directly using functions in data policies.

Below is an example row-level policy that leverages a lookup table to dynamically drive access to rows in a table:

CREDIT_CARD_NUMBER

TRANSACTION_LOCATION

TRANSACTION_TIME

ACCESS_LEVEL

The final column in the table, ACCESS_LEVEL, defines who can see that row of data.

Now consider the following hierarchy:

In this diagram, there are 11 different access levels (AL) to data and the tree defines access. For example, if a user has Vegetables, they get access levels 2, 3, 4, 9, 10, and 11. If a user has Pear, they only get access level 8. In other words, a user with Vegetables would see the first row of the above table, a user with Pear would see the second row of the above table, and a user with Food would see both rows of the table.

Taking the example further, that hierarchy tree is represented as a table in the data platform that we wish to use to drive the row-level policy:

That hierarchy lookup table can be referenced in the row-level policy as user metadata like this:

@columnTagged('access_level') IN (SELECT ACCESS_LEVEL from [lookup table] where @attributeValuesContains('user_level', 'ROOT'))

Walking through the policy step-by-step:

@columnTagged('access_level'): This allows us to target multiple tables with an ACCESS_LEVEL column that needs protecting with a single policy. Simply tag all the ACCESS_LEVEL columns with the access_level tag and this policy would apply to all of them.
IN (SELECT ACCESS_LEVEL from [lookup table]: This is selecting the matching ACCESS_LEVEL from the lookup table to use as the IN clause for filtering the actual business table.

So, you can then add metadata to your users in Immuta, such as Vegetables or Pear and that will result in them seeing the appropriate rows in the business table in question.

The above example used a row-level policy, but it could instead do cell masking using the same technique:

Mask columns tagged Credit Card Number using hashing where @columnTagged('access_level') NOT IN (SELECT ACCESS_LEVEL from [lookup table] where @attributeValuesContains('user_level', 'ROOT'))

In this case, the credit card number will be masked if the access_level is not found for the user for that row.

Even if not using a lookup table, the power of the @columnTagged('tag name') function is apparent for applying your masking or row-level policies at scale.

How-to Guides

Author a Masking Data Policy

Best practice: write global policies

Build global policies with tags instead of writing local policies to manage data access. This practice will prevent you from having to write or rewrite single policies for every data source added to Immuta.

Determine your policy scope:
- Global policy: Click the Policies page icon in the left sidebar and select the Data Policies tab. Click Add Policy and enter a name for your policy.
- Local policy: Navigate to a specific data source and click the Policies tab. Scroll to the Data Policies section and click Add Policy.
Select Mask from the first dropdown menu.
Select columns tagged, columns with any tag, columns with no tags, all columns, or columns with names spelled like.
Select a masking type:
- using hashing
- with reversibility
- by making null
- using a constant: Enter a constant in the field that appears next to the masking type dropdown.
- using a regex:
  1. Enter a regular expression and replacement value in the fields that appear next to the masking type dropdown.
  2. From the next dropdown, choose to make the regex Case Insensitive and/or Global.
- by rounding: Select the Bucket Type and then enter the bucket size.
- with format preserving masking
- with K-Anonymization: Select either using fingerprint or requiring group size of at least and enter a group size in the subsequent dropdown menu.
- using randomized response
- using the custom function: Enter the custom function native to the underlying database.
  Note: The function must be valid for the data type of the column. If it is not, the default masking type will be applied to the column.
Select everyone except, everyone, or everyone who to continue the condition.
- everyone except: In the subsequent dropdown menus, choose is a member of group, possesses attribute, or is acting under purpose. Complete the condition with the subsequent dropdown menus.
- for everyone who: Complete the Otherwise clause. You can add more than one condition by selecting + Add Another Condition. The dropdown menu in the policy builder contains conjunctions for your policy. If you select or, only one of your conditions must apply to a user for them to see the data. If you select and, all of the conditions must apply.
Opt to complete the Enter Rationale for Policy (Optional) field, and then click Add.
For global policies: Click the dropdown menu beneath Where should this policy be applied and select When selected by data owners, On all data sources, or On data sources. If you selected On data sources, finish the condition in one of the following ways:
- tagged: Select this option and then search for tags in the subsequent dropdown menu.
- with columns tagged: Select this option and then search for tags in the subsequent dropdown menu.
- with column names spelled like: Select this option, and then enter a regex and choose a modifier in the subsequent fields.
- in server: Select this option and then choose a server from the subsequent dropdown menu to apply the policy to data sources that share this connection string.
- created between: Select this option and then choose a start date and an end date in the subsequent dropdown menus.
Click Create Policy. If creating a global policy, you then need to click Activate Policy or Stage Policy.

Create a custom certification for a global policy

This step is optional, but data governors can add certifications that outline acknowledgements or require approvals from data owners. For example, data governors could add a custom certification that states that data owners must verify that tags have been added correctly to their data sources before certifying the policy.

Click Add Certification in the data policy builder.
Enter a Certification Label and Certification Text in the corresponding fields of the dialog that appears.
Click Save.

Author a Minimization Policy

Determine your policy scope:
- : Click the Policies page icon in the left sidebar and select the Data Policies tab. Click Add Policy and enter a name for your policy.
- : Navigate to a specific data source and click the Policies tab. Scroll to the Data Policies section and click Add Policy.
Select Minimize data source from the first dropdown.
Complete the enter percentage field to limit the amount of data returned at query-time.
Select for everyone except from the next dropdown menu to continue the condition. Additional options include for everyone and for everyone who.
Use the next field to choose the attribute, group, or purpose that you will match values against.
Notes:
- If you choose for everyone who as a condition, complete the Otherwise clause before continuing to the next step.
- You can add more than one condition by selecting + Add Another Condition. The dropdown menu in the far right of the Policy Builder contains conjunctions for your policy. If you select or, only one of your conditions must apply to a user for them to see the data. If you select and, all of the conditions must apply.
Opt to complete the Enter Rationale for Policy (Optional), and then click Add.
For global policies: Click the dropdown menu beneath Where should this policy be applied, and select On all data sources, On data sources, or When selected by data owners. If you select On data sources, finish the condition in one of the following ways:
- tagged: Select this option and then search for tags in the subsequent dropdown menu.
- with columns tagged: Select this option and then search for tags in the subsequent dropdown menu.
- with column names spelled like: Select this option, and then enter a regex and choose a modifier in the subsequent fields.
- in server: Select this option and then choose a server from the subsequent dropdown menu to apply the policy to data sources that share this connection string.
- created between: Select this option and then choose a start date and an end date in the subsequent dropdown menus.
Click Create Policy. If creating a global policy, you then need to click Activate Policy or Stage Policy.

Author a Purpose-Based Restriction Policy

Requirement and prerequisite:

CREATE_DATA_SOURCE or GOVERNANCE Immuta permission
A

Build the policy

Determine your policy scope:
- : Click the Policies page icon in the left sidebar and select the Data Policies tab. Click Add Policy and enter a name for your policy.
- : Navigate to a specific data source and click the Policies tab. Scroll to the Data Policies section and click Add Policy.
Select Limit usage to purpose(s) in the first dropdown menu.
In the next field, select a specific purpose that you would like to restrict usage of this data source to or ANY PURPOSE. You can add more than one condition by selecting + Add Another Condition. The dropdown menu in the policy builder contains conjunctions for your policy. If you select or, only one of your conditions must apply to a user for them to see the data. If you select and, all of the conditions must apply.
Select for everyone or for everyone except. If you select for everyone except, you must select conditions that will drive the policy such as group, purpose, or attribute.
Opt to complete the Enter Rationale for Policy (Optional) field, and then click Add.
For global policies: Click the dropdown menu beneath Where should this policy be applied, and select On all data sources, On data sources, or When selected by data owners. If you select On data sources, finish the condition in one of the following ways:
- tagged: Select this option and then search for tags in the subsequent dropdown menu.
- with columns tagged: Select this option and then search for tags in the subsequent dropdown menu.
- with column names spelled like: Select this option, and then enter a regex and choose a modifier in the subsequent fields.
- in server: Select this option and then choose a server from the subsequent dropdown menu to apply the policy to data sources that share this connection string.
- created between: Select this option and then choose a start date and an end date in the subsequent dropdown menus.
Click Create Policy. If creating a global policy, you then need to click Activate Policy or Stage Policy.

How-to guides

Reference guides

Conceptual guide

Author a Restricted Data Policy

Data owners who are not governors can write restricted subscription and data policies, which allow them to enforce policies on multiple data sources simultaneously, eliminating the need to write redundant local policies.

Unlike global policies, the application of these policies is restricted to the data sources owned by the users or groups specified in the policy and will change as users' ownerships change.

Click Policies in the left sidebar and select Data Policies.
Click Add Policy and complete the Enter Name field.
Select how the policy should protect the data. Click a link below for instructions on building that specific data policy:
Opt to complete the Enter Rationale for Policy (Optional) field, and then click Add.
From the Where should this policy be applied dropdown menu, select When selected by data owners, On all data sources, or On data sources. If you selected On data sources, finish the condition in one of the following ways:
- tagged: Select this option and then search for tags in the subsequent dropdown menu.
- with columns tagged: Select this option and then search for tags in the subsequent dropdown menu.
- with column names spelled like: Select this option, and then enter a regex and choose a modifier in the subsequent fields.
- in server: Select this option and then choose a server from the subsequent dropdown menu to apply the policy to data sources that share this connection string.
- created between: Select this option and then choose a start date and an end date in the subsequent dropdown menus.
Beneath Whose Data Sources should this policy be restricted to, add users or groups to the policy restriction by typing in the text fields and selecting from the dropdown menus that appear.
Click Create Policy, and then click Activate Policy or Stage Policy.

Author a Row-Level Policy

Determine your policy scope:
- : Click the Policies page icon in the left sidebar and select the Data Policies tab. Click Add Policy and enter a name for your policy.
- : Navigate to a specific data source and click the Policies tab. Scroll to the Data Policies section and click Add Policy.
Select the Only show rows action from the first dropdown.
Choose one of the following policy conditions:
- Where user
  1. Choose the condition that will drive the policy from the next dropdown: is a member of a group or possesses an attribute.
  2. Use the next field to choose the attribute, group, or purpose that you will match values against.
  3. Use the next dropdown menu to choose the tag that will drive this policy. You can add more than one condition by selecting + Add Another Condition. The dropdown menu in the far right of the policy builder contains conjunctions for your policy. If you select or, only one of your conditions must apply to a user for them to see the data. If you select and, all of the conditions must apply.
- Where the value in the column tagged
  1. Select the tag from the next dropdown menu.
  2. From the subsequent dropdown, choose is or is not in the list, and then enter a list of comma-separated values.
- Where
  1. Enter a valid SQL WHERE clause in the subsequent field. When you place your cursor in this field, a tooltip details valid input and the column names of your data source. See for more information about specific functions.
- Never
  The never condition blocks all access to the data source.
  1. Choose the condition that will drive the policy from the next dropdown: for everyone, for everyone except, or for everyone who.
  2. Select the condition that will further define the policy: is a member of group, is acting under a purpose, or possesses attribute.
  3. Use the next field to choose the group, purpose, or attribute that you will match values against.
Choose for everyone, everyone except, or for everyone who to drive the policy. If you choose for everyone except, use the subsequent dropdown to choose the group, purpose, or attribute for your condition. If you choose for everyone who as a condition, complete the Otherwise clause before continuing to the next step.
Opt to complete the Enter Rationale for Policy (Optional) field, and then click Add.
For global policies: Click the dropdown menu beneath Where should this policy be applied, and select On all data sources, On data sources, or When selected by data owners. If you select On data sources, finish the condition in one of the following ways:
- tagged: Select this option and then search for tags in the subsequent dropdown menu.
- with columns tagged: Select this option and then search for tags in the subsequent dropdown menu.
- with column names spelled like: Select this option, and then enter a regex and choose a modifier in the subsequent fields.
- in server: Select this option and then choose a server from the subsequent dropdown menu to apply the policy to data sources that share this connection string.
- created between: Select this option and then choose a start date and an end date in the subsequent dropdown menus.
Click Create Policy. If creating a global policy, you then need to click Activate Policy or Stage Policy.

Author a Time-Based Restriction Policy

Determine your policy scope:
- Global policy: Click the Policies page icon in the left sidebar and select the Data Policies tab. Click Add Policy and enter a name for your policy.
- Local policy: Navigate to a specific data source and click the Policies tab. Scroll to the Data Policies section and click Add Policy.
Select Only show data by time from the first dropdown.
Select where data is more recent than or older than from the next dropdown, and then enter the number of minutes, hours, days, or years that you would like to restrict the data source to. Note that unlike many other policies, there is no field to select a column to drive the policy. This type of policy will be driven by the data source's event-time column, which is selected at data source creation.
Choose for everyone, everyone except, or for everyone who to drive the policy. If you choose for everyone except, use the subsequent dropdown to choose the group, purpose, or attribute for your condition. If you choose for everyone who as a condition, complete the Otherwise clause before continuing to the next step.
Opt to complete the Enter Rationale for Policy (Optional) field, and then click Add.
For global policies: Click the dropdown menu beneath Where should this policy be applied, and select On all data sources, On data sources, or When selected by data owners. If you select On data sources, finish the condition in one of the following ways:
- tagged: Select this option and then search for tags in the subsequent dropdown menu.
- with columns tagged: Select this option and then search for tags in the subsequent dropdown menu.
- with column names spelled like: Select this option, and then enter a regex and choose a modifier in the subsequent fields.
- in server: Select this option and then choose a server from the subsequent dropdown menu to apply the policy to data sources that share this connection string.
- created between: Select this option and then choose a start date and an end date in the subsequent dropdown menus.
Click Create Policy. If creating a global policy, you then need to click Activate Policy or Stage Policy.

Certifications Exemptions and Diffs

Required permissions

To manage and apply existing policies to data sources, a user must have either the CREATE_DATA_SOURCE Immuta permission or be manually assigned the owner role on a data source.

Certify global policies

After a policy with a certification requirement is applied to a data source, data owners will receive a notification indicating that they need to certify the policy.

Navigate to the Policies tab of the affected data source, and review the policy in the Data Policies section.
Click Certify Policy.
In the Policy Certification modal, click Sign and Certify.

Add policy exemptions

Once this setting is enabled on the app settings page, data owners can exempt users from policies on a per-data-source basis to allow those users to see all the data, regardless of the global or local policies applied. Note: By default, policy exemptions are disabled in Immuta.

Select a data source and click the Policies tab.
In the Data Policies menu, click Add Exemptions. This button will only be visible if policy exemptions have been enabled.
Enter the names of the users or groups to exempt from your policies.
Click Create to finish your exemption policy.
Click Save All to apply the policy to your data source.

View policy diffs

Once you have a data policy in effect, you can view the changes in your policies by clicking the Policy Diff button in the data policies section on a data source's policies tab.

The Policy Diff button displays previous policies and the current policy applied to the data source.

External Masking Interface

Deprecation notice: Support for this feature has been deprecated.

Use Deterministic IVs/Salt

Use deterministic IVs/salt to ensure the same value is masked consistently throughout the data, as Immuta always pushes down the masked version of the literal when the querying user is exempt from the policy.

Authentication

Immuta can make requests to your External Masking service with one of two authentication methods:

Username and password authentication: Immuta can send requests with a username and a password in the Authorization HTTP header. In this case, your service will need to be able to parse a and validate the credentials sent with it.
PKI Certificate: Immuta can send requests using a CA certificate, a certificate, and a key.

Alternatively, Immuta can make unauthenticated requests to your REST masking service. This is recommended only if you have other security measures in place (e.g., if the service is in an isolated network that's reachable only by your Immuta environment.)

Endpoints

POST /

Description

The unmask action allows Immuta to build predicates that can be used to query data that is consistently masked at rest in the remote database; it does not dynamically mask data at query time.

To dynamically mask data, use .

This endpoint accepts a set of values and a directive to either mask or unmask them.

Request Body

Your service will need to parse and process the following body parameters:

Below is an example request payload to mask values in the ssn and ccn columns:

Below is an example request payload to unmask values in the ssn and ccn columns:

Response Body

Your service will need to return a map of values that corresponds to the columns and values that were specified in the request. It is important that your service returns the same column keys and that the position of each masked/unmasked value in your response corresponds to the masked/unmasked value from the request.

For example, the following request

could return the following body:

Notice that both ssn and ccn columns are present and that each of them contains the exact number of values specified in the request. Immuta will fail to validate responses to its request under the following circumstances:

The response contains column keys that were not present in the request.
The response is missing column keys that were present in the request.
The response doesn't contain the exact number of values for each of the corresponding column keys in the request.

Examples

Below are some very simplistic implementation examples of a service with mask() and unmask() functions:

Reference Guides

Data Policy Types

Once a user is subscribed to a data source, the data policies that are applied to that data source determine what data the user sees.

For all data policies, you must establish the conditions for which they will be enforced. Immuta allows you to append multiple conditions to the data. Those conditions are based on user attributes and groups (which can come from multiple identity management systems and applied as conditions in the same policy), or purposes they are acting under through Immuta projects.

Conditions can be directed as exclusionary or inclusionary, depending on the policy that's being enforced:

exclusionary condition example: Mask using hashing values in columns tagged PII on all data sources for everyone except users in the group AUDIT.
: Only show rows where user is a member of a group that matches the value in the column tagged Department.

Policy support

Integration support matrix

Certain policies are not supported, or supported with caveats*, depending on the integration:

*Supported with Caveats:

On Databricks data sources, joins will not be allowed on data protected with replace with NULL/constant policies.
Snowflake k-anonymization: This policy type is only supported if you are using the query engine, which is disabled by default. Reach out to your Immuta representative if you need to enable this policy type for your account.
Starburst (Trino):
- K-anonymization, randomized response, and format preserving masking are only supported if you are using the query engine, which is disabled by default. Reach out to your Immuta representative if you need to enable this policy type for your account.
- The Immuta function @iam for WHERE clause policies can block the creation of views.

Policy types

Inclusionary policies

For example, governors could mask values using hashing for users acting under a specified purpose while masking those same values by making null for everyone else who accesses the data.

This variation can be created by selecting for everyone who when available from the condition dropdown menus and then completing the Otherwise clause.

Limit to purpose policies

For example, if the purpose Research included Marketing, Product, and Onboarding as sub-purposes, a governor could write the following global policy:

Limit usage to purpose(s) Research for everyone on data sources tagged PHI.

This hierarchy allows you to create this as a single purpose instead of creating separate purposes, which must then each be added to policies as they evolve.

Now, any user acting under the purpose or sub-purpose of Research - whether Research.Marketing or Research.Onboarding - will meet the criteria of this policy. Consequently, purpose hierarchies eliminate the need for a governor to rewrite these global policies when sub-purposes are added or removed. Furthermore, if new projects with new Research purposes are added, for example, the relevant global policy will automatically be enforced.

Masking policies

Masking policies hide values in data, providing various levels of utility while still preserving privacy.

Hashing

This policy masks the values with an irreversible sha256 hash, which is consistent for the same value throughout the data source, so you can count or track the specific values, but not know the true raw value.

Replace with NULL

This policy makes values null, removing any utility of the data the policy applies to.

Replace with constant

With this policy, you can replace the values with the same constant value you choose, such as 'Redacted', removing any utility of that data.

Regular expression (regex)

This policy is similar to replacing with a constant, but it provides more utility because you can retain portions of the true value. For example, the following regex rule would mask the final digits of an IP address:

Mask using a regex \d+$ the value in the columns ip_address for everyone.

In this case, the regular expression \d+$

\d matches a digit (equal to [0-9])

+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

$ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)

This ensures we capture the last digit(s) after the last . in the ip address. We then can enter the replacement for what we captured, which in this case is XXX. So the outcome of the policy, would look like this: 164.16.13.XXX

Rounding

This is a technique to hide precision from numeric values while providing more utility than simply hashing. For example, you could remove precision from a geospatial coordinate. You can also use this type of policy to remove precision from dates and times by rounding to the nearest hour, day, month, or year.

With reversibility

Note: The user receiving the unmasking request must send the unmasked value to the requester.

With Reversible Masking, the raw values are switched out with consistent values to allow analysis without revealing the underlying sensitive data. The direct identifier is replaced with a token that can still be tracked or counted.

With format preserving masking

This option masks the value, but preserves the length and type of the value.

Preserving the data format is important if the format has some relevance to the analysis at hand. For example, if you need to retain the integer column type or if the first 6 digits of a 12-digit number have an important meaning.

Custom function

This option uses functions native to the underlying database to transform the column.

Limitations

The masking functions are executed against the remote database directly. A poorly written function could lead to poor quality results, data leaks, and performance hits.
Using custom functions can result in changes to the original data type. In order to prevent query errors you must ensure that you cast this result back to the original type.
The function must be valid for the data type of the selected column. If it is not
- Local policies will error and show a message that the function is not valid.
- Global policies will error and change to the default masking type (hashing for text and NULL for all others).

Conditionally masking

For all of the policies above, both at the local and global policy levels, you can conditionally mask the value based on a value in another column. This allows you to build a policy that looks something like: "Mask bank account number where country = 'USA'" instead of blindly stating you want bank account masked always.

With k-anonymization

Sample data is processed during computation of k-anonymization policies

When a k-anonymization policy is applied to a data source, the columns targeted by the policy are queried under a fingerprinting process that generates rules enforcing k-anonymity. The results of this query, which may contain data that is subject to regulatory constraints such as GDPR or HIPAA, are stored in Immuta's metadata database.

The location of the metadata database depends on your deployment:

Self-managed Immuta deployment: The metadata database is located in the server where you have your external metadata database deployed.
SaaS Immuta deployment: The metadata database is located in the AWS global segment you have chosen to deploy Immuta.

K-anonymity is measured by grouping records in a data source that contain the same values for a common set of quasi identifiers (QIs) - publicly known attributes (such as postal codes, dates of birth, or gender) that are consistently, but ambiguously, associated with an individual.

The k-anonymity of a data source is defined as the number of records within the least populated cohort, which means that the QIs of any single record cannot be distinguished from at least k other records. In this way, a record with QIs cannot be uniquely associated with any one individual in a data source, provided k is greater than 1.

In Immuta, masking with k-anonymization examines pairs of values across columns and hides groups that do not appear at least the specified number of times (k). For example, if one column contains street numbers and another contains street names, the group 123, "Main Street" probably would appear frequently while the group 123, "Diamondback Drive" probably would show up much less. Since the second group appears infrequently, the values could potentially identify someone, so this group would be masked.

After the fingerprint service identifies columns with a low number of distinct values, users will only be able to select those columns when building the policy. Users can either use a minimum group size (k) given by the fingerprint or manually select the value of k.

Masking multiple columns with k-anonymization

Governors can write global data policies using k-anonymization in the global data policy builder.

When this global policy is applied to data sources, it will mask all columns matching the specified tag.

Applying k-anonymization over disjoint sets of columns in separate policies does not guarantee k-anonymization over their union.

If you select multiple columns to mask with k-anonymization in the same policy, the policy is driven by how many times these values appear together. If the groups appear fewer than k times, they will be masked.

For example, if Policy A

Policy A: Mask with k-anonymization the values in the columns gender and state requiring a group size of at least 2 for everyone

was applied to this data source

the values would be masked like this:

Note: Selecting many columns to mask with k-anonymization increases the processing that must occur to calculate the policy, so saving the policy may take time.

However, if you select to mask the same columns with k-anonymization in separate policies, Policy C and Policy D,

Policy C: Mask with k-anonymization the values in the column gender requiring a group size of at least 2 for everyone
Policy D: Mask with k-anonymization the values in the column state requiring a group size of at least 2 for everyone

the values in the columns will be masked separately instead of as groups. Therefore, the values in that same data source would be masked like this:

Using randomized response

All members of this cohort have indicated substance abuse, sensitive personal information that could have damaging consequences, and, even though direct identifiers have been removed and k-anonymization has been applied, outsiders could infer substance abuse for an individual if they knew a male welder in this zip code.

In this scenario, using randomized response would change some of the Y's in substance_abuse to N's and vice versa; consequently, outsiders couldn't be sure of the displayed value of substance_abuse given in any individual row, as they wouldn't know which rows had changed.

How the randomization works

Immuta applies a random number generator (RNG) that is seeded with some fixed attributes of the data source, column, backing technology, and the value of the high cardinality column, an approach that simulates cached randomness without having to actually cache anything.

For numeric data, Immuta uses the RNG to add a random shift from a 0-centered Laplace distribution with the standard deviation specified in the policy configuration. For most purposes, knowing the distribution is not important, but the net effect is that on average the reported values should be the true value plus or minus the specified deviation value.

Preserving data utility

Using randomized response doesn't destroy the data because data is only randomized slightly; aggregate utility can be preserved because analysts know how and what proportion of the values will change. Through this technique, values can be interpreted as hints, signals, or suggestions of the truth, but it is much harder to reason about individual rows.

Additionally, randomized response gives deniability of record content not dataset participation, so individual rows can be displayed.

Mixing masking policies on the same column

In some cases, you may want several different masking policies applied to the same column through Otherwise policies. To build these policies, select everyone who instead of everyone or everyone except. After you specify who the masking policy applies to, select how it applies to everyone else in the Otherwise condition.

You can add and remove tags in Otherwise conditions for global policies (unlike local policy Otherwise conditions), as illustrated above; however, all tags or regular expressions included in the initial everyone who rule must be included in an everyone or everyone except rule in the additional clauses.

Complex data types: masking fields within struct columns (public preview)

Feature limitations

Masking struct and array columns is only available for Databricks data sources.
Immuta only supports Parquet and Delta table types.

Spark supports a class of data types called complex types, which can represent multiple data values in a single column. Immuta supports masking fields within array and struct columns:

array: an ordered collection of elements
struct: a collection of elements that are primitive or complex types

Without this feature enabled, the struct and array columns of a data source default to jsonb in the Data Dictionary, and the masking policies that users can apply to jsonb columns are limited. For example, if a user wanted to mask PII inside the column patient in the image below, they would have to apply null masking to the entire column or use a custom function instead of just masking name or address.

After a global or local policy masks the columns containing PII, users who do not meet the exception specified in the policy will see these values masked:

Note: Immuta uses the > delimiter to indicate that a field is nested instead of the . delimiter, since field and column names could include ..

Caveats

Struct Columns with Many Fields

If users have struct columns with many fields, they will need to either

create the data source against a cluster running Spark 3 or
add spark.debug.maxToStringFields 1000 to their Spark 2 cluster's configuration.

To get column information about a data source, Immuta executes a DESCRIBE call for the table. In this call, Spark returns a simple string representation of the schema for each column in the table. For the patient column above, the simple string would look like this:

struct<name:string,ssn:string,age:int,address:struct<city:string,state:string,zipCode:string,street:text>>

Immuta then parses this string into the following format for the data source's dictionary:

However, if the struct contains more than 25 fields, Spark truncates the string, causing the parser to fail and fall back to jsonb. Immuta will attempt to avoid this failure by increasing the number of fields allowed in the server-side property setting, maxToStringFields; however, this only works with clusters on a Spark 3 runtime. The maxToStringFields configuration in Spark 2 cannot be set through the ODBC driver and can only be set through the Spark configuration on the cluster with spark.debug.maxToStringFields 1000 on cluster startup.

External masking

Deprecation notice: Support for this feature has been deprecated.

This feature allows Immuta to unmask data that is masked at rest in a remote database using a customer-provided encryption or masking algorithm. To do so,

Data owners apply these tags to columns that are masked (with encryption or another algorithm) in the remote database.

Unmasking process

Immuta will only unmask externally masked data if two conditions are met:

A masking policy is applied against that tagged column.
The querying user is exempt from that policy.

When a user who is exempt from the policy restrictions queries that masked column using a filter, Immuta converts the literal being queried using the external algorithm provided. Consider the following example:

The social_security_number column is masked on-ingest and has the tag externally_masked_data applied to it.
This masking policy is applied to the data source in Immuta: Mask using hashing the values in the column tagged externally_masked_data except for users who belong to the group view_masked_values.
The querying user belongs to the view_masked_values group.

When the user above runs the query select * from table A where social_security_number = 220869988, Immuta converts 220869988 to the masked value using the provided algorithm to query the database and return matching rows.

Use equality queries only

Queries against masked values on-ingest should be equality queries only. For example, if an exempt user ran a query like select * from table A where social_security_number > 220869988, the results may not make sense (depending on the algorithm used for masking the data).

Tutorials

Row-level security policies

These policies hide entire rows or objects of data based on the policy being enforced; some of these policies require the data to be tagged as well.

Matching

These policies match a user attribute with a row/object/file attribute to determine if that row/object/file should be visible. This process uses a direct string match, so the user attribute would have to match exactly the data attribute in order to see that row of data.

For example, to restrict access to insurance claims data to the state for which the user's home office is located, you could build a policy such as this:

Only show rows where user possesses an attribute in Office Location that matches the value in the column State for everyone except when user is a member of group Legal.

In this case, the Office Location is retrieved by the identity management system as a user attribute or group. If the user's attribute (Office Location) was Missouri, rows containing the value Missouri in the State column in the data source would be the only rows visible to that user.

WHERE clause policy

This policy can be thought of as a table "view" created automatically for the user based on the condition of the policy. For example, in the policy below, users who are not members of the Admins group will only see taxi rides where passenger_count < 2.

Only show rows where public.us.taxis.passenger_count <2 for everyone except when user is a member of group Admins.

WHERE clause policy requirement

All columns referenced in the policy must have fully qualified names. Any column names that are unqualified (just the column name) will default to a column of the data source the policy is being applied to (if one matches the name).

Time-based restrictions

These policies restrict access to rows/objects/files that fall within the time restrictions set in the policy. If a data source has time-based restriction policies, queries run against the data source by a user will only return rows/blobs with a date in its event-time column/attribute from within a certain range.

The time window is based on the event time you select when creating the data source. This value will come from a date/time column in relational sources.

Minimization

These policies return a limited percentage of the data, which is randomly sampled, at query time. but it is the same sample for all the users. For example, you could limit certain users to only 10% of the data. Immuta uses a hashing policy to return approximately 10% of the data, and the data returned will always be the same; however, the exact number of rows exposed depends on the distribution of high cardinality columns in the database and the hashing type available. Additionally, Immuta will adjust the data exposed when new rows are added or removed.

Best practice: row count

Immuta recommends you use a table with over 1,000 rows for the best results when using a data minimization policy.

Masked columns as input for row-level policies

Public preview: This feature is currently in public preview and available to all accounts.

If a global masking policy applies to a column, you can still use that masked column in a global row-level policy.

Consider the following policy examples:

Masking policy: Mask values in columns tagged Country for everyone except users in group Admin.
Row-level policy: Only show rows where user possesses an attribute in OfficeLocation that matches the value in column tagged Country for everyone.

Both of these policies use the Country tag to restrict access. Therefore, the masking policy and the row-level policy would apply to data source columns with the tag Country for users who are not in the Admin group.

Limitations

New column added policy

Masking Policies

Masking policies hide values in data, providing various levels of utility while still preserving privacy. Immuta offers and .

Column masking

Column masking policies allow you to hide the data in a column. However, there are several different approaches for masking data that allow you to make tradeoffs between privacy (how far you go with masking) vs utility (how much you want the masked data to be useful to the data consumer).

As with all Immuta policy types, it is recommended that you use when to manage policies at scale. When using global policies, tagging your data with metadata becomes critical and is described in detail in the use case.

Types

Categorical Randomized Response: Categorical values are randomized by replacing a value with some non-zero probability. Not all values are randomized, and the consumer of the data is not told which values are randomized and which ones remain unchanged. Values are replaced by selecting a different value uniformly at random from among all other values. If a randomized response policy were applied to a “state” column, a person’s residency could flip from Maryland to Virginia, which would provide ambiguity to the actual state of residency. This policy is appropriate when obscuring sensitive values such as medical diagnosis or survey responses.
Custom Function: This function uses SQL functions native to the underlying database to transform the values in a column. This can be used in numerous use cases, but notional examples include top-coding to some upper limit, a custom hash function, and string manipulation.
K-Anonymization: Masking through k-anonymization is a distinct policy that can operate over multiple attributes. A k-anonymization policy applies rounding and NULL masking policies over multiple columns so that the columns contain at least “K” records, where K is a positive integer. As a result, attributes will only be disclosed when there is a sufficient number of observations. This policy is appropriate to apply over indirect identifiers, such as zip code, gender, or age. Generally, each of these identifiers is not uniquely linked to an individual, but when combined with other identifiers can be associated with a single person. Applying k-anonymization to these attributes provides the anonymity of crowds so that individual rows are made indistinct from each other, reducing the re-identification risk by making it unclear which record corresponds to a specific person. Immuta requires that you opt in to use this masking policy type. To enable k-anonymization for your account, contact your Immuta representative. Immuta supports k-anonymization of text, numeric, and time-based data types.
Mask with Format Preserving Masking: This function masks using a reversible function but does so in a way that the underlying structure of a value is preserved. This means the length and type of a value are maintained. This is appropriate when the masked value should appear in the same format as the underlying value. Examples of this would include social security numbers and credit card numbers where Mask with Format Preserving Masking would return masked values in a format consistent with credit cards or social security numbers, respectively. There is larger overhead with this masking type, and it should really only be used when format is critically valuable, such as situations when an engineer is building an application where downstream systems validate content. In almost all analytical use cases, format should not matter.
Mask with Reversibility: This function masks in a way that an authorized user can “unmask” a value and reveal the value to an authorized user. Masking with Reversibility is appropriate when there is a need to obscure a value while allowing an authorized user to recover the underlying value. All of the same use cases and caveats that apply to Replace with Hashing apply to this function. Reversibly masked fields can leak the length of their contents, so it is important to consider whether or not this may be an attack vector for applications involving its use.
Randomized Response: This function randomizes the displayed value to make the true value uncertain, but maintains some analytic utility. The randomization is applied differently to both categorical and quantitative values. In both cases, the noise can be increased to enhance privacy or reduced to preserve more analytic value.
Datetime and Numeric Randomized Response: Numeric and datetime randomized response apply a tunable, unbiased noise to the nominal value. This noise can obscure the underlying value, but the impact of the noise is reduced in aggregate. This masking type can be applied to sensitive numerical attributes, such as salary, age, or treatment dates.
Replace with Constant: This function replaces any value in a column with a specified value. The underlying data will appear to be a constant. This masking carries the same privacy and utility guarantees as Replace with NULL. Apply this policy to strings that require a specific repeated value.
Replace with Hashing: This function masks the values with an irreversible hash, which is consistent for the same value throughout the data source, so you can count or track the specific values, but not know the true raw value. This is appropriate for cases where the underlying value is sensitive, but there is a need to segment the population. Such attributes could be addresses, time segments, or countries. It is important to note that hashing is susceptible to inference attacks based on prior knowledge of the population distribution. For example, if “state” is hashed, and the dataset is a sample across the United States, then an adversary could assume that the most frequently occurring hash value is California. As such, it's most secure to use the hashing mask on attributes that are evenly distributed across a population.
Replace with Null: This function replaces any value in a column with NULL. This removes any identifiability from the column and removes all utility of the data. Apply this policy to numeric or text attributes that have a high re-identification risk, but little analytic value (names and personal identifiers).
Replace with REGEX: This function uses a regular expression to replace all or a portion of an attribute. REGEX replacement allows for some groupings to be maintained, while providing greater ambiguity to the disclosed value. This masking technique is useful when the underlying data has some consistent structure, the remasked underlying data represents some re-identification risk, and a regular expression can be used to mask the underlying data to be less identifiable.
Rounding: Immuta’s rounding policy reduces, rounds, or truncates numeric or datetime values to a fixed precision. This policy is appropriate when it is important to maintain analytic value of a quantity, but not at its native precision.
- Date/Time Rounding: This policy truncates the precision of a datetime value to a user-defined precision. `minute`, `hour`, `day`, `months`, and `year` are the supported precisions.
- Numeric Rounding: This policy maps the nominal value to the ceiling of some specified bandwidth. Immuta has a recommended bandwidth based on the Freedman-Diaconis rule.

Cell-level masking

For example, a regular masking policy looks like the following:

Mask columns tagged Discovered.Entity.Social Security Number using hashing for everyone except members of group admins

The cells can be conditionally masked by changing the for to a where:

Mask columns tagged Discovered.Entity.Social Security Number using hashing where country_of_residence = 'US' for everyone except members of group admins

That policy will check the country_of_residence column in the table and if the value is US the cell will be masked, otherwise the data will be presented in the clear as usual.

Mask columns tagged Discovered.Entity.Social Security Number using hashing where @columnTagged('country') = 'US' for everyone except members of group admins

This example policy targets the column with the tag country in the policy logic dynamically.

Masking circumstances

The masking functions described above can be implemented in a variety of use cases. Use the table below to determine the circumstance under which a function should be used.

Circumstance descriptions

Applicable to Numeric Data: The masking function can be applied to numeric values.
Column-Value Determinism: Repeated values in the same column are masked with the same output.
Introduces NULLs: The masking function may, under normal or irregular circumstances, return NULL values.
Performance: How performant the masking function will be (10/10 being the best).
Preserves Appearance: The output masked value resembles the valid column values. For example, a masking function would output phone numbers when given phone numbers. Here, NULL values are not counted against this property.
Preserves Averages: The average of the masked values (avg(mask(v))) will be near the average of the values in the clear (avg(v)).
Suitable for De-Identification: The masking function can be used to obscure record identifiers, hiding data subject identities and preventing future linking against other identified data.
Provides Deniability of Record Content: A (possibly identified) person can plausibly attribute the appearance of the value to the masking function. This is a desirable property of masking functions that retain analytic utility, as such functions must necessarily leak information about the original value. Fields masked with these functions provide strong protections against value inference attacks.
Preserves Equality and Grouping: Each value will be masked to the same value consistently without colliding with others. Therefore, equal values remain equal under masking while unequal values remain unequal, preserving equality. This implies that counting statistics are also preserved.
Preserves Message Length: The length of the masked value is equal to the length of the original value.
Preserves Range Statistics: The number of data values falling in a particular range is preserved. For strings, this can be interpreted as the number of strings falling between any two values by alphabetical order.
Preserves Value Locality: The output will remain near the input, which may be important for analytic purposes.
Reversible: Qualified individuals can reveal the original input value.

Masking policy support by integration

Masking policy support by integration

Since Global Policies can apply masking policies across multiple different databases at once, if an unsupported masking policy is applied to a column, Immuta will revert to NULLing that column.

Row-Level Policies

Policy logic

Immuta row-level policies compare data values with user metadata at query-time to determine whether or not the querying user should have access to the individual rows of data.

Referencing data values

The values contained in one or many columns in the table in question (or a ) need to be referenced by the policy for its logic to take effect.

For example, consider the policy below:

Only show rows where user is a member of a group that matches the value in the column tagged Department.

The data values (the values in the column tagged Department) are matched against the user attribute (their groups) to determine whether or not rows will be visible to the user accessing the data.

The policy targets columns tagged Department; this means that this policy can be applied globally across all tables and data platforms that have that tag with this single policy rather than having to build a separate policy for individual tables and columns.

Leveraging custom functions

It is also possible to use custom functions in row-level policies for more complex use cases.

These wrap Immuta context into free-form SQL logic for the row-level policy. That context can be things like the attributes (@attributeValuesContains()) or groups (@groupsContains()) possessed by the user or the username (@username) - injected into the SQL at runtime.

Avoid referencing explicit column names in custom functions and instead use the @columnTagged('tag name') function in SQL. In doing so, you can avoid having to reference the physical database world with the custom SQL policies and instead continue to target the metadata/tag world.

Custom WHERE Clause Functions

Overview

The Immuta policy builder allows you to use custom functions that reference important Immuta metadata from within your where clause. These custom functions can be seen as utilities that help you create policies easier. Using the Immuta Policy Builder, you can include these functions in your policy queries by choosing where in the sub-action drop-down menu.

Custom Functions

The `@attributeValuesContains()` Function

This function returns true for a given row if the provided column evaluates to an attribute value for which the querying user has a corresponding attribute value. This function requires two arguments and accepts no more than three arguments.

Parameters

Parameter

Type

Required

Description

The `@columnTagged()` Function

This function returns the column name with the specified tag.

If this function is used in a Global Policy and the tag doesn't exist on a data source, the policy will not be applied.

Parameters

Parameter

Type

Required

Description

The `@groupsContains()` Function

This function returns true for a given row if the provided column evaluates to a group to which the querying user belongs. This function requires at least one argument.

Parameters

The `@hasAttribute()` Function

This function returns a boolean indicating if the current user has the specified attribute name and value combination. If the specified attribute name or attribute value has a single quote, you will need to escape it using a \'\' expression within a custom WHERE policy.

Parameters

The `@iam` Function

This function returns the IAM ID for the current user.

Parameters

None.

The `@isInGroups()` Function

This function returns a boolean indicating if the current user is a member of all of the specified groups. If any of the specified groups has a single quote, you will need to escape it using a \'\' expression within a custom WHERE policy.

Parameters

The `@isUsingPurpose()` Function

This function returns a boolean indicating if the current user is using the specified purpose. If the specified purpose has a single quote, you will need to escape it using a \'\' expression within a custom WHERE policy.

Parameters

The `@purposesContains()` Function

This function returns true for a given row if the provided column evaluates to a purpose under which the querying user is currently acting. This function requires at least one argument and accepts no more than two arguments.

Parameters

The `@username` Function

This function returns the current user's user name.

Parameters

None.

Data Policy Conflicts and Fallback

Masking policy conflicts

In some cases, two conflicting global masking policies apply to a single data source. When this happens, the policy containing a tag deeper in the hierarchy will apply to the data source to resolve the conflict.

Consider the following global data policies created by a data governor:

Data policy 1:

Mask columns tagged PII by making null for everyone on data sources with columns tagged PII

Data policy 2:

Mask columns tagged PII.SSN using hashing for everyone on data sources with columns tagged PII.SSN

If a data owner creates a data source and applies the PII.SSN tag to a column, both of these global masking policies will apply to the column with that tag. Instead of having a conflict, the policy containing a deeper tag in the hierarchy will apply.

In this example, data policy 2 will be applied to the data source because PII.SSN is deeper and thus considered more specific than PII. If data owners wanted to use data policy 1 on the data source instead, they would need to disable data policy 2.

Should two or more masking policies target the same column and have the same hierarchy depth, the policy that was authored first will win out. This is a conservative approach that avoids the original policy being changed unexpectedly.

Row-level policy conflicts

Similar to masking policies, it is possible for two or more row-level policies to target the same table. When this occurs, all row-level policies will be applied and AND'ed together, meaning the user will need to meet all in some capacity to see any rows in the table at all.

To OR separate row-level policies together, build them into a single Immuta policy together with an OR.

Masking policy intelligent fallbacks

When masking columns, the type of the column matters. For example, it is not possible to hash a numeric column, because the hash would render the number as a string.

Many data platforms make the user account for this by building separate data policies for every column type that could exist now or in the future, which is quite onerous.

Instead, Immuta has intelligent fallbacks. An intelligent fallback occurs when a masking type targets a column type that is incompatible with the masking type. In this case, Immuta will fall back to the most appropriate masking type which retains the level of privacy or better required by the previous type.

For example, if a hashing masking type hits a numeric type, it would intelligently fallback to nulling the column instead, since nulls are allowed in numeric types.

Lockout policies

Sometimes a global data policy will target a table and the policy cannot be applied as written. This can happen for several reasons, but the most common is that the row-level policy logic is not relevant to the table in question.

For example, with the following policy

@attributeValuesContains('Attribute Name', 'SOME_COLUMN')

If SOME_COLUMN does not exist in the table, the row-level policy will not work (this is why it is always recommended to use the @columnTagged('tag name') function instead of hard coding column names).

In the case where an error such as this occurs with a global data policy, the lockout policy will kick in. The lockout policy is a row-level policy that blocks any rows from returning for any users. This may seem extreme, but since Immuta does not know how to apply the policy, the lockout policy avoids data leaks until the policy is edited to work correctly.

Custom Data Policy Certifications

When building a global data policy, governors can create custom certifications, which must then be acknowledged by data owners when the policy is applied to data sources.

For example, data governors could add a custom certification that states that data owners must verify that tags have been added correctly to their data sources before certifying the policy.

When a global data policy with a custom certification is cloned, the certification is also cloned. If the user who clones the policy and custom certification is not a governor, the policy will only be applied to data sources that user owns.

Orchestrated Masking Policies

Private preview

This feature is only available to select accounts. Contact your Immuta representative to enable this feature.

Orchestrated masking policies (OMP) reduce conflicts between masking policies that apply to a single column, allowing policies to scale more effectively across your organization. Furthermore, OMP fosters distributed data stewardship, empowering policy authors who share responsibility of a data set to protect it while allowing data consumers acting under various roles or purposes to access the data.

When multiple masking policies apply to a column, Immuta combines the exception conditions of the masking policy so that data subscribers can access the data when they satisfy one of those exception conditions. Multiple masking policies will be enforced on a column if the following conditions are true:

Policies use the same masking type.
Policies use the for everyone except condition.

Requirements

Databricks Spark or Starburst (Trino) integration

Supported masking policy types

OMP supports the following masking types:

Constant
Hashing
Format preserving masking
Null
Regex
Rounding

Global policy logic

Previous policy logic

Governors can apply policies to all columns in a data source or target specific columns with tags or a regular expression. Without orchestrated masking policies enabled, when multiple global policies apply to the same columns, Immuta could only apply one of those policies.

Consider the following example to examine how policies behaved when one tag is used in two different policies:

Mask PII Global Policy 1: Mask using hashing the value in columns tagged email except when user is acting under the purpose Email Campaign.
Mask PII Global Policy 2: Mask using hashing the value in columns tagged email except when user is acting under purpose Marketing.

For columns tagged email, only one of these policies is enforced. The Mask PII Global Policy 2 is not applied to the data source, so Immuta is not enforcing the masking policy properly for users who should be able to see emails because they are acting under the Marketing purpose.

Consider the following example where multiple masking policies apply to columns that have multiple tags, resulting in one policy applying:

Global Policy 3: Mask using hashing the value in columns tagged Employee Data unless users are acting under the purpose Retention Analysis.
Global Policy 4: Mask using hashing the value in columns tagged HR Data unless users are acting under the purpose Employee Satisfaction Survey.

If a column is tagged Employee Data and HR Data, Immuta will only apply one of the policies.

Orchestrated masking policy logic

With orchestrated masking policies, Immuta applies multiple global masking policies that apply to a single column by combining the policy exceptions with OR. For these policies to combine, the masking type must be identical and the policy must use the for everyone except condition.

Consider the following example, both of these policies will apply to the data source:

Mask PII Global Policy 1: Mask using hashing the value in columns tagged email except when user is acting under the purpose Email Campaign.
Mask PII Global Policy 2: Mask using hashing the value in columns tagged email except when user is acting under purpose Marketing.

Users acting under the purpose Marketing or Email Campaign will be able to see emails in the clear.

However, in the following example, only one of these policies will apply to the data source because one masks using a constant and the other masks using hashing:

Global Policy 5: Mask using the constant REDACTED the value in columns tagged Employee Data unless users are acting under the purpose Retention Analysis.
Global Policy 6: Mask using hashing the value in columns tagged HR Data unless users are acting under the purpose Employee Satisfaction Survey.

Limitations

No UI enhancements were made in this release. Multiple masking policies applied to the same column are visible on a data source, but there is no indication that the exceptions are combined with OR.
Masking types must match exactly for the policies to be combined. For example, both policies must mask using rounding.
Existing policies will not automatically migrate to the new policy logic when you enable the feature. To re-compute existing policies with the new logic, you must manually trigger global policy changes by staging and re-enabling each policy.

Data Policy Types

Once a user is subscribed to a data source, the data policies that are applied to that data source determine what data the user sees.

Conditions can be directed as exclusionary or inclusionary, depending on the policy that's being enforced:

exclusionary condition example: Mask using hashing values in columns tagged PII on all data sources for everyone except users in the group AUDIT.
: Only show rows where user is a member of a group that matches the value in the column tagged Department.

Policy support

Integration support matrix

Certain policies are not supported, or supported with caveats*, depending on the integration:

*Supported with Caveats:

On Databricks data sources, joins will not be allowed on data protected with replace with NULL/constant policies.
Snowflake k-anonymization: This policy type is only supported if you are using the query engine, which is disabled by default. Reach out to your Immuta representative if you need to enable this policy type for your account.
Starburst (Trino):
- K-anonymization, randomized response, and format preserving masking are only supported if you are using the query engine, which is disabled by default. Reach out to your Immuta representative if you need to enable this policy type for your account.
- The Immuta function @iam for WHERE clause policies can block the creation of views.

Policy types

Inclusionary policies

For all policies except , inclusionary logic allows governors to vary policy actions with an Otherwise clause.

For example, governors could mask values using hashing for users acting under a specified purpose while masking those same values by making null for everyone else who accesses the data.

This variation can be created by selecting for everyone who when available from the condition dropdown menus and then completing the Otherwise clause.

Limit to purpose policies

Purposes help define the scope and use of data within a project and allow users to meet . Governors create and manage purposes and their sub-purposes, which project owners then add to their project(s) and use to drive Data Policies.

Purposes can be constructed as a hierarchy, meaning that purposes can contain nested sub-purposes, much like in Immuta. This design allows more flexibility in managing purpose-based restriction policies and transparency in the relationships among purposes.

For example, if the purpose Research included Marketing, Product, and Onboarding as sub-purposes, a governor could write the following global policy:

Limit usage to purpose(s) Research for everyone on data sources tagged PHI.

This hierarchy allows you to create this as a single purpose instead of creating separate purposes, which must then each be added to policies as they evolve.

Please refer to the for a tutorial on purpose-based restrictions on data.

Masking policies

Masking policies hide values in data, providing various levels of utility while still preserving privacy.

Hashing

Hashed values are different across data sources, so you cannot join on hashed values unless you . Immuta prevents joins on hashed values to protect against link attacks where two data owners may have exposed data with the same masked column (a quasi-identifier), but their data combined by that masked value could result in a sensitive data leak.

Replace with NULL

This policy makes values null, removing any utility of the data the policy applies to.

Replace with constant

With this policy, you can replace the values with the same constant value you choose, such as 'Redacted', removing any utility of that data.

Regular expression (regex)

Mask using a regex \d+$ the value in the columns ip_address for everyone.

In this case, the regular expression \d+$

\d matches a digit (equal to [0-9])

+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

$ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)

Rounding

With reversibility

This option masks the values using hashing, but allows users to submit an to users who meet the exceptions of the policy.

Note: The user receiving the unmasking request must send the unmasked value to the requester.

With format preserving masking

This option masks the value, but preserves the length and type of the value.

This option also allows users to submit an to users who meet the exceptions of the policy.

Custom function

This option uses functions native to the underlying database to transform the column.

Limitations

The masking functions are executed against the remote database directly. A poorly written function could lead to poor quality results, data leaks, and performance hits.
Using custom functions can result in changes to the original data type. In order to prevent query errors you must ensure that you cast this result back to the original type.
The function must be valid for the data type of the selected column. If it is not
- Local policies will error and show a message that the function is not valid.
- Global policies will error and change to the default masking type (hashing for text and NULL for all others).

Conditionally masking

Note: When building conditional masking policies with custom SQL statements, avoid using a column that is masked using in the SQL statement, as this can lead to different behavior depending on your data platform and may produce results that are unexpected.

With k-anonymization

Sample data is processed during computation of k-anonymization policies

The location of the metadata database depends on your deployment:

Self-managed Immuta deployment: The metadata database is located in the server where you have your external metadata database deployed.
SaaS Immuta deployment: The metadata database is located in the AWS global segment you have chosen to deploy Immuta.

To ensure this process does not violate your organization's data localization regulations, you need to first activate this masking policy type before you can use it in your Immuta tenant. To enable k-anonymization for your account, see the .

Note: The default cardinality cutoff for columns to qualify for k-anonymization is 500. For details about adjusting this setting, navigate to the .

Masking multiple columns with k-anonymization

Governors can write global data policies using k-anonymization in the global data policy builder.

When this global policy is applied to data sources, it will mask all columns matching the specified tag.

Applying k-anonymization over disjoint sets of columns in separate policies does not guarantee k-anonymization over their union.

For example, if Policy A

Policy A: Mask with k-anonymization the values in the columns gender and state requiring a group size of at least 2 for everyone

was applied to this data source

Gender

State

the values would be masked like this:

Gender

State

Note: Selecting many columns to mask with k-anonymization increases the processing that must occur to calculate the policy, so saving the policy may take time.

However, if you select to mask the same columns with k-anonymization in separate policies, Policy C and Policy D,

Policy C: Mask with k-anonymization the values in the column gender requiring a group size of at least 2 for everyone
Policy D: Mask with k-anonymization the values in the column state requiring a group size of at least 2 for everyone

the values in the columns will be masked separately instead of as groups. Therefore, the values in that same data source would be masked like this:

Gender

State

Using randomized response

This policy masks data by slightly randomizing the values in a column, while preventing outsiders from inferring content of specific records.

For example, if an analyst wanted to publish data from a health survey she conducted, she could remove direct identifiers and apply to indirect identifiers to make it difficult to single out individuals. However, consider these survey participants, a cohort of male welders who share the same zip code:

participant_id

zip_code

gender

occupation

substance_abuse

How the randomization works

For string data, the random number generator essentially flips a biased coin. If the coin comes up as tails, which it does with the frequency of the replacement rate , then the value is changed to any other possible value in the column, selected uniformly at random from among those values. If the coin comes up as heads, the true value is released.

Preserving data utility

Additionally, randomized response gives deniability of record content not dataset participation, so individual rows can be displayed.

Mixing masking policies on the same column

Complex data types: masking fields within struct columns (public preview)

Feature limitations

Masking struct and array columns is only available for Databricks data sources.
Immuta only supports Parquet and Delta table types.

Spark supports a class of data types called complex types, which can represent multiple data values in a single column. Immuta supports masking fields within array and struct columns:

array: an ordered collection of elements
struct: a collection of elements that are primitive or complex types

After Complex Data Types is enabled on the , the column type for struct columns for new data sources will display as struct in the Data Dictionary. (For data sources that are already in Immuta, users can edit the data source and change the column types for the appropriate columns from jsonb to struct.) Once struct fields are available, they can be searched, tagged, and used in masking policies. For example, a user could tag name, ssn, and street as PII instead of the entire patient column.

After a global or local policy masks the columns containing PII, users who do not meet the exception specified in the policy will see these values masked:

Note: Immuta uses the > delimiter to indicate that a field is nested instead of the . delimiter, since field and column names could include ..

Caveats

Struct Columns with Many Fields

If users have struct columns with many fields, they will need to either

create the data source against a cluster running Spark 3 or
add spark.debug.maxToStringFields 1000 to their Spark 2 cluster's configuration.

struct<name:string,ssn:string,age:int,address:struct<city:string,state:string,zipCode:string,street:text>>

Immuta then parses this string into the following format for the data source's dictionary:

{
  dataType: 'struct',
  children: [
    {
      name: 'name',
      dataType: 'text'
    },
    {
      name: 'ssn',
      dataType: 'text'
    },
    {
      name: 'age',
      dataType: 'integer'
    },
    {
      name: 'address',
      dataType: 'struct',
      children: [
        {
          name: 'city',
          dataType: 'text'
        },
        {
          name: 'state',
          dataType: 'text'
        },
        {
          name: 'zipCode',
          dataType: 'text'
        },
        {
          name: 'street',
          dataType: 'text'
        },
      ]
    }
  ]
}

External masking

Deprecation notice: Support for this feature has been deprecated.

This feature allows Immuta to unmask data that is masked at rest in a remote database using a customer-provided encryption or masking algorithm. To do so,

System Administrators build their own custom logic and security in an . Because Immuta always pushes down the masked version of the literal when the user is exempt from the policy, the organization should use deterministic IVs/salt to ensure the same value is masked consistently throughout the data.
System Administrators give Immuta access to the that will be used by data owners to indicate that data is masked at rest in the remote database.
Data owners apply these tags to columns that are masked (with encryption or another algorithm) in the remote database.
Data owners or governors that allow Immuta to reach out to this external REST service to unmask data according to the specifications in the policy.

Immuta’s External Masking feature expects data to be masked at rest by an external tool consistently on a per-cell basis in the remote database. Immuta then provides policy-based unmasking (and additional masking on top of this using ).

Unmasking process

Immuta will only unmask externally masked data if two conditions are met:

A masking policy is applied against that tagged column.
The querying user is exempt from that policy.

The social_security_number column is masked on-ingest and has the tag externally_masked_data applied to it.
This masking policy is applied to the data source in Immuta: Mask using hashing the values in the column tagged externally_masked_data except for users who belong to the group view_masked_values.
The querying user belongs to the view_masked_values group.

Use equality queries only

Tutorials

To configure External Masking, see the .
For an implementation guide, see the .

Row-level security policies

These policies hide entire rows or objects of data based on the policy being enforced; some of these policies require the data to be tagged as well.

Note: When building row-level policies with custom SQL statements, avoid using a column that is masked using in the SQL statement, as this can lead to different behavior depending on whether you’re using the Spark or Snowflake and may produce results that are unexpected.

Matching

For example, to restrict access to insurance claims data to the state for which the user's home office is located, you could build a policy such as this:

Only show rows where user possesses an attribute in Office Location that matches the value in the column State for everyone except when user is a member of group Legal.

WHERE clause policy

Only show rows where public.us.taxis.passenger_count <2 for everyone except when user is a member of group Admins.

You can put any valid SQL in the policy. See the for a list of custom functions.

WHERE clause policy requirement

Time-based restrictions

The time window is based on the event time you select when creating the data source. This value will come from a date/time column in relational sources.

Minimization

Best practice: row count

Immuta recommends you use a table with over 1,000 rows for the best results when using a data minimization policy.

Masked columns as input for row-level policies

Public preview: This feature is currently in public preview and available to all accounts.

If a global masking policy applies to a column, you can still use that masked column in a global row-level policy.

Consider the following policy examples:

Masking policy: Mask values in columns tagged Country for everyone except users in group Admin.
Row-level policy: Only show rows where user possesses an attribute in OfficeLocation that matches the value in column tagged Country for everyone.

Limitations

This feature is only available for and integrations.
This feature is only supported for , not local data policies.

New column added policy

This policy pairs with to mask newly added columns to data sources until data owners review and approve these changes from the requests tab of their profile page.

When this policy is activated by a governor, it will automatically be enforced on data sources that have the New tag applied to them by .

To learn how to activate this policy, navigate to the .

Data Policies

Overview

How-to guides

Reference guides

Overview

Leveraging lookup tables

Other topics of interest

How-to Guides

Author a Masking Data Policy

Create a custom certification for a global policy

Author a Minimization Policy

Author a Purpose-Based Restriction Policy

Build the policy

Related guides

How-to guides

Reference guides

Conceptual guide

Author a Restricted Data Policy

Author a Row-Level Policy

Author a Time-Based Restriction Policy

Certifications Exemptions and Diffs

Required permissions

Certify global policies

Add policy exemptions

View policy diffs

External Masking Interface

Authentication

Endpoints

POST /

Description

Request Body

Response Body

Examples

Reference Guides

Data Policy Types

Policy support

Integration support matrix

Policy types

Inclusionary policies

Limit to purpose policies

Masking policies

Hashing

Replace with NULL

Replace with constant

Regular expression (regex)

Rounding

With reversibility

With format preserving masking

Custom function

Conditionally masking

With k-anonymization

Using randomized response

Mixing masking policies on the same column

Complex data types: masking fields within struct columns (public preview)

External masking

Row-level security policies

Matching

WHERE clause policy

Time-based restrictions

Minimization

Masked columns as input for row-level policies

New column added policy

Masking Policies

Column masking

Types

Cell-level masking

Masking circumstances

Circumstance descriptions

Masking policy support by integration

Row-Level Policies

Policy logic

Referencing data values

Leveraging custom functions

Custom WHERE Clause Functions

Overview

Custom Functions

The @attributeValuesContains() Function

Parameters

The @columnTagged() Function

Parameters

The `@attributeValuesContains()` Function

The `@columnTagged()` Function

The `@groupsContains()` Function

The `@hasAttribute()` Function

The `@iam` Function

The `@isInGroups()` Function

The `@isUsingPurpose()` Function

The `@purposesContains()` Function

The `@username` Function