Skip to content

Policies in Immuta

Audience: Data Owners and Governors

Content Summary: Policies in Immuta are managed and applied to data sources and projects by Data Owners and Governors to restrict access to data. This section contains tutorials for creating and managing Global Policies and Local Policies for both of these user types.

This page outlines the types of policies users can create and manage in Immuta.

Global and Local Policies

Global Policies are created by Data Governors and apply to all data sources across an organization. In contrast, Local Policies can be created by Data Owners or Data Governors and apply to a specific data source.

Global and Local Policies each contain two categories: Subscription Policies and Data Policies.

Restricted Global Policies

Data Owners who are not Governors can write Restricted Global Policies for data sources that they own. With this feature, Data Owners have higher-level policy controls and can write and enforce policies on multiple data sources simultaneously, eliminating the need to write redundant Local Policies on data sources.

Unlike Global Policies, the application of these policies is restricted to the data sources owned by the users or groups specified in the policy and will change as users' ownerships change.

Subscription Policies

Video Tutorial: Subscription Policies

To access a data source, Immuta users must first be subscribed to that data source. A Subscription Policy determines who can request access and has one of four possible restriction levels:

  • Anyone: Users will automatically be granted access (Least Restricted).
  • Anyone Who Asks (and is Approved): Users will need to request access and be granted permission by the configured approvers (Moderately Restricted).
  • Users with Specific Groups/Authorizations: Only users with the specified groups/authorizations will be able to see the data source and subscribe (Moderately Restricted).
  • Individual Users You Select: The data source will not appear in search results; data owners must manually add/remove users (Most Restricted).

See Managing Users and Groups in a Data Source for details on managing Data Users.

Combining Global Subscription Policies

In some cases, multiple Global Subscription Policies created by a Data Governor may apply to a single data source. Rather than having the two policies conflict, the conditions of the Subscription Policies are combined using complex boolean logic, as illustrated in the example below.

Consider the following two Global Subscription Policies created by a Data Governor:

Sub 1: Allow users to subscribe when user is a member of group Legal on data sources tagged PII.SSN

Sub 2: Allow users to subscribe when user is a member of group Medical Claims on data sources tagged PII.SSN and tagged PII.DOB

If a Data Owner creates a data source and applies both the PII.SSN and PII.DOB tags, both of these Global Subscription Policies will apply. Instead of having a conflict, the Subscription Policies are combined:

Sub Policy Combined

In this example, users must be a member of both the Legal and Medical Claims groups to subscribe to Demo Data Source 3, which contains the PII.SSN and PII.DOB tags.

Once enabled on a data source, Global Subscription Policies can be edited and disabled by Data Owners. See the Local Policy Builder Tutorial for instructions.

Data Policies

Once a user is subscribed to a data source, the Data Policies that are applied to that data source determine what data the user sees. Data Policy types include masking, row redaction, differential privacy, and limiting to purpose.

Masking Policies

You would use these to hide values in data. The masking policies have various levels of utility while still preserving data privacy. In order to create masking policies on object-backed data sources, you must create data dictionary entries and the data format must be either, csv, tsv, or json.

Hashing

Local Masking Policy Video Tutorial: Hashing

Hash the values to an irreversible sha256 hash, which is consistent for the same value throughout the data source so you can count or track the specific values, but not know the true raw value. The hash will be unique per user, but consistent for that user within the data source. In other words, the user will not be able to share the hashed value with other users in a meaningful way, but will be able to count and track it within the data source.

Hashed values are different across data sources, thus, you are not able to join on hashed values. This is done to protect against link attacks where two data owners may have exposed data with the same masked column (a quasi-identifier), but their data combined by that masked value could result in a sensitive data leak. However, joining on masked values can be enabled in Projects, if desired. This is the default masking policy when not doing advanced masking policies, listed below.

Replace with Null

Local Masking Policy Video Tutorial: Replace with Null

Make all the values in the column null, removing any utility of this column.

Replace with Constant

Local Masking Policy Video Tutorial: Replace with a Constant

Replace all the values in the column with the same constant value you choose, such as 'Redacted', removing any utility of this column.

Regular Expression (regex)

This is similar to replacing with a constant, yet provides more utility as you can retain portions of the true value. For example, I could mask the final digits of an IP address with the following regex rule:

Regex IP example

In this case, the regular expression \d+$

\d matches a digit (equal to [0-9])

+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

$ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)

This ensures we capture the last digit(s) after the last . in the ip address. We then can enter the replacement for what we captured, which in this case is XXX. So the outcome of the policy, would look like this: 164.16.13.XXX

Rounding

This is a technique to hide precision from numeric values yet providing more utility than simply hashing. For example, you could remove precision from a geospatial coordinate. You can also use this type of policy to remove precision from dates and times by rounding to the nearest hour, day, month, or year.

Some may know this as k-anonymization; however, we do not provide guarantees of k through the Immuta platform - you can estimate k yourself by providing your own precision width. There always remains high potential for a link attack when leveraging k-anonymization or l-diversity. If you wish to retain guarantees of privacy, we suggest you use the differential privacy policy explained below.

With Reversibility

This option masks the values using hashing, but allows users to submit an unmasking request to users who meet the exceptions of the policy.

Unmask Values

With Format Preserving Masking

This option masks the value, but preserves the length and type of the value, as illustrated in the examples below.

Format Preserving Masking

Format Preserving Masking 2

This option also allows users to submit an unmasking request to users who meet the exceptions of the policy.

Conditionally Masking

For all of the above policies, both at the Local and Global Policy levels, you can conditionally mask the value based on a value in another column. This allows you to build a policy that looks something like: "Mask bank account number where country = 'USA'" instead of blindly stating you want bank account masked always.

Mixing Masking Policies on the Same Column

There are some cases where you may want several different masking policies on the same column. This is possible as well through what is called OTHERWISE policies. To do so, when building the policy, instead of selecting everyone / everyone except, you can select everyone who. Once you do that, you specify who the masking policy applies to, and then must select how it applies to everyone else, e.g. OTHERWISE. You can add as many "everyone who" phrases that you need; however, you must always have a blanket OTHERWISE at the end.

Row-Level Security Policies

Video Tutorial: Row-level Security Policy

These policies hide entire rows or objects of data based on the policy being enforced; some of these policies require the data to be tagged as well.

Matching

These policies match a user attribute with a row/object/file attribute to determine if that row/object/file should be visible. This process uses a direct string match, so the user attribute would have to match exactly the data attribute in order to see that row of data.

For example, to restrict access to insurance claims data to the state for which the user's home office is located, you could build a policy such as this:

Row redaction matching example

In this case, the Office Location is retrieved by the identity management system as a user attribute (which can be an authorization or group). If the user's authorization (Office Location) was Missouri, rows containing the value Missouri in the State column in the data source would be the only rows visible to that user.

For object-backed sources, the State can be retrieved from places other than columns, depending on the database. For example, in S3 it is retrieved from the metadata or tags on the S3 object or the folder name. For HDFS it is retrieved from the xattr on the file or the folder name.

WHERE Clause Policy

This policy can be thought of as a table "view" created automatically based on the condition of the policy. For example, in the policy below, users who are not members of the Admins group will only see taxi rides where passenger_count < 2. You can put any valid SQL WHERE clause in the policy.

Where clause example

Time-based Restrictions

Video Tutorial: Time-based Policy

These policies restrict access to rows/objects/files that fall within the time restrictions set in the policy. If a data source has time-based restriction policies, queries run against the data source by a user will only return rows/blobs with a date in its event-time column/attribute from within a certain range.

This type of policy can be used for both object-backed and query-backed data sources.

The time window is based on the event time you select when creating the data source. This value will come from a date/time column in relational sources. For S3 it can be retrieved by a metadata or tag on the S3 object, and for HDFS it is retrieved from the xattr on the file.

Minimization

These policies restrict access to a limited percentage of the data, which is randomly sampled, but it is the same sample for all the users. For example, you could limit certain users to only 10% of the data. The data the user sees will always be the same, but new rows may be added as new data arrives in the system. This policy can only be applied to query-backed data sources.

Differential Privacy

Video Tutorial: Differential Privacy

Differential privacy provides mathematical guarantees that you cannot pinpoint an individual (row) in the data. This anonymization applies the appropriate noise (if any) to the response based on the sensitivity of the query. For example “average age” could be changed from 50.5 to 55 at query time. To do this the Immuta SQL layer restricts queries run on the data to only aggregate queries (AVG, SUM, COUNT, etc) and prevents very sensitive queries from running at all. This policy type can only be applied to query-backed data sources.

We encourage you to read our blog on this topic that dives into details of the theories behind this powerful anonymization technique.

In order to create this policy you must select a high cardinality column in the data. This is typically the primary key column, but could also be a column with many unique values. It is not recommended that you select a column with less than 90% unique values. Otherwise you could have invalid noise added to the responses.

It is also critical that you consider the latency tolerance on the data source when creating this policy. The latency tolerance drives how long differentially private query responses are cached. You should set this window to a length that allows sufficient time for the underlying data to change enough where the same query would get a statistically relevant dissimilar result. The caching is done to avoid the privacy budget problem, which is the problem of the user asking similar questions consecutively in order to determine the real response.

Limit to Purpose

Video Tutorial: Purpose-based Restriction Policy

Purposes help define the scope and use of data within a project and allow users to meet purpose restrictions on policies. Governors create and manage purposes and their sub-purposes, which project owners then add to their project(s) and use to drive Data Policies.

Purposes can be constructed as a hierarchy, meaning that purposes can contain nested sub-purposes, much like tags in Immuta. This design allows more flexibility in managing purpose-based restriction policies and transparency in the relationships among purposes.

For example, consider this organization of the sub-purposes of Research:

Sub-Purpose Builder

Instead of creating separate purposes, which must then each be added to policies as they evolve, a Governor could write the following Global Policy:

Limit usage to purpose(s) Research for everyone on data sources tagged PHI.

Now, any user acting under the purpose or sub-purpose of Research - whether Research.Marketing, Research.Onboarding.Customer, or Research.MedicalClaims - will meet the criteria of this policy. Consequently, purpose hierarchies eliminate the need for a Governor to re-write these Global Policies when sub-purposes are added or removed. Furthermore, if new projects with new Research purposes are added, for example, the relevant Global Policy will automatically be enforced.

Please refer to the Data Governor Policy Guide for instructions on purpose-based restrictions on data.

Conditions

For all of the rules above, you must also establish the conditions for which they will be enforced. Immuta allows you to append multiple conditions to the data. Those conditions are based on user attributes, which can be authorizations and groups from your identity management system, or purposes they are acting under via Immuta projects. Note that the authorizations and groups can be retrieved from multiple different identity management systems and applied as conditions to the same policy.

Conditions can be directed as exclusionary or inclusionary, depending on the policy that's being enforced. Immuta has determined the best direction for the condition to avoid inadvertent data leaks.

For example, rather than specifying every user attribute that should see the unmasked value, you instead specify the "special" attribute that is allowed to see the unmasked value, e.g. mask for everyone except. This is exclusionary. There are inclusionary policies, such as row level security matching, where you require that the user attribute matches the data attribute.

One final note on purposes. It's best to think about projects and purposes as additional entitlements for users. In other words, you would use projects to open users up to more data, not restrict them from more.

Data Policy Conflicts

In some cases, two conflicting Global Data Policies may apply to a single data source. When this happens, the policy containing a tag deeper in the hierarchy will apply to the data source to resolve the conflict.

Consider the following Global Data Policies created by a Data Governor:

Data Policy 1: Mask columns tagged PII by making null for everyone on data sources with columns tagged PII

Data Policy 2: Mask columns tagged PII.SSN using hashing for everyone on data sources with columns tagged PII.SSN

If a Data Owner creates a data source and applies the PII.SSN tag, both of these Global Data Policies will apply. Instead of having a conflict, the policy containing a deeper tag in the hierarchy will apply:

Data Policy Conflict

In this example, Data Policy 2 cannot be applied to the data source. If Data Owners wanted to use Data Policy 2 on the data source instead, they would need to disable Data Policy 1.

Once enabled on a data source, Global Data Policies can be edited and disabled by Data Owners. See the Local Policy Builder Tutorial for instructions.