Skip to content

Management of Policies, Projects, and Data Sources

Audience: Data Owners

Content Summary: This page describes Global and Local Policies, projects, and data sources in Immuta.

For step-by-step instructions related to these concepts, navigate to the Managing Policies Tutorial, Managing Projects Tutorial, or Managing Data Sources Tutorial.

Global and Local Policies

Global Policies can only be created by Data Governors and apply to all data sources across an organization. In contrast, Local Policies can be created by Data Owners or Data Governors and apply to a specific data source.

Global and Local Policies each contain two categories: Subscription Policies and Data Policies.

Subscription Policies

To access a data source, Immuta users must first be subscribed to that data source. A Subscription Policy determines who can request access and has one of four possible restriction levels:

  • Anyone: Users will automatically be granted access (Least Restricted).
  • Anyone Who Asks (and is Approved): Users will need to request access and be granted permission by the configured approvers (Moderately Restricted).
  • Users with Specific Groups/Authorizations: Only users with the specified groups/authorizations will be able to see the data source and subscribe (Moderately Restricted).
  • Individual Users You Select: The data source will not appear in search results; data owners must manually add/remove users (Most Restricted).

See Managing Users and Groups in a Data Source for details on managing Data Users.

Data Policies

Once a user is subscribed to a data source, the Data Policies that are applied to that data source determine what data the user sees. Data Policy types include masking, row redaction, differential privacy, and limiting to purpose.

Masking Policies

You would use these to hide values in data. The masking policies have various levels of utility while still preserving data privacy. In order to create masking policies on object-backed data sources, you must create data dictionary entries and the data format must be either, csv, tsv, or json.

Hashing

Hash the values to an irreversible sha256 hash, which is consistent for the same value throughout the data source so you can count or track the specific values, but not know the true raw value. The hash will be unique per user, but consistent for that user within the data source. In other words, the user will not be able to share the hashed value with other users in a meaningful way, but will be able to count and track it within the data source.

Hashed values are different across data sources, thus, you are not able to join on hashed values. This is done to protect against link attacks where two data owners may have exposed data with the same masked column (a quasi-identifier), but their data combined by that masked value could result in a sensitive data leak. However, joining on masked values can be enabled in Projects, if desired. This is the default masking policy when not doing advanced masking policies, listed below.

Replace with Null

Make all the values in the column null, removing any utility of this column.

Replace with constant

Replace all the values in the column with the same constant value you choose, such as 'Redacted', removing any utility of this column.

Regular Expression (regex)

This is similar to replacing with a constant, yet provides more utility as you can retain portions of the true value. For example, I could mask the final digits of an IP address with the following regex rule:

Regex IP example

In this case, the regular expression \d+$

\d matches a digit (equal to [0-9])

+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

$ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)

This ensures we capture the last digit(s) after the last . in the ip address. We then can enter the replacement for what we captured, which in this case is XXX. So the outcome of the policy, would look like this: 164.16.13.XXX

Rounding

This is a technique to hide precision from numeric values yet providing more utility than simply hashing. For example, you could remove precision from a geospatial coordinate. You can also use this type of policy to remove precision from dates and times by rounding to the nearest hour, day, month, or year.

Some may know this as k-anonymization; however, we do not provide guarantees of k through the Immuta platform - you can estimate k yourself by providing your own precision width. There always remains high potential for a link attack when leveraging k-anonymization or l-diversity. If you wish to retain guarantees of privacy, we suggest you use the differential privacy policy explained below.

With Reversibility

This option masks the values using hashing, but allows users to submit an unmasking request to users who meet the exceptions of the policy.

Unmask Values

With Format Preserving Masking

This option masks the value, but preserves the length and type of the value, as illustrated in the examples below.

Format Preserving Masking

Format Preserving Masking 2

This option also allows users to submit an unmasking request to users who meet the exceptions of the policy.

Conditionally Masking

For all of the above policies, you are able to conditionally mask the value based on a value in another column. This allows you to build a policy that looks something like: "Mask bank account number where country = 'USA'" instead of blindly stating you want bank account masked always.

Mixing Masking Policies on the Same Column

There are some cases where you may want several different masking policies on the same column. This is possible as well through what is called OTHERWISE policies. To do so, when building the policy, instead of selecting everyone / everyone except, you can select everyone who. Once you do that, you specify who the masking policy applies to, and then must select how it applies to everyone else, e.g. OTHERWISE. You can add as many "everyone who" phrases that you need; however, you must always have a blanket OTHERWISE at the end.

Row-Level Security Policies

These policies hide entire rows or objects of data based on the policy being enforced; some of these policies require the data to be tagged as well.

Matching

These policies match a user attribute with a row/object/file attribute to determine if that row/object/file should be visible. This process uses a direct string match, so the user attribute would have to match exactly the data attribute in order to see that row of data.

For example, to restrict access to insurance claims data to the state for which the user's home office is located, you could build a policy such as this:

Row redaction matching example

In this case, the Office Location is retrieved by the identity management system as a user attribute (which can be an authorization or group). If the user's authorization (Office Location) was Missouri, rows containing the value Missouri in the State column in the data source would be the only rows visible to that user.

For object-backed sources, the State can be retrieved from places other than columns, depending on the database. For example, in S3 it is retrieved from the metadata or tags on the S3 object or the folder name. For HDFS it is retrieved from the xattr on the file or the folder name.

WHERE Clause Policy

This policy can be thought of as a table "view" created automatically based on the condition of the policy. For example, in the policy below, users who are not members of the Admins group will only see taxi rides where passenger_count < 2. You can put any valid SQL WHERE clause in the policy.

Where clause example

Time Window

These policies restrict access to rows/objects/files that fall within the last x days/hours/minutes. Think of this as a moving window of time, which is chopping off the rows of data that fall further back in that time window.

The time window is based on the event time you select when creating the data source. This value will come from a date/time column in relational sources. For S3 it can be retrieved by a metadata or tag on the S3 object, and for HDFS it is retrieved from the xattr on the file.

Minimization

These policies restrict access to a limited percentage of the data, which is randomly sampled, but it is the same sample for all the users. For example, you could limit certain users to only 10% of the data. The data the user sees will always be the same, but new rows may be added as new data arrives in the system. This policy can only be applied to query-backed data sources.

Differential Privacy

Differential privacy provides mathematical guarantees that you cannot pinpoint an individual (row) in the data. This anonymization applies the appropriate noise (if any) to the response based on the sensitivity of the query. For example “average age” could be changed from 50.5 to 55 at query time. To do this the Immuta SQL layer restricts queries run on the data to only aggregate queries (AVG, SUM, COUNT, etc) and prevents very sensitive queries from running at all. This policy type can only be applied to query-backed data sources.

We encourage you to read our blog on this topic that dives into details of the theories behind this powerful anonymization technique.

In order to create this policy you must select a high cardinality column in the data. This is typically the primary key column, but could also be a column with many unique values. It is not recommended that you select a column with less than 90% unique values. Otherwise you could have invalid noise added to the responses.

It is also critical that you consider the latency tolerance on the data source when creating this policy. The latency tolerance drives how long differentially private query responses are cached. You should set this window to a length that allows sufficient time for the underlying data to change enough where the same query would get a statistically relevant dissimilar result. The caching is done to avoid the privacy budget problem, which is the problem of the user asking similar questions consecutively in order to determine the real response.

Limit to Purpose

Governors can create purposes which are then applied to projects and used to drive these types of policies, restricting users' access to data based on the project context they're working under. Governors can also require data sources to have a purpose restriction.

Conditions

For all of the rules above, you must also establish the conditions for which they will be enforced. Immuta allows you to append multiple conditions to the data. Those conditions are based on user attributes, which can be authorizations and groups from your identity management system, or purposes they are acting under via Immuta projects. Note that the authorizations and groups can be retrieved from multiple different identity management systems and applied as conditions to the same policy.

Conditions can be directed as exclusionary or inclusionary, depending on the policy that's being enforced. Immuta has determined the best direction for the condition to avoid inadvertent data leaks.

For example, rather than specifying every user attribute that should see the unmasked value, you instead specify the "special" attribute that is allowed to see the unmasked value, e.g. mask for everyone except. This is exclusionary. There are inclusionary policies, such as row level security matching, where you require that the user attribute matches the data attribute.

One final note on purposes. It's best to think about projects and purposes as additional entitlements for users. In other words, you would use projects to open users up to more data, not restrict them from more.

Projects in Immuta

Project Owner Capabilities

Users with the CREATE_PROJECT permission are considered owners of the projects they create and have the following capabilities:

Governor Capabilities

Governors have the following capabilities for any project in their organization, even for projects that are private or that they are not members of:

Project Member Capabilities

Once subscribed to a project, all Immuta users have the following capabilities as project members:

Project Purposes, Acknowledgement Statements, and Settings

The Data Governor is responsible for configuring project purposes, acknowledgement statements, and settings.

  • Purposes: Purposes help define the scope and use of data within a project and allow users to meet purpose restrictions on policies. Governors can create purposes for project owners to use or owners can create their own purposes when they create their project (if the Governor allows them to). However, only Governors can delete purposes.

    Purposes Tab

  • Acknowledgement Statements: Projects containing purposes require owners and subscribers to acknowledge that they will only use the data for those purposes by affirming or rejecting acknowledgement statements. If users accept the statement, they become a project member. If they reject the acknowledgement statement, they are denied access to the project. Once acknowledged, data accessed under the provision of a project will be audited and the purposes will be noted. Immuta provides default acknowledgement statements, but Data Governors can customize these statements in the Purposes or Settings tabs.

    Project Member Acknowledgement

  • Settings: Governors can also determine if purposes are required to create a project, if purposes can be customized by project owners or must be chosen from purposes created by the data governor, or if a project can have more than one purpose. These settings are adjusted in the Settings tab of the Governance page and include the following options:

    • A purpose must be included in projects: Selecting this option will require that every project contain a purpose. Utilizing data within a project outside of the stated purposes is prohibited. Projects without purposes, however, have no set restrictions.
    • All data sources require a purpose restriction: Selecting this option will require every data source to have a purpose restriction.
    • A project can have more than one purpose: Selecting this option allows projects to have more than one purpose.
    • A project's purpose can change: Selecting this option will allow a project’s purpose to change at any time during the life of the project. Only users who created the project can change the purpose.
    • Projects can have custom purposes: Selecting this option will allow project owners to describe the purpose of their project themselves, rather than choosing from a list of purposes created by a Governor.

    Settings Tab

Switching Project Contexts

The Immuta UI provides a simple way to switch project contexts so that users can access various data sources while acting under the appropriate purpose. By default, there will be no project selected, even if the user belongs to one or more projects in Immuta.

When users change project contexts, all SQL queries or blob fetches that run through Immuta will reflect users as acting under the purposes of that project, which may allow additional access to data if there are purpose restrictions on the data source(s). This process also allows organizations to track not just whether a specific data source is being used, but why.

Project Equalization

The same security restrictions regarding data sources are applied to projects; project members still need to be subscribed to data sources in order to access data, and only users with appropriate authorizations and credentials will be able to see the data if it contains any row-level or masking security.

However, Project Equalization improves collaboration by ensuring that the data in the project looks identical to all members, regardless of their level of access to data. When enabled, this feature automatically equalizes all permissions so that no project member has more access to data than the member with the least access.

Project Equalization

Note: Only project owners can add data sources to the project if this feature is enabled.

For instructions on enabling Project Equalization, navigate to the Project Owner guide.

Required Authorizations

This setting adjusts the minimum authorizations required to join the project and to access data within the project. When Project Equalization is enabled, Required Authorizations defaults to Immuta's recommended settings, but project owners can edit these settings by adding or removing parts of the authorizations. However, making these changes entails two potential disadvantages:

  • If you add authorizations, members might see more data as a whole, but at least some members of the project will be out of compliance. The status of users' compliance is visible from the Members tab within the project.

    Compliance Status

  • If you remove part of an authorization, the project will be open to users with fewer privileges, but this change might make less data visible to all project members. Removing from authorizations is only recommended if you foresee new users joining with less access to data than the current members.

Validation Frequency

This setting determines how often user credentials are validated, which is critical if users share data with project members outside of Immuta, as they need a way to verify that those members' permissions are still valid. Validation Frequency provides those means of verification.

Masked Joins

This feature allows masked columns to be joined within the context of a project.

Masked Joins

Note: Masked columns cannot be joined across data sources that are not linked by a project.

For instructions on enabling Masked Joins, navigate to the Project Owner guide.

Derived Data Sources

When Project Equalization is enabled, members can use data sources within the project to create a derived data source, which will dynamically inherit the policies from the parent source(s). If members use data outside the project to create their data source, they must first add that data to the project and re-derive the data source through the project connection. When creating a derived data source, members are prompted to certify that their data is derived from the parent data sources they selected upon creation.

Data Sources in Immuta

A data source is how Data Owners expose their data across their organization to other Immuta users. Throughout this process, the data is not copied. Instead, Immuta uses metadata from the data source to determine how to expose the data. In this sense, a data source is a virtual representation of data that exists in a remote data storage technology.

When a data source is exposed, policies (written by Data Owners and Data Governors) are dynamically enforced on the data, appropriately redacting and masking information depending on the attributes of the user accessing the data. Once the data source is exposed and subscribed to, the data can be accessed in a consistent manner across analytics and visualization tools, allowing reproducibility and collaboration.

Once subscribed, Data Users interact with data through Immuta through four different access patterns: HDFS, Spark, SQL, and the Immuta Virtual Filesystem. Accessing data through Immuta ensures that users are only consuming policy-controlled data with thorough auditing.

Data Source User Roles

There are various roles users and groups can play relating to data sources, including

  • Subscribers: Those who have access to the data source data. With the appropriate data accesses and authorizations, these users/groups can view files, run SQL queries, and generate analytics against the data source data. All users/groups granted access to a data source (except for those with the ingest role) have subscriber status.
  • Experts: Those who are knowledgeable about the data source data and can elaborate on it. They are also capable of managing the data source's documentation and Data Dictionary.
  • Owners: Those who create and manage new data sources and their users, documentation, Data Dictionaries, and queries. They are also capable of ingesting data into their data sources as well as adding ingest users (if their data source is object-backed).
  • Ingest: Those who are responsible for ingesting data for the data source. This role only applies to object-backed data sources (since query-backed data sources are ingested automatically). Ingest users cannot access any data once it's inside Immuta, but they are able to verify if their data was successfully ingested or not.

Data Source Types

Data sources fall in one of two categories: those that are backed by SQL technologies (query-backed data sources) and those that are not (object-backed data sources).

  • query-backed data sources: These data sources are accessible to subscribed Data Users through the Immuta Query Engine and appear as though they are Postgres tables.

  • object-backed data sources: These data sources are backed by data storage technologies that do not support SQL and can range from blob stores, to filesystems, to APIs.

Data Attributes

Data attributes are information about the data within the data source. These attributes are then matched against policy logic to determine if a row or object should be visible to a specific user. This matching is usually done between the data attribute and the user attribute.

For example, in the policy

Only show rows where Country='US' for everyone except when user is a member of group Finance

the data attribute (US in the Country column) is matched against the user attribute (Finance group) to determine whether or not rows will be visible to the user accessing the data. In this case only users who are a member of the Finance group will see all rows in the data source.

User Attributes

User attributes are values connected to specific Immuta user accounts. These attributes fall into three categories: permissions, groups, and authorizations.

These user attributes give users access to various Immuta features and drive data source policies.

Permissions

Permissions control what actions a user can take in Immuta, both API and UI actions. Permissions can be added and removed from user accounts by a System Administrator (an Immuta user with the ADMIN permission); however, the permissions themselves are managed by Immuta, and the actions associated with the permissions cannot be altered.

Groups

Groups allow System Administrators to group sets of users together. Users can belong to any number of groups and can be added or removed from groups at any time. Like authorizations, groups can be used to restrict what data a set of users has access to.

Authorizations

Authorizations are custom tags that are applied to users to restrict what data users can see. Authorizations can be added manually or mapped in from LDAP or Active Directory.

Data Dictionary

The Data Dictionary provides information about the columns within the data source, including column names and value types. Users subscribed to the data source can post and reply to discussion threads by commenting on the Data Dictionary.

Dictionary columns are automatically generated when the data source is created if the remote storage technology supports SQL. Otherwise, Data Owners or Experts can create the entries for the Data Dictionary manually.