Management of Policies, Projects, and Data Sources
Audience: Data Owners
Global and Local Policies
Global Policies can only be created by Data Governors and apply to all data sources across an organization. In contrast, Local Policies can be created by Data Owners or Data Governors and apply to a specific data source.
To access a data source, Immuta users must first be subscribed to that data source. A Subscription Policy determines who can request access and has one of four possible restriction levels:
- Anyone: Users will automatically be granted access (Least Restricted).
- Anyone Who Asks (and is Approved): Users will need to request access and be granted permission by the configured approvers (Moderately Restricted).
- Users with Specific Groups/Authorizations: Only users with the specified groups/authorizations will be able to see the data source and subscribe (Moderately Restricted).
- Individual Users You Select: The data source will not appear in search results; data owners must manually add/remove users (Most Restricted).
See Managing Users and Groups in a Data Source for details on managing Data Users.
Once a user is subscribed to a data source, the Data Policies that are applied to that data source determine what data the user sees. Data Policy types include masking, row redaction, differential privacy, and limiting to purpose.
You would use these to hide values in data. The masking policies have various levels of utility while still preserving data privacy. In order to create masking policies on object-backed data sources, you must create data dictionary entries and the data format must be either, csv, tsv, or json.
Hash the values to an irreversible sha256 hash, which is consistent for the same value throughout the data source so you can count or track the specific values, but not know the true raw value. The hash will be unique per user, but consistent for that user within the data source. In other words, the user will not be able to share the hashed value with other users in a meaningful way, but will be able to count and track it within the data source.
Hashed values are different across data sources, thus, you are not able to join on hashed values. This is done to protect against link attacks where two data owners may have exposed data with the same masked column (a quasi-identifier), but their data combined by that masked value could result in a sensitive data leak. However, joining on masked values can be enabled in Projects, if desired. This is the default masking policy when not doing advanced masking policies, listed below.
Replace with Null
Make all the values in the column null, removing any utility of this column.
Replace with constant
Replace all the values in the column with the same constant value you choose, such as 'Redacted', removing any utility of this column.
Regular Expression (regex)
This is similar to replacing with a constant, yet provides more utility as you can retain portions of the true value. For example, I could mask the final digits of an IP address with the following regex rule:
In this case, the regular expression
\d matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)
This ensures we capture the last digit(s) after the last
. in the ip address. We
then can enter the replacement for what we captured, which in this case is
XXX. So the
outcome of the policy, would look like this:
This is a technique to hide precision from numeric values yet providing more utility than simply hashing. For example, you could remove precision from a geospatial coordinate. You can also use this type of policy to remove precision from dates and times by rounding to the nearest hour, day, month, or year.
This option masks the values using hashing, but allows users to submit an unmasking request to users who meet the exceptions of the policy.
With Format Preserving Masking
This option masks the value, but preserves the length and type of the value, as illustrated in the examples below.
This option also allows users to submit an unmasking request to users who meet the exceptions of the policy.
For all of the above policies, you are able to conditionally mask the value based on a value in another column. This allows you to build a policy that looks something like: "Mask bank account number where country = 'USA'" instead of blindly stating you want bank account masked always.
Mixing Masking Policies on the Same Column
There are some cases where you may want several different masking policies on the same column. This is possible as well through what is called OTHERWISE policies. To do so, when building the policy, instead of selecting everyone / everyone except, you can select everyone who. Once you do that, you specify who the masking policy applies to, and then must select how it applies to everyone else, e.g. OTHERWISE. You can add as many "everyone who" phrases that you need; however, you must always have a blanket OTHERWISE at the end.
Row-Level Security Policies
These policies hide entire rows or objects of data based on the policy being enforced; some of these policies require the data to be tagged as well.
These policies match a user attribute with a row/object/file attribute to determine if that row/object/file should be visible. This process uses a direct string match, so the user attribute would have to match exactly the data attribute in order to see that row of data.
For example, to restrict access to insurance claims data to the state for which the user's home office is located, you could build a policy such as this:
In this case, the
Office Location is retrieved by the identity
management system as a user attribute (which can be an authorization or group). If the user's authorization
Office Location) was
Missouri, rows containing the value
Missouri in the
in the data source would be the only rows visible to that user.
For object-backed sources, the
State can be retrieved from places
other than columns, depending on the database. For example, in S3 it is
retrieved from the metadata or tags on the S3 object or the folder name. For HDFS it is
retrieved from the xattr on the file or the folder name.
WHERE Clause Policy
This policy can be thought of as a table "view" created automatically based on the condition of the policy.
For example, in the policy below, users who are not members of the
will only see taxi rides
where passenger_count < 2. You can put any
valid SQL WHERE clause in the policy.
These policies restrict access to rows/objects/files that fall within the last x days/hours/minutes. Think of this as a moving window of time, which is chopping off the rows of data that fall further back in that time window.
The time window is based on the event time you select when creating the data source. This value will come from a date/time column in relational sources. For S3 it can be retrieved by a metadata or tag on the S3 object, and for HDFS it is retrieved from the xattr on the file.
These policies restrict access to a limited percentage of the data, which is randomly sampled, but it is the same sample for all the users. For example, you could limit certain users to only 10% of the data. The data the user sees will always be the same, but new rows may be added as new data arrives in the system. This policy can only be applied to query-backed data sources.
Differential privacy provides mathematical guarantees that you cannot pinpoint an individual (row) in the data. This anonymization applies the appropriate noise (if any) to the response based on the sensitivity of the query. For example “average age” could be changed from 50.5 to 55 at query time. To do this the Immuta SQL layer restricts queries run on the data to only aggregate queries (AVG, SUM, COUNT, etc) and prevents very sensitive queries from running at all. This policy type can only be applied to query-backed data sources.
We encourage you to read our blog on this topic that dives into details of the theories behind this powerful anonymization technique.
In order to create this policy you must select a high cardinality column in the data. This is typically the primary key column, but could also be a column with many unique values. It is not recommended that you select a column with less than 90% unique values. Otherwise you could have invalid noise added to the responses.
It is also critical that you consider the latency tolerance on the data source when creating this policy. The latency tolerance drives how long differentially private query responses are cached. You should set this window to a length that allows sufficient time for the underlying data to change enough where the same query would get a statistically relevant dissimilar result. The caching is done to avoid the privacy budget problem, which is the problem of the user asking similar questions consecutively in order to determine the real response.
Limit to Purpose
Governors can create purposes which are then applied to projects and used to drive these types of policies, restricting users' access to data based on the project context they're working under. Governors can also require data sources to have a purpose restriction.
For all of the rules above, you must also establish the conditions for which they will be enforced. Immuta allows you to append multiple conditions to the data. Those conditions are based on user attributes, which can be authorizations and groups from your identity management system, or purposes they are acting under via Immuta projects. Note that the authorizations and groups can be retrieved from multiple different identity management systems and applied as conditions to the same policy.
Conditions can be directed as exclusionary or inclusionary, depending on the policy that's being enforced. Immuta has determined the best direction for the condition to avoid inadvertent data leaks.
For example, rather than specifying every user attribute that should see the unmasked value, you instead
specify the "special" attribute that is allowed to see the unmasked value, e.g.
mask for everyone except. This
is exclusionary. There are inclusionary policies, such as row level security matching, where
you require that the user attribute matches the data attribute.
One final note on purposes. It's best to think about projects and purposes as additional entitlements for users. In other words, you would use projects to open users up to more data, not restrict them from more.
Projects in Immuta
Project Owner Capabilities
Users with the
CREATE_PROJECT permission are considered owners of the projects they create and have the following capabilities:
- manage project members
- manage project documentation and purposes
- set subscription policies on the project
- enable Project Equalization
- enable Masked Joins
- manage project data sources
- manage project tags
- post, reply to, delete, and resolve discussion threads
- disable, delete, and restore the project
- create derived data sources
- create fingerprint versions
- switch project contexts
Governors have the following capabilities for any project in their organization, even for projects that are private or that they are not members of:
- manage project members
- set subscription policies on the project
- add data sources to and delete data sources from the project
- manage project tags
- post and reply to discussion threads and delete their own threads and replies
- disable and restore a project
- configure project purposes, acknowledgement statements, and settings
- switch project contexts
Project Member Capabilities
Once subscribed to a project, all Immuta users have the following capabilities as project members:
- add data sources to the project (unless Project Equalization or Masked Joins is enabled)
- remove data sources they’ve added to the project
- post and reply to discussion threads and delete their own discussion threads and replies
- create derived data sources
- create fingerprint versions
- switch project contexts
Project Purposes, Acknowledgement Statements, and Settings
The Data Governor is responsible for configuring project purposes, acknowledgement statements, and settings.
Purposes: Purposes help define the scope and use of data within a project and allow users to meet purpose restrictions on policies. Governors can create purposes for project owners to use or owners can create their own purposes when they create their project (if the Governor allows them to). However, only Governors can delete purposes.
Acknowledgement Statements: Projects containing purposes require owners and subscribers to acknowledge that they will only use the data for those purposes by affirming or rejecting acknowledgement statements. If users accept the statement, they become a project member. If they reject the acknowledgement statement, they are denied access to the project. Once acknowledged, data accessed under the provision of a project will be audited and the purposes will be noted. Immuta provides default acknowledgement statements, but Data Governors can customize these statements in the Purposes or Settings tabs.
Settings: Governors can also determine if purposes are required to create a project, if purposes can be customized by project owners or must be chosen from purposes created by the data governor, or if a project can have more than one purpose. These settings are adjusted in the Settings tab of the Governance page and include the following options:
- A purpose must be included in projects: Selecting this option will require that every project contain a purpose. Utilizing data within a project outside of the stated purposes is prohibited. Projects without purposes, however, have no set restrictions.
- All data sources require a purpose restriction: Selecting this option will require every data source to have a purpose restriction.
- A project can have more than one purpose: Selecting this option allows projects to have more than one purpose.
- A project's purpose can change: Selecting this option will allow a project’s purpose to change at any time during the life of the project. Only users who created the project can change the purpose.
- Projects can have custom purposes: Selecting this option will allow project owners to describe the purpose of their project themselves, rather than choosing from a list of purposes created by a Governor.
Switching Project Contexts
The Immuta UI provides a simple way to switch project contexts so that users can access various data sources while acting under the appropriate purpose. By default, there will be no project selected, even if the user belongs to one or more projects in Immuta.
When users change project contexts, all SQL queries or blob fetches that run through Immuta will reflect users as acting under the purposes of that project, which may allow additional access to data if there are purpose restrictions on the data source(s). This process also allows organizations to track not just whether a specific data source is being used, but why.
The same security restrictions regarding data sources are applied to projects; project members still need to be subscribed to data sources in order to access data, and only users with appropriate authorizations and credentials will be able to see the data if it contains any row-level or masking security.
However, Project Equalization improves collaboration by ensuring that the data in the project looks identical to all members, regardless of their level of access to data. When enabled, this feature automatically equalizes all permissions so that no project member has more access to data than the member with the least access.
Note: Only project owners can add data sources to the project if this feature is enabled.
For instructions on enabling Project Equalization, navigate to the Project Owner guide.
This setting adjusts the minimum authorizations required to join the project and to access data within the project. When Project Equalization is enabled, Required Authorizations defaults to Immuta's recommended settings, but project owners can edit these settings by adding or removing parts of the authorizations. However, making these changes entails two potential disadvantages:
If you add authorizations, members might see more data as a whole, but at least some members of the project will be out of compliance. The status of users' compliance is visible from the Members tab within the project.
If you remove part of an authorization, the project will be open to users with fewer privileges, but this change might make less data visible to all project members. Removing from authorizations is only recommended if you foresee new users joining with less access to data than the current members.
This setting determines how often user credentials are validated, which is critical if users share data with project members outside of Immuta, as they need a way to verify that those members' permissions are still valid. Validation Frequency provides those means of verification.
This feature allows masked columns to be joined within the context of a project.
Note: Masked columns cannot be joined across data sources that are not linked by a project.
For instructions on enabling Masked Joins, navigate to the Project Owner guide.
Derived Data Sources
When Project Equalization is enabled, members can use data sources within the project to create a derived data source, which will dynamically inherit the policies from the parent source(s). If members use data outside the project to create their data source, they must first add that data to the project and re-derive the data source through the project connection. When creating a derived data source, members are prompted to certify that their data is derived from the parent data sources they selected upon creation.
Data Sources in Immuta
A data source is how Data Owners expose their data across their organization to other Immuta users. Throughout this process, the data is not copied. Instead, Immuta uses metadata from the data source to determine how to expose the data. In this sense, a data source is a virtual representation of data that exists in a remote data storage technology.
When a data source is exposed, policies (written by Data Owners and Data Governors) are dynamically enforced on the data, appropriately redacting and masking information depending on the attributes of the user accessing the data. Once the data source is exposed and subscribed to, the data can be accessed in a consistent manner across analytics and visualization tools, allowing reproducibility and collaboration.
Once subscribed, Data Users interact with data through Immuta through four different access patterns: HDFS, Spark, SQL, and the Immuta Virtual Filesystem. Accessing data through Immuta ensures that users are only consuming policy-controlled data with thorough auditing.
Data Source User Roles
There are various roles users and groups can play relating to data sources, including
- Subscribers: Those who have access to the data source data. With the appropriate data accesses and authorizations, these users/groups can view files, run SQL queries, and generate analytics against the data source data. All users/groups granted access to a data source (except for those with the ingest role) have subscriber status.
- Experts: Those who are knowledgeable about the data source data and can elaborate on it. They are also capable of managing the data source's documentation and Data Dictionary.
- Owners: Those who create and manage new data sources and their users, documentation, Data Dictionaries, and queries. They are also capable of ingesting data into their data sources as well as adding ingest users (if their data source is object-backed).
- Ingest: Those who are responsible for ingesting data for the data source. This role only applies to object-backed data sources (since query-backed data sources are ingested automatically). Ingest users cannot access any data once it's inside Immuta, but they are able to verify if their data was successfully ingested or not.
Data Source Types
Data sources fall in one of two categories: those that are backed by SQL technologies (query-backed data sources) and those that are not (object-backed data sources).
query-backed data sources: These data sources are accessible to subscribed Data Users through the Immuta Query Engine and appear as though they are Postgres tables.
object-backed data sources: These data sources are backed by data storage technologies that do not support SQL and can range from blob stores, to filesystems, to APIs.
Data attributes are information about the data within the data source. These attributes are then matched against policy logic to determine if a row or object should be visible to a specific user. This matching is usually done between the data attribute and the user attribute.
For example, in the policy
Only show rows where
Country='US' for everyone except when user is a member of group
the data attribute (
US in the
Country column) is matched against the user attribute
Finance group) to determine whether or not rows will be visible to the user accessing the data. In this case only
users who are a member of the Finance group will see all rows in the data source.
These user attributes give users access to various Immuta features and drive data source policies.
Permissions control what actions a user can take in Immuta, both API and UI actions. Permissions can be
added and removed from user accounts by a System Administrator (an Immuta user with the
ADMIN permission); however,
the permissions themselves are managed by Immuta, and the actions associated with the permissions cannot be altered.
Groups allow System Administrators to group sets of users together. Users can belong to any number of groups and can be added or removed from groups at any time. Like authorizations, groups can be used to restrict what data a set of users has access to.
Authorizations are custom tags that are applied to users and groups to restrict what data users can see. Authorizations can be added manually or mapped in from LDAP or Active Directory.
The Data Dictionary provides information about the columns within the data source, including column names and value types. Users subscribed to the data source can post and reply to discussion threads by commenting on the Data Dictionary.
Dictionary columns are automatically generated when the data source is created if the remote storage technology supports SQL. Otherwise, Data Owners or Experts can create the entries for the Data Dictionary manually.