Management of Policies, Projects, and Data Sources
Audience: Data Owners
Global and Local Policies
Global Policies can only be created by Data Governors and apply to all data sources across an organization. In contrast, Local Policies can be created by Data Owners or Data Governors and apply to a specific data source.
Restricted Global Policies
Data Owners who are not Governors can write Restricted Global Policies for data sources that they own. With this feature, Data Owners have higher-level policy controls and can write and enforce policies on multiple data sources simultaneously, eliminating the need to write redundant Local Policies on data sources.
Unlike Global Policies, the application of these policies is restricted to the data sources owned by the users or groups specified in the policy and will change as users' ownerships change.
Policy Export and Import
This feature allows Data Owners and Governors to export and import policies as JSON files so they can seamlessly move policies from one system to another, as long as the systems have identical configurations.
Exporting policies also allows them to be tracked, compared, and approved in systems like Git. If users want to test specific policies in their development environments and get approval before moving these policies to their production environments, they could use the Policy Export and Import feature to allow for this approval workflow.
Once enabled on the App Settings page by an Application Administrator, the Import Policies and Export Policies buttons will be visible on the Policies page for users who have the appropriate permissions (generally, a Data Owner or Governor).
When Export Policies is clicked, a .zip file containing all relevant policies will be downloaded; each Global Policy and each data source will be separated into its own JSON file.
The files exported are determined based on the user performing the export. For example, Data Owners will only be able to export policies for data sources that they own and Restricted Global Policies that they've created. Governors, however, can export all policies.
Once the files are exported, in the destination system, import can be selected to open the import modal, which gives options to import all files, remove certain files from the import, and export the current policy state as a backup. If policies are found in the current system that are not found in the import, a warning will display with an option to delete those policies.
Since policy updates are asynchronous, certain policy states will not carry through the import/export process. These include
- Policy disable. Manual policy disables will not be preserved after an import.
- Policy conflicts. Immuta's policy conflict logic is not deterministic, so after an import of Global Policies, there is no guarantee the current enabled policy state will be the same as it was in the export.
If the state of the destination system does not match the exact state of the source system (tags, data sources, users, IAMs, purposes, etc.), there is a significant chance that policies will fail to be applied or applied the same way as in the source system. These failures are reported, but, in general, import/export should not be attempted unless source and destination systems are identical.
The exported files contain the raw JSON format of a policy, not the simple policy language displayed in the UI, so there may be limits to how much users are able to use and understand comparisons of exported policies in Git or any other version-control workflow.
To access a data source, Immuta users must first be subscribed to that data source. A Subscription Policy determines who can request access and has one of four possible restriction levels:
- Anyone: Users will automatically be granted access (Least Restricted).
- Anyone Who Asks (and is Approved): Users will need to request access and be granted permission by the configured approvers (Moderately Restricted).
- Users with Specific Groups/Attributes: Only users with the specified groups/attributes will be able to see the data source and subscribe (Moderately Restricted).
- Individual Users You Select: The data source will not appear in search results; data owners must manually add/remove users (Most Restricted).
See Managing Users and Groups in a Data Source for details on managing Data Users.
Once a user is subscribed to a data source, the Data Policies that are applied to that data source determine what data the user sees. Data Policy types include masking, row redaction, differential privacy, and limiting to purpose.
You would use these to hide values in data. The masking policies have various levels of utility while still preserving data privacy. In order to create masking policies on object-backed data sources, you must create data dictionary entries and the data format must be either, csv, tsv, or json.
Hash the values to an irreversible sha256 hash, which is consistent for the same value throughout the data source so you can count or track the specific values, but not know the true raw value.
Hashed values are different across data sources, so you are not able to join on hashed values. This is done to protect against link attacks where two data owners may have exposed data with the same masked column (a quasi-identifier), but their data combined by that masked value could result in a sensitive data leak. However, joining on masked values can be enabled in Projects.
Replace with Null
Make all the values in the column null, removing any utility of this column.
Replace with constant
Replace all the values in the column with the same constant value you choose, such as 'Redacted', removing any utility of this column.
Regular Expression (regex)
This is similar to replacing with a constant, yet provides more utility as you can retain portions of the true value. For example, I could mask the final digits of an IP address with the following regex rule:
In this case, the regular expression
\d matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)
This ensures we capture the last digit(s) after the last
. in the ip address. We
then can enter the replacement for what we captured, which in this case is
XXX. So the
outcome of the policy, would look like this:
This is a technique to hide precision from numeric values yet providing more utility than simply hashing. For example, you could remove precision from a geospatial coordinate. You can also use this type of policy to remove precision from dates and times by rounding to the nearest hour, day, month, or year.
This option masks the values using hashing, but allows users to submit an unmasking request to users who meet the exceptions of the policy.
With Format Preserving Masking
This option masks the value, but preserves the length and type of the value, as illustrated in the examples below.
This option also allows users to submit an unmasking request to users who meet the exceptions of the policy.
For all of the above policies, you are able to conditionally mask the value based on a value in another column. This allows you to build a policy that looks something like: "Mask bank account number where country = 'USA'" instead of blindly stating you want bank account masked always.
Note: When building conditional masking policies with custom SQL statements, avoid using a column that is masked using randomized response in the SQL statement, as this can lead to different behavior depending on whether you’re using the Query Engine, Spark, or Snowflake and may produce results that are unexpected.
K-anonymity is measured by grouping records in a data source that contain the same values for a common set of quasi identifiers (QIs) - publicly known attributes (such as postal codes, dates of birth, or gender) that are consistently, but ambiguously, associated with an individual.
The k-anonymity of a data source is defined as the number of records within the least populated cohort, which means that the QIs of any single record cannot be distinguished from at least k other records. In this way, a record with QIs cannot be uniquely associated with any one individual in a data source, provided k is greater than 1.
In Immuta, masking with K-Anonymization examines pairs of values across columns and hides groups that do not
appear at least the
specified number of
times (k). For example, if one column contains street numbers and another contains street names, the group
123, "Main Street" probably would appear frequently while the group
123, "Diamondback Drive" probably would show up
much less. Since the second group appears infrequently, the values could potentially identify someone, so this group
would be masked.
After the fingerprint service identifies columns with a low number of distinct values, users will only be able to select those columns when building the policy. Users can either use a minimum group size (k) given by the fingerprint or manually select the value of k.
Note: The default cardinality cutoff for columns to qualify for k-anonymization is 500. For details about adjusting this setting, navigate to the App Settings Tutorial.
Masking Multiple Columns with K-Anonymization
Governors can write Global Data Policies using K-Anonymization in the Global Data Policy Builder.
When this Global Policy is applied to data sources, it will mask all columns matching the specified tag.
Applying k-anonymization over disjoint sets of columns in separate policies does not guarantee k-anonymization over their union.
If you select multiple columns to mask with K-Anonymization in the same policy, the policy is driven by how many times these values appear together. If the groups appear fewer than k times, they will be masked.
For example, if Policy A
Policy A: Mask with K-Anonymization the values in the columns
staterequiring a group size of at least 2 for everyone
was applied to this data source
the values would be masked like this:
Note: Selecting many columns to mask with K-Anonymization increases the processing that must occur to calculate the policy, so saving the policy may take time.
However, if you select to mask the same columns with K-Anonymization in separate policies, Policy C and Policy D,
Policy C: Mask with K-Anonymization the values in the column
genderrequiring a group size of at least 2 for everyone
Policy D: Mask with K-Anonymization the values in the column
staterequiring a group size of at least 2 for everyone
the values in the columns will be masked separately instead of as groups. Therefore, the values in that same data source would be masked like this:
Using Randomized Response
This policy masks data by slightly randomizing the values in a column, preserving the utility of the data while preventing outsiders from inferring content of specific records.
For example, if an analyst wanted to publish data from a health survey she conducted, she could remove direct identifiers and apply k-anonymization to indirect identifiers to make it difficult to single out individuals. However, consider these survey participants, a cohort of male welders who share the same zip code:
All members of this cohort have indicated substance abuse, sensitive personal information that could have damaging consequences, and, even though direct identifiers have been removed and k-anonymization has been applied, outsiders could infer substance abuse for an individual if they knew a male welder in this zip code.
In this scenario, using randomized response would change some of the Y's in
substance_abuse to N's and vice versa;
consequently, outsiders couldn't be sure of the displayed value of
substance_abuse given in any individual row,
as they wouldn't know which rows had changed.
How the Randomization Works
Immuta applies a random number generator (RNG) that is seeded with some fixed attributes of the data source, column, backing technology, and the value of the high cardinality column, an approach that simulates cached randomness without having to actually cache anything.
For string data, the random number generator essentially flips a biased coin. If the coin comes up as tails, which it does with the frequency of the replacement rate configured in the policy, then the value is changed to any other possible value in the column, selected uniformly at random from among those values. If the coin comes up as heads, the true value is released.
For numeric data, Immuta uses the RNG to add a random shift from a 0-centered Laplace distribution with the standard deviation specified in the policy configuration. For most purposes, knowing the distribution is not important, but the net effect is that on average the reported values should be the true value plus or minus the specified deviation value.
Preserving Data Utility
Using randomized response doesn't destroy the data because data is only randomized slightly; aggregate utility can be preserved because analysts know how and what proportion of the values will change. Through this technique, values can be interpreted as hints, signals, or suggestions of the truth, but it is much harder to reason about individual rows.
Additionally, randomized response gives deniability of record content not dataset participation, so individual rows can be displayed (unlike differential privacy), which makes this policy easier for analysts to work with.
Mixing Masking Policies on the Same Column
There are some cases where you may want several different masking policies on the same column. This is possible as well through what is called OTHERWISE policies. To do so, when building the policy, instead of selecting everyone / everyone except, you can select everyone who. Once you do that, you specify who the masking policy applies to, and then must select how it applies to everyone else, e.g. OTHERWISE. You can add as many "everyone who" phrases that you need; however, you must always have a blanket OTHERWISE at the end.
SQL Support Matrix
The SQL Support Matrix button in the Data Policies section allows users to view all masking policy types and details what is supported for each access pattern.
Row-Level Security Policies
These policies hide entire rows or objects of data based on the policy being enforced; some of these policies require the data to be tagged as well.
Note: When building row-level policies with custom SQL statements, avoid using a column that is masked using randomized response in the SQL statement, as this can lead to different behavior depending on whether you’re using the Query Engine, Spark, or Snowflake and may produce results that are unexpected.
These policies match a user attribute with a row/object/file attribute to determine if that row/object/file should be visible. This process uses a direct string match, so the user attribute would have to match exactly the data attribute in order to see that row of data.
For example, to restrict access to insurance claims data to the state for which the user's home office is located, you could build a policy such as this:
In this case, the
Office Location is retrieved by the identity
management system as a user attribute or group. If the user's attribute
Office Location) was
Missouri, rows containing the value
Missouri in the
in the data source would be the only rows visible to that user.
For object-backed sources, the
State can be retrieved from places
other than columns, depending on the database. For example, in S3 it is
retrieved from the metadata or tags on the S3 object or the folder name. For HDFS it is
retrieved from the xattr on the file or the folder name.
WHERE Clause Policy
This policy can be thought of as a table "view" created automatically based on the condition of the policy.
For example, in the policy below, users who are not members of the
will only see taxi rides
where passenger_count < 2. You can put any
valid SQL WHERE clause in the policy.
These policies restrict access to rows/objects/files that fall within the last x days/hours/minutes. Think of this as a moving window of time, which is chopping off the rows of data that fall further back in that time window.
The time window is based on the event time you select when creating the data source. This value will come from a date/time column in relational sources. For S3 it can be retrieved by a metadata or tag on the S3 object, and for HDFS it is retrieved from the xattr on the file.
These policies restrict access to a limited percentage of the data, which is randomly sampled, but it is the same sample for all the users. For example, you could limit certain users to only 10% of the data. The data the user sees will always be the same, but new rows may be added as new data arrives in the system. This policy can only be applied to query-backed data sources.
Differential privacy provides mathematical guarantees that you cannot pinpoint an individual (row) in the data. This anonymization applies the appropriate noise (if any) to the response based on the sensitivity of the query. For example “average age” could be changed from 50.5 to 55 at query time. To do this the Immuta SQL layer restricts queries run on the data to only aggregate queries (AVG, SUM, COUNT, etc) and prevents very sensitive queries from running at all. This policy type can only be applied to query-backed data sources.
We encourage you to read our blog on this topic that dives into details of the theories behind this powerful anonymization technique.
In order to create this policy you must select a high cardinality column in the data. This is typically the primary key column, but could also be a column with many unique values. It is not recommended that you select a column with less than 90% unique values. Otherwise you could have invalid noise added to the responses.
It is also critical that you consider the latency tolerance on the data source when creating this policy. The latency tolerance drives how long differentially private query responses are cached. You should set this window to a length that allows sufficient time for the underlying data to change enough where the same query would get a statistically relevant dissimilar result. The caching is done to avoid the privacy budget problem, which is the problem of the user asking similar questions consecutively in order to determine the real response.
Limit to Purpose
Purposes help define the scope and use of data within a project and allow users to meet purpose restrictions on policies. Governors create and manage purposes and their sub-purposes, which project owners then add to their project(s) and use to drive Data Policies.
Purposes can be constructed as a hierarchy, meaning that purposes can contain nested sub-purposes, much like tags in Immuta. This design allows more flexibility in managing purpose-based restriction policies and transparency in the relationships among purposes.
For example, consider this organization of the sub-purposes of Research:
Instead of creating separate purposes, which must then each be added to policies as they evolve, a Governor could write the following Global Policy:
Limit usage to purpose(s) Research for everyone on data sources tagged PHI.
Now, any user acting under the purpose or sub-purpose of
Research - whether
Research.MedicalClaims - will meet the criteria of this policy. Consequently,
purpose hierarchies eliminate
the need for a Governor to re-write these Global Policies when sub-purposes are added or removed. Furthermore, if new
projects with new Research purposes are added, for example, the relevant Global Policy will automatically be enforced.
For all of the rules above, you must also establish the conditions for which they will be enforced. Immuta allows you to append multiple conditions to the data. Those conditions are based on user attributes and groups (which can come from your identity management system), or purposes they are acting under via Immuta projects. Note that the attributes and groups can be retrieved from multiple different identity management systems and applied as conditions to the same policy.
Conditions can be directed as exclusionary or inclusionary, depending on the policy that's being enforced. Immuta has determined the best direction for the condition to avoid inadvertent data leaks.
For example, rather than specifying every user attribute that should see the unmasked value, you instead
specify the "special" attribute that is allowed to see the unmasked value, e.g.
mask for everyone except. This
is exclusionary. There are inclusionary policies, such as row level security matching, where
you require that the user attribute matches the data attribute.
One final note on purposes. It's best to think about projects and purposes as additional entitlements for users. In other words, you would use projects to open users up to more data, not restrict them from more.
Projects in Immuta
Project Owner Capabilities
Users with the
CREATE_PROJECT permission are considered owners of the projects they create and have the following capabilities:
- manage project members
- manage project documentation and purposes
- set subscription policies on the project
- enable Project Equalization
- enable Masked Joins
- manage project data sources
- manage project tags
- post, reply to, delete, and resolve discussion threads
- disable, delete, and restore the project
- create derived data sources
- create fingerprint versions
- switch project contexts
- manage native workspaces
Governors have the following capabilities for any project in their organization, even for projects that are private or that they are not members of:
- manage project members
- set subscription policies on the project
- add data sources to and delete data sources from the project
- manage project tags
- post and reply to discussion threads and delete their own threads and replies
- disable and restore a project
- configure project purposes, acknowledgement statements, and settings
- switch project contexts
Project Member Capabilities
Once subscribed to a project, all Immuta users have the following capabilities as project members:
- add data sources to the project (unless Project Equalization or Masked Joins is enabled)
- remove data sources they’ve added to the project
- post and reply to discussion threads and delete their own discussion threads and replies
- create derived data sources
- create fingerprint versions
- switch project contexts
Project Purposes, Acknowledgement Statements, and Settings
The Data Governor is responsible for configuring project purposes, acknowledgement statements, and settings.
Purposes: Purposes help define the scope and use of data within a project and allow users to meet purpose restrictions on policies. Governors can create purposes for project owners to use or owners can create their own purposes when they create their project (if the Governor allows them to). However, only Governors can delete purposes.
Acknowledgement Statements: Projects containing purposes require owners and subscribers to acknowledge that they will only use the data for those purposes by affirming or rejecting acknowledgement statements. If users accept the statement, they become a project member. If they reject the acknowledgement statement, they are denied access to the project. Once acknowledged, data accessed under the provision of a project will be audited and the purposes will be noted. Immuta provides default acknowledgement statements, but Data Governors can customize these statements in the Purposes or Settings tabs.
Settings: Governors can also determine if purposes are required to create a project, if purposes can be customized by project owners or must be chosen from purposes created by the data governor, or if a project can have more than one purpose. These settings are adjusted in the Settings tab of the Governance page and include the following options:
- A purpose must be included in projects: Selecting this option will require that every project contain a purpose. Utilizing data within a project outside of the stated purposes is prohibited. Projects without purposes, however, have no set restrictions.
- All data sources require a purpose restriction: Selecting this option will require every data source to have a purpose restriction.
- A project can have more than one purpose: Selecting this option allows projects to have more than one purpose.
- A project's purpose can change: Selecting this option will allow a project’s purpose to change at any time during the life of the project. Only users who created the project can change the purpose.
- Projects can have custom purposes: Selecting this option will allow project owners to describe the purpose of their project themselves, rather than choosing from a list of purposes created by a Governor.
Switching Project Contexts
The Immuta UI provides a simple way to switch project contexts so that users can access various data sources while acting under the appropriate purpose. By default, there will be no project selected, even if the user belongs to one or more projects in Immuta.
When users change project contexts, all SQL queries or blob fetches that run through Immuta will reflect users as acting under the purposes of that project, which may allow additional access to data if there are purpose restrictions on the data source(s). This process also allows organizations to track not just whether a specific data source is being used, but why.
The same security restrictions regarding data sources are applied to projects; project members still need to be subscribed to data sources in order to access data, and only users with appropriate attributes and credentials will be able to see the data if it contains any row-level or masking security.
However, Project Equalization improves collaboration by ensuring that the data in the project looks identical to all members, regardless of their level of access to data. When enabled, this feature automatically equalizes all permissions so that no project member has more access to data than the member with the least access.
Note: Only project owners can add data sources to the project if this feature is enabled.
For instructions on enabling Project Equalization, navigate to the Project Owner guide.
Project Equalization and Subscription Policies
Once Project Equalization is enabled, the project Subscription Policy builder locks and can only be adjusted by manually editing the Equalized Entitlements. Then, the Subscription Policy will combine with the entitlement settings, depending on the policy type.
For example, consider the Subscription Policy of the following sample project, Fraud Prevention, before Project Equalization is enabled:
Subscription Policy: Allow users to subscribe when approved by anyone with permission Owner (of this project).
After enabling Project Equalization, the following Equalized Entitlement is recommended by Immuta: User is a member of group Accounting.
In this particular example, the Equalized Subscription Policy contains the Equalized Entitlement and the approval of the original policy, so users must satisfy both conditions to subscribe:
- the user must be a member of the group Accounting and
- the user must be approved by anyone with permission Owner (of this project).
However, the way entitlements and approvals combine differs depending on the policy type; for clarity, the table below illustrates various scenarios for each type. Every row demonstrates how a specific project Subscription Policy changes after Project Equalization is enabled (when an equalized entitlement is set and when no entitlement is set) and how the policy reverts if Project Equalization is subsequently disabled.
|Original Policy||Equalized Policy (Example Entitlement: member of group Accounting)||Equalized Policy (No Entitlement)||Policy After Disabling Equalization|
|Anyone||Allow user to subscribe when user is a member of group Accounting||Individual Users You Select||Individual Users You Select|
|Allow users to subscribe when approved by anyone with permission Owner (of this project)||Allow users to subscribe when they satisfy all of the following: is a member of group Accounting and is approved by anyone with permission Owner (of this Project)||Allow users to subscribe when approved by anyone with permission Owner (of this project)||Allow users to subscribe when approved by anyone with permission Owner (of this project)|
|Allow users to subscribe to the project when user is a member of group Legal||Allow users to subscribe to the project when user is a member of group Accounting||Individual Users You Select||Individual Users You Select|
|Individual Users You Select||Allow users to subscribe to the project when user is a member of group Accounting||Individual Users You Select||Individual Users You Select|
This setting adjusts the minimum entitlements (i.e., users' groups and attributes) required to join the project and to access data within the project. When Project Equalization is enabled, Equalized Entitlements default to Immuta's recommended settings, but project owners can edit these settings by adding or removing parts of the entitlements. However, making these changes entails two potential disadvantages:
If you add entitlements, members might see more data as a whole, but at least some members of the project will be out of compliance. The status of users' compliance is visible from the Members tab within the project.
If you remove entitlements, the project will be open to users with fewer privileges, but this change might make less data visible to all project members. Removing entitlements is only recommended if you foresee new users joining with less access to data than the current members.
This setting determines how often user credentials are validated, which is critical if users share data with project members outside of Immuta, as they need a way to verify that those members' permissions are still valid. Validation Frequency provides those means of verification.
This feature allows masked columns to be joined within the context of a project.
Note: Masked columns cannot be joined across data sources that are not linked by a project.
For instructions on enabling Masked Joins, navigate to the Project Owner guide.
Derived Data Sources
When Project Equalization is enabled, members can use data sources within the project to create a derived data source, which dynamically inherits the Subscription Policies and purpose restriction Data Policies from the parent source(s).
For example, consider these data sources, which each contain a Subscription and Data Policy:
Data Source A
Subscription Policy: Allow users to subscribe to the data source when user is a member of group Medical Claims
Data Policy: Mask by making null the value in the column(s) address except for members of group Legal
Data Source B
Subscription Policy: Allow users to subscribe to the data source when user is approved by anyone with permission Owner and anyone with permission Governance
Data Policy: Limit usage to purpose(s) Research for everyone
If a user creates a derived data source, Data Source C, from these two data sources, Data Source C will inherit these policies, which will be unchangeable:
Data Source C
Subscription Policy: Allow user to subscribe when they satisfy all of the following:
- is a member of group Legal and is a member of group Medical Claims
- is approved by anyone with permission Owner (of Data Source B) and anyone with permission Governance
Data Policy: Limit usage to purpose(s) Research for everyone
Note: If members use data outside the project to create their data source, they must first add that data to the project and re-derive the data source through the project connection. When creating a derived data source, members are prompted to certify that their data is derived from the parent data sources they selected upon creation.
For detailed instructions on creating a derived data source, navigate to the Managing Projects tutorial.
HDFS Native Workspace
This workspace allows native access to data on cluster without having to go through the Immuta SparkSession or Immuta Query Engine. Within a project, users can enable HDFS Native Workspace, which creates a workspace directory in HDFS (and a corresponding database in the Hive metastore) where users can write files.
After a Project Owner creates a workspace, users will only be able to access this HDFS directory and database when acting under the project, and they should use the SparkSQL session to copy data into the workspace. The Immuta Spark SQL Session will apply policies to the data, so any data written to the workspace will already be compliant with the restrictions of the equalized project, where all members see data at the same level of access.
Once derived data is ready to be shared outside the workspace, it can be exposed as a derived data source in Immuta. At that point, the derived data source will inherit policies appropriately, and it will then be available through Immuta outside the project and can be used in future project workspaces by different teams in a compliant way.
- Administrators can opt to configure where all Immuta projects are kept in HDFS (default
/user/immuta/workspace). Note: If an administrator changes the default directory, the Immuta user must have full access to that directory. Once any workspace is created, this directory can no longer be modified.
- Administrators can place a configuration value in the cluster configuration (
core-site.xml) to mark that cluster as unavailable for use as a workspace.
- Once a project is equalized, Project Owners can enable a workspace for the project.
- If more than one cluster is configured, Immuta will prompt for which to use.
- Once enabled, the full URI of where that workspace is located will display on the project page.
- Project Owners can also add connection information for Hive and/or Impala to allow Hive or Impala workspace sources to be created. The connection information provided and the Kerberos credentials configured for Immuta will be used for each derived Hive or Impala data source. The connection string for Hive or Impala will be displayed on the project page with the full URI.
- Project Owners can disable the workspace at any time.
- When disabled, the workspace will not allow reading/writing from project members any longer.
- Data sources living in this directory will still exist and their access will not be changed. (Subscribed users will still have access as usual.)
- All data in this directory will still exist, regardless of whether it belongs to a data source or not.
- Project Owners can purge all data in the workspace after it has been disabled. Project Owners can
- Purge all non-data-source data only.
- Purge all data (including data source data).
- When purging all data source data, sources can either be disabled or fully deleted.
- When a user is acting under the project context, Immuta will provide them read/write access to the project HDFS directory (using HDFS ACLs). If there are Immuta data sources already exposed in that directory, the user will bypass the namenode plugin if acting under the project for the data in that directory.
- Once a user is not acting under the project, all access to that directory will be revoked and they can only access data in that project as official Immuta data sources, if any exist.
- When users with the CREATE_DATA_SOURCE_IN_PROJECT permission create a derived data source with workspace enabled,
they will be prompted with a modified create data source workflow:
- The user will select the directory (starting with the project root directory) of the data they are exposing.
- If the directory contains parquet or ORC files, then Hive, Impala, and HDFS will be an option for the data source; otherwise, only HDFS will be available.
- Users will not be asked for the connection information because the Immuta user connection will be used to create the data source, which will ensure join pushdown and that the data source will work even when the user isn’t acting in the project. Note: Hive or Impala workspace sources are only available if the Project Owner added Hive or Impala connection information to the workspace.
- If Hive or Impala is selected as the data source type, Immuta will infer schema/partitions from files and generate create table statements for Hive.
- Once the data source is created, policy inheritance will take effect.
Note: To avoid data source collisions, Immuta will not allow HDFS and Hive/Impala data sources to be backed from the same location in HDFS.
Native Snowflake Workspaces
Snowflake workspaces allow users to access protected data directly in Snowflake without having to go through the Immuta Query Engine.
Typically, Immuta applies policies by forcing users to query through the Query Engine, which acts like a proxy in front of the database Immuta is protecting. However, Snowflake secure views make this process unnecessary. Instead, Immuta enforces policy logic on data and represents it as secure views in Snowflake. However, since secure views are static, creating a secure view for every unique user in your organization for every table in your organization would result in secure view bloat. Immuta projects address this problem by virtually grouping users and tables and equalizing users to the same level of access, ensuring that all members of the project see the same view of the data. Consequently, all members share one secure view.
Beyond interacting directly with Snowflake secure views in these workspaces, users can create derived data sources and collaborate with other project members at a common access level. Because these derived data sources will inherit all appropriate policies, that data can then be shared outside the project. Additionally, derived data sources use the credentials of the Immuta system Snowflake account, which will allow them to persist after a workspace is disconnected.
Additionally, derived data sources use the credentials of the Immuta system Snowflake account, which will allow them to persist after a workspace is disconnected.
Immuta enforces policy logic on data and represents it as secure views in Snowflake. Because projects group users and tables and equalize members to the same level of access, all members will see the same view of the data and, consequently, will only need one secure view. Changes to policies immediately propagate to relevant secure views.
Mapping Projects to Secure Views
Immuta projects are represented as Session Contexts within Snowflake. As they are linked to Snowflake, projects automatically create corresponding
- roles in Snowflake: IMMUTA_[project name]
- schemas in the Snowflake IMMUTA database: [project name]
- secure views in the project schema for any table in the project
If users switch projects, they simply change their Snowflake Session Context to the appropriate Immuta project. If users are not entitled to a data source contained by the project, they will not be able to access the Context in Snowflake until they have access to all tables in the project. If changes are made to a user's attributes, the changes will immediately propagate to the Snowflake context.
Using Immuta with an Existing Snowflake Account
The following steps allow Immuta to be used with existing Snowflake accounts.
Immuta is configured to integrate with the organization’s Snowflake account and (optionally) share a single sign on (such as Okta), allowing users in Immuta to map to the same users in Snowflake. (Alternatively, that mapping can be inferred by using the same usernames in both Snowflake and Immuta.)
CREATE_DATA_SOURCE permissions are granted to specific users to allow them to expose Snowflake table metadata and enforce policies.
If tags are used to drive policies, users can manually add tags when tables are imported, Immuta can automatically tag sensitive data (if Sensitive Data Detection is enabled), or users can pull tags from external catalogs that are mapped to the tables being exposed.
Policies are created and enforced on tables.
The CREATE_PROJECT permission is granted to specific users so they can create their own Immuta projects and create the appropriate Snowflake contexts. These users can drive what projects and hence what Snowflake contexts exist. Note: When users leave a project or a project is deleted, that Snowflake context will be removed from their Snowflake accounts.
The CREATE_DATA_SOURCE_IN_PROJECT permission is given to specific users so they can expose their derived tables in the project; the derived tables will inherit the policies, and then the data can be shared outside the project.
Users access data only through secure views in Snowflake (via Immuta projects), which significantly decreases the amount of role management for administrators in Snowflake. Organizations should also consider having a user in Snowflake who is able to create databases and make GRANTs on those databases and having separate users who are able to read and write from those tables.
- Few roles to manage in Snowflake; that complexity is pushed to Immuta, which is designed to simplify it.
- A small set of users has direct access to raw tables; most users go through secure views only, but raw database access can be segmented across departments.
- Policies are built by the individual database administrators within Immuta and are managed in a single location, and changes to policies are automatically propagated across thousands of tables’ secure views.
- Self-service access to data based on data policies.
- Users work in various contexts in Snowflake natively, based on their collaborators and their purpose, without fear of leaking data.
All policies are enforced natively in Snowflake without performance impact.
- Security is maintained through Snowflake primitives (roles and secure views).
- Performance and scalability is maintained (no proxy).
Policies can be driven by metadata, allowing massive scale policy enforcement with only a small set of actual policies.
- Derived tables can be shared back out through Immuta, improving collaboration.
- User access and removal are immediately reflected in secure views.
- Snowflake workspaces do not support differential privacy policies. Any Snowflake sources with differential privacy policies applied will not be created within the native Snowflake workspace.
- Native derived data sources can't be query-backed.
Data Sources in Immuta
A data source is how Data Owners expose their data across their organization to other Immuta users. Throughout this process, the data is not copied. Instead, Immuta uses metadata from the data source to determine how to expose the data. In this sense, a data source is a virtual representation of data that exists in a remote data storage technology.
When a data source is exposed, policies (written by Data Owners and Data Governors) are dynamically enforced on the data, appropriately redacting and masking information depending on the attributes of the user accessing the data. Once the data source is exposed and subscribed to, the data can be accessed in a consistent manner across analytics and visualization tools, allowing reproducibility and collaboration.
Once subscribed, Data Users interact with data through Immuta through different access patterns: HDFS, Spark, and SQL. Accessing data through Immuta ensures that users are only consuming policy-controlled data with thorough auditing.
Data Source User Roles
There are various roles users and groups can play relating to data sources, including
- Subscribers: Those who have access to the data source data. With the appropriate data accesses and attributes, these users/groups can view files, run SQL queries, and generate analytics against the data source data. All users/groups granted access to a data source (except for those with the ingest role) have subscriber status.
- Experts: Those who are knowledgeable about the data source data and can elaborate on it. They are also capable of managing the data source's documentation and Data Dictionary.
- Owners: Those who create and manage new data sources and their users, documentation, Data Dictionaries, and queries. They are also capable of ingesting data into their data sources as well as adding ingest users (if their data source is object-backed).
- Ingest: Those who are responsible for ingesting data for the data source. This role only applies to object-backed data sources (since query-backed data sources are ingested automatically). Ingest users cannot access any data once it's inside Immuta, but they are able to verify if their data was successfully ingested or not.
Data Source Types
Data sources fall in one of two categories: those that are backed by SQL technologies (query-backed data sources) and those that are not (object-backed data sources).
query-backed data sources: These data sources are accessible to subscribed Data Users through the Immuta Query Engine and appear as though they are Postgres tables.
object-backed data sources: These data sources are backed by data storage technologies that do not support SQL and can range from blob stores, to filesystems, to APIs.
Data attributes are information about the data within the data source. These attributes are then matched against policy logic to determine if a row or object should be visible to a specific user. This matching is usually done between the data attribute and the user attribute.
For example, in the policy
Only show rows where
Country='US' for everyone except when user is a member of group
the data attribute (
US in the
Country column) is matched against the user attribute
Finance group) to determine whether or not rows will be visible to the user accessing the data. In this case only
users who are a member of the Finance group will see all rows in the data source.
These user attributes give users access to various Immuta features and drive data source policies.
Permissions control what actions a user can take in Immuta, both API and UI actions. Permissions can be
added and removed from user accounts by a System Administrator (an Immuta user with the
USER_ADMIN permission); however,
the permissions themselves are managed by Immuta, and the actions associated with the permissions cannot be altered.
Groups allow System Administrators to group sets of users together. Users can belong to any number of groups and can be added or removed from groups at any time. Like attributes, groups can be used to restrict what data a set of users has access to.
Called "authorizations" prior to Immuta's 2.7 release, attributes are custom tags that are applied to users to restrict what data users can see. Attributes can be added manually or mapped in from LDAP or Active Directory.
The Data Dictionary provides information about the columns within the data source, including column names and value types. Users subscribed to the data source can post and reply to discussion threads by commenting on the Data Dictionary.
Dictionary columns are automatically generated when the data source is created if the remote storage technology supports SQL. Otherwise, Data Owners or Experts can create the entries for the Data Dictionary manually.
This feature monitors servers for schema and table changes, including when schemas and tables are added or removed, and notifies Data Owners when any changes are made.
When this feature is enabled by a Data Owner, Immuta detects when a new table has been added and automatically creates a new data source. Correspondingly, if a remote table is removed, that data source will be disabled in the console.
Data Owners or Governors can select which users will monitor schema changes, and if more than one user is selected as a monitor, one data source will be created for each of these users.
Table Evolution Detection
Data Owners can also enable Table Evolution Detection, which monitors when columns are added or removed and when column types are changed.
When new columns are added to the remote table, Immuta automatically applies the
New tag to these columns in the
data source, and, since these new columns could contain sensitive data, a seeded
New Column Added Global Policy
Data Owners can then review and approve these changes from the Requests tab of their profile page.
Approving column changes removes the
New tags from the data source.