Management of Global Policies, Tags, and Project Settings
Audience: Data Governors
Content Summary: This page describes Global Policy types, tags, purposes, acknowledgement statements, and project settings.
Global Policies can only be created by Data Governors and apply to all data sources across an organization. In contrast, Local Policies can be created by Data Owners or Data Governors and apply to a specific data source.
To access a data source, Immuta users must first be subscribed to that data source. A Subscription Policy determines who can request access and has one of four possible restriction levels:
- Anyone: Users will automatically be granted access (Least Restricted).
- Anyone Who Asks (and is Approved): Users will need to request access and be granted permission by the configured approvers (Moderately Restricted).
- Users with Specific Groups/Attributes: Only users with the specified groups/attributes will be able to see the data source and subscribe (Moderately Restricted).
- Individual Users You Select: The data source will not appear in search results; data owners must manually add/remove users (Most Restricted).
See Managing Users and Groups in a Data Source for details on managing Data Users.
Once a user is subscribed to a data source, the Data Policies that are applied to that data source determine what data the user sees. Data Policy types include masking, row redaction, minimization, time-based restrictions, differential privacy, and limiting to purpose.
You would use these to hide values in data. The masking policies have various levels of utility while still preserving data privacy. In order to create masking policies on object-backed data sources, you must create Data Dictionary entries and the data format must be either, csv, tsv, or json.
K-anonymity is measured by grouping records in a data source that contain the same values for a common set of quasi identifiers (QIs) - publicly known attributes (such as postal codes, dates of birth, or gender) that are consistently, but ambiguously, associated with an individual.
The k-anonymity of a data source is defined as the number of records within the least populated cohort, which means that the QIs of any single record cannot be distinguished from at least k other records. In this way, a record with QIs cannot be uniquely associated with any one individual in a data source, provided k is greater than 1.
In Immuta, masking with K-Anonymization examines pairs of values across columns and hides groups that do not
appear at least the
specified number of
times (k). For example, if one column contains street numbers and another contains street names, the group
123, "Main Street" probably would appear frequently while the group
123, "Diamondback Drive" probably would show up
much less. Since the second group appears infrequently, the values could potentially identify someone, so this group
would be masked.
After the fingerprint service identifies columns with a low number of distinct values, users will only be able to select those columns when building the policy. Users can either use a minimum group size (k) given by the fingerprint or manually select the value of k.
Note: The default cardinality cutoff for columns to qualify for k-anonymization is 500. For details about adjusting this setting, navigate to the App Settings Tutorial.
Masking Multiple Columns with K-Anonymization
Governors can write Global Data Policies using K-Anonymization in the Global Data Policy Builder.
When this Global Policy is applied to data sources, it will mask all columns matching the specified tag.
Applying k-anonymization over disjoint sets of columns in separate policies does not guarantee k-anonymization over their union.
If you select multiple columns to mask with K-Anonymization in the same policy, the policy is driven by how many times these values appear together. If the groups appear fewer than k times, they will be masked.
For example, if Policy A
Policy A: Mask with K-Anonymization the values in the columns
staterequiring a group size of at least 2 for everyone
was applied to this data source
the values would be masked like this:
Note: Selecting many columns to mask with K-Anonymization increases the processing that must occur to calculate the policy, so saving the policy may take time.
However, if you select to mask the same columns with K-Anonymization in separate policies, Policy C and Policy D,
Policy C: Mask with K-Anonymization the values in the column
genderrequiring a group size of at least 2 for everyone
Policy D: Mask with K-Anonymization the values in the column
staterequiring a group size of at least 2 for everyone
the values in the columns will be masked separately instead of as groups. Therefore, the values in that same data source would be masked like this:
Using Randomized Response
This policy masks data by slightly randomizing the values in a column, preserving the utility of the data while preventing outsiders from inferring content of specific records.
For example, if an analyst wanted to publish data from a health survey she conducted, she could remove direct identifiers and apply k-anonymization to indirect identifiers to make it difficult to single out individuals. However, consider these survey participants, a cohort of male welders who share the same zip code:
All members of this cohort have indicated substance abuse, sensitive personal information that could have damaging consequences, and, even though direct identifiers have been removed and k-anonymization has been applied, outsiders could infer substance abuse for an individual if they knew a male welder in this zip code.
In this scenario, using randomized response would change some of the Y's in
substance_abuse to N's and vice versa;
consequently, outsiders couldn't be sure of the displayed value of
substance_abuse given in any individual row,
as they wouldn't know which rows had changed.
How the Randomization Works
Immuta applies a random number generator (RNG) that is seeded with some fixed attributes of the data source, column, backing technology, and the value of the high cardinality column, an approach that simulates cached randomness without having to actually cache anything.
For string data, the random number generator essentially flips a biased coin. If the coin comes up as tails, which it does with the frequency of the replacement rate configured in the policy, then the value is changed to any other possible value in the column, selected uniformly at random from among those values. If the coin comes up as heads, the true value is released.
For numeric data, Immuta uses the RNG to add a random shift from a 0-centered Laplace distribution with the standard deviation specified in the policy configuration. For most purposes, knowing the distribution is not important, but the net effect is that on average the reported values should be the true value plus or minus the specified deviation value.
Preserving Data Utility
Using randomized response doesn't destroy the data because data is only randomized slightly; aggregate utility can be preserved because analysts know how and what proportion of the values will change. Through this technique, values can be interpreted as hints, signals, or suggestions of the truth, but it is much harder to reason about individual rows.
Additionally, randomized response gives deniability of record content not dataset participation, so individual rows can be displayed (unlike differential privacy, which makes this policy easier for analysts to work with.
SQL Support Matrix
The SQL Support Matrix button in the Data Policies section allows users to view all masking policy types and details what is supported for each access pattern.
Row-Level Security Policies
These policies hide entire rows or objects of data based on the policy being enforced; some of these policies require the data to be tagged as well.
These policies restrict access to rows/objects/files that fall within the time restrictions set in the policy. If a
data source has time-based restriction policies, queries run against the data source by a user will only
return rows/blobs with a date in its
event-time column/attribute from within a certain range.
The time window is based on the event time you select when creating the data source. This value will come from a date/time column in relational sources. For S3 it can be retrieved by a metadata or tag on the S3 object, and for HDFS it is retrieved from the xattr on the file.
These policies restrict access to a limited percentage of the data, which is randomly sampled, but it is the same sample for all the users. For example, you could limit certain users to only 10% of the data. The data the user sees will always be the same, but new rows may be added as new data arrives in the system. This policy can only be applied to query-backed data sources.
Differential privacy provides mathematical guarantees that you cannot pinpoint an individual (row) in the data. This anonymization applies the appropriate noise (if any) to the response based on the sensitivity of the query. For example “average age” could be changed from 50.5 to 55 at query time. To do this the Immuta SQL layer restricts queries run on the data to only aggregate queries (AVG, SUM, COUNT, etc) and prevents very sensitive queries from running at all. This policy type can only be applied to query-backed data sources.
We encourage you to read our blog on this topic that dives into details of the theories behind this powerful anonymization technique.
In order to create this policy you must select a high cardinality column in the data. This is typically the primary key column, but could also be a column with many unique values. It is not recommended that you select a column with less than 90% unique values. Otherwise you could have invalid noise added to the responses.
It is also critical that you consider the latency tolerance on the data source when creating this policy. The latency tolerance drives how long differentially private query responses are cached. You should set this window to a length that allows sufficient time for the underlying data to change enough where the same query would get a statistically relevant dissimilar result. The caching is done to avoid the privacy budget problem, which is the problem of the user asking similar questions consecutively in order to determine the real response.
Limit to Purpose
Purposes help define the scope and use of data within a project and allow users to meet purpose restrictions on policies. Governors create and manage purposes and their sub-purposes, which project owners then add to their project(s) and use to drive Data Policies.
Purposes can be constructed as a hierarchy, meaning that purposes can contain nested sub-purposes, much like tags in Immuta. This design allows more flexibility in managing purpose-based restriction policies and transparency in the relationships among purposes.
For example, consider this organization of the sub-purposes of Research:
Instead of creating separate purposes, which must then each be added to policies as they evolve, a Governor could write the following Global Policy:
Limit usage to purpose(s) Research for everyone on data sources tagged PHI.
Now, any user acting under the purpose or sub-purpose of
Research - whether
Research.MedicalClaims - will meet the criteria of this policy. Consequently,
purpose hierarchies eliminate
the need for a Governor to re-write these Global Policies when sub-purposes are added or removed. Furthermore, if new
projects with new Research purposes are added, for example, the relevant Global Policy will automatically be enforced.
Data Policy Conflicts
In some cases, two conflicting Global Data Policies may apply to a single data source. When this happens, the policy containing a tag deeper in the hierarchy will apply to the data source to resolve the conflict.
Consider the following Global Data Policies created by a Data Governor:
Data Policy 1: Mask columns tagged
PIIusing a constant for everyone on data sources with columns tagged
Data Policy 2: Mask columns tagged
PII.SSNby making null for everyone on data sources with columns tagged
If a Data Owner creates a data source and applies
PII.Other tags, both of these Global Data Policies
will apply. Instead of having a conflict, the policy containing a deeper tag in the hierarchy will apply:
In this example, Data Policy 2 cannot be applied to the data source. If Data Owners wanted to use Data Policy 2 on the data source instead, they would need to disable Data Policy 1.
Once enabled on a data source, Global Data Policies can be edited and disabled by Data Owners. See the Managing Policies Tutorial for instructions.
Staged Global Policies
Governors can create Staged Global Policies, which can then be safely reviewed and edited without affecting data sources. Once a policy is ready, Governors can activate it to immediately enforce the policy on relevant data sources.
Note: Policies that contain the circumstance When selected by data owners cannot be staged.
Global Data Policy Custom Certifications
When building a Global Data Policy, Governors can create custom certifications, which must then be acknowledged by Data Owners when the policy is applied to data sources.
When a Global Data Policy with a custom certification is cloned, the certification is also cloned. If the user who clones the policy and custom certification is not a Governor, the policy will only be applied to data sources that user owns.
HIPAA Safe Harbor Policy
HIPAA Safe Harbor requires that
- 18 direct identifiers are removed from data sources.
- Data Owners do not have actual knowledge that Data Users could re-identify individuals.
The HIPAA Safe Harbor policy is a Global Policy included in Immuta by default. When combined with Sensitive Data Detection, this policy automatically applies to relevant data sources. However, to fully comply with HIPAA Safe Harbor, Data Owners will need to certify that tags on data sources are accurate; after the policy is applied, multiple warnings indicate that certification is required, including a "Policy Certification Required" label on the data source and on the policy. Additionally, owners will receive a notification to certify the policy.
Note: The HIPAA Safe Harbor policy is staged by default and cannot be edited by any user. However, Governors can clone this policy and then edit the clone.
HIPAA Safe Harbor Policy Certification
The Data Owner and Data User certifications serve as official acknowledgements that the users and data comply with HIPAA Safe Harbor:
- Data Owner Certification: Data Owners certify that all 18 identifiers have been correctly tagged and that they have no knowledge that the information in the data sources could be used by Data Users to identify individuals.
- Data User Certification: Data Users agree to use the data only for the stated purpose of the project; refrain from sharing that data outside the project; not re-identify or take any steps to re-identify individuals' health information; notify the Project Owner or Governance team in the event that individuals have been identified or could be identified; and refrain from contacting individuals who might be identified.
California Consumer Privacy Act (CCPA) Policy
The CCPA policy is a Global Policy included in Immuta by default. When combined with Sensitive Data Detection, this policy automatically applies to relevant data sources.
CCPA sets forth two routes to achieve compliance:
- businesses processing consumer personal information abide by all applicable restrictions (e.g., purpose restrictions or consumer rights), and/or
- businesses transform consumer personal information into de-identified or aggregate data so that restrictions, such as consumer rights, become inapplicable.
Under CCPA, de-identification is successfully performed if data “cannot reasonably identify, relate to, describe, be capable of being associated with, or be linked, directly or indirectly, to a particular consumer,” provided that an organization that uses de-identified information
- implements technical safeguards that prohibit re-identification of the consumer to whom the information may pertain,
- implements business processes that specifically prohibit re-identification of the information,
- implements business processes to prevent inadvertent release of de-identified information, and
- makes no attempt to re-identify the information.
Immuta’s CCPA de-identification policy was created to comply with this definition and consists of 4 main components (each of which addresses at least one prong of CCPA's de-identification test):
- a self-executing data policy that applies a de-identification technique that serves as a technical safeguard to prohibit re-identification of the consumer.
- certifications by the Data Owner. These serve as an official acknowledgement that the covered business has initially appropriately labeled consumer information and is not aware that the Data User is in position to re-identify consumers prior to the re-use of the data. This component is crucial to prevent inadvertent release of de-identified information.
- certifications by the Data User. These serve as official acknowledgements that the Data User is subject to business processes that prohibit re-identification and inadvertent release of de-identified information to third parties.
- functionalities to enable real-time monitoring and auditing of query-based access to data. These aim to deter and detect attempts to re-identify.
Note: The language used in certifications can be customized to meet specific needs of customers, such as when customers want to use specific language found in data-sharing agreements.
CCPA Policy Conditions
The data policy is made of four rules, as illustrated below.
The first rule ensures that access to data can only happen for two types of use cases: those that require access to
de-identified data (
Re-identification Prohibited.CCPA) and those that require access to identifying data
Use Case Outside De-identification). Data Users are then strictly segmented by use case through
attribute-based access control and purpose acknowledgement.
The second rule nulls direct identifiers and undetermined identifiers for Data Users with access to de-identified data.
The third rule generalizes indirect identifiers with k-anonymization so that the re-identifiability probability is always equal to or below 5% for Data Users with access to de-identified data. Note: Immuta has analyzed industry standards and thresholds recommended by statistical methods experts and selected the most restrictive value of 5% for the maximum re-identifiability probability.
The fourth rule applies the first three rules to all data sources containing columns tagged
Discovered.Identifier Indirect, or
Immuta's CCPA policy addresses both both direct and indirect identifiers because robust de-identification requires considering all types of identifying attributes, and the identifiers are masked differently to maximize utility. With this combination of masking techniques, the data re-identification risk (the amount of re-identification possible for each data source) meets CCPA’s de-identification criteria.
Note: The CCPA policy is staged by default and cannot be edited by any user. However, Governors can clone this policy and then edit the clone. However, customers will have to check that after the customization the overall re-identification risk is still acceptable.
Policy Export and Import
This feature allows Data Owners and Governors to export and import policies as JSON files so they can seamlessly move policies from one system to another, as long as the systems have identical configurations.
Exporting policies also allows them to be tracked, compared, and approved in systems like Git. If users want to test specific policies in their development environments and get approval before moving these policies to their production environments, they could use the Policy Export and Import feature to allow for this approval workflow.
Once enabled on the App Settings page by an Application Administrator, the Import Policies and Export Policies buttons will be visible on the Policies page for users who have the appropriate permissions (generally, a Data Owner or Governor).
When Export Policies is clicked, a .zip file containing all relevant policies will be downloaded; each Global Policy and each data source will be separated into its own JSON file.
The files exported are determined based on the user performing the export. For example, Data Owners will only be able to export policies for data sources that they own and Restricted Global Policies that they've created. Governors, however, can export all policies.
Once the files are exported, in the destination system, import can be selected to open the import modal, which gives options to import all files, remove certain files from the import, and export the current policy state as a backup. If policies are found in the current system that are not found in the import, a warning will display with an option to delete those policies.
Since policy updates are asynchronous, certain policy states will not carry through the import/export process. These include
- Policy disable. Manual policy disables will not be preserved after an import.
- Policy conflicts. Immuta's policy conflict logic is not deterministic, so after an import of Global Policies, there is no guarantee the current enabled policy state will be the same as it was in the export.
If the state of the destination system does not match the exact state of the source system (tags, data sources, users, IAMs, purposes, etc.), there is a significant chance that policies will fail to be applied or applied the same way as in the source system. These failures are reported, but, in general, import/export should not be attempted unless source and destination systems are identical.
The exported files contain the raw JSON format of a policy, not the simple policy language displayed in the UI, so there may be limits to how much users are able to use and understand comparisons of exported policies in Git or any other version-control workflow.
Tags serve several functions: they can drive Local or Global Subscription and Data Policies, they can be used to generate Immuta Reports, and they can drive search results in the Immuta UI. Governors can create tags or import tags from external catalogs in the Governance UI. Data Owners and Governors can then apply these tags to or remove them from projects, data sources, and/or specific columns within the data sources.
Sensitive Data Detection
External Sensitive Data Detection
External Sensitive Data Detection is a license-driven feature that must be added for you before it is available in your Immuta instance.
When enabled on the App Settings page, this feature uses third party services to automatically identify and tag columns that contain sensitive data (PII, PHI, etc.) when the data source is created; this detection is based on an extremely small randomized sampling of underlying data, which is encrypted in transit, is used only for entity prediction, and remains confidential and managed by Immuta, subject to the same guarantees reviewed and agreed to in our license agreement.
During the fingerprint process External Sensitive Data Detection divides the classification of the data into specific tags: Immuta “Discovered” tags.
The Immuta application is pre-configured with a set of these tags that the service can return so that they can be used to write Global Policies before data sources even exist. Consequently, sensitive data is tagged and appropriate policies are enforced immediately upon data source creation.
Only Application Admins have the option to enable External Sensitive Data Detection on the App Settings page. However, users can disable auto-tagging on a data-source-by-data-source basis, and Governors can disable any unwanted “Discovered” tags in the Immuta application to prevent them from being used and auto-detected in the future.
Internal Sensitive Data Detection
When enabled on the App Settings page, this feature automatically identifies and tags columns that contain sensitive data (PII, PHI, etc.) when the data source is created; this detection is based on a small sample of underlying data, which remains in the users' network.
During the fingerprint process Internal Sensitive Data Detection divides the classification of the data into specific tags: Immuta “Discovered” tags.
The Immuta application is pre-configured with a set of these tags so that they can be used to write Global Policies before data sources even exist. Consequently, sensitive data is tagged and appropriate policies are enforced immediately upon data source creation.
Unlike External Sensitive Data Detection, users do not need a license to enable it. However, only Application Admins have the option to enable Internal Sensitive Data Detection on the App Settings page. However, users can disable auto-tagging on a data-source-by-data-source basis, and Governors can disable any unwanted “Discovered” tags in the Immuta application to prevent them from being used and auto-detected in the future.
Project Purposes, Acknowledgement Statements, and Settings
The Data Governor is responsible for configuring project purposes, acknowledgement statements, and settings.
Purposes: Purposes help define the scope and use of data within a project and allow users to meet purpose restrictions on policies. Governors can create purposes for project owners to use or owners can create their own purposes when they create their project (if the Governor allows them to). However, only Governors can delete purposes.
Acknowledgement Statements: Projects containing purposes require owners and subscribers to acknowledge that they will only use the data for those purposes by affirming or rejecting acknowledgement statements. If users accept the statement, they become a project member. If they reject the acknowledgement statement, they are denied access to the project. Once acknowledged, data accessed under the provision of a project will be audited and the purposes will be noted. Immuta provides default acknowledgement statements, but Data Governors can customize these statements in the Purposes or Settings tabs. Acknowledgement statements ensure that project members are aware of (and agree to) all purpose-based restrictions before accessing the project's content. Each purpose is associated with its own acknowledgement statement, meaning that a project with multiple purposes (if allowed) would require users to accept more than one acknowledgement statement. Immuta keeps a record of whether each project member has agreed to the acknowledgement statement(s), and if so, records the purpose associated to the acknowledgement, the time of the acknowledgement, and the text of the acknowledgement itself. All purposes are associated with the default acknowledgement statement unless their statement has been customized.
Settings: Governors can also determine if purposes are required to create a project, if purposes can be customized by project owners or must be chosen from purposes created by the data governor, or if a project can have more than one purpose. These settings are adjusted in the Settings tab of the Governance page and include the following options:
- A purpose must be included in projects: Selecting this option will require that every project contain a purpose. Utilizing data within a project outside of the stated purposes is prohibited. Projects without purposes, however, have no set restrictions.
- All data sources require a purpose restriction: Selecting this option will require every data source to have a purpose restriction.
- A project can have more than one purpose: Selecting this option allows projects to have more than one purpose.
- A project's purpose can change: Selecting this option will allow a project’s purpose to change at any time during the life of the project. Only users who created the project can change the purpose.
- Projects can have custom purposes: Selecting this option will allow project owners to describe the purpose of their project themselves, rather than choosing from a list of purposes created by a Governor.