LogoLogo
2024.3
  • Immuta Documentation - 2024.3
  • What is Immuta?
  • Self-Managed Deployment
    • Requirements
    • Install
      • Managed Public Cloud
      • Red Hat OpenShift
    • Upgrade
      • Migrating to the New Helm Chart
      • Upgrading (IEHC)
      • Upgrading (IHC)
    • Guides
      • Ingress Configuration
      • TLS Configuration
      • Cosign Verification
      • Production Best Practices
      • Rotating Credentials
      • External Cache Configuration
      • Enabling Legacy Query Engine and Fingerprint
      • Private Container Registries
      • Air-Gapped Environments
    • Disaster Recovery
    • Troubleshooting
    • Conventions
  • Integrations
    • Immuta Integrations
    • Snowflake
      • Getting Started
      • How-to Guides
        • Configure a Snowflake Integration
        • Snowflake Table Grants Migration
        • Edit or Remove Your Snowflake Integration
        • Integration Settings
          • Enable Snowflake Table Grants
          • Use Snowflake Data Sharing with Immuta
          • Configure Snowflake Lineage Tag Propagation
          • Enable Snowflake Low Row Access Policy Mode
            • Upgrade Snowflake Low Row Access Policy Mode
      • Reference Guides
        • Snowflake Integration
        • Snowflake Data Sharing
        • Snowflake Lineage Tag Propagation
        • Snowflake Low Row Access Policy Mode
        • Snowflake Table Grants
        • Warehouse Sizing Recommendations
      • Phased Snowflake Onboarding Concept Guide
    • Databricks Unity Catalog
      • Getting Started
      • How-to Guides
        • Configure a Databricks Unity Catalog Integration
        • Migrate to Unity Catalog
      • Databricks Unity Catalog Integration Reference Guide
    • Databricks Spark
      • How-to Guides
        • Configuration
          • Simplified Databricks Spark Configuration
          • Manual Databricks Spark Configuration
          • Manually Update Your Databricks Cluster
          • Install a Trusted Library
        • DBFS Access
        • Limited Enforcement in Databricks Spark
        • Hide the Immuta Database in Databricks
        • Run spark-submit Jobs on Databricks
        • Configure Project UDFs Cache Settings
        • External Metastores
      • Reference Guides
        • Databricks Spark Integration
        • Databricks Spark Pre-Configuration Details
        • Configuration Settings
          • Databricks Spark Cluster Policies
            • Python & SQL
            • Python & SQL & R
            • Python & SQL & R with Library Support
            • Scala
            • Sparklyr
          • Environment Variables
          • Ephemeral Overrides
          • Py4j Security Error
          • Scala Cluster Security Details
          • Databricks Security Configuration for Performance
        • Databricks Change Data Feed
        • Databricks Libraries Introduction
        • Delta Lake API
        • Spark Direct File Reads
        • Databricks Metastore Magic
    • Starburst (Trino)
      • Getting Started
      • How-to Guides
        • Configure Starburst (Trino) Integration
        • Customize Read and Write Access Policies for Starburst (Trino)
      • Starburst (Trino) Integration Reference Guide
    • Redshift
      • Getting Started
      • How-to Guides
        • Configure Redshift Integration
        • Configure Redshift Spectrum
      • Reference Guides
        • Redshift Integration
        • Redshift Pre-Configuration Details
    • Azure Synapse Analytics
      • Getting Started
      • Configure Azure Synapse Analytics Integration
      • Reference Guides
        • Azure Synapse Analytics Integration
        • Azure Synapse Analytics Pre-Configuration Details
    • Amazon S3
    • Google BigQuery
    • Legacy Integrations
      • Securing Hive and Impala Without Sentry
      • Enabling ImmutaGroupsMapping
    • Catalogs
      • Getting Started with External Catalogs
      • Configure an External Catalog
      • Reference Guides
        • External Catalogs
        • Custom REST Catalogs
          • Custom REST Catalog Interface Endpoints
  • Data
    • Registering Metadata
      • Data Sources in Immuta
      • Register Data Sources
        • Create a Data Source
        • Create an Amazon S3 Data Source
        • Create a Google BigQuery Data Source
        • Bulk Create Snowflake Data Sources
      • Data Source Settings
        • How-to Guides
          • Manage Data Sources and Data Source Settings
          • Manage Data Source Members
          • Manage Access Requests and Tasks
          • Manage Data Dictionary Descriptions
          • Disable Immuta from Sampling Raw Data
        • Data Source Health Checks Reference Guide
      • Schema Monitoring
        • How-to Guides
          • Run Schema Monitoring and Column Detection Jobs
          • Manage Schema Monitoring
        • Reference Guides
          • Schema Monitoring
          • Schema Projects
        • Why Use Schema Monitoring?
    • Domains
      • Getting Started with Domains
      • Domains Reference Guide
    • Tags
      • How-to Guides
        • Create and Manage Tags
        • Add Tags to Data Sources and Projects
      • Tags Reference Guide
  • People
    • Getting Started
    • Identity Managers (IAMs)
      • How-to Guides
        • Okta LDAP Interface
        • OpenID Connect
          • OpenID Connect Protocol
          • Okta and OpenID Connect
          • OneLogin with OpenID
        • SAML
          • SAML Protocol
          • Microsoft Entra ID
          • Okta SAML SCIM
      • Reference Guides
        • Identity Managers
        • SAML Single Logout
        • SAML Protocol Configuration Options
    • Immuta Users
      • How-to Guides
        • Managing Personas and Permissions
        • Manage Attributes and Groups
        • User Impersonation
        • External User ID Mapping
        • External User Info Endpoint
      • Reference Guides
        • Attributes and Groups in Immuta
        • Permissions and Personas
  • Discover Your Data
    • Getting Started with Discover
    • Introduction
    • Data Discovery
      • How-to Guides
        • Enable Sensitive Data Discovery (SDD)
        • Manage Identification Frameworks
        • Manage Identifiers
        • Run and Manage SDD on Data Sources
        • Manage Sensitive Data Discovery Settings
        • Migrate From Legacy to Native SDD
      • Reference Guides
        • How Competitive Criteria Analysis Works
        • Built-in Identifier Reference
        • Built-in Discovered Tags Reference
    • Data Classification
      • How-to Guides
        • Activate Classification Frameworks
        • Adjust Identification and Classification Framework Tags
        • How to Use a Built-In Classification Framework with Your Own Tags
      • Built-in Classification Frameworks Reference Guide
  • Detect Your Activity
    • Getting Started with Detect
      • Monitor and Secure Sensitive Data Platform Query Activity
        • User Identity Best Practices
        • Integration Architecture
        • Snowflake Roles Best Practices
        • Register Data Sources
        • Automate Entity and Sensitivity Discovery
        • Detect with Discover: Onboarding Guide
        • Using Immuta Detect
      • General Immuta Configuration
        • User Identity Best Practices
        • Integration Architecture
        • Databricks Roles Best Practices
        • Register Data Sources
    • Introduction
    • Audit
      • How-to Guides
        • Export Audit Logs to S3
        • Export Audit Logs to ADLS
        • Run Governance Reports
      • Reference Guides
        • Universal Audit Model (UAM)
          • UAM Schema
        • Query Audit Logs
          • Snowflake Query Audit Logs
          • Databricks Unity Catalog Query Audit Logs
          • Databricks Spark Query Audit Logs
          • Starburst (Trino) Query Audit Logs
        • Audit Export GraphQL Reference Guide
        • Governance Report Types
        • Unknown Users in Audit Logs
      • Deprecated Audit Guides
        • Legacy to UAM Migration
        • Download Audit Logs
        • System Audit Logs
    • Dashboards
      • Use the Detect Dashboards How-To Guide
      • Detect Dashboards Reference Guide
    • Monitors
      • Manage Monitors and Observations
      • Detect Monitors Reference Guide
  • Secure Your Data
    • Getting Started with Secure
      • Automate Data Access Control Decisions
        • The Two Paths: Orchestrated RBAC and ABAC
        • Managing User Metadata
        • Managing Data Metadata
        • Author Policy
        • Test and Deploy Policy
      • Compliantly Open More Sensitive Data for ML and Analytics
        • Managing User Metadata
        • Managing Data Metadata
        • Author Policy
      • Federated Governance for Data Mesh and Self-Serve Data Access
        • Defining Domains
        • Managing Data Products
        • Managing Data Metadata
        • Apply Federated Governance
        • Discover and Subscribe to Data Products
    • Introduction
      • Scalability and Evolvability
      • Understandability
      • Distributed Stewardship
      • Consistency
      • Availability of Data
    • Authoring Policies in Secure
      • Authoring Policies at Scale
      • Data Engineering with Limited Policy Downtime
      • Subscription Policies
        • How-to Guides
          • Author a Subscription Policy
          • Author an ABAC Subscription Policy
          • Subscription Policies Advanced DSL Guide
          • Author a Restricted Subscription Policy
          • Clone, Activate, or Stage a Global Policy
        • Reference Guides
          • Subscription Policies
          • Subscription Policy Access Types
          • Advanced Use of Special Functions
      • Data Policies
        • Overview
        • How-to Guides
          • Author a Masking Data Policy
          • Author a Minimization Policy
          • Author a Purpose-Based Restriction Policy
          • Author a Restricted Data Policy
          • Author a Row-Level Policy
          • Author a Time-Based Restriction Policy
          • Certifications Exemptions and Diffs
          • External Masking Interface
        • Reference Guides
          • Data Policy Types
          • Masking Policies
          • Row-Level Policies
          • Custom WHERE Clause Functions
          • Data Policy Conflicts and Fallback
          • Custom Data Policy Certifications
          • Orchestrated Masking Policies
    • Projects and Purpose-Based Access Control
      • Projects and Purpose Controls
        • Getting Started
        • How-to Guides
          • Create a Project
          • Create and Manage Purposes
          • Adjust a Policy
          • Project Management
            • Manage Projects and Project Settings
            • Manage Project Data Sources
            • Manage Project Members
        • Reference Guides
          • Projects and Purposes
          • Policy Adjustments
        • Why Use Purposes?
      • Equalized Access
        • Manage Project Equalization
        • Project Equalization Reference Guide
        • Why Use Project Equalization?
      • Masked Joins
        • Enable Masked Joins
        • Why Use Masked Joins?
      • Writing to Projects
        • How-to Guides
          • Create and Manage Snowflake Project Workspaces
          • Create and Manage Databricks Spark Project Workspaces
          • Write Data to the Workspace
        • Reference Guides
          • Project Workspaces
          • Project UDFs (Databricks)
    • Data Consumers
      • Subscribe to a Data Source
      • Query Data
        • Querying Snowflake Data
        • Querying Databricks Data
        • Querying Databricks SQL Data
        • Querying Starburst (Trino) Data
        • Querying Redshift Data
        • Querying Azure Synapse Analytics Data
      • Subscribe to Projects
  • Application Settings
    • How-to Guides
      • App Settings
      • BI Tools
        • BI Tool Configuration Recommendations
        • Power BI Configuration Example
        • Tableau Configuration Example
      • Add a License Key
      • Add ODBC Drivers
      • Manage Encryption Keys
      • System Status Bundle
    • Reference Guides
      • Data Processing, Encryption, and Masking Practices
      • Metadata Ingestion
  • Releases
    • Immuta v2024.3 Release Notes
    • Immuta Release Lifecycle
    • Immuta LTS Changelog
    • Immuta Support Matrix Overview
    • Immuta CLI Release Notes
    • Immuta Image Digests
    • Preview Features
      • Features in Preview
    • Deprecations
  • Developer Guides
    • The Immuta CLI
      • Install and Configure the Immuta CLI
      • Manage Your Immuta Tenant
      • Manage Data Sources
      • Manage Sensitive Data Discovery
        • Manage Sensitive Data Discovery Rules
        • Manage Identification Frameworks
        • Run Sensitive Data Discovery on Data Sources
      • Manage Policies
      • Manage Projects
      • Manage Purposes
      • Manage Audit
    • The Immuta API
      • Integrations API
        • Getting Started
        • How-to Guides
          • Configure an Amazon S3 Integration
          • Configure an Azure Synapse Analytics Integration
          • Configure a Databricks Unity Catalog Integration
          • Configure a Google BigQuery Integration
          • Configure a Redshift Integration
          • Configure a Snowflake Integration
          • Configure a Starburst (Trino) Integration
        • Reference Guides
          • Integrations API Endpoints
          • Integration Configuration Payload
          • Response Schema
          • HTTP Status Codes and Error Messages
      • Immuta V2 API
        • Data Source Payload Attribute Details
        • Data Source Request Payload Examples
        • Create Policies API Examples
        • Create Projects API Examples
        • Create Purposes API Examples
      • Immuta V1 API
        • Authenticate with the API
        • Configure Your Instance of Immuta
          • Get Fingerprint Status
          • Get Job Status
          • Manage Frameworks
          • Manage IAMs
          • Manage Licenses
          • Manage Notifications
          • Manage Sensitive Data Discovery (SDD)
          • Manage Tags
          • Manage Webhooks
          • Search Filters
        • Connect Your Data
          • Create and Manage an Amazon S3 Data Source
          • Create an Azure Synapse Analytics Data Source
          • Create an Azure Blob Storage Data Source
          • Create a Databricks Data Source
          • Create a Presto Data Source
          • Create a Redshift Data Source
          • Create a Snowflake Data Source
          • Create a Starburst (Trino) Data Source
          • Manage the Data Dictionary
        • Manage Data Access
          • Manage Access Requests
          • Manage Data and Subscription Policies
          • Manage Domains
          • Manage Write Policies
            • Write Policies Payloads and Response Schema Reference Guide
          • Policy Handler Objects
          • Search Audit Logs
          • Search Connection Strings
          • Search for Organizations
          • Search Schemas
        • Subscribe to and Manage Data Sources
        • Manage Projects and Purposes
          • Manage Projects
          • Manage Purposes
        • Generate Governance Reports
Powered by GitBook

Other versions

  • SaaS
  • 2024.3
  • 2024.2

Copyright © 2014-2024 Immuta Inc. All rights reserved.

On this page
  • Policy support matrix
  • Policy types
  • Inclusionary policies
  • Limit to purpose policies
  • Masking policies
  • Row-level security policies
  • New column added policy

Was this helpful?

Export as PDF
  1. Secure Your Data
  2. Authoring Policies in Secure
  3. Data Policies
  4. Reference Guides

Data Policy Types

Last updated 7 months ago

Was this helpful?

Once a user is subscribed to a data source, the data policies that are applied to that data source determine what data the user sees.

For all data policies, you must establish the conditions for which they will be enforced. Immuta allows you to append multiple conditions to the data. Those conditions are based on user attributes and groups (which can come from multiple identity management systems and applied as conditions in the same policy), or purposes they are acting under through Immuta projects.

Conditions can be directed as exclusionary or inclusionary, depending on the policy that's being enforced:

  • exclusionary condition example: Mask using hashing values in columns tagged PII on all data sources for everyone except users in the group AUDIT.

  • : Only show rows where user is a member of a group that matches the value in the column tagged Department.

Policy support matrix

The table below outlines the policy types supported by each integration. Details about each of these policies are included in the section.

*Supported with Caveats:

  • On Databricks data sources, joins will not be allowed on data protected with replace with NULL/constant policies.

  • Snowflake k-anonymization: This policy type is only supported if you are using the query engine, which is disabled by default. Reach out to your Immuta representative if you need to enable this policy type for your account.

  • Starburst (Trino):

    • K-anonymization, randomized response, and format preserving masking are only supported if you are using the query engine, which is disabled by default. Reach out to your Immuta representative if you need to enable this policy type for your account.

    • The Immuta function @iam for WHERE clause policies can block the creation of views.

Policy types

Inclusionary policies

For example, governors could mask values using hashing for users acting under a specified purpose while masking those same values by making null for everyone else who accesses the data.

This variation can be created by selecting for everyone who when available from the condition dropdown menus and then completing the Otherwise clause.

Limit to purpose policies

For example, if the purpose Research included Marketing, Product, and Onboarding as sub-purposes, a governor could write the following global policy:

Limit usage to purpose(s) Research for everyone on data sources tagged PHI.

This hierarchy allows you to create this as a single purpose instead of creating separate purposes, which must then each be added to policies as they evolve.

Now, any user acting under the purpose or sub-purpose of Research - whether Research.Marketing or Research.Onboarding - will meet the criteria of this policy. Consequently, purpose hierarchies eliminate the need for a governor to rewrite these global policies when sub-purposes are added or removed. Furthermore, if new projects with new Research purposes are added, for example, the relevant global policy will automatically be enforced.

Masking policies

Masking policies hide values in data, providing various levels of utility while still preserving privacy.

Hashing

This policy masks the values with an irreversible sha256 hash, which is consistent for the same value throughout the data source, so you can count or track the specific values, but not know the true raw value.

Replace with NULL

This policy makes values null, removing any utility of the data the policy applies to.

Replace with constant

With this policy, you can replace the values with the same constant value you choose, such as 'Redacted', removing any utility of that data.

Regular expression (regex)

This policy is similar to replacing with a constant, but it provides more utility because you can retain portions of the true value. When authoring the policy in Immuta, the regex and the replacement value do not need to be in single or double quotes.

The following regex rule would mask the final digits of an IP address:

Mask using a regex \d+$ the value in the columns ip_address for everyone.

In this case, the regular expression \d+$

\d matches a digit (equal to [0-9])

+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

$ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)

This ensures we capture the last digit(s) after the last . in the ip address. We then can enter the replacement for what we captured, which in this case is XXX. So the outcome of the policy, would look like this: 164.16.13.XXX

This regex rule applies masking to telephone numbers variably depending on the presence of a dash (implying a prefix), space, or only digits:

Mask using a regex (\+?\d{0,3}[-\s]?)?\d{4} the value in the column tagged Discovered...Telephone Number for everyone.

The image below illustrates authoring a regex global policy that will apply to Databricks Unity Catalog data sources:

Databricks Unity Catalog integration regex_replace function

Rounding

This is a technique to hide precision from numeric values while providing more utility than simply hashing. For example, you could remove precision from a geospatial coordinate. You can also use this type of policy to remove precision from dates and times by rounding to the nearest hour, day, month, or year.

With reversibility

Deprecation notice

Support for unmask requests has been deprecated.

Note: The user receiving the unmasking request must send the unmasked value to the requester.

With Reversible Masking, the raw values are switched out with consistent values to allow analysis without revealing the underlying sensitive data. The direct identifier is replaced with a token that can still be tracked or counted.

With format preserving masking

This option masks the value, but preserves the length and type of the value.

Preserving the data format is important if the format has some relevance to the analysis at hand. For example, if you need to retain the integer column type or if the first 6 digits of a 12-digit number have an important meaning.

Custom function

This option uses functions native to the underlying database to transform the column. Single quotes enclosing the regex and escaping special characters are required. The following example masks telephone numbers variably depending on the presence of a dash (implying a prefix), space, or only digits:

REGEXP_REPLACE(@column, '(\\+?\\d{0,3}[-\\s]?)?\\d{4}', '****')

The image below illustrates authoring a global policy using this custom function:

Limitations

  • The masking functions are executed against the remote database directly. A poorly written function could lead to poor quality results, data leaks, and performance hits.

  • Using custom functions can result in changes to the original data type. In order to prevent query errors you must ensure that you cast this result back to the original type.

  • The function must be valid for the data type of the selected column. If it is not

    • Local policies will error and show a message that the function is not valid.

    • Global policies will error and change to the default masking type (hashing for text and NULL for all others).

Conditionally masking

For all of the policies above, both at the local and global policy levels, you can conditionally mask the value based on a value in another column. This allows you to build a policy that looks something like: "Mask bank account number where country = 'USA'" instead of blindly stating you want bank account masked always.

With k-anonymization

Sample data is processed during computation of k-anonymization policies

When a k-anonymization policy is applied to a data source, the columns targeted by the policy are queried under a fingerprinting process that generates rules enforcing k-anonymity. The results of this query, which may contain data that is subject to regulatory constraints such as GDPR or HIPAA, are stored in Immuta's metadata database.

The location of the metadata database depends on your deployment:

  • Self-managed Immuta deployment: The metadata database is located in the server where you have your external metadata database deployed.

  • SaaS Immuta deployment: The metadata database is located in the AWS global segment you have chosen to deploy Immuta.

K-anonymity is measured by grouping records in a data source that contain the same values for a common set of quasi identifiers (QIs) - publicly known attributes (such as postal codes, dates of birth, or gender) that are consistently, but ambiguously, associated with an individual.

The k-anonymity of a data source is defined as the number of records within the least populated cohort, which means that the QIs of any single record cannot be distinguished from at least k other records. In this way, a record with QIs cannot be uniquely associated with any one individual in a data source, provided k is greater than 1.

In Immuta, masking with k-anonymization examines pairs of values across columns and hides groups that do not appear at least the specified number of times (k). For example, if one column contains street numbers and another contains street names, the group 123, "Main Street" probably would appear frequently while the group 123, "Diamondback Drive" probably would show up much less. Since the second group appears infrequently, the values could potentially identify someone, so this group would be masked.

After the fingerprint service identifies columns with a low number of distinct values, users will only be able to select those columns when building the policy. Users can either use a minimum group size (k) given by the fingerprint or manually select the value of k.

Masking multiple columns with k-anonymization

Governors can write global data policies using k-anonymization in the global data policy builder.

When this global policy is applied to data sources, it will mask all columns matching the specified tag.

Applying k-anonymization over disjoint sets of columns in separate policies does not guarantee k-anonymization over their union.

If you select multiple columns to mask with k-anonymization in the same policy, the policy is driven by how many times these values appear together. If the groups appear fewer than k times, they will be masked.

For example, if Policy A

Policy A: Mask with k-anonymization the values in the columns gender and state requiring a group size of at least 2 for everyone

was applied to this data source

Gender
State

Female

Ohio

Female

Florida

Female

Florida

Female

Arkansas

Male

Florida

the values would be masked like this:

Gender
State

Null

Null

Female

Florida

Female

Florida

Null

Null

Null

Null

Note: Selecting many columns to mask with k-anonymization increases the processing that must occur to calculate the policy, so saving the policy may take time.

However, if you select to mask the same columns with k-anonymization in separate policies, Policy C and Policy D,

Policy C: Mask with k-anonymization the values in the column gender requiring a group size of at least 2 for everyone

Policy D: Mask with k-anonymization the values in the column state requiring a group size of at least 2 for everyone

the values in the columns will be masked separately instead of as groups. Therefore, the values in that same data source would be masked like this:

Gender
State

Female

Null

Female

Florida

Female

Florida

Female

Null

Null

Florida

Using randomized response

participant_id
zip_code
gender
occupation
substance_abuse

...

...

...

...

...

880d0096

75002

Male

Welder

Y

f267334b

75002

Male

Welder

Y

bfdb43db

75002

Male

Welder

Y

260930ce

75002

Male

Welder

Y

046dc7fb

75002

Male

Welder

Y

...

...

...

...

...

All members of this cohort have indicated substance abuse, sensitive personal information that could have damaging consequences, and, even though direct identifiers have been removed and k-anonymization has been applied, outsiders could infer substance abuse for an individual if they knew a male welder in this zip code.

In this scenario, using randomized response would change some of the Y's in substance_abuse to N's and vice versa; consequently, outsiders couldn't be sure of the displayed value of substance_abuse given in any individual row, as they wouldn't know which rows had changed.

How the randomization works

Immuta applies a random number generator (RNG) that is seeded with some fixed attributes of the data source, column, backing technology, and the value of the high cardinality column, an approach that simulates cached randomness without having to actually cache anything.

For numeric data, Immuta uses the RNG to add a random shift from a 0-centered Laplace distribution with the standard deviation specified in the policy configuration. For most purposes, knowing the distribution is not important, but the net effect is that on average the reported values should be the true value plus or minus the specified deviation value.

Preserving data utility

Using randomized response doesn't destroy the data because data is only randomized slightly; aggregate utility can be preserved because analysts know how and what proportion of the values will change. Through this technique, values can be interpreted as hints, signals, or suggestions of the truth, but it is much harder to reason about individual rows.

Additionally, randomized response gives deniability of record content not dataset participation, so individual rows can be displayed.

Mixing masking policies on the same column

In some cases, you may want several different masking policies applied to the same column through Otherwise policies. To build these policies, select everyone who instead of everyone or everyone except. After you specify who the masking policy applies to, select how it applies to everyone else in the Otherwise condition.

You can add and remove tags in Otherwise conditions for global policies (unlike local policy Otherwise conditions), as illustrated above; however, all tags or regular expressions included in the initial everyone who rule must be included in an everyone or everyone except rule in the additional clauses.

Complex data types: masking fields within struct columns (public preview)

Feature limitations

  • Masking struct and array columns is only available for Databricks data sources.

  • Immuta only supports Parquet and Delta table types.

Spark supports a class of data types called complex types, which can represent multiple data values in a single column. Immuta supports masking fields within array and struct columns:

  • array: an ordered collection of elements

  • struct: a collection of elements that are primitive or complex types

Without this feature enabled, the struct and array columns of a data source default to jsonb in the Data Dictionary, and the masking policies that users can apply to jsonb columns are limited. For example, if a user wanted to mask PII inside the column patient in the image below, they would have to apply null masking to the entire column or use a custom function instead of just masking name or address.

After a global or local policy masks the columns containing PII, users who do not meet the exception specified in the policy will see these values masked:

Note: Immuta uses the > delimiter to indicate that a field is nested instead of the . delimiter, since field and column names could include ..

Caveats

Struct columns with many fields

If users have struct columns with many fields, they will need to either

  • create the data source against a cluster running Spark 3 or

  • add spark.debug.maxToStringFields 1000 to their Spark 2 cluster's configuration.

To get column information about a data source, Immuta executes a DESCRIBE call for the table. In this call, Spark returns a simple string representation of the schema for each column in the table. For the patient column above, the simple string would look like this:

struct<name:string,ssn:string,age:int,address:struct<city:string,state:string,zipCode:string,street:text>>

Immuta then parses this string into the following format for the data source's dictionary:

{
  dataType: 'struct',
  children: [
    {
      name: 'name',
      dataType: 'text'
    },
    {
      name: 'ssn',
      dataType: 'text'
    },
    {
      name: 'age',
      dataType: 'integer'
    },
    {
      name: 'address',
      dataType: 'struct',
      children: [
        {
          name: 'city',
          dataType: 'text'
        },
        {
          name: 'state',
          dataType: 'text'
        },
        {
          name: 'zipCode',
          dataType: 'text'
        },
        {
          name: 'street',
          dataType: 'text'
        },
      ]
    }
  ]
}

However, if the struct contains more than 25 fields, Spark truncates the string, causing the parser to fail and fall back to jsonb. Immuta will attempt to avoid this failure by increasing the number of fields allowed in the server-side property setting, maxToStringFields; however, this only works with clusters on a Spark 3 runtime. The maxToStringFields configuration in Spark 2 cannot be set through the ODBC driver and can only be set through the Spark configuration on the cluster with spark.debug.maxToStringFields 1000 on cluster startup.

External masking

Deprecation notice: Support for this feature has been deprecated.

This feature allows Immuta to unmask data that is masked at rest in a remote database using a customer-provided encryption or masking algorithm. To do so,

  1. Data owners apply these tags to columns that are masked (with encryption or another algorithm) in the remote database.

Unmasking process

Immuta will only unmask externally masked data if two conditions are met:

  1. A masking policy is applied against that tagged column.

  2. The querying user is exempt from that policy.

When a user who is exempt from the policy restrictions queries that masked column using a filter, Immuta converts the literal being queried using the external algorithm provided. Consider the following example:

  • The social_security_number column is masked on-ingest and has the tag externally_masked_data applied to it.

  • This masking policy is applied to the data source in Immuta: Mask using hashing the values in the column tagged externally_masked_data except for users who belong to the group view_masked_values.

  • The querying user belongs to the view_masked_values group.

When the user above runs the query select * from table A where social_security_number = 220869988, Immuta converts 220869988 to the masked value using the provided algorithm to query the database and return matching rows.

Use equality queries only

Queries against masked values on-ingest should be equality queries only. For example, if an exempt user ran a query like select * from table A where social_security_number > 220869988, the results may not make sense (depending on the algorithm used for masking the data).

Tutorials

Row-level security policies

These policies hide entire rows or objects of data based on the policy being enforced; some of these policies require the data to be tagged as well.

Matching

These policies match a user attribute with a row/object/file attribute to determine if that row/object/file should be visible. This process uses a direct string match, so the user attribute would have to match exactly the data attribute in order to see that row of data.

For example, to restrict access to insurance claims data to the state for which the user's home office is located, you could build a policy such as this:

Only show rows where user possesses an attribute in Office Location that matches the value in the column State for everyone except when user is a member of group Legal.

In this case, the Office Location is retrieved by the identity management system as a user attribute or group. If the user's attribute (Office Location) was Missouri, rows containing the value Missouri in the State column in the data source would be the only rows visible to that user.

WHERE clause policy

This policy can be thought of as a table "view" created automatically for the user based on the condition of the policy. For example, in the policy below, users who are not members of the Admins group will only see taxi rides where passenger_count < 2.

Only show rows where public.us.taxis.passenger_count <2 for everyone except when user is a member of group Admins.

WHERE clause policy requirement

All columns referenced in the policy must have fully qualified names. Any column names that are unqualified (just the column name) will default to a column of the data source the policy is being applied to (if one matches the name).

Time-based restrictions

These policies restrict access to rows/objects/files that fall within the time restrictions set in the policy. If a data source has time-based restriction policies, queries run against the data source by a user will only return rows/blobs with a date in its event-time column/attribute from within a certain range.

The time window is based on the event time you select when creating the data source. This value will come from a date/time column in relational sources.

Minimization

These policies return a limited percentage of the data, which is randomly sampled, at query time. but it is the same sample for all the users. For example, you could limit certain users to only 10% of the data. Immuta uses a hashing policy to return approximately 10% of the data, and the data returned will always be the same; however, the exact number of rows exposed depends on the distribution of high cardinality columns in the database and the hashing type available. Additionally, Immuta will adjust the data exposed when new rows are added or removed.

Best practice: row count

Immuta recommends you use a table with over 1,000 rows for the best results when using a data minimization policy.

Masked columns as input for row-level policies

Public preview: This feature is currently in public preview and available to all accounts.

If a global masking policy applies to a column, you can still use that masked column in a global row-level policy.

Consider the following policy examples:

  • Masking policy: Mask values in columns tagged Country for everyone except users in group Admin.

  • Row-level policy: Only show rows where user possesses an attribute in OfficeLocation that matches the value in column tagged Country for everyone.

Both of these policies use the Country tag to restrict access. Therefore, the masking policy and the row-level policy would apply to data source columns with the tag Country for users who are not in the Admin group.

Limitations

New column added policy

When this policy is activated by a governor, it will automatically be enforced on data sources that have the New tag applied to them.

For all policies except , inclusionary logic allows governors to vary policy actions with an Otherwise clause.

Purposes help define the scope and use of data within a project and allow users to meet . Governors create and manage purposes and their sub-purposes, which project owners then add to their project(s) and use to drive Data Policies.

Purposes can be constructed as a hierarchy, meaning that purposes can contain nested sub-purposes, much like in Immuta. This design allows more flexibility in managing purpose-based restriction policies and transparency in the relationships among purposes.

Refer to the for a tutorial on purpose-based restrictions on data.

Hashed values are different across data sources, so you cannot join on hashed values unless you . Immuta prevents joins on hashed values to protect against link attacks where two data owners may have exposed data with the same masked column (a quasi-identifier), but their data combined by that masked value could result in a sensitive data leak.

The Databricks Unity Catalog integration uses Spark’s built in regex_replace function. That Databricks function currently . Regex will not work on this platform unless these settings are appropriately configured.

This option masks the values using hashing, but allows users to submit an to users who meet the exceptions of the policy.

This option also allows users to submit an to users who meet the exceptions of the policy.

Note: When building conditional masking policies with custom SQL statements, avoid using a column that is masked using in the SQL statement, as this can lead to different behavior depending on your data platform and may produce results that are unexpected.

To ensure this process does not violate your organization's data localization regulations, you need to first activate this masking policy type before you can use it in your Immuta tenant. To enable k-anonymization for your account, see the .

Note: The default cardinality cutoff for columns to qualify for k-anonymization is 500. For details about adjusting this setting, navigate to the .

This policy masks data by slightly randomizing the values in a column, while preventing outsiders from inferring content of specific records.

For example, if an analyst wanted to publish data from a health survey she conducted, she could remove direct identifiers and apply to indirect identifiers to make it difficult to single out individuals. However, consider these survey participants, a cohort of male welders who share the same zip code:

For string data, the random number generator essentially flips a biased coin. If the coin comes up as tails, which it does with the frequency of the replacement rate , then the value is changed to any other possible value in the column, selected uniformly at random from among those values. If the coin comes up as heads, the true value is released.

After Complex Data Types is enabled on the , the column type for struct columns for new data sources will display as struct in the Data Dictionary. (For data sources that are already in Immuta, users can edit the data source and change the column types for the appropriate columns from jsonb to struct.) Once struct fields are available, they can be searched, tagged, and used in masking policies. For example, a user could tag name, ssn, and street as PII instead of the entire patient column.

System Administrators build their own custom logic and security in an . Because Immuta always pushes down the masked version of the literal when the user is exempt from the policy, the organization should use deterministic IVs/salt to ensure the same value is masked consistently throughout the data.

System Administrators give Immuta access to the that will be used by data owners to indicate that data is masked at rest in the remote database.

Data owners or governors that allow Immuta to reach out to this external REST service to unmask data according to the specifications in the policy.

Immuta’s External Masking feature expects data to be masked at rest by an external tool consistently on a per-cell basis in the remote database. Immuta then provides policy-based unmasking (and additional masking on top of this using ).

To configure External Masking, see the .

For an implementation guide, see the .

Note: When building row-level policies with custom SQL statements, avoid using a column that is masked using in the SQL statement, as this can lead to different behavior depending on whether you’re using the Spark or Snowflake and may produce results that are unexpected.

You can put any valid SQL in the policy. See the for a list of custom functions.

This feature is only available for and integrations.

This feature is only supported for , not local data policies.

This policy pairs with to mask newly added columns to data sources until data owners review and approve these changes from the requests tab of their profile page.

To learn how to activate this policy, navigate to the .

purpose restrictions on policies
tags
data governor policy guide
enable masked joins on data sources within a project
configured in the policy
external REST service
create data policies
External Masking Interface
WHERE clause
Custom WHERE clause functions
Snowflake
Databricks Unity Catalog
global data policies
schema monitoring
tutorial
purpose-based restriction policies
randomized response
preserving the utility of the data
k-anonymization
standard masking policies
randomized response
inclusionary condition example
policy types
unmasking request
unmasking request
only supports global pattern flags set as global (g) and case-sensitive
k-anonymization section on the app settings how-to guide
App Settings Tutorial
App Settings page
external REST service and configure tags
App Settings Tutorial
A regex applied to Databricks which requires Global pattern to be enabled and Case insensitivity disabled.
Note the use of @column to specify the column to which this should apply.