Data Processing, Encryption, and Masking Practices
Audience: All Immuta Users
Content Summary: This page outlines Immuta's practices regarding policy decision and sample data, data encryption, and masking.
Policy Decision Data
Policy decision data is transmitted to ensure end users querying data are limited to the appropriate access as defined by the policies in Immuta.
In the Databricks integration, the user, data source information, and query are sent to Immuta through the Spark Plugin to determine what policies need to be applied while the query is being processed. Data that travels from Immuta to the Databricks cluster could include
- user attributes.
- what columns to mask.
- the entire predicate itself (for row-level policies).
- A user runs a query against data in their environment.
- The query is sent to the Immuta Web Service.
- The Web Service queries the Metadata Database to obtain the policy definition, which includes data source metadata (tags, column names, etc.) and user entitlements (groups and attributes).
- The policy information is transmitted to the remote data system for native policy enforcement.
- Query results are displayed based on what policy definition was applied.
Sample Raw Data
Sample data is processed and aggregated or reduced during Immuta's fingerprinting process and specific policy processes. Note: Data Owners can see sample data when editing a data source. However, this action requires the database password, and the small sample of data visible is only displayed in the UI and is not stored in Immuta.
- When enabled, statistical queries made during data source health checks are distilled into summary statistics, called fingerprints. The sample data processed for fingerprinting allows Immuta to track data source changes.
- During this process, statistical query results and data samples (which may contain PII) are temporarily held in memory during computation by the Fingerprint Service.
- The fingerprinting process checks for new tables through schema monitoring (when enabled) and captures summary statistics of changes to data sources, including when policies were applied, external views were created, or sensitive data elements were added.
Immuta does not sample data for row redaction policies.
Immuta does not sample data for row redaction policies; Immuta only pulls samples of data to determine if a column is a candidate for randomized response and aggregates of user-defined cohorts for k-anonymization. Both datasets only exist in memory during the computation.
- Sample data is processed when k-anonymization or randomized response policies are applied to data sources.
- Sample data exists temporarily in memory in the Fingerprint Service during the computation.
- k-Anonymization Policies: At the time of its application, the columns of a k-anonymization policy are queried under a separate fingerprinting process that generates rules enforcing k-anonymity. The results of this query, which may contain PII, are temporarily held in memory by the Fingerprint Service. The final rules are stored in the Metadata Database as the policy definition for enforcement.
- Randomized Response Policies: If the list of substitution values for a categorical column is not part of the policy specification (e.g., when specified via the API), a list is obtained via query and merged into the policy definition in the Metadata Database.
- Raw data is processed for masking, producing either a distinct set of values or aggregated groups of values.
Encryption of Data at Rest
Immuta captures metadata and stores it in an internal PostgreSQL database. Customers can encrypt the volumes backing the database using an external Key Management Service to ensure that data is encrypted at rest.
Encryption of Data in Transit
To encrypt data in transit, Immuta uses TLS protocol, which is configured by the customer.
Encryption Key Management
Immuta encrypts values with data encryption keys, either those that are system-generated or managed using an external key management service (KMS). Immuta recommends a KMS to encrypt or decrypt data keys and supports the AWS Key Management Service; however, if no KMS is configured, Immuta will generate a data encryption key on a user-defined rollover schedule, using the most recent data key to encrypt new values while preserving old data keys to decrypt old values.
Immuta employs three families of functions in its masking policies:
One-way Hashing: One-way (irreversible) hashing is performed via a salted SHA256 hash. A consistent salt is used for values throughout the data source, so users can count or track the specific values without revealing the true value. Since hashed values are different across data sources, users are unable to join on hashed values. Note: joining on masked values can be enabled in Immuta Projects.
Reversible Masking: For reversible masking, values are encrypted using AES-256 CBC encryption. Encryption is performed using a cell-specific initialization vector. The resulting values can be unmasked by an authorized user. Note that this is dynamic encryption of individual fields as results are streamed to the querying system; Immuta is not modifying records in the data store. See the External Masking documentation to learn how Immuta policies can work with 3rd-party encryption/decryption services to reveal encrypted records under appropriate circumstances.
Reversible Format Preserving Masking: Format preserving masking maintains the format of the data while masking the value, and is achieved by initializing and applying the NIST standard method FF1 at the column level. The resulting values can be unmasked by an authorized user.
Immuta collects and stores the following kinds of metadata in Immuta's Metadata Database for policy enforcement. Further, policy information may be transmitted to data source host systems for enforcement purposes as part of a query, or to enable the host system to perform native enforcement.
Identity Management Information: Usernames, group information, and other kinds of personal identifiers may be stored and referenced for the purposes of performing authentication and access control and may be retained in audit logs. When such information is relevant for access determination under policy, it may be retained as part of the policy definition.
Schema Information: Data source metadata such as schema, column data types, and information about the host.
Fingerprints: When enabled, additional statistical queries made during the health check are distilled into summary statistics, called fingerprints. During this process, statistical query results and data samples (which may contain PII) are temporarily held in memory by the Fingerprint Service.
k-Anonymization Policies: At the time of its application, the columns of a k-anonymization policy are queried under a separate fingerprinting process which generates rules enforcing k-anonymity. The results of this query, which may contain PII, are temporarily held in memory by the fingerprint service. The final rules are stored for enforcement.
Randomized Response Policies: If the list of substitution values for a categorical column is not part of the policy specification (e.g., when specified via the API), a list is obtained via query and merged into the policy definition.