Data Processing

Terminology: Local Region

The Local Region is the customer's operating region, which determines where an Immuta tenant is deployed and the Immuta Metadata Database lives. Immuta SaaS can deploy in these AWS regions.

To understand how Immuta processes data, it's imperative to understand the purpose of the Immuta components deployed in the Immuta Cloud infrastructure:

  • Immuta Tenant Metadata Database: The database specific to a customer's tenant that contains the tenant's metadata that powers the core functionality of Immuta, including policy data and attributes about data sources (tags, audit data, etc.).

  • Immuta Web Service: This component includes the Immuta UI and API and is responsible for all web-based user interaction with Immuta, metadata ingest, and the data fingerprinting process.

Data Categories

Immuta tenants are localized to the customer

The Immuta tenants and its components (Metadata Database and and Web Service) are localized to the customer.

Data processed by Immuta falls into one of the following categories. For additional details, click a category to navigate to that section.

Data Category
Definition
Storage

Audit logs include details about data access, such as who subscribes to a data source, when they access the data, and the queries they've run.

This data is stored in the tenant's Metadata Database.

This data includes user account data, such as email addresses, names, and entitlements.

This data is stored in the tenant's Metadata Database, unless a customer has opted to use an external identity provider.

This data includes column names, tags, free-text descriptions of columns, and health check results, such as row counts and high cardinality checks. Additionally, this data source metadata may include the schema, column data types, and information about the host.

This data is stored in the tenant's Metadata Database.

This data includes summary statistics regarding changes to data sources, including when policies have been applied, when external views have been created, when sensitive data elements have been added, and when users have enabled checks for new tables through schema monitoring.

This data is stored in the tenant's Metadata Database.

This data includes the metadata (such as usernames, group information, or other kinds of personal identifiers) sent to the Immuta Web Service to determine if a user has access. When such information is relevant for access determination, it may be retained as part of the policy definition.

This data is stored in the tenant's Metadata Database.

This data includes tenant metrics -- statistics about activities occurring within Immuta, such as how many policies, projects, or tags have been created and how many users are authenticated within Immuta -- and user metrics, such as the user and session and event properties (user and session IDs, page views, and clicks).

This data is stored in a single, US-based region.

TCP Connection

Immuta communicates with remote databases over a TCP connection.

Immuta Audit Logs

Audit data includes metadata (e.g., who subscribes to a data source, when they access data, potentially what SQL queries were run, etc.) that is generated by a variety of actions and processes in Immuta. The most common processes are illustrated in the diagram below.

All audit logs flow from the Web Service to the Metadata Database (local to the customer's region) and are stored for 90 days.

Immuta Identity Management Data

This process is only relevant to customers using an external identity provider service to manage user accounts in Immuta.

  1. The initial Immuta user account is created on the Immuta SaaS tenant, and this data is stored in the tenant's Metadata Database.

  2. A System Administrator configures an external IAM with Immuta.

  3. User account information is collected from the external IAM and stored in the tenant's Metadata Database.

Data Dictionary and Data Source Metadata

This data is processed to support data source creation, health checks, policy enforcement, and dictionary features.

  1. A System Administrator configures the integration in Immuta.

  2. A Data Owner registers data sources from their remote data platform with Immuta. Note: Data Owners can see sample data when editing a data source. However, this action requires the database password, and the small sample of data visible is only displayed in the UI and is not stored in Immuta.

  3. When a data source is created or updated, the Metadata Database pulls in and stores statistics about the data source, including row count and high cardinality calculations.

  4. The data source health check runs daily to ensure existing tables are still valid.

  5. If an external catalog is enabled, the daily health check will pull in data source attributes (e.g., tags and definitions) and store them in the Metadata Database.

Policy Decision Data

Policy decision data is transmitted to ensure end users querying data are limited to the appropriate access as defined by the policies in Immuta.

Spark plugin

In the Databricks Spark integration, the user, data source information, and query are sent to Immuta through the Spark Plugin to determine what policies need to be applied while the query is being processed. Data that travels from Immuta to the Databricks cluster could include

  • user attributes.

  • what columns to mask.

  • the entire predicate itself (for row-level policies).

  1. A user runs a query against data in their environment.

  2. The query is sent to the Immuta Web Service.

  3. The Web Service queries the Metadata Database to obtain the policy definition, which includes data source metadata (tags, column names, etc.) and user entitlements (groups and attributes).

  4. The policy information is transmitted to the remote data system for policy enforcement.

  5. Query results are displayed based on what policy definition was applied.

Sample Raw Data

Sample data is processed and aggregated or reduced during Immuta's fingerprinting process and specific policy processes. Note: Data Owners can see sample data when editing a data source. However, this action requires the database password, and the small sample of data visible is only displayed in the UI and is not stored in Immuta.

Fingerprinting Process

In the Snowflake integration, statistical queries made during data source registration are distilled into summary statistics, called fingerprints. Fingerprinting allows Immuta to implement advanced privacy enhancing masking and data policies.

During this process, query results return statistics (not data samples) about the data to Immuta (no PII is included). The fingerprinting process checks for new tables through schema monitoring (when enabled) and captures summary statistics of changes to data sources, including when policies were applied, external views were created, or sensitive data elements were added.

Policy Processes

Immuta does not sample data for row-level policies

Immuta does not sample data for row-level policies; Immuta only pulls samples of data to determine if a column is a candidate for randomized response and aggregates of user-defined cohorts for k-anonymization. Both datasets only exist in memory during the computation.

  1. Sample data is processed when k-anonymization or randomized response policies are applied to Snowflake data sources.

  2. Sample data exists temporarily in memory during the computation.

  3. Raw data is processed for masking, producing either a distinct set of values or aggregated groups of values.

  4. If either of the following policy types targets a column that contains PII, Immuta stores that PII in the Metdata Database in order to enforce the policy:

    • k-Anonymization Policies: At the time of its application, the columns of a k-anonymization policy are queried under a separate process that generates rules enforcing k-anonymity. The results of this query (which may contain PII) are stored in the Metadata Database as the policy definition for enforcement. Immuta requires that you opt in to use this masking policy type.

    • Randomized Response Policies: If the list of substitution values for a categorical column is not part of the policy specification (e.g., when specified via the API), a list is obtained via query (which may contain PII) and merged into the policy definition in the Metadata Database. Immuta requires that you opt in to use this masking policy type.

User Metrics Data

Immuta collects a variety of metrics and details about app usage that is stored in a single US-based region.

  1. Data about activity within the tenant is aggregated nightly.

  2. Aggregates create metrics (the number of policies created, number of users authenticated, number of tags created, etc.). This data is stored in our data warehouse, which resides in a single, US-based region (AWS us-east-1).

  3. Telemetry Data (session ID, length, event properties, page views, etc.) is collected using Segment and Heap.

Last updated

Was this helpful?