Skip to content

You are viewing documentation for Immuta version 2.8.

For the latest version, view our documentation for Immuta SaaS or the latest self-hosted version.

Immuta Data Flow

Audience: All Immuta users

Content Summary: This page details the data flow in Immuta, specifically how policy decisions are enforced at read-time through the Databricks, HDFS, S3, Snowflake, SQL, and SparkSQL access patterns.

Immuta HDFS Access Pattern

HDFS Access Pattern

It is possible to simply expose raw HDFS files using file level controls should row and column level controls not be needed. This is an “HDFS data source” in Immuta.

  1. User authenticates on the cluster with their HDFS principal.

    • That principal is mapped to an Immuta user.

    • If it does not map to an Immuta user, the normal ACL controls will take precedence and Immuta controls will be ignored.

    • If the user is configured to an “ignored Immuta user,” the normal ACL controls will take precedence and Immuta controls will be ignored.

    • If the HDFS file is not protected by Immuta (e.g., it's not backing a HIVE or Impala table protected by Immuta), then the normal ACL controls will take precedence.

  2. When the user attempts to read the raw file from HDFS, Immuta will check to see if that user is subscribed to that data source.

    • At the time the data source is exposed in Immuta, Immuta crawls HDFS for file metadata using the credentials that were used to create the data source. Only files that are accessible by that user will be available through Immuta.

    • It is possible to force a re-crawl of HDFS for metadata through the Immuta UI.

  3. If the user is subscribed and meets the policy, the user can read the file and that read is audited.

    • The decision is cached for a configurable amount of time. The default is 600 seconds.

S3 Access Pattern

  1. User configures their S3 client to connect to Immuta.
  2. User queries the immuta bucket for data sources and blob metadata using the S3 client.
  3. Using a list-objects call, user determines what content keys are available and the sizes of each object. This retrieves blob metadata in Immuta, and no external calls are made to their client.
  4. User downloads blobs using the S3 client.
  5. Immuta streams the blobs from the underlying data source back to the client.

Notes:

  • Auth (API key) is cached for 15 seconds on each request.
  • On a get/head blob request, a data source with policies is cached for 15 seconds (with the API key cache) to reduce overhead of bulk head requests, which many S3 tools perform to validate if the objects actually exist.
  • All blobs are cached based on blobId and user. and this is configurable by cache.defaultBlobCacheTTL. Range requested blobs are never cached (requests for large blobs will always be range requests, and they would potentially push everything else out of the cache if we cache each part).

Immuta SparkSQL Access Pattern

Spark Access Pattern

  1. User authenticates on the cluster with their HDFS principal.

    • That principal is mapped to an Immuta user.

    • If it does not map to an Immuta user, the normal ACL controls will take precedence and Immuta controls will be ignored.

    • If the user is configured to an “ignored Immuta user,” the normal ACL controls will take precedence and Immuta controls will be ignored.

    • If the HDFS file is not protected by Immuta (e.g., it's not backing a HIVE or Impala table protected by Immuta), then the normal ACL controls will take precedence.

  2. The user executes a SparkSQL job through the Immuta SparkSQL session using the Hive metastore to build the query plan.

  3. The Immuta SparkSQL session checks in with Immuta to get the policy decision.

    • The decision is cached in the Immuta SparkSQL session for a configurable amount of time. The default is 600 seconds.
  4. The original SparkSQL query is rewritten based on the policy decision.

  5. The Spark job executes and is temporarily authorized to access the raw files in HDFS as that user and only within the session of this specific Spark job.

    • The user will never have raw access to the HDFS files backing the table outside of this flow once protected by Immuta.
  6. The job runs with the altered query plan with temporary access to the backing HDFS files and is audited.

Immuta SQL Access Pattern

SQL Access Pattern

  1. User authenticates to the Immuta Query Engine with their Immuta SQL credentials.

  2. User executes a SQL SELECT statement against an Immuta virtual table.

  3. Immuta code in the Query Engine checks with the Immuta service to determine the policy decision (can the user query this table, what rows should be hidden, columns masked, etc).

    • Immuta manages the subscription; the user does not need to have access to the data outside of Immuta.

    • User attributes from the identity management system used to make the policy decision are cached for the configured latency tolerance on the data source.

  4. The original query is rewritten based on the policy decision.

  5. The query is converted to the native database syntax and executed as a query on the native database.

    • The query on the native database is executed using the credentials used to create the data source.

    • The user and query are fully audited in Immuta.

  6. Results are streamed back to the user through the Immuta Query Engine.