Skip to content

Immuta Data Flow

Audience: All Immuta users

Content Summary: This page details the data flow in Immuta, specifically how policy decisions are enforced at read-time through the SQL, SparkSQL, HDFS, and the Immuta Virtual Filesystem access patterns.

Immuta SQL Access Pattern

SQL Access Pattern

  1. User authenticates to the Immuta Query Engine with their Immuta SQL credentials.

  2. User executes a SQL SELECT statement against an Immuta virtual table.

  3. Immuta code in the Query Engine checks with the Immuta service to determine the policy decision (can the user query this table, what rows should be hidden, columns masked, etc).

    • Immuta manages the subscription; the user does not need to have access to the data outside of Immuta.

    • User attributes from the identity management system used to make the policy decision are cached for the configured latency tolerance on the data source.

  4. The original query is rewritten based on the policy decision.

  5. The query is converted to the native database syntax and executed as a query on the native database.

    • The query on the native database is executed using the credentials used to create the data source.

    • The user and query are fully audited in Immuta.

  6. Results are streamed back to the user through the Immuta Query Engine.

Immuta SparkSQL Access Pattern

Spark Access Pattern

  1. User authenticates on the cluster with their HDFS principal.

    • That principal is mapped to an Immuta user.

    • If it does not map to an Immuta user, the normal ACL controls will take precedence and Immuta controls will be ignored.

    • If the user is configured to an “ignored Immuta user,” the normal ACL controls will take precedence and Immuta controls will be ignored.

    • If the HDFS file is not protected by Immuta (e.g., it's not backing a HIVE or Impala table protected by Immuta), then the normal ACL controls will take precedence.

  2. The user executes a SparkSQL job through the Immuta SparkSQL context using the Hive metastore to build the query plan.

  3. The Immuta SparkSQL context checks in with Immuta to get the policy decision.

    • The decision is cached in the Immuta SparkSQL context for a configurable amount of time. The default is 600 seconds.
  4. The original SparkSQL query is rewritten based on the policy decision.

  5. The Spark job executes and is temporarily authorized to access the raw files in HDFS as that user and only within the context of this specific Spark job.

    • The user will never have raw access to the HDFS files backing the table outside of this flow once protected by Immuta.
  6. The job runs with the altered query plan with temporary access to the backing HDFS files and is audited.

Immuta HDFS Access Pattern

HDFS Access Pattern

It is possible to simply expose raw HDFS files using file level controls should row and column level controls not be needed. This is an “HDFS data source” in Immuta.

  1. User authenticates on the cluster with their HDFS principal.

    • That principal is mapped to an Immuta user.

    • If it does not map to an Immuta user, the normal ACL controls will take precedence and Immuta controls will be ignored.

    • If the user is configured to an “ignored Immuta user,” the normal ACL controls will take precedence and Immuta controls will be ignored.

    • If the HDFS file is not protected by Immuta (e.g., it's not backing a HIVE or Impala table protected by Immuta), then the normal ACL controls will take precedence.

  2. When the user attempts to read the raw file from HDFS, Immuta will check to see if that user is subscribed to that data source.

    • At the time the data source is exposed in Immuta, Immuta crawls HDFS for file metadata using the credentials that were used to create the data source. Only files that are accessible by that user will be available through Immuta.

    • It is possible to force a re-crawl of HDFS for metadata through the Immuta UI.

  3. If the user is subscribed and meets the policy, the user can read the file and that read is audited.

    • The decision is cached for a configurable amount of time. The default is 600 seconds.

Immuta Virtual Filesystem Access Pattern

Filesystem Access Pattern

  1. User authenticates with Immuta when the filesystem is mounted on their client machine.

  2. The user attempts to read a virtual file from the Immuta Virtual Filesystem.

  3. The Immuta Virtual Filesystem checks in with Immuta to get the policy decision.

  4. If the user is subscribed and meets the data policy, they will be authorized to read the file.

  5. If the file is being read within the allocated latency tolerance period for that data source (e.g., the cache hasn’t expired), the file can be read from the cache if configured to use a cache.

    1. The cached file can exist on the client machine, encrypted on disk with AES-256 encryption. The decryption key is kept in-memory within the Virtual Filesystem, inaccessible to any user.

    2. The cached file can exist in-memory on the Immuta server, inaccessible to any user.

  6. Immuta then decides where to load the data from if not cached.

    1. If it’s blob storage, it will read directly.

    2. If it’s a relational database, it will execute the query to populate the file through the Immuta Query Engine.

    3. In both cases, all data policies will be applied either through the query or when the data is streaming back from the blob storage.