Immuta Data Access Patterns
Audience: Data Owners, Data Users, and System Administrators
Content Summary: The Immuta data control plane does not require users to learn a new API or language to access data exposed there. Immuta plugs into existing tools and ongoing work while remaining completely invisible to downstream consumers by exposing the data through these foundational access patterns: the Immuta Query Engine, HDFS, the Immuta Virtual Filesystem, and SparkSQL.
Users are provided a basic Immuta PostgreSQL connection. The tables within this connection represent all the connected data across your organization. Those tables, however, are virtual tables, completely empty until a query is run. At query time the SQL is proxied through the virtual Immuta table down to the native database while enforcing the policy automatically. The Immuta SQL connection can be used within any Business Intelligence (BI) tool or integrated directly into code for interactive analysis.
Unlike the other access patterns, the Immuta HDFS access pattern is not virtual. The value in HDFS processing is to bring the code to the data, and as such, requires the Immuta policies to be enforced in-place on the data in the HDFS data nodes. Because of this, the Immuta HDFS layer can only act on data stored in HDFS. However, you are able to build complex subscription and granular access policies on objects stored in HDFS and retain all the rich audit capabilities provided by the other Immuta virtual layers.
This mountable filesystem, like the Query Engine, is virtual. The files represent connected data from across your organization in a directory hierarchy, yet all the files are empty. Once a file is read, the file is hydrated dynamically with the data from the underlying storage technology with the policies enforced automatically. Unlike the Query Engine, the Virtual Filesystem does cache results removing load from the remote storage technology and reducing latency from the client. That cache’s time to live is configurable per data source exposed in Immuta.
Users are able to access subscribed data sources within their Spark Jobs by utilizing Spark SQL with the ImmutaContext class. All tables are virtual and are not populated until a query is materialized. When a query is materialized, data from metastore backed data sources, such as Hive and Impala, will be accessed using standard Spark libraries to access the data in the underlying files stored in HDFS. All other data source types will access data using the Query Engine which will proxy the query to the native database technology. Policies for each data source will be enforced automatically.