Accessing Data
Once a Databricks securable is registered in Immut as a data source and you are subscribed to that data source, you must access that data through SQL:
See the sections below for more guidance on accessing data using Delta Lake, direct file reads in Spark for file paths, and user impersonation.
Delta Lake
When using Delta Lake, the API does not go through the normal Spark execution path. This means that Immuta's Spark extensions do not provide protection for the API. To solve this issue and ensure that Immuta has control over what a user can access, the Delta Lake API is blocked.
Spark SQL can be used instead to give the same functionality with all of Immuta's data protections. See the Delta API reference guide for a list of corresponding Spark SQL calls to use.
Spark direct file reads
In addition to supporting direct file reads through workspace and scratch paths, Immuta allows direct file reads in Spark for file paths. As a result, users who prefer to interact with their data using file paths or who have existing workflows revolving around file paths can continue to use these workflows without rewriting those queries for Immuta.
When reading from a path in Spark, the Immuta Databricks Spark plugin queries the Immuta Web Service to find Databricks data sources for the current user that are backed by data from the specified path. If found, the query plan maps to the Immuta data source and follows existing code paths for policy enforcement.
Users can read data from individual parquet files in a sub-directory and partitioned data from a sub-directory (or by using a where
predicate). Expand the blocks below to view examples of reading data using these methods.
Limitations
Direct file reads for Immuta data sources only apply to data sources created from tables, not data sources created from views or queries.
If more than one data source has been created for a path, Immuta will use the first valid data source it finds. It is therefore not recommended to use this integration when more than one data source has been created for a path.
In Databricks, multiple input paths are supported as long as they belong to the same data source.
CSV-backed tables are not currently supported.
Loading a
delta
partition from a sub-directory is not recommended by Spark and is not supported in Immuta. Instead, use awhere
predicate:
User impersonation
User impersonation allows Databricks users to query data as another Immuta user. To impersonate another user, see the User impersonation page.
Last updated
Was this helpful?