In addition to supporting direct file reads through workspace and scratch paths, Immuta allows direct file reads in Spark for file paths. As a result, users who prefer to interact with their data using file paths or who have existing workflows revolving around file paths can continue to use these workflows without rewriting those queries for Immuta.
When reading from a path in Spark, the Immuta Databricks plugin queries the Immuta Web Service to find Databricks data sources for the current user that are backed by data from the specified path. If found, the query plan maps to the Immuta data source and follows existing code paths for policy enforcement.
Spark Direct File Reads in EMR
EMR uses the same integration as Databricks, but you will need to use the immuta
SparkSession just as you normally would to interact with Immuta data sources.
For example, instead of spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table/partition_column=01/my_file.parquet")
, use immuta.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table/partition_column=01/my_file.parquet")
.
Users can read data from individual parquet files in a sub-directory and partitioned data from a sub-directory (or by using a where
predicate). Use the tabs below to view examples of reading data using these methods.
Read Data from an Individual Parquet File
To read from an individual file, load a partition file from a sub-directory:
Read Partitioned Data from a Sub-Directory
To read partitioned data from a sub-directory, load a parquet partition from a sub-directory:
Alternatively, load a parquet partition using a where
predicate:
Direct file reads in Spark are also supported for object-backed Immuta data sources (such as S3 or Azure Blob data sources) using the is3a file system
:
Direct file reads for Immuta data sources only apply to table-backed Immuta data sources, not data sources created from views or queries.
If more than one data source has been created for a path, Immuta will use the first valid data source it finds. It is therefore not recommended to use this integration when more than one data source has been created for a path.
On Databricks, multiple input paths are supported as long as they belong to the same data source. However, for EMR only a single input path is supported.
CSV-backed tables are not currently supported.
Loading a delta
partition from a sub-directory is not recommended by Spark and is not supported in Immuta. Instead, use a where
predicate: