Spark Direct File Reads

In addition to supporting direct file reads through workspace and scratch paths, Immuta allows direct file reads in Spark for file paths. As a result, users who prefer to interact with their data using file paths or who have existing workflows revolving around file paths can continue to use these workflows without rewriting those queries for Immuta.

When reading from a path in Spark, the Immuta Databricks Spark plugin queries the Immuta Web Service to find Databricks data sources for the current user that are backed by data from the specified path. If found, the query plan maps to the Immuta data source and follows existing code paths for policy enforcement.

Read Data

Users can read data from individual parquet files in a sub-directory and partitioned data from a sub-directory (or by using a where predicate). Use the tabs below to view examples of reading data using these methods.

To read from an individual file, load a partition file from a sub-directory:

spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table/partition_column=01/my_file.parquet")

To read partitioned data from a sub-directory, load a parquet partition from a sub-directory:

spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table/partition_column=01")

Alternatively, load a parquet partition using a where predicate:

spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table").where("partition_column=01")

Limitations

Direct file reads for Immuta data sources only apply to table-backed Immuta data sources, not data sources created from views or queries.
If more than one data source has been created for a path, Immuta will use the first valid data source it finds. It is therefore not recommended to use this integration when more than one data source has been created for a path.
In Databricks, multiple input paths are supported as long as they belong to the same data source.
CSV-backed tables are not currently supported.

Loading a delta partition from a sub-directory is not recommended by Spark and is not supported in Immuta. Instead, use a where predicate:

# Not recommended by Spark and not supported in Immuta
spark.read.format("delta").load("s3:/my_bucket/path/to/my_delta_table/partition_column=01")

# Recommended by Spark and supported in Immuta.
spark.read.format("delta").load("s3:/my_bucket/path/to/my_delta_table").where("partition_column=01")

Last updated 6 months ago

Was this helpful?

Read Data

To read from an individual file, load a partition file from a sub-directory:

spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table/partition_column=01/my_file.parquet")

To read partitioned data from a sub-directory, load a parquet partition from a sub-directory:

spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table/partition_column=01")

Alternatively, load a parquet partition using a where predicate:

spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table").where("partition_column=01")

Limitations

Direct file reads for Immuta data sources only apply to table-backed Immuta data sources, not data sources created from views or queries.

If more than one data source has been created for a path, Immuta will use the first valid data source it finds. It is therefore not recommended to use this integration when more than one data source has been created for a path.

In Databricks, multiple input paths are supported as long as they belong to the same data source.

CSV-backed tables are not currently supported.

Loading a delta partition from a sub-directory is not recommended by Spark and is not supported in Immuta. Instead, use a where predicate:

# Not recommended by Spark and not supported in Immuta
spark.read.format("delta").load("s3:/my_bucket/path/to/my_delta_table/partition_column=01")

# Recommended by Spark and supported in Immuta.
spark.read.format("delta").load("s3:/my_bucket/path/to/my_delta_table").where("partition_column=01")