Skip to content

Documentation

Spark Direct File Reads

Light Dark

SaaS

Immuta Documentation

Overview
What is Immuta?
Data and Integrations
Data and Integrations
- Section Contents
- Immuta Integrations
- Snowflake
  Snowflake
  - Section Contents
  - Getting Started
  - How-to Guides
    How-to Guides
    
    Configure Snowflake Integration
    
    Enhanced Onboarding and Data Source Registration
    
    Crawl a Host or Object
    
    Snowflake Table Grants Migration
    
    Edit or Remove a Snowflake Integration
    
    Integration Settings
    Integration Settings
    
    AWS PrivateLink for Snowflake
    
    Enable Snowflake Table Grants
    
    Snowflake Lineage Tag Propagation
    
    Use Snowflake Data Sharing with Immuta
    
    Snowflake Low Row Access Policy Mode
    Snowflake Low Row Access Policy Mode
    
    Manage Snowflake Low Row Access Policy Mode
    
    Upgrade Snowflake Low Row Access Policy Mode
  - Reference Guides
    Reference Guides
    
    Snowflake Integration Reference Guide
    
    Enhanced Onboarding and Data Source Registration
    
    Snowflake Data Sharing with Immuta
    
    Snowflake Lineage Tag Propagation
    
    Snowflake Low Row Access Policy Mode Overview
    
    Snowflake Table Grants
  - Concept Guide
    Concept Guide
    
    Phased Snowflake Onboarding Approach
- Databricks Unity Catalog
  Databricks Unity Catalog
  - Section Contents
  - Getting Started
  - How-to Guides
    How-to Guides
    
    Configure Databricks Unity Catalog Integration
    
    Migrate to Unity Catalog
  - Reference Guide
    Reference Guide
    
    Unity Catalog Integration Reference
- Databricks Spark
  Databricks Spark
  - Section Contents
  - How-to Guides
    How-to Guides
    
    Configuration
    Configuration
    
    Introduction
    
    Simplified Databricks Configuration
    
    Manual Databricks Installation
    
    Manually Update Your Databricks Cluster
    
    Install a Trusted Library
    
    DBFS Access
    
    Limited Enforcement in Databricks
    
    Hiding the Immuta Database in Databricks
    
    Run spark-submit Jobs on Databricks
    
    Project UDFs Cache Settings
    
    External Metastores
  - Reference Guides
    Reference Guides
    
    Databricks Spark Integration Overview
    
    Databricks Spark Pre-Configuration Details
    
    Configuration Settings
    Configuration Settings
    
    Cluster Policies
    Cluster Policies
    
    Python & SQL
    
    Python & SQL & R
    
    Python & SQL & R with Library Support
    
    Scala
    
    Sparklyr
    
    Environment Variables
    
    Ephemeral Overrides
    
    Py4j Security Error
    
    Scala Cluster Security Details
    
    Security Configuration for Performance
    
    Databricks Change Data Feed
    
    Databricks Libraries
    
    Spark Direct File Reads Spark Direct File Reads
    Table of contents
    
    Read Data
    
    Limitations
    
    Databricks Metastore Magic
- Starburst
  Starburst
  - Section Contents
  - Getting Started
  - How-to Guide
    How-to Guide
    
    Starburst Integration
    
    Customize Read and Write Access Policies for Starburst
  - Reference Guide
    Reference Guide
    
    Starburst Integration Overview
- Redshift
  Redshift
  - Section Contents
  - Getting Started
  - How-to Guides
    How-to Guides
    
    Redshift Configuration
    
    Redshift Spectrum Configuration
  - Reference Guides
    Reference Guides
    
    Redshift Integration Overview
    
    Redshift Pre-Configuration Details
- Azure Synapse Analytics
  Azure Synapse Analytics
  - Section Contents
  - Getting Started
  - How-to Guide
    How-to Guide
    
    Azure Synapse Analytics Configuration
  - Reference Guides
    Reference Guides
    
    Azure Synapse Integration Overview
    
    Azure Synapse Pre-Configuration Details
- Amazon S3
- Google BigQuery
- Registering Metadata
  Registering Metadata
  - Section Contents
  - Data Sources in Immuta
  - Register Data Sources
    Register Data Sources
    
    Section Contents
    
    Amazon S3 Data Source
    
    Azure Synapse Analytics Data Source
    
    Databricks Data Source
    
    Google BigQuery Data Source
    
    Redshift Data Source
    
    Snowflake Data Source
    
    Bulk Create Snowflake Data Sources
    
    Starburst Data Source
  - Data Source Settings
    Data Source Settings
    
    Section Contents
    
    How-to Guides
    How-to Guides
    
    Manage Data Source Settings
    
    Manage Data Source Members
    
    Data Source Access Requests
    
    Disable Data Sampling
    
    Data Dictionary
    
    Reference Guide
    Reference Guide
    
    Data Source Health Checks
  - Schema Monitoring
    Schema Monitoring
    
    Section Contents
    
    How-to Guides
    How-to Guides
    
    Manage Schema Projects
    
    Run Schema Monitoring Jobs
    
    Reference Guides
    Reference Guides
    
    Schema Monitoring
    
    Schema Projects
    
    Concept Guides
    Concept Guides
    
    Why Use Schema Monitoring?
- Catalogs
  Catalogs
  - Section Contents
  - Getting Started
  - How-to Guide
    How-to Guide
    
    Configure an External Catalog
  - Reference Guides
    Reference Guides
    
    External Catalog Integrations
    
    Custom REST Catalogs
    Custom REST Catalogs
    
    Custom REST Catalog Interface Introduction
    
    Custom REST Catalog Interface Endpoints
- Tags
  Tags
  - Section Contents
  - How-to Guides
    How-to Guides
    
    Create Tags
    
    Add Tags to Data Sources and Projects
  - Reference Guide
    Reference Guide
    
    Tags
People
People
- Getting Started
- Identity Managers (IAMs)
  Identity Managers (IAMs)
  - Section Contents
  - How-to Guides
    How-to Guides
    
    Microsoft Entra ID
    
    Okta
    Okta
    
    Okta and LDAP
    
    Okta and OpenID Connect
    
    Integrate Okta SAML SCIM with Immuta
    
    OneLogin
    
    SAML
  - Reference Guides
    Reference Guides
    
    Identity Managers
    
    SAML SLO
    
    SAML IAM Protocol Configuration Options
- Immuta Users
  Immuta Users
  - Section Contents
  - How-to Guides
    How-to Guides
    
    Manage Personas and Permissions
    
    User Impersonation
    
    Manage Attributes and Groups
    
    External User ID Mapping
    
    External User Info Endpoint
  - Reference Guides
    Reference Guides
    
    Personas and Permissions
    
    Attributes and Groups
Discover Your Data
Discover Your Data
- Getting Started
- Introduction
- Architecture
- Identification
  Identification
  - Overview
  - How-to Guides
    How-to Guides
    
    Enable SDD
    
    Create Frameworks
    
    Create Patterns
    
    Manage Rules
    
    Manage SDD on Data Sources
    
    Manage Global SDD Settings
    
    Migrate From Legacy to Native SDD
  - Reference Guides
    Reference Guides
    
    Built-In Discovered Tags
    
    Built-In Patterns
- Classification
  Classification
  - Overview
  - How-to Guides
    How-to Guides
    
    Activate a Framework
    
    Adjust and Accept Tags
  - Reference Guides
    Reference Guides
    
    Immuta DSF
    
    Built-in Frameworks
Detect Your Activity
Detect Your Activity
- Getting Started
  Getting Started
  - Select Your Use Case
  - Use Case
    Use Case
    
    Monitor and Secure Sensitive Data Platform Query Activity
    Monitor and Secure Sensitive Data Platform Query Activity
    
    Overview
    
    SaaS Benefits
    
    User Identity Best Practices
    
    Native Integration Architecture
    
    Snowflake Roles Best Practices
    
    Register Data
    
    Automate Entity and Sensitivity Discovery
    
    Onboard Detect with Discover
    
    Using Immuta Detect
    
    General Immuta Configuration
    General Immuta Configuration
    
    Overview
    
    SaaS Benefits
    
    User Identity Best Practices
    
    Native Integration Architecture
    
    Databricks Roles Best Practices
    
    Register Data
- Introduction
- Audit
  Audit
  - Section Contents
  - How-to Guides
    How-to Guides
    
    Export Audit Logs to S3
    
    Export Audit Logs to ADLS
    
    Run Governance Reports
  - Reference Guides
    Reference Guides
    
    Universal Audit Model Overview
    
    Snowflake Audit
    
    Databricks Unity Catalog Audit
    
    Databricks Audit
    
    Starburst Audit
    
    Audit Export CLI Reference Guide
    
    Governance Reports Overview
  - Deprecated Audit Guides
    Deprecated Audit Guides
    
    View and Download Audit Logs
    
    Snowflake Audit
    
    Databricks Audit
    
    Databricks Unity Catalog Audit
    
    Starburst Audit
- Detection
  Detection
  - Overview
  - How-to Guides
    How-to Guides
    
    Use the Dashboards
  - Reference Guides
    Reference Guides
    
    Dashboards
    
    Unknown Users
- Monitor
  Monitor
  - Monitor and Observations Overview
  - Create a Monitor
Secure Your Data
Secure Your Data
- Getting Started
  Getting Started
  - Select Your Use Case
  - Use Case
    Use Case
    
    Automate Data Access Control Decisions
    Automate Data Access Control Decisions
    
    Overview
    
    The Two Paths
    
    Managing User Metadata
    
    Managing Data Metadata
    
    Author Policy
    
    Test and Deploy Policy
    
    Compliantly Open more Sensitive Data for ML and Analytics
    Compliantly Open more Sensitive Data for ML and Analytics
    
    Overview
    
    Managing User Metadata
    
    Managing Data Metadata
    
    Author Policy
    
    Federated Governance for Data Mesh and Self-Serve Data Access
    Federated Governance for Data Mesh and Self-Serve Data Access
    
    Overview
    
    Defining Domains
    
    Managing Data Products
    
    Managing Data Metadata
    
    Applying Federated Governance
    
    Discover and Subscribe to Data Products
- Introduction
  Introduction
- Authoring Policies in Secure
  Authoring Policies in Secure
  - Overview
  - Authoring Policies at Scale
  - Data Engineering with Limited Policy Downtime
  - Subscription Policies
    Subscription Policies
    
    Section Contents
    
    How-to Guides
    How-to Guides
    
    Subscription Policy
    
    ABAC Subscription Policy
    
    Advanced DSL Builder
    
    Restricted Subscription Policy
    
    Clone, Activate, or Stage a Global Policy
    
    Reference Guides
    Reference Guides
    
    Subscription Policies
    
    Subscription Policy Access Types
    
    Advanced Use of Special Functions
  - Data Policies
    Data Policies
    
    Overview
    
    How-to Guides
    How-to Guides
    
    Masking Policy
    
    Minimization Policy
    
    Purpose-Based Restriction Policy
    
    Restricted Data Policy
    
    Row-level Policy
    
    Time-Based Restriction Policy
    
    Certifications Exemptions and Diffs
    
    External Masking Interface (Deprecated)
    
    Reference Guides
    Reference Guides
    
    Column Masking
    
    Row-level Policy
    
    Cell Masking
    
    Custom WHERE Clause Functions
    
    Data Policy Conflicts and Fallback
    
    Full List of Data Policies
    
    Custom Policy Certifications
    
    Orchestrated Masking Policies
- Domains
  Domains
- Projects and Purpose-Based Access Control
  Projects and Purpose-Based Access Control
  - Section Contents
  - Projects and Purpose Controls
    Projects and Purpose Controls
    
    Section Contents
    
    Getting Started
    
    How-to Guides
    How-to Guides
    
    Create a Project
    
    Create a Purpose
    
    Adjust a Policy
    
    Project Management
    Project Management
    
    Manage Projects and Project Settings
    
    Manage Data Sources
    
    Manage Members
    
    Reference Guides
    Reference Guides
    
    Projects and Purposes
    
    Policy Adjustments
    
    Concept Guide
    Concept Guide
    
    Why Use Purposes?
  - Equalized Access
    Equalized Access
    
    Section Contents
    
    Manage Equalization How-To Guide
    
    Equalized Access Reference Guide
    
    Why Equalize Access?
  - Masked Joins
    Masked Joins
    
    Section Contents
    
    Enable Masked Joins
    
    Why Use Masked Joins?
  - Writing to Projects
    Writing to Projects
    
    Section Contents
    
    How-to Guides
    How-to Guides
    
    Create a Snowflake Project Workspace
    
    Create a Databricks Project Workspace
    
    Writing to Projects
    
    Reference Guides
    Reference Guides
    
    Writing to Projects
    
    Project UDFs
- Data Consumers
  Data Consumers
  - Section Contents
  - Subscribe to Data Sources
  - Query Data
    Query Data
    
    Snowflake
    
    Databricks
    
    Databricks SQL
    
    Starburst
    
    Redshift
    
    Azure Synapse Analytics
  - Subscribe to Projects
System Configuration
System Configuration
- Section Contents
- How-To Guides
  How-To Guides
  - App Settings
  - AWS PrivateLink for Databricks
  - BI Tools
    BI Tools
    
    Configuration Recommendations
    
    Configuration Examples
    Configuration Examples
    
    Power BI
    
    Tableau
  - IP Filtering
- Reference Guides
  Reference Guides
Releases
Releases
- Deployment Notes
- Immuta Support Matrix Overview
- Immuta CLI Release Notes
- Preview Features
  Preview Features
  - Preview Levels
  - Features in Preview
Developer Guides
Developer Guides
- The Immuta CLI
  The Immuta CLI
  - Introduction
  - Install and Configure the CLI
  - Manage Instances
  - Manage Data Sources
  - Manage Sensitive Data Discovery
    Manage Sensitive Data Discovery
    
    Introduction
    
    Manage Rules
    
    Manage Frameworks
    
    Run Sensitive Data Discovery
  - Manage Policies
  - Manage Purposes
  - Manage Projects
- Immuta API
  Immuta API
  - Introduction
  - Integrations API
    Integrations API
    
    Overview
    
    Getting Started
    
    How-To Guides
    How-To Guides
    
    Overview
    
    Amazon S3
    
    Azure Synapse Analytics
    
    Databricks Unity Catalog
    
    Google BigQuery
    
    Redshift
    
    Snowflake
    
    Starburst
    
    Reference Guides
    Reference Guides
    
    Overview
    
    Integrations API Endpoints
    
    Integrations API Payloads
    
    Response Schema
    
    Status Codes and Error Messages
  - Version 2 API
    Version 2 API
    
    Overview
    
    Request Payload Examples
    Request Payload Examples
    
    Data Sources
    Data Sources
    
    Payload Attribute Details
    
    Request Payload Examples
    
    Policies
    
    Projects
    
    Purposes
  - Version 1 API
    Version 1 API
    
    Overview
    
    Authenticate with the API
    
    Configure Immuta
    Configure Immuta
    
    Overview
    
    Activities and Notifications
    
    Fingerprint Service Status
    
    Frameworks
    
    IAMs
    
    Licenses
    
    Jobs
    
    Search Filters
    
    Sensitive Data Discovery
    
    Tags
    
    Webhooks
    Webhooks
    
    Webhooks
    
    Connect Data
    Connect Data
    
    Overview
    
    Create Data Sources
    Create Data Sources
    
    Azure Synapse Analytics API Reference Guide
    
    Databricks API Reference Guide
    
    Delta Lake API Reference Guide
    
    Redshift API Reference Guide
    
    Snowflake API Reference Guide
    
    Trino API Reference Guide
    
    Data Dictionary API Reference Guide
    
    Manage and Audit Data Access
    Manage and Audit Data Access
    
    Overview
    
    Data and Subscription Policies
    
    Write Policies
    Write Policies
    
    Write Policy Endpoints
    
    Write Policy Payload Reference
    
    Domains API Reference Guide
    
    Manage Access Requests
    
    Policy Handler Object
    
    Search Audit Logs
    
    Search Connection Strings
    
    Search for Organizations
    
    Search Schemas
    
    Subscribe to and Manage Data Sources
    
    Create Projects
    Create Projects
    
    Overview
    
    Manage Projects
    
    Manage Purposes

Spark Direct File Reads

In addition to supporting direct file reads through workspace and scratch paths, Immuta allows direct file reads in Spark for file paths. As a result, users who prefer to interact with their data using file paths or who have existing workflows revolving around file paths can continue to use these workflows without rewriting those queries for Immuta.

When reading from a path in Spark, the Immuta Databricks plugin queries the Immuta Web Service to find Databricks data sources for the current user that are backed by data from the specified path. If found, the query plan maps to the Immuta data source and follows existing code paths for policy enforcement.

Read Data

Users can read data from individual parquet files in a sub-directory and partitioned data from a sub-directory (or by using a where predicate). Use the tabs below to view examples of reading data using these methods.

Read Data from an Individual Parquet FileRead Partitioned Data from a Sub-Directory

Read Data from an Individual Parquet File

To read from an individual file, load a partition file from a sub-directory:

spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table/partition_column=01/my_file.parquet")

Read Partitioned Data from a Sub-Directory

To read partitioned data from a sub-directory, load a parquet partition from a sub-directory:

spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table/partition_column=01")

Alternatively, load a parquet partition using a where predicate:

spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table").where("partition_column=01")

Limitations

Direct file reads for Immuta data sources only apply to table-backed Immuta data sources, not data sources created from views or queries.
If more than one data source has been created for a path, Immuta will use the first valid data source it finds. It is therefore not recommended to use this integration when more than one data source has been created for a path.
In Databricks, multiple input paths are supported as long as they belong to the same data source.
CSV-backed tables are not currently supported.

Loading a delta partition from a sub-directory is not recommended by Spark and is not supported in Immuta. Instead, use a where predicate:

# Not recommended by Spark and not supported in Immuta
spark.read.format("delta").load("s3:/my_bucket/path/to/my_delta_table/partition_column=01")

# Recommended by Spark and supported in Immuta.
spark.read.format("delta").load("s3:/my_bucket/path/to/my_delta_table").where("partition_column=01")