Skip to content

S3 Access in Spark and Databricks

Audience: Data Owners, Data Users, and System Administrators

Content Summary: The S3 access pattern allows you to use a library (like Boto 3 in Python) to access standard Amazon S3 and point it at Immuta to access your data. The integration with Databricks and Spark uses a file system (is3a) that retrieves your API key and communicates with Immuta as if it were talking directly to S3, allowing users to access object-backed data sources (S3, Azure Blob, persisted) through Immuta's s3p endpoint.

Note

This mechanism would never go to S3 directly. To access S3 directly, you will need to expose an S3-backed table or view in the Databricks Metastore as a source or use native workspaces/scratch paths.

Accessing Object-Backed Data Sources in Spark or Databricks

No Configuration Changes

No configuration changes are necessary to use the is3a file system with Spark or Databricks.

  1. Create a Persisted data source.

  2. Select Upload Files from the dropdown menu on the data source Overview page.

    Upload Files

  3. Opt to add folders in the Directory field in the Upload modal.

    Basic Information

  4. Click Add Files to Upload to upload your files, and then click Save.

    Uploaded Files

  5. In Databricks or Spark, write queries that access this data by referencing the S3 path (shown in the Basic Information section of the Upload Files modal above), but using the URL scheme is3a:

    is3a

Limitations

This access pattern is only available for object-backed data sources. Consequently, all the standard limitations that apply to object-backed data sources in Immuta apply here.