Skip to content

You are viewing documentation for Immuta version 2020.2.

For the latest version, view our documentation for Immuta SaaS or the latest self-hosted version.

S3 Data Source Creation Tutorial

Audience: Data Owners

Content Summary: Amazon S3 differs from other object-backed storage technologies in that users can query structured data stored within it using Spark.

This guide outlines how to configure an Amazon S3 data source in Immuta. To explore data source creation guides for other storage technologies, see the Object-backed and Query-backed guides.

Step 1: Select Authentication Method

You must choose an authentication method before connecting an S3 data source. The following methods are supported:

  • AWS Access Key: Connect to a private S3 bucket using an access key pair.
  • AWS Instance Role: Connect to a private S3 bucket with no credentials, instead leveraging an IAM Role that has been assigned to the Immuta EC2 instance. The Data Owner must possess the CREATE_S3_DATASOURCE_WITH_INSTANCE_ROLE permission to proceed. By default, this option is disabled. For instructions on enabling this option, please contact your Immuta Support Professional.
  • No Authentication: Connect to a public S3 bucket with no credentials.

S3 Connection Information

Step 2: Enter Connection Parameters

  1. Fill out the following fields in the Connection Information window:

    • AWS Access Key Id: Your AWS public access key.
    • AWS Secret Access Key: Your AWS secret access key.
    • Assume AWS IAM Role: Enable only if you want to connect using an IAM Role's access as opposed to an individual user's access.
    • AWS IAM Role ARN: If Assume AWS IAM Role is enabled, this is the name of the IAM Role ARN that will be used (please login to the AWS Console or contact your AWS Admin for this information).
    • AWS Region: The region that contains the S3 bucket that you wish to expose.
  2. Click Verify Credentials.

IAM Policy Elements

To connect to an S3 bucket, the IAM role you use must have several policy elements. These elements are included in a Statement array and must be present for every bucket you wish to connect with that IAM Role.

  • s3:ListBucket on the bucket you want to expose.
  • s3:getObject on /* of the bucket you want to expose.
  • s3:getObjectTagging on /* of the bucket you want to expose.

Optionally, you can give the element s3:ListAllMyBuckets on arn:aws:s3:::*. This allows Immuta to present you with a list of buckets to choose from at data source creation time. Without this, you'll have to type in your bucket name.

In order update metadata from S3 as changes occur, additional policy elements are required.

Below is an example policy that can be used with the buckets test-bucket-1 and test-bucket-2. Note, that you'd have to create a data source per bucket, but that this policy (applied to your IAM role) is applicable to both buckets. This policy does not include the list.

    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Action": [
            "Resource": [
            "Effect": "Allow",
            "Action": [
            "Resource": [

Step 3: Configure S3

  1. Complete the following fields in the S3 Configuration window:

    • S3 Bucket: the bucket you want your data source to reference
    • S3 Prefix: the prefix within the specified bucket. Only blobs from within the specified prefix will be ingested and returned.
    • Additional Options:
      • Ingest Metadata Automatically: when enabled, this automatically monitors S3 for new updates to the data. If not enabled, there will only be a one-time metadata pull from S3. No additions or deletions will be captured after the pull completes. That means even if you change S3 attributes, those changes will not be reflected in the data source policy. This feature creates SQS queues containing bucket updates. Note that even if you do not enable this option, you can still direct Immuta to re-crawl the data manually whenever you would like.

    S3 Bucket Configuration

  2. If you select Ingest Metadata Automatically, enter the name of the SQS queue in the SQS Queue field.

    These AWS IAM permissions are required:

    • sqs:DeleteMessageBatch a general permission on SQS.
    • sqs:ReceiveMessage a general permission on SQS.

Advanced Options

Option 1: Select Data Format

Data Format

While object-backed data sources can really be any format (images, videos, etc.), we can still work under the assumption that some will have common formats. Should your blobs be comma separated, tab-delimited, or json, you can mask values through the Immuta interface. Specifying the data format will allow you to create masking policies for the data source.

Option 2: Select Event Time

By default, Immuta will use the write time of the blob for the event time. However, write time is not always an accurate way to represent event time. Should you want to provide a customized event time, you can do that via blob attributes. In S3 these are stored as metadata. You can specify the key of the metadata/xattr that contains the date in ISO 8601 format, for example: 2015-11-15T05:13:32+00:00.

S3 Event Time

Option 3: Configure Tags and Features

  • How would you like to tag your data?: This step is optional and provides the ability to "auto-tag" based on attributes of the data rather than the manual entry you do in step 3. You can pull the tags from the folder name or metadata/xattrs. Adding tags makes your data more discoverable in the Immuta Web UI and REST API.

  • Note that when using metadata in S3, Immuta assumes the x-amz-metdata- namespace prefix, so you do not need to include that in your keys.

    S3 Tags and Features

To learn more about S3 tags and metadata attributes see the official AWS documentation: