Skip to content

S3 Data Source Creation Tutorial

Audience: Data Owners

Content Summary: Amazon S3 differs from other object-backed storage technologies in that users can query structured data stored within it using Spark.

This guide outlines how to configure an Amazon S3 data source in Immuta. To explore data source creation guides for other storage technologies, see the Object-backed and Query-backed guides.

Step 1: Select Authentication Method

You must choose an authentication method before connecting an S3 data source. The following methods are supported:

  • AWS Access Key: Connect to a private S3 bucket using an access key pair.
  • AWS Instance Role: Connect to a private S3 bucket with no credentials, instead leveraging an IAM Role that has been assigned to the Immuta EC2 instance. The Data Owner must possess the CREATE_S3_DATASOURCE_WITH_INSTANCE_ROLE permission to proceed. By default, this option is disabled. For instructions on enabling this option, please contact your Immuta Support Professional.
  • No Authentication: Connect to a public S3 bucket with no credentials.

S3 Connection Information

Step 2: Enter Connection Parameters

  1. Fill out the following fields in the Connection Information window:

    • AWS Access Key Id: Your AWS public access key.
    • AWS Secret Access Key: Your AWS secret access key.
    • Assume AWS IAM Role: Enable only if you want to connect using an IAM Role's access as opposed to an individual user's access.
    • AWS IAM Role ARN: If Assume AWS IAM Role is enabled, this is the name of the IAM Role ARN that will be used (please login to the AWS Console or contact your AWS Admin for this information).
    • AWS Region: The region that contains the S3 bucket that you wish to expose.
  2. Click Verify Credentials.

IAM Policy Elements

To connect to an S3 bucket, the IAM role you use must have several policy elements. These elements are included in a Statement array and must be present for every bucket you wish to connect with that IAM Role.

  • s3:ListBucket on the bucket you want to expose.
  • s3:getObject on /* of the bucket you want to expose.
  • s3:getObjectTagging on /* of the bucket you want to expose.

Optionally, you can give the element s3:ListAllMyBuckets on arn:aws:s3:::*. This allows Immuta to present you with a list of buckets to choose from at data source creation time. Without this, you'll have to type in your bucket name.

In order update metadata from S3 as changes occur, additional policy elements are required.

Below is an example policy that can be used with the buckets test-bucket-1 and test-bucket-2. Note, that you'd have to create a data source per bucket, but that this policy (applied to your IAM role) is applicable to both buckets. This policy does not include the list.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:ListBucketMultipartUploads",
            ],
            "Resource": [
                "arn:aws:s3:::test-bucket-1",
                "arn:aws:s3:::test-bucket-2"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:getObject",
                "s3:getObjectTagging"
            ],
            "Resource": [
                "arn:aws:s3:::test-bucket-1/*",
                "arn:aws:s3:::test-bucket-2/*",
            ]
        }
    ]
}

Step 3: Configure S3

  1. Complete the following fields in the S3 Configuration window:

    • S3 Bucket: the bucket you want your data source to reference
    • Additional Options:
      • Ingest Metadata Automatically: when enabled, this automatically monitors S3 for new updates to the data. If not enabled, there will only be a one-time metadata pull from S3. No additions or deletions will be captured after the pull completes. That means even if you change S3 attributes, those changes will not be reflected in the data source policy. This feature creates a SQS queue containing bucket updates. AWS allows only one such queue per bucket, so if you already have a queue attached to notifications for the bucket you will be unable to use this feature. Note that even if you do not enable this option, you can still direct Immuta to re-crawl the data manually whenever you would like.

    S3 Bucket Configuration

If you select Ingest Metadata Automatically, these additional AWS IAM policy elements are required:

  • s3:GetBucketNotification on all of the buckets you want to tie to a SQS queue.
  • s3:PutBucketNotification on all of the buckets you want to tie to a SQS queue.
  • sqs:CreateQueue a general permission on SQS.
  • sqs:DeleteMessageBatch a general permission on SQS.
  • sqs:DeleteQueue a general permission on SQS.
  • sqs:GetQueueAttributes a general permission on SQS.
  • sqs:ReceiveMessage a general permission on SQS.
  • sqs:SetQueueAttributes a general permission on SQS.

Advanced Options

Option 1: Select Data Format

Data Format

While object-backed data sources can really be any format (images, videos, etc.), we can still work under the assumption that some will have common formats. Should your blobs be comma separated, tab-delimited, or json, you can mask values through the Immuta interface. Specifying the data format will allow you to create masking policies for the data source.

Option 2: Select Event Time

By default, Immuta will use the write time of the blob for the event time. However, write time is not always an accurate way to represent event time. Should you want to provide a customized event time, you can do that via blob attributes. In S3 these are stored as metadata. You can specify the key of the metadata/xattr that contains the date in ISO 8601 format, for example: 2015-11-15T05:13:32+00:00.

S3 Event Time

Option 3: Configure Tags and Features

  • How would you like to tag your data?: This step is optional and provides the ability to "auto-tag" based on attributes of the data rather than the manual entry you do in step 3. You can pull the tags from the folder name or metadata/xattrs. Adding tags makes your data more discoverable in the Immuta Web UI and REST API.

  • Select any attributes you would like to extract as features: Object-backed data sources are not accessible via the Immuta Query Engine. However, you can pass Immuta features about the data (essentially extra metadata) and that metadata can be queried through the Immuta Query Engine. Those features can be populated just like the tags (via folder name or metadata/xattrs).

  • Note that when using metadata in S3, Immuta assumes the x-amz-metdata- namespace prefix, so you do not need to include that in your keys.

    S3 Tags and Features

To learn more about S3 tags and metadata attributes see the official AWS documentation:

* [Object Tagging](http://docs.aws.amazon.com/AmazonS3/latest/dev/object-tagging.html)
* [Object Metadata](http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html#object-metadata)