S3 Data Source Creation Tutorial
Audience: Data Owners
Content Summary: Amazon S3 differs from other object-backed storage technologies in that users can query structured data stored within it using Spark.
This guide outlines how to configure an Amazon S3 data source in Immuta. To explore data source creation guides for other storage technologies, see the Object-backed and Query-backed guides.
Step 1: Select Authentication Method
You must choose an authentication method before connecting an S3 data source. The following methods are supported:
AWS Access Key
: Connect to a private S3 bucket using an access key pair.AWS Instance Role
: Connect to a private S3 bucket with no credentials, instead leveraging an IAM Role that has been assigned to the Immuta EC2 instance. The Data Owner must possess theCREATE_S3_DATASOURCE_WITH_INSTANCE_ROLE
permission to proceed. By default, this option is disabled. For instructions on enabling this option, please contact your Immuta Support Professional.No Authentication
: Connect to a public S3 bucket with no credentials.
Step 2: Enter Connection Parameters
-
Fill out the following fields in the Connection Information window:
- AWS Access Key Id: Your AWS public access key.
- AWS Secret Access Key: Your AWS secret access key.
- Assume AWS IAM Role: Enable only if you want to connect using an IAM Role's access as opposed to an individual user's access.
- AWS IAM Role ARN: If Assume AWS IAM Role is enabled, this is the name of the IAM Role ARN that will be used (please login to the AWS Console or contact your AWS Admin for this information).
- AWS Region: The region that contains the S3 bucket that you wish to expose.
-
Click Verify Credentials.
IAM Policy Elements
To connect to an S3 bucket, the IAM role you use must have several policy elements. These elements are included in a
Statement
array and must be present for every bucket you wish to connect with that IAM Role.
s3:ListBucket
on the bucket you want to expose.s3:getObject
on/*
of the bucket you want to expose.s3:getObjectTagging
on/*
of the bucket you want to expose.
Optionally, you can give the element s3:ListAllMyBuckets
on arn:aws:s3:::*
. This allows Immuta to present you with
a list of buckets to choose from at data source creation time. Without this, you'll have to type in your bucket name.
In order update metadata from S3 as changes occur, additional policy elements are required.
Below is an example policy that can be used with the buckets test-bucket-1
and test-bucket-2
. Note, that you'd have
to create a data source per bucket, but that this policy (applied to your IAM role) is applicable to both buckets. This
policy does not include the list.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
],
"Resource": [
"arn:aws:s3:::test-bucket-1",
"arn:aws:s3:::test-bucket-2"
]
},
{
"Effect": "Allow",
"Action": [
"s3:getObject",
"s3:getObjectTagging"
],
"Resource": [
"arn:aws:s3:::test-bucket-1/*",
"arn:aws:s3:::test-bucket-2/*",
]
}
]
}
Step 3: Configure S3
-
Complete the following fields in the S3 Configuration window:
- S3 Bucket: the bucket you want your data source to reference
- S3 Prefix: the prefix within the specified bucket. Only blobs from within the specified prefix will be ingested and returned.
- Additional Options:
- Ingest Metadata Automatically: when enabled, this automatically monitors S3 for new updates to the data. If not enabled, there will only be a one-time metadata pull from S3. No additions or deletions will be captured after the pull completes. That means even if you change S3 attributes, those changes will not be reflected in the data source policy. This feature creates SQS queues containing bucket updates. Note that even if you do not enable this option, you can still direct Immuta to re-crawl the data manually whenever you would like.
-
If you select Ingest Metadata Automatically, enter the name of the SQS queue in the SQS Queue field.
These AWS IAM permissions are required:
sqs:DeleteMessageBatch
a general permission on SQS.sqs:ReceiveMessage
a general permission on SQS.
Advanced Options
Set Refresh Interval
This setting determines how frequently Immuta re-indexes the data in Amazon S3. If this data source is configured to point to a bucket that will not change, you can safely leave this set to 0. However, if data flows into this bucket constantly, you should decide how often updates are actually necessary for the consumer, because once the interval threshold is met, a re-indexing of the data source will be triggered.
Select Data Format
While object-backed data sources can really be any format (images, videos, etc.), we can still work under the assumption that some will have common formats. Should your blobs be comma separated, tab-delimited, or json, you can mask values through the Immuta interface. Specifying the data format will allow you to create masking policies for the data source.
Select Event Time
By default, Immuta will use the write time of the blob for the event time.
However, write time is not always an accurate way to represent event time.
Should you want to provide a customized event time, you can do that via blob attributes. In S3 these are stored as
metadata. You can specify the key of the metadata/xattr that contains the
date in ISO 8601 format, for example: 2015-11-15T05:13:32+00:00
.
Configure Tags
-
How would you like to tag your data?
: This step is optional and provides the ability to "auto-tag" based on attributes of the data rather than the manual entry you do in step 3. You can pull the tags from the folder name or metadata/xattrs. Adding tags makes your data more discoverable in the Immuta Web UI and REST API. Additionally, this metadata can be used to drive policies.
To learn more about S3 tags and metadata attributes see the official AWS documentation:
Add Data Source Tags
Adding tags to a data source allows searches by tag and the application of Global Policies based on those tags. Tags that begin with "Discovered" are automatically applied by Immuta at data source creation if Sensitive Data Detection is enabled.