Amazon Athena Data Source Considerations
Audience: Data Owners
Content Summary: This guide details considerations regarding Amazon Athena data sources. Be sure to understand the general concepts behind query-backed data sources prior to reading this page.
Athena differs from most other query-backed data sources in that query execution through Athena requires knowledge of and access to source data files and query output files rather than simply a connection to the database. Both source data files and query output files reside in Amazon S3. Review Amazon's Athena documentation for full details.
Immuta and Amazon Athena
The important points to consider in conjunction with Immuta are that
- You must create an Athena database in your AWS account that this data source will reference.
- Immuta must be aware of the S3 bucket in which results are stored. Note that the storage is driven by the value here, not what you may see in the Athena UI on the AWS console.
- Immuta must be aware of the folder within that bucket in which query results for this data source will be stored.
- You must provide Immuta with AWS credentials capable of running the query. Full details are below, but the
credentials must have access to
- Read and list the contents of the bucket(s) that back the Athena database.
- Read, list the contents of, and write to the bucket and path where query results will be stored.
- Read metadata information from AWS Glue.
- Create, execute, and delete queries within AWS Athena.
- Athena writes query results to S3. Neither Athena nor Immuta automatically delete query results. Per Amazon's recommendation you should set up S3 life cycle policies to manage query result retention.
Connection Information
The fields outlined below are required to create your Athena data source.
Authentication Method
Immuta supports AWS IAM Instance Profiles for authentication or the direct entry of an AWS Access Key Id and
AWS Secret Access Key. If your Immuta Instance is configured to allow Instance Profile authentication and
your Immuta profile has the permission CREATE_S3_DATASOURCE_WITH_INSTANCE_ROLE
, you will have the option
to choose authentication methods.
By default, Immuta is not configured to allow Instance Profile authentication. Please contact your Immuta Support Professional for details should you need to enable it.
If you do not have the option to select authentication method, you are using the AWS Access Key
method.
If you do have the option, you can choose between the following options:
AWS Access Key
AWS Instance Role
Further Options
AWS Access Key Id
: the AWS access Key you wish to use to create this data source (not available or required when usingAWS Instance Role
authentication)AWS Secret Access Key
: the secret access key associated with the above Access Key Id (not available or required when usingAWS Instance Role
authentication)AWS Region
: the region in which the Athena database and S3 buckets are locatedDatabase
: the name of the Athena database from which you wish to create the data sourceQuery Results Bucket
: the S3 bucket in which query results will be stored. Immuta will specify this value in each query sent to Athena. The bucket must exist.Query Results Directory Path
: the path in the S3 bucket where Athena will store query results. This folder will be created if it does not already exist.
IAM Role for Data Source IAM User
Note that these requirements are due to the implementation of Athena and the Simba ODBC driver. AWS does not
currently support resource specifications other than *
for Athena and Glue.
The following policy must be edited for your data locations. It shows values of
your-bucket-query-results
forQuery Results Bucket
your-results-path
forQuery Results Directory Path
your-source-bucket
as the S3 bucket backing the Athena tableyour-source-path
as the path prefix within that bucket where the files backing the Athena table are stored
Note: While path information is included in the example below, it is
recommended that path not be used in the resource restrictions. Additionally, single-bucket source data is
the only tested configuration. Athena databases with source data in multiple buckets may work, but would
require that additional resources be specified in the below policy anywhere your-source
is referenced.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"athena:StartQueryExecution",
"glue:GetTableVersions",
"glue:GetPartitions",
"athena:GetQueryResults",
"athena:DeleteNamedQuery",
"athena:GetNamedQuery",
"athena:ListQueryExecutions",
"athena:GetExecutionEngine",
"athena:StopQueryExecution",
"athena:GetExecutionEngines",
"athena:RunQuery",
"glue:GetTables",
"athena:GetNamespace",
"athena:GetQueryExecutions",
"glue:GetDatabases",
"athena:GetCatalogs",
"glue:GetTable",
"athena:ListNamedQueries",
"glue:GetDatabase",
"athena:GetNamespaces",
"glue:GetPartition",
"athena:CreateNamedQuery",
"athena:CancelQueryExecution",
"athena:GetQueryExecution",
"athena:GetTables",
"athena:GetTable",
"athena:BatchGetNamedQuery",
"athena:BatchGetQueryExecution",
"athena:GetQueryResultsStream"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:AbortMultipartUpload",
"s3:ListMultipartUploadParts"
],
"Resource": "arn:aws:s3:::your-bucket-query-results/your-results-path*"
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucketMultipartUploads",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::your-bucket-query-results",
"arn:aws:s3:::your-source-bucket"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListMultipartUploadParts"
],
"Resource": "arn:aws:s3:::your-source-bucket/your-source-path/*"
}
]
}