Skip to content

Amazon Athena Data Source Considerations

Audience: Data Owners

Content Summary: This guide details considerations regarding Amazon Athena data sources. Be sure to understand the general concepts behind query-Backed data sources prior to reading this page.

Athena differs from most other query-backed data sources in that query execution through Athena requires knowledge of and access to source data files and query output files rather than simply a connection to the database. Both source data files and query output files reside in Amazon S3. Review Amazon's Athena documentation for full details.

Immuta and Amazon Athena

The important points to consider in conjunction with Immuta are that

  1. You must create an Athena database in your AWS account that this data source will reference.
  2. Immuta must be aware of the S3 bucket in which results are stored. Note that the storage is driven by the value here, not what you may see in the Athena UI on the AWS console.
  3. Immuta must be aware of the folder within that bucket in which query results for this data source will be stored.
  4. You must provide Immuta with AWS credentials capable of running the query. Full details are below, but the credentials must have access to
    • Read and list the contents of the bucket(s) that back the Athena database.
    • Read, list the contents of, and write to the bucket and path where query results will be stored.
    • Read metadata information from AWS Glue.
    • Create, execute, and delete queries within AWS Athena.
  5. Athena writes query results to S3. Neither Athena nor Immuta automatically delete query results. Per Amazon's recommendation you should set up S3 life cycle policies to manage query result retention.

Connection Information

The fields outlined below are required to create your Athena data source.

Authentication Method

Immuta supports AWS IAM Instance Profiles for authentication or the direct entry of an AWS Access Key Id and AWS Secret Access Key. If your Immuta Instance is configured to allow Instance Profile authentication and your Immuta profile has the permission CREATE_S3_DATASOURCE_WITH_INSTANCE_ROLE, you will have the option to choose authentication methods.

By default, Immuta is not configured to allow Instance Profile authentication. Please contact your Immuta Support Professional for details should you need to enable it.

If you do not have the option to select authentication method, you are using the AWS Access Key method. If you do have the option, you can choose between the following options:

  1. AWS Access Key
  2. AWS Instance Role

Further Options

  1. AWS Access Key Id: the AWS access Key you wish to use to create this data source (not available or required when using AWS Instance Role authentication)
  2. AWS Secret Access Key: the secret access key associated with the above Access Key Id (not available or required when using AWS Instance Role authentication)
  3. AWS Region: the region in which the Athena database and S3 buckets are located
  4. Database: the name of the Athena database from which you wish to create the data source
  5. Query Results Bucket: the S3 bucket in which query results will be stored. Immuta will specify this value in each query sent to Athena. The bucket must exist.
  6. Query Results Directory Path: the path in the S3 bucket where Athena will store query results. This folder will be created if it does not already exist.

IAM Role for Data Source IAM User

Note that these requirements are due to the implementation of Athena and the Simba ODBC driver. AWS does not currently support resource specifications other than * for Athena and Glue.

The following policy must be edited for your data locations. It shows values of

  • your-bucket-query-results for Query Results Bucket
  • your-results-path for Query Results Directory Path
  • your-source-bucket as the S3 bucket backing the Athena table
  • your-source-path as the path prefix within that bucket where the files backing the Athena table are stored

Note: While path information is included in the example below, it is recommended that path not be used in the resource restrictions. Additionally, single-bucket source data is the only tested configuration. Athena databases with source data in multiple buckets may work, but would require that additional resources be specified in the below policy anywhere your-source is referenced.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "athena:StartQueryExecution",
                "glue:GetTableVersions",
                "glue:GetPartitions",
                "athena:GetQueryResults",
                "athena:DeleteNamedQuery",
                "athena:GetNamedQuery",
                "athena:ListQueryExecutions",
                "athena:GetExecutionEngine",
                "athena:StopQueryExecution",
                "athena:GetExecutionEngines",
                "athena:RunQuery",
                "glue:GetTables",
                "athena:GetNamespace",
                "athena:GetQueryExecutions",
                "glue:GetDatabases",
                "athena:GetCatalogs",
                "glue:GetTable",
                "athena:ListNamedQueries",
                "glue:GetDatabase",
                "athena:GetNamespaces",
                "glue:GetPartition",
                "athena:CreateNamedQuery",
                "athena:CancelQueryExecution",
                "athena:GetQueryExecution",
                "athena:GetTables",
                "athena:GetTable",
                "athena:BatchGetNamedQuery",
                "athena:BatchGetQueryExecution",
                "athena:GetQueryResultsStream"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:AbortMultipartUpload",
                "s3:ListMultipartUploadParts"
            ],
            "Resource": "arn:aws:s3:::your-bucket-query-results/your-results-path*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucketMultipartUploads",
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::your-bucket-query-results",
                "arn:aws:s3:::your-source-bucket"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListMultipartUploadParts"
            ],
            "Resource": "arn:aws:s3:::your-source-bucket/your-source-path/*"
        }
    ]
}