Audience: System Administrators
Content Summary: This simple deployment guide familiarizes users with Immuta on EMR. This guide is only meant to be deploy clusters for non-production purposes, such as demos or proof-of-concept. For more robust deployments, please see the main installation guide for Immuta on EMR.
Deprecation notice
Support for this integration has been deprecated.
AWS CLI (v1.16.x
or greater) installed in a bash environment.
The CLI should be configured to use a role that is able to fully manage EMR, IAM, and S3 resources. This can be a user role in a local environment or an instance role on an EC2 instance.
Resource IDs for your chosen AWS VPC subnet and EMR-managed security groups.
Be sure that your master and worker security groups are configured for bi-directional communication with your Immuta instance.
An instance of Immuta that is reachable from your chosen AWS VPC.
A username and password for the Immuta archives site. You can get these from your Immuta support professional.
First, download the quickstart script:
Next, run the script. Note that you will be prompted for input variables. If a variable is not required, you can press enter to use the displayed default value.
See below for an example of the script being run and prompting for variables. Note that any input in the example is simply for demonstration purposes; you will need to provide your own values.
The immuta-emr-quickstart.sh
script will prompt the user for input variables to configure the AWS resources required for the cluster. These variables are represented by the environment variables listed below. Exporting these environment variables prior to running the script will skip the prompts.
CLUSTER_NAME
Optional. The name of the EMR cluster to be created.
Default: immuta-quickstart
.
EMR_VERSION
Optional. The EMR version of the cluster. Current supported versions are 5.17.0
- 5.23.0
.
Default: 5.23.0
.
IMMUTA_VERSION
Optional. The full Immuta version to be installed on the cluster.
Default: 2024.1.13_20240624.
IMMUTA_INSTANCE_URL
Required. The URL of the Immuta instance that will drive policies on the cluster.
AWS_REGION
Optional. The AWS Region that the cluster will run in.
Default: us-east-1
.
INSTANCE_COUNT
Optional. The number of instances (master + worker) in the cluster.
Default: 3
.
INSTANCE_TYPE
Optional. The type of instance for cluster nodes.
Default: m5.xlarge
.
AWS_KEY_NAME
Required. The name of the SSH keypair in AWS that will be used to connect to the cluster.
AWS_SUBNET_ID
Required. The ID string of the subnet that the cluster will run in.
SERVICE_SECURITY_GROUP
Required. The ID string of the security group for the cluster's EMR services.
MASTER_SECURITY_GROUP
Required. The ID string of the security group for the cluster's master node.
WORKER_SECURITY_GROUP
Required. The ID string of the security group for the cluster's worker nodes.
ARCHIVE_USERNAME
Required. The username for Immuta Archives.
ARCHIVE_PASSWORD
Required. The password for Immuta Archives.
BOOTSTRAP_BUCKET
Optional. The S3 bucket where bootstrap artifacts will be stored. If the specified bucket does not exist, a new one will be created with default private ACLs.
Default: immuta-emr-bootstrap-$AWS_ACCOUNT_ID-$AWS_REGION
.
DATA_BUCKET
Optional. The S3 bucket where partitioned data is stored. If the specified bucket does not exist, a new one will be created with default private ACLs.
Default: immuta-emr-data-$AWS_ACCOUNT_ID-$AWS_REGION
.
KADMIN_PASSWORD
Optional. The Kerberos admin password that will be used to create Kerberos principals on the cluster's dedicated internal KDC.
Default: random.
HDFS_SYSTEM_TOKEN
Optional. The HDFS System Token that the cluster will use to securely communicate with the Immuta instance. You should generate this value in the Immuta Configuration UI before creating your cluster.
Default: random.
You will need to copy the immuta.keytab
and krb5.conf
files from the cluster and upload them to your Immuta instance using the Immuta Configuration UI.
The quickstart bootstrap automatically seeds the cluster with three user principals for you to use while familiarizing yourself with the Immuta platform and data policies: owner
, consumer1
, and consumer2
. The default Kerberos password for these users is immuta-quickstart
.
You can associate these users with your Immuta users by following this guide. Note that only the owner
principal will have access to the data in your chosen S3 data bucket, so this is the principal that you should use to create your data sources in Immuta.
Audience: System Administrators
Content Summary: This tutorial will guide you through the process of spinning up an Amazon Elastic Map Reduce cluster with Immuta's Hadoop and Spark security plugins installed.
Deprecation notice
Support for this integration has been deprecated.
This tutorial contains examples using the AWS CLI. These examples are conceptual in nature and will require modification to adapt to your exact deployment needs. If you wish to quickly familiarize yourself with Immuta's EMR integration, please visit the Quickstart Installation Guide for Immuta on AWS EMR.
This deployment is tested and known to work on the EMR releases listed below.
5.17.0
5.18.0
5.19.0
5.20.0
5.21.0
5.22.0
5.23.0
5.24.0
5.25.0
5.26.0
5.27.0
5.28.0
5.29.0
5.30.0
5.31.0
5.32.0
In addition to the EMR cluster itself, Immuta requires a handful of additional AWS resources in order to function properly.
In order to bootstrap the EMR cluster with Immuta's software bundle and startup scripts, you will need to create an S3 bucket to hold these artifacts.
In this guide, the bucket is referenced by the placeholder $BOOTSTRAP_BUCKET
. You should substitute this bucket name for a unique bucket name of your choosing. The bucket must contain all artifacts listed below. These artifacts can be found at Immuta Downloads.
Immuta's Spark integration relies on an IAM role policy that has access to the S3 buckets where your sensitive data is stored. Note that the EC2 Instance Roles for your EMR cluster should not have access to these buckets. Immuta will broker access to the data in these buckets to authorized users.
Modify the JSON data below to include the correct name of your data bucket(s), and save as immuta_data_iam_policy.json
.
If you are leveraging Immuta's Native S3 Workspace capability, you must also give the Immuta data IAM role full control of the workspace bucket or folder.
Now you can run the following command to create the Immuta IAM user policy.
The IAM role that brokers access to S3 data must be able to assume the cluster node instance roles, and vice versa. Since this a cycle, you will need to create both roles with generic trust policies, and then update them after both roles are created.
Create a file called immuta_data_role_trust_policy_generic.json
as seen below.
After creating the immuta_data_role_trust_policy_generic.json
file from above, run the following command to create the Immuta data IAM role. Note that you will be using the generic IAM role trust policy that you created in the previous step. This will be updated when both the data and instance IAM roles are created.
Next you will need to attach the IAM policy that allows access to your protected data in S3.
Modify the JSON data below to include the correct name of your bootstrap bucket, and save as immuta_emr_instance_policy.json
.
Note that the above policy is derived from the Minimal EMR role for EC2 (instance profile) policy
described in Amazon's Best Practices for Securing Amazon EMR guide. You may need to tune this policy based on your organization's environment and needs.
After creating the immuta_emr_instance_policy.json
file from above, run the following command to create the Immuta EMR Instance policy.
The node instance IAM role must be able to assume the IAM role that brokers access to S3 data, and vice versa. Assuming you have already created the immuta_emr_data_role
, create a JSON file called instance_role_trust_policy.json
as shown below.
Now you can create the instance role with the policy document from above.
Next you will need to attach the IAM policy that allows access to required resources for your cluster.
After creating the role and policy for the Immuta instances, you can create the Immuta EC2 Instance Profile.
After creating the Instance Profile, you can attach the newly created Role.
Now that both the data and instance IAM roles are created, you can update the trust policy of the data IAM role to include the instance role.
Create a file called data_role_trust_policy.json
as shown below.
Now you can update the trust policy of the data IAM role.
Navigate to the App Settings page and generate an Immuta HDFS System Token. Copy the value generated by Immuta, and create a new secret in AWS Secrets Manager as shown below.
Complete the JSON template below and save as ec2_attributes.json
. You may remove keys where you would like to use default values.
When choosing security groups for your master and worker nodes, be sure that they provide bi-directional access between the nodes and your Immuta instance.
Immuta requires a custom configuration file for Hadoop services to be passed in to the cluster. The required configurations are displayed below. Modify the JSON data to match your environment and save as cluster_configuration.json
.
Next, create a file called bootstrap_actions.json
to configure the Immuta bootstrap action. If you have any additional bootstrap actions to run outside of Immuta, they should be added here as well.
If you wish to deploy a kerberized cluster, create a kerberos_attributes.json
file with your desired Kerberos configurations. Note that although not strictly required, a cluster without Kerberos should be considered secure for production.
You will need to create a security configuration before creating the EMR cluster so that Immuta's EMRFS integration can leverage the IAM role you created to access data in S3.
First, create a security_configuration.json
file with your desired security settings. A basic example with a cluster-dedicated KDC for Kerberos is shown below. Note that you are allowing the following system users to use the data IAM role: hadoop
, hive
, and immuta_emr
. Data Owners must also have access to this data to use the Immuta Query Engine. This example grants access to any user in the fictional data_owners
group. See the official AWS Documentation for more details on configuring IAM roles for EMRFS.
Next, create your security configuration with the following command.
Finally, you can now spin up an EMR cluster with Immuta's security plugins.
To ensure protection of the Immuta user's AWS credentials as well as the kadmin
password (if using Kerberos), it is recommended to overwrite the secret values that were created during cluster deployment process. If you leave the secret values in AWS Secrets Manager, cluster users may be able to assume the instance role of the EMR nodes and read these values.
It is safe to remove these values after the cluster has finished bootstrapping. The example below overwrites the relevant secrets with null
values.
Note that if you are using an external KDC without a cross-realm trust (no KDC on the cluster), you should put the kadmin
password back into the immuta-kerberos-secret
. This is required to clean up the Immuta services principals that will have been created on the external KDC.