1 of 11

Configuration Guides

Amazon EMR

Audience: System Administrators
Content Summary: This tutorial will guide you through the process of spinning up an Amazon Elastic Map Reduce cluster with Immuta's Hadoop and Spark security plugins installed.

Deprecation notice

Support for this integration has been deprecated.

Introduction

This tutorial contains examples using the AWS CLI. These examples are conceptual in nature and will require modification to adapt to your exact deployment needs. If you wish to quickly familiarize yourself with Immuta's EMR integration, please visit the Quickstart Installation Guide for Immuta on AWS EMR.

Supported EMR Versions

This deployment is tested and known to work on the EMR releases listed below.

5.17.0
5.18.0
5.19.0
5.20.0
5.21.0
5.22.0
5.23.0
5.24.0
5.25.0
5.26.0
5.27.0
5.28.0
5.29.0
5.30.0
5.31.0
5.32.0

Create Prerequisite AWS Resources

In addition to the EMR cluster itself, Immuta requires a handful of additional AWS resources in order to function properly.

Immuta Bootstrap Bucket

In order to bootstrap the EMR cluster with Immuta's software bundle and startup scripts, you will need to create an S3 bucket to hold these artifacts.

In this guide, the bucket is referenced by the placeholder $BOOTSTRAP_BUCKET. You should substitute this bucket name for a unique bucket name of your choosing. The bucket must contain all artifacts listed below. These artifacts can be found at Immuta Downloads.

s3://$BOOTSTRAP_BUCKET/immuta-bootstrap
s3://$BOOTSTRAP_BUCKET/immuta-bootstrap.tar.gz
s3://$BOOTSTRAP_BUCKET/immuta_bundle-$IMMUTA_VERSION.tar.gz
s3://$BOOTSTRAP_BUCKET/install.sh

Immuta Data IAM Role

Immuta's Spark integration relies on an IAM role policy that has access to the S3 buckets where your sensitive data is stored. Note that the EC2 Instance Roles for your EMR cluster should not have access to these buckets. Immuta will broker access to the data in these buckets to authorized users.

Create Immuta Data IAM Policy

Modify the JSON data below to include the correct name of your data bucket(s), and save as immuta_data_iam_policy.json.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:Get*",
                "s3:Head*",
                "s3:List*"
            ],
            "Resource": [
                "arn:aws:s3:::$DATA_BUCKET_1",
                "arn:aws:s3:::$DATA_BUCKET_2",
                "arn:aws:s3:::$DATA_BUCKET_1/*",
                "arn:aws:s3:::$DATA_BUCKET_2/*"
            ]
        }
    ]
}

If you are leveraging Immuta's Native S3 Workspace capability, you must also give the Immuta data IAM role full control of the workspace bucket or folder.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:Get*",
                "s3:Head*",
                "s3:List*"
            ],
            "Resource": [
                "arn:aws:s3:::$DATA_BUCKET_1",
                "arn:aws:s3:::$DATA_BUCKET_2",
                "arn:aws:s3:::$DATA_BUCKET_1/*",
                "arn:aws:s3:::$DATA_BUCKET_2/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:*"
            ],
            "Resource": [
                "arn:aws:s3:::$WORKSPACE_BUCKET",
                "arn:aws:s3:::$WORKSPACE_BUCKET/*"
            ]
        }
    ]
}

Now you can run the following command to create the Immuta IAM user policy.

aws iam create-policy \
    --policy-name immuta_emr_data_policy \
    --policy-document file://immuta_data_iam_policy.json

Create Immuta Data IAM Role

The IAM role that brokers access to S3 data must be able to assume the cluster node instance roles, and vice versa. Since this a cycle, you will need to create both roles with generic trust policies, and then update them after both roles are created.

Create a file called immuta_data_role_trust_policy_generic.json as seen below.

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Principal":{
            "AWS":"arn:aws:iam::$AWS_ACCOUNT_ID:role/EMR_EC2_DefaultRole"
         },
         "Action":"sts:AssumeRole"
      }
   ]
}

After creating the immuta_data_role_trust_policy_generic.json file from above, run the following command to create the Immuta data IAM role. Note that you will be using the generic IAM role trust policy that you created in the previous step. This will be updated when both the data and instance IAM roles are created.

aws iam create-role \
  --role-name immuta_emr_data_role \
  --assume-role-policy-document "file://immuta_data_role_trust_policy_generic.json"

Next you will need to attach the IAM policy that allows access to your protected data in S3.

aws iam attach-role-policy \
    --policy-arn arn:aws:iam::$AWS_ACCOUNT_ID:policy/immuta_emr_data_policy \
    --role-name immuta_emr_data_role

Create Immuta Instance IAM Policy

Modify the JSON data below to include the correct name of your bootstrap bucket, and save as immuta_emr_instance_policy.json.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Resource": "*",
            "Action": [
                "ec2:Describe*",
                "elasticmapreduce:Describe*",
                "elasticmapreduce:ListBootstrapActions",
                "elasticmapreduce:ListClusters",
                "elasticmapreduce:ListInstanceGroups",
                "elasticmapreduce:ListInstances",
                "elasticmapreduce:ListSteps"
            ]
        },
        {
            "Effect": "Allow",
            "Resource": "arn:aws:sqs:*:123456789012:AWS-ElasticMapReduce-*",
            "Action": [
                "sqs:CreateQueue",
                "sqs:DeleteQueue",
                "sqs:DeleteMessage",
                "sqs:DeleteMessageBatch",
                "sqs:GetQueueAttributes",
                "sqs:GetQueueUrl",
                "sqs:PurgeQueue",
                "sqs:ReceiveMessage"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:*Object"
            ],
            "Resource": [
                "arn:aws:s3:::$BOOTSTRAP_BUCKET/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::$BOOTSTRAP_BUCKET"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "secretsmanager:*",
            "Resource": [
                "arn:aws:secretsmanager:$AWS_REGION:$AWS_ACCOUNT_ID:secret:immuta-emr-secret-??????",
                "arn:aws:secretsmanager:$AWS_REGION>:$AWS_ACCOUNT_ID:secret:immuta-kerberos-secret-??????"
            ]
        }
    ]
}

Note that the above policy is derived from the Minimal EMR role for EC2 (instance profile) policy described in Amazon's Best Practices for Securing Amazon EMR guide. You may need to tune this policy based on your organization's environment and needs.

After creating the immuta_emr_instance_policy.json file from above, run the following command to create the Immuta EMR Instance policy.

aws iam create-policy \
    --policy-name immuta_emr_instance_policy \
    --policy-document file://immuta_emr_instance_policy.json

Create Immuta Instance IAM Role

The node instance IAM role must be able to assume the IAM role that brokers access to S3 data, and vice versa. Assuming you have already created the immuta_emr_data_role, create a JSON file called instance_role_trust_policy.json as shown below.

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Principal":{
            "AWS":"arn:aws:iam::$AWS_ACOUNT_ID:role/immuta_emr_data_role",
            "Service": "ec2.amazonaws.com"
         },
         "Action":"sts:AssumeRole"
      }
   ]
}

Now you can create the instance role with the policy document from above.

aws iam create-role \
  --role-name immuta_emr_instance_role \
  --assume-role-policy-document "file://instance_role_trust_policy.json"

Next you will need to attach the IAM policy that allows access to required resources for your cluster.

aws iam attach-role-policy \
    --policy-arn arn:aws:iam::$AWS_ACCOUNT_ID:policy/immuta_emr_instance_policy \
    --role-name immuta_emr_instance_role

Create Immuta EMR Instance Profile

After creating the role and policy for the Immuta instances, you can create the Immuta EC2 Instance Profile.

aws iam create-instance-profile \
    --instance-profile-name immuta_emr_instance_profile

After creating the Instance Profile, you can attach the newly created Role.

aws iam add-role-to-instance-profile \
    --instance-profile-name immuta_emr_instance_profile \
    --role-name immuta_emr_instance_role

Update Immuta Data IAM Role Trust Policy

Now that both the data and instance IAM roles are created, you can update the trust policy of the data IAM role to include the instance role.

Create a file called data_role_trust_policy.json as shown below.

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Principal":{
            "AWS":"arn:aws:iam::$AWS_ACCOUNT_ID:role/immuta_emr_instance_role"
         },
         "Action":"sts:AssumeRole"
      }
   ]
}

Now you can update the trust policy of the data IAM role.

aws iam update-assume-role-policy \
  --role-name immuta_emr_data_role \
  --policy-document "file://data_role_trust_policy.json"

Immuta HDFS System Token in AWS Secrets Manager

Navigate to the App Settings page and generate an Immuta HDFS System Token. Copy the value generated by Immuta, and create a new secret in AWS Secrets Manager as shown below.

aws secretsmanager create-secret \
    --name immuta-emr-secret \
    --secret-string $HDFS_SYSTEM_TOKEN

Create EMR Cluster

EC2 Attributes Configuration File

Complete the JSON template below and save as ec2_attributes.json. You may remove keys where you would like to use default values.

When choosing security groups for your master and worker nodes, be sure that they provide bi-directional access between the nodes and your Immuta instance.

{
  "ServiceAccessSecurityGroup": "string",
  "AvailabilityZone": "string",
  "AdditionalSlaveSecurityGroups": ["string", ...],
  "EmrManagedMasterSecurityGroup": "string",
  "KeyName": "<the name of your SSH public key stored in AWS>",
  "InstanceProfile": "immuta_emr_instance_profile",
  "SubnetId": "string",
  "AdditionalMasterSecurityGroups": ["string", ...],
  "AvailabilityZones": ["string", ...],
  "EmrManagedSlaveSecurityGroup": "string"
}

Cluster Configuration File

Immuta requires a custom configuration file for Hadoop services to be passed in to the cluster. The required configurations are displayed below. Modify the JSON data to match your environment and save as cluster_configuration.json.

[
   {
      "Classification":"hdfs-site",
      "Properties":{
         "dfs.namenode.inode.attributes.provider.class":"com.immuta.hadoop.ImmutaInodeAttributeProvider",
         "dfs.namenode.acls.enabled":"true",
         "immuta.extra.name.node.plugin.config":"file:///opt/immuta/hadoop/name-node-conf.xml"
      },
      "Configurations":[]
   },
   {
      "Classification":"emrfs-site",
      "Properties":{
         "fs.s3.customAWSCredentialsProvider":"com.immuta.emr.ImmutaEMRAWSCredentialsProvider"
      },
      "Configurations":[]
   },
   {
      "Classification":"core-site",
      "Properties":{
         "immuta.permission.users.to.ignore":"hdfs,yarn,hive,impala,llama,mapred,spark,oozie,hue,hbase,hadoop",
         "fs.immuta.impl":"com.immuta.hadoop.ImmutaFileSystem",
         "hadoop.proxyuser.immuta_emr.groups":"*",
         "hadoop.proxyuser.immuta_emr.users":"*",
         "hadoop.proxyuser.immuta_emr.hosts":"*",
         "hadoop.proxyuser.immuta.groups":"*",
         "hadoop.proxyuser.immuta.users":"*",
         "hadoop.proxyuser.immuta.hosts":"*",
         "immuta.cluster.name":"my_cluster",
         "immuta.spark.partition.generator.user":"immuta_emr",
         "immuta.credentials.dir":"/user",
         "immuta.base.url":"https://immuta.mycompany.com"
      },
      "Configurations":[]
   },
   {
      "Classification":"hadoop-env",
      "Properties":{},
      "Configurations":[
         {
            "Classification":"export",
            "Properties":{
               "HADOOP_CLASSPATH":"$HADOOP_CLASSPATH:/opt/immuta/hadoop/lib/immuta-inode-attribute-provider.jar:/opt/immuta/hadoop/lib/immuta-hadoop-filesystem.jar:/opt/immuta/hadoop/lib/immuta-emrfs-credential-provider.jar",
               "JAVA_HOME":"/usr/lib/jvm/java-1.8.0"
            },
            "Configurations":[]
         }
      ]
   },
   {
      "Classification":"hive-site",
      "Properties":{
         "hive.server2.enable.doAs":"true",
         "hive.security.metastore.authorization.auth.reads": "false",
         "hive.compute.query.using.stats": "true"
      },
      "Configurations":[]
   },
   {
      "Classification": "capacity-scheduler",
      "Properties": {
         "yarn.scheduler.capacity.root.default.default-node-label-expression": "CORE",
         "yarn.scheduler.capacity.root.immuta_spark.default-node-label-expression": "CORE",
         "yarn.scheduler.capacity.root.default.accessible-node-labels.CORE.capacity": "30",
         "yarn.scheduler.capacity.root.queues": "default,immuta_spark",
         "yarn.scheduler.capacity.root.immuta_spark.accessible-node-labels.CORE.capacity": "70",
         "yarn.scheduler.capacity.root.immuta_spark.maximum-applications": "100",
         "yarn.scheduler.capacity.root.immuta_spark.maximum-am-resource-percent": "0.1",
         "yarn.scheduler.capacity.root.immuta_spark.capacity": "0",
         "yarn.scheduler.capacity.root.default.capacity": "100"
      },
      "Configurations": []
   }
]

Immuta Bootstrap Configuration File

Next, create a file called bootstrap_actions.json to configure the Immuta bootstrap action. If you have any additional bootstrap actions to run outside of Immuta, they should be added here as well.

[
  {
    "Path": "s3://$BOOTSTRAP_BUCKET/immuta-bootstrap",
    "Args": [
        "--immuta-instance-url=https://immuta.mycompany.com",
        "--immuta-secret-name=immuta-emr-secret",
        "--immuta-user-name=immuta_emr",
        "--immuta-bootstrap-archive=s3://$BOOTSTRAP_BUCKET/immuta_bootstrap.tar.gz",
        "--immuta-software-bundle=s3://$BOOTSTRAP_BUCKET/immuta_bundle.tar.gz",
        "--immuta-install-script=s3://$BOOTSTRAP_BUCKET/install.sh",
        "--kerberos",
        "--kerberos-secret-name immuta-kerberos-secret"
    ],
    "Name": "Immuta Bootstrap"
  }
]

(Optional) Kerberos Attributes Configuration File

If you wish to deploy a kerberized cluster, create a kerberos_attributes.json file with your desired Kerberos configurations. Note that although not strictly required, a cluster without Kerberos should be considered secure for production.

{
  "Realm": "EC2.INTERNAL",
  "KdcAdminPassword": "secret"
}

Security Configuration

You will need to create a security configuration before creating the EMR cluster so that Immuta's EMRFS integration can leverage the IAM role you created to access data in S3.

First, create a security_configuration.json file with your desired security settings. A basic example with a cluster-dedicated KDC for Kerberos is shown below. Note that you are allowing the following system users to use the data IAM role: hadoop, hive, and immuta_emr. Data Owners must also have access to this data to use the Immuta Query Engine. This example grants access to any user in the fictional data_owners group. See the official AWS Documentation for more details on configuring IAM roles for EMRFS.

{
  "AuthenticationConfiguration": {
    "KerberosConfiguration": {
      "Provider": "ClusterDedicatedKdc",
      "ClusterDedicatedKdcConfiguration": {
        "TicketLifetimeInHours": 24
      }
    }
  },
  "AuthorizationConfiguration": {
    "EmrFsConfiguration": {
      "RoleMappings": [
        {
          "Role": "arn:aws:iam::$AWS_ACCOUNT_ID:role/immuta_emr_data_role",
          "IdentifierType": "User",
          "Identifiers": ["hadoop","hive","immuta_emr"]
        },
        {
          "Role": "arn:aws:iam::$AWS_ACCOUNT_ID:role/immuta_emr_data_role",
          "IdentifierType": "Group",
          "Identifiers": ["data_owners"]
        }
      ]
    }
  }
}

Next, create your security configuration with the following command.

aws emr create-security-configuration \
    --name immuta_emr_security_configuration \
    --security-configuration file://security_configuration.json

Create EMR Cluster Command

Finally, you can now spin up an EMR cluster with Immuta's security plugins.

aws emr create-cluster \
    --name immuta-emr \
    --release-label emr-5.28.0 \
    --configuration file://cluster_configuration.json \
    --ec2-attributes file://ec2_attributes.json \
    --instance-type m5.xlarge \
    --instance-count 3 \
    --bootstrap-actions file://bootstrap_actions.json \
    --kerberos-attributes file://kerberos_attributes.json \
    --security-configuration immuta_emr_security_configuration \
    --service-role EMR_DefaultRole

Remove Secrets

To ensure protection of the Immuta user's AWS credentials as well as the kadmin password (if using Kerberos), it is recommended to overwrite the secret values that were created during cluster deployment process. If you leave the secret values in AWS Secrets Manager, cluster users may be able to assume the instance role of the EMR nodes and read these values.

It is safe to remove these values after the cluster has finished bootstrapping. The example below overwrites the relevant secrets with null values.

aws secretsmanager put-secret-value \
    --secret-id immuta-emr-secret \
    --secret-binary null
aws secretsmanager put-secret-value \
    --secret-id immuta-kerberos-secret \
    --secret-string null

Note that if you are using an external KDC without a cross-realm trust (no KDC on the cluster), you should put the kadmin password back into the immuta-kerberos-secret. This is required to clean up the Immuta services principals that will have been created on the external KDC.

Quickstart

Audience: System Administrators
Content Summary: This simple deployment guide familiarizes users with Immuta on EMR. This guide is only meant to be deploy clusters for non-production purposes, such as demos or proof-of-concept. For more robust deployments, please see the main installation guide for Immuta on EMR.

Deprecation notice

Support for this integration has been deprecated.

Installation Prerequisites

AWS Resources

AWS CLI (v1.16.x or greater) installed in a bash environment.
- The CLI should be configured to use a role that is able to fully manage EMR, IAM, and S3 resources. This can be a user role in a local environment or an instance role on an EC2 instance.
Resource IDs for your chosen AWS VPC subnet and EMR-managed security groups.
- Be sure that your master and worker security groups are configured for bi-directional communication with your Immuta instance.

Immuta Resources

An instance of Immuta that is reachable from your chosen AWS VPC.
A username and password for the Immuta archives site. You can get these from your Immuta support professional.

Run the Immuta EMR Quickstart Script

First, download the quickstart script:

Next, run the script. Note that you will be prompted for input variables. If a variable is not required, you can press enter to use the displayed default value.

See below for an example of the script being run and prompting for variables. Note that any input in the example is simply for demonstration purposes; you will need to provide your own values.

$ ./immuta-emr-quickstart.sh

* Enter Cluster Name [immuta-quickstart]:

* Enter EMR Version [Default: 5.23.0]:

* Enter Immuta Version [Default: 2024.1.13_20240624]:

* Enter Immuta Instance URL [REQUIRED]: https://immuta.mycompany.com

* Enter AWS Region [us-east-1]:

* Enter Instance Count [3]:

* Enter Instance Type [m5.xlarge]:

* Enter AWS Key Name for SSH [REQUIRED]: my-aws-key

* Enter AWS Subnet ID [REQUIRED]: subnet-xxxxxx

* Enter EMR Service Managed Security Group ID [REQUIRED]: sg-xxxxxxxxxxxx

* Enter EMR Master Node Managed Security Group ID [REQUIRED]: sg-yyyyyyyyyyy

* Enter EMR Worker Node Managed Security Group ID [REQUIRED]: sg-zzzzzzzzzzz

* Enter Immuta Archive Username [REQUIRED]: abjgksdthghjksgslkjaghsdfsj

* Enter Immuta Archive Password [REQUIRED]: gjw4a8906y423432r93hf3f03rhfqfq470ty3

* Enter Bootstrap Bucket Name. If the bucket does not exist, it will be created with default permissions [immuta-emr-bootstrap-<account id>-us-east-1]:

* Enter Data Bucket Name. If the bucket does not exist, it will be created with default permissions [immuta-emr-data-<account id>-us-east-1]:

* Enter Kerberos Admin Password [Default: <generated>]:

* Enter HDFS System Token [Default: <generated>]:

< Cluster creation begins>
...

Input Variables

The immuta-emr-quickstart.sh script will prompt the user for input variables to configure the AWS resources required for the cluster. These variables are represented by the environment variables listed below. Exporting these environment variables prior to running the script will skip the prompts.

CLUSTER_NAME
- Optional. The name of the EMR cluster to be created.
- Default: immuta-quickstart.
EMR_VERSION
- Optional. The EMR version of the cluster. Current supported versions are 5.17.0 - 5.23.0.
- Default: 5.23.0.
IMMUTA_VERSION
- Optional. The full Immuta version to be installed on the cluster.
- Default: 2024.1.13_20240624.
IMMUTA_INSTANCE_URL
- Required. The URL of the Immuta instance that will drive policies on the cluster.
AWS_REGION
- Optional. The AWS Region that the cluster will run in.
- Default: us-east-1.
INSTANCE_COUNT
- Optional. The number of instances (master + worker) in the cluster.
- Default: 3.
INSTANCE_TYPE
- Optional. The type of instance for cluster nodes.
- Default: m5.xlarge.
AWS_KEY_NAME
- Required. The name of the SSH keypair in AWS that will be used to connect to the cluster.
AWS_SUBNET_ID
- Required. The ID string of the subnet that the cluster will run in.
SERVICE_SECURITY_GROUP
- Required. The ID string of the security group for the cluster's EMR services.
MASTER_SECURITY_GROUP
- Required. The ID string of the security group for the cluster's master node.
WORKER_SECURITY_GROUP
- Required. The ID string of the security group for the cluster's worker nodes.
ARCHIVE_USERNAME
- Required. The username for Immuta Archives.
ARCHIVE_PASSWORD
- Required. The password for Immuta Archives.
BOOTSTRAP_BUCKET
- Optional. The S3 bucket where bootstrap artifacts will be stored. If the specified bucket does not exist, a new one will be created with default private ACLs.
- Default: immuta-emr-bootstrap-$AWS_ACCOUNT_ID-$AWS_REGION.
DATA_BUCKET
- Optional. The S3 bucket where partitioned data is stored. If the specified bucket does not exist, a new one will be created with default private ACLs.
- Default: immuta-emr-data-$AWS_ACCOUNT_ID-$AWS_REGION.
KADMIN_PASSWORD
- Optional. The Kerberos admin password that will be used to create Kerberos principals on the cluster's dedicated internal KDC.
- Default: random.
HDFS_SYSTEM_TOKEN
- Optional. The HDFS System Token that the cluster will use to securely communicate with the Immuta instance. You should generate this value in the Immuta Configuration UI before creating your cluster.
- Default: random.

Post-installation

Copy Kerberos Resources to Immuta Instance

You will need to copy the immuta.keytab and krb5.conf files from the cluster and upload them to your Immuta instance using the Immuta Configuration UI.

scp -i my-aws-key.pem hadoop@ip-x-x-x-x.ec2.internal:/etc/krb5.conf .
scp -i my-aws-key.pem hadoop@ip-x-x-x-x.ec2.internal:~/.keytabs/immuta.keytab .

Associate Quickstart Principals with Immuta Users

The quickstart bootstrap automatically seeds the cluster with three user principals for you to use while familiarizing yourself with the Immuta platform and data policies: owner, consumer1, and consumer2. The default Kerberos password for these users is immuta-quickstart.

You can associate these users with your Immuta users by following this guide. Note that only the owner principal will have access to the data in your chosen S3 data bucket, so this is the principal that you should use to create your data sources in Immuta.

Cloudera Hadoop

Audience: System Administrators
Content Summary: Installation of the components necessary for the use of the Immuta Hadoop Integration depends on the version of Hadoop. This section contains guides for installing Cloudera Hadoop.

Section Contents

Prerequisites: Outlines the prerequisites required to successfully use installation components on your CDH cluster.
Cloudera Hadoop Installation Guide
Performance Optimization: Describes strategies for improving performance of Immuta's NameNode plugin on CDH clusters.
Run as a Non-Default User: By default, the Immuta Partition servers will run as the immuta user. For clusters configured to use Kerberos, this means that you must have an immuta principal available for Cloudera Manager to provision the service. If for some reason you do not have an immuta principal available, you can change the user that the Immuta partition servers run as. This page describes the configuration changes that are needed to change the principal(s) that Immuta uses.
Log Analysis: Details how to use the immuta_hdfs_log_analyzer tool to troubleshoot slowdowns in your CDH cluster.
Upgrading: Details how to upgrade the Immuta Parcel and Service on your CDH cluster.
Disable or Uninstall: Outlines steps to effectively disable and/or uninstall the Immuta components from your CDH cluster.

Prerequisites

Audience: System Administrators
Content Summary: The Immuta CDH integration installation consists of the following components:
Immuta NameNode plugin
Immuta Hadoop Filesystem plugin
Immuta Spark 1.6 Partition Service (DEPRECATED)
Immuta Spark 2 Partition Service
This page outlines the prerequisites required to successfully use these components on your CDH cluster.

This installation process has been verified to work with the following CDH versions:

5.9.x
5.12.x
5.13.x
5.14.x
5.15.x
5.16.x
6.1.x
6.2.x
6.3.x

Set Up

Before installing Immuta onto your CDH cluster, the following steps need completed:

Enable HDFS Extended Attributes

Immuta requires that HDFS Extended Attributes are enabled.

Under the HDFS service of Cloudera Manager, Configuration tab, search for key:

Enable Access Control Lists

and, ensure the Checkbox is checked.

Generate an Immuta System API Key

Export Cluster Configuration (Optional but Recommended)

curl -u ${ADMIN_USER}:${ADMIN_PASSWORD} "http://${CM_HOST}/api/v12/clusters/${CLUSTER_NAME}/export" > export.json

Before sending the exported JSON file, it is recommended to look over the configurations and redact any information that you consider too sensitive to share externally. Cloudera Manager will automatically redact known passwords; however, there may be sensitive values embedded in your configuration that Cloudera Manager does not know about. An example of this may be configuration of a third-party cluster application that requires passwords or API keys in its cluster configuration.

Download the Immuta Parcel and CSD Artifacts

Needed Artifacts

Begin by downloading the Immuta Parcel and CSD for your Cloudera Distribution. A complete installation will require 3 files:

IMMUTA-<VERSION>_<DATESTAMP>-<CDH_VERSION>-spark2-public-<LINUX_DISTRIBUTION>.parcel
- The .parcel file is the Immuta CDH parcel.
- For versions that support it, Spark 1 is included in this parcel.
IMMUTA-<VERSION>_<DATESTAMP>-<CDH_VERSION>-spark2-public-<LINUX_DISTRIBUTION>.parcel.sha
- The .parcel.sha file contains a SHA1 hash of the Immuta .parcel file for integrity verification by Cloudera Manager.
IMMUTA-<VERSION>_<DATESTAMP>-<CDH_VERSION>-spark2-public.jar
- The .jar file is the Custom Service Descriptor (CSD) for the Immuta service in Cloudera Manager.

The variables above are defined as:

<VERSION> is like "2024.1.13"
<DATESTAMP> is the compiled date in the format "YYYYMMDD"
<CDH_VERSION> must match your CDH version, like "5.16.2"
<LINUX_DISTRIBUTION> is either "el7" or "el6".

The Immuta Archives Site

All artifacts are divided up by subdirectories in the form of[Immuta Release]/[CDH Version].

Installation

Audience: System Administrators
Content Summary: The Immuta CDH integration installation consists of the following components:
Immuta NameNode plugin
Immuta Hadoop Filesystem plugin
Immuta Spark 2 Vulcan service
This page outlines the installation steps required to successfully deploy these components on your CDH cluster.
Prerequisites: Follow the to prepare for installation.

Installation

Begin installation by transferring the Immuta .parcel and its associated .parcel.sha files to your Cloudera Manager node and placing them in /opt/cloudera/parcel-repo. Once copied, ensure files have both their owner and group permissions set to cloudera-scm

chown -R cloudera-scm:cloudera-scm /opt/cloudera/parcel-repo

Next, transfer the Immuta CSD (.jar file) to /opt/cloudera/csd, and ensure both its owner and group permissions are set to cloudera-scm as well.

chown -R cloudera-scm:cloudera-scm /opt/cloudera/csd

You will need to restart the Cloudera Manager server in order for the CSD to be picked up:

systemctl restart cloudera-scm-server

service cloudera-scm-server restart

Follow Cloudera's instructions for distributing and activating the IMMUTA parcel.

Once the parcel has been successfully activated, you can add the IMMUTA service:

From the Cloudera Manager select Add Service.
Choose Immuta.
Click Continue.
Select nodes to install the services on. Your options are
- For maximum redundancy, choose all.
- Choose a single node.
- Choose a few nodes. Set up a Load Balancer in front of the instances to distribute load. Contact Immuta support for more details.
Proceed to the end of the workflow.

Configure HDFS

After adding the Immuta service to your CDH cluster, there is some configuration that needs to be completed.

NameNode-Only Configuration

Warning

The following settings should only be written to the configuration on the NameNode. Setting these values on DataNodes will have security implications, so be sure that they are set in the NameNode only section of Cloudera Manager. For optimal performance, only set these configuration options in the NameNode Role Config Group that controls the namespace where Immuta data resides.

Under the HDFS service of Cloudera Manager, Configuration tab, search for key:

NameNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml

and, using "View as XML", add/set the value(s) similar to:

<property>
    <name>dfs.namenode.authorization.provider.class</name>
    <value>com.immuta.hadoop.ImmutaAuthorizationProvider</value>
    <final>true</final>
</property>
<property>
    <name>immuta.permission.fallback.class</name>
    <value>org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider</value>
    <final>true</final>
</property>
<property>
    <name>immuta.permission.allow.fallback</name>
    <value>false</value>
    <final>true</final>
</property>
<property>
    <name>immuta.system.api.key</name>
    <value>0ec28d3f-a8a2-4960-b653-d7ccfe4803b3</value>
    <final>true</final>
</property>
<property>
    <name>immuta.permission.users.to.ignore</name>
    <value>hdfs,yarn,hive,impala,llama,mapred,spark,oozie,hue,hbase,livy,immuta</value>
    <final>true</final>
</property>
<property>
    <name>immuta.permission.paths.to.enforce</name>
    <value>*</value>
    <final>true</final>
</property>
<property>
    <name>immuta.permission.source.cache.enabled</name>
    <value>false</value>
    <final>true</final>
</property>

Best Practice: Configuration Values

Immuta recommends that all Immuta configuration values be marked final.

Shared Configuration

The following configuration items should be configured for both the NameNode processes and the DataNode processes. These configurations are used both by the Immuta FileSystem and the Immuta NameNode plugin. For example:

Under the HDFS service of Cloudera Manager, Configuration tab, search for key:

Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml

and, using "View as XML", add/set the value(s) similar to:

<property>
    <name>immuta.base.url</name>
    <value>https://immuta.hostname</value>
    <final>true</final>
</property>
<property>
    <name>immuta.spark.partition.generator.user</name>
    <value>immuta</value>
    <final>true</final>
</property>
<property>
    <name>immuta.credentials.dir</name>
    <value>/user</value>
    <final>true</final>
</property>
<property>
    <name>immuta.visibility.cache.timeout.seconds</name>
    <value>600</value>
    <final>true</final>
</property>
<property>
    <name>fs.immuta.impl</name>
    <value>com.immuta.hadoop.ImmutaFileSystem</value>
    <final>true</final>
</property>
<property>
    <name>hadoop.proxyuser.immuta.hosts</name>
    <value>*</value>
    <final>true</final>
</property>
<property>
    <name>hadoop.proxyuser.immuta.users</name>
    <value>*</value>
    <final>true</final>
</property>
<property>
    <name>hadoop.proxyuser.immuta.groups</name>
    <value>*</value>
    <final>true</final>
</property>

Best Practice: Configuration Values

Immuta recommends that all Immuta configuration values be marked final.

Make sure that user directories underneath immuta.credentials.dir are readable only by the owner of the directory. If the user's directory doesn't exist and we create it, we will set the permissions to 700.

Enable TLS for the Immuta Vulcan Service

You can enable TLS on the Immuta Vulcan service by configuring it to use a keystore in JKS format.

Server-side TLS Configuration

Under the Immuta service of Cloudera Manager, Configuration tab, search for key:

Immuta Spark 2 Vulcan Server Advanced Configuration Snippet (Safety Valve) for session/generator.xml

and, using "View as XML", add/set the value(s) similar to:

<property>
    <name>immuta.secure.partition.generator.keystore</name>
    <value>/etc/immuta/keystore.jks</value>
    <final>true</final>
</property>
<property>
    <name>immuta.secure.partition.generator.keystore.password</name>
    <value>secure_password</value>
    <final>true</final>
</property>
<property>
    <name>immuta.secure.partition.generator.keymanager.password</name>
    <value>secure_password</value>
    <final>true</final>
</property>

Best Practice: Configuration Values

Immuta recommends that all Immuta configuration values be marked final.

Detailed Explanation:

immuta.secure.partition.generator.keystore
- Specifies the path to the Immuta Vulcan service keystore.
- Example: /etc/immuta/keystore.jks
immuta.secure.partition.generator.keystore.password
- Specifies the password for the Immuta Vulcan service keystore. This password will be a publicly available piece of information, but file permissions should be used to make sure that only the user running the service can read the keystore file.
- Example: secure_password
immuta.secure.partition.generator.keystore.password
- Specifies the password for the Immuta Vulcan service keystore. This password will be a publicly available piece of information, but file permissions should be used to make sure that only the user running the service can read the keystore file.
- Example: secure_password
immuta.secure.partition.generator.keymanager.password
- Specifies the KeyManager password for the Immuta Vulcan service keystore. This password will be a publicly available piece of information, but file permissions should be used to make sure that only the user running the service can read the keystore file. This is not always necessary.
- Example: secure_password

Best Practice: Secure Keystore with File Permissions

Immuta recommends using file permissions to secure the keystore from improper access:

chown immuta:immuta /etc/immuta/keystore.jks
chmod 600 /etc/immuta/keystore.jks

Client-side TLS Configuration

You must also set the following properties under the following client sections:

For Spark 2, under the Immuta service of Cloudera Manager, Configuration tab, search for key:

Immuta Client Advanced Configuration Snippet (Safety Valve) for immuta-conf/session/generator.xml

and, using "View as XML", add/set the value(s) similar to:

<property>
    <name>immuta.secure.partition.generator.keystore</name>
    <value>true</value>
    <final>true</final>
</property>

Best Practice: Configuration Values

Immuta recommends that all Immuta configuration values be marked final.

Detailed Explanation:

immuta.secure.partition.generator.keystore
- Set to true to enable TLS
- Default: true

Impala Configuration

You must give the service principal that the Immuta Web Service is configured to use permission to delegate in Impala. To accomplish this, add the Immuta Web Service principal to authorized_proxy_user_config in the Impala daemon command line arguments.

Under the Impala service of Cloudera Manager, Configuration tab, search for key:

Impala Daemon Command Line Argument Advanced Configuration Snippet (Safety Valve)

and add/set the value(s) similar to:

-authorized_proxy_user_config=<IMMUTA_SERVICE_PRINCIPAL>=*

If the authorized_proxy_user_config parameter is already present for other services, append the Immuta configuration value to the end:

-authorized_proxy_user_config=hue=*;<IMMUTA_SERVICE_PRINCIPAL>=*

Spark 2 Configuration

No additional configuration is required.

Note: Immuta will work with any Spark 2 version you may have already installed on your cluster.

Immuta Vulcan Service Configuration

The Immuta Vulcan service requires the same system API key that is configured for the Immuta NameNode plugin. Be sure that the value of immuta.system.api.key is consistent across your configuration.

For Spark 2, under the IMMUTA service of Cloudera Manager, Configuration section, search for key:

Immuta Spark 2 Vulcan Server Advanced Configuration Snippet (Safety Valve) for session/generator.xml

and, using "View as XML", add/set the value(s) similar to:

<property>
    <name>immuta.system.api.key</name>
    <value>0ec28d3f-a8a2-4960-b653-d7ccfe4803b3</value>
    <final>true</final>
</property>

Best Practice: Configuration Values

Immuta recommends that all Immuta configuration values be marked final.

Immuta Web Service Configuration

Though generally unnecessary given the configuration through the Application Settings of the Web UI, below is an example YAML snippet that can be used as an alternative to the Immuta Configuration UI if recommended by an Immuta representative.

client:
    kerberosRealm: YOURCOMPANY.COM
plugins:
    hdfsHandler:
        hdfsSystemToken: 0ec28d3f-a8a2-4960-b653-d7ccfe4803b3
kerberos:
    ticketRefreshInterval: 43200000
    username: immuta
    keyTabPath: /etc/immuta/immuta.keytab
    krbConfigPath: /etc/krb5.conf
    krbBinPath: /usr/bin/

Detailed Explanation:

client
- kerberosRealm
  - Specifies the default realm to use for Kerberos authentication.
  - Example: YOURCOMPANY.COM
plugins
- hdfsHandler
  - hdfsSystemToken
    Token used by NameNode plugin to authenticate with the Immuta REST API. This must equal the value set in immuta.system.api.key. Use the value of HDFS_SYSTEM_TOKEN generated earlier.
    Example: 0ec28d3f-a8a2-4960-b653-d7ccfe4803b3
kerberos
- ticketRefreshInterval
  - Time in milliseconds to wait between kinit executions. This should be lower than the ticket refresh interval required by the Kerberos server.
  - Default: 43200000
- username
  - User principal used for kinit.
  - Default: immuta
- keyTabPath
  - The path to the keytab file on disk to be used for kinit.
  - Default: /etc/immuta/immuta.keytab
- krbConfigPath
  - The path to the krb5 configuration file on disk.
  - Default: /etc/krb5.conf
- krbBinPath
  - The path to the Kerberos installation binary directory.
  - Default: /usr/bin/

Performance Optimization

Audience: System Administrators
Content Summary: This page describes strategies for improving performance of Immuta's NameNode plugin on CDH clusters.

Overview

Immuta operates within a locked operation in the NameNode when granting / denying permissions based on Immuta policies. This section contains configuration and strategies to prevent RPC queue latency, threads waiting, or other issues on cluster-wide file permission checks.

Deployment Architecture

Isolated HDFS Namespace

Best Practice: NameNode Plugin Configuration

Immuta recommends only configuring the NameNode Plugin to check permissions on the NameNode(s) that oversee the data that you want to protect.

For example, say that you currently have a federated HDFS NameNode architecture with three Nameservices - nameservice1, nameservice2, and nameservice3. The HDFS federation in this example is distributed across these nameservices as described below.

nameservice1: /data, /tmp/, /user
nameservice2: /data2
nameservice3: /data3

Suppose you know that all the sensitive data that you want to protect with Immuta is located under /data3. To achieve optimum performance in this case, you can go ahead and add the Immuta NameNode-only configuration (hdfs-site.xml) to the role config group for nameservice3, and leave it out of nameservice1 and nameservice2. The public / client Immuta configuration (core-site.xml) should still be configured cluster-wide. See Immuta CDH Integration Installation for more details about these configuration groupings.

One caveat to take into consideration here is that Immuta's Vulcan service requires the Immuta NameNode Plugin to oversee user credentials that are stored in /user/<username> by default. Vulcan also stores some configuration under /user/immuta by default. This is a problem because /user resides under nameservice1, and the goal is to only operate the Immuta NameNode Plugin on nameservice3.

A simple solution to this problem is to create a new directory for these credentials, /data3/immuta_creds for example, and configure the NameNode Plugin and the Vulcan service to use this directory instead of /user. Changing this requires the configuration modifications listed below.

HDFS - Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml
- Set immuta.generated.api.key.dir and immuta.credentials.dir to /data3/immuta_creds.
Immuta - Immuta Spark 2 Vulcan Server Advanced Configuration Snippet(Safety Valve) for session/generator.xml
- Set immuta.meta.store.token.dir to /data3/immuta_creds/immuta/tokens.
- Set immuta.meta.store.remote.token.dir to /data3/immuta_creds/immuta/remotetokens.
- Set immuta.configuration.id.file.config to hdfs://nameservice3/data3/immuta_creds/immuta/config_id.

Note that you will need to manually create the /data3/immuta_creds/immuta directory and set the permissions such that only the immuta user can read / write in that directory. The /data3/immuta_creds directory should also be world writable to allow user directories to be created the first time that they interact with Immuta on the cluster.

Configuration

Essential Performance Tuning Settings

immuta.permission.paths.to.enforce
- Description: A comma delimited list of paths to enforce when checking permissions on HDFS files. This ensures that API calls to the Immuta web service are only made when permissions are being checked on the paths that you specify in this configuration. This also means that you can only create data sources against data that lives under these paths, and the Immuta Workspace must be under one of these paths as well. Alternatively, immuta.permission.paths.to.ignore can be set to a list of paths that you know do not contain Immuta data - then API calls will never be made against those paths. Setting both immuta.permission.paths.to.ignore and immuta.permission.paths.to.enforce properties at the same time is unsupported.
immuta.permission.groups.to.enforce
- Description: A comma delimited list of groups that must go through Immuta when checking permissions on HDFS files. If this configuration item is set, then fallback authorizations will apply to everyone by default, unless they are in a group on this list. If a user is on both the enforce list and the ignore list, then their permissions will be checked with Immuta (i.e., the enforce configuration item takes precedence). This may improve NameNode performance by only making permission check API calls for the subset of users who fall under Immuta enforcement.
immuta.permission.source.cache.enabled
- Description: Denotes whether a background thread should be started to periodically cache paths from Immuta that represent Immuta-protected paths in HDFS. Enabling this increases NameNode performance because it prevents the NameNode plugin from calling the Immuta web service for paths that do not back HDFS data sources. For performance optimization, it is best to enable this cache to act as a "backup" to immuta.permission.paths.to.enforce.
immuta.permission.source.cache.enabled
- Description: The time between calls to sync/cache all paths that back Immuta data sources in HDFS. You can increase this value to further reduce the number of API calls made from the NameNode.
immuta.permission.workspace.base.path.override
- Description: This configuration item can be set so that the NameNode does not have to retrieve the Immuta HDFS workspace base path periodically from the Immuta API.

Advanced Cache and Network Settings

There are also a wide variety of cache and network settings that can be used to fine-tune performance. You can refer to the Configuration Guide for details on each of these items.

immuta.permission.source.cache.timeout.seconds
immuta.permission.source.cache.retries
immuta.permission.request.initial.delay.milliseconds
immuta.permission.request.socket.timeout
immuta.no.data.source.cache.timeout.seconds
immuta.hive.impala.cache.timeout.seconds
immuta.canisee.cache.timeout.seconds
immuta.data.source.cache.timeout.seconds
immuta.canisee.metastore.cache.timeout.seconds
immuta.canisee.non.user.cache.timeout.seconds
immuta.canisee.num.retries
immuta.project.user.cache.timeout.seconds
immuta.project.cache.timeout.seconds
immuta.project.forbidden.cache.timeout.seconds
immuta.permission.system.details.retries

Debugging Suspected Performance Issues

See Immuta Log Analysis Tool for CDH Deployments for instructions on how to identify performance issues in the Immuta NameNode Plugin.

Running as a Non-Default User

Audience: System Administrators
Content Summary: By default, the Immuta Partition servers will run as the immuta user. For clusters configured to use Kerberos, this means that you must have an immuta principal available for Cloudera Manager to provision the service. If for some reason you do not have an immuta principal available, you can change the user that the Immuta partition servers run as.
This page describes the configuration changes that are needed to change the principal(s) that Immuta uses. The same principal can be used for both services, but that is not necessary. Just make sure the configuration options are consistent for all configuration options on the individual services.

Partition Server Configuration

The Immuta Spark Partition Servers are components that run on your CDH cluster. The following sections will walk you through configuring the various CDH components so that the Spark Partition Servers can run as a non-default user.

In the configuration for the Immuta service, make the following updates:

System User: Set to the system user that will be running Immuta.
System Group: Set to the primary group of the user that will be running Immuta.
Kerberos Principal: Set to the Kerberos principal of the user that will be running Immuta.

In the configuration for HDFS, make the following updates:

Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml:
- Set immuta.spark.partition.generator.user to the principal configured as the Kerberos Principal in the Immuta service.

Immuta Web Service

The Immuta Web Service uses the configured Kerberos principal to impersonate users when running queries against various Kerberos-enabled databases. If you are using a non-default Kerberos principal for the Immuta Web Service, be sure to update the following values.

In the configuration for HDFS, enter the following for Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml:

hadoop.proxyuser.<immuta service principal>.hosts
- Description: The configuration that allows the Immuta service principal to proxy other hosts. Make sure to enter the appropriate principal in place of <immuta service principal>.
- Value: *
hadoop.proxyuser.<immuta service principal>.users
- Description: The configuration that allows the Immuta service principal to proxy end-users. Make sure to enter the appropriate principal in place of <immuta service principal>.
- Value: *
hadoop.proxyuser.<immuta service principal>.groups
- Description: The configuration that allows the Immuta service principal to proxy user groups. Make sure to enter the appropriate principal in place of <immuta service principal>.
- Value: *

If the principal for the Immuta Web Service is different from the principal used by the Immuta Partition Server, then be sure to add the Web Service principal to immuta.permission.users.to.ignore. In the HDFS configuration section for NameNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml ensure that the user principal running the Immuta Web Service is included in the comma-separated list of users set for immuta.permission.users.to.ignore.

Upgrade Cloudera Hadoop

Audience: System Administrators
Content Summary: This page details how to upgrade the Immuta Parcel and Service on your CDH cluster.
Prerequisites: Follow the Immuta CDH Integration Prerequisites to prepare for upgrading.

Upgrade the Parcel

Transfer the Immuta .parcel and its associated .parcel.sha to your Cloudera Manager node and place them in /opt/cloudera/parcel-repo. Once copied, ensure files must have ownership cloudera-scm and group cloudera-scm.

Once the Immuta parcel and its SHA (hash) file are in the parcel repo, you can distribute and activate the updated parcel. (Activating the new parcel will automatically deactivate an older version.) To do so,

In Cloudera Manager, select the Parcels icon in the upper right corner.
Click Check for New Parcels.
Make sure the location filter has your on-cluster parcel repo selected.
Locate the IMMUTA parcel, and then find the row corresponding to the version you are upgrading to. Click Distribute.
Wait for the parcel to finish distribution. Once finished, the action button for that row should say Activate.
Click the Activate button to activate the parcel.

You have successfully upgraded your Immuta parcel.

Upgrade the Immuta Partition Service

The first step in upgrading your Immuta Partition Service CSD is copying the .jar file to your Cloudera Manager node, placing it in /opt/cloudera/csd. The file must have ownership cloudera-scm and group cloudera-scm.

You will need to restart Cloudera Manager in order for the CSD to be picked up:

systemctl restart cloudera-scm-server

service cloudera-scm-server restart

Finally, restart the IMMUTA service in Cloudera Manager.

Disable/Uninstall Cloudera Hadoop

Audience: System Administrators
Content Summary: This page outlines steps to effectively disable and/or uninstall the Immuta components from your CDH cluster. The disable portions of this document detail how to deactivate the Immuta components without removing the components. For a complete uninstall, follow these steps and then proceed to remove all Immuta-related settings, configuration, and any Immuta Kerberos principals from your cluster.

NameNode

These changes will require a cluster restart

The changes detailed below affect HDFS; therefore, a cluster restart is required to fully implement these changes.

Steps to Disable

The Immuta Authorization Provider must be removed from the NameNode configuration.

Navigate to the Cloudera Manager Overview page.
Click on the HDFS service.
Click on the Configuration tab.

In the search bar, enter

dfs.namenode.authorization.provider.class

Click on the minus [-] sign that appears on the right of the entry corresponding to dfs.namenode.authorization.provider.class. This will restore to the CDH default.
Click the Save Changes button at the bottom of the screen.

Steps to Uninstall

Warning

You may have non-default settings that are completely unrelated to Immuta! You may also have non-default settings that are currently related to Immuta that will need to be altered to another non-default custom setting specific to your installation. Your CDH Admins will know which settings this applies to. Do not blanket revert settings to their defaults unless you are certain the CDH defaults are appropriate for your cluster.

To uninstall, instead of only reverting the Immuta Authorization Provider, all Immuta customized settings can be removed from the NameNode configuration.

Navigate to the Cloudera Manager Overview page.
Click on the HDFS service.
Click on the Configuration tab.
Near the bottom of the left side navigation pane, select Non-Default. This will list all settings that are not presently set to the defaults.
All settings under
```
Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml
```
can be reverted. Click the minus [-] sign that appears on the right of the individual entries, or - if you are certain your cluster should operate on the CDH defaults - all settings can be reverted by clicking the revert arrow icon to the right of HDFS (Service-Wide).
All settings under
```
NameNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml
```
can be reverted. Click the minus [-] sign that appears on the right of the individual entries, or - if you are certain your cluster should operate on the CDH defaults - all settings can be reverted by clicking revert arrow icon to the right of NameNode Default Group.
Click the Save Changes button at the bottom of the screen.

YARN

If fully uninstalling, Immuta's components need to be removed from YARN's classpath.

These changes will require a cluster restart

The changes detailed below affect HDFS; therefore, a cluster restart is required to fully implement these changes.

Steps to Uninstall

Navigate to the YARN service.
Click on the Configuration tab.
In the search bar, enter
```
yarn.application.classpath
```
Click on the minus [-] sign that appears on the right of any entries that reference IMMUTA. For example, there may be records for jars such as immuta-group-mapping.jar or immuta-hadoop-filesystem.jar or similar.
Click the Save Changes button at the bottom of the screen.

Hive

These settings may be applied either system-wide (via core-site.xml) or to specific target systems such as Hive or Impala. Be sure to locate all setting locations.

These changes will require a Hive service restart

The Hive service will need to be restarted for the changes below to take effect.

Steps to Disable

Navigate to the Hive service.
Click on the Configuration tab.

In the search bar, enter

Hive Service Advanced Configuration Snippet (Safety Valve) for core-site.xml

Click on the minus [-] sign that appears to the right of the entry corresponding to hadoop.security.group.mapping. This will restore to the CDH default.
Click the Save Changes button at the bottom of the screen.

Steps to Uninstall

Warning

Navigate to the Hive service.
Click on the Configuration tab.
Near the bottom of the left side navigation pane, select Non-Default. This will list all settings that are not presently set to the defaults.
All settings under
```
HiveServer2 Advanced Configuration Snippet (Safety Valve) for core-site.xml
```
can be reverted. Click the minus [-] sign that appears on the right of the individual entries, or - if you are certain your cluster should operate on the CDH defaults - all settings can be reverted by clicking the revert arrow icon to the right of HiveServer2 Default Group.
Click the Save Changes button at the bottom of the screen.

Impala

These settings may be applied either system-wide (via core-site.xml) or to specific target systems such as Hive or Impala. Be sure to locate all setting locations.

These changes will require an Impala service restart

The Impala service will need restarted in order for the changes below to take effect.

Steps to Disable

Navigate to the Impala service.
Click on the Configuration tab.

In the search bar, enter

Impala Daemon Advanced Configuration Snippet (Safety Valve) for core-site.xml

Click on the minus [-] sign that appears on the right of the entry corresponding to hadoop.security.group.mapping. This will restore to the CDH default.
Click the Save Changes button at the bottom of the screen.

Steps to Uninstall

Warning

Navigate to the Impala service.
Click on the Configuration tab.
Near the bottom of the left side navigation pane, select Non-Default. This will list all settings that are not presently set to the defaults.
The "immuta" proxy user from
```
Impala Command Line Argument Advanced Configuration Snippet (Safety Valve)
```
can be removed. Simply delete the "immuta=*" (and any leading or trailing ;) from the -authorized_proxy_user_config= value, leaving any other values in place. It may also be done by clicking the revert arrow icon to the right of Impala (Service-Wide) if the default is appropriate.
All settings under
```
Impala Daemon Advanced Configuration Snippet (Safety Valve) for core-site.xml
```
can be reverted. Click the minus [-] sign that appears on the right of the individual entries, or - if you are certain your cluster should operate on the CDH defaults - all settings can be reverted by clicking the revert arrow icon to the right of Impala Daemon Default Group.
If using Kerberos principal short names was only done in support of ImmutaGroupsMapping for use in native workspaces, that setting can also be reverted. In the search bar, enter
```
load_auth_to_local_rules
```
Simply uncheck the checkbox to the left of "Impala (Service-Wide)".
Click the Save Changes button at the bottom of the screen.

Spark

Most of the current Spark controls are now set through the IMMUTA service and will be removed through the subsequent step of stopping and disabling that service. These instructions are primarily for legacy Spark 1.6 installs that may still contain settings from the Spark 1.6 Configuration.

These changes will require a Spark service restart

The Spark service will need to be restarted for the changes below to take effect.

Steps to Uninstall

Navigate to the Spark service.
Click on the Configuration tab.

In the search bar, enter

Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-defaults.conf

Remove any references to IMMUTA or "immuta" in the configuration options. Particularly look for the options defined in Spark 1.6 Configuration.

Then go back to the search bar, and enter

Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh

Remove any references to IMMUTA or "immuta" in the environment variables. Particularly look for the environment settings defined in Spark 1.6 Configuration.
Click the Save Changes button at the bottom of the screen.

Sentry

If your installation leveraged the Immuta HDFS Native Workspace and ImmutaGroupsMapping, Immuta was likely configured as a Sentry admin. When uninstalling, this can be removed.

These changes will require a Sentry service restart

The Sentry service will need to be restarted for the changes below to take effect.

Steps to Uninstall

Warning

Navigate to the Sentry service.
Click on the Configuration tab.
Near the bottom of the left side navigation pane, select Non-Default. This will list all settings that are not presently set to the defaults.
The "immuta" user can be removed from any place specified, but particularly the
```
sentry.service.admin.group
```
should be removed. Click the minus [-] sign that appears on the right of the individual entries, or - if you are certain your cluster should operate on the CDH defaults - all settings can be reverted by clicking the revert arrow icon.
Click the Save Changes button at the bottom of the screen.

Immuta Immuta Partition/"Vulcan" Service

Steps to Disable

1 - Stop the Immuta Partition/"Vulcan" Service

Navigate to the Cloudera Manager Overview page.
Click on the down arrow next to the IMMUTA service.
Click Stop.
Confirm that you want to stop the service.

2 - Remove the Immuta Service

Navigate to the Cloudera Manager Overview page.
Click on the down arrow next to the IMMUTA service.
Click Delete.
Confirm that you want to delete the service.

Steps to Uninstall

Complete both 1 and 2 in the previous "Disable" section.

3 - Deactivate and Remove the Immuta Parcel

You may need to restart the cluster before you can fully remove these parcels

If the parcel was in active use, a cluster restart is likely needed before Cloudera Manager will let you do the following steps to remove and delete these parcels.

Navigate to the Cloudera Manager Overview page.
Click on the package icon on the top right hand side of the page near the search bar.
Find the "Distributed, Activated" Immuta Parcel(s) and click the Deactivate button.
Click Confirm.
Once deactivated, go back to the Immuta Parcels(s) and select the "down arrow" beside the "Activate" button, and select Remove from Hosts.
Click Confirm.
Once not distributed, go back to the Immuta Parcels(s) and select the "down arrow" beside the "Distribute" button, and select Delete.
Click Delete.

Restart the Cluster

To commit all previous settings, issue a restart of the CDH cluster.

Log Analysis Tool

Audience: System Administrators
Content Summary: This page details how to use the immuta_hdfs_log_analyzer tool to troubleshoot slowdowns in your CDH cluster.

Overview

Sub-optimal configuration of the Immuta HDFS NameNode plugin may cause cluster-wide slowdowns under certain conditions. The NameNode plugin contains a variety of cache settings to limit the number of network calls that occur within the NameNode's locked permission checking operation. If these settings are configured properly, there will be little to no impact on the performance of HDFS operations.

You can use the immuta_hdfs_log_analyzer command-line utility to track the number of API calls coming from NameNode plugin to the Immuta Web Service.

Usage

You can download the log analysis tool:

It can be invoked like so:

./immuta_hdfs_log_analyzer [-s START_TIME] [-e END_TIME] [-g {MINUTES,HOURS,DAYS}] [-t TIME_FORMAT] <file>

Options

START_TIME (-s, --start-time): Timestamp for the beginning of the period to analyze.
END_TIME (-e, --end-time): Timestamp for the end of the period to analyze.
GRANULARITY (g, --granularity): Defines time buckets for analysis. Can be MINUTES, HOURS or DAYS.
TIME_FORMAT (-t, --time-format): The format to use for timestamps. This should match the timestamp format in the Immuta Web Service logs.

Output

$ ./immuta_hdfs_log_analyzer \
    -s "2020-02-03T02:00:00.000000Z" \
    -e "2020-02-03T08:00:00.000000Z" \
    -g HOURS \
    immuta.log
2020-02-03T02:00:00.000000Z -- HDFS API Calls: 641, Mean ResponseTime: 8.0 ms, Max ResponseTime: 76 ms
2020-02-03T03:00:00.000000Z -- HDFS API Calls: 368, Mean ResponseTime: 6.0 ms, Max ResponseTime: 79 ms
2020-02-03T04:00:00.000000Z -- HDFS API Calls: 407, Mean ResponseTime: 7.0 ms, Max ResponseTime: 63 ms
2020-02-03T05:00:00.000000Z -- HDFS API Calls: 440, Mean ResponseTime: 8.0 ms, Max ResponseTime: 89 ms
2020-02-03T06:00:00.000000Z -- HDFS API Calls: 491, Mean ResponseTime: 10.0 ms, Max ResponseTime: 70 ms
2020-02-03T07:00:00.000000Z -- HDFS API Calls: 481, Mean ResponseTime: 15.0 ms, Max ResponseTime: 422 ms
2020-02-03T08:00:00.000000Z -- HDFS API Calls: 321, Mean ResponseTime: 6.0 ms, Max ResponseTime: 78 ms
HDFS API Calls: 3149
Other API Calls: 10398

If you are able to correlate time buckets from this tool's output to periods of slow cluster performance, you may need to adjust configuration for the Immuta HDFS NameNode plugin.

Installation

Audience: System Administrators
Content Summary: The Immuta CDH integration installation consists of the following components:
Immuta NameNode plugin
Immuta Hadoop Filesystem plugin
Immuta Spark 2 Vulcan service
This page outlines the installation steps required to successfully deploy these components on your CDH cluster.
Prerequisites: Follow the to prepare for installation.

Installation

chown -R cloudera-scm:cloudera-scm /opt/cloudera/parcel-repo

Next, transfer the Immuta CSD (.jar file) to /opt/cloudera/csd, and ensure both its owner and group permissions are set to cloudera-scm as well.

chown -R cloudera-scm:cloudera-scm /opt/cloudera/csd

You will need to restart the Cloudera Manager server in order for the CSD to be picked up:

systemctl restart cloudera-scm-server

service cloudera-scm-server restart

Follow Cloudera's instructions for distributing and activating the IMMUTA parcel.

Once the parcel has been successfully activated, you can add the IMMUTA service:

From the Cloudera Manager select Add Service.
Choose Immuta.
Click Continue.
Select nodes to install the services on. Your options are
- For maximum redundancy, choose all.
- Choose a single node.
- Choose a few nodes. Set up a Load Balancer in front of the instances to distribute load. Contact Immuta support for more details.
Proceed to the end of the workflow.

Configure HDFS

After adding the Immuta service to your CDH cluster, there is some configuration that needs to be completed.

If your cluster is configured with Kerberos, note that the default configuration expects to run Immuta services using the immuta principal. If you need to use a different Kerberos principal, see for detailed instructions on how to configure that. After running through these steps, note that you may need to manually run the Create Immuta User Home Directory command from the Actions menu for the Immuta service.

For more details on Immuta's HDFS configuration, please see .

NameNode-Only Configuration

Warning

Under the HDFS service of Cloudera Manager, Configuration tab, search for key:

NameNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml

and, using "View as XML", add/set the value(s) similar to:

<property>
    <name>dfs.namenode.authorization.provider.class</name>
    <value>com.immuta.hadoop.ImmutaAuthorizationProvider</value>
    <final>true</final>
</property>
<property>
    <name>immuta.permission.fallback.class</name>
    <value>org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider</value>
    <final>true</final>
</property>
<property>
    <name>immuta.permission.allow.fallback</name>
    <value>false</value>
    <final>true</final>
</property>
<property>
    <name>immuta.system.api.key</name>
    <value>0ec28d3f-a8a2-4960-b653-d7ccfe4803b3</value>
    <final>true</final>
</property>
<property>
    <name>immuta.permission.users.to.ignore</name>
    <value>hdfs,yarn,hive,impala,llama,mapred,spark,oozie,hue,hbase,livy,immuta</value>
    <final>true</final>
</property>
<property>
    <name>immuta.permission.paths.to.enforce</name>
    <value>*</value>
    <final>true</final>
</property>
<property>
    <name>immuta.permission.source.cache.enabled</name>
    <value>false</value>
    <final>true</final>
</property>

Best Practice: Configuration Values

Immuta recommends that all Immuta configuration values be marked final.

See for details about each individual configuration value.

Shared Configuration

Under the HDFS service of Cloudera Manager, Configuration tab, search for key:

Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml

and, using "View as XML", add/set the value(s) similar to:

<property>
    <name>immuta.base.url</name>
    <value>https://immuta.hostname</value>
    <final>true</final>
</property>
<property>
    <name>immuta.spark.partition.generator.user</name>
    <value>immuta</value>
    <final>true</final>
</property>
<property>
    <name>immuta.credentials.dir</name>
    <value>/user</value>
    <final>true</final>
</property>
<property>
    <name>immuta.visibility.cache.timeout.seconds</name>
    <value>600</value>
    <final>true</final>
</property>
<property>
    <name>fs.immuta.impl</name>
    <value>com.immuta.hadoop.ImmutaFileSystem</value>
    <final>true</final>
</property>
<property>
    <name>hadoop.proxyuser.immuta.hosts</name>
    <value>*</value>
    <final>true</final>
</property>
<property>
    <name>hadoop.proxyuser.immuta.users</name>
    <value>*</value>
    <final>true</final>
</property>
<property>
    <name>hadoop.proxyuser.immuta.groups</name>
    <value>*</value>
    <final>true</final>
</property>

Best Practice: Configuration Values

Immuta recommends that all Immuta configuration values be marked final.

See for details about each individual configuration value.

Enable TLS for the Immuta Vulcan Service

You can enable TLS on the Immuta Vulcan service by configuring it to use a keystore in JKS format.

Server-side TLS Configuration

Under the Immuta service of Cloudera Manager, Configuration tab, search for key:

Immuta Spark 2 Vulcan Server Advanced Configuration Snippet (Safety Valve) for session/generator.xml

and, using "View as XML", add/set the value(s) similar to:

<property>
    <name>immuta.secure.partition.generator.keystore</name>
    <value>/etc/immuta/keystore.jks</value>
    <final>true</final>
</property>
<property>
    <name>immuta.secure.partition.generator.keystore.password</name>
    <value>secure_password</value>
    <final>true</final>
</property>
<property>
    <name>immuta.secure.partition.generator.keymanager.password</name>
    <value>secure_password</value>
    <final>true</final>
</property>

Best Practice: Configuration Values

Immuta recommends that all Immuta configuration values be marked final.

Detailed Explanation:

immuta.secure.partition.generator.keystore
- Specifies the path to the Immuta Vulcan service keystore.
- Example: /etc/immuta/keystore.jks
immuta.secure.partition.generator.keystore.password
- Specifies the password for the Immuta Vulcan service keystore. This password will be a publicly available piece of information, but file permissions should be used to make sure that only the user running the service can read the keystore file.
- Example: secure_password
immuta.secure.partition.generator.keystore.password
- Specifies the password for the Immuta Vulcan service keystore. This password will be a publicly available piece of information, but file permissions should be used to make sure that only the user running the service can read the keystore file.
- Example: secure_password
immuta.secure.partition.generator.keymanager.password
- Specifies the KeyManager password for the Immuta Vulcan service keystore. This password will be a publicly available piece of information, but file permissions should be used to make sure that only the user running the service can read the keystore file. This is not always necessary.
- Example: secure_password

Best Practice: Secure Keystore with File Permissions

Immuta recommends using file permissions to secure the keystore from improper access:

chown immuta:immuta /etc/immuta/keystore.jks
chmod 600 /etc/immuta/keystore.jks

Client-side TLS Configuration

You must also set the following properties under the following client sections:

For Spark 2, under the Immuta service of Cloudera Manager, Configuration tab, search for key:

Immuta Client Advanced Configuration Snippet (Safety Valve) for immuta-conf/session/generator.xml

and, using "View as XML", add/set the value(s) similar to:

<property>
    <name>immuta.secure.partition.generator.keystore</name>
    <value>true</value>
    <final>true</final>
</property>

Best Practice: Configuration Values

Immuta recommends that all Immuta configuration values be marked final.

Detailed Explanation:

immuta.secure.partition.generator.keystore
- Set to true to enable TLS
- Default: true

Impala Configuration

Under the Impala service of Cloudera Manager, Configuration tab, search for key:

Impala Daemon Command Line Argument Advanced Configuration Snippet (Safety Valve)

and add/set the value(s) similar to:

-authorized_proxy_user_config=<IMMUTA_SERVICE_PRINCIPAL>=*

If the authorized_proxy_user_config parameter is already present for other services, append the Immuta configuration value to the end:

-authorized_proxy_user_config=hue=*;<IMMUTA_SERVICE_PRINCIPAL>=*

Spark 2 Configuration

No additional configuration is required.

Note: Immuta will work with any Spark 2 version you may have already installed on your cluster.

Immuta Vulcan Service Configuration

For Spark 2, under the IMMUTA service of Cloudera Manager, Configuration section, search for key:

Immuta Spark 2 Vulcan Server Advanced Configuration Snippet (Safety Valve) for session/generator.xml

and, using "View as XML", add/set the value(s) similar to:

<property>
    <name>immuta.system.api.key</name>
    <value>0ec28d3f-a8a2-4960-b653-d7ccfe4803b3</value>
    <final>true</final>
</property>

Best Practice: Configuration Values

Immuta recommends that all Immuta configuration values be marked final.

Immuta Web Service Configuration

The Immuta Web Service needs to be configured to support the HDFS plugin. You can set this configuration using the .

client:
    kerberosRealm: YOURCOMPANY.COM
plugins:
    hdfsHandler:
        hdfsSystemToken: 0ec28d3f-a8a2-4960-b653-d7ccfe4803b3
kerberos:
    ticketRefreshInterval: 43200000
    username: immuta
    keyTabPath: /etc/immuta/immuta.keytab
    krbConfigPath: /etc/krb5.conf
    krbBinPath: /usr/bin/

Detailed Explanation:

client
- kerberosRealm
  - Specifies the default realm to use for Kerberos authentication.
  - Example: YOURCOMPANY.COM
plugins
- hdfsHandler
  - hdfsSystemToken
    Token used by NameNode plugin to authenticate with the Immuta REST API. This must equal the value set in immuta.system.api.key. Use the value of HDFS_SYSTEM_TOKEN generated earlier.
    Example: 0ec28d3f-a8a2-4960-b653-d7ccfe4803b3
kerberos
- ticketRefreshInterval
  - Time in milliseconds to wait between kinit executions. This should be lower than the ticket refresh interval required by the Kerberos server.
  - Default: 43200000
- username
  - User principal used for kinit.
  - Default: immuta
- keyTabPath
  - The path to the keytab file on disk to be used for kinit.
  - Default: /etc/immuta/immuta.keytab
- krbConfigPath
  - The path to the krb5 configuration file on disk.
  - Default: /etc/krb5.conf
- krbBinPath
  - The path to the Kerberos installation binary directory.
  - Default: /usr/bin/

Additionally, you must upload a keytab for the immuta user as well as a krb5.conf configuration file to the Immuta Web Service. This can also be done via the .

Amazon EMR

Audience: System Administrators
Content Summary: This tutorial will guide you through the process of spinning up an Amazon Elastic Map Reduce cluster with Immuta's Hadoop and Spark security plugins installed.

Deprecation notice

Support for this integration has been deprecated.

Introduction

Supported EMR Versions

This deployment is tested and known to work on the EMR releases listed below.

5.17.0
5.18.0
5.19.0
5.20.0
5.21.0
5.22.0
5.23.0
5.24.0
5.25.0
5.26.0
5.27.0
5.28.0
5.29.0
5.30.0
5.31.0
5.32.0

Create Prerequisite AWS Resources

In addition to the EMR cluster itself, Immuta requires a handful of additional AWS resources in order to function properly.

Immuta Bootstrap Bucket

In order to bootstrap the EMR cluster with Immuta's software bundle and startup scripts, you will need to create an S3 bucket to hold these artifacts.

s3://$BOOTSTRAP_BUCKET/immuta-bootstrap
s3://$BOOTSTRAP_BUCKET/immuta-bootstrap.tar.gz
s3://$BOOTSTRAP_BUCKET/immuta_bundle-$IMMUTA_VERSION.tar.gz
s3://$BOOTSTRAP_BUCKET/install.sh

Immuta Data IAM Role

Create Immuta Data IAM Policy

Modify the JSON data below to include the correct name of your data bucket(s), and save as immuta_data_iam_policy.json.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:Get*",
                "s3:Head*",
                "s3:List*"
            ],
            "Resource": [
                "arn:aws:s3:::$DATA_BUCKET_1",
                "arn:aws:s3:::$DATA_BUCKET_2",
                "arn:aws:s3:::$DATA_BUCKET_1/*",
                "arn:aws:s3:::$DATA_BUCKET_2/*"
            ]
        }
    ]
}

If you are leveraging Immuta's Native S3 Workspace capability, you must also give the Immuta data IAM role full control of the workspace bucket or folder.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:Get*",
                "s3:Head*",
                "s3:List*"
            ],
            "Resource": [
                "arn:aws:s3:::$DATA_BUCKET_1",
                "arn:aws:s3:::$DATA_BUCKET_2",
                "arn:aws:s3:::$DATA_BUCKET_1/*",
                "arn:aws:s3:::$DATA_BUCKET_2/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:*"
            ],
            "Resource": [
                "arn:aws:s3:::$WORKSPACE_BUCKET",
                "arn:aws:s3:::$WORKSPACE_BUCKET/*"
            ]
        }
    ]
}

Now you can run the following command to create the Immuta IAM user policy.

aws iam create-policy \
    --policy-name immuta_emr_data_policy \
    --policy-document file://immuta_data_iam_policy.json

Create Immuta Data IAM Role

Create a file called immuta_data_role_trust_policy_generic.json as seen below.

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Principal":{
            "AWS":"arn:aws:iam::$AWS_ACCOUNT_ID:role/EMR_EC2_DefaultRole"
         },
         "Action":"sts:AssumeRole"
      }
   ]
}

aws iam create-role \
  --role-name immuta_emr_data_role \
  --assume-role-policy-document "file://immuta_data_role_trust_policy_generic.json"

Next you will need to attach the IAM policy that allows access to your protected data in S3.

aws iam attach-role-policy \
    --policy-arn arn:aws:iam::$AWS_ACCOUNT_ID:policy/immuta_emr_data_policy \
    --role-name immuta_emr_data_role

Create Immuta Instance IAM Policy

Modify the JSON data below to include the correct name of your bootstrap bucket, and save as immuta_emr_instance_policy.json.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Resource": "*",
            "Action": [
                "ec2:Describe*",
                "elasticmapreduce:Describe*",
                "elasticmapreduce:ListBootstrapActions",
                "elasticmapreduce:ListClusters",
                "elasticmapreduce:ListInstanceGroups",
                "elasticmapreduce:ListInstances",
                "elasticmapreduce:ListSteps"
            ]
        },
        {
            "Effect": "Allow",
            "Resource": "arn:aws:sqs:*:123456789012:AWS-ElasticMapReduce-*",
            "Action": [
                "sqs:CreateQueue",
                "sqs:DeleteQueue",
                "sqs:DeleteMessage",
                "sqs:DeleteMessageBatch",
                "sqs:GetQueueAttributes",
                "sqs:GetQueueUrl",
                "sqs:PurgeQueue",
                "sqs:ReceiveMessage"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:*Object"
            ],
            "Resource": [
                "arn:aws:s3:::$BOOTSTRAP_BUCKET/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::$BOOTSTRAP_BUCKET"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "secretsmanager:*",
            "Resource": [
                "arn:aws:secretsmanager:$AWS_REGION:$AWS_ACCOUNT_ID:secret:immuta-emr-secret-??????",
                "arn:aws:secretsmanager:$AWS_REGION>:$AWS_ACCOUNT_ID:secret:immuta-kerberos-secret-??????"
            ]
        }
    ]
}

After creating the immuta_emr_instance_policy.json file from above, run the following command to create the Immuta EMR Instance policy.

aws iam create-policy \
    --policy-name immuta_emr_instance_policy \
    --policy-document file://immuta_emr_instance_policy.json

Create Immuta Instance IAM Role

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Principal":{
            "AWS":"arn:aws:iam::$AWS_ACOUNT_ID:role/immuta_emr_data_role",
            "Service": "ec2.amazonaws.com"
         },
         "Action":"sts:AssumeRole"
      }
   ]
}

Now you can create the instance role with the policy document from above.

aws iam create-role \
  --role-name immuta_emr_instance_role \
  --assume-role-policy-document "file://instance_role_trust_policy.json"

Next you will need to attach the IAM policy that allows access to required resources for your cluster.

aws iam attach-role-policy \
    --policy-arn arn:aws:iam::$AWS_ACCOUNT_ID:policy/immuta_emr_instance_policy \
    --role-name immuta_emr_instance_role

Create Immuta EMR Instance Profile

After creating the role and policy for the Immuta instances, you can create the Immuta EC2 Instance Profile.

aws iam create-instance-profile \
    --instance-profile-name immuta_emr_instance_profile

After creating the Instance Profile, you can attach the newly created Role.

aws iam add-role-to-instance-profile \
    --instance-profile-name immuta_emr_instance_profile \
    --role-name immuta_emr_instance_role

Update Immuta Data IAM Role Trust Policy

Now that both the data and instance IAM roles are created, you can update the trust policy of the data IAM role to include the instance role.

Create a file called data_role_trust_policy.json as shown below.

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Principal":{
            "AWS":"arn:aws:iam::$AWS_ACCOUNT_ID:role/immuta_emr_instance_role"
         },
         "Action":"sts:AssumeRole"
      }
   ]
}

Now you can update the trust policy of the data IAM role.

aws iam update-assume-role-policy \
  --role-name immuta_emr_data_role \
  --policy-document "file://data_role_trust_policy.json"

Immuta HDFS System Token in AWS Secrets Manager

Navigate to the App Settings page and generate an Immuta HDFS System Token. Copy the value generated by Immuta, and create a new secret in AWS Secrets Manager as shown below.

aws secretsmanager create-secret \
    --name immuta-emr-secret \
    --secret-string $HDFS_SYSTEM_TOKEN

Create EMR Cluster

EC2 Attributes Configuration File

Complete the JSON template below and save as ec2_attributes.json. You may remove keys where you would like to use default values.

When choosing security groups for your master and worker nodes, be sure that they provide bi-directional access between the nodes and your Immuta instance.

{
  "ServiceAccessSecurityGroup": "string",
  "AvailabilityZone": "string",
  "AdditionalSlaveSecurityGroups": ["string", ...],
  "EmrManagedMasterSecurityGroup": "string",
  "KeyName": "<the name of your SSH public key stored in AWS>",
  "InstanceProfile": "immuta_emr_instance_profile",
  "SubnetId": "string",
  "AdditionalMasterSecurityGroups": ["string", ...],
  "AvailabilityZones": ["string", ...],
  "EmrManagedSlaveSecurityGroup": "string"
}

Cluster Configuration File

[
   {
      "Classification":"hdfs-site",
      "Properties":{
         "dfs.namenode.inode.attributes.provider.class":"com.immuta.hadoop.ImmutaInodeAttributeProvider",
         "dfs.namenode.acls.enabled":"true",
         "immuta.extra.name.node.plugin.config":"file:///opt/immuta/hadoop/name-node-conf.xml"
      },
      "Configurations":[]
   },
   {
      "Classification":"emrfs-site",
      "Properties":{
         "fs.s3.customAWSCredentialsProvider":"com.immuta.emr.ImmutaEMRAWSCredentialsProvider"
      },
      "Configurations":[]
   },
   {
      "Classification":"core-site",
      "Properties":{
         "immuta.permission.users.to.ignore":"hdfs,yarn,hive,impala,llama,mapred,spark,oozie,hue,hbase,hadoop",
         "fs.immuta.impl":"com.immuta.hadoop.ImmutaFileSystem",
         "hadoop.proxyuser.immuta_emr.groups":"*",
         "hadoop.proxyuser.immuta_emr.users":"*",
         "hadoop.proxyuser.immuta_emr.hosts":"*",
         "hadoop.proxyuser.immuta.groups":"*",
         "hadoop.proxyuser.immuta.users":"*",
         "hadoop.proxyuser.immuta.hosts":"*",
         "immuta.cluster.name":"my_cluster",
         "immuta.spark.partition.generator.user":"immuta_emr",
         "immuta.credentials.dir":"/user",
         "immuta.base.url":"https://immuta.mycompany.com"
      },
      "Configurations":[]
   },
   {
      "Classification":"hadoop-env",
      "Properties":{},
      "Configurations":[
         {
            "Classification":"export",
            "Properties":{
               "HADOOP_CLASSPATH":"$HADOOP_CLASSPATH:/opt/immuta/hadoop/lib/immuta-inode-attribute-provider.jar:/opt/immuta/hadoop/lib/immuta-hadoop-filesystem.jar:/opt/immuta/hadoop/lib/immuta-emrfs-credential-provider.jar",
               "JAVA_HOME":"/usr/lib/jvm/java-1.8.0"
            },
            "Configurations":[]
         }
      ]
   },
   {
      "Classification":"hive-site",
      "Properties":{
         "hive.server2.enable.doAs":"true",
         "hive.security.metastore.authorization.auth.reads": "false",
         "hive.compute.query.using.stats": "true"
      },
      "Configurations":[]
   },
   {
      "Classification": "capacity-scheduler",
      "Properties": {
         "yarn.scheduler.capacity.root.default.default-node-label-expression": "CORE",
         "yarn.scheduler.capacity.root.immuta_spark.default-node-label-expression": "CORE",
         "yarn.scheduler.capacity.root.default.accessible-node-labels.CORE.capacity": "30",
         "yarn.scheduler.capacity.root.queues": "default,immuta_spark",
         "yarn.scheduler.capacity.root.immuta_spark.accessible-node-labels.CORE.capacity": "70",
         "yarn.scheduler.capacity.root.immuta_spark.maximum-applications": "100",
         "yarn.scheduler.capacity.root.immuta_spark.maximum-am-resource-percent": "0.1",
         "yarn.scheduler.capacity.root.immuta_spark.capacity": "0",
         "yarn.scheduler.capacity.root.default.capacity": "100"
      },
      "Configurations": []
   }
]

Immuta Bootstrap Configuration File

Next, create a file called bootstrap_actions.json to configure the Immuta bootstrap action. If you have any additional bootstrap actions to run outside of Immuta, they should be added here as well.

[
  {
    "Path": "s3://$BOOTSTRAP_BUCKET/immuta-bootstrap",
    "Args": [
        "--immuta-instance-url=https://immuta.mycompany.com",
        "--immuta-secret-name=immuta-emr-secret",
        "--immuta-user-name=immuta_emr",
        "--immuta-bootstrap-archive=s3://$BOOTSTRAP_BUCKET/immuta_bootstrap.tar.gz",
        "--immuta-software-bundle=s3://$BOOTSTRAP_BUCKET/immuta_bundle.tar.gz",
        "--immuta-install-script=s3://$BOOTSTRAP_BUCKET/install.sh",
        "--kerberos",
        "--kerberos-secret-name immuta-kerberos-secret"
    ],
    "Name": "Immuta Bootstrap"
  }
]

(Optional) Kerberos Attributes Configuration File

{
  "Realm": "EC2.INTERNAL",
  "KdcAdminPassword": "secret"
}

Security Configuration

You will need to create a security configuration before creating the EMR cluster so that Immuta's EMRFS integration can leverage the IAM role you created to access data in S3.

{
  "AuthenticationConfiguration": {
    "KerberosConfiguration": {
      "Provider": "ClusterDedicatedKdc",
      "ClusterDedicatedKdcConfiguration": {
        "TicketLifetimeInHours": 24
      }
    }
  },
  "AuthorizationConfiguration": {
    "EmrFsConfiguration": {
      "RoleMappings": [
        {
          "Role": "arn:aws:iam::$AWS_ACCOUNT_ID:role/immuta_emr_data_role",
          "IdentifierType": "User",
          "Identifiers": ["hadoop","hive","immuta_emr"]
        },
        {
          "Role": "arn:aws:iam::$AWS_ACCOUNT_ID:role/immuta_emr_data_role",
          "IdentifierType": "Group",
          "Identifiers": ["data_owners"]
        }
      ]
    }
  }
}

Next, create your security configuration with the following command.

aws emr create-security-configuration \
    --name immuta_emr_security_configuration \
    --security-configuration file://security_configuration.json

Create EMR Cluster Command

Finally, you can now spin up an EMR cluster with Immuta's security plugins.

aws emr create-cluster \
    --name immuta-emr \
    --release-label emr-5.28.0 \
    --configuration file://cluster_configuration.json \
    --ec2-attributes file://ec2_attributes.json \
    --instance-type m5.xlarge \
    --instance-count 3 \
    --bootstrap-actions file://bootstrap_actions.json \
    --kerberos-attributes file://kerberos_attributes.json \
    --security-configuration immuta_emr_security_configuration \
    --service-role EMR_DefaultRole

Remove Secrets

It is safe to remove these values after the cluster has finished bootstrapping. The example below overwrites the relevant secrets with null values.

aws secretsmanager put-secret-value \
    --secret-id immuta-emr-secret \
    --secret-binary null
aws secretsmanager put-secret-value \
    --secret-id immuta-kerberos-secret \
    --secret-string null