Skip to content

Native Workspace Configuration for EMR

Audience: System Administrators

Content Summary: This page describes how to configure Native Workspaces for Immuta-enabled EMR clusters. The Native S3 Workspace requires an Immuta-bootstrapped EMR cluster. For more information about EMR deployments, please see the main installation guide.

Overview

This workspace allows native access to data on cluster without having to go through the Immuta SparkSession or Immuta Query Engine.

Accessing Data

Users will only be able to access the directory and database created for the workspace when acting under the project. The Immuta Spark SQL Session will apply policies to the data, so any data written to the workspace will already be compliant with the restrictions of the equalized project, where all members see data at the same level of access. When users are ready to write data back to Immuta, they should use the SparkSQL session to copy data into the workspace.

Workspace Configuration Options:

  • EMR HDFS
  • EMR S3

Available Data Source Types:

  • Amazon S3 (EMR S3)

Immuta App Settings

The native workspace must be enabled from the App Settings page.

IAM Role Configuration

Immuta integrates with EMRFS to control access to sensitive data stored in S3. To configure this, you must create an IAM Role Policy for Immuta as described in the main EMR Installation Guide. To leverage Immuta's Native S3 Workspace capability, you must also give the Immuta data IAM role full control of the workspace bucket or folder.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:Get*",
                "s3:Head*",
                "s3:List*"
            ],
            "Resource": [
                "arn:aws:s3:::$DATA_BUCKET_1",
                "arn:aws:s3:::$DATA_BUCKET_2",
                "arn:aws:s3:::$DATA_BUCKET_1/*",
                "arn:aws:s3:::$DATA_BUCKET_2/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:*"
            ],
            "Resource": [
                "arn:aws:s3:::$WORKSPACE_BUCKET",
                "arn:aws:s3:::$WORKSPACE_BUCKET/*"
            ]
        }
    ]
}

Hive Configuration

In order for users to be able to query workspace data natively via Hive, you need to set additional configuration in hive-site for Hive to have access to the Immuta System API Key.

For maximum security in production deployments, you should store the System API Key in a JCEKS file for Hive to access. The location of this key should be set in immuta.hadoop.security.credential.provider.path.

[
   {
      "Classification":"hive-site",
      "Properties":{
         "hive.server2.enable.doAs":"true",
         "hive.security.metastore.authorization.auth.reads": "false",
         "hive.compute.query.using.stats": "true",
         "immuta.hadoop.security.credential.provider.path":"/home/hive/immuta_provider.jceks"
      },
      "Configurations":[]
   }
]

Create a Cloudera or EMR Workspace

  1. Navigate to the Policies tab and enable Project Equalization by clicking the Project Equalization slider to on.
  2. Scroll to the Native Workspace section and click Create.
  3. Select the Cloudera or EMR Workspace Configuration from the dropdown menu.

  4. Select the Cluster Name from the subsequent dropdown menu.

  5. Opt to edit the Workspace Directory field or add a Hive Connection (if available).

  6. Click Create to enable the workspace.