Native Workspace Configuration for EMR
Audience: System Administrators
Content Summary: This page describes how to configure Native Workspaces for Immuta-enabled EMR clusters. The Native S3 Workspace requires an Immuta-bootstrapped EMR cluster. For more information about EMR deployments, please see the main installation guide.
Overview
This workspace allows native access to data on cluster without having to go through the Immuta SparkSession or Immuta Query Engine.
Accessing Data
Users will only be able to access the directory and database created for the workspace when acting under the project. The Immuta Spark SQL Session will apply policies to the data, so any data written to the workspace will already be compliant with the restrictions of the equalized project, where all members see data at the same level of access. When users are ready to write data back to Immuta, they should use the SparkSQL session to copy data into the workspace.
Workspace Configuration Options:
- EMR HDFS
- EMR S3
Available Data Source Types:
- Amazon S3 (EMR S3)
Immuta App Settings
The native workspace must be enabled from the App Settings page.
IAM Role Configuration
Immuta integrates with EMRFS to control access to sensitive data stored in S3. To configure this, you must create an IAM Role Policy for Immuta as described in the main EMR Installation Guide. To leverage Immuta's Native S3 Workspace capability, you must also give the Immuta data IAM role full control of the workspace bucket or folder.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:Get*",
"s3:Head*",
"s3:List*"
],
"Resource": [
"arn:aws:s3:::$DATA_BUCKET_1",
"arn:aws:s3:::$DATA_BUCKET_2",
"arn:aws:s3:::$DATA_BUCKET_1/*",
"arn:aws:s3:::$DATA_BUCKET_2/*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:*"
],
"Resource": [
"arn:aws:s3:::$WORKSPACE_BUCKET",
"arn:aws:s3:::$WORKSPACE_BUCKET/*"
]
}
]
}
Hive Configuration
In order for users to be able to query workspace data natively via Hive, you need to set additional configuration
in hive-site
for Hive to have access to the Immuta System API Key.
For maximum security in production deployments, you should store the System API Key in a JCEKS file for Hive to
access. The location of this key should be set in immuta.hadoop.security.credential.provider.path
.
[
{
"Classification":"hive-site",
"Properties":{
"hive.server2.enable.doAs":"true",
"hive.security.metastore.authorization.auth.reads": "false",
"hive.compute.query.using.stats": "true",
"immuta.hadoop.security.credential.provider.path":"/home/hive/immuta_provider.jceks"
},
"Configurations":[]
}
]
Create a Cloudera or EMR Workspace
- Navigate to the Policies tab and enable Project Equalization by clicking the Project Equalization slider to on.
- Scroll to the Native Workspace section and click Create.
-
Select the Cloudera or EMR Workspace Configuration from the dropdown menu.
-
Select the Cluster Name from the subsequent dropdown menu.
-
Opt to edit the Workspace Directory field or add a Hive Connection (if available).
-
Click Create to enable the workspace.