Getting Started With EMR and Immuta
What Immuta Provides
Immuta allows fine-grained access control for Spark workloads on EMR backed by S3, which means you can redact rows of data and mask columns or cells using several anonymization techniques - all on-the-fly without making new copies of data. This allows multi-tenancy on your clusters and allows data-centric policy enforcement, rather than managing entitlements to EMR clusters and creating anonymized copies of your data in S3.
Understanding This Guide
How to Configure EMR for Immuta: This section describes the steps required for configuring the Immuta plugin on your EMR cluster.
Creating Policies on Data in S3: In order to protect data in S3 granularly, Immuta requires a schema typically provided by HIVE and optionally backed by GLUE. This section outlines those steps.
Accessing Protected Data: Once data is protected, how do you use it? This section outlines those steps.
How to Configure EMR for Immuta
In order to enforce fine-grained controls in EMR against S3, the Immuta plugin must be configured on the EMR cluster to communicate with Immuta to enforce policies.
You will also need to configure the EMR cluster so that no user other than the Immuta user (IAM role) has access to the S3 data. This will block direct access to S3 unless permissioned through Immuta policies. Note that your EMR cluster must be kerberized; otherwise users can sudo to other users and potentially bypass any protections provided by Immuta and/or IAM roles.
You can quickly deploy and discover Immuta on EMR using our Quickstart Installation Guide for Immuta on AWS EMR.
Full technical steps are found in our full EMR Installation Guide.
Creating Policies on Data in S3
In order to enforce fine-grained controls on data, Immuta must be provided a schema, this is done through HIVE tables, which can optionally be backed by AWS GLUE.
Once a HIVE table is created, it can be exposed in Immuta to enforce policies and allow access to the S3 data that is backing it. Note that Immuta does support ephemeral clusters, so while you do expose the table with a specific IP, it will still work with new clusters in the future.
Details on exposing HIVE tables in Immuta is found here.
Details on building policies can be found here.
Accessing Protected Data
Now that you have a schema and policies defined, you are able to execute Spark jobs on the cluster against the S3 data.
First you must associate the Hadoop principal to the Immuta user account, which allows Immuta to know what policies to enforce on the user logged into the EMR cluster. Details on how to configure that can be found here.
Once a principal is assigned, the user is ready to go and can begin using spark-submit, spark-shell, or pyspark to work with the protected data. Details on that workflow can be found here.