Skip to content

Deploying Immuta on Google Dataproc

Audience: System Administrators

Content Summary: This tutorial will guide you through the process of spinning up a Google Dataproc cluster with Immuta's Hadoop and Spark security plugins installed. This tutorial contains examples using the Google Cloud CLI. These examples are conceptual in nature and will require modification to adapt to your exact deployment needs.

Supported Dataproc Versions

Immuta currently supports Dataproc 1.3 and 1.4.

Kerberos Considerations

Immuta should only be installed on kerberized hadoop clusters. The examples in this guide utilize a cluster-dedicated KDC for the sake of simplicity. For production deployments, you will likely want to establish a cross-realm trust with an external KDC or Active Directory. For detailed instructions on how to do that, visit the official Dataproc documentation.

Create Prerequisite GCP Resources

In addition to the Dataproc cluster itself, Immuta requires a handful of additional GCP resources in order to function properly.

Immuta Artifacts Bucket

In order to initialize the Dataproc cluster with Immuta's software bundle and startup scripts, you will need to create a Google Storage Bucket to hold these artifacts.

In this guide, the bucket is referenced by the placeholder $IMMUTA_ARTIFACTS_BUCKET. You should substitute this bucket name for a unique bucket name of your choosing. The bucket must contain all artifacts listed below. These artifacts can be found at Immuta Downloads.

gs://$IMMUTA_ARTIFACTS_BUCKET/immuta_bundle-$IMMUTA_VERSION.tar.gz
gs://$IMMUTA_ARTIFACTS_BUCKET/immuta_initialization_actions.sh
gs://$IMMUTA_ARTIFACTS_BUCKET/immuta_kerberos_actions.sh
gs://$IMMUTA_ARTIFACTS_BUCKET/immuta_startup_actions.sh

Immuta Cluster Service Account

To initialize your Dataproc cluster and install the Immuta software, you will need a service account with a role that has read access to all objects in your $IMMUTA_ARTIFACTS_BUCKET, as well as the permissions listed below.

cloudkms.cryptoKeyVersions.get
cloudkms.cryptoKeyVersions.useToDecrypt
cloudkms.cryptoKeys.get
dataproc.agents.create
dataproc.agents.delete
dataproc.agents.get
dataproc.agents.list
dataproc.agents.update
dataproc.tasks.lease
dataproc.tasks.listInvalidatedLeases
dataproc.tasks.reportStatus
logging.logEntries.create
monitoring.metricDescriptors.create
monitoring.metricDescriptors.get
monitoring.metricDescriptors.list
monitoring.monitoredResourceDescriptors.get
monitoring.monitoredResourceDescriptors.list
monitoring.timeSeries.create
resourcemanager.projects.get
storage.buckets.get

Immuta Data Service Account

Immuta's Spark integration relies on a service account that has a role with read access to the Google Storage buckets where your sensitive data is stored. Note that the cluster service account should not have access to these buckets. Immuta will broker access to the data in these buckets to authorized users.

To grant access to data, you should modify the access control lists (ACLs) of your data buckets to grant read access to the data service account.

The service account used to create this tutorial also has the permissions listed below.

compute.machineTypes.get
compute.machineTypes.list
compute.networks.get
compute.networks.list
compute.projects.get
compute.regions.get
compute.regions.list
compute.zones.get
compute.zones.list
dataproc.agents.create
dataproc.agents.delete
dataproc.agents.get
dataproc.agents.list
dataproc.agents.update
dataproc.clusters.create
dataproc.clusters.delete
dataproc.clusters.get
dataproc.clusters.list
dataproc.clusters.update
dataproc.clusters.use
dataproc.jobs.cancel
dataproc.jobs.create
dataproc.jobs.delete
dataproc.jobs.get
dataproc.jobs.list
dataproc.jobs.update
dataproc.operations.delete
dataproc.operations.get
dataproc.operations.list
dataproc.tasks.lease
dataproc.tasks.listInvalidatedLeases
dataproc.tasks.reportStatus
dataproc.workflowTemplates.create
dataproc.workflowTemplates.delete
dataproc.workflowTemplates.get
dataproc.workflowTemplates.instantiate
dataproc.workflowTemplates.instantiateInline
dataproc.workflowTemplates.list
dataproc.workflowTemplates.update
firebase.projects.get
logging.logEntries.create
monitoring.metricDescriptors.create
monitoring.metricDescriptors.get
monitoring.metricDescriptors.list
monitoring.monitoredResourceDescriptors.get
monitoring.monitoredResourceDescriptors.list
monitoring.timeSeries.create
resourcemanager.projects.get
storage.buckets.create
storage.buckets.delete
storage.buckets.get
storage.buckets.getIamPolicy
storage.buckets.list
storage.buckets.setIamPolicy
storage.buckets.update
storage.objects.create
storage.objects.delete
storage.objects.get
storage.objects.getIamPolicy
storage.objects.list
storage.objects.setIamPolicy
storage.objects.update

You will also need to generate a keyfile for the data service account and upload it to your $IMMUTA_ARTIFACTS_BUCKET prior to cluster creation. Be sure to grant the cluster service account read access to this keyfile and delete this keyfile from the Google Storage Bucket when the cluster is finished initializing.

KMS Resources

You will need to create an encryption keyring to hold the encryption key for the cluster's Kerberos admin password and the Immuta System API key.

In this example, the keyring is called immuta-secret-keyring and it is located in the us-east1 region. You may substitute in different values here if you wish.

gcloud kms keyrings create immuta-secret-keyring --location us-east1

Now, add an encryption key to your newly created keyring.

gcloud kms keys create immuta-secret-key --keyring --immuta-secret-keyring --location us-east1 --purpose encryption

Next, use the newly created key to create encrypted files for the Kerberos admin password and the Immuta system API key. Be sure to generate the API key in the Immuta Configuration UI first.

echo "<your kerberos admin password>" | \
gcloud kms encrypt \
    --location=us-east1  \
    --keyring=immuta-secret-keyring \
    --key=immuta-secret-key \
    --plaintext-file=- \
    --ciphertext-file=kerberos-password.encrypted

echo "<your Immuta system api key>" | \
gcloud kms encrypt \
    --location=us-east1  \
    --keyring=immuta-secret-keyring \
    --key=immuta-secret-key \
    --plaintext-file=- \
    --ciphertext-file=api-key.encrypted

Once the encrypted files are created, upload them to your $IMMUTA_ARTIFACTS_BUCKET in Google Storage, ensuring that your cluster service account has read access. Also note that these files should be removed from the bucket once the cluster is created and initialized.

Create Dataproc Cluster

Now you can create your cluster. For ease of managing configuration, copy the file below, insert your desired values, and save as immuta-dataproc-env.sh.

#!/bin/bash

# Essential variables
export CLUSTER_NAME=immuta-dataproc
export CLUSTER_PROJECT_NAME=my-project
export CLUSTER_REGION=us-east1
export CLUSTER_SVC_ACCOUNT=immuta-dataproc-limited@${CLUSTER_PROJECT_NAME}.iam.gserviceaccount.com
export CLUSTER_ZONE=us-east1-c
export IMMUTA_ARTIFACTS_BUCKET=my-immuta-artifacts
export IMMUTA_BASE_URL=https://immuta.mycompany.com
export IMMUTA_KEYFILE_URI=gs://${IMMUTA_ARTIFACTS_BUCKET}/your-keyfile.json
export IMMUTA_KEYRING=immuta-secret-keyring
export IMMUTA_KMS_KEY=immuta-secret-key

# VM Configuration
export NUM_WORKERS=2
export NUM_MASTERS=1
export WORKER_MACHINE_TYPE=n1-standard-2
export MASTER_MACHINE_TYPE=n1-standard-2
export WORKER_BOOT_DISK_SIZE=50GB
export MASTER_BOOT_DISK_SIZE=50GB
export IMAGE_VERSION=1.3.34-debian9

# Other Configuration
export IMMUTA_ACCESS_KEY_CRYPTO_LOCATION=${CLUSTER_REGION}
export IMMUTA_ACCESS_KEY_URI=gs://${IMMUTA_ARTIFACTS_BUCKET}/api-key.encrypted
export IMMUTA_BUNDLE_URI=gs://${IMMUTA_ARTIFACTS_BUCKET}/immuta_bundle-2.6.0-hadoop-2.9.2-public.tar.gz
export IMMUTA_INSTALL_SCRIPT_URI=gs://${IMMUTA_ARTIFACTS_BUCKET}/install.sh
export IMMUTA_PARTITION_SERVICE_USER=immuta
export INIT_ACTIONS=gs://${IMMUTA_ARTIFACTS_BUCKET}/immuta_initialization_actions.sh,gs://${IMMUTA_ARTIFACTS_BUCKET}/immuta_kerberos_actions.sh
export KERBEROS_INIT_PASSWORD=$(cat /dev/urandom | LC_ALL=C tr -dc 'a-zA-Z0-9' | fold -w 30 | head -n 1)
export KERBEROS_KMS_KEY=projects/${CLUSTER_PROJECT_NAME}/locations/${IMMUTA_ACCESS_KEY_CRYPTO_LOCATION}/keyRings/${IMMUTA_KEYRING}/cryptoKeys/${IMMUTA_KMS_KEY}
export KERBEROS_PASSWORD_URI=gs://${IMMUTA_ARTIFACTS_BUCKET}/kerberos-password.encrypted
export STARTUP_SCRIPT_URI=gs://${IMMUTA_ARTIFACTS_BUCKET}/immuta_startup_script.sh

Finally, run the code block below to create your cluster. Feel free to tweak any of the configuration flags as you see fit.

source immuta-dataproc-env.sh && \
gcloud dataproc clusters create ${CLUSTER_NAME} \
  --bucket ${IMMUTA_ARTIFACTS_BUCKET} \
  --region ${CLUSTER_REGION} \
  --zone ${CLUSTER_ZONE} \
  --service-account ${CLUSTER_SVC_ACCOUNT} \
  --num-workers 2 \
  --num-masters 1 \
  --image-version=${IMAGE_VERSION} \
  --worker-boot-disk-size=${WORKER_BOOT_DISK_SIZE} \
  --master-boot-disk-size=${MASTER_BOOT_DISK_SIZE} \
  --master-machine-type=${WORKER_MACHINE_TYPE} \
  --worker-machine-type=${MASTER_MACHINE_TYPE} \
  --tags='hive,jdwp,kdc' \
  --kerberos-root-principal-password-uri ${KERBEROS_PASSWORD_URI} \
  --kerberos-kms-key=${KERBEROS_KMS_KEY} \
  --initialization-actions ${INIT_ACTIONS} \
  --properties ^--^hive:hive.security.metastore.authorization.auth.reads=false\
--core:fs.immuta.impl=com.immuta.hadoop.ImmutaFileSystem\
--core:immuta.permission.users.to.ignore=hdfs,yarn,hive,impala,llama,mapred,spark,oozie,hue,hbase,hadoop\
--core:immuta.credentials.dir=/user\
--core:immuta.spark.partition.generator.user=${IMMUTA_PARTITION_SERVICE_USER}\
--core:immuta.base.url=${IMMUTA_BASE_URL}\
--core:immuta.cluster.name=${CLUSTER_NAME}\
--core:hadoop.proxyuser.${IMMUTA_PARTITION_SERVICE_USER}.hosts=\*\
--core:hadoop.proxyuser.${IMMUTA_PARTITION_SERVICE_USER}.users=\*\
--core:hadoop.proxyuser.${IMMUTA_PARTITION_SERVICE_USER}.groups=\*\
--hadoop-env:HADOOP_CLASSPATH=\${HADOOP_CLASSPATH}:/opt/immuta/hadoop/lib/immuta-inode-attribute-provider.jar:\
--hdfs:dfs.namenode.inode.attributes.provider.class=com.immuta.hadoop.ImmutaInodeAttributeProvider\
--hdfs:immuta.extra.name.node.plugin.config=file:///opt/immuta/hadoop/name-node-conf.xml\
--hdfs:dfs.namenode.acls.enabled=true\
--hive:hive.server2.enable.doAs=true\
--hive:hive.compute.query.using.stats=true\
--hive:google.cloud.auth.service.account.json.keyfile=/etc/hive/keyfile.json\
--yarn:yarn.resourcemanager.webapp.methods-allowed=GET,HEAD,OPTIONS \
    --metadata ^--^startup-script-url=gs://${IMMUTA_ARTIFACTS_BUCKET}/immuta_startup_script.sh\
--immuta_bundle_uri=${IMMUTA_BUNDLE_URI}\
--immuta_install_script_uri=${IMMUTA_INSTALL_SCRIPT_URI}\
--immuta_install_location=/opt/immuta/hadoop\
--immuta_partition_service_user=${IMMUTA_PARTITION_SERVICE_USER}\
--immuta_partition_service_principal=${IMMUTA_PARTITION_SERVICE_USER}\
--immuta_partition_service_keytab=/opt/immuta/hadoop/${IMMUTA_PARTITION_SERVICE_USER}.keytab\
--immuta_access_key_uri=${IMMUTA_ACCESS_KEY_URI}\
--immuta_keyfile_location=/etc/hive/keyfile.json\
--immuta_keyring=${IMMUTA_KEYRING}\
--immuta_kms_key=${IMMUTA_KMS_KEY}\
--immuta_access_key_crypto_location=${IMMUTA_ACCESS_KEY_CRYPTO_LOCATION}\
--immuta_keyfile_uri=${IMMUTA_KEYFILE_URI}\
--immuta_kerberos_helper_script_uri=gs://${IMMUTA_ARTIFACTS_BUCKET}/kerberos_helper.sh\
--immuta_setup_kerberos_principals=${IMMUTA_PARTITION_SERVICE_USER}:${KERBEROS_INIT_PASSWORD}

Post-creation Tasks

Once your cluster is up, you will want to delete or restrict the permissions on the sensitive objects in your artifacts bucket. This includes the encrypted passwords as well as the keyfile for your data service account.

You will also need to copy the Immuta user keytab (located at /opt/immuta/hadoop/immuta.keytab in this example) and the /etc/krb5.conf from the master node of your cluster to the Immuta Configuration UI.