1 of 2

Hadoop Clusters

Audience: Data Owners and System Administrators
Content Summary: This guide augments the documentation on Spark, focusing on how and when you should use the Immuta Spark integration on your cluster.

Why Use Immuta On Your Cluster

When you create Hive or Impala tables from your data in HDFS, it may require policies restricting who can see specific rows and columns. This becomes complex on a Hadoop cluster because not only do you need to protect the Hive and/or Impala tables, but you also need to protect the data that back those tables.

For example, when you run SparkSQL, although it does reference Hive or Impala tables, it does not actually read any data from them. For performance reasons it reads the data directly from HDFS. This means that any protections you set on those Hive or Impala tables through Sentry or Ranger will not be applied to the raw file reads in SparkSQL. And in fact, those files need to be completely open to anyone running SparkSQL jobs.

Immuta enforces policy controls not only on the Hive and Impala tables, but also the backing files in HDFS.

Workflow

Should you want to enforce row and column level controls on data in HDFS, you must associate some structure to that data. This is done by creating tables in Hive or Impala from that data in HDFS. Once those tables are created, you can then expose them as data sources in Immuta like you normally would any other database.

The difference, though, is that Immuta will not only enforce the controls through the Immuta Query Engine, but will also dynamically lock down the backing files in HDFS. That means if anyone tries to read those files, they will be denied access. In order to read these files, users can use SparkSQL and the ImmutaSparkSession (Spark 2.4).

Tip: The user principal used to expose the data from Impala/HIVE/HDFS will not be impacted by Immuta security on the underlying files; it will fall back to the underlying permissions (such as ACLs).

Immuta Spark Session

The ImmutaSession class (Spark 2.4) is a subclass of SparkSession. Users can access subscribed data sources within their Spark jobs by using SparkSQL. Immuta enforces SparkSQL controls on data platforms that support batch processing workloads. Standard Spark libraries access data from metastore-backed data sources (like Hive and Impala) to retrieve the data from the underlying files stored in HDFS, while Immuta dynamically unlocks the files in HDFS and enforces row-level and column-level controls within the Spark job.

General Spark Access

Should you not care about row and column level controls, but still want to restrict access to files, you can do this with Immuta HDFS data sources. You can expose the HDFS directories in Immuta as data sources and enforce file-level controls based on directory structure or extra attributes on those files. In this case, HDFS reads work as usual and data is read with the Immuta policies enforced.

Policy Fallback

It is possible to also set ACL (or Ranger/Sentry) controls on tables and HDFS files as well. If an Immuta policy is set on that data, it will be enforced first, but if not, it will fall back to the ACL/Sentry/Ranger controls on that data. You can in fact exclude users (like admins) from Immuta policies should you desire to do so.

Please refer to our Installation Guide for details on combined installs with Immuta and Sentry. There are requirements on what sequence you install both.

Securing Hive and Impala without Sentry

Although Cloudera recommends using the Sentry service to secure access to Hive and Impala, CDH cluster administrators can lock down this access without running the Sentry service. See the Security without Sentry Guide for details on this alternative to using Sentry.

It is recommended that you provide write scratch space to your users that is private to them, avoiding write to public locations in HDFS. This avoids the issue of users inadvertently sharing data or data outputs from their jobs with other users. Once that data is in their scratch space, users with CREATE_DATA_SOURCE permission can expose that data, either by exposing a Hive or Impala table created from it (if row/column controls are needed) or by exposing the raw HDFS files as an Immuta data source.

You may want to only allow privileged users have CREATE_DATA_SOURCE permission so the appropriate policies can be applied before the data is exposed.

Securing Hive and Impala without Sentry

Audience: System Administrators
Content Summary: Immuta offers both fine- and coarse-grained protection for Hive and Impala tables for users who access data via the Immuta Query Engine or the Spark Integration. However, additional protections are required to ensure that users cannot gain unauthorized access to data by connecting to Hive or Impala directly. Cloudera recommends using the Sentry service to secure access to Hive and Impala. As an alternative, this guide details steps that CDH cluster administrators can take to lock down Hive and Impala access without running the Sentry service.

Each section in this guide is a required step to ensure that access to Hive and Impala is secured.

Restricting Access to Hive

After installing Immuta on your cluster, users will still be able to connect to Hive via the hive shell, beeline, or JDBC/ODBC connections. To prevent users from circumventing Immuta and gaining unauthorized access to data, you can leverage HDFS Access control lists (ACLs) without running Sentry.

Enable HDFS Access Control Lists in Cloudera Manager

See the official Cloudera Documentation to complete this step.

Enable Hive Impersonation in Cloudera Manager

In order to leverage ACLs to secure Hive, Hive impersonation must be enabled. To enable Hive impersonation in Cloudera manager, set hive.server2.enable.impersonation, hive.server2.enable.doAs to true in the Hive service configuration.

Configure Access Control Lists

Group in this context refers to Linux groups, not Sentry groups.

You must configure ACLs for each location in HDFS that Hive data will be stored in to restrict access to hive, impala, and data owners that belong to a particular group. You can accomplish this by running the commands below.

hadoop fs -setfacl -m other::--- /user/hive/warehouse
hadoop fs -setfacl -m user::rwx /user/hive/warehouse
hadoop fs -setfacl -m group::rwx /user/hive/warehouse
hadoop fs -setfacl -m group:hive:rwx /user/hive/warehouse
hadoop fs -setfacl -m group:examplegroup:rwx /user/hive/warehouse

In this example, we are allowing members of the hive and examplegroup to select & insert on tables in hive. Note that the hive group only contains the hive and impala users, while examplegroup contains the privileged users who would be considered potential data owners in Immuta.

By default, Hive stores data in HDFS under /user/hive/warehouse. However, you can change this directory in the above example if you are using a different data storage location on your cluster.

Restricting Access to Impala

After installing Immuta on your cluster, users will still be able to connect to Impala via impala-shell or JDBC/ODBC connections. To prevent users from circumventing Immuta and gaining unauthorized access to data, you can leverage policy configuration files for Impala without running Sentry.

Create Policy Configuration File

Group in this context refers to Linux groups, not Sentry groups.

The policy configuration file that will drive Impala's security must be in .ini format. The example below will grant users in group examplegroup the ability to read and write data in the default database. You can add additional groups and roles that correspond to different databases or tables.

[groups]
examplegroup = example_insert_role, example_select_role

[roles]
example_insert_role = server=server1->db=default->table=*->action=insert
example_select_role = server=server1->db=default->table=*->action=select

This policy configuration file assigns the group called examplegroup to the roles example_insert_role and example_select_role, which grant insert and select (read and write) privileges on all tables in the default database.

See the official Impala documentation for a detailed guide on policy configuration files. Note that while the guide mentions Sentry, running the Sentry service is not required to leverage policy configuration files.

Next, place the policy configuration file (we will call it policy.ini) in HDFS. The policy file should be owned by the impala user, and should only be accessible by the impala user. See below for an example.

hadoop fs -copyFromLocal /tmp/policy.ini /user/impala/
hadoop fs -chown impala:impala /user/impala/policy.ini
hadoop fs -chmod o-rwx /user/impala/policy.ini

Configure Impala to use Policy Configuration File

You can configure Impala to leverage your new policy file by navigating to Impala's configuration in Cloudera Manager and modifying Impala Daemon Command Line Argument Advanced Configuration Snippet (Safety Valve) with the snippet below.

-server_name=server1
-authorization_policy_file=/user/impala/policy.ini

You must restart the Impala service in Cloudera Manager to implement the policy changes. Note that server_name should correspond to the server that you define in your policy roles. Also note that each key-value pair should be placed on its own line in the configuration snippet.

Securing Hive and Impala without Sentry

Audience: System Administrators
Content Summary: Immuta offers both fine- and coarse-grained protection for Hive and Impala tables for users who access data via the Immuta Query Engine or the Spark Integration. However, additional protections are required to ensure that users cannot gain unauthorized access to data by connecting to Hive or Impala directly. Cloudera recommends using the Sentry service to secure access to Hive and Impala. As an alternative, this guide details steps that CDH cluster administrators can take to lock down Hive and Impala access without running the Sentry service.

Each section in this guide is a required step to ensure that access to Hive and Impala is secured.

Restricting Access to Hive

Enable HDFS Access Control Lists in Cloudera Manager

See the official Cloudera Documentation to complete this step.

Enable Hive Impersonation in Cloudera Manager

Configure Access Control Lists

Group in this context refers to Linux groups, not Sentry groups.

hadoop fs -setfacl -m other::--- /user/hive/warehouse
hadoop fs -setfacl -m user::rwx /user/hive/warehouse
hadoop fs -setfacl -m group::rwx /user/hive/warehouse
hadoop fs -setfacl -m group:hive:rwx /user/hive/warehouse
hadoop fs -setfacl -m group:examplegroup:rwx /user/hive/warehouse

By default, Hive stores data in HDFS under /user/hive/warehouse. However, you can change this directory in the above example if you are using a different data storage location on your cluster.

Restricting Access to Impala

Create Policy Configuration File

Group in this context refers to Linux groups, not Sentry groups.

[groups]
examplegroup = example_insert_role, example_select_role

[roles]
example_insert_role = server=server1->db=default->table=*->action=insert
example_select_role = server=server1->db=default->table=*->action=select

hadoop fs -copyFromLocal /tmp/policy.ini /user/impala/
hadoop fs -chown impala:impala /user/impala/policy.ini
hadoop fs -chmod o-rwx /user/impala/policy.ini

Configure Impala to use Policy Configuration File

-server_name=server1
-authorization_policy_file=/user/impala/policy.ini

Hadoop Clusters

Audience: Data Owners and System Administrators
Content Summary: This guide augments the documentation on Spark, focusing on how and when you should use the Immuta Spark integration on your cluster.

Why Use Immuta On Your Cluster

Immuta enforces policy controls not only on the Hive and Impala tables, but also the backing files in HDFS.

Workflow

Tip: The user principal used to expose the data from Impala/HIVE/HDFS will not be impacted by Immuta security on the underlying files; it will fall back to the underlying permissions (such as ACLs).

Immuta Spark Session

General Spark Access

Policy Fallback

Please refer to our Installation Guide for details on combined installs with Immuta and Sentry. There are requirements on what sequence you install both.

Securing Hive and Impala without Sentry

You may want to only allow privileged users have CREATE_DATA_SOURCE permission so the appropriate policies can be applied before the data is exposed.

Hadoop Clusters

Why Use Immuta On Your Cluster

Workflow

Immuta Spark Session

General Spark Access

Policy Fallback

Securing Hive and Impala without Sentry

Data Sharing

Securing Hive and Impala without Sentry

Restricting Access to Hive

Enable HDFS Access Control Lists in Cloudera Manager

Enable Hive Impersonation in Cloudera Manager

Configure Access Control Lists

Restricting Access to Impala

Create Policy Configuration File

Configure Impala to use Policy Configuration File

Securing Hive and Impala without Sentry

Restricting Access to Hive

Enable HDFS Access Control Lists in Cloudera Manager

Enable Hive Impersonation in Cloudera Manager

Configure Access Control Lists

Restricting Access to Impala

Create Policy Configuration File

Configure Impala to use Policy Configuration File

Hadoop Clusters

Why Use Immuta On Your Cluster

Workflow

Immuta Spark Session

General Spark Access

Policy Fallback

Securing Hive and Impala without Sentry

Data Sharing