1 of 9

Overviews

Hadoop Clusters

Audience: Data Owners and System Administrators
Content Summary: This guide augments the documentation on Spark, focusing on how and when you should use the Immuta Spark integration on your cluster.

Why Use Immuta On Your Cluster

When you create Hive or Impala tables from your data in HDFS, it may require policies restricting who can see specific rows and columns. This becomes complex on a Hadoop cluster because not only do you need to protect the Hive and/or Impala tables, but you also need to protect the data that back those tables.

For example, when you run SparkSQL, although it does reference Hive or Impala tables, it does not actually read any data from them. For performance reasons it reads the data directly from HDFS. This means that any protections you set on those Hive or Impala tables through Sentry or Ranger will not be applied to the raw file reads in SparkSQL. And in fact, those files need to be completely open to anyone running SparkSQL jobs.

Immuta enforces policy controls not only on the Hive and Impala tables, but also the backing files in HDFS.

Workflow

Should you want to enforce row and column level controls on data in HDFS, you must associate some structure to that data. This is done by creating tables in Hive or Impala from that data in HDFS. Once those tables are created, you can then expose them as data sources in Immuta like you normally would any other database.

The difference, though, is that Immuta will not only enforce the controls through the Immuta Query Engine, but will also dynamically lock down the backing files in HDFS. That means if anyone tries to read those files, they will be denied access. In order to read these files, users can use SparkSQL and the ImmutaSparkSession (Spark 2.4).

Tip: The user principal used to expose the data from Impala/HIVE/HDFS will not be impacted by Immuta security on the underlying files; it will fall back to the underlying permissions (such as ACLs).

Immuta Spark Session

The ImmutaSession class (Spark 2.4) is a subclass of SparkSession. Users can access subscribed data sources within their Spark jobs by using SparkSQL. Immuta enforces SparkSQL controls on data platforms that support batch processing workloads. Standard Spark libraries access data from metastore-backed data sources (like Hive and Impala) to retrieve the data from the underlying files stored in HDFS, while Immuta dynamically unlocks the files in HDFS and enforces row-level and column-level controls within the Spark job.

General Spark Access

Should you not care about row and column level controls, but still want to restrict access to files, you can do this with Immuta HDFS data sources. You can expose the HDFS directories in Immuta as data sources and enforce file-level controls based on directory structure or extra attributes on those files. In this case, HDFS reads work as usual and data is read with the Immuta policies enforced.

Policy Fallback

It is possible to also set ACL (or Ranger/Sentry) controls on tables and HDFS files as well. If an Immuta policy is set on that data, it will be enforced first, but if not, it will fall back to the ACL/Sentry/Ranger controls on that data. You can in fact exclude users (like admins) from Immuta policies should you desire to do so.

Please refer to our Installation Guide for details on combined installs with Immuta and Sentry. There are requirements on what sequence you install both.

Securing Hive and Impala without Sentry

Although Cloudera recommends using the Sentry service to secure access to Hive and Impala, CDH cluster administrators can lock down this access without running the Sentry service. See the Security without Sentry Guide for details on this alternative to using Sentry.

It is recommended that you provide write scratch space to your users that is private to them, avoiding write to public locations in HDFS. This avoids the issue of users inadvertently sharing data or data outputs from their jobs with other users. Once that data is in their scratch space, users with CREATE_DATA_SOURCE permission can expose that data, either by exposing a Hive or Impala table created from it (if row/column controls are needed) or by exposing the raw HDFS files as an Immuta data source.

You may want to only allow privileged users have CREATE_DATA_SOURCE permission so the appropriate policies can be applied before the data is exposed.

Securing Hive and Impala without Sentry

Audience: System Administrators
Content Summary: Immuta offers both fine- and coarse-grained protection for Hive and Impala tables for users who access data via the Immuta Query Engine or the Spark Integration. However, additional protections are required to ensure that users cannot gain unauthorized access to data by connecting to Hive or Impala directly. Cloudera recommends using the Sentry service to secure access to Hive and Impala. As an alternative, this guide details steps that CDH cluster administrators can take to lock down Hive and Impala access without running the Sentry service.

Each section in this guide is a required step to ensure that access to Hive and Impala is secured.

Restricting Access to Hive

After installing Immuta on your cluster, users will still be able to connect to Hive via the hive shell, beeline, or JDBC/ODBC connections. To prevent users from circumventing Immuta and gaining unauthorized access to data, you can leverage HDFS Access control lists (ACLs) without running Sentry.

Enable HDFS Access Control Lists in Cloudera Manager

See the official Cloudera Documentation to complete this step.

Enable Hive Impersonation in Cloudera Manager

In order to leverage ACLs to secure Hive, Hive impersonation must be enabled. To enable Hive impersonation in Cloudera manager, set hive.server2.enable.impersonation, hive.server2.enable.doAs to true in the Hive service configuration.

Configure Access Control Lists

Group in this context refers to Linux groups, not Sentry groups.

You must configure ACLs for each location in HDFS that Hive data will be stored in to restrict access to hive, impala, and data owners that belong to a particular group. You can accomplish this by running the commands below.

hadoop fs -setfacl -m other::--- /user/hive/warehouse
hadoop fs -setfacl -m user::rwx /user/hive/warehouse
hadoop fs -setfacl -m group::rwx /user/hive/warehouse
hadoop fs -setfacl -m group:hive:rwx /user/hive/warehouse
hadoop fs -setfacl -m group:examplegroup:rwx /user/hive/warehouse

In this example, we are allowing members of the hive and examplegroup to select & insert on tables in hive. Note that the hive group only contains the hive and impala users, while examplegroup contains the privileged users who would be considered potential data owners in Immuta.

By default, Hive stores data in HDFS under /user/hive/warehouse. However, you can change this directory in the above example if you are using a different data storage location on your cluster.

Restricting Access to Impala

After installing Immuta on your cluster, users will still be able to connect to Impala via impala-shell or JDBC/ODBC connections. To prevent users from circumventing Immuta and gaining unauthorized access to data, you can leverage policy configuration files for Impala without running Sentry.

Create Policy Configuration File

Group in this context refers to Linux groups, not Sentry groups.

The policy configuration file that will drive Impala's security must be in .ini format. The example below will grant users in group examplegroup the ability to read and write data in the default database. You can add additional groups and roles that correspond to different databases or tables.

[groups]
examplegroup = example_insert_role, example_select_role

[roles]
example_insert_role = server=server1->db=default->table=*->action=insert
example_select_role = server=server1->db=default->table=*->action=select

This policy configuration file assigns the group called examplegroup to the roles example_insert_role and example_select_role, which grant insert and select (read and write) privileges on all tables in the default database.

See the official Impala documentation for a detailed guide on policy configuration files. Note that while the guide mentions Sentry, running the Sentry service is not required to leverage policy configuration files.

Next, place the policy configuration file (we will call it policy.ini) in HDFS. The policy file should be owned by the impala user, and should only be accessible by the impala user. See below for an example.

hadoop fs -copyFromLocal /tmp/policy.ini /user/impala/
hadoop fs -chown impala:impala /user/impala/policy.ini
hadoop fs -chmod o-rwx /user/impala/policy.ini

Configure Impala to use Policy Configuration File

You can configure Impala to leverage your new policy file by navigating to Impala's configuration in Cloudera Manager and modifying Impala Daemon Command Line Argument Advanced Configuration Snippet (Safety Valve) with the snippet below.

-server_name=server1
-authorization_policy_file=/user/impala/policy.ini

You must restart the Impala service in Cloudera Manager to implement the policy changes. Note that server_name should correspond to the server that you define in your policy roles. Also note that each key-value pair should be placed on its own line in the configuration snippet.

HDFS Access Pattern

Audience: Data Owners and Data Users
Content Summary: Immuta integrates with your Hadoop cluster to provide policy-compliant access to data sources directly through HDFS. This page instructs how to access data through the HDFS integration, which only enforces file-level controls on data. For more information on installing and configuring the Immuta Hadoop plugin, see the installation tutorial. There is also a Spark SQL integration should you need to enforce row-level and column-level controls on data.
The Immuta Hadoop plugin can also be integrated with an existing kerberos setup to allow users to access HDFS data using their existing kerberos principals, with data access and policy enforcement managed by Immuta.

Immuta HDFS Principal

When Immuta is installed on the cluster, users can only access data through HDFS using the HDFS principal that has been set for them in Immuta. This principal can only be set by an Immuta Administrator or imported from an external Identity Manager, but Immuta users can view their principal via the profile page.

Authentication

In order to access data through Immuta's HDFS Integration, you must be authenticated as the user or principal that is assigned to your Immuta HDFS principal.

For clusters secured with kerberos, you must successfully kinit with your Immuta HDFS principal before attempting to access data.
For insecure clusters, you must be logged in to the cluster as the system user that is assigned to your HDFS principal.

Accessing Data

Immuta's HDFS integration allows you to access data two different ways:

The immuta:/// namespace allows you to access files in relation to the Immuta data source that it is part of. For example, if you want to access a file called december_report.csv that is part of an Immuta data source called reports, you can access it with the following path:
immuta:///immuta/reports/december_report.csv
Note that the path to the file is relative to the Immuta data source that it falls under, not the real path in HDFS. Also, immuta:/// is restricted to only paths that a user can see - files that the user is not authorized for will not be visible.
The HDFS integration also allows users to access data using native HDFS paths. Authorized data source subscribers can access the file december_report.csv through its native path in HDFS:
hdfs:///actual/path/in/hdfs/december_report.csv
Note that in order for a user to access data using hdfs:/// paths, there must be a hdfs:///user/<user>/ directory where <user> corresponds to the user's Immuta HDFS principal. Also, hdfs:/// paths will allow users to see locations of all files, but they will only be able to read files that they have access to in Immuta.

Both methods of accessing data will be audited and compliant with data source policies. If users are not subscribed to or are policy-restricted by the data source that a file in HDFS falls under, they will not be able to access the file using either namespace.

HDFS User Impersonation

Immuta users with the IMPERSONATE_HDFS_USER permission can create HDFS, Hive, and Impala data sources as any HDFS user (provided that they have the proper credentials). For more information, see the tutorial for creating a data source.

Legacy S3 Access Pattern

Audience: Data Owners, Data Users, and System Administrators
Content Summary: Immuta supports an S3-style REST API, which allows you to communicate with Immuta the same way you would with S3. Consequently, Immuta easily integrates with tools you may already be using to work with S3.

S3 as a Filesystem

In this integration, Immuta implements a single bucket (with data sources broken up as sub-directories under that bucket), since some S3 tools only support the new virtual-hosted style requests.

The three APIs (outlined below) used in this integration support basic AWS functionality; the requests and responses for each are identical to those in S3.

GET Bucket

This request returns the bucket configured within Immuta.

Method

Path

Successful Status Code

GET Bucket Contents

This request returns the contents of the given bucket.

Method

Path

Successful Status Code

GET Object

This request returns a stream from the requested object within Immuta.

Method

Path

Successful Status Code

Example Request:

curl \
    --request GET \
    --header "Authorization: AWS <API KEY>:immuta" \
    https://demo.immuta.com/s3p/immuta/my_data_source/path/to/file/myfile.json

Example: HTTP Request and Response

GET Bucket Example Request:

curl \
    --request GET \
    --header "Authorization: AWS <API KEY>:immuta" \
    https://demo.immuta.com/s3p/immuta?delimiter=/&prefix=my_data_source/path/to/file

Note: There is a single file in the requested directory.

GET Bucket Example Response:

<?xml version="1.0" encoding="UTF-8"?>
<ListBucketResult xmlns="http://doc.s3.amazonaws.com/2006-03-01/">
    <IsTruncated>false</IsTruncated>
    <Marker></Marker>
    <Name>immuta</Name>
    <Prefix>my_data_source/path/to/file</Prefix>
    <MaxKeys>1000</MaxKeys>
    <Delimiter>/</Delimiter>
    <Contents>
        <Key>my_data_source/path/to/file/myfile.json</Key>
        <LastModified>2018-11-05T21:25:04.000Z</LastModified>
        <ETag>5b0810c82a69a70e552cece19b20585fc94b67fe4eaa8b</ETag>
        <Size>389</Size>
        <StorageClass>STANDARD</StorageClass>
        <Owner>
            <ID>Immuta</ID>
            <DisplayName>Immuta</DisplayName>
        </Owner>
    </Contents>
</ListBucketResult>

Example: Using Boto 3 to Download Objects

Boto 3 is the official Amazon Web Services client SDK for Python and is widely used by developers for accessing S3 objects. With Immuta's S3 integration, Immuta users can use boto3 to download policy-enforced files or tables.

The first step is to create a Session object that points to your Immuta endpoint and is authenticated with a user-specific API Key.

import boto3

session = boto3.session.Session()

s3_client = session.client(
    service_name = 's3',
    aws_access_key_id = '<YOUR_USER_API_KEY>',
    aws_secret_access_key = 'immuta',
    endpoint_url = 'https://<YOUR_IMMUTA_URL>:443/s3p'
)

To find out what objects are available for download, you can list the objects in the immuta bucket. To filter down to a particular data source, pass in a Prefix that corresponds to the SQL table name of your Immuta data source.

bucket_contents = s3_client.list_objects(
    Bucket = 'immuta',
    Delimiter = '/',
    Prefix = '<SQL_TABLE_NAME>'
).get("Contents")

print(bucket_contents[0])
    {
        'Key': '<SQL_TABLE_NAME>/<SINGLE_OBJECT_KEY>',
        'ETag': 'aa0492082b95c5d8bb90377a006e...',
        'StorageClass': 'STANDARD',
        'Owner': {'DisplayName': 'Immuta', 'ID': 'Immuta'}
    }

Once you have an object key, you can use the download_file method to download the object to your local development environment.

s3_client.download_file(
    Bucket = "immuta",
    Key = "<SQL_TABLE_NAME>/<SINGLE_OBJECT_KEY>",
    Filename = "<OUTPUT_FILE_PATH>"
)

Spark Access Pattern

Audience: Data Owners and Data Users
Content Summary: Users can access subscribed data sources within their Spark jobs by using SparkSQL with the ImmutaSession class (Spark 2.4). Immuta enforces SparkSQL controls on data platforms that support batch processing workloads. Through this process, all tables are virtual and empty until a query is materialized.
When a query is materialized, standard Spark libraries access data from metastore-backed data sources (like Hive and Impala) to retrieve the data from the underlying files stored in HDFS. Other data source types access data using the Query Engine, which proxies the query to the native database technology and automatically enforces policies for each data source.
Security of data sources is enforced both server-side and client-side. Server-side security is provided by an external partitioning service and client-side security is provided by a Java SecurityManager to moderate access to sensitive information.

Spark Integration Specific to CDH and EMR

The Spark integration is only supported by CDH and EMR integrations.

Section Contents

Using the Immuta SparkSession (Spark 2)
Leveraging Data on Other Clusters and Databases
Spark Policy Enforcement and Deployment

Using the Immuta SparkSession (Spark 2)

Audience: Data Users
Content Summary: This page outlines how to use the Immuta SparkSession with spark-submit, spark-shell, and pyspark.
Immuta SparkSession Background: For Spark 2, the Immuta SparkSession must be used in order to access Immuta data sources. Once the Immuta Spark Installation has been completed on your Spark cluster, then you are able to use the special Immuta Spark interfaces that are detailed below. For data platforms that support batch processing workloads, the Immuta SparkSession allows users to query data sources the same way that they query Hive tables with Spark SQL.
When querying metastore-backed data sources, such as Hive and Impala, the Immuta Session accesses the data directly in HDFS. Other data source types will pass through the Query Engine. In order to take advantage of the performance gains provided by directly acting on the files in HDFS in your Spark jobs, you must create Immuta data sources for metastore-backed data sources with tables that are persisted in HDFS.
For guidance on querying data sources across multiple clusters and/or remote databases, see Leveraging Data on Other Clusters and Databases.

Using the Immuta SparkSession

With spark-submit

Launch the special immuta-spark-submit interface, and submit jobs just like you would with spark-submit:

immuta-spark-submit <job>

With spark-shell

First, launch the special immuta-spark-shell interface:

immuta-spark-shell

Then, Use the immuta variable just like you would spark:

immuta.catalog.listTables().show()
val df = immuta.table("my_immuta_datasource")
df.show()
val df2 = immuta.sql("SELECT * FROM my_immuta_datasource")
df2.show()

Next, use the immuta format to specify partition information:

val df3 = immuta.read.format("immuta")
    .option("dbtable", "my_immuta_datasource")
    .option("partitionColumn", "id")
    .option("lowerBound", "0")
    .option("upperBound", "300")
    .option("numPartitions", "3")
    .load()
df3.show()

The immuta format also supports query pushdown:

val df4 = immuta.read.format("immuta")
    .option("dbtable", "(SELECT * FROM my_immuta_datasource) as my_immuta_datasource")
    .load()
df4.show()

Finally, specify the fetch size:

val df5 = immuta.read.format("immuta")
    .option("dbtable", "my_immuta_datasource")
    .option("fetchsize", "500").load()
df5.show()

With pyspark

First, launch the special immuta-pyspark interface:

immuta-pyspark

Then, use the immuta variable just like you would spark:

immuta.catalog.listTables()
df = immuta.table("my_immuta_datasource")
df.show()
df2 = immuta.sql("SELECT * FROM my_immuta_datasource")
df2.show()

Finally, use the immuta format to specify partition information:

df3 = immuta.read.format("immuta")
    .option("dbtable", "my_immuta_datasource")
    .option("partitionColumn", "id")
    .option("lowerBound", "0")
    .option("upperBound", "300")
    .option("numPartitions", "3")
    .load()
df3.show()

The immuta format also supports query pushdown:

df4 = immuta.read.format("immuta")
    .option("dbtable", "(SELECT * FROM my_immuta_datasource) as my_immuta_datasource")
    .load()
df4.show()
To specify the fetch size:

df5 = immuta.read.format("immuta")
    .option("dbtable", "my_immuta_datasource")
    .option("fetchsize", "500").load()
df5.show()

Leveraging Data on Other Clusters and Databases

Audience: Data Users
Content Summary: Immuta's Spark integration can help you leverage data in tables across different clusters and databases in your organization, without having to make permanent copies of the data. This page illustrates the process of running efficient cross-technology joins in Spark.
The code examples on this page are written in Scala using the immuta session variable in Spark 2.4. If you are using Spark 1.6, you can repeat these steps with the ImmutaContext variable, ic.

Prerequisites

An Immuta data source for each database table that you wish to join. For guidance on creating these data sources, please refer to this tutorial.
A working Immuta HDFS/Spark plugin installation on one of your clusters. This is also the cluster that your spark jobs will run on. For guidance on installing the Immuta plugin, please refer to the Hadoop Installation Guide.

Cross-cluster Joins

When joining data across clusters, the most efficient approach is to focus queries on narrower windows of data to eliminate overhead. Although Immuta is not permanently rewriting the data, it still must transport data across a network from a different cluster. For this reason, users are encouraged to avoid overly broad queries.

Suppose you wish to run the query below, where sales refers to an Immuta data source on Cluster A and customer refers to an Immuta data source denoted by Database B. Also assume that the Immuta Spark plugin has been successfully installed on Cluster A.

To eliminate overhead, you join data and calculate sales totals for customers within their first month of registration. The following query calculates first-month sales for customers who registered in April 2018:

SELECT
   s.customer_id, c.id, c.registration_date,
   sum(s.sale_price) total_sales
FROM
    sales s, customer c
WHERE
    s.customer_id = c.id
    and s.sale_date < 20180501
    and s.sale_date > 20180331
    and c.registration_date < 20180501
    and c.registration_date > 20180331
GROUP BY
    s.customer_id, c.id, c.registration_date
ORDER BY
    c.id

Step 1: Load Tables into Spark DataFrames

To maximize the efficiency of the cross-cluster join query, the first step is to load a partitioned portion of the data into a Spark DataFrame. This will reduce the overhead of the join query, and allow Immuta to calculate an ideal query plan.

First, load the desired sales data from the local Cluster A into a DataFrame named salesDF by passing the desired query to immuta.sql():

val salesQuery = """SELECT
customer_id, sale_price, sale_date, region_id
FROM sales
WHERE sale_date < 20180501 and sale_date > 20180331"""

val salesDF = immuta.sql(salesQuery)

Then, load customer data from remote Database B into a DataFrame named customerDF. The syntax to set up the remote DataFrame is a little bit different since the user needs to pass in the partitioning configuration. Note that the user defines partitions on the region_id column, which is an integer between 1000 and 2000.

Note: When choosing a partition column, it is important to find a column with a generally even distribution across a known range of values. If you are expecting a large volume of data to be returned from the remote cluster, you can increase the number of partitions to break up the transfers into smaller payloads.

val customerQuery = """(SELECT
id, region_id, registration_date
FROM customer WHERE registration_date < 20180501 and registration_date > 20180331)
as customer_tmp"""

val customerReader = immuta.read.format("immuta")
.option("dbtable", customerQuery)
.option("partitionColumn", "region_id")
.option("lowerBound", "1000")
.option("upperBound", "2000")
.option("numPartitions", "3")
val customerDF = customerReader.load()

If you do not partition your query and the remote data is larger than a single executor can handle (which is very typical for most workloads), the full local-cluster portion of the query will run. Then, one-by-one each Spark executor will attempt to execute the remote query and fail due to memory limitations. Thus, the time to failure of a non-partitioned query is extremely long. For more information, please contact your Immuta Support Professional.

Step 2: Register Temporary Views of Filtered Data

Now that you have defined the filtered and partitioned DataFrames, register them as temporary views that will be used in the join query:

salesDF.createOrReplaceTempView("sales_tmp")
customerDF.createOrReplaceTempView("customer_tmp")

Immuta recognizes these temporary views as queryable tables for the current session. Below is an example of viewing the queryable Immuta tables in the Spark CLI:

scala> immuta.sql("show tables").show()                                                                                                                                            +--------+---------------+-----------+
|database|      tableName|isTemporary|
+--------+---------------+-----------+
|  immuta|       customer|      false|
|  immuta|          sales|      false|
|        |   customer_tmp|       true|
|        |      sales_tmp|       true|
+--------+---------------+-----------+

Step 3: Run the Join Query

Finally, leverage the newly-created temporary views to run the cross-cluster join query:

val joinQuery="""SELECT
s.customer_id, c.registration_date,
sum(s.sale_price) total_sales
FROM sales_tmp s, customer_tmp c
WHERE s.customer_id = c.id
GROUP BY s.customer_id, c.id, c.registration_date
ORDER BY c.id"""

val joinDF = immuta.sql(joinQuery)

The following is a possible output in the Spark CLI:

scala> joinDF.show()
+-----------+-----------------+-----------+
|customer_id|registration_date|total_sales|
+-----------+-----------------+-----------+
|   00000001|         20180427|    1005.40|
|   00000002|         20180411|      80.82|
|   00000003|         20180412|       9.00|
|   00000004|         20180409|     768.09|
|   00000005|         20180421|     534.20|
|   00000006|         20180429|    3218.28|
|   00000007|         20180403|    1076.20|
|   00000008|         20180422|     632.45|
|   00000009|         20180417|      76.50|
|   00000010|         20180428|     598.12|
|   00000011|         20180425|       9.99|
|   00000012|         20180405|      54.90|
|   00000013|         20180410|    2602.97|
|   00000014|         20180416|      16.02|
|   00000015|         20180413|     576.90|
|   00000016|         20180419|      12.39|
|   00000017|         20180401|    2280.92|
|   00000018|         20180418|     209.71|
|   00000019|         20180414|    1140.46|
|   00000020|         20180416|     342.89|
+-----------+-----------------+-----------+
only showing top 20 rows

Spark Policy Enforcement and Deployment

Audience: Data Owners and Data Users
Content Summary: This page details the components of Immuta's Spark ecosystem and policy enforcement.

Spark Policy Enforcement

In Immuta's Spark plugins, policies are enforced at query time much like the .

Outside of Databricks, Immuta's Spark ecosystem is composed of

Immuta SparkSession
Vulcan Service
Immuta SecurityManager
Immuta NameNode Plugin (optional, HDFS)

All of these components work in conjunction to apply and enforce Immuta policies on data sources queried through Spark.

In Databricks, Immuta's Spark policy enforcement is driven by Spark plugins that operate on a normal SparkSession (i.e., no ImmutaSparkSession class or object).

Immuta SparkSession/Immuta Context (non-Databricks)

The Immuta SparkSession is the client-side plugin in the Immuta Spark ecosystem. This plugin is an extension of the open-source SparkSession, but Immuta's SparkSession and the open-source SparkSession have two differences:

Immuta's external and session catalogs
Immuta's logical replanning

The replanning in ImmutaSparkSession occurs in the QueryExecution class. Immuta has an internal version of that class that replaces the different stages of the plan (logical, analyzed, optimized, sparkPlan, and executedPlan) with policy-enforced versions, and the QueryExecution object and resulting SparkPlan (physical plan) trigger audit calls. Additionally, Immuta's implementation of QueryExecution provides a layer of security within the JVM itself to make sure that any sensitive information needed by physical plans is used or stored so that it can be protected by the SecurityManager.

Several other Spark internals are implemented in Immuta to organize code in a way that the SecurityManager can prevent access to fields or methods that expose sensitive information.

Non-Databricks Deployments

In non-Databricks deployments, users will have to use a different object in their code (an instance of ImmutaSparkSession) than the normal SparkSession object to run Immuta Spark jobs. Creating this object is simple, only requiring a 1-2 line change in most existing scripts.

SparkSession Modifications in Databricks

In Databricks deployments, Immuta's plugins operate in a more transparent manner than outside of Databricks. Immuta leverages SparkSessionExtensions in Databricks to update the different planning phases in Spark and add Immuta's policies to the target SparkSession objects. This means that in Databricks users do not have to use a different object to interact with Immuta data sources; they simply connect to an Immuta-enabled cluster and do their work as usual.

Immuta updates the Analyzer, Hive Client, and physical planning strategy to ensure that policies are enforced on any user-generated plans and that the user's view of available data sources represents only what they are allowed to see in Immuta.

ODBC/JDBC Queries

In Databricks, Spark is the execution layer for any ODBC/JDBC connections to the cluster. This means that when Immuta's plugins are installed, ODBC/JDBC queries submitted to the cluster go through Immuta's plugins during execution. This provides a great deal of functionality for users who wish to connect BI tools directly to the cluster and still have their view of Immuta's data. However, when exposing data sources in Immuta from an Immuta-enabled Databricks cluster, the API token provided to Immuta for exposing the Databricks data source must belong to either an administrative user in Databricks or a privileged user specified in the Immuta configuration on the Databricks cluster.

Plan Analysis and Execution

To make the Immuta Spark ecosystem as user-friendly as possible, Immuta's Spark implementation resolves relations by reaching out to the Immuta Web Service instead of resolving relations in the Hive Metastore directly. All queryable Immuta data sources are available to Immuta's Spark plugins.

Cluster-native data sources (Hive, Impala, or Databricks) will be queried by accessing files directly from storage that compose the Metastore table, which is the same type of query execution that occurs in open source Spark when accessing a table in the Hive Metastore.

Any non-cluster queryable data source in Immuta will be queried from the user's Spark application via JDBC through the Immuta Query Engine. Users can provide query partition information similar to what is available via the JDBC data source in Spark to distribute their query to the Query Engine.

In JDBC data sources, policies are enforced at the Query Engine layer. In cluster data sources, policies are enforced through the following steps:

Plan modification during analysis to include policies using functions/expressions for masking and filters for row-level policies.
Restrictions to field/method access through the Immuta SecurityManager.

In Databricks

Restrictions to storage configuration access via the Immuta SecurityManager. User code cannot access credentials for S3, ADL gen 2, etc. directly, and those configurations are only loadable by the ImmutaSecureFileSystemWrapper class.
Restrictions to the use of AWS instance roles via the Immuta SecurityManager.

Outside Databricks

Partition and file access token generation in the Vulcan Service.
Token validation and filesystem access enforcement in the Immuta NameNode plugin (HDFS).
Token validation and remote object store proxying/enforcement in the Vulcan Service (S3/ADL/etc).

Plan Modifications

When a user attempts to query any Hive or Impala data source through the Immuta SparkSession, the Immuta catalogs first replace the relation in the user's plan with the proper plan that the data source represents. For example, if the user attempts the query (immuta is an instance of ImmutaSparkSession)

and the customer_purchases data source is composed of this query

and, in Immuta, these columns were selected to expose in this data source

id
first_name
last_name
age
country
ssn
product_id
department
purchase_date

the resulting Spark logical plan would look like this:

After the data source is resolved, the policies specific to the user will be applied to the logical plan. If the policy has masking or filters (row-level, minimization, time filter, etc.), those filters will be applied to all corresponding underlying tables in the plan. For example, consider the following Immuta policies:

Mask using hashing the column ssn for everyone.
Only show rows where user is a member of group that matches the value in the column department for everyone.

The plan would be modified (assume the current user is in the "Toys" and "Home Goods" groups):

In this example, the masked columns (such as ssn) are aliased to their original name after masking is applied. This means that transformations, filters, or functions applied to those columns will be applied to the masked columns. Additionally, filters on the plan are applied before any user transformations or filters, so a user's query cannot modify or subvert the policies applied to the plan.

Immuta does not attempt to change or block optimizations to the Spark Plan via the Catalyst Optimizer.

Query Engine and Spark Difference

Spark policies are applied at the lowest possible level in the Spark plan for security reasons, which may lead to different results when applying policies to a Spark plan rather than a Query Engine plan. For instance, in the Query Engine a user may be able to compute a column and then generate a masking policy on that computed column. For security reasons, this is not possible in Spark, so the query may be blocked outright.

Field Protections via SecurityManager

Immuta has an implementation of the Java SecurityManager construct, which is required when running Spark jobs with the Immuta SparkSession. When a user's Immuta Spark job starts, it communicates with the Immuta Vulcan Service to get an access token, which can be exchanged for partition information during job planning and execution.

The Vulcan Service checks whether the user's job is running with the SecurityManager enabled; if so, it is allowed to retrieve partitions and access tokens during job execution to temporarily access the underlying data for the table. This data is stored in HDFS or a cloud object store (such as S3 or ADL). During job execution, the SecurityManager restricts when file access tokens can be used and which classes can use them. These restrictions prevent users from attempting to access data outside an approved Immuta Spark plan with policies applied.

The SecurityManager also prevents users from making changes to Spark plans that the Immuta SparkSession has generated. This means that once policies have been applied, users cannot attempt to modify the plan and remove policies that are being enforced via the plan modifications.

Vulcan Service

The Vulcan Service serves administrative functions in the Spark ecosystem and is only deployed outside of Databricks. The Service has these major responsibilities in Immuta's Spark ecosystem:

Compute partition information for Immuta Spark Jobs
Service administrative requests for Immuta Hadoop Native Workspaces
Act as a proxy to remote storage (S3, Google Storage, etc.) for Immuta Spark jobs

Compute Partition Information for Immuta Spark Jobs

Immuta users do not have access to the underlying data files (like Parquet or ORC files) for the Hive Metastore tables that make up Immuta data sources on-cluster. For this reason, the user's Spark application cannot generate partition information directly because it cannot read file metadata from HDFS or remote storage.

Consequently, the user's Spark job must request partition information from the Vulcan Service, which must be configured in such a way that it can access all raw data that may be the target of Immuta data sources. This configuration should include

Running the service as a kerberos principal that is specified in HDFS NameNode configuration as the Immuta Vulcan user. If this configuration is incorrect, the service will fail to start, as the service will not have access to the locations in HDFS that it requires. This access is granted dynamically by the Immuta NameNode plugin.
Running the service with S3/Google Storage credentials that have access to the underlying data in remote storage. This configuration should be written in a way that users cannot access the configuration files, but the Vulcan Service user can. Typically this is done by configuring sensitive information in generator.xml on the CLASSPATH for Vulcan and only giving the OS user running the Vulcan service access to that file.

Service Administrative Requests for Immuta Hadoop Native Workspaces

The Vulcan Service serves all native workspace management requests on Hadoop Clusters. These requests include

Workspace creation
Workspace deletion
Derived data source creation from a directory
Determining if directory contains supported files (ORC/Parquet)

The Vulcan Service must have access to create Metastore databases to create Immuta native workspace databases and have access in storage (HDFS is handled via the NameNode plugin) to create directories in the configured workspace locations.

Act as a Proxy to Remote Storage for Immuta Spark Jobs

The Vulcan Service acts as a proxy to remote storage when Immuta Spark jobs read data from Metastore-based data sources. As mentioned above, the Vulcan Service must have access to credentials for reading data from remote storage to fulfill requests from Immuta Spark jobs to read that data. The Vulcan Service acts as a proxy with very minimal overhead when reading from remote storage.

The user must present Vulcan with a temporary access token for any target files being read. These temporary tokens are generated by Vulcan during partition generation and protected by the SecurityManager so that users cannot access them directly. The token presented to Vulcan grants access to the raw data via Vulcan's storage proxy endpoints. Vulcan opens a stream to the target object in storage and passes that stream's content back to the client until they are finished reading.

Note: The client will read all bytes needed from Vulcan, but Vulcan may read more data from storage than the client needed into its buffers. This may produce warning messages in the Vulcan logs but those are expected, as Vulcan cannot predict the number of bytes needed by the client.

Immuta Jobs Co-Located with Non-Immuta Jobs

The way Immuta is deployed allows a cluster to service both Immuta and non-Immuta workloads. Although it is recommended that those workloads are segregated, in many cases that is not feasible. However, because of the way Immuta jobs are executed (outside of Databricks), it is clear when a user is attempting to use Immuta and when they are not because of the immuta- prefixed scripts that are analogous to the out-of-the-box Spark scripts for starting different spark toolsets. (For example, immuta-pyspark instead of pyspark and immuta-spark-submit instead of spark-submit.)

These scripts are required because Immuta packages a full deployment of Spark's binaries to override the target Spark classes needed by Immuta's plugins to operate securely. The immuta- prefixed scripts set up environment variables needed by Immuta to execute properly and set other required configuration items that are not the default global values for Spark.

Note: This does not apply to Databricks. Once a Databricks cluster is Immuta-enabled/configured, Immuta is in the execution path for all jobs, regardless of whether the executing user is an Immuta user.

Spark Policy Enforcement and Deployment

Audience: Data Owners and Data Users
Content Summary: This page details the components of Immuta's Spark ecosystem and policy enforcement.

Spark Policy Enforcement

In Immuta's Spark plugins, policies are enforced at query time much like the .

Outside of Databricks, Immuta's Spark ecosystem is composed of

Immuta SparkSession
Vulcan Service
Immuta SecurityManager
Immuta NameNode Plugin (optional, HDFS)

All of these components work in conjunction to apply and enforce Immuta policies on data sources queried through Spark.

In Databricks, Immuta's Spark policy enforcement is driven by Spark plugins that operate on a normal SparkSession (i.e., no ImmutaSparkSession class or object).

Immuta SparkSession/Immuta Context (non-Databricks)

Immuta's external and session catalogs
Immuta's logical replanning

Several other Spark internals are implemented in Immuta to organize code in a way that the SecurityManager can prevent access to fields or methods that expose sensitive information.

Non-Databricks Deployments

SparkSession Modifications in Databricks

Since Immuta runs transparently on the cluster and does not require a separate SparkSession object, any administrative users who need to run jobs against raw data on cluster must set the proper Spark configuration parameters to bypass Immuta's plan resolution plugins. See the for the Databricks installation for more information.

ODBC/JDBC Queries

See Immuta's for more details.

Plan Analysis and Execution

In JDBC data sources, policies are enforced at the Query Engine layer. In cluster data sources, policies are enforced through the following steps:

Plan modification during analysis to include policies using functions/expressions for masking and filters for row-level policies.
Restrictions to field/method access through the Immuta SecurityManager.

In Databricks

Restrictions to storage configuration access via the Immuta SecurityManager. User code cannot access credentials for S3, ADL gen 2, etc. directly, and those configurations are only loadable by the ImmutaSecureFileSystemWrapper class.
Restrictions to the use of AWS instance roles via the Immuta SecurityManager.

Outside Databricks

Partition and file access token generation in the Vulcan Service.
Token validation and filesystem access enforcement in the Immuta NameNode plugin (HDFS).
Token validation and remote object store proxying/enforcement in the Vulcan Service (S3/ADL/etc).

Plan Modifications

immuta.sql("SELECT * FROM customer_purchases WHERE age BETWEEN 18 AND 24 AND product_id = 15")

and the customer_purchases data source is composed of this query

SELECT * FROM customer JOIN purchase where customer.id = purchase.customer_id

and, in Immuta, these columns were selected to expose in this data source

id
first_name
last_name
age
country
ssn
product_id
department
purchase_date

the resulting Spark logical plan would look like this:

'Project [*]
+- 'Filter ((('age >= 18) && ('age <= 24)) && ('purchase.product_id = 15))
   +- 'Project ['id, 'first_name, 'last_name, 'age, 'country, 'ssn, 'product_id, 'department, 'purchase_date]
      +- 'Join Inner, ('customer.id = 'purchase.customer_id)
         :- 'UnresolvedRelation `customer`
         +- 'UnresolvedRelation `purchase`

Mask using hashing the column ssn for everyone.
Only show rows where user is a member of group that matches the value in the column department for everyone.

The plan would be modified (assume the current user is in the "Toys" and "Home Goods" groups):

'Project [*]
+- 'Filter ((('age >= 18) && ('age <= 24)) && ('product_id = 15))
   +- 'Project ['id, 'first_name, 'last_name, 'country, 'ssn, 'product_id, 'department, 'purchase_date]
      +- 'Join Inner, ('customer.id = 'purchase.customer_id)
         :- 'Project ['id, 'first_name, 'last_name, 'age, 'country, 'immuta_hash('ssn) AS ssn#0]
         :  +- 'UnresolvedRelation `customer`
         +- 'Project ['customer_id, 'product_id, 'department, 'purchase_date]
            +- 'Filter (('department = Toys) || ('department = Home Goods))
               +- 'UnresolvedRelation `purchase`

Immuta does not attempt to change or block optimizations to the Spark Plan via the Catalyst Optimizer.

Query Engine and Spark Difference

Field Protections via SecurityManager

Vulcan Service

The Vulcan Service serves administrative functions in the Spark ecosystem and is only deployed outside of Databricks. The Service has these major responsibilities in Immuta's Spark ecosystem:

Compute partition information for Immuta Spark Jobs
Service administrative requests for Immuta Hadoop Native Workspaces
Act as a proxy to remote storage (S3, Google Storage, etc.) for Immuta Spark jobs

Compute Partition Information for Immuta Spark Jobs

Running the service as a kerberos principal that is specified in HDFS NameNode configuration as the Immuta Vulcan user. If this configuration is incorrect, the service will fail to start, as the service will not have access to the locations in HDFS that it requires. This access is granted dynamically by the Immuta NameNode plugin.
Running the service with S3/Google Storage credentials that have access to the underlying data in remote storage. This configuration should be written in a way that users cannot access the configuration files, but the Vulcan Service user can. Typically this is done by configuring sensitive information in generator.xml on the CLASSPATH for Vulcan and only giving the OS user running the Vulcan service access to that file.

Service Administrative Requests for Immuta Hadoop Native Workspaces

The Vulcan Service serves all native workspace management requests on Hadoop Clusters. These requests include

Workspace creation
Workspace deletion
Derived data source creation from a directory
Determining if directory contains supported files (ORC/Parquet)

Act as a Proxy to Remote Storage for Immuta Spark Jobs

Immuta Jobs Co-Located with Non-Immuta Jobs

However, even when a non-Immuta user is executing a non-Immuta Spark job, it is possible that the Immuta NameNode plugin is still in the execution path for that job. Please see our for the Immuta NameNode plugin to minimize the overhead or impact on non-Immuta users in a Hadoop cluster (such as setting up ignored paths in HDFS or dynamically determining non-Immuta users or paths).

Overviews

Hadoop Clusters

Why Use Immuta On Your Cluster

Workflow

Immuta Spark Session

General Spark Access

Policy Fallback

Securing Hive and Impala without Sentry

Data Sharing

Securing Hive and Impala without Sentry

Restricting Access to Hive

Enable HDFS Access Control Lists in Cloudera Manager

Enable Hive Impersonation in Cloudera Manager

Configure Access Control Lists

Restricting Access to Impala

Create Policy Configuration File

Configure Impala to use Policy Configuration File

HDFS Access Pattern

Immuta HDFS Principal

Authentication

Accessing Data

HDFS User Impersonation

Legacy S3 Access Pattern

S3 as a Filesystem

GET Bucket

GET Bucket Contents

GET Object

Example: HTTP Request and Response

Example: Using Boto 3 to Download Objects

Spark Access Pattern

Section Contents

Using the Immuta SparkSession (Spark 2)

Using the Immuta SparkSession

With spark-submit

With spark-shell

With pyspark

Leveraging Data on Other Clusters and Databases

Prerequisites

Cross-cluster Joins

Step 1: Load Tables into Spark DataFrames

Step 2: Register Temporary Views of Filtered Data

Step 3: Run the Join Query

Spark Policy Enforcement and Deployment

Spark Policy Enforcement

Immuta SparkSession/Immuta Context (non-Databricks)

SparkSession Modifications in Databricks

Plan Analysis and Execution

Plan Modifications

Query Engine and Spark Difference

Field Protections via SecurityManager

Vulcan Service

Compute Partition Information for Immuta Spark Jobs

Service Administrative Requests for Immuta Hadoop Native Workspaces

Act as a Proxy to Remote Storage for Immuta Spark Jobs

Immuta Jobs Co-Located with Non-Immuta Jobs

Legacy S3 Access Pattern

S3 as a Filesystem

GET Bucket

GET Bucket Contents

GET Object

Example: HTTP Request and Response

Example: Using Boto 3 to Download Objects

HDFS Access Pattern

Immuta HDFS Principal

Authentication

Accessing Data

HDFS User Impersonation

Hadoop Clusters

Why Use Immuta On Your Cluster

Workflow

Immuta Spark Session

General Spark Access

Policy Fallback

Securing Hive and Impala without Sentry

Data Sharing

Spark Access Pattern

Section Contents

Using the Immuta SparkSession (Spark 2)

Using the Immuta SparkSession

With spark-submit