1 of 30

Appendix

Audience: All Immuta Users
Content Summary:: A collection of detailed technical documents.

Section Contents

If you do not see the content below in the left navigation pane, you need to log in.

Glossary
Policy as Code: API Reference Guides
Project Workspaces:
- HDFS Workspaces
- Managing Hadoop Workspaces
Query Engine

Additional Integrations

Overviews

Hadoop Clusters

Audience: Data Owners and System Administrators
Content Summary: This guide augments the documentation on Spark, focusing on how and when you should use the Immuta Spark integration on your cluster.

Why Use Immuta On Your Cluster

When you create Hive or Impala tables from your data in HDFS, it may require policies restricting who can see specific rows and columns. This becomes complex on a Hadoop cluster because not only do you need to protect the Hive and/or Impala tables, but you also need to protect the data that back those tables.

For example, when you run SparkSQL, although it does reference Hive or Impala tables, it does not actually read any data from them. For performance reasons it reads the data directly from HDFS. This means that any protections you set on those Hive or Impala tables through Sentry or Ranger will not be applied to the raw file reads in SparkSQL. And in fact, those files need to be completely open to anyone running SparkSQL jobs.

Immuta enforces policy controls not only on the Hive and Impala tables, but also the backing files in HDFS.

Workflow

Should you want to enforce row and column level controls on data in HDFS, you must associate some structure to that data. This is done by creating tables in Hive or Impala from that data in HDFS. Once those tables are created, you can then expose them as data sources in Immuta like you normally would any other database.

The difference, though, is that Immuta will not only enforce the controls through the Immuta Query Engine, but will also dynamically lock down the backing files in HDFS. That means if anyone tries to read those files, they will be denied access. In order to read these files, users can use SparkSQL and the ImmutaSparkSession (Spark 2.4).

Tip: The user principal used to expose the data from Impala/HIVE/HDFS will not be impacted by Immuta security on the underlying files; it will fall back to the underlying permissions (such as ACLs).

Immuta Spark Session

The ImmutaSession class (Spark 2.4) is a subclass of SparkSession. Users can access subscribed data sources within their Spark jobs by using SparkSQL. Immuta enforces SparkSQL controls on data platforms that support batch processing workloads. Standard Spark libraries access data from metastore-backed data sources (like Hive and Impala) to retrieve the data from the underlying files stored in HDFS, while Immuta dynamically unlocks the files in HDFS and enforces row-level and column-level controls within the Spark job.

General Spark Access

Should you not care about row and column level controls, but still want to restrict access to files, you can do this with Immuta HDFS data sources. You can expose the HDFS directories in Immuta as data sources and enforce file-level controls based on directory structure or extra attributes on those files. In this case, HDFS reads work as usual and data is read with the Immuta policies enforced.

Policy Fallback

It is possible to also set ACL (or Ranger/Sentry) controls on tables and HDFS files as well. If an Immuta policy is set on that data, it will be enforced first, but if not, it will fall back to the ACL/Sentry/Ranger controls on that data. You can in fact exclude users (like admins) from Immuta policies should you desire to do so.

Please refer to our Installation Guide for details on combined installs with Immuta and Sentry. There are requirements on what sequence you install both.

Securing Hive and Impala without Sentry

Although Cloudera recommends using the Sentry service to secure access to Hive and Impala, CDH cluster administrators can lock down this access without running the Sentry service. See the Security without Sentry Guide for details on this alternative to using Sentry.

It is recommended that you provide write scratch space to your users that is private to them, avoiding write to public locations in HDFS. This avoids the issue of users inadvertently sharing data or data outputs from their jobs with other users. Once that data is in their scratch space, users with CREATE_DATA_SOURCE permission can expose that data, either by exposing a Hive or Impala table created from it (if row/column controls are needed) or by exposing the raw HDFS files as an Immuta data source.

You may want to only allow privileged users have CREATE_DATA_SOURCE permission so the appropriate policies can be applied before the data is exposed.

Securing Hive and Impala without Sentry

Audience: System Administrators
Content Summary: Immuta offers both fine- and coarse-grained protection for Hive and Impala tables for users who access data via the Immuta Query Engine or the Spark Integration. However, additional protections are required to ensure that users cannot gain unauthorized access to data by connecting to Hive or Impala directly. Cloudera recommends using the Sentry service to secure access to Hive and Impala. As an alternative, this guide details steps that CDH cluster administrators can take to lock down Hive and Impala access without running the Sentry service.

Each section in this guide is a required step to ensure that access to Hive and Impala is secured.

Restricting Access to Hive

After installing Immuta on your cluster, users will still be able to connect to Hive via the hive shell, beeline, or JDBC/ODBC connections. To prevent users from circumventing Immuta and gaining unauthorized access to data, you can leverage HDFS Access control lists (ACLs) without running Sentry.

Enable HDFS Access Control Lists in Cloudera Manager

See the official Cloudera Documentation to complete this step.

Enable Hive Impersonation in Cloudera Manager

In order to leverage ACLs to secure Hive, Hive impersonation must be enabled. To enable Hive impersonation in Cloudera manager, set hive.server2.enable.impersonation, hive.server2.enable.doAs to true in the Hive service configuration.

Configure Access Control Lists

Group in this context refers to Linux groups, not Sentry groups.

You must configure ACLs for each location in HDFS that Hive data will be stored in to restrict access to hive, impala, and data owners that belong to a particular group. You can accomplish this by running the commands below.

hadoop fs -setfacl -m other::--- /user/hive/warehouse
hadoop fs -setfacl -m user::rwx /user/hive/warehouse
hadoop fs -setfacl -m group::rwx /user/hive/warehouse
hadoop fs -setfacl -m group:hive:rwx /user/hive/warehouse
hadoop fs -setfacl -m group:examplegroup:rwx /user/hive/warehouse

In this example, we are allowing members of the hive and examplegroup to select & insert on tables in hive. Note that the hive group only contains the hive and impala users, while examplegroup contains the privileged users who would be considered potential data owners in Immuta.

By default, Hive stores data in HDFS under /user/hive/warehouse. However, you can change this directory in the above example if you are using a different data storage location on your cluster.

Restricting Access to Impala

After installing Immuta on your cluster, users will still be able to connect to Impala via impala-shell or JDBC/ODBC connections. To prevent users from circumventing Immuta and gaining unauthorized access to data, you can leverage policy configuration files for Impala without running Sentry.

Create Policy Configuration File

Group in this context refers to Linux groups, not Sentry groups.

The policy configuration file that will drive Impala's security must be in .ini format. The example below will grant users in group examplegroup the ability to read and write data in the default database. You can add additional groups and roles that correspond to different databases or tables.

[groups]
examplegroup = example_insert_role, example_select_role

[roles]
example_insert_role = server=server1->db=default->table=*->action=insert
example_select_role = server=server1->db=default->table=*->action=select

This policy configuration file assigns the group called examplegroup to the roles example_insert_role and example_select_role, which grant insert and select (read and write) privileges on all tables in the default database.

See the official Impala documentation for a detailed guide on policy configuration files. Note that while the guide mentions Sentry, running the Sentry service is not required to leverage policy configuration files.

Next, place the policy configuration file (we will call it policy.ini) in HDFS. The policy file should be owned by the impala user, and should only be accessible by the impala user. See below for an example.

hadoop fs -copyFromLocal /tmp/policy.ini /user/impala/
hadoop fs -chown impala:impala /user/impala/policy.ini
hadoop fs -chmod o-rwx /user/impala/policy.ini

Configure Impala to use Policy Configuration File

You can configure Impala to leverage your new policy file by navigating to Impala's configuration in Cloudera Manager and modifying Impala Daemon Command Line Argument Advanced Configuration Snippet (Safety Valve) with the snippet below.

-server_name=server1
-authorization_policy_file=/user/impala/policy.ini

You must restart the Impala service in Cloudera Manager to implement the policy changes. Note that server_name should correspond to the server that you define in your policy roles. Also note that each key-value pair should be placed on its own line in the configuration snippet.

HDFS Access Pattern

Audience: Data Owners and Data Users
Content Summary: Immuta integrates with your Hadoop cluster to provide policy-compliant access to data sources directly through HDFS. This page instructs how to access data through the HDFS integration, which only enforces file-level controls on data. For more information on installing and configuring the Immuta Hadoop plugin, see the installation tutorial. There is also a Spark SQL integration should you need to enforce row-level and column-level controls on data.
The Immuta Hadoop plugin can also be integrated with an existing kerberos setup to allow users to access HDFS data using their existing kerberos principals, with data access and policy enforcement managed by Immuta.

Immuta HDFS Principal

When Immuta is installed on the cluster, users can only access data through HDFS using the HDFS principal that has been set for them in Immuta. This principal can only be set by an Immuta Administrator or imported from an external Identity Manager, but Immuta users can view their principal via the profile page.

Authentication

In order to access data through Immuta's HDFS Integration, you must be authenticated as the user or principal that is assigned to your Immuta HDFS principal.

For clusters secured with kerberos, you must successfully kinit with your Immuta HDFS principal before attempting to access data.
For insecure clusters, you must be logged in to the cluster as the system user that is assigned to your HDFS principal.

Accessing Data

Immuta's HDFS integration allows you to access data two different ways:

The immuta:/// namespace allows you to access files in relation to the Immuta data source that it is part of. For example, if you want to access a file called december_report.csv that is part of an Immuta data source called reports, you can access it with the following path:
immuta:///immuta/reports/december_report.csv
Note that the path to the file is relative to the Immuta data source that it falls under, not the real path in HDFS. Also, immuta:/// is restricted to only paths that a user can see - files that the user is not authorized for will not be visible.
The HDFS integration also allows users to access data using native HDFS paths. Authorized data source subscribers can access the file december_report.csv through its native path in HDFS:
hdfs:///actual/path/in/hdfs/december_report.csv
Note that in order for a user to access data using hdfs:/// paths, there must be a hdfs:///user/<user>/ directory where <user> corresponds to the user's Immuta HDFS principal. Also, hdfs:/// paths will allow users to see locations of all files, but they will only be able to read files that they have access to in Immuta.

Both methods of accessing data will be audited and compliant with data source policies. If users are not subscribed to or are policy-restricted by the data source that a file in HDFS falls under, they will not be able to access the file using either namespace.

HDFS User Impersonation

Immuta users with the IMPERSONATE_HDFS_USER permission can create HDFS, Hive, and Impala data sources as any HDFS user (provided that they have the proper credentials). For more information, see the tutorial for creating a data source.

Legacy S3 Access Pattern

Audience: Data Owners, Data Users, and System Administrators
Content Summary: Immuta supports an S3-style REST API, which allows you to communicate with Immuta the same way you would with S3. Consequently, Immuta easily integrates with tools you may already be using to work with S3.

S3 as a Filesystem

In this integration, Immuta implements a single bucket (with data sources broken up as sub-directories under that bucket), since some S3 tools only support the new virtual-hosted style requests.

The three APIs (outlined below) used in this integration support basic AWS functionality; the requests and responses for each are identical to those in S3.

GET Bucket

This request returns the bucket configured within Immuta.

Method

Path

Successful Status Code

GET Bucket Contents

This request returns the contents of the given bucket.

Method

Path

Successful Status Code

GET Object

This request returns a stream from the requested object within Immuta.

Method

Path

Successful Status Code

Example Request:

curl \
    --request GET \
    --header "Authorization: AWS <API KEY>:immuta" \
    https://demo.immuta.com/s3p/immuta/my_data_source/path/to/file/myfile.json

Example: HTTP Request and Response

GET Bucket Example Request:

curl \
    --request GET \
    --header "Authorization: AWS <API KEY>:immuta" \
    https://demo.immuta.com/s3p/immuta?delimiter=/&prefix=my_data_source/path/to/file

Note: There is a single file in the requested directory.

GET Bucket Example Response:

<?xml version="1.0" encoding="UTF-8"?>
<ListBucketResult xmlns="http://doc.s3.amazonaws.com/2006-03-01/">
    <IsTruncated>false</IsTruncated>
    <Marker></Marker>
    <Name>immuta</Name>
    <Prefix>my_data_source/path/to/file</Prefix>
    <MaxKeys>1000</MaxKeys>
    <Delimiter>/</Delimiter>
    <Contents>
        <Key>my_data_source/path/to/file/myfile.json</Key>
        <LastModified>2018-11-05T21:25:04.000Z</LastModified>
        <ETag>5b0810c82a69a70e552cece19b20585fc94b67fe4eaa8b</ETag>
        <Size>389</Size>
        <StorageClass>STANDARD</StorageClass>
        <Owner>
            <ID>Immuta</ID>
            <DisplayName>Immuta</DisplayName>
        </Owner>
    </Contents>
</ListBucketResult>

Example: Using Boto 3 to Download Objects

Boto 3 is the official Amazon Web Services client SDK for Python and is widely used by developers for accessing S3 objects. With Immuta's S3 integration, Immuta users can use boto3 to download policy-enforced files or tables.

The first step is to create a Session object that points to your Immuta endpoint and is authenticated with a user-specific API Key.

import boto3

session = boto3.session.Session()

s3_client = session.client(
    service_name = 's3',
    aws_access_key_id = '<YOUR_USER_API_KEY>',
    aws_secret_access_key = 'immuta',
    endpoint_url = 'https://<YOUR_IMMUTA_URL>:443/s3p'
)

To find out what objects are available for download, you can list the objects in the immuta bucket. To filter down to a particular data source, pass in a Prefix that corresponds to the SQL table name of your Immuta data source.

bucket_contents = s3_client.list_objects(
    Bucket = 'immuta',
    Delimiter = '/',
    Prefix = '<SQL_TABLE_NAME>'
).get("Contents")

print(bucket_contents[0])
    {
        'Key': '<SQL_TABLE_NAME>/<SINGLE_OBJECT_KEY>',
        'ETag': 'aa0492082b95c5d8bb90377a006e...',
        'StorageClass': 'STANDARD',
        'Owner': {'DisplayName': 'Immuta', 'ID': 'Immuta'}
    }

Once you have an object key, you can use the download_file method to download the object to your local development environment.

s3_client.download_file(
    Bucket = "immuta",
    Key = "<SQL_TABLE_NAME>/<SINGLE_OBJECT_KEY>",
    Filename = "<OUTPUT_FILE_PATH>"
)

Spark Access Pattern

Audience: Data Owners and Data Users
Content Summary: Users can access subscribed data sources within their Spark jobs by using SparkSQL with the ImmutaSession class (Spark 2.4). Immuta enforces SparkSQL controls on data platforms that support batch processing workloads. Through this process, all tables are virtual and empty until a query is materialized.
When a query is materialized, standard Spark libraries access data from metastore-backed data sources (like Hive and Impala) to retrieve the data from the underlying files stored in HDFS. Other data source types access data using the Query Engine, which proxies the query to the native database technology and automatically enforces policies for each data source.
Security of data sources is enforced both server-side and client-side. Server-side security is provided by an external partitioning service and client-side security is provided by a Java SecurityManager to moderate access to sensitive information.

Spark Integration Specific to CDH and EMR

The Spark integration is only supported by CDH and EMR integrations.

Section Contents

Using the Immuta SparkSession (Spark 2)
Leveraging Data on Other Clusters and Databases
Spark Policy Enforcement and Deployment

Using the Immuta SparkSession (Spark 2)

Audience: Data Users
Content Summary: This page outlines how to use the Immuta SparkSession with , , and .
Immuta SparkSession Background: For Spark 2, the Immuta SparkSession must be used in order to access Immuta data sources. Once the Immuta Spark Installation has been completed on your Spark cluster, then you are able to use the special Immuta Spark interfaces that are detailed below. For data platforms that support batch processing workloads, the Immuta SparkSession allows users to query data sources the same way that they query Hive tables with Spark SQL.
When querying metastore-backed data sources, such as Hive and Impala, the Immuta Session accesses the data directly in HDFS. Other data source types will pass through the . In order to take advantage of the performance gains provided by directly acting on the files in HDFS in your Spark jobs, you must create Immuta data sources for metastore-backed data sources with tables that are persisted in HDFS.
For guidance on querying data sources across multiple clusters and/or remote databases, see .

Using the Immuta SparkSession

With spark-submit

Launch the special immuta-spark-submit interface, and submit jobs just like you would with spark-submit:

With spark-shell

First, launch the special immuta-spark-shell interface:

Then, Use the immuta variable just like you would spark:

Next, use the immuta format to specify partition information:

The immuta format also supports query pushdown:

Finally, specify the fetch size:

With pyspark

First, launch the special immuta-pyspark interface:

Then, use the immuta variable just like you would spark:

Finally, use the immuta format to specify partition information:

The immuta format also supports query pushdown:

Leveraging Data on Other Clusters and Databases

Audience: Data Users
Content Summary: Immuta's Spark integration can help you leverage data in tables across different clusters and databases in your organization, without having to make permanent copies of the data. This page illustrates the process of running efficient cross-technology joins in Spark.
The code examples on this page are written in Scala using the immuta session variable in Spark 2.4. If you are using Spark 1.6, you can repeat these steps with the ImmutaContext variable, ic.

Prerequisites

An Immuta data source for each database table that you wish to join. For guidance on creating these data sources, please refer to this tutorial.
A working Immuta HDFS/Spark plugin installation on one of your clusters. This is also the cluster that your spark jobs will run on. For guidance on installing the Immuta plugin, please refer to the Hadoop Installation Guide.

Cross-cluster Joins

When joining data across clusters, the most efficient approach is to focus queries on narrower windows of data to eliminate overhead. Although Immuta is not permanently rewriting the data, it still must transport data across a network from a different cluster. For this reason, users are encouraged to avoid overly broad queries.

Suppose you wish to run the query below, where sales refers to an Immuta data source on Cluster A and customer refers to an Immuta data source denoted by Database B. Also assume that the Immuta Spark plugin has been successfully installed on Cluster A.

To eliminate overhead, you join data and calculate sales totals for customers within their first month of registration. The following query calculates first-month sales for customers who registered in April 2018:

SELECT
   s.customer_id, c.id, c.registration_date,
   sum(s.sale_price) total_sales
FROM
    sales s, customer c
WHERE
    s.customer_id = c.id
    and s.sale_date < 20180501
    and s.sale_date > 20180331
    and c.registration_date < 20180501
    and c.registration_date > 20180331
GROUP BY
    s.customer_id, c.id, c.registration_date
ORDER BY
    c.id

Step 1: Load Tables into Spark DataFrames

To maximize the efficiency of the cross-cluster join query, the first step is to load a partitioned portion of the data into a Spark DataFrame. This will reduce the overhead of the join query, and allow Immuta to calculate an ideal query plan.

First, load the desired sales data from the local Cluster A into a DataFrame named salesDF by passing the desired query to immuta.sql():

val salesQuery = """SELECT
customer_id, sale_price, sale_date, region_id
FROM sales
WHERE sale_date < 20180501 and sale_date > 20180331"""

val salesDF = immuta.sql(salesQuery)

Then, load customer data from remote Database B into a DataFrame named customerDF. The syntax to set up the remote DataFrame is a little bit different since the user needs to pass in the partitioning configuration. Note that the user defines partitions on the region_id column, which is an integer between 1000 and 2000.

Note: When choosing a partition column, it is important to find a column with a generally even distribution across a known range of values. If you are expecting a large volume of data to be returned from the remote cluster, you can increase the number of partitions to break up the transfers into smaller payloads.

val customerQuery = """(SELECT
id, region_id, registration_date
FROM customer WHERE registration_date < 20180501 and registration_date > 20180331)
as customer_tmp"""

val customerReader = immuta.read.format("immuta")
.option("dbtable", customerQuery)
.option("partitionColumn", "region_id")
.option("lowerBound", "1000")
.option("upperBound", "2000")
.option("numPartitions", "3")
val customerDF = customerReader.load()

If you do not partition your query and the remote data is larger than a single executor can handle (which is very typical for most workloads), the full local-cluster portion of the query will run. Then, one-by-one each Spark executor will attempt to execute the remote query and fail due to memory limitations. Thus, the time to failure of a non-partitioned query is extremely long. For more information, please contact your Immuta Support Professional.

Step 2: Register Temporary Views of Filtered Data

Now that you have defined the filtered and partitioned DataFrames, register them as temporary views that will be used in the join query:

salesDF.createOrReplaceTempView("sales_tmp")
customerDF.createOrReplaceTempView("customer_tmp")

Immuta recognizes these temporary views as queryable tables for the current session. Below is an example of viewing the queryable Immuta tables in the Spark CLI:

scala> immuta.sql("show tables").show()                                                                                                                                            +--------+---------------+-----------+
|database|      tableName|isTemporary|
+--------+---------------+-----------+
|  immuta|       customer|      false|
|  immuta|          sales|      false|
|        |   customer_tmp|       true|
|        |      sales_tmp|       true|
+--------+---------------+-----------+

Step 3: Run the Join Query

Finally, leverage the newly-created temporary views to run the cross-cluster join query:

val joinQuery="""SELECT
s.customer_id, c.registration_date,
sum(s.sale_price) total_sales
FROM sales_tmp s, customer_tmp c
WHERE s.customer_id = c.id
GROUP BY s.customer_id, c.id, c.registration_date
ORDER BY c.id"""

val joinDF = immuta.sql(joinQuery)

The following is a possible output in the Spark CLI:

scala> joinDF.show()
+-----------+-----------------+-----------+
|customer_id|registration_date|total_sales|
+-----------+-----------------+-----------+
|   00000001|         20180427|    1005.40|
|   00000002|         20180411|      80.82|
|   00000003|         20180412|       9.00|
|   00000004|         20180409|     768.09|
|   00000005|         20180421|     534.20|
|   00000006|         20180429|    3218.28|
|   00000007|         20180403|    1076.20|
|   00000008|         20180422|     632.45|
|   00000009|         20180417|      76.50|
|   00000010|         20180428|     598.12|
|   00000011|         20180425|       9.99|
|   00000012|         20180405|      54.90|
|   00000013|         20180410|    2602.97|
|   00000014|         20180416|      16.02|
|   00000015|         20180413|     576.90|
|   00000016|         20180419|      12.39|
|   00000017|         20180401|    2280.92|
|   00000018|         20180418|     209.71|
|   00000019|         20180414|    1140.46|
|   00000020|         20180416|     342.89|
+-----------+-----------------+-----------+
only showing top 20 rows

Spark Policy Enforcement and Deployment

Audience: Data Owners and Data Users
Content Summary: This page details the components of Immuta's Spark ecosystem and policy enforcement.

Spark Policy Enforcement

In Immuta's Spark plugins, policies are enforced at query time much like the .

Outside of Databricks, Immuta's Spark ecosystem is composed of

Immuta SparkSession
Vulcan Service
Immuta SecurityManager
Immuta NameNode Plugin (optional, HDFS)

All of these components work in conjunction to apply and enforce Immuta policies on data sources queried through Spark.

In Databricks, Immuta's Spark policy enforcement is driven by Spark plugins that operate on a normal SparkSession (i.e., no ImmutaSparkSession class or object).

Immuta SparkSession/Immuta Context (non-Databricks)

The Immuta SparkSession is the client-side plugin in the Immuta Spark ecosystem. This plugin is an extension of the open-source SparkSession, but Immuta's SparkSession and the open-source SparkSession have two differences:

Immuta's external and session catalogs
Immuta's logical replanning

The replanning in ImmutaSparkSession occurs in the QueryExecution class. Immuta has an internal version of that class that replaces the different stages of the plan (logical, analyzed, optimized, sparkPlan, and executedPlan) with policy-enforced versions, and the QueryExecution object and resulting SparkPlan (physical plan) trigger audit calls. Additionally, Immuta's implementation of QueryExecution provides a layer of security within the JVM itself to make sure that any sensitive information needed by physical plans is used or stored so that it can be protected by the SecurityManager.

Several other Spark internals are implemented in Immuta to organize code in a way that the SecurityManager can prevent access to fields or methods that expose sensitive information.

Non-Databricks Deployments

In non-Databricks deployments, users will have to use a different object in their code (an instance of ImmutaSparkSession) than the normal SparkSession object to run Immuta Spark jobs. Creating this object is simple, only requiring a 1-2 line change in most existing scripts.

SparkSession Modifications in Databricks

In Databricks deployments, Immuta's plugins operate in a more transparent manner than outside of Databricks. Immuta leverages SparkSessionExtensions in Databricks to update the different planning phases in Spark and add Immuta's policies to the target SparkSession objects. This means that in Databricks users do not have to use a different object to interact with Immuta data sources; they simply connect to an Immuta-enabled cluster and do their work as usual.

Immuta updates the Analyzer, Hive Client, and physical planning strategy to ensure that policies are enforced on any user-generated plans and that the user's view of available data sources represents only what they are allowed to see in Immuta.

ODBC/JDBC Queries

In Databricks, Spark is the execution layer for any ODBC/JDBC connections to the cluster. This means that when Immuta's plugins are installed, ODBC/JDBC queries submitted to the cluster go through Immuta's plugins during execution. This provides a great deal of functionality for users who wish to connect BI tools directly to the cluster and still have their view of Immuta's data. However, when exposing data sources in Immuta from an Immuta-enabled Databricks cluster, the API token provided to Immuta for exposing the Databricks data source must belong to either an administrative user in Databricks or a privileged user specified in the Immuta configuration on the Databricks cluster.

Plan Analysis and Execution

To make the Immuta Spark ecosystem as user-friendly as possible, Immuta's Spark implementation resolves relations by reaching out to the Immuta Web Service instead of resolving relations in the Hive Metastore directly. All queryable Immuta data sources are available to Immuta's Spark plugins.

Cluster-native data sources (Hive, Impala, or Databricks) will be queried by accessing files directly from storage that compose the Metastore table, which is the same type of query execution that occurs in open source Spark when accessing a table in the Hive Metastore.

Any non-cluster queryable data source in Immuta will be queried from the user's Spark application via JDBC through the Immuta Query Engine. Users can provide query partition information similar to what is available via the JDBC data source in Spark to distribute their query to the Query Engine.

In JDBC data sources, policies are enforced at the Query Engine layer. In cluster data sources, policies are enforced through the following steps:

Plan modification during analysis to include policies using functions/expressions for masking and filters for row-level policies.
Restrictions to field/method access through the Immuta SecurityManager.

In Databricks

Restrictions to storage configuration access via the Immuta SecurityManager. User code cannot access credentials for S3, ADL gen 2, etc. directly, and those configurations are only loadable by the ImmutaSecureFileSystemWrapper class.
Restrictions to the use of AWS instance roles via the Immuta SecurityManager.

Outside Databricks

Partition and file access token generation in the Vulcan Service.
Token validation and filesystem access enforcement in the Immuta NameNode plugin (HDFS).
Token validation and remote object store proxying/enforcement in the Vulcan Service (S3/ADL/etc).

Plan Modifications

When a user attempts to query any Hive or Impala data source through the Immuta SparkSession, the Immuta catalogs first replace the relation in the user's plan with the proper plan that the data source represents. For example, if the user attempts the query (immuta is an instance of ImmutaSparkSession)

and the customer_purchases data source is composed of this query

and, in Immuta, these columns were selected to expose in this data source

id
first_name
last_name
age
country
ssn
product_id
department
purchase_date

the resulting Spark logical plan would look like this:

After the data source is resolved, the policies specific to the user will be applied to the logical plan. If the policy has masking or filters (row-level, minimization, time filter, etc.), those filters will be applied to all corresponding underlying tables in the plan. For example, consider the following Immuta policies:

Mask using hashing the column ssn for everyone.
Only show rows where user is a member of group that matches the value in the column department for everyone.

The plan would be modified (assume the current user is in the "Toys" and "Home Goods" groups):

In this example, the masked columns (such as ssn) are aliased to their original name after masking is applied. This means that transformations, filters, or functions applied to those columns will be applied to the masked columns. Additionally, filters on the plan are applied before any user transformations or filters, so a user's query cannot modify or subvert the policies applied to the plan.

Immuta does not attempt to change or block optimizations to the Spark Plan via the Catalyst Optimizer.

Query Engine and Spark Difference

Spark policies are applied at the lowest possible level in the Spark plan for security reasons, which may lead to different results when applying policies to a Spark plan rather than a Query Engine plan. For instance, in the Query Engine a user may be able to compute a column and then generate a masking policy on that computed column. For security reasons, this is not possible in Spark, so the query may be blocked outright.

Field Protections via SecurityManager

Immuta has an implementation of the Java SecurityManager construct, which is required when running Spark jobs with the Immuta SparkSession. When a user's Immuta Spark job starts, it communicates with the Immuta Vulcan Service to get an access token, which can be exchanged for partition information during job planning and execution.

The Vulcan Service checks whether the user's job is running with the SecurityManager enabled; if so, it is allowed to retrieve partitions and access tokens during job execution to temporarily access the underlying data for the table. This data is stored in HDFS or a cloud object store (such as S3 or ADL). During job execution, the SecurityManager restricts when file access tokens can be used and which classes can use them. These restrictions prevent users from attempting to access data outside an approved Immuta Spark plan with policies applied.

The SecurityManager also prevents users from making changes to Spark plans that the Immuta SparkSession has generated. This means that once policies have been applied, users cannot attempt to modify the plan and remove policies that are being enforced via the plan modifications.

Vulcan Service

The Vulcan Service serves administrative functions in the Spark ecosystem and is only deployed outside of Databricks. The Service has these major responsibilities in Immuta's Spark ecosystem:

Compute partition information for Immuta Spark Jobs
Service administrative requests for Immuta Hadoop Native Workspaces
Act as a proxy to remote storage (S3, Google Storage, etc.) for Immuta Spark jobs

Compute Partition Information for Immuta Spark Jobs

Immuta users do not have access to the underlying data files (like Parquet or ORC files) for the Hive Metastore tables that make up Immuta data sources on-cluster. For this reason, the user's Spark application cannot generate partition information directly because it cannot read file metadata from HDFS or remote storage.

Consequently, the user's Spark job must request partition information from the Vulcan Service, which must be configured in such a way that it can access all raw data that may be the target of Immuta data sources. This configuration should include

Running the service as a kerberos principal that is specified in HDFS NameNode configuration as the Immuta Vulcan user. If this configuration is incorrect, the service will fail to start, as the service will not have access to the locations in HDFS that it requires. This access is granted dynamically by the Immuta NameNode plugin.
Running the service with S3/Google Storage credentials that have access to the underlying data in remote storage. This configuration should be written in a way that users cannot access the configuration files, but the Vulcan Service user can. Typically this is done by configuring sensitive information in generator.xml on the CLASSPATH for Vulcan and only giving the OS user running the Vulcan service access to that file.

Service Administrative Requests for Immuta Hadoop Native Workspaces

The Vulcan Service serves all native workspace management requests on Hadoop Clusters. These requests include

Workspace creation
Workspace deletion
Derived data source creation from a directory
Determining if directory contains supported files (ORC/Parquet)

The Vulcan Service must have access to create Metastore databases to create Immuta native workspace databases and have access in storage (HDFS is handled via the NameNode plugin) to create directories in the configured workspace locations.

Act as a Proxy to Remote Storage for Immuta Spark Jobs

The Vulcan Service acts as a proxy to remote storage when Immuta Spark jobs read data from Metastore-based data sources. As mentioned above, the Vulcan Service must have access to credentials for reading data from remote storage to fulfill requests from Immuta Spark jobs to read that data. The Vulcan Service acts as a proxy with very minimal overhead when reading from remote storage.

The user must present Vulcan with a temporary access token for any target files being read. These temporary tokens are generated by Vulcan during partition generation and protected by the SecurityManager so that users cannot access them directly. The token presented to Vulcan grants access to the raw data via Vulcan's storage proxy endpoints. Vulcan opens a stream to the target object in storage and passes that stream's content back to the client until they are finished reading.

Note: The client will read all bytes needed from Vulcan, but Vulcan may read more data from storage than the client needed into its buffers. This may produce warning messages in the Vulcan logs but those are expected, as Vulcan cannot predict the number of bytes needed by the client.

Immuta Jobs Co-Located with Non-Immuta Jobs

The way Immuta is deployed allows a cluster to service both Immuta and non-Immuta workloads. Although it is recommended that those workloads are segregated, in many cases that is not feasible. However, because of the way Immuta jobs are executed (outside of Databricks), it is clear when a user is attempting to use Immuta and when they are not because of the immuta- prefixed scripts that are analogous to the out-of-the-box Spark scripts for starting different spark toolsets. (For example, immuta-pyspark instead of pyspark and immuta-spark-submit instead of spark-submit.)

These scripts are required because Immuta packages a full deployment of Spark's binaries to override the target Spark classes needed by Immuta's plugins to operate securely. The immuta- prefixed scripts set up environment variables needed by Immuta to execute properly and set other required configuration items that are not the default global values for Spark.

Note: This does not apply to Databricks. Once a Databricks cluster is Immuta-enabled/configured, Immuta is in the execution path for all jobs, regardless of whether the executing user is an Immuta user.

Configuration Guides

Amazon EMR

Audience: System Administrators
Content Summary: This tutorial will guide you through the process of spinning up an Amazon Elastic Map Reduce cluster with Immuta's Hadoop and Spark security plugins installed.

Deprecation notice

Support for this integration has been deprecated.

Introduction

This tutorial contains examples using the AWS CLI. These examples are conceptual in nature and will require modification to adapt to your exact deployment needs. If you wish to quickly familiarize yourself with Immuta's EMR integration, please visit the .

Supported EMR Versions

This deployment is tested and known to work on the EMR releases listed below.

5.17.0
5.18.0
5.19.0
5.20.0
5.21.0
5.22.0
5.23.0
5.24.0
5.25.0
5.26.0
5.27.0
5.28.0
5.29.0
5.30.0
5.31.0
5.32.0

Create Prerequisite AWS Resources

In addition to the EMR cluster itself, Immuta requires a handful of additional AWS resources in order to function properly.

Immuta Bootstrap Bucket

In order to bootstrap the EMR cluster with Immuta's software bundle and startup scripts, you will need to create an S3 bucket to hold these artifacts.

Immuta Data IAM Role

Immuta's Spark integration relies on an IAM role policy that has access to the S3 buckets where your sensitive data is stored. Note that the EC2 Instance Roles for your EMR cluster should not have access to these buckets. Immuta will broker access to the data in these buckets to authorized users.

Create Immuta Data IAM Policy

Modify the JSON data below to include the correct name of your data bucket(s), and save as immuta_data_iam_policy.json.

If you are leveraging Immuta's Native S3 Workspace capability, you must also give the Immuta data IAM role full control of the workspace bucket or folder.

Now you can run the following command to create the Immuta IAM user policy.

Create Immuta Data IAM Role

The IAM role that brokers access to S3 data must be able to assume the cluster node instance roles, and vice versa. Since this a cycle, you will need to create both roles with generic trust policies, and then update them after both roles are created.

Create a file called immuta_data_role_trust_policy_generic.json as seen below.

After creating the immuta_data_role_trust_policy_generic.json file from above, run the following command to create the Immuta data IAM role. Note that you will be using the generic IAM role trust policy that you created in the previous step. This will be updated when both the data and instance IAM roles are created.

Next you will need to attach the IAM policy that allows access to your protected data in S3.

Create Immuta Instance IAM Policy

Modify the JSON data below to include the correct name of your bootstrap bucket, and save as immuta_emr_instance_policy.json.

After creating the immuta_emr_instance_policy.json file from above, run the following command to create the Immuta EMR Instance policy.

Create Immuta Instance IAM Role

The node instance IAM role must be able to assume the IAM role that brokers access to S3 data, and vice versa. Assuming you have already created the immuta_emr_data_role, create a JSON file called instance_role_trust_policy.json as shown below.

Now you can create the instance role with the policy document from above.

Next you will need to attach the IAM policy that allows access to required resources for your cluster.

Create Immuta EMR Instance Profile

After creating the role and policy for the Immuta instances, you can create the Immuta EC2 Instance Profile.

After creating the Instance Profile, you can attach the newly created Role.

Update Immuta Data IAM Role Trust Policy

Now that both the data and instance IAM roles are created, you can update the trust policy of the data IAM role to include the instance role.

Create a file called data_role_trust_policy.json as shown below.

Now you can update the trust policy of the data IAM role.

Immuta HDFS System Token in AWS Secrets Manager

Create EMR Cluster

EC2 Attributes Configuration File

Complete the JSON template below and save as ec2_attributes.json. You may remove keys where you would like to use default values.

When choosing security groups for your master and worker nodes, be sure that they provide bi-directional access between the nodes and your Immuta instance.

Cluster Configuration File

Immuta requires a custom configuration file for Hadoop services to be passed in to the cluster. The required configurations are displayed below. Modify the JSON data to match your environment and save as cluster_configuration.json.

Immuta Bootstrap Configuration File

Next, create a file called bootstrap_actions.json to configure the Immuta bootstrap action. If you have any additional bootstrap actions to run outside of Immuta, they should be added here as well.

(Optional) Kerberos Attributes Configuration File

If you wish to deploy a kerberized cluster, create a kerberos_attributes.json file with your desired Kerberos configurations. Note that although not strictly required, a cluster without Kerberos should be considered secure for production.

Security Configuration

You will need to create a security configuration before creating the EMR cluster so that Immuta's EMRFS integration can leverage the IAM role you created to access data in S3.

Next, create your security configuration with the following command.

Create EMR Cluster Command

Finally, you can now spin up an EMR cluster with Immuta's security plugins.

Remove Secrets

To ensure protection of the Immuta user's AWS credentials as well as the kadmin password (if using Kerberos), it is recommended to overwrite the secret values that were created during cluster deployment process. If you leave the secret values in AWS Secrets Manager, cluster users may be able to assume the instance role of the EMR nodes and read these values.

It is safe to remove these values after the cluster has finished bootstrapping. The example below overwrites the relevant secrets with null values.

Note that if you are using an external KDC without a cross-realm trust (no KDC on the cluster), you should put the kadmin password back into the immuta-kerberos-secret. This is required to clean up the Immuta services principals that will have been created on the external KDC.

Quickstart

Audience: System Administrators
Content Summary: This simple deployment guide familiarizes users with Immuta on EMR. This guide is only meant to be deploy clusters for non-production purposes, such as demos or proof-of-concept. For more robust deployments, please see the for Immuta on EMR.

Deprecation notice

Support for this integration has been deprecated.

Installation Prerequisites

AWS Resources

(v1.16.x or greater) installed in a bash environment.
- The CLI should be configured to use a role that is able to fully manage EMR, IAM, and S3 resources. This can be a user role in a local environment or an instance role on an EC2 instance.
Resource IDs for your chosen and .
- Be sure that your master and worker security groups are configured for bi-directional communication with your Immuta instance.

Immuta Resources

An instance of Immuta that is reachable from your chosen AWS VPC.
A username and password for the Immuta archives site. You can get these from your Immuta support professional.

Run the Immuta EMR Quickstart Script

First, download the quickstart script:

Next, run the script. Note that you will be prompted for input variables. If a variable is not required, you can press enter to use the displayed default value.

See below for an example of the script being run and prompting for variables. Note that any input in the example is simply for demonstration purposes; you will need to provide your own values.

Input Variables

The immuta-emr-quickstart.sh script will prompt the user for input variables to configure the AWS resources required for the cluster. These variables are represented by the environment variables listed below. Exporting these environment variables prior to running the script will skip the prompts.

CLUSTER_NAME
- Optional. The name of the EMR cluster to be created.
- Default: immuta-quickstart.
EMR_VERSION
- Optional. The EMR version of the cluster. Current supported versions are 5.17.0 - 5.23.0.
- Default: 5.23.0.
IMMUTA_VERSION
- Optional. The full Immuta version to be installed on the cluster.
- Default: 2024.1.13_20240624.
IMMUTA_INSTANCE_URL
- Required. The URL of the Immuta instance that will drive policies on the cluster.
AWS_REGION
- Optional. The AWS Region that the cluster will run in.
- Default: us-east-1.
INSTANCE_COUNT
- Optional. The number of instances (master + worker) in the cluster.
- Default: 3.
INSTANCE_TYPE
- Optional. The type of instance for cluster nodes.
- Default: m5.xlarge.
AWS_KEY_NAME
- Required. The name of the SSH keypair in AWS that will be used to connect to the cluster.
AWS_SUBNET_ID
- Required. The ID string of the subnet that the cluster will run in.
SERVICE_SECURITY_GROUP
- Required. The ID string of the security group for the cluster's EMR services.
MASTER_SECURITY_GROUP
- Required. The ID string of the security group for the cluster's master node.
WORKER_SECURITY_GROUP
- Required. The ID string of the security group for the cluster's worker nodes.
ARCHIVE_USERNAME
- Required. The username for Immuta Archives.
ARCHIVE_PASSWORD
- Required. The password for Immuta Archives.
BOOTSTRAP_BUCKET
- Optional. The S3 bucket where bootstrap artifacts will be stored. If the specified bucket does not exist, a new one will be created with default private ACLs.
- Default: immuta-emr-bootstrap-$AWS_ACCOUNT_ID-$AWS_REGION.
DATA_BUCKET
- Optional. The S3 bucket where partitioned data is stored. If the specified bucket does not exist, a new one will be created with default private ACLs.
- Default: immuta-emr-data-$AWS_ACCOUNT_ID-$AWS_REGION.
KADMIN_PASSWORD
- Optional. The Kerberos admin password that will be used to create Kerberos principals on the cluster's dedicated internal KDC.
- Default: random.
HDFS_SYSTEM_TOKEN
- Default: random.

Post-installation

Copy Kerberos Resources to Immuta Instance

Associate Quickstart Principals with Immuta Users

The quickstart bootstrap automatically seeds the cluster with three user principals for you to use while familiarizing yourself with the Immuta platform and data policies: owner, consumer1, and consumer2. The default Kerberos password for these users is immuta-quickstart.

Cloudera Hadoop

Audience: System Administrators
Content Summary: Installation of the components necessary for the use of the Immuta Hadoop Integration depends on the version of Hadoop. This section contains guides for installing Cloudera Hadoop.

Section Contents

: Outlines the prerequisites required to successfully use installation components on your CDH cluster.
: Describes strategies for improving performance of Immuta's NameNode plugin on CDH clusters.
: By default, the Immuta Partition servers will run as the immuta user. For clusters configured to use Kerberos, this means that you must have an immuta principal available for Cloudera Manager to provision the service. If for some reason you do not have an immuta principal available, you can change the user that the Immuta partition servers run as. This page describes the configuration changes that are needed to change the principal(s) that Immuta uses.
: Details how to use the immuta_hdfs_log_analyzer tool to troubleshoot slowdowns in your CDH cluster.
: Details how to upgrade the Immuta Parcel and Service on your CDH cluster.
: Outlines steps to effectively disable and/or uninstall the Immuta components from your CDH cluster.

Prerequisites

Audience: System Administrators
Content Summary: The Immuta CDH integration installation consists of the following components:
Immuta NameNode plugin
Immuta Hadoop Filesystem plugin
Immuta Spark 1.6 Partition Service (DEPRECATED)
Immuta Spark 2 Partition Service
This page outlines the prerequisites required to successfully use these components on your CDH cluster.

This installation process has been verified to work with the following CDH versions:

5.9.x
5.12.x
5.13.x
5.14.x
5.15.x
5.16.x
6.1.x
6.2.x
6.3.x

Set Up

Before installing Immuta onto your CDH cluster, the following steps need completed:

Enable HDFS Extended Attributes

Immuta requires that HDFS Extended Attributes are enabled.

Under the HDFS service of Cloudera Manager, Configuration tab, search for key:

and, ensure the Checkbox is checked.

Generate an Immuta System API Key

Export Cluster Configuration (Optional but Recommended)

Before sending the exported JSON file, it is recommended to look over the configurations and redact any information that you consider too sensitive to share externally. Cloudera Manager will automatically redact known passwords; however, there may be sensitive values embedded in your configuration that Cloudera Manager does not know about. An example of this may be configuration of a third-party cluster application that requires passwords or API keys in its cluster configuration.

Download the Immuta Parcel and CSD Artifacts

Needed Artifacts

Begin by downloading the Immuta Parcel and CSD for your Cloudera Distribution. A complete installation will require 3 files:

IMMUTA-<VERSION>_<DATESTAMP>-<CDH_VERSION>-spark2-public-<LINUX_DISTRIBUTION>.parcel
- The .parcel file is the Immuta CDH parcel.
- For versions that support it, Spark 1 is included in this parcel.
IMMUTA-<VERSION>_<DATESTAMP>-<CDH_VERSION>-spark2-public-<LINUX_DISTRIBUTION>.parcel.sha
- The .parcel.sha file contains a SHA1 hash of the Immuta .parcel file for integrity verification by Cloudera Manager.
IMMUTA-<VERSION>_<DATESTAMP>-<CDH_VERSION>-spark2-public.jar
- The .jar file is the Custom Service Descriptor (CSD) for the Immuta service in Cloudera Manager.

The variables above are defined as:

<VERSION> is like "2024.1.13"
<DATESTAMP> is the compiled date in the format "YYYYMMDD"
<CDH_VERSION> must match your CDH version, like "5.16.2"
<LINUX_DISTRIBUTION> is either "el7" or "el6".

The Immuta Archives Site

All artifacts are divided up by subdirectories in the form of[Immuta Release]/[CDH Version].

Installation

Audience: System Administrators
Content Summary: The Immuta CDH integration installation consists of the following components:
Immuta NameNode plugin
Immuta Hadoop Filesystem plugin
Immuta Spark 2 Vulcan service
This page outlines the installation steps required to successfully deploy these components on your CDH cluster.
Prerequisites: Follow the to prepare for installation.

Installation

Begin installation by transferring the Immuta .parcel and its associated .parcel.sha files to your Cloudera Manager node and placing them in /opt/cloudera/parcel-repo. Once copied, ensure files have both their owner and group permissions set to cloudera-scm

Next, transfer the Immuta CSD (.jar file) to /opt/cloudera/csd, and ensure both its owner and group permissions are set to cloudera-scm as well.

You will need to restart the Cloudera Manager server in order for the CSD to be picked up:

Follow Cloudera's instructions for distributing and activating the IMMUTA parcel.

Once the parcel has been successfully activated, you can add the IMMUTA service:

From the Cloudera Manager select Add Service.
Choose Immuta.
Click Continue.
Select nodes to install the services on. Your options are
- For maximum redundancy, choose all.
- Choose a single node.
- Choose a few nodes. Set up a Load Balancer in front of the instances to distribute load. Contact Immuta support for more details.
Proceed to the end of the workflow.

Configure HDFS

After adding the Immuta service to your CDH cluster, there is some configuration that needs to be completed.

NameNode-Only Configuration

Warning

The following settings should only be written to the configuration on the NameNode. Setting these values on DataNodes will have security implications, so be sure that they are set in the NameNode only section of Cloudera Manager. For optimal performance, only set these configuration options in the NameNode Role Config Group that controls the namespace where Immuta data resides.

Under the HDFS service of Cloudera Manager, Configuration tab, search for key:

and, using "View as XML", add/set the value(s) similar to:

Best Practice: Configuration Values

Immuta recommends that all Immuta configuration values be marked final.

Shared Configuration

The following configuration items should be configured for both the NameNode processes and the DataNode processes. These configurations are used both by the Immuta FileSystem and the Immuta NameNode plugin. For example:

Under the HDFS service of Cloudera Manager, Configuration tab, search for key:

and, using "View as XML", add/set the value(s) similar to:

Best Practice: Configuration Values

Immuta recommends that all Immuta configuration values be marked final.

Make sure that user directories underneath immuta.credentials.dir are readable only by the owner of the directory. If the user's directory doesn't exist and we create it, we will set the permissions to 700.

Enable TLS for the Immuta Vulcan Service

You can enable TLS on the Immuta Vulcan service by configuring it to use a keystore in JKS format.

Server-side TLS Configuration

Under the Immuta service of Cloudera Manager, Configuration tab, search for key:

and, using "View as XML", add/set the value(s) similar to:

Best Practice: Configuration Values

Immuta recommends that all Immuta configuration values be marked final.

Detailed Explanation:

immuta.secure.partition.generator.keystore
- Specifies the path to the Immuta Vulcan service keystore.
- Example: /etc/immuta/keystore.jks
immuta.secure.partition.generator.keystore.password
- Specifies the password for the Immuta Vulcan service keystore. This password will be a publicly available piece of information, but file permissions should be used to make sure that only the user running the service can read the keystore file.
- Example: secure_password
immuta.secure.partition.generator.keystore.password
- Specifies the password for the Immuta Vulcan service keystore. This password will be a publicly available piece of information, but file permissions should be used to make sure that only the user running the service can read the keystore file.
- Example: secure_password
immuta.secure.partition.generator.keymanager.password
- Specifies the KeyManager password for the Immuta Vulcan service keystore. This password will be a publicly available piece of information, but file permissions should be used to make sure that only the user running the service can read the keystore file. This is not always necessary.
- Example: secure_password

Best Practice: Secure Keystore with File Permissions

Immuta recommends using file permissions to secure the keystore from improper access:

Client-side TLS Configuration

You must also set the following properties under the following client sections:

For Spark 2, under the Immuta service of Cloudera Manager, Configuration tab, search for key:

and, using "View as XML", add/set the value(s) similar to:

Best Practice: Configuration Values

Immuta recommends that all Immuta configuration values be marked final.

Detailed Explanation:

immuta.secure.partition.generator.keystore
- Set to true to enable TLS
- Default: true

Impala Configuration

You must give the service principal that the Immuta Web Service is configured to use permission to delegate in Impala. To accomplish this, add the Immuta Web Service principal to authorized_proxy_user_config in the Impala daemon command line arguments.

Under the Impala service of Cloudera Manager, Configuration tab, search for key:

and add/set the value(s) similar to:

If the authorized_proxy_user_config parameter is already present for other services, append the Immuta configuration value to the end:

Spark 2 Configuration

No additional configuration is required.

Note: Immuta will work with any Spark 2 version you may have already installed on your cluster.

Immuta Vulcan Service Configuration

The Immuta Vulcan service requires the same system API key that is configured for the Immuta NameNode plugin. Be sure that the value of immuta.system.api.key is consistent across your configuration.

For Spark 2, under the IMMUTA service of Cloudera Manager, Configuration section, search for key:

and, using "View as XML", add/set the value(s) similar to:

Best Practice: Configuration Values

Immuta recommends that all Immuta configuration values be marked final.

Immuta Web Service Configuration

Though generally unnecessary given the configuration through the Application Settings of the Web UI, below is an example YAML snippet that can be used as an alternative to the Immuta Configuration UI if recommended by an Immuta representative.

Detailed Explanation:

client
- kerberosRealm
  - Specifies the default realm to use for Kerberos authentication.
  - Example: YOURCOMPANY.COM
plugins
- hdfsHandler
  - hdfsSystemToken
    Token used by NameNode plugin to authenticate with the Immuta REST API. This must equal the value set in immuta.system.api.key. Use the value of HDFS_SYSTEM_TOKEN generated earlier.
    Example: 0ec28d3f-a8a2-4960-b653-d7ccfe4803b3
kerberos
- ticketRefreshInterval
  - Time in milliseconds to wait between kinit executions. This should be lower than the ticket refresh interval required by the Kerberos server.
  - Default: 43200000
- username
  - User principal used for kinit.
  - Default: immuta
- keyTabPath
  - The path to the keytab file on disk to be used for kinit.
  - Default: /etc/immuta/immuta.keytab
- krbConfigPath
  - The path to the krb5 configuration file on disk.
  - Default: /etc/krb5.conf
- krbBinPath
  - The path to the Kerberos installation binary directory.
  - Default: /usr/bin/

Performance Optimization

Audience: System Administrators
Content Summary: This page describes strategies for improving performance of Immuta's NameNode plugin on CDH clusters.

Overview

Immuta operates within a locked operation in the NameNode when granting / denying permissions based on Immuta policies. This section contains configuration and strategies to prevent RPC queue latency, threads waiting, or other issues on cluster-wide file permission checks.

Deployment Architecture

Isolated HDFS Namespace

Best Practice: NameNode Plugin Configuration

Immuta recommends only configuring the NameNode Plugin to check permissions on the NameNode(s) that oversee the data that you want to protect.

For example, say that you currently have a federated HDFS NameNode architecture with three Nameservices - nameservice1, nameservice2, and nameservice3. The HDFS federation in this example is distributed across these nameservices as described below.

nameservice1: /data, /tmp/, /user
nameservice2: /data2
nameservice3: /data3

Suppose you know that all the sensitive data that you want to protect with Immuta is located under /data3. To achieve optimum performance in this case, you can go ahead and add the Immuta NameNode-only configuration (hdfs-site.xml) to the role config group for nameservice3, and leave it out of nameservice1 and nameservice2. The public / client Immuta configuration (core-site.xml) should still be configured cluster-wide. See for more details about these configuration groupings.

One caveat to take into consideration here is that Immuta's Vulcan service requires the Immuta NameNode Plugin to oversee user credentials that are stored in /user/<username> by default. Vulcan also stores some configuration under /user/immuta by default. This is a problem because /user resides under nameservice1, and the goal is to only operate the Immuta NameNode Plugin on nameservice3.

A simple solution to this problem is to create a new directory for these credentials, /data3/immuta_creds for example, and configure the NameNode Plugin and the Vulcan service to use this directory instead of /user. Changing this requires the configuration modifications listed below.

HDFS - Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml
- Set immuta.generated.api.key.dir and immuta.credentials.dir to /data3/immuta_creds.
Immuta - Immuta Spark 2 Vulcan Server Advanced Configuration Snippet(Safety Valve) for session/generator.xml
- Set immuta.meta.store.token.dir to /data3/immuta_creds/immuta/tokens.
- Set immuta.meta.store.remote.token.dir to /data3/immuta_creds/immuta/remotetokens.
- Set immuta.configuration.id.file.config to hdfs://nameservice3/data3/immuta_creds/immuta/config_id.

Note that you will need to manually create the /data3/immuta_creds/immuta directory and set the permissions such that only the immuta user can read / write in that directory. The /data3/immuta_creds directory should also be world writable to allow user directories to be created the first time that they interact with Immuta on the cluster.

Configuration

Essential Performance Tuning Settings

immuta.permission.paths.to.enforce
- Description: A comma delimited list of paths to enforce when checking permissions on HDFS files. This ensures that API calls to the Immuta web service are only made when permissions are being checked on the paths that you specify in this configuration. This also means that you can only create data sources against data that lives under these paths, and the Immuta Workspace must be under one of these paths as well. Alternatively, immuta.permission.paths.to.ignore can be set to a list of paths that you know do not contain Immuta data - then API calls will never be made against those paths. Setting both immuta.permission.paths.to.ignore and immuta.permission.paths.to.enforce properties at the same time is unsupported.
immuta.permission.groups.to.enforce
- Description: A comma delimited list of groups that must go through Immuta when checking permissions on HDFS files. If this configuration item is set, then fallback authorizations will apply to everyone by default, unless they are in a group on this list. If a user is on both the enforce list and the ignore list, then their permissions will be checked with Immuta (i.e., the enforce configuration item takes precedence). This may improve NameNode performance by only making permission check API calls for the subset of users who fall under Immuta enforcement.
immuta.permission.source.cache.enabled
- Description: Denotes whether a background thread should be started to periodically cache paths from Immuta that represent Immuta-protected paths in HDFS. Enabling this increases NameNode performance because it prevents the NameNode plugin from calling the Immuta web service for paths that do not back HDFS data sources. For performance optimization, it is best to enable this cache to act as a "backup" to immuta.permission.paths.to.enforce.
immuta.permission.source.cache.enabled
- Description: The time between calls to sync/cache all paths that back Immuta data sources in HDFS. You can increase this value to further reduce the number of API calls made from the NameNode.
immuta.permission.workspace.base.path.override
- Description: This configuration item can be set so that the NameNode does not have to retrieve the Immuta HDFS workspace base path periodically from the Immuta API.

Debugging Suspected Performance Issues

Upgrade Cloudera Hadoop

Audience: System Administrators
Content Summary: This page details how to upgrade the Immuta Parcel and Service on your CDH cluster.
Prerequisites: Follow the to prepare for upgrading.

Upgrade the Parcel

Transfer the Immuta .parcel and its associated .parcel.sha to your Cloudera Manager node and place them in /opt/cloudera/parcel-repo. Once copied, ensure files must have ownership cloudera-scm and group cloudera-scm.

Once the Immuta parcel and its SHA (hash) file are in the parcel repo, you can distribute and activate the updated parcel. (Activating the new parcel will automatically deactivate an older version.) To do so,

In Cloudera Manager, select the Parcels icon in the upper right corner.
Click Check for New Parcels.
Make sure the location filter has your on-cluster parcel repo selected.
Locate the IMMUTA parcel, and then find the row corresponding to the version you are upgrading to. Click Distribute.
Wait for the parcel to finish distribution. Once finished, the action button for that row should say Activate.
Click the Activate button to activate the parcel.

You have successfully upgraded your Immuta parcel.

Upgrade the Immuta Partition Service

The first step in upgrading your Immuta Partition Service CSD is copying the .jar file to your Cloudera Manager node, placing it in /opt/cloudera/csd. The file must have ownership cloudera-scm and group cloudera-scm.

You will need to restart Cloudera Manager in order for the CSD to be picked up:

Finally, restart the IMMUTA service in Cloudera Manager.

Disable/Uninstall Cloudera Hadoop

Audience: System Administrators
Content Summary: This page outlines steps to effectively disable and/or uninstall the Immuta components from your CDH cluster. The disable portions of this document detail how to deactivate the Immuta components without removing the components. For a complete uninstall, follow these steps and then proceed to remove all Immuta-related settings, configuration, and any Immuta Kerberos principals from your cluster.

NameNode

These changes will require a cluster restart

The changes detailed below affect HDFS; therefore, a cluster restart is required to fully implement these changes.

Steps to Disable

The Immuta Authorization Provider must be removed from the NameNode configuration.

Navigate to the Cloudera Manager Overview page.
Click on the HDFS service.
Click on the Configuration tab.
In the search bar, enter
Click on the minus [-] sign that appears on the right of the entry corresponding to dfs.namenode.authorization.provider.class. This will restore to the CDH default.
Click the Save Changes button at the bottom of the screen.

Steps to Uninstall

Warning

You may have non-default settings that are completely unrelated to Immuta! You may also have non-default settings that are currently related to Immuta that will need to be altered to another non-default custom setting specific to your installation. Your CDH Admins will know which settings this applies to. Do not blanket revert settings to their defaults unless you are certain the CDH defaults are appropriate for your cluster.

To uninstall, instead of only reverting the Immuta Authorization Provider, all Immuta customized settings can be removed from the NameNode configuration.

Navigate to the Cloudera Manager Overview page.
Click on the HDFS service.
Click on the Configuration tab.
Near the bottom of the left side navigation pane, select Non-Default. This will list all settings that are not presently set to the defaults.
All settings under
can be reverted. Click the minus [-] sign that appears on the right of the individual entries, or - if you are certain your cluster should operate on the CDH defaults - all settings can be reverted by clicking the revert arrow icon to the right of HDFS (Service-Wide).
All settings under
can be reverted. Click the minus [-] sign that appears on the right of the individual entries, or - if you are certain your cluster should operate on the CDH defaults - all settings can be reverted by clicking revert arrow icon to the right of NameNode Default Group.
Click the Save Changes button at the bottom of the screen.

YARN

If fully uninstalling, Immuta's components need to be removed from YARN's classpath.

These changes will require a cluster restart

The changes detailed below affect HDFS; therefore, a cluster restart is required to fully implement these changes.

Steps to Uninstall

Navigate to the YARN service.
Click on the Configuration tab.
In the search bar, enter
Click on the minus [-] sign that appears on the right of any entries that reference IMMUTA. For example, there may be records for jars such as immuta-group-mapping.jar or immuta-hadoop-filesystem.jar or similar.
Click the Save Changes button at the bottom of the screen.

Hive

These settings may be applied either system-wide (via core-site.xml) or to specific target systems such as Hive or Impala. Be sure to locate all setting locations.

These changes will require a Hive service restart

The Hive service will need to be restarted for the changes below to take effect.

Steps to Disable

Navigate to the Hive service.
Click on the Configuration tab.
In the search bar, enter
Click on the minus [-] sign that appears to the right of the entry corresponding to hadoop.security.group.mapping. This will restore to the CDH default.
Click the Save Changes button at the bottom of the screen.

Steps to Uninstall

Warning

Navigate to the Hive service.
Click on the Configuration tab.
Near the bottom of the left side navigation pane, select Non-Default. This will list all settings that are not presently set to the defaults.
All settings under
can be reverted. Click the minus [-] sign that appears on the right of the individual entries, or - if you are certain your cluster should operate on the CDH defaults - all settings can be reverted by clicking the revert arrow icon to the right of HiveServer2 Default Group.
Click the Save Changes button at the bottom of the screen.

Impala

These settings may be applied either system-wide (via core-site.xml) or to specific target systems such as Hive or Impala. Be sure to locate all setting locations.

These changes will require an Impala service restart

The Impala service will need restarted in order for the changes below to take effect.

Steps to Disable

Navigate to the Impala service.
Click on the Configuration tab.
In the search bar, enter
Click on the minus [-] sign that appears on the right of the entry corresponding to hadoop.security.group.mapping. This will restore to the CDH default.
Click the Save Changes button at the bottom of the screen.

Steps to Uninstall

Warning

Navigate to the Impala service.
Click on the Configuration tab.
Near the bottom of the left side navigation pane, select Non-Default. This will list all settings that are not presently set to the defaults.
The "immuta" proxy user from
can be removed. Simply delete the "immuta=*" (and any leading or trailing ;) from the -authorized_proxy_user_config= value, leaving any other values in place. It may also be done by clicking the revert arrow icon to the right of Impala (Service-Wide) if the default is appropriate.
All settings under
can be reverted. Click the minus [-] sign that appears on the right of the individual entries, or - if you are certain your cluster should operate on the CDH defaults - all settings can be reverted by clicking the revert arrow icon to the right of Impala Daemon Default Group.
If using Kerberos principal short names was only done in support of ImmutaGroupsMapping for use in native workspaces, that setting can also be reverted. In the search bar, enter
Simply uncheck the checkbox to the left of "Impala (Service-Wide)".
Click the Save Changes button at the bottom of the screen.

Spark

These changes will require a Spark service restart

The Spark service will need to be restarted for the changes below to take effect.

Steps to Uninstall

Navigate to the Spark service.
Click on the Configuration tab.
In the search bar, enter
Remove any references to IMMUTA or "immuta" in the configuration options. Particularly look for the options defined in Spark 1.6 Configuration.
Then go back to the search bar, and enter
Remove any references to IMMUTA or "immuta" in the environment variables. Particularly look for the environment settings defined in Spark 1.6 Configuration.
Click the Save Changes button at the bottom of the screen.

Sentry

If your installation leveraged the Immuta HDFS Native Workspace and ImmutaGroupsMapping, Immuta was likely configured as a Sentry admin. When uninstalling, this can be removed.

These changes will require a Sentry service restart

The Sentry service will need to be restarted for the changes below to take effect.

Steps to Uninstall

Warning

Navigate to the Sentry service.
Click on the Configuration tab.
Near the bottom of the left side navigation pane, select Non-Default. This will list all settings that are not presently set to the defaults.
The "immuta" user can be removed from any place specified, but particularly the
should be removed. Click the minus [-] sign that appears on the right of the individual entries, or - if you are certain your cluster should operate on the CDH defaults - all settings can be reverted by clicking the revert arrow icon.
Click the Save Changes button at the bottom of the screen.

Immuta Immuta Partition/"Vulcan" Service

Steps to Disable

1 - Stop the Immuta Partition/"Vulcan" Service

Navigate to the Cloudera Manager Overview page.
Click on the down arrow next to the IMMUTA service.
Click Stop.
Confirm that you want to stop the service.

2 - Remove the Immuta Service

Navigate to the Cloudera Manager Overview page.
Click on the down arrow next to the IMMUTA service.
Click Delete.
Confirm that you want to delete the service.

Steps to Uninstall

Complete both 1 and 2 in the previous "Disable" section.

3 - Deactivate and Remove the Immuta Parcel

You may need to restart the cluster before you can fully remove these parcels

If the parcel was in active use, a cluster restart is likely needed before Cloudera Manager will let you do the following steps to remove and delete these parcels.

Navigate to the Cloudera Manager Overview page.
Click on the package icon on the top right hand side of the page near the search bar.
Find the "Distributed, Activated" Immuta Parcel(s) and click the Deactivate button.
Click Confirm.
Once deactivated, go back to the Immuta Parcels(s) and select the "down arrow" beside the "Activate" button, and select Remove from Hosts.
Click Confirm.
Once not distributed, go back to the Immuta Parcels(s) and select the "down arrow" beside the "Distribute" button, and select Delete.
Click Delete.

Restart the Cluster

To commit all previous settings, issue a restart of the CDH cluster.

Glossary

Attributes

Called "authorizations" prior to Immuta's 2.7 release, attributes are custom tags that can be added to a user to restrict what data the user can see. When creating a policy on a data source, Data Owners can apply the policy to any user that possesses an attribute. Attributes can be added manually as well as mapped in from an .

Blob Handler

Blob handlers are the tools used to access the backend data platform and stream blobs of data through Immuta.

Data Attributes

A data attribute is data about your data. These attributes can be used to match against to decide if a row/object should be visible to a given user. This matching is usually done between the data attribute and .

Data attributes are typically part of the data being exposed as a column or metadata attribute. For example, a may have a column called access, which is used in policy logic to match against a user attribute to determine if they can see the given row.

Data Dictionary

The Data Dictionary provides information about columns within a , including column names and types, and users subscribed to the data source can comment on the Data Dictionary.

Dictionary columns are generated automatically when the data source is created if the remote data platform supports SQL. Otherwise, the Data Owner or can create the entries manually.

Data Experts

Data Experts are those who are knowledgeable about the data source data and can elaborate on it. They are responsible for managing the data source's documentation and the .

Data Source

A data source is how you virtually expose data across your enterprise to Immuta consumers. When you expose a data source you are not copying the data; you are using metadata from the data source to tell Immuta how to expose it. No raw data is moved to an end user (or into the Immuta cache) until it is fetched by that user. The Immuta caching layer is configurable to reduce load on your exposed databases, and with the cost of RAM dropping, building a virtual data lake with desired data flowing in and out through the Immuta caching layer will reduce infrastructure cost, database load, and data latency.

From a technical perspective, a data source is an abstraction to data living in a remote data storage technology. When you expose a data source, it becomes an authoritative view to that remote data without having to pass around connection strings or API guides. Policy enforcement and access is maintained through Immuta based on the settings provided by the data source creator, who is known within Immuta as the Data Owner. Once exposed and subscribed to, the data will be accessed in a consistent manner across analytics and visualization tools, allowing reproducibility and sharing.

Data Source Minimization

Minimization policies expose a percentage of the data source to querying users. This percentage is configurable by the Data Owner and is based on a column with high cardinality.

Data Visibility

As metadata for blobs/rows is ingested into Immuta, the data is tagged with a visibility marking which is an arbitrary JSON object that the Data Owner defines. The visibility for data sources can be prescribed by selecting one or many columns to use as the visibility.

Fingerprinting

Data Fingerprints capture summary statistics from data sources so that the user can view how that data changes over time or how the data changes when policies affecting that data source are changed.

Immuta pulls a sample of a data source through a Postgres proxy and it exists, temporarily, in the fingerprint container. Immuta then distills this data down to a series of summary statistics and pushes those statistics back to the Immuta Metadata Database. Those summary statistics of a data source are captured when a data source is created, when a policy is applied or changed, or when a user manually updates the data source fingerprint from the Policies tab. The user can then track changes in that data.

Groups

Handlers

Identity Managers

Identity managers (IAMs) authenticate Immuta users and control their access to data. Out of the box, Immuta supports several configurable identity managers:

Immuta Identity Manager (Built-in)
Active Directory
LDAP
PKI
OAuth2
Okta (SAML)

Immuta also offers support for custom IAM plugins, so you can use the Immuta API to implement your own identity manager.

Integration

This term refers to how Immuta users can consume and interact with data through Immuta. Accessing data through Immuta ensures that users are only consuming policy-controlled data with thorough auditing.

Permissions

Permission Details

APPLICATION_ADMIN: Gives the user access to administrative actions for the configuration of Immuta. These actions include
- Adding external IAMs.
- Adding ODBC drivers.
- Adding external catalogs.
- Configuring email settings.
USER_ADMIN: Gives the user access to administrative actions for managing users in Immuta. These include
AUDIT: Gives the user access to the audit logs.
CREATE_DATA_SOURCE_IN_PROJECT: Gives the user the ability to create data sources within a project.
CREATE_S3_DATASOURCE_WITH_INSTANCE_ROLE: When creating an S3 data source, this allows the user to the handler to assume an AWS Role when ingesting data.
CREATE_FILTER: Gives the user the ability to create and save a search filter.
FETCH_POLICY_INFO: Gives the user access to an endpoint that returns visibilities, masking information, and filters for a given data source.
IMPERSONATE_HDFS_USER: When creating an HDFS data source, this allows the user to enter any HDFS user name to use when accessing data.

Policies

Policies are fine-grained security controls Data Owners apply when creating data sources. Columns can be masked and rows hidden, or certain blobs of data can be hidden from certain users and particular fields in the content of the blobs can be masked, if the blob is a known format. The creator of the data source determines the logic behind what is hidden from whom, and the logic can be as complex as desired.

Policy Handler

Projects

Projects are logical groupings of data, members, and discussions based on business goals. Projects can also capture the purpose of the work and audit data access.

Project Purposes

Purpose-Based Restrictions

Sensitive Data Discovery

Deprecation notice: Support for this feature has been deprecated.

This feature can be enabled by an Application Admin to automatically identify and tag columns that contain sensitive data when a new data source is created. The Immuta application is pre-configured with a set of Discovered tags that can be used to write Global Policies proactively.

Subscription

Users can subscribe to a data source by requesting access through the Immuta UI or be added to the data source by the Data Owner.

Subscription Policies

A Subscription Policy refers to how open a data source is to potential subscribers and can have one of four possible restriction levels:

Anyone: Users will automatically be granted access (Least Restricted).
Anyone Who Asks (and is Approved): Users will need to request access and be granted permission by a configured approver (Moderately Restricted).
Users with Specific Groups/Attributes: Only users with the specified groups/attributes will be able to see the data source and subscribe (Moderately Restricted).
Individual Users You Select: The data source will not appear in search results, Data Owners must manually add/remove users (Most Restricted).

Time-Based Restrictions

Time-based restrictions only expose data within a defined time range, which is set by the Data Owner and is based on the event time column of the data source.

User Attributes

User attributes are used to drive data source policies as well as to give users access to certain Immuta features.

Immuta Query Engine

Audience: Data Users
Content Summary: This page explains the Immuta Query Engine.

Data Sources

Once subscribed to a data source, one of the mechanisms the user has for accessing data in that source is the Immuta SQL connection. The connection is a regular PostgreSQL connection, and the data sources look like PostgreSQL tables, but these tables abstract the true underlying database technology, allowing users to visit a single place for all of their data. However, the data still resides in its original location, not within Immuta. When a user queries the Immuta database, the query is transformed to sync with the underlying data platform for any data source. When the data is returned from the original location, Immuta applies policies based on the querying user's attributes and forwards the data on to that user.

Policies

All policy types are supported by the Immuta Query Engine. See the Subscription Policies or Data Policies overview for details about policy types.

Analytic Tools

Users can add their SQL connection to their analytic tools, such as Excel, Tableau, RStudio, etc. to query protected data through the Query Engine.

Once you’ve hooked in your BI tool and have subscribed to data sources in Immuta, you will see available tables in the Immuta database — these tables are the exposed Immuta data sources. All of the Immuta data sources look and feel just like PostgreSQL tables (when they are actually a proxy for the real data source). This means you can execute cross-database technology queries since to you, they are just PostgreSQL tables.

You will only be allowed to run queries on data sources you are subscribed to and against data you have appropriate entitlements to view.

Projects

To access data sources in the context of a project, Immuta users can also obtain unique SQL credentials for each project that they are a member of.

These credentials will only provide access to the data sources in their respective projects and allow Data Governors to enforce purpose-based restrictions on project data.

See Creating Project Based SQL Connections for more information.

Managing SQL Accounts

Audience: Data Users
Content Summary: This page outlines how to create your SQL account, view your SQL connection information, update your password, and create SQL connections for projects.

Creating an Immuta User SQL Account

Navigate to your .
Click on the SQL Credentials tab. Then Create Account.
Fill out the required fields following these guidelines:
- Usernames must be unique and can only consist of lowercase letters, numbers, and underscores ( _ ).
- Usernames cannot start with a number and can be no longer than 63 characters.
- Passwords should not contain braces ( { ).
Your SQL password will never be displayed in the Immuta console, so be sure to store it in a secure location.
Click the Create button at the bottom of the center pane.

Note: Rather than requiring users to enter this user/password for authentication when accessing a data source, System Administrators may force users to use PKI or LDAP authentication.

Viewing SQL Connection Information

Assuming an Immuta SQL Account has already been created, navigate to your .
Click on the SQL Credentials tab.
Click the dropdown menu button in the top right corner of the page, and then select the Copy to Clipboard button to copy the SQL connection information.

This connection can be used to connect your favorite BI tool to the Immuta database, where all the Immuta Data Sources you’re subscribed to will be available to you as tables.

Updating an Immuta user SQL password

Click on the SQL Credentials tab.
Click the dropdown menu button in the top right corner of the page, and then select the Change Password button.
Change the password and click Save.

Creating Project Based SQL Connections

Navigate to the Project details page.
Click SQL Connection in the right menu under Credentials.
A modal window will display with the requested connection information. Please make sure to store these credentials somewhere secure. If you misplace them, you will have to generate a new account and re-authenticate all services connected to Immuta via this account.
When done, click the Close button.

Project SQL accounts are unique to each project, and only provide access to the data sources in that project. Note that project SQL credentials cannot be retrieved from Immuta if they are lost. Credentials can only be re-generated using the instructions above. When a user generates new SQL credentials for a project, any existing SQL credentials for that project the user may have had are revoked.

Query Engine Authentication

Audience: System Administrators
Content Summary: By default, users authenticate with the Query Engine using credentials that they create in their Immuta profile.
It is possible to configure Immuta so that users authenticate with the Query Engine using external systems such as LDAP, Kerberos, or PKI. Any valid should be possible, though not all have been tested. In order to use external authentication methods with the Query Engine, you must configure an IAM system with the supported action linkPostgresAccount. When the IAM is configured with linkPostgresAccount, Immuta attaches a special role to the user in the Query Engine of the format <IAM ID>_user. For example, given the following IAM configuration,
Immuta creates accounts in the Query Engine for users belonging to that IAM and assign the role myOrgLDAPIAM_user to them.
This page describes authentication methods that are fully supported by Immuta.

Query Engine Authentication Methods

Each Query Engine authentication method outlined below makes use of the IAM-specific role to target users for authentication. The configuration that is added to pg_hba.conf needs to come at the beginning of the file before the catch-all hostssl immuta all 0.0.0.0/0 md5 that is used to authenticate users using the built-in SQL account management.

Built-in Authentication

The built-in Query Engine authentication does not require any additional configuration. Users authenticate using a username and password configured on their Immuta profile page.

See for more information on how these accounts are managed.

LDAP Authentication

Users can authenticate with the Query Engine using their LDAP IAM credentials, but the Query Engine must be configured using the PostgreSQL ldap authentication method.

Simple Bind

In this method, a DN pattern is specified as a username prefix and suffix. When a user makes an authentication attempt, the prefix, username, and suffix are concatenated and used as DN along with the supplied password to bind as the user. On a successful bind the user authentication to the Query Engine succeeds.

Simple Bind Example:

Search and Bind

In this method, PostgreSQL performs an LDAP search using the given ldapbasedn and ldapsearchattribute with the given username to find the user DN to bind with. This method can be useful when users exist in multiple OUs such that a single prefix and suffix will not satisfy all valid user DN patterns.

Search and Bind Example:

When using search and bind, a bind username and password may be required if your LDAP server does not allow anonymous search.

Kerberos Authentication

Users can authenticate with the Query Engine using Kerberos credentials by using the PostgreSQL gss authentication method.

Configuring the Query Engine with Kerberos

Before users can authenticate against the Query Engine using Kerberos, the PostgreSQL configuration must be updated with a keytab file. Kerberos principals must be generated for postgres/<host>@<REALM> for each Query Engine server and any replication load balancers that may be in use.

Generate a keytab for these service principals, copy it to each Query Engine host, and set the path to the keytab in postgresql.conf as krb_server_keyfile. For example,

Be sure that the keytab file is owned by the immutaqe user.

Configuring User Authentication for Kerberos

The configuration for Kerberos authentication should always have include_realm=0, and the krb_realm will need to be set. When users connect to the Query Engine they will need to present valid Kerberos credentials. If the Kerberos credentials match an Immuta user, authentication to the Query Engine succeeds.

PKI Authentication

PKI authentication makes use of certificates signed by a trusted authority to perform authentication.

An example pg_hba.conf configuration is below.

By default, the cn attribute from the certificate will be compared to the database username. If the Immuta user ID is something else, such as an email address, then

Define a user mapping in pg_ident.conf:
Then, in order to use the mapping in pg_hba.conf, add map=pki-map:

Query Engine User Impersonation

Audience: Users with the IMPERSONATE_USER permission and System Administrators
Content Summary: This page outlines how to use the IMPERSONATE_USER permission for Immuta instances using the .

Query Engine Must Be Enabled

If the on the App Settings page, Query Engine User Impersonation will be unavailable.

Overview

There are two general use cases for user impersonation:

The Project Path: The user wants multiple users to use the same dashboard and needs everyone to see the same data. An Immuta project is created and equalized. Then it is exposed to a PostgreSQL connection for projects; this gives the project a single connection for all the users to impersonate. A dashboard can then be created with the project's connection. After this creation multiple users can see the same data with the correct policies enforced.
The Impersonation Path: The IMPERSONATE_USER permission allows a user to identify themselves while watching a dashboard that is not their own. An identifier of the user requesting the data is presented with a special, sensitive access token. With this information the data on the dashboard can be personalized to the person viewing it, while still remaining a multi-user connection.

The tutorial below illustrates the Impersonation Path.

Configuration and Usage

A User Admin .
As a user with the IMPERSONATE_USER permission, connect your analytic tool to Immuta's Query Engine using the .
In your Immuta Query Engine session, enter the iamid that is associated with the Immuta user account you want to impersonate.
The iamid is the name of the Identity and Access Management (IAM) provider that the Immuta user you want to impersonate is associated with.
For example, if using the iamid of "Okta", the full SQL command would be
Note: The iamid is a case-sensitive value.
Enter the userid that is associated with the Immuta user account you want to impersonate.
The userid could be an email address (if using Immuta's built-in identity manager - or bim), or it could be a shortened form of the username like a sAMAccountName in Active Directory.
For example, to specify a userid of jdoe, run
Note: The userid is a case-sensitive value.
In certain cases, it may be necessary to convert a shortened form of the username, like a sAMAccountName, to an email address in order to match it to an Immuta account. To handle this special case, Immuta has a capability that augments the userid by a specified template.
For example, a sAMAccountName of jdoe can be converted into an email address at mycompany.com using a string template that substitutes the value of {userid} with the userid provided. The resulting value would be jdoe@mycompany.com.
Now that your Immuta Query Engine session is configured to impersonate the desired Immuta user, your queries will be executed as the impersonated user as long as your session remains active.

Caveats

Once impersonation is set, all subsequent SQL calls will be made as the impersonated user.
User impersonation lasts the duration of the SQL connection. To stop impersonating a user, simply close the connection.
It is not possible to switch impersonated users within a single SQL connection. Each connection supports at most one impersonation setting. After user impersonation has been enabled, attempts to set a different user to impersonate will fail.

Project Workspaces

Managing Project Workspaces in Hadoop

Audience: Data Users
Content Summary: This page details how to use the Immuta project workspace in Hive and Impala.

Writing Data to the Workspace

You can write data to a project workspace within an ImmutaSparkSession. Note that you must be acting within the context of a project in order to write to that project's workspace.

In the example below, the consumer1 user is acting under the project Taxi Research, which contains purpose-restricted Impala data sources: NYC Taxi Trip and NYC Taxi Fare. This user will query these data sources from the ImmutaSparkSession and write the resulting DataFrame to parquet files in the Taxi Research workspace at /user/immuta/workspace/Taxi_Research.

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0.cloudera4
      /_/

Using Python version 2.7.5 (default, Jul 13 2018 13:06:57)
SparkSession available as 'spark'.
ImmutaSparkSession available as immuta.
>>> immuta.sql("show tables").show()
+--------+-------------+-----------+
|database|    tableName|isTemporary|
+--------+-------------+-----------+
|  immuta|nyc_taxi_fare|      false|
|  immuta|nyc_taxi_trip|      false|
+--------+-------------+-----------+

>>> my_df = immuta.sql("SELECT medallion, total_amount FROM nyc_taxi_fare LIMIT 500")
>>> my_df.write.parquet("/user/immuta/workspace/Taxi_Research/fare_total_sample")

The user can then verify that the data was written:

consumer1@worker0 ~]$ hadoop fs -ls /user/immuta/workspace/Taxi_Research/fare_total_sample
Found 2 items
-rw-r--r--   1 consumer1 immuta          0 2019-08-13 15:26 /user/immuta/workspace/Taxi_Research/fare_total_sample/_SUCCESS
-rw-r--r--   1 consumer1 immuta      30554 2019-08-13 15:26 /user/immuta/workspace/Taxi_Research/fare_total_sample/part-00000-0e4c5d79-1231-4fc3-b470-7814fb06e688-c000.snappy.parquet

Creating a Derived Data Source from Workspace Data

Data written to the project workspace can be easily exposed as a derived data source within the project.

Continuing from the example in the previous section, consumer1 can log in to the Immuta Web UI and start creating a derived data source by navigating to the Overview tab and clicking Create under Create Derived Data Source.

A modal will appear, prompting the user to select the data source(s) that the data was derived from. In this case, the data was derived from NYC Taxi Fare.

Next, the user will need to enter the path where the data is stored and select the desired backing technology of the derived data source. In this case, the data is stored under /user/immuta/workspace/Taxi_Research/fare_total_sample.

After the derived data source is created, other members of the project will be able to subscribe to it in the Immuta Web UI and query the data from the project workspace.

Creating a Table in Hive or Impala from Workspace Data without a Derived Data Source

Although best practices dictate that new tables in a project workspace database should be created via derived data sources, users can opt to manually create working tables in the database using Hive or Impala. In this case, users can leverage CREATE TABLE or CREATE EXTERNAL TABLE statements. An example for creating the fare_total_sample table using this method is below.

CREATE EXTERNAL TABLE fare_total_sample
(
  medallion        VARCHAR(32),
  total_amount     FLOAT
)
STORED AS parquet
LOCATION 'hdfs:///user/immuta/workspace/Taxi_Research/fare_total_sample/';

Querying Workspace Data Natively from Hive or Impala

The native workspace enables users to query data from an Immuta project natively from Hive or Impala, as opposed to using the Immuta Query Engine or the ImmutaSparkSession.

Immuta will manage the Sentry permissions for project users, allowing them to access a database in the Hive Metastore that corresponds to their active project context. In the example below, a project user connects directly to Impala and queries a derived data source table in the taxi_research project database. Note that this is only possible when the user is acting under the Taxi Research project context.

[worker0.hadoop.cs-26-workspace-d.immuta.io:21000] > show databases;
Query: show databases
+------------------------+----------------------------------------------+
| name                   | comment                                      |
+------------------------+----------------------------------------------+
| _impala_builtins       | System database for Impala builtin functions |
| default                | Default Hive database                        |
| taxi_research          | immuta_project_1                             |
+------------------------+----------------------------------------------+
Fetched 5 row(s) in 0.04s
[worker0.hadoop.cs-26-workspace-d.immuta.io:21000] > use taxi_research;
Query: use taxi_research
[worker0.hadoop.cs-26-workspace-d.immuta.io:21000] > show tables;
Query: show tables
+-------------------+
| name              |
+-------------------+
| fare_total_sample |
+-------------------+
Fetched 1 row(s) in 0.00s
[worker0.hadoop.cs-26-workspace-d.immuta.io:21000] > select * from fare_total_sample limit 5;
Query: select * from fare_total_sample limit 5
Query submitted at: 2019-08-28 20:43:10 (Coordinator: http://worker0.hadoop.cs-26-workspace-d.immuta.io:25000)
Query progress can be monitored at: http://worker0.hadoop.cs-26-workspace-d.immuta.io:25000/query_plan?query_id=f54099a8f9f8aead:8cdee1ab00000000
+------------------------------------------------------------------+-------------------+
| medallion                                                        | total_amount      |
+------------------------------------------------------------------+-------------------+
| 4714f2b55cd230d030f55a11b88174c6e74ecd56c41d1fbc3a995d5283b86cae | 12                |
| 8d75d46bcc719be8c308713943f8c643443aab9a55c745f24112b81794e974be | 16                |
| 389c10e71c67883ea16c33921eb10e65249d58485533df220cf6033fb11416ad | 7.650000095367432 |
| fc3f6d10604836f7268e1173bde76e72ddae96af5b0afb1f376a92a835888fb2 | 6.5               |
| df85d6b4bd150460ccb3e9403f2b8c5a99616c1bd78287054cbb23cd5ce99a41 | 10.10000038146973 |
+------------------------------------------------------------------+-------------------+
Fetched 5 row(s) in 0.04s

HDFS Project Workspaces

Overview

This workspace allows native access to data on cluster without having to go through the Immuta SparkSession or Immuta Query Engine. Within a project, users can enable an HDFS Native Workspace, which creates a workspace directory in HDFS (and a corresponding database in the Hive metastore) where users can write files.

After a project owner creates a workspace, users will only be able to access this HDFS directory and database when acting under the project, and they should use the SparkSQL session to copy data into the workspace. The Immuta Spark SQL Session will apply policies to the data, so any data written to the workspace will already be compliant with the restrictions of the equalized project, where all members see data at the same level of access.

Once derived data is ready to be shared outside the workspace, it can be exposed as a derived data source in Immuta. At that point, the derived data source will inherit policies appropriately, and it will then be available through Immuta outside the project and can be used in future project workspaces by different teams in a compliant way.

Administrators

Administrators can opt to configure where all Immuta projects are kept in HDFS (default is /user/immuta/workspace). Note: If an administrator changes the default directory, the Immuta user must have full access to that directory. Once any workspace is created, this directory can no longer be modified.
Administrators can place a configuration value in the cluster configuration (core-site.xml) to mark that cluster as unavailable for use as a workspace.

Project Owners

Once a project is equalized, project owners can enable a workspace for the project.
- If more than one cluster is configured, Immuta will prompt for which to use.
- Once enabled, the full URI of where that workspace is located will display on the project page.
- Project owners can also add connection information for Hive and/or Impala to allow Hive or Impala workspace sources to be created. The connection information provided and the Kerberos credentials configured for Immuta will be used for each derived Hive or Impala data source. The connection string for Hive or Impala will be displayed on the project page with the full URI.
Project owners can disable the workspace at any time.
- When disabled, the workspace will not allow reading/writing from project members any longer.
- Data sources living in this directory will still exist and their access will not be changed. (Subscribed users will still have access as usual.)
- All data in this directory will still exist, regardless of whether it belongs to a data source or not.
- Project owners can purge all data in the workspace after it has been disabled. Project Owners can
  - Purge all non-data-source data only.
  - Purge all data (including data source data).
    When purging all data source data, sources can either be disabled or fully deleted.

Project Members

When a user is acting under the project context, Immuta will provide them read/write access to the project HDFS directory (using HDFS ACLs). If there are Immuta data sources already exposed in that directory, the user will bypass the namenode plugin if acting under the project for the data in that directory.
Once a user is not acting under the project, all access to that directory will be revoked and they can only access data in that project as official Immuta data sources, if any exist.
When users with the CREATE_DATA_SOURCE_IN_PROJECT permission create a derived data source with workspace enabled, they will be prompted with a modified create data source workflow:
- The user will select the directory (starting with the project root directory) of the data they are exposing.
- If the directory contains parquet or ORC files, then Hive, Impala, and HDFS will be an option for the data source; otherwise, only HDFS will be available.
- Users will not be asked for the connection information because the Immuta user connection will be used to create the data source, which will ensure join pushdown and that the data source will work even when the user isn’t acting in the project. Note: Hive or Impala workspace sources are only available if the Project Owner added Hive or Impala connection information to the workspace.
- If Hive or Impala is selected as the data source type, Immuta will infer schema/partitions from files and generate create table statements for Hive.
- Once the data source is created, policy inheritance will take effect.

Note: To avoid data source collisions, Immuta will not allow HDFS and Hive/Impala data sources to be backed from the same location in HDFS.

Appendix

Section Contents

Additional Integrations

Overviews

Hadoop Clusters

Why Use Immuta On Your Cluster

Workflow

Immuta Spark Session

General Spark Access

Policy Fallback

Securing Hive and Impala without Sentry

Data Sharing

Securing Hive and Impala without Sentry

Restricting Access to Hive

Enable HDFS Access Control Lists in Cloudera Manager

Enable Hive Impersonation in Cloudera Manager

Configure Access Control Lists

Restricting Access to Impala

Create Policy Configuration File

Configure Impala to use Policy Configuration File

HDFS Access Pattern

Immuta HDFS Principal

Authentication

Accessing Data

HDFS User Impersonation

Legacy S3 Access Pattern

S3 as a Filesystem

GET Bucket

GET Bucket Contents

GET Object

Example: HTTP Request and Response

Example: Using Boto 3 to Download Objects

Spark Access Pattern

Section Contents

Using the Immuta SparkSession (Spark 2)

Using the Immuta SparkSession

With spark-submit

With spark-shell

With pyspark

Leveraging Data on Other Clusters and Databases

Prerequisites

Cross-cluster Joins

Step 1: Load Tables into Spark DataFrames

Step 2: Register Temporary Views of Filtered Data

Step 3: Run the Join Query

Spark Policy Enforcement and Deployment

Spark Policy Enforcement

Immuta SparkSession/Immuta Context (non-Databricks)

SparkSession Modifications in Databricks

Plan Analysis and Execution

Plan Modifications

Query Engine and Spark Difference

Field Protections via SecurityManager

Vulcan Service

Compute Partition Information for Immuta Spark Jobs

Service Administrative Requests for Immuta Hadoop Native Workspaces

Act as a Proxy to Remote Storage for Immuta Spark Jobs

Immuta Jobs Co-Located with Non-Immuta Jobs

Configuration Guides

Amazon EMR

Introduction

Supported EMR Versions

Create Prerequisite AWS Resources

Immuta Bootstrap Bucket

Immuta Data IAM Role

Create Immuta Data IAM Policy

Create Immuta Data IAM Role

Create Immuta Instance IAM Policy

Create Immuta Instance IAM Role

Create Immuta EMR Instance Profile

Update Immuta Data IAM Role Trust Policy

Immuta HDFS System Token in AWS Secrets Manager

Create EMR Cluster

EC2 Attributes Configuration File

Cluster Configuration File

Immuta Bootstrap Configuration File

(Optional) Kerberos Attributes Configuration File

Security Configuration

Create EMR Cluster Command

Remove Secrets