Skip to content

Managing Native Project Workspaces in Hadoop

Audience: Data Users

Content Summary: This page details how to use the Immuta project workspace in HDFS, Hive, and Impala. For a general overview of the workspace, as well as details on how to set one up, please see the Projects Guide.

Writing Data to the Workspace

You can write data to a project workspace within an ImmutaSparkSession. Note that you must be acting within the context of a project in order to write to that project's workspace.

In the example below, the consumer1 user is acting under the project Taxi Research, which contains purpose-restricted Impala data sources: NYC Taxi Trip and NYC Taxi Fare. This user will query these data sources from the ImmutaSparkSession and write the resulting DataFrame to parquet files in the Taxi Research workspace at /user/immuta/workspace/Taxi_Research.

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0.cloudera4
      /_/

Using Python version 2.7.5 (default, Jul 13 2018 13:06:57)
SparkSession available as 'spark'.
ImmutaSparkSession available as immuta.
>>> immuta.sql("show tables").show()
+--------+-------------+-----------+
|database|    tableName|isTemporary|
+--------+-------------+-----------+
|  immuta|nyc_taxi_fare|      false|
|  immuta|nyc_taxi_trip|      false|
+--------+-------------+-----------+

>>> my_df = immuta.sql("SELECT medallion, total_amount FROM nyc_taxi_fare LIMIT 500")
>>> my_df.write.parquet("/user/immuta/workspace/Taxi_Research/fare_total_sample")

The user can then verify that the data was written:

consumer1@worker0 ~]$ hadoop fs -ls /user/immuta/workspace/Taxi_Research/fare_total_sample
Found 2 items
-rw-r--r--   1 consumer1 immuta          0 2019-08-13 15:26 /user/immuta/workspace/Taxi_Research/fare_total_sample/_SUCCESS
-rw-r--r--   1 consumer1 immuta      30554 2019-08-13 15:26 /user/immuta/workspace/Taxi_Research/fare_total_sample/part-00000-0e4c5d79-1231-4fc3-b470-7814fb06e688-c000.snappy.parquet

Creating a Derived Data Source from Workspace Data

Data written to the project workspace can be easily exposed as a derived data source within the project.

Continuing from the example in the previous section, consumer1 can log in to the Immuta Web UI and start creating a derived data source by navigating to the Overview tab and clicking Create under Create Derived Data Source.

A modal will appear, prompting the user to select the data source(s) that the data was derived from. In this case, the data was derived from NYC Taxi Fare.

Next, the user will need to enter the path in HDFS where the data is stored and select the desired backing technology of the derived data source. In this case, the data is stored under /user/immuta/workspace/Taxi_Research/fare_total_sample.

After the derived data source is created, other members of the project will be able to subscribe to it in the Immuta Web UI and query the data from the project workspace.

Creating a Table in Hive or Impala from Workspace Data without a Derived Data Source

Although best practices dictate that new tables in a project workspace database should be created via derived data sources, users can opt to manually create working tables in the database using Hive or Impala. In this case, users can leverage CREATE TABLE or CREATE EXTERNAL TABLE statements. An example for creating the fare_total_sample table using this method is below.

CREATE EXTERNAL TABLE fare_total_sample
(
  medallion        VARCHAR(32),
  total_amount     FLOAT
)
STORED AS parquet
LOCATION 'hdfs:///user/immuta/workspace/Taxi_Research/fare_total_sample/';

Querying Workspace Data Natively from Hive or Impala

The native workspace enables users to query data from an Immuta project natively from Hive or Impala, as opposed to using the Immuta Query Engine or the ImmutaSparkSession.

Immuta will manage the Sentry permissions for project users, allowing them to access a database in the Hive Metastore that corresponds to their active project context. In the example below, a project user connects directly to Impala and queries a derived data source table in the taxi_research project database. Note that this is only possible when the user is acting under the Taxi Research project context.

[worker0.hadoop.cs-26-workspace-d.immuta.io:21000] > show databases;
Query: show databases
+------------------------+----------------------------------------------+
| name                   | comment                                      |
+------------------------+----------------------------------------------+
| _impala_builtins       | System database for Impala builtin functions |
| default                | Default Hive database                        |
| taxi_research          | immuta_project_1                             |
+------------------------+----------------------------------------------+
Fetched 5 row(s) in 0.04s
[worker0.hadoop.cs-26-workspace-d.immuta.io:21000] > use taxi_research;
Query: use taxi_research
[worker0.hadoop.cs-26-workspace-d.immuta.io:21000] > show tables;
Query: show tables
+-------------------+
| name              |
+-------------------+
| fare_total_sample |
+-------------------+
Fetched 1 row(s) in 0.00s
[worker0.hadoop.cs-26-workspace-d.immuta.io:21000] > select * from fare_total_sample limit 5;
Query: select * from fare_total_sample limit 5
Query submitted at: 2019-08-28 20:43:10 (Coordinator: http://worker0.hadoop.cs-26-workspace-d.immuta.io:25000)
Query progress can be monitored at: http://worker0.hadoop.cs-26-workspace-d.immuta.io:25000/query_plan?query_id=f54099a8f9f8aead:8cdee1ab00000000
+------------------------------------------------------------------+-------------------+
| medallion                                                        | total_amount      |
+------------------------------------------------------------------+-------------------+
| 4714f2b55cd230d030f55a11b88174c6e74ecd56c41d1fbc3a995d5283b86cae | 12                |
| 8d75d46bcc719be8c308713943f8c643443aab9a55c745f24112b81794e974be | 16                |
| 389c10e71c67883ea16c33921eb10e65249d58485533df220cf6033fb11416ad | 7.650000095367432 |
| fc3f6d10604836f7268e1173bde76e72ddae96af5b0afb1f376a92a835888fb2 | 6.5               |
| df85d6b4bd150460ccb3e9403f2b8c5a99616c1bd78287054cbb23cd5ce99a41 | 10.10000038146973 |
+------------------------------------------------------------------+-------------------+
Fetched 5 row(s) in 0.04s

CDH Configurations for Sentry Object Ownership

Enabling Sentry Object Ownership allows Immuta users in a workspace to DROP any tables that they have created in the workspace. Note that if Object Ownership is not enabled as detailed below, or your CDH version is older than CDH 5.16, any workspace tables created directly in Hive or Impala can only be dropped by Sentry administrators.

Follow the steps below to configure an Immuta scratch space with Sentry Object Ownership. For additional information on Sentry Object Ownership, see the official Cloudera Documentation.

  1. Install and configure Sentry.
  2. Navigate to the Sentry Configuration in Cloudera Manager to turn Sentry Object Ownership on.

    1. Search for sentry.db.policy.store.owner.as.privilege.
    2. Set the value to ALL privileges using the radio button.

      • Make sure you do not select ALL privileges with GRANT, as this option would allow users to circumvent Immuta by sharing data via GRANT statements.
    3. Save your changes and restart the Sentry service.

  3. Revoke all other privileges for roles that Immuta users belong to. It is important that the only database that these users are able access via Hive or Impala directly are project workspace databases.