1 of 3

Project Workspaces

Managing Project Workspaces in Hadoop

Audience: Data Users
Content Summary: This page details how to use the Immuta project workspace in Hive and Impala.

Writing Data to the Workspace

You can write data to a project workspace within an ImmutaSparkSession. Note that you must be acting within the context of a project in order to write to that project's workspace.

In the example below, the consumer1 user is acting under the project Taxi Research, which contains purpose-restricted Impala data sources: NYC Taxi Trip and NYC Taxi Fare. This user will query these data sources from the ImmutaSparkSession and write the resulting DataFrame to parquet files in the Taxi Research workspace at /user/immuta/workspace/Taxi_Research.

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0.cloudera4
      /_/

Using Python version 2.7.5 (default, Jul 13 2018 13:06:57)
SparkSession available as 'spark'.
ImmutaSparkSession available as immuta.
>>> immuta.sql("show tables").show()
+--------+-------------+-----------+
|database|    tableName|isTemporary|
+--------+-------------+-----------+
|  immuta|nyc_taxi_fare|      false|
|  immuta|nyc_taxi_trip|      false|
+--------+-------------+-----------+

>>> my_df = immuta.sql("SELECT medallion, total_amount FROM nyc_taxi_fare LIMIT 500")
>>> my_df.write.parquet("/user/immuta/workspace/Taxi_Research/fare_total_sample")

The user can then verify that the data was written:

consumer1@worker0 ~]$ hadoop fs -ls /user/immuta/workspace/Taxi_Research/fare_total_sample
Found 2 items
-rw-r--r--   1 consumer1 immuta          0 2019-08-13 15:26 /user/immuta/workspace/Taxi_Research/fare_total_sample/_SUCCESS
-rw-r--r--   1 consumer1 immuta      30554 2019-08-13 15:26 /user/immuta/workspace/Taxi_Research/fare_total_sample/part-00000-0e4c5d79-1231-4fc3-b470-7814fb06e688-c000.snappy.parquet

Creating a Derived Data Source from Workspace Data

Data written to the project workspace can be easily exposed as a derived data source within the project.

Continuing from the example in the previous section, consumer1 can log in to the Immuta Web UI and start creating a derived data source by navigating to the Overview tab and clicking Create under Create Derived Data Source.

A modal will appear, prompting the user to select the data source(s) that the data was derived from. In this case, the data was derived from NYC Taxi Fare.

Next, the user will need to enter the path where the data is stored and select the desired backing technology of the derived data source. In this case, the data is stored under /user/immuta/workspace/Taxi_Research/fare_total_sample.

After the derived data source is created, other members of the project will be able to subscribe to it in the Immuta Web UI and query the data from the project workspace.

Creating a Table in Hive or Impala from Workspace Data without a Derived Data Source

Although best practices dictate that new tables in a project workspace database should be created via derived data sources, users can opt to manually create working tables in the database using Hive or Impala. In this case, users can leverage CREATE TABLE or CREATE EXTERNAL TABLE statements. An example for creating the fare_total_sample table using this method is below.

CREATE EXTERNAL TABLE fare_total_sample
(
  medallion        VARCHAR(32),
  total_amount     FLOAT
)
STORED AS parquet
LOCATION 'hdfs:///user/immuta/workspace/Taxi_Research/fare_total_sample/';

Querying Workspace Data Natively from Hive or Impala

The native workspace enables users to query data from an Immuta project natively from Hive or Impala, as opposed to using the Immuta Query Engine or the ImmutaSparkSession.

Immuta will manage the Sentry permissions for project users, allowing them to access a database in the Hive Metastore that corresponds to their active project context. In the example below, a project user connects directly to Impala and queries a derived data source table in the taxi_research project database. Note that this is only possible when the user is acting under the Taxi Research project context.

[worker0.hadoop.cs-26-workspace-d.immuta.io:21000] > show databases;
Query: show databases
+------------------------+----------------------------------------------+
| name                   | comment                                      |
+------------------------+----------------------------------------------+
| _impala_builtins       | System database for Impala builtin functions |
| default                | Default Hive database                        |
| taxi_research          | immuta_project_1                             |
+------------------------+----------------------------------------------+
Fetched 5 row(s) in 0.04s
[worker0.hadoop.cs-26-workspace-d.immuta.io:21000] > use taxi_research;
Query: use taxi_research
[worker0.hadoop.cs-26-workspace-d.immuta.io:21000] > show tables;
Query: show tables
+-------------------+
| name              |
+-------------------+
| fare_total_sample |
+-------------------+
Fetched 1 row(s) in 0.00s
[worker0.hadoop.cs-26-workspace-d.immuta.io:21000] > select * from fare_total_sample limit 5;
Query: select * from fare_total_sample limit 5
Query submitted at: 2019-08-28 20:43:10 (Coordinator: http://worker0.hadoop.cs-26-workspace-d.immuta.io:25000)
Query progress can be monitored at: http://worker0.hadoop.cs-26-workspace-d.immuta.io:25000/query_plan?query_id=f54099a8f9f8aead:8cdee1ab00000000
+------------------------------------------------------------------+-------------------+
| medallion                                                        | total_amount      |
+------------------------------------------------------------------+-------------------+
| 4714f2b55cd230d030f55a11b88174c6e74ecd56c41d1fbc3a995d5283b86cae | 12                |
| 8d75d46bcc719be8c308713943f8c643443aab9a55c745f24112b81794e974be | 16                |
| 389c10e71c67883ea16c33921eb10e65249d58485533df220cf6033fb11416ad | 7.650000095367432 |
| fc3f6d10604836f7268e1173bde76e72ddae96af5b0afb1f376a92a835888fb2 | 6.5               |
| df85d6b4bd150460ccb3e9403f2b8c5a99616c1bd78287054cbb23cd5ce99a41 | 10.10000038146973 |
+------------------------------------------------------------------+-------------------+
Fetched 5 row(s) in 0.04s

HDFS Project Workspaces

Overview

This workspace allows native access to data on cluster without having to go through the Immuta SparkSession or Immuta Query Engine. Within a project, users can enable an HDFS Native Workspace, which creates a workspace directory in HDFS (and a corresponding database in the Hive metastore) where users can write files.

After a project owner creates a workspace, users will only be able to access this HDFS directory and database when acting under the project, and they should use the SparkSQL session to copy data into the workspace. The Immuta Spark SQL Session will apply policies to the data, so any data written to the workspace will already be compliant with the restrictions of the equalized project, where all members see data at the same level of access.

Once derived data is ready to be shared outside the workspace, it can be exposed as a derived data source in Immuta. At that point, the derived data source will inherit policies appropriately, and it will then be available through Immuta outside the project and can be used in future project workspaces by different teams in a compliant way.

Administrators

Administrators can opt to configure where all Immuta projects are kept in HDFS (default is /user/immuta/workspace). Note: If an administrator changes the default directory, the Immuta user must have full access to that directory. Once any workspace is created, this directory can no longer be modified.
Administrators can place a configuration value in the cluster configuration (core-site.xml) to mark that cluster as unavailable for use as a workspace.

Project Owners

Once a project is equalized, project owners can enable a workspace for the project.
- If more than one cluster is configured, Immuta will prompt for which to use.
- Once enabled, the full URI of where that workspace is located will display on the project page.
- Project owners can also add connection information for Hive and/or Impala to allow Hive or Impala workspace sources to be created. The connection information provided and the Kerberos credentials configured for Immuta will be used for each derived Hive or Impala data source. The connection string for Hive or Impala will be displayed on the project page with the full URI.
Project owners can disable the workspace at any time.
- When disabled, the workspace will not allow reading/writing from project members any longer.
- Data sources living in this directory will still exist and their access will not be changed. (Subscribed users will still have access as usual.)
- All data in this directory will still exist, regardless of whether it belongs to a data source or not.
- Project owners can purge all data in the workspace after it has been disabled. Project Owners can
  - Purge all non-data-source data only.
  - Purge all data (including data source data).
    When purging all data source data, sources can either be disabled or fully deleted.

Project Members

When a user is acting under the project context, Immuta will provide them read/write access to the project HDFS directory (using HDFS ACLs). If there are Immuta data sources already exposed in that directory, the user will bypass the namenode plugin if acting under the project for the data in that directory.
Once a user is not acting under the project, all access to that directory will be revoked and they can only access data in that project as official Immuta data sources, if any exist.
When users with the CREATE_DATA_SOURCE_IN_PROJECT permission create a derived data source with workspace enabled, they will be prompted with a modified create data source workflow:
- The user will select the directory (starting with the project root directory) of the data they are exposing.
- If the directory contains parquet or ORC files, then Hive, Impala, and HDFS will be an option for the data source; otherwise, only HDFS will be available.
- Users will not be asked for the connection information because the Immuta user connection will be used to create the data source, which will ensure join pushdown and that the data source will work even when the user isn’t acting in the project. Note: Hive or Impala workspace sources are only available if the Project Owner added Hive or Impala connection information to the workspace.
- If Hive or Impala is selected as the data source type, Immuta will infer schema/partitions from files and generate create table statements for Hive.
- Once the data source is created, policy inheritance will take effect.

Note: To avoid data source collisions, Immuta will not allow HDFS and Hive/Impala data sources to be backed from the same location in HDFS.