1 of 3

Databricks Workspaces

Audience: Project members
Content Summary: This page explains Databricks workspaces, which allow users to access and write to protected data directly in Databricks.
See the Pre-Configuration Checklist for details on prerequisites and see the Configuration page for installation instructions.

Overview

Databricks project workspaces allow users to access data on cluster without having to go through the Immuta SparkSession. Using Immuta Projects and Project Equalization, Databricks project workspaces are a space where every project member has the same level of access to data. This equalized access allows collaboration without worries about data leaks. Not only can project members collaborate on data, but they can also write protected data back to Immuta.

Users will only be able to access the directory and database created for the workspace when acting under the project. The Immuta Spark SQL Session will apply policies to the data, so any data written to the workspace will already be compliant with the restrictions of the equalized project, where all members see data at the same level of access. When users are ready to write data back to Immuta, they should use the SparkSQL session to copy data into the workspace.

Supported Cloud Providers

Amazon Web Services

Immuta currently supports the s3a schema for Amazon S3. When using Databricks on Amazon S3 either a key pair for S3 needs to be specified in the additional configuration that has access to the workspace bucket/prefix or an instance role must be applied to the cluster with access.

Microsoft Azure

Immuta currently supports the abfss schema for Azure General Purpose V2 Storage Accounts. this includes support for Azure Data Lake Gen 2. When configuring Immuta workspaces for Databricks on Azure, the Azure Databricks Workspace ID must be provided. More information about how to determine the Workspace ID for your workspace can be found in the Databricks documentation. It is also important that the additional configuration file is included on any clusters that wish to use Immuta workspaces with credentials for the container in Azure Storage that contains Immuta workspaces.

Google Cloud Platform

Immuta currently supports the gs schema for Google Cloud Platform. The primary difference between Databricks on Google Cloud Platform and Databricks on AWS or Azure is that it is deployed to Google Kubernetes Engine. Databricks handles automatically provisioning and auto scaling drivers and executors to pods on Google Kubernetes Engine, so Google Cloud Platform admin users can view and monitor the Google Kubernetes resources in the Google Cloud Platform.

Caveats and Limitations

Stage Immuta installation artifacts in Google Storage, not DBFS: The DBFS FUSE mount is unavailable, and the IMMUTA_SPARK_DATABRICKS_DBFS_MOUNT_ENABLED property cannot be set to true to expose the DBFS FUSE mount.
Stage the Immuta init script in Google Storage: Init scripts in DBFS are not supported.
Stage third party libraries in DBFS: Installing libraries from Google Storage is not supported.
Install third party libraries as cluster-scoped: Notebook-scoped libraries have limited support. See the Databricks Libraries page for more details.
Maven library installation is only supported in Databricks Runtime 8.1+.
/databricks/spark/conf/spark-env.sh is mounted as read-only:
- Set sensitive Immuta configuration values directly in immuta_conf.xml: Do not use environment variables to set sensitive Immuta properties. Immuta is unable to edit the spark-env.sh file because it is read-only; therefore, remove environment variables and keep them from being visible to end users.
- Use /immuta-scratch directly: The IMMUTA_LOCAL_SCRATCH_DIR property is unavailable.
Allow the Kubernetes resource to spin down before submitting another job: Job clusters with init scripts fail on subsequent runs.
The DBFS CLI is unavailable: Other non-DBFS Databricks CLI functions will still work as expected.

Writing Data Back to Databricks: Supported Metastore Providers

To write data back to a table in Databricks through an Immuta workspace, use one of the following supported provider types for your table format:

avro
csv
delta
orc
parquet

Pre-Configuration Details

Audience: Project members
Content Summary: This page outlines prerequisites and provides an overview of the integration process for Databricks project workspaces.
See the page for information on the utility of project workspaces and see the page for installation instructions.

Prerequisites

.
.
.
External IDs have been mapped in for Databricks.
Cluster configuration: Before creating a workspace, the cluster must send its configuration to Immuta; to do this, run a simple query on the cluster (i.e., show tables). Otherwise, an error message will occur when you attempt to create a workspace.

Project Workspace Workflow

An Immuta User with the CREATE_PROJECT permission with Databricks data sources.
The Immuta Project Owner which balances every Project Members’ access to the data to be the same.
The Immuta Project Owner which automatically generates a subfolder in the root path specified by the Application Admin and remote database associated with the project.
The Immuta Project Members query equalized data within the context of the project, collaborate, and write data back to Immuta, all within Databricks.
The Immuta Project Members use their newly written derived data and . These derived data sources inherit the necessary Immuta policies to be securely shared outside of the project.

Root Directory Details

Immuta only supports a single root location, so all projects will write to a subdirectory under this single root location.
If an administrator changes the default directory, the Immuta user must have full access to that directory. Once any workspace is created, this directory can no longer be modified.
Administrators can place a configuration value in the cluster configuration (core-site.xml) to mark that cluster as unavailable for use as a workspace.

Read and Write Data

When acting in the workspace project, users can read data using calls like spark.read.parquet("immuta:///some/path/to/a/workspace").
To write delta lake data to a workspace and then expose that delta table as a data source in Immuta, you must specify a table when creating the derived data source (rather than a directory) in the workspace for the data source.

Tutorial

Audience: Project Owners and members
Content Summary: This tutorial configures a Databricks workspace.

Create a Databricks Workspace

Databricks Cluster Configuration

Before creating a workspace, the cluster must send its configuration to Immuta; to do this, run a simple query on the cluster (i.e., show tables). Otherwise, an error message will occur when you attempt to create a workspace.

Navigate to the Policies tab and enable Project Equalization by clicking the Project Equalization slider to on.
Scroll to the Native Workspace section and click Create.
Select Databricks from the Workspace Configuration dropdown menu.
Opt to edit the sub-directory in the Workspace Directory field; this sub-directory auto-populates as the project name.
Enter the Workspace Database Name.
Click Create to enable the workspace.

Disabling or Deleting Workspaces

Scroll to the Native Workspace section on the Policies tab and click the toggle to disable the workspace.
Click Delete in the Native Workspace section.
Choose one of the following options in the modal:
- Purge Generic Workspace Data: permanently delete data, while the data used by derived data sources is preserved. Note: If you created a derived data source that references a view on top of a table in Snowflake that isn't a derived data source, that table will be deleted and break the derived data source.
- Purge Everything & Delete Derived Data Sources: permanently delete data and purge all derived data sources.
Click Delete.