Skip to content

You are viewing documentation for Immuta version 2022.5.

For the latest version, view our documentation for Immuta SaaS or the latest self-hosted version.

Databricks Project Workspaces Overview

Audience: Project members

Content Summary: This page explains Databricks workspaces, which allow users to access and write to protected data directly in Databricks.

See the Pre-Configuration Checklist for details on prerequisites and see the Configuration page for installation instructions.

Overview

Databricks project workspaces allow users to access data on cluster without having to go through the Immuta SparkSession. Using Immuta Projects and Project Equalization, Databricks project workspaces are a space where every project member has the same level of access to data. This equalized access allows collaboration without worries about data leaks. Not only can project members collaborate on data, but they can also write protected data back to Immuta.

Users will only be able to access the directory and database created for the workspace when acting under the project. The Immuta Spark SQL Session will apply policies to the data, so any data written to the workspace will already be compliant with the restrictions of the equalized project, where all members see data at the same level of access. When users are ready to write data back to Immuta, they should use the SparkSQL session to copy data into the workspace.

Supported Cloud Providers

Amazon Web Services

Immuta currently supports the s3a schema for Amazon S3. When using Databricks on Amazon S3 either a key pair for S3 needs to be specified in the additional configuration that has access to the workspace bucket/prefix or an instance role must be applied to the cluster with access.

Microsoft Azure

Immuta currently supports the abfss schema for Azure General Purpose V2 Storage Accounts. this includes support for Azure Data Lake Gen 2. When configuring Immuta workspaces for Databricks on Azure, the Azure Databricks Workspace ID must be provided. More information about how to determine the Workspace ID for your workspace can be found in the Databricks documentation. It is also important that the additional configuration file is included on any clusters that wish to use Immuta workspaces with credentials for the container in Azure Storage that contains Immuta workspaces.

Google Cloud Platform

Immuta currently supports the gs schema for Google Cloud Platform. The primary difference between Databricks on Google Cloud Platform and Databricks on AWS or Azure is that it is deployed to Google Kubernetes Engine. Databricks handles automatically provisioning and auto scaling drivers and executors to pods on Google Kubernetes Engine, so Google Cloud Platform admin users can view and monitor the Google Kubernetes resources in the Google Cloud Platform.

Caveats and Limitations

  • Stage Immuta installation artifacts in Google Storage, not DBFS: The DBFS FUSE mount is unavailable, and the IMMUTA_SPARK_DATABRICKS_DBFS_MOUNT_ENABLED property cannot be set to true to expose the DBFS FUSE mount.
  • Stage the Immuta init script in Google Storage: Init scripts in DBFS are not supported.
  • Stage third party libraries in DBFS: Installing libraries from Google Storage is not supported.
  • Install third party libraries as cluster-scoped: Notebook-scoped libraries have limited support. See the Databricks Libraries page for more details.
  • Maven library installation is only supported in Databricks Runtime 8.1+.
  • /databricks/spark/conf/spark-env.sh is mounted as read-only:

    • Set sensitive Immuta configuration values directly in immuta_conf.xml: Do not use environment variables to set sensitive Immuta properties. Immuta is unable to edit the spark-env.sh file because it is read-only; therefore, remove environment variables and keep them from being visible to end users.
    • Use /immuta-scratch directly: The IMMUTA_LOCAL_SCRATCH_DIR property is unavailable.
  • Allow the Kubernetes resource to spin down before submitting another job: Job clusters with init scripts fail on subsequent runs.

  • The DBFS CLI is unavailable: Other non-DBFS Databricks CLI functions will still work as expected.

Writing Data Back to Databricks: Supported Metastore Providers

To write data back to a table in Databricks through an Immuta workspace, use one of the following supported provider types for your table format:

  • avro
  • csv
  • delta
  • orc
  • parquet