Skip to content

Databricks Access Pattern

Audience: Data Owners and Data Users

Content Summary: This page provides an overview of the Databricks access pattern. For installation instructions, see the Databricks Installation Introduction and the Databricks Quick Integration Guide.

Overview

The Immuta Databricks integration allows you to protect access to tables and manage row-, column-, and cell-level controls without enabling table ACLs or credential passthrough. Like other integrations, policies are applied to the plan that Spark builds for a user's query and enforced live on-cluster.

Using Immuta with Databricks

Mapping Users

Usernames in Immuta must match usernames in Databricks. It is best practice is to use the same identity manager for Immuta that you use for Databricks (Immuta supports all common identity manager protocols); however, for Immuta SaaS users, it’s easiest to just ensure usernames match between systems.

Configuring Tables

You should use a Databricks administrator account to register tables with Immuta using the UI or API; however, you should not test Immuta policies using a Databricks administrator account, as they are able to bypass controls. See the “Testing the Integration” section below for more details.

Ideally, you should register entire databases and run schema monitoring jobs through the python script provided during data source registration.

Testing the Integration

Test the integration on an Immuta-enabled cluster with a user that is not a Databricks administrator. To illustrate table access and policy controls, we will use two example accounts: Bob (test account) and Emily (administrator account).

Table Access

The administrator, Emily, can control who has access to specific tables in Databricks. The analyst, Bob, will only see the immuta database with no tables in it until he has gained access to tables through Immuta Subscription Policies Emily sets or by being manually added to the data source by Emily. Therefore, if Emily registers a database called fruit with tables banana, kiwi, and apple, once Bob has subscribed to those tables through Immuta, he will see the fruit database and its tables and be able to query them. Note: If Bob tries to query those tables before being subscribed, he will be blocked.

The immuta Database

All tables registered in Immuta will also appear in the immuta database, allowing for a single database for all tables, so in our example Bob would see fruit.banana, fruit.kiwi, and fruit.apple, and in the immuta database he would see immuta.fruit_banana, immuta.fruit_kiwi, and immuta.fruit_apple.

Immuta will also contain tables that are not in Databricks; if Emily had Athena tables registered with Immuta, they would show in the immuta database and would be queryable through Databricks. (Immuta automatically configures JDBC.)

Fine-grained Access Control

Once Bob is subscribed to the fruit database tables, Emily can apply fine-grained access controls, such as restricting rows or masking columns with advanced anonymization techniques, to manage what Bob can see in each table. More details on Data Policies can be found here, including an overview of masking struct and array columns in Databricks.

Note: Immuta recommends building Global Policies rather than Local Policies, as they allow organizations to easily manage policies as a whole and capture system state in a more deterministic manner.

Accessing Data

All access controls must go through SQL.

Python

df = spark.sql("select * from fruit.kiwi")

Scala

import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()
val sqlDF = spark.sql("SELECT * FROM fruit.kiwi")

SQL

%sql
select * from fruit.kiwi

R

library(SparkR)
df <- SparkR::sql("SELECT * from fruit.kiwi")

Note: With R, you must load the SparkR library in a cell before accessing the data.

Native Databricks Workspaces

Databricks workspaces allow users to access and write to protected data directly in Databricks without having to go through the Immuta Query Engine.

Typically, Immuta applies policies by forcing users to query through the Query Engine, which acts like a proxy in front of the database Immuta is protecting. However, within an equalized project this process is unnecessary. Instead, Immuta enforces policy logic between the execution flow of the query within Databricks.

  • When acting in the workspace project, users can read data using calls like spark.read.parquet("immuta:///some/path/to/a/workspace").
  • If you want to write delta lake data to a workspace and expose that delta table as a data source in Immuta, you must specify a table when creating the derived data source (rather than a directory) in the workspace for the data source.

Amazon Web Services

Immuta currently supports the s3a schema for Amazon S3. When using Databricks on Amazon S3 either a key pair for S3 needs to be specified in the additional configuration that has access to the workspace bucket/prefix or an instance role must be applied to the cluster with access.

Microsoft Azure

Immuta currently supports the abfss schema for Azure General Purpose V2 Storage Accounts. this includes support for Azure Data Lake Gen 2. When configuring Immuta workspaces for Databricks on Azure, the Azure Databricks Workspace ID must be provided. More information about how to determine the Workspace ID for your workspace can be found in the Databricks documentation. It is also important that the additional configuration file is included on any clusters that wish to use Immuta workspaces with credentials for the container in Azure Storage that contains Immuta workspaces.

Google Cloud Platform

Immuta currently supports the gs schema for Google Cloud Platform. The primary difference between Databricks on Google Cloud Platform and Databricks on AWS or Azure is that it is deployed to Google Kubernetes Engine. Databricks handles automatically provisioning and auto scaling drivers and executors to pods on Google Kubernetes Engine, so Google Cloud Platform admin users can view and monitor the Google Kubernetes resources in the Google Cloud Platform.

Caveats and Limitations

  • Stage Immuta installation artifacts in Google Storage, not DBFS: The DBFS FUSE mount is unavailable, and the IMMUTA_SPARK_DATABRICKS_DBFS_MOUNT_ENABLED property cannot be set to true to expose the DBFS FUSE mount.
  • Stage the Immuta init script in Google Storage: Init scripts in DBFS are not supported.
  • Stage third party libraries in DBFS: Installing libraries from Google Storage is not supported.
  • Install third party libraries as cluster-scoped: Notebook-scoped libraries are not supported.
  • Maven library installation is only supported in Databricks Runtime 8.1+.
  • /databricks/spark/conf/spark-env.sh is mounted as read-only:

    • Set sensitive Immuta configuration values directly in immuta_conf.xml: Do not use environment variables to set sensitive Immuta properties. Immuta is unable to edit the spark-env.sh file because it is read-only; therefore, remove environment variables and keep them from being visible to end users.
    • Use /immuta-scratch directly: The IMMUTA_LOCAL_SCRATCH_DIR property is unavailable.
  • Allow the Kubernetes resource to spin down before submitting another job: Job clusters with init scripts fail on subsequent runs.

  • The DBFS CLI is unavailable: Other non-DBFS Databricks CLI functions will still work as expected.

Databricks SQL Analytics Instances (Public Preview)

Immuta also supports using Databricks SQL Analytics natively. Users can configure multiple native Databricks SQL Analytics integrations in a single instance of Immuta. Note: The endpoint of Immuta's connection must be continuously running.

For an overview of this access pattern, see the Native Databricks SQL Analytics Integration page. For details about enabling this integration, see this tutorial.