Skip to content

Databricks Installation Introduction

Audience: System Administrators

Content Summary: This page provides an overview of the native Databricks integration and both installation methods.

Note

This page contains references to the term whitelist, which Immuta no longer uses. When the term is removed from the software, it will be removed from this page.

Prerequisites

  • Databricks instance: Premium tier workspace and Cluster access control enabled
  • Databricks instance has network level access to Immuta instance
  • Access to Immuta releases
  • Permissions and access to download (outside Internet access) or transfer files to the host machine

Recommended Databricks Workspace Configurations:

Note: Azure Databricks authenticates users with Azure AD. Be sure to configure your Immuta instance with an IAM that uses the same user ID as does Azure AD. Immuta's Spark security plugin will look to match this user ID between the two systems. See this Azure Active Directory page for details.

Supported Databricks Runtimes

Immuta supports these Databricks Runtimes:

  • 5.5
  • 6.4
  • 7.3
  • 7.4
  • 7.5
  • 7.6
  • 8.0
  • 8.1
  • 8.2
  • 8.3
  • 8.4
  • 9.1 (only supported on Python/SQL clusters)

Supported Clusters

Immuta supports both high concurrency and standard clusters. However, the languages supported by these cluster types differ.

  • High Concurrency Cluster Supported Languages:

    • Python
    • SQL
    • R (requires advanced configuration; work with your Immuta support professional to use R)
  • Standard Cluster Supported Languages:

    • Python
    • SQL
    • R (requires advanced configuration; work with your Immuta support professional to use R)
    • Scala (requires advanced configuration; work with your Immuta support professional to use Scala)

Databricks Installation Overview

Users Who Can Read Raw Tables On-Cluster

  • If a Databricks Admin is tied to an Immuta account, they will have the ability to read raw tables on-cluster.

  • If a Databricks user is listed as an "ignored" user, they will have the ability to read raw tables on-cluster. Users can be added to the immuta.spark.acl.whitelist configuration to become ignored users.

The Immuta Databricks integration injects an Immuta plugin into the SparkSQL stack at cluster startup. The Immuta plugin creates an "immuta" database that is available for querying and intercepts all queries executed against it. For these queries, policy determinations will be obtained from the connected Immuta instance and applied before returning the results to the user.

The Databricks cluster init script provided by Immuta downloads the Immuta artifacts onto the target cluster and puts them in the appropriate locations on local disk for use by Spark. Once the init script runs, the Spark application running on the Databricks cluster will have the appropriate artifacts on its CLASSPATH to use Immuta for policy enforcement.

The cluster init script uses environment variables in order to

  • Determine the location of the required artifacts for downloading.
  • Authenticate with the service/storage containing the artifacts.

Note: Each target system/storage layer (HTTPS, for example) can only have one set of environment variables, so the cluster init script assumes that any artifact retrieved from that system uses the same environment variables.

Installation Methods

There are two installation options for Databricks. Click a link below to navigate to a tutorial for your chosen method:

  • Simplified Configuration: The steps to enable the integration with this method include
    1. Adding the integration on the App Settings page.
    2. Downloading or automatically pushing cluster policies to your Databricks workspace.
    3. Creating or restarting your cluster.
  • Manual Configuration: The steps to enable the integration with this method include
    1. Downloading and configuring Immuta artifacts.
    2. Staging Immuta artifacts somewhere the cluster can read from during its startup procedures.
    3. Protecting Immuta environment variables with Databricks Secrets.
    4. Creating and configuring the cluster to start with the init script and load Immuta into its SparkSQL environment.

Debugging Immuta Installation Issues

For easier debugging of the Immuta Databricks installation, enable cluster init script logging. In the cluster page in Databricks for the target cluster, under Advanced Options -> Logging, change the Destination from NONE to DBFS and change the path to the desired output location. Note: The unique cluster ID will be added onto the end of the provided path.

For debugging issues between the Immuta web service and Databricks, you can view the Spark UI on your target Databricks cluster. On the cluster page, click the Spark UI tab, which shows the Spark application UI for the cluster. If you encounter issues creating Databricks data sources in Immuta, you can also view the JDBC/ODBC Server portion of the Spark UI to see the result of queries that have been sent from Immuta to Databricks.

Using the Validation and Debugging Notebook

The Validation and Debugging Notebook (immuta-validation.ipynb) is packaged with other Databricks release artifacts (for manual installations), or it can be downloaded from the App Settings page when configuring native Databricks through the Immuta UI. This notebook is designed to be used by or under the guidance of an Immuta Support Professional.

Benchmark Suite

  1. Import the notebook into a Databricks workspace by navigating to Home in your Databricks instance.
  2. Click the arrow next to your name and select Import.
  3. Once you have executed commands in the notebook and populated it with debugging information, export the notebook and its contents by opening the File menu, selecting Export, and then selecting DBC Archive.