Skip to content

Simplified Databricks Configuration

Audience: System Administrators

Content Summary: This guide details the simplified installation method for enabling native access to Databricks with Immuta policies enforced.

Prerequisites: Ensure your Databricks workspace, instance, and permissions meet the guidelines outlined in the Installation Introduction.

1 - Add the Native Integration on the App Settings Page

  1. Log in to Immuta and click the App Settings icon in the left sidebar.
  2. Scroll to the System API Key subsection under HDFS and click Generate Key.

    Generate Key

  3. Click Save and then Confirm.

  4. Scroll to the Native Integrations section, and click + Add a Native Integration.
  5. Select Databricks Integration from the dropdown menu.
  6. Complete the Hostname field.

    Databricks Quick Config Modal

  7. Select your configured Immuta IAM from the dropdown menu.

  8. Choose one of the following options for your data access model:
    • Protected until made available by policy: All tables are hidden until a user is permissioned through an Immuta policy. This is how most databases work and assumes least privileged access and also means you will have to register all tables with Immuta.
    • Available until protected by policy: All tables are open until explicitly registered and protected by Immuta. This makes a lot of sense if most of your tables are non-sensitive and you can pick and choose which to protect.
  9. Select the Storage Access Type from the dropdown menu.
  10. Opt to add any Additional Hadoop Configuration Files.
  11. Click Add Native Integration.

2 - Configure Cluster Policies

Several cluster policies are available on the App Settings page when configuring this integration. Use the tabs below to read more about each of these cluster policies before continuing with the tutorial.

Python & SQL

This is the most performant policy configuration.

In this configuration, Immuta is able to rely on Databricks-native security controls, reducing overhead. The key security control here is the enablement of process isolation. This prevents users from obtaining unintentional access to the queries of other users. In other words, masked and filtered data is consistently made accessible to users in accordance with their assigned attributes. This Immuta cluster configuration relies on Py4J security being enabled.

Many Python ML classes (such as LogisticRegression, StringIndexer, and DecisionTreeClassifier) and dbutils.fs are unfortunately not supported with Py4J security enabled. Users will also be unable to use the Databricks Connect client library. Additionally, only Python and SQL are available as supported languages.

Both Standard and High Concurrency cluster modes are supported.

For full details on Databricks’ best practices in configuring clusters, please read their governance documentation.

Python & SQL & R

In relation to the Python & SQL cluster policy, this configuration trades some additional overhead for added support of the R language.

In this configuration, you are able to rely on the Databricks-native security controls. The key security control here is the enablement of process isolation. This prevents users from obtaining unintentional access to the queries of other users. In other words, masked and filtered data is consistently made accessible to users in accordance with their assigned attributes.

Like the Python & SQL configuration, Py4j security is enabled for the Python & SQL & R configuration. However, because R has been added, Immuta enables the SecurityManager in addition to Py4j security to provide more security guarantees. For example, by default all actions in R execute as the root user; among other things, this permits access to the entire filesystem (including sensitive configuration data), and, without iptable restrictions, a user may freely access the cluster’s cloud storage credentials. To address these security issues, Immuta’s initialization script wraps the R and Rscript binaries to launch each command as a temporary, non-privileged user with limited filesystem and network access and installs the Immuta SecurityManager, which prevents users from bypassing policies and protects against the above vulnerabilities from within the JVM.

Consequently, the cost of introducing R is that the SecurityManager incurs a small increase in performance overhead; however, average latency will vary depending on whether the cluster is homogeneous or heterogeneous. (In homogeneous clusters, all users are at the same level of groups/authorizations; this is enforced externally, rather than directly by Immuta.)

Many Python ML classes (such as LogisticRegression, StringIndexer, and DecisionTreeClassifier) and dbutils.fs are unfortunately not supported with Py4J security enabled. Users will also be unable to use the Databricks Connect client library.

When users install third-party Java/Scala libraries, they will be denied access to sensitive resources by default. However, cluster administrators can specify which of the installed Databricks libraries should be trusted by Immuta.

Both Standard and High Concurrency cluster modes are supported.

For full details on Databricks’ best practices in configuring clusters, please read their governance documentation.

Python & SQL & R with Library Support

In addition to support for Python, SQL, and R, this configuration adds support for additional Python libraries and utilities by disabling Databricks-native Py4j security.

This configuration does not rely on Databricks-native Py4j security to secure the cluster, while process isolation is still enabled to secure filesystem and network access from within Python processes. On an Immuta-enabled cluster, once Py4J security is disabled the Immuta SecurityManager is installed to prevent nefarious actions from Python in the JVM. Disabling Py4J security also allows for expanded Python library support, including many Python ML classes (such as LogisticRegression, StringIndexer, and DecisionTreeClassifier) and dbutils.fs.

By default, all actions in R will execute as the root user. Among other things, this permits access to the entire filesystem (including sensitive configuration data). And without iptable restrictions, a user may freely access the cluster’s cloud storage credentials. To properly support the use of the R language, Immuta’s initialization script wraps the R and Rscript binaries to launch each command as a temporary, non-privileged user. This user has limited filesystem and network access. The Immuta SecurityManager is also installed to prevent users from bypassing policies and protects against the above vulnerabilities from within the JVM.

The SecurityManager will incur a small increase in performance overhead; average latency will vary depending on whether the cluster is homogeneous or heterogeneous. (In homogeneous clusters, all users are at the same level of groups/authorizations; this is enforced externally, rather than directly by Immuta.)

When users install third-party Java/Scala libraries, they will be denied access to sensitive resources by default. However, cluster administrators can specify which of the installed Databricks libraries should be trusted by Immuta.

A homogeneous cluster is recommended for configurations where Py4J security is disabled. If all users have the same level of authorization, there would not be any data leakage, even if a nefarious action was taken.

Both Standard and High Concurrency cluster modes are supported.

For full details on Databricks’ best practices in configuring clusters, please read their governance documentation.

Scala

Scala-only with a standard cluster.

Where Scala language support is needed, this configuration can be used in the standard cluster mode (high concurrency unavailable).

According to Databricks’ cluster type support documentation, Scala clusters are intended for single users only. However, nothing inherently prevents a Scala cluster from being configured for multiple users. Even with the Immuta SecurityManager enabled, there are limitations to user isolation within a Scala job.

For a secure configuration, it is recommended that clusters intended for Scala workloads are limited to Scala jobs only and are made homogeneous through the use of project equalization or externally via convention/cluster ACLs. (In homogeneous clusters, all users are at the same level of groups/authorizations; this is enforced externally, rather than directly by Immuta.)

For full details on Databricks’ best practices in configuring clusters, please read their governance documentation.

  1. Click Configure Cluster Policies.

    Configure Cluster Policies

  2. Select one more more cluster policies in the matrix by clicking the Select button(s).

  3. Opt to make changes to these cluster policies by clicking Additional Policy Changes and editing the text field.

    Additional Policy Changes

  4. Use one of the two Installation Types described in the tabs below to apply the policies to your cluster:

    Automatically Push Cluster Policies

    This option allows you to automatically push the cluster policies to the configured Databricks workspace. This will overwrite any cluster policy templates previously applied to this workspace.

    1. Select the Automatically Push Cluster Policies radio button.
    2. Enter your Admin Token. This token must be for a user who can create cluster policies in Databricks.

      Automatically Push Cluster Policies

    3. Click Apply Policies.

    Manually Push Cluster Policies

    Enabling this option will allow you to manually push the cluster policies to the configured Databricks workspace. There will be various files to download and manually push to the configured Databricks workspace.

    1. Select the Manually Push Cluster Policies radio button.

      Manually Push Cluster Policies

    2. Click Download Init Script.

    3. Follow the steps in the Instructions to upload the init script to DBFS section.

      Instructions to Upload Init Script

    4. Click Download Policies, and then manually add these Cluster Policies in Databricks.

  5. Opt to click the Download the Benchmarking Suite to compare a regular Databricks cluster to one protected by Immuta. Detailed instructions are available in the first notebook, which will require an Immuta and non-Immuta cluster to generate test data and perform queries.

  6. Click Close, and then click Save and Confirm.

3 - Add Policies to Your Cluster

  1. Create a cluster in Databricks by following the Databricks documentation.
  2. In the Policy dropdown, select the Cluster Policies you pushed or manually added from Immuta.

    Select Cluster Policy

  3. Select a Cluster Mode: Immuta supports both High Concurrency and Standard clusters in Databricks.

  4. Opt to adjust Autopilot Options and Worker Type settings: The default values provided here may be more than what is necessary for non-production or smaller use-cases. To reduce resource usage you can enable/disable autoscaling, limit the size and number of workers, and set the inactivity timeout to a lower value.
  5. Opt to configure the Instances tab in the Advanced Options section:

    • IAM Role (AWS ONLY): Select the instance role you created for this cluster. (For access key authentication, you should instead use the environment variables listed in the AWS section.)
  6. Click Create Cluster.

4 - Query Immuta Data

When the Immuta-enabled Databricks cluster has been successfully started, Immuta will create an immuta database, which allows Immuta to track Immuta-managed data sources separately from remote Databricks tables so that policies and other security features can be applied. However, users can query sources with their original database or table name without referencing the immuta database. Additionally, when configuring a Databricks cluster you can hide immuta from any calls to SHOW DATABASES so that users aren't misled or confused by its presence. For more details, see the Hiding the immuta Database in Databricks page.

  1. Before users can query an Immuta data source, an administrator must give the user Can Attach To permissions on the cluster.

  2. See the Databricks Data Source Creation guide for a detailed walkthrough of creating Databricks data sources in Immuta.

Example Queries

Below are some example queries that can be run to obtain data from an Immuta-configured data source.

%sql
show tables in immuta;
%sql
select * from immuta.my_data_source limit 5;