Customizing the Integration

You can customize the Databricks Spark integration settings using these components Immuta provides:

Cluster policies

Immuta provides cluster policies that set the Spark environment variables and configuration on your Databricks cluster once you apply that policy to your cluster. These policies generated by Immuta must be applied to your cluster manually. The Configure a Databricks Spark integration guide includes instructions for generating and applying these cluster policies. Each cluster policy is described below.

Python and SQL

This is the most performant policy configuration.

In this configuration, Immuta is able to rely on Databricks-native security controls, reducing overhead. The key security control here is the enablement of process isolation. This prevents users from obtaining unintentional access to the queries of other users. In other words, masked and filtered data is consistently made accessible to users in accordance with their assigned attributes. This Immuta cluster configuration relies on Py4J security being enabled. Consequently, the following Databricks features are unsupported:

  • Many Python ML classes (such as LogisticRegression, StringIndexer, and DecisionTreeClassifier)

  • dbutils.fs

  • Databricks Connect client library

For full details on Databricks’ best practices in configuring clusters, read their governance documentation.

Python, SQL, and R

Additional overhead: Compared to the Python and SQL cluster policy, this configuration trades some additional overhead for added support of the R language.

In this configuration, you are able to rely on the Databricks-native security controls. The key security control here is the enablement of process isolation. This prevents users from obtaining unintentional access to the queries of other users. In other words, masked and filtered data is consistently made accessible to users in accordance with their assigned attributes.

Like the Python & SQL configuration, Py4j security is enabled for the Python & SQL & R configuration. However, because R has been added Immuta enables the Security Manager, in addition to Py4J security, to provide more security guarantees. For example, by default all actions in R execute as the root user; among other things, this permits access to the entire filesystem (including sensitive configuration data), and, without iptable restrictions, a user may freely access the cluster’s cloud storage credentials. To address these security issues, Immuta’s initialization script wraps the R and Rscript binaries to launch each command as a temporary, non-privileged user with limited filesystem and network access and installs the Immuta Security Manager, which prevents users from bypassing policies and protects against the above vulnerabilities from within the JVM.

Consequently, the cost of introducing R is that the Security Manager incurs a small increase in performance overhead; however, average latency will vary depending on whether the cluster is homogeneous or heterogeneous. (In homogeneous clusters, all users are at the same level of groups/authorizations; this is enforced externally, rather than directly by Immuta.)

When users install third-party Java/Scala libraries, they will be denied access to sensitive resources by default. However, cluster administrators can specify which of the installed Databricks libraries should be trusted by Immuta.

The following Databricks features are unsupported when this cluster policy is applied:

  • Many Python ML classes (such as LogisticRegression, StringIndexer, and DecisionTreeClassifier)

  • dbutils.fs

  • Databricks Connect client library

For full details on Databricks’ best practices in configuring clusters, read their governance documentation.

Python, SQL, and R with library support

Py4J security disabled: In addition to support for Python, SQL, and R, this configuration adds support for additional Python libraries and utilities by disabling Databricks-native Py4J security.

This configuration does not rely on Databricks-native Py4J security to secure the cluster, while process isolation is still enabled to secure filesystem and network access from within Python processes. On an Immuta-enabled cluster, once Py4J security is disabled the Immuta Security Manager is installed to prevent nefarious actions from Python in the JVM. Disabling Py4J security also allows for expanded Python library support, including many Python ML classes (such as LogisticRegression, StringIndexer, and DecisionTreeClassifier) and dbutils.fs.

By default, all actions in R will execute as the root user. Among other things, this permits access to the entire filesystem (including sensitive configuration data). And without iptable restrictions, a user may freely access the cluster’s cloud storage credentials. To properly support the use of the R language, Immuta’s initialization script wraps the R and Rscript binaries to launch each command as a temporary, non-privileged user. This user has limited filesystem and network access. The Immuta Security Manager is also installed to prevent users from bypassing policies and protects against the above vulnerabilities from within the JVM.

The Security Manager will incur a small increase in performance overhead; average latency will vary depending on whether the cluster is homogeneous or heterogeneous. (In homogeneous clusters, all users are at the same level of groups/authorizations; this is enforced externally, rather than directly by Immuta.)

When users install third-party Java/Scala libraries, they will be denied access to sensitive resources by default. However, cluster administrators can specify which of the installed Databricks libraries should be trusted by Immuta.

A homogeneous cluster is recommended for configurations where Py4J security is disabled. If all users have the same level of authorization, there would not be any data leakage, even if a nefarious action was taken.

For full details on Databricks’ best practices in configuring clusters, read their governance documentation.

Scala

Scala clusters: This configuration is for Scala-only clusters.

Where Scala language support is needed, this configuration can be used in the Custom access mode.

According to Databricks’ cluster type support documentation, Scala clusters are intended for single users only. However, nothing inherently prevents a Scala cluster from being configured for multiple users. Even with the Immuta Security Manager enabled, there are limitations to user isolation within a Scala job.

For a secure configuration, it is recommended that clusters intended for Scala workloads are limited to Scala jobs only and are made homogeneous through the use of project equalization or externally via convention/cluster ACLs. (In homogeneous clusters, all users are at the same level of groups/authorizations; this is enforced externally, rather than directly by Immuta.)

For full details on Databricks’ best practices in configuring clusters, read their governance documentation.

Sparklyr

Single-user clusters recommended: Like Databricks, Immuta recommends single-user clusters for sparklyr when user isolation is required. A single-user cluster can either be a job cluster or a cluster with credential passthrough enabled. Note: spark-submit jobs are not currently supported.

Two cluster types can be configured with sparklyr: Single-User Clusters (recommended) and Multi-User Clusters (discouraged).

  • Single-User Clusters: Credential Passthrough (required on Databricks) allows a single-user cluster to be created. This setting automatically configures the cluster to assume the role of the attached user when reading from storage. Because Immuta requires that raw data is readable by the cluster, the instance profile associated with the cluster should be used rather than a role assigned to the attached user.

  • Multi-User Clusters: Because Immuta cannot guarantee user isolation in a multi-user sparklyr cluster, it is not recommended to deploy a multi-user cluster. To force all users to act under the same set of attributes, groups, and purposes with respect to their data access and eliminate the risk of a data leak, all sparklyr multi-user clusters must be equalized either by convention (all users able to attach to the cluster have the same level of data access in Immuta) or by configuration (detailed below).

Single-user cluster configuration

1 - Enable sparklyr

In addition to the configuration for an Immuta cluster with R, add this environment variable to the Environment Variables section of the cluster:

IMMUTA_DATABRICKS_SPARKLYR_SUPPORT_ENABLED=true

This configuration makes changes to the iptables rules on the cluster to allow the sparklyr client to connect to the required ports on the JVM used by the sparklyr backend service.

2 - Set up a sparklyr connection in Databricks

  1. Install and load libraries into a notebook. Databricks includes the stable version of sparklyr, so library(sparklyr) in an R notebook is sufficient, but you may opt to install the latest version of sparklyr from CRAN. Additionally, loading library(DBI) will allow you to execute SQL queries.

  2. Set up a sparklyr connection:

    sc <- spark_connect(method = "databricks")
  3. Pass the connection object to execute queries:

    dbGetQuery(sc, "show tables in immuta")

3 - Configure a single-user cluster

Add the following items to the Spark Config section of the cluster:

spark.databricks.passthrough.enabled true

spark.databricks.pyspark.trustedFilesystems com.databricks.s3a.S3AFileSystem,shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem,shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem,com.databricks.adl.AdlFileSystem,shaded.databricks.V2_1_4.com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem,shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem,shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem,org.apache.hadoop.fs.ImmutaSecureFileSystemWrapper

spark.hadoop.fs.s3a.aws.credentials.provider com.amazonaws.auth.InstanceProfileCredentialsProvider

The trustedFileSystems setting is required to allow Immuta’s wrapper FileSystem (used in conjunction with the Security Manager for data security purposes) to be used with credential passthrough. Additionally, the InstanceProfileCredentialsProvider must be configured to continue using the cluster’s instance profile for data access, rather than a role associated with the attached user.

Multi-user cluster configuration

Avoid deploying multi-user clusters with sparklyr configuration

It is possible, but not recommended, to deploy a multi-user cluster sparklyr configuration. Immuta cannot guarantee user isolation in a multi-user sparklyr configuration.

The configurations in this section enable sparklyr, require project equalization, map sparklyr sessions to the correct Immuta user, and prevent users from accessing Immuta native workspaces.

  1. Add the following environment variables to the Environment Variables section of your cluster configuration:

    IMMUTA_DATABRICKS_SPARKLYR_SUPPORT_ENABLED=true
    
    IMMUTA_SPARK_REQUIRE_EQUALIZATION=true
    
    IMMUTA_SPARK_CURRENT_USER_SCIM_FALLBACK=false
  2. Add the following items to the Spark Config section:

    immuta.spark.acl.assume.not.privileged true
    
    immuta.api.key=<user’s API key>

Limitations

Immuta’s integration with sparklyr does not currently support

  • spark-submit jobs

  • UDFs

Spark environment variables

The Spark environment variables reference guide lists the various possible settings controlled by these variables that you can set in your cluster policy before attaching it to your cluster.

Additional Hadoop configuration file (optional)

In some cases it is necessary to add sensitive configuration to SparkSession.sparkContext.hadoopConfiguration to allow Spark to read data.

For example, when accessing external tables stored in Azure Data Lake Gen2, Spark must have credentials to access the target containers or filesystems in Azure Data Lake Gen2, but users must not have access to those credentials. In this case, an additional configuration file may be provided with a storage account key that the cluster may use to access Azure Data Lake Gen2.

To use an additional Hadoop configuration file, set the IMMUTA_INIT_ADDITIONAL_CONF_URI Spark environment variable to be the full URI to this file.

Configurable settings

Data source settings

Protected and unprotected tables

Generally, Immuta prevents users from seeing data unless they are explicitly given access, which blocks access to raw sources in the underlying databases.

Databricks non-privileged users will only see sources to which they are subscribed in Immuta, and this can present problems if organizations have a data lake full of non-sensitive data and Immuta removes access to all of it. The limited enforcement scope feature addresses this challenge by allowing Immuta users to access any tables that are not protected by Immuta (i.e., not registered as a data source or a table in a native workspace). Although this is similar to how privileged users in Databricks operate, non-privileged users cannot bypass Immuta controls.

  • Protected until made available by policy: This setting means all tables are hidden until a user is granted access through an Immuta policy. This is how most databases work and assumes least privileged access and also means you will have to register all tables with Immuta if this is disabled.

  • Available until protected by policy: This setting means all tables are open until explicitly registered and protected by Immuta. This makes sense if most of your tables are non-sensitive and you can pick and choose which to protect. This setting allows both non-Immuta reads and non-Immuta writes:

    • IMMUTA_SPARK_DATABRICKS_ALLOW_NON_IMMUTA_READS: Immuta users with regular (non-privileged) Databricks roles may SELECT from tables that are not registered in Immuta. This setting does not allow reading data directly with commands like spark.read.format("x"). Users are still required to read data and query tables using Spark SQL. When non-Immuta reads are enabled through the cluster policy, Immuta users will see all databases and tables when they run show databases or show tables. However, this does not mean they will be able to query all of them.

    • IMMUTA_SPARK_DATABRICKS_ALLOW_NON_IMMUTA_WRITES: Immuta users with regular (non-privileged) Databricks roles can run DDL commands and data-modifying commands against tables or spaces that are not registered in Immuta. With non-Immuta writes enabled through the cluster policy, users on the cluster can mix any policy-enforced data they may have access to via any registered data sources in Immuta with non-Immuta data and write the ensuing result to a non-Immuta write space where it would be visible to others. If this is not a desired possibility, the cluster should instead be configured to only use Immuta’s project workspaces.

The Configure a Databricks Spark integration guide includes instructions for applying these settings to your cluster.

Ephemeral overrides

In Immuta, a Databricks data source is considered ephemeral, meaning that the compute resources associated with that data source will not always be available.

Ephemeral data sources allow the use of ephemeral overrides, user-specific connection parameter overrides that are applied to Immuta metadata operations.

When a user runs a Spark job in Databricks, the Immuta plugin automatically submits ephemeral overrides for that user to Immuta for all applicable data sources to use the current cluster as compute for all subsequent metadata operations for that user against the applicable data sources.

For more details about ephemeral overrides and how to configure or disable them, see the Ephemeral overrides page.

Restricting users' access with Immuta projects

Immuta projects combine users and data sources under a common purpose. Sometimes this purpose is for a single user to organize their data sources or to control an entire schema of data sources through a single projects screen; however, most often this is an Immuta purpose for which the data has been approved to be used and will restrict access to data and streamline team collaboration. Consequently, data owners can restrict access to data for a specified purpose through projects.

When a user is working within the context of a project, data users will only see the data in that project. This helps to prevent data leaks when users collaborate. Users can switch project contexts to access various data sources while acting under the appropriate purpose. Consider adjusting the following project settings to suit your organization's needs:

  • Project UDFs (web service and on-cluster caches): Immuta caches a mapping of user accounts and users' current projects in the Immuta Web Service and on-cluster. When users change their project with UDFs instead of the Immuta UI, Immuta invalidates all the caches on-cluster (so that everything changes immediately) and the cluster submits a request to change the project context to a web worker. Immediately after that request, another call is made to a web worker to refresh the current project. To allow use of project UDFs in Spark jobs, raise the caching on-cluster and lower the cache timeouts for the Immuta Web Service. Otherwise, caching could cause dissonance among the requests and calls to multiple web workers when users try to change their project contexts.\

  • Preventing users from changing projects within a session: If your compliance requirements restrict users from changing projects within a session, you can block the use of Immuta's project UDFs on a Databricks Spark cluster. To do so, configure the IMMUTA_SPARK_DATABRICKS_DISABLED_UDFS Spark environment variable.

Databricks features

This section describes how Immuta interacts with common Databricks features.

Change data feed

Databricks users can see the on queried tables if they are allowed to read raw data and meet specific qualifications. Immuta does not support applying policies to the changed data, and the CDF cannot be read for data source tables if the user does not have access to the raw data in Databricks or for streaming queries.

The CDF can be read if the querying user is allowed to read the raw data and ONE of the following statements is true:

  • the table is in the current workspace

  • the table is in a scratch path

  • non-Immuta reads are enabled AND the table does not intersect with a workspace under which the current user is not acting

  • non-Immuta reads are enabled AND the table is not part of an Immuta data source

Databricks trusted libraries

The trusted libraries feature allows Databricks cluster administrators to avoid . An administrator can specify an installed library as trusted, which will enable that library's code to bypass the Immuta Security Manager. This feature does not impact Immuta's ability to apply policies; trusting a library only allows code through that would otherwise be blocked by the Security Manager.

The following types of libraries are supported when installing a third-party library using the Databricks UI or the Databricks Libraries API:

  • Library source is Upload, DBFS or DBFS/S3 and the Library Type is Jar.

  • Library source is Maven.

When users install third-party libraries, those libraries will be denied access to sensitive resources by default. However, cluster administrators can specify which of the installed Databricks libraries should be trusted by Immuta. See the Install a trusted library guide to add a trusted library to your configuration.

Limitations

  • Installing trusted libraries outside of the Databricks Libraries API (e.g., ADD JAR ...) is not supported.

  • Databricks installs libraries right after a cluster has started, but there is no guarantee that library installation will complete before a user's code is executed. If a user executes code before a trusted library installation has completed, Immuta will not be able to identify the library as trusted. This can be solved by either

    • waiting for library installation to complete before running any third-party library commands or

    • executing a Spark query. This will force Immuta to wait for any trusted Immuta libraries to complete installation before proceeding.

  • When installing a library using Maven as a library source, Databricks will also install any transitive dependencies for the library. However, those transitive dependencies are installed behind the scenes and will not appear as installed libraries in either the Databricks UI or using the Databricks Libraries API. Only libraries specifically listed in the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS environment variable will be trusted by Immuta, which does not include installed transitive dependencies. This effectively means that any code paths that include a class from a transitive dependency but do not include a class from a trusted third-party library can still be blocked by the Immuta Security Manager. For example, if a user installs a trusted third-party library that has a transitive dependency of a file-util library, the user will not be able to directly use the file-util library to read a sensitive file that is normally protected by the Immuta Security Manager.\

    In many cases, it is not a problem if dependent libraries aren't trusted because code paths where the trusted library calls down into dependent libraries will still be trusted. However, if the dependent library needs to be trusted, there is a workaround:

    1. Add the transitive dependency jar paths to the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS Spark environment variable. In the driver log4j logs, Databricks outputs the source jar locations when it installs transitive dependencies. In the cluster driver logs, look for a log message similar to the following:

      INFO LibraryDownloadManager: Downloaded library dbfs:/FileStore/jars/maven/org/slf4j/slf4j-api-1.7.25.jar as
      local file /local_disk0/tmp/addedFile8569165920223626894slf4j_api_1_7_25-784af.jar
    2. In the above example, where slf4j is the transitive dependency, you would add the path dbfs:/FileStore/jars/maven/org/slf4j/slf4j-api-1.7.25.jar to the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS environment variable and restart your cluster.

External catalogs

Connect any of these supported external catalogs to work with your Databricks Spark integration so data owners can tag their data.

External metastores

Immuta supports the use of external metastores in local or remote mode:

  • Local mode: The metastore client running inside a cluster connects to the underlying metastore database directly via JDBC.

  • Remote mode: Instead of connecting to the underlying database directly, the metastore client connects to a separate metastore service via the Thrift protocol. The metastore service connects to the underlying database. When running a metastore in remote mode, DBFS is not supported.

For more details about these deployment modes, see how to set up Databricks clusters to connect to an existing external Apache Hive metastore.

Configure external Hive metastore

Download the metastore jars and point to them as specified in Databricks documentation. Metastore jars must end up on the cluster's local disk at this explicit path: /databricks/hive_metastore_jars.

If using DBR 7.x with Hive 2.3.x, either

  • Set spark.sql.hive.metastore.version to 2.3.7 and spark.sql.hive.metastore.jars to builtin or

  • Download the metastore jars and set spark.sql.hive.metastore.jars to /databricks/hive_metastore_jars/* as before.

Configure AWS Glue Data Catalog

To use AWS Glue Data Catalog as the metastore for Databricks, see the Databricks documentation.

Notebook-scoped libraries on machine learning clusters

Users on Databricks Runtimes 8+ can manage notebook-scoped libraries with %pip commands.

However, this functionality differs from the support for Databricks trusted libraries, and Python libraries are not supported as trusted libraries. The Immuta Security Manager will deny the code of libraries installed with %pip access to sensitive resources.

Scratch paths

Scratch paths are cluster-specific remote file paths that Databricks users are allowed to directly read from and write to without restriction. The creator of a Databricks cluster specifies the set of remote file paths that are designated as scratch paths on that cluster when they configure a Databricks cluster. Scratch paths are useful for scenarios where non-sensitive data needs to be written out to a specific location using a Databricks cluster protected by Immuta.

To configure a scratch path, use the IMMUTA_SPARK_DATABRICKS_SCRATCH_PATHS Spark environment variable.

Last updated

Was this helpful?