Customizing the Integration
You can customize the Databricks Spark integration settings using these components Immuta provides:
Cluster policies
Immuta provides cluster policies that set the Spark environment variables and configuration on your Databricks cluster once you apply that policy to your cluster. These policies generated by Immuta must be applied to your cluster manually. The Configure a Databricks Spark integration guide includes instructions for generating and applying these cluster policies. Each cluster policy is described below.
Spark environment variables
The Spark environment variables reference guide lists the various possible settings controlled by these variables that you can set in your cluster policy before attaching it to your cluster.
Additional Hadoop configuration file (optional)
In some cases it is necessary to add sensitive configuration to SparkSession.sparkContext.hadoopConfiguration
to allow Spark to read data.
For example, when accessing external tables stored in Azure Data Lake Gen2, Spark must have credentials to access the target containers or filesystems in Azure Data Lake Gen2, but users must not have access to those credentials. In this case, an additional configuration file may be provided with a storage account key that the cluster may use to access Azure Data Lake Gen2.
To use an additional Hadoop configuration file, set the IMMUTA_INIT_ADDITIONAL_CONF_URI
Spark environment variable to be the full URI to this file.
Configurable settings
Data source settings
Protected and unprotected tables
Generally, Immuta prevents users from seeing data unless they are explicitly given access, which blocks access to raw sources in the underlying databases.
Databricks non-privileged users will only see sources to which they are subscribed in Immuta, and this can present problems if organizations have a data lake full of non-sensitive data and Immuta removes access to all of it. The limited enforcement scope feature addresses this challenge by allowing Immuta users to access any tables that are not protected by Immuta (i.e., not registered as a data source or a table in a native workspace). Although this is similar to how privileged users in Databricks operate, non-privileged users cannot bypass Immuta controls.
Protected until made available by policy: This setting means all tables are hidden until a user is granted access through an Immuta policy. This is how most databases work and assumes least privileged access and also means you will have to register all tables with Immuta if this is disabled.
Available until protected by policy: This setting means all tables are open until explicitly registered and protected by Immuta. This makes sense if most of your tables are non-sensitive and you can pick and choose which to protect. This setting allows both non-Immuta reads and non-Immuta writes:
IMMUTA_SPARK_DATABRICKS_ALLOW_NON_IMMUTA_READS
: Immuta users with regular (non-privileged) Databricks roles maySELECT
from tables that are not registered in Immuta. This setting does not allow reading data directly with commands like spark.read.format("x"). Users are still required to read data and query tables using Spark SQL. When non-Immuta reads are enabled through the cluster policy, Immuta users will see all databases and tables when they run show databases or show tables. However, this does not mean they will be able to query all of them.IMMUTA_SPARK_DATABRICKS_ALLOW_NON_IMMUTA_WRITES
: Immuta users with regular (non-privileged) Databricks roles can run DDL commands and data-modifying commands against tables or spaces that are not registered in Immuta. With non-Immuta writes enabled through the cluster policy, users on the cluster can mix any policy-enforced data they may have access to via any registered data sources in Immuta with non-Immuta data and write the ensuing result to a non-Immuta write space where it would be visible to others. If this is not a desired possibility, the cluster should instead be configured to only use Immuta’s project workspaces.
The Configure a Databricks Spark integration guide includes instructions for applying these settings to your cluster.
Ephemeral overrides
In Immuta, a Databricks data source is considered ephemeral, meaning that the compute resources associated with that data source will not always be available.
Ephemeral data sources allow the use of ephemeral overrides, user-specific connection parameter overrides that are applied to Immuta metadata operations.
When a user runs a Spark job in Databricks, the Immuta plugin automatically submits ephemeral overrides for that user to Immuta for all applicable data sources to use the current cluster as compute for all subsequent metadata operations for that user against the applicable data sources.
For more details about ephemeral overrides and how to configure or disable them, see the Ephemeral overrides page.
Restricting users' access with Immuta projects
Immuta projects combine users and data sources under a common purpose. Sometimes this purpose is for a single user to organize their data sources or to control an entire schema of data sources through a single projects screen; however, most often this is an Immuta purpose for which the data has been approved to be used and will restrict access to data and streamline team collaboration. Consequently, data owners can restrict access to data for a specified purpose through projects.
When a user is working within the context of a project, data users will only see the data in that project. This helps to prevent data leaks when users collaborate. Users can switch project contexts to access various data sources while acting under the appropriate purpose. Consider adjusting the following project settings to suit your organization's needs:
Project UDFs (web service and on-cluster caches): Immuta caches a mapping of user accounts and users' current projects in the Immuta Web Service and on-cluster. When users change their project with UDFs instead of the Immuta UI, Immuta invalidates all the caches on-cluster (so that everything changes immediately) and the cluster submits a request to change the project context to a web worker. Immediately after that request, another call is made to a web worker to refresh the current project. To allow use of project UDFs in Spark jobs, raise the caching on-cluster and lower the cache timeouts for the Immuta Web Service. Otherwise, caching could cause dissonance among the requests and calls to multiple web workers when users try to change their project contexts.
Preventing users from changing projects within a session: If your compliance requirements restrict users from changing projects within a session, you can block the use of Immuta's project UDFs on a Databricks Spark cluster. To do so, configure the
IMMUTA_SPARK_DATABRICKS_DISABLED_UDFS
Spark environment variable.
Databricks features
This section describes how Immuta interacts with common Databricks features.
Change data feed
Databricks users can see the on queried tables if they are allowed to read raw data and meet specific qualifications. Immuta does not support applying policies to the changed data, and the CDF cannot be read for data source tables if the user does not have access to the raw data in Databricks or for streaming queries.
The CDF can be read if the querying user is allowed to read the raw data and ONE of the following statements is true:
the table is in the current workspace
the table is in a scratch path
non-Immuta reads are enabled AND the table does not intersect with a workspace under which the current user is not acting
non-Immuta reads are enabled AND the table is not part of an Immuta data source
Databricks trusted libraries
Security vulnerability
Using this feature could create a security vulnerability, depending on the third-party library. For example, if a library exposes a public method named readProtectedFile
that displays the contents of a sensitive file, then trusting that library would allow end users access to that file. Work with your Immuta support professional to determine if the risk does not apply to your environment or use case.
The trusted libraries feature allows Databricks cluster administrators to avoid . An administrator can specify an installed library as trusted, which will enable that library's code to bypass the Immuta security manager. This feature does not impact Immuta's ability to apply policies; trusting a library only allows code through that otherwise would have been blocked by the Security Manager.
The following types of libraries are supported when installing a third-party library using the Databricks UI or the Databricks Libraries API:
Library source
isUpload
,DBFS
orDBFS/S3
and theLibrary Type
isJar
.Library source
isMaven
.
When users install third-party libraries, those libraries will be denied access to sensitive resources by default. However, cluster administrators can specify which of the installed Databricks libraries should be trusted by Immuta. See the Install a trusted library guide to add a trusted library to your configuration.
Limitations
Installing trusted libraries outside of the Databricks Libraries API (e.g.,
ADD JAR ...
) is not supported.Databricks installs libraries right after a cluster has started, but there is no guarantee that library installation will complete before a user's code is executed. If a user executes code before a trusted library installation has completed, Immuta will not be able to identify the library as trusted. This can be solved by either
waiting for library installation to complete before running any third-party library commands or
executing a Spark query. This will force Immuta to wait for any trusted Immuta libraries to complete installation before proceeding.
When installing a library using Maven as a library source, Databricks will also install any transitive dependencies for the library. However, those transitive dependencies are installed behind the scenes and will not appear as installed libraries in either the Databricks UI or using the Databricks Libraries API. Only libraries specifically listed in the
IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS
environment variable will be trusted by Immuta, which does not include installed transitive dependencies. This effectively means that any code paths that include a class from a transitive dependency but do not include a class from a trusted third-party library can still be blocked by the Immuta security manager. For example, if a user installs a trusted third-party library that has a transitive dependency of afile-util
library, the user will not be able to directly use thefile-util
library to read a sensitive file that is normally protected by the Immuta security manager.In many cases, it is not a problem if dependent libraries aren't trusted because code paths where the trusted library calls down into dependent libraries will still be trusted. However, if the dependent library needs to be trusted, there is a workaround:
Add the transitive dependency jar paths to the
IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS
Spark environment variable. In the driverlog4j
logs, Databricks outputs the source jar locations when it installs transitive dependencies. In the cluster driver logs, look for a log message similar to the following:In the above example, where
slf4j
is the transitive dependency, you would add the pathdbfs:/FileStore/jars/maven/org/slf4j/slf4j-api-1.7.25.jar
to theIMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS
environment variable and restart your cluster.
External catalogs
Connect any of these supported external catalogs to work with your Databricks Spark integration so data owners can tag their data.
External metastores
Immuta supports the use of external metastores in local or remote mode:
Local mode: The metastore client running inside a cluster connects to the underlying metastore database directly via JDBC.
Remote mode: Instead of connecting to the underlying database directly, the metastore client connects to a separate metastore service via the Thrift protocol. The metastore service connects to the underlying database. When running a metastore in remote mode, DBFS is not supported.
For more details about these deployment modes, see how to set up Databricks clusters to connect to an existing external Apache Hive metastore.
Notebook-scoped libraries on machine learning clusters
Users on Databricks Runtimes 8+ can manage notebook-scoped libraries with %pip
commands.
However, this functionality differs from the support for Databricks trusted libraries, and Python libraries are not supported as trusted libraries. The Immuta Security Manager will deny the code of libraries installed with %pip
access to sensitive resources.
Scratch paths
Scratch paths are cluster-specific remote file paths that Databricks users are allowed to directly read from and write to without restriction. The creator of a Databricks cluster specifies the set of remote file paths that are designated as scratch paths on that cluster when they configure a Databricks cluster. Scratch paths are useful for scenarios where non-sensitive data needs to be written out to a specific location using a Databricks cluster protected by Immuta.
To configure a scratch path, use the IMMUTA_SPARK_DATABRICKS_SCRATCH_PATHS
Spark environment variable.
Last updated
Was this helpful?