Customizing the Integration
Last updated
Was this helpful?
Last updated
Was this helpful?
You can customize the Databricks Spark integration settings using these components Immuta provides:
Immuta provides cluster policies that set the and configuration on your Databricks cluster once you apply that policy to your cluster. These policies generated by Immuta must be applied to your cluster manually. The includes instructions for generating and applying these cluster policies. Each cluster policy is described below.
In some cases it is necessary to add sensitive configuration to SparkSession.sparkContext.hadoopConfiguration
to allow Spark to read data.
For example, when accessing external tables stored in Azure Data Lake Gen2, Spark must have credentials to access the target containers or filesystems in Azure Data Lake Gen2, but users must not have access to those credentials. In this case, an additional configuration file may be provided with a storage account key that the cluster may use to access Azure Data Lake Gen2.
Generally, Immuta prevents users from seeing data unless they are explicitly given access, which blocks access to raw sources in the underlying databases.
Databricks non-privileged users will only see sources to which they are subscribed in Immuta, and this can present problems if organizations have a data lake full of non-sensitive data and Immuta removes access to all of it. The limited enforcement scope feature addresses this challenge by allowing Immuta users to access any tables that are not protected by Immuta (i.e., not registered as a data source or a table in a native workspace). Although this is similar to how privileged users in Databricks operate, non-privileged users cannot bypass Immuta controls.
Protected until made available by policy: This setting means all tables are hidden until a user is granted access through an Immuta policy. This is how most databases work and assumes least privileged access and also means you will have to register all tables with Immuta if this is disabled.
Available until protected by policy: This setting means all tables are open until explicitly registered and protected by Immuta. This makes sense if most of your tables are non-sensitive and you can pick and choose which to protect. This setting allows both non-Immuta reads and non-Immuta writes:
In Immuta, a Databricks data source is considered ephemeral, meaning that the compute resources associated with that data source will not always be available.
Ephemeral data sources allow the use of ephemeral overrides, user-specific connection parameter overrides that are applied to Immuta metadata operations.
When a user runs a Spark job in Databricks, the Immuta plugin automatically submits ephemeral overrides for that user to Immuta for all applicable data sources to use the current cluster as compute for all subsequent metadata operations for that user against the applicable data sources.
Immuta projects combine users and data sources under a common purpose. Sometimes this purpose is for a single user to organize their data sources or to control an entire schema of data sources through a single projects screen; however, most often this is an Immuta purpose for which the data has been approved to be used and will restrict access to data and streamline team collaboration. Consequently, data owners can restrict access to data for a specified purpose through projects.
When a user is working within the context of a project, data users will only see the data in that project. This helps to prevent data leaks when users collaborate. Users can switch project contexts to access various data sources while acting under the appropriate purpose. Consider adjusting the following project settings to suit your organization's needs:
This section describes how Immuta interacts with common Databricks features.
The CDF can be read if the querying user is allowed to read the raw data and ONE of the following statements is true:
the table is in the current workspace
the table is in a scratch path
non-Immuta reads are enabled AND the table does not intersect with a workspace under which the current user is not acting
non-Immuta reads are enabled AND the table is not part of an Immuta data source
Security vulnerability
Using this feature could create a security vulnerability, depending on the third-party library. For example, if a library exposes a public method named readProtectedFile
that displays the contents of a sensitive file, then trusting that library would allow end users access to that file. Work with your Immuta support professional to determine if the risk does not apply to your environment or use case.
The trusted libraries feature allows Databricks cluster administrators to avoid . An administrator can specify an installed library as trusted, which will enable that library's code to bypass the Immuta security manager. This feature does not impact Immuta's ability to apply policies; trusting a library only allows code through that otherwise would have been blocked by the Security Manager.
The following types of libraries are supported when installing a third-party library using the Databricks UI or the Databricks Libraries API:
Library source
is Upload
, DBFS
or DBFS/S3
and the Library Type
is Jar
.
Library source
is Maven
.
Limitations
Installing trusted libraries outside of the Databricks Libraries API (e.g., ADD JAR ...
) is not supported.
Databricks installs libraries right after a cluster has started, but there is no guarantee that library installation will complete before a user's code is executed. If a user executes code before a trusted library installation has completed, Immuta will not be able to identify the library as trusted. This can be solved by either
waiting for library installation to complete before running any third-party library commands or
executing a Spark query. This will force Immuta to wait for any trusted Immuta libraries to complete installation before proceeding.
When installing a library using Maven as a library source, Databricks will also install any transitive dependencies for the library. However, those transitive dependencies are installed behind the scenes and will not appear as installed libraries in either the Databricks UI or using the Databricks Libraries API. Only libraries specifically listed in the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS
environment variable will be trusted by Immuta, which does not include installed transitive dependencies. This effectively means that any code paths that include a class from a transitive dependency but do not include a class from a trusted third-party library can still be blocked by the Immuta security manager. For example, if a user installs a trusted third-party library that has a transitive dependency of a file-util
library, the user will not be able to directly use the file-util
library to read a sensitive file that is normally protected by the Immuta security manager.
In many cases, it is not a problem if dependent libraries aren't trusted because code paths where the trusted library calls down into dependent libraries will still be trusted. However, if the dependent library needs to be trusted, there is a workaround:
In the above example, where slf4j
is the transitive dependency, you would add the path dbfs:/FileStore/jars/maven/org/slf4j/slf4j-api-1.7.25.jar
to the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS
environment variable and restart your cluster.
Local mode: The metastore client running inside a cluster connects to the underlying metastore database directly via JDBC.
Remote mode: Instead of connecting to the underlying database directly, the metastore client connects to a separate metastore service via the Thrift protocol. The metastore service connects to the underlying database. When running a metastore in remote mode, DBFS is not supported.
Scratch paths are cluster-specific remote file paths that Databricks users are allowed to directly read from and write to without restriction. The creator of a Databricks cluster specifies the set of remote file paths that are designated as scratch paths on that cluster when they configure a Databricks cluster. Scratch paths are useful for scenarios where non-sensitive data needs to be written out to a specific location using a Databricks cluster protected by Immuta.
Where Scala language support is needed, this configuration can be used in the Custom .
According to Databricks’ cluster type support documentation, Scala clusters are intended for . However, nothing inherently prevents a Scala cluster from being configured for multiple users. Even with the Immuta Security Manager enabled, there are limitations to user isolation within a Scala job.
For a secure configuration, it is recommended that clusters intended for Scala workloads are limited to Scala jobs only and are made homogeneous through the use of or externally via convention/cluster ACLs. (In homogeneous clusters, all users are at the same level of groups/authorizations; this is enforced externally, rather than directly by Immuta.)
For full details on Databricks’ best practices in configuring clusters, read their .
: Credential Passthrough (required on Databricks) allows a single-user cluster to be created. This setting automatically configures the cluster to assume the role of the attached user when reading from storage. Because Immuta requires that raw data is readable by the cluster, the instance profile associated with the cluster should be used rather than a role assigned to the attached user.
: Because Immuta cannot guarantee user isolation in a multi-user sparklyr cluster, it is not recommended to deploy a multi-user cluster. To force all users to act under the same set of attributes, groups, and purposes with respect to their data access and eliminate the risk of a data leak, all sparklyr multi-user clusters must be equalized either by convention (all users able to attach to the cluster have the same level of data access in Immuta) or by configuration (detailed below).
The lists the various possible settings controlled by these variables that you can set in your cluster policy before attaching it to your cluster.
To use an additional Hadoop configuration file, set the to be the full URI to this file.
: Immuta users with regular (non-privileged) Databricks roles may SELECT
from tables that are not registered in Immuta. This setting does not allow reading data directly with commands like spark.read.format("x"). Users are still required to read data and query tables using Spark SQL. When non-Immuta reads are enabled through the cluster policy, Immuta users will see all databases and tables when they run show databases or show tables. However, this does not mean they will be able to query all of them.
: Immuta users with regular (non-privileged) Databricks roles can run DDL commands and data-modifying commands against tables or spaces that are not registered in Immuta. With non-Immuta writes enabled through the cluster policy, users on the cluster can mix any policy-enforced data they may have access to via any registered data sources in Immuta with non-Immuta data and write the ensuing result to a non-Immuta write space where it would be visible to others. If this is not a desired possibility, the cluster should instead be configured to only use Immuta’s project workspaces.
The includes instructions for applying these settings to your cluster.
For more details about ephemeral overrides and how to configure or disable them, see the .
Project UDFs (web service and on-cluster caches): Immuta caches a mapping of user accounts and users' current projects in the Immuta Web Service and on-cluster. When users change their project with UDFs instead of the Immuta UI, Immuta invalidates all the caches on-cluster (so that everything changes immediately) and the cluster submits a request to change the project context to a web worker. Immediately after that request, another call is made to a web worker to refresh the current project. To allow use of project UDFs in Spark jobs, . Otherwise, caching could cause dissonance among the requests and calls to multiple web workers when users try to change their project contexts.
Preventing users from changing projects within a session: If your compliance requirements restrict users from changing projects within a session, you can block the use of Immuta's project UDFs on a Databricks Spark cluster. To do so, .
Databricks users can see the on queried tables if they are allowed to read raw data and meet specific qualifications. Immuta does not support applying policies to the changed data, and the CDF cannot be read for data source tables if the user does not have access to the raw data in Databricks or for .
When users install third-party libraries, those libraries will be denied access to sensitive resources by default. However, cluster administrators can specify which of the installed Databricks libraries should be trusted by Immuta. See the to add a trusted library to your configuration.
Add the transitive dependency jar paths to the . In the driver log4j
logs, Databricks outputs the source jar locations when it installs transitive dependencies. In the cluster driver logs, look for a log message similar to the following:
Connect any of these to work with your Databricks Spark integration so data owners can tag their data.
Immuta supports the use of external metastores in :
For more details about these deployment modes, see .
Download the metastore jars and point to them as specified in . Metastore jars must end up on the cluster's local disk at this explicit path: /databricks/hive_metastore_jars
.
To use AWS Glue Data Catalog as the metastore for Databricks, see the .
Users on Databricks Runtimes 8+ can manage notebook-scoped libraries with .
However, this functionality differs from the , and Python libraries are not supported as trusted libraries. The Immuta Security Manager will deny the code of libraries installed with %pip
access to sensitive resources.
To configure a scratch path, use the .