Databricks Libraries
This page provides an overview of Immuta's Databricks Trusted Libraries feature and support of Notebook-Scoped Libraries on Machine Learning Clusters.
Databricks Libraries and Immuta's Security Manager
The Immuta security manager blocks users from executing code that could allow them to gain access to sensitive data by only allowing select code paths to access sensitive files and methods. These select code paths provide Immuta's code access to sensitive resources while blocking end users from these sensitive resources directly.
Similarly, when users install third-party libraries those libraries will be denied access to sensitive resources by default. However, cluster administrators can specify which of the installed Databricks libraries should be trusted by Immuta.
Databricks Trusted Libraries
The trusted libraries feature allows Databricks cluster administrators to avoid Immuta security manager errors when using third-party libraries. An administrator can specify an installed library as "trusted," which will enable that library's code to bypass the Immuta security manager. Contact your Immuta support professional for custom security configurations for your libraries.
This feature does not impact Immuta's ability to apply policies; trusting a library only allows code through what previously would have been blocked by the security manager.
Security vulnerability
Using this feature could create a security vulnerability, depending on the third-party library. For example, if a library exposes a public method named readProtectedFile
that displays the contents of a sensitive file, then trusting that library would allow end users access to that file. Work with your Immuta support professional to determine if the risk does not apply to your environment or use case.
Databricks Libraries API: Installing trusted libraries outside of the Databricks Libraries API (e.g., ADD JAR ...
) is not supported.
The following types of libraries are supported when installing a third-party library using the Databricks UI or the Databricks Libraries API:
Library source
isUpload
,DBFS
orDBFS/S3
and theLibrary Type
isJar
.Library source
isMaven
.
Limitations
Databricks installs libraries right after a cluster has started, but there is no guarantee that library installation will complete before a user's code is executed. If a user executes code before a trusted library installation has completed, Immuta will not be able to identify the library as trusted. This can be solved by either
waiting for library installation to complete before running any third-party library commands or
executing a Spark query. This will force Immuta to wait for any trusted Immuta libraries to complete installation before proceeding.
When installing a library using Maven as a library source, Databricks will also install any transitive dependencies for the library. However, those transitive dependencies are installed behind the scenes and will not appear as installed libraries in either the Databricks UI or using the Databricks Libraries API. Only libraries specifically listed in the
IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS
environment variable will be trusted by Immuta, which does not include installed transitive dependencies. This effectively means that any code paths that include a class from a transitive dependency but do not include a class from a trusted third-party library can still be blocked by the Immuta security manager. For example, if a user installs a trusted third-party library that has a transitive dependency of afile-util
library, the user will not be able to directly use thefile-util
library to read a sensitive file that is normally protected by the Immuta security manager.In many cases, it is not a problem if dependent libraries aren't trusted because code paths where the trusted library calls down into dependent libraries will still be trusted. However, if the dependent library needs to be trusted, there is a workaround:
Add the transitive dependency jar paths to the
IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS
environment variable. In the driverlog4j
logs, Databricks outputs the source jar locations when it installs transitive dependencies. In the cluster driver logs, look for a log message similar to the following:In the above example, where
slf4j
is the transitive dependency, you would add the pathdbfs:/FileStore/jars/maven/org/slf4j/slf4j-api-1.7.25.jar
to theIMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS
environment variable and restart your cluster.
Troubleshooting
In case of failure, check the driver logs for details. Some possible causes of failure include
One of the Immuta configured trusted library URIs does not point to a Databricks library. Check that you have configured the correct URI for the Databricks library.
For trusted Maven artifacts, the URI must follow this format:
maven:/group.id:artifact-id:version
.Databricks failed to install a library. Any Databricks library installation errors will appear in the Databricks UI under the Libraries tab.
Configuration
For details about configuring trusted libraries, navigate to the installation guide.
Notebook-Scoped Libraries on Machine Learning Clusters
Users on Databricks runtimes 8+ can manage notebook-scoped libraries with %pip
commands.
However, this functionality differs from Immuta's trusted libraries feature, and Python libraries are still not supported as trusted libraries. The Immuta Security Manager will deny the code of libraries installed with %pip
access to sensitive resources.
Configuration
No additional configuration is needed to enable this feature. Users only need to be running on clusters with DBR 8+.
Last updated