Skip to content

Python & SQL & R

Audience: System Administrators

Content Summary: This page describes the Python & SQL & R cluster policy.

Additional Overhead

In relation to the Python & SQL cluster policy, this configuration trades some additional overhead for added support of the R language.

In this configuration, you are able to rely on the Databricks-native security controls. The key security control here is the enablement of process isolation. This prevents users from obtaining unintentional access to the queries of other users. In other words, masked and filtered data is consistently made accessible to users in accordance with their assigned attributes.

Like the Python & SQL configuration, Py4j security is enabled for the Python & SQL & R configuration. However, because R has been added Immuta enables the SecurityManager, in addition to Py4j security, to provide more security guarantees. For example, by default all actions in R execute as the root user; among other things, this permits access to the entire filesystem (including sensitive configuration data), and, without iptable restrictions, a user may freely access the cluster’s cloud storage credentials. To address these security issues, Immuta’s initialization script wraps the R and Rscript binaries to launch each command as a temporary, non-privileged user with limited filesystem and network access and installs the Immuta SecurityManager, which prevents users from bypassing policies and protects against the above vulnerabilities from within the JVM.

Consequently, the cost of introducing R is that the SecurityManager incurs a small increase in performance overhead; however, average latency will vary depending on whether the cluster is homogeneous or heterogeneous. (In homogeneous clusters, all users are at the same level of groups/authorizations; this is enforced externally, rather than directly by Immuta.)

Many Python ML classes (such as LogisticRegression, StringIndexer, and DecisionTreeClassifier) and dbutils.fs are unfortunately not supported with Py4J security enabled. Users will also be unable to use the Databricks Connect client library.

When users install third-party Java/Scala libraries, they will be denied access to sensitive resources by default. However, cluster administrators can specify which of the installed Databricks libraries should be trusted by Immuta.

Both Standard and High Concurrency cluster modes are supported.

For full details on Databricks’ best practices in configuring clusters, please read their governance documentation.