Databricks Spark Application Configuration
Audience: System Administrators
Content Summary: This page outlines configuration options for Immuta-enabled Databricks clusters. Databricks Administrators should place desired configuration in the
immuta_conf.xml
file.
Environment Variable Overrides
Properties in the config file can be overridden during installation using environment variables.
The variable names are the config names in all upper case with _
instead of .
. For example, to set the
value of immuta.base.url
via an environment variable, you would set the following in the Environment Variables
section of cluster configuration: IMMUTA_BASE_URL=https://immuta.mycompany.com
Spark Application Configuration
-
immuta.spark.acl.enabled
-
Default:
true
-
Description: Immuta Access Control List (ACL). Controls whether Databricks users are blocked from accessing non-Immuta tables. Ignored if Databricks Table ACLs are enabled (i.e.,
spark.databricks.acl.dfAclsEnabled=true
).
-
-
immuta.spark.acl.whitelist
- Description: Comma-separated list of Databricks usernames who may access raw tables when the Immuta ACL is in use.
-
immuta.spark.acl.privileged.timeout.seconds
-
Default:
3600
-
Description: The number of seconds to cache privileged user status for the Immuta ACL. A privileged Databricks user is an admin or is whitelisted in
immuta.spark.acl.whitelist
.
-
-
immuta.spark.acl.assume.not.privileged
-
Default:
false
-
Description: Session property that overrides privileged user status when the Immuta ACL is in use. This should only be used in R scripts associated with spark-submit jobs.
-
-
immuta.spark.resolve.raw.tables.enabled
-
Default:
true
-
Description: Enables use of the underlying database and table name in queries against a table-backed Immuta data source. Note that this property is not set in
immuta_conf.xml
. Administrators or whitelisted users can setimmuta.spark.session.resolve.raw.tables.enabled
tofalse
to bypass resolving raw databases or tables as Immuta data sources. This is useful if an admin wants to read raw data but is also an Immuta user. By default, data policies will be applied to a table even for an administrative user if that admin is also an Immuta user; however if they runset immuta.spark.session.resolve.raw.tables.enabled=false
then they will see raw data only (not Immuta data policy-enforced data).
-
-
immuta.spark.session.resolve.raw.tables.enabled
-
Default:
true
-
Description: Same as above, but a session property that allows users to toggle this functionality. Ignored if
immuta.spark.resolve.raw.tables.enabled=false
.
-
-
immuta.spark.databricks.local.scratch.dir.enabled
-
Default:
true
-
Description: Creates a world-readable/writable scratch directory on local disk to facilitate the use of
dbutils
and 3rd party libraries that may write to local disk. Its location is non-configurable and is stored in the environment variableIMMUTA_LOCAL_SCRATCH_DIR
. Note: Sensitive data should not be stored at this location.
-
-
immuta.spark.databricks.py4j.strict.enabled
-
Default:
true
-
Description: Disable to allow the use of the
dbutils
API in Python. Note: This setting should only be disabled for customers who employ a homogeneous access pattern (i.e., all users have the same level of data access).
-
-
immuta.spark.databricks.scratch.paths
-
Description: Comma-separated list of remote paths that Databricks users are allowed to directly read/write. These paths amount to unprotected "scratch spaces." You can create a scratch database by configuring its specified location (or configure
dbfs:/user/hive/warehouse/<db_name>.db
for the default location).To create a scratch path to a location or a database stored at that location, configure
<property> <name>immuta.spark.databricks.scratch.paths</name> <value>s3://path/to/the/dir</value> </property>
To create a scratch path to a database created using the default location,
<property> <name>immuta.spark.databricks.scratch.paths</name> <value>s3://path/to/the/dir, dbfs:/user/hive/warehouse/any_db_name.db</value> </property>
-
-
immuta.spark.databricks.scratch.paths.create.db.enabled
-
Default:
false
-
Description: Enables non-privileged users to create or drop scratch databases.
-
-
immuta.spark.databricks.filesystem.blacklist
-
Default:
hdfs
-
Description: A list of filesystem protocols that this instance of Immuta will not support for workspaces. This is useful in cases where a filesystem is available to a cluster but should not be used on that cluster.
-
-
immuta.spark.acl.workspace.enabled
-
Default:
true
-
Description: Enables enforcement of workspace operations in Databricks.
-
-
immuta.spark.require.equalization
-
Default:
false
-
Description: Requires that users act through a single, equalized project. A cluster should be equalized if users need to run Scala jobs on it, and it should be limited to Scala jobs only via
spark.databricks.repl.allowedLanguages
.
-
-
immuta.user.context.class
-
Default:
com.immuta.spark.OSUserContext
-
Description: The class name of the UserContext that will be used to determine the current user in
immuta-spark-hive
. The default implementation gets the OS user running the JVM for the Spark application.
-
-
immuta.spark.databricks.jar.uri
-
Default:
file:///databricks/jars/immuta-spark-hive.jar
-
Description: The location of
immuta-spark-hive.jar
on the filesystem for Databricks. This should not need to change unless a customer needs a custom initialization script that places immuta-spark-hive in a non-standard location.
-
-
immuta.spark.databricks.submit.tag.job
-
Default:
true
-
Description: Denotes whether the Spark job will be run that "tags" a Databricks cluster as being associated with Immuta.
-
-
immuta.spark.databricks.dbfs.mount.enabled
-
Default:
false
-
Description: Exposes the DBFS FUSE mount located at
/dbfs
. Granular permissions are not possible, so all users will have read/write access to all objects therein. Note: Raw, unfiltered source data should never be stored in DBFS.
-
-
immuta.user.mapping.iamid
-
Default:
bim
-
Description: Denotes which IAM in Immuta should be used when mapping the current Spark user's username to a userid in Immuta. This defaults to bim but should be updated to reflect an actual production IAM.
-
-
immuta.ephemeral.host.override
-
Default:
true
-
Description: Set this to
false
if ephemeral overrides should not be enabled for Spark. Whentrue
, this will automatically override ephemeral data source httpPaths with the httpPath of the Databricks cluster running the user's Spark application.
-
-
immuta.ephemeral.host.override.httpPath
- Description: This configuration item can be used if automatic detection of the Databricks httpPath should be disabled in favor of a static path to use for ephemeral overrides.
-
immuta.ephemeral.table.path.check.enabled
-
Default:
true
-
Description: When querying Immuta data sources in Spark, the metadata from the Metastore is compared to the metadata for the target source in Immuta to validate that the source being queried exists and is queryable on the current cluster. This check typically validates that the target (database, table) pair exists in the Metastore and that the table’s underlying location matches what is in Immuta. This configuration can be used to disable location checking if that location is dynamic or changes over time. Note: This may lead to undefined behavior if the same table names exist in multiple workspaces but do not correspond to the same underlying data.
-
Accessing DBFS in Databricks
To allow general access to remote storage locations (e.g., S3) for non-sensitive data, opt to enable DBFS FUSE Mount or DBUtils with scratch paths.
1 - DBFS FUSE Mount
DBFS FUSE Mount Limitation
This feature cannot be used in environments with E2 Private Link enabled.
To enable the DBFS FUSE mount, set this configuration: immuta.spark.databricks.dbfs.mount.enabled=true
.
This feature (provided by Databricks) mounts DBFS to the local cluster filesystem at /dbfs
. Although
disabled when using process isolation, this feature can safely be enabled if raw, unfiltered data is not stored
in DBFS and all users on the cluster are authorized to see each other’s files. When enabled, the entirety of DBFS
essentially becomes a scratch path where users can read and write files in /dfbs/path/to/my/file
as though they
were local files.
For example,
%sh echo "I'm creating a new file in DBFS" > /dbfs/my/newfile.txt
In Python,
%python
with open("/dbfs/my/newfile.txt", "w") as f:
f.write("I'm creating a new file in DBFS")
Note: This solution also works in R and Scala.
Mounting a Bucket
-
Users can mount additional buckets to DBFS that can also be accessed using the FUSE mount.
-
Mounting a bucket is a one-time action, and the mount will be available to all clusters in the workspace from that point on.
-
Mounting must be performed from a non-Immuta cluster.
2 - Scala DBUtils (and %fs magic) with Scratch Paths
To support %fs magic and Scala DBUtils with scratch paths, configure
shell
<property>
<name>immuta.spark.databricks.scratch.paths</name>
<value>s3://my-bucket/my/scratch/path</value>
</property>
Scratch paths will work when performing arbitrary remote filesystem operations with fs magic or Scala dbutils.fs functions. For example,
%fs put -f s3://my-bucket/my/scratch/path/mynewfile.txt "I'm creating a new file in S3"
%scala dbutils.fs.put("s3://my-bucket/my/scratch/path/mynewfile.txt", "I'm creating a new file in S3")
DBUtils in Python
To use dbutils
in Python, set this configuration: immuta.spark.databricks.py4j.strict.enabled=false
.
Example Workflow
This section illustrates the workflow for getting a file from a remote scratch path, editing it locally with Python, and writing it back to a remote scratch path.
%python
import os
import shutil
s3ScratchFile = "s3://some-bucket/path/to/scratch/file"
localScratchDir = os.environ['IMMUTA_LOCAL_SCRATCH_DIR']
localScratchFile = localScratchDir + "/myfile.txt"
localScratchFileCopy = localScratchDir + "/myfile_copy.txt"
-
Get the file from remote storage:
dbutils.fs.cp(s3ScratchFile, "file://" + localScratchFile)`
-
Make a copy if want to explicitly edit
localScratchFile
, as it will be read-only and owned by root:shutil.copy(localScratchFile, localScratchFileCopy) with open(localScratchFileCopy, "a") as f: f.write("Some appended file content")
-
Write the new file back to remote storage:
dbutils.fs.cp("file://" + localScratchFileCopy, s3ScratchFile)