Spark Environment Variables
This page outlines configuration details for Immuta-enabled Databricks clusters. Databricks administrators should place the desired configuration in the Spark environment variables.
IMMUTA_INIT_ADDITIONAL_CONF_URI
If you add additional Hadoop configuration during the integration setup, this variable sets the path to that file.
The additional Hadoop configuration is where sensitive configuration goes for remote filesystems (if you are using a secret key pair to access S3, for example).
IMMUTA_EPHEMERAL_HOST_OVERRIDE
Default value: true
Set this to false
if ephemeral overrides should not be enabled for Spark. When true
, this will automatically override ephemeral data source httpPaths with the httpPath of the Databricks cluster running the user's Spark application.
IMMUTA_EPHEMERAL_HOST_OVERRIDE_HTTPPATH
This configuration item can be used if automatic detection of the Databricks httpPath should be disabled in favor of a static path to use for ephemeral overrides.
IMMUTA_EPHEMERAL_TABLE_PATH_CHECK_ENABLED
Default value: true
When querying Immuta data sources in Spark, the metadata from the Metastore is compared to the metadata for the target source in Immuta to validate that the source being queried exists and is queryable on the current cluster. This check typically validates that the target (database, table) pair exists in the Metastore and that the table’s underlying location matches what is in Immuta. This configuration can be used to disable location checking if that location is dynamic or changes over time. Note: This may lead to undefined behavior if the same table names exist in multiple workspaces but do not correspond to the same underlying data.
IMMUTA_INIT_ALLOWED_CALLING_CLASSES_URI
IMMUTA_SPARK_ACL_ALLOWLIST
This is a comma-separated list of Databricks users who can access any table or view in the cluster metastore without restriction.
IMMUTA_SPARK_ACL_PRIVILEGED_TIMEOUT_SECONDS
Default value: 3600
The number of seconds to cache privileged user status for the Immuta ACL. A privileged Databricks user is an admin or is whitelisted in IMMUTA_SPARK_ACL_WHITELIST
.
IMMUTA_SPARK_AUDIT_ALL_QUERIES
Default value: false
Enables auditing all queries run on a Databricks cluster, regardless of whether users touch Immuta-protected data or not.
IMMUTA_SPARK_DATABRICKS_ALLOW_NON_IMMUTA_READS
Default value: false
IMMUTA_SPARK_DATABRICKS_ALLOW_NON_IMMUTA_WRITES
Default value: false
IMMUTA_SPARK_DATABRICKS_ALLOWED_IMPERSONATION_USERS
This is a comma-separated list of Databricks users who are allowed to impersonate Immuta users:
IMMUTA_SPARK_DATABRICKS_DBFS_MOUNT_ENABLED
Default value: false
Exposes the DBFS FUSE mount located at /dbfs
. Granular permissions are not possible, so all users will have read/write access to all objects therein. Note: Raw, unfiltered source data should never be stored in DBFS.
IMMUTA_SPARK_DATABRICKS_DISABLED_UDFS
IMMUTA_SPARK_DATABRICKS_JAR_URI
Default value: file:///databricks/jars/immuta-spark-hive.jar
The location of immuta-spark-hive.jar
on the filesystem for Databricks. This should not need to change unless a custom initialization script that places immuta-spark-hive in a non-standard location is necessary.
IMMUTA_SPARK_DATABRICKS_LOCAL_SCRATCH_DIR_ENABLED
Default value: true
Creates a world-readable or writable scratch directory on local disk to facilitate the use of dbutils
and 3rd party libraries that may write to local disk. Its location is non-configurable and is stored in the environment variable IMMUTA_LOCAL_SCRATCH_DIR
. Note: Sensitive data should not be stored at this location.
IMMUTA_SPARK_DATABRICKS_LOG_LEVEL
Default value: INFO
The SLF4J log level to apply to Immuta's Spark plugins.
IMMUTA_SPARK_DATABRICKS_LOG_STDOUT_ENABLED
Default value: false
If true, writes logging output to stdout/the console as well as the log4j-active.txt
file (default in Databricks).
IMMUTA_SPARK_DATABRICKS_SCRATCH_DATABASE
Additionally, this configuration will only display the scratch databases that are configured and will not validate that the configured databases exist in the Metastore. Therefore, it is up to the Databricks administrator to properly set this value and keep it current.
IMMUTA_SPARK_DATABRICKS_SCRATCH_PATHS
Comma-separated list of remote paths that Databricks users are allowed to directly read/write. These paths amount to unprotected "scratch spaces." You can create a scratch database by configuring its specified location (or configure dbfs:/user/hive/warehouse/<db_name>.db
for the default location).
To create a scratch path to a location or a database stored at that location, configure
To create a scratch path to a database created using the default location,
IMMUTA_SPARK_DATABRICKS_SCRATCH_PATHS_CREATE_DB_ENABLED
Default value: false
Enables non-privileged users to create or drop scratch databases.
IMMUTA_SPARK_DATABRICKS_SINGLE_IMPERSONATION_USER
Default value: false
When true
, this configuration prevents users from changing their impersonation user once it has been set for a given Spark session. This configuration should be set when the BI tool or other service allows users to submit arbitrary SQL or issue SET commands.
IMMUTA_SPARK_DATABRICKS_SUBMIT_TAG_JOB
Default value: true
Denotes whether the Spark job will be run that "tags" a Databricks cluster as being associated with Immuta.
IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS
IMMUTA_SPARK_NON_IMMUTA_TABLE_CACHE_SECONDS
Default value: 3600
The number of seconds Immuta caches whether a table has been exposed as a data source in Immuta. This setting only applies when IMMUTA_SPARK_DATABRICKS_ALLOW_NON_IMMUTA_WRITES
or IMMUTA_SPARK_DATABRICKS_ALLOW_NON_IMMUTA_READS
is enabled.
IMMUTA_SPARK_REQUIRE_EQUALIZATION
Default value: false
Requires that users act through a single, equalized project. A cluster should be equalized if users need to run Scala jobs on it, and it should be limited to Scala jobs only via spark.databricks.repl.allowedLanguages
.
IMMUTA_SPARK_RESOLVE_RAW_TABLES_ENABLED
Default value: true
Enables use of the underlying database and table name in queries against a table-backed Immuta data source. Administrators or whitelisted users can set IMMUTA_SPARK_RESOLVE_RAW_TABLES_ENABLED
to false
to bypass resolving raw databases or tables as Immuta data sources. This is useful if an admin wants to read raw data but is also an Immuta user. By default, data policies will be applied to a table even for an administrative user if that admin is also an Immuta user.
IMMUTA_SPARK_SESSION_RESOLVE_RAW_TABLES_ENABLED
Default value: true
IMMUTA_SPARK_SHOW_IMMUTA_DATABASE
Default value: true
This shows the immuta
database in the configured Databricks cluster. When set to false
Immuta will no longer show this database when a SHOW DATABASES
query is performed. However, queries can still be performed against tables in the immuta
database using the Immuta-qualified table name (e.g., immuta.my_schema_my_table
) regardless of whether or not this feature is enabled.
IMMUTA_SPARK_VERSION_VALIDATE_ENABLED
Default value: true
Immuta checks the versions of its artifacts to verify that they are compatible with each other. When set to true
, if versions are incompatible, that information will be logged to the Databricks driver logs and the cluster will not be usable. If a configuration file or the jar artifacts have been patched with a new version (and the artifacts are known to be compatible), this check can be set to false
so that the versions don't get logged as incompatible and make the cluster unusable.
IMMUTA_USER_MAPPING_IAMID
Default value: bim
Denotes which IAM in Immuta should be used when mapping the current Spark user's username to a userid in Immuta. This defaults to Immuta's internal IAM (bim
) but should be updated to reflect an actual production IAM.
Last updated
Was this helpful?