Spark Environment Variables

This page outlines configuration details for Immuta-enabled Databricks clusters. Databricks administrators should place the desired configuration in the Spark environment variables.

IMMUTA_INIT_ADDITIONAL_CONF_URI

If you add additional Hadoop configuration during the integration setup, this variable sets the path to that file.

The additional Hadoop configuration is where sensitive configuration goes for remote filesystems (if you are using a secret key pair to access S3, for example).

IMMUTA_EPHEMERAL_HOST_OVERRIDE

Default value: true

Set this to false if ephemeral overrides should not be enabled for Spark. When true, this will automatically override ephemeral data source httpPaths with the httpPath of the Databricks cluster running the user's Spark application.

IMMUTA_EPHEMERAL_HOST_OVERRIDE_HTTPPATH

This configuration item can be used if automatic detection of the Databricks httpPath should be disabled in favor of a static path to use for ephemeral overrides.

IMMUTA_EPHEMERAL_TABLE_PATH_CHECK_ENABLED

Default value: true

When querying Immuta data sources in Spark, the metadata from the Metastore is compared to the metadata for the target source in Immuta to validate that the source being queried exists and is queryable on the current cluster. This check typically validates that the target (database, table) pair exists in the Metastore and that the table’s underlying location matches what is in Immuta. This configuration can be used to disable location checking if that location is dynamic or changes over time. Note: This may lead to undefined behavior if the same table names exist in multiple workspaces but do not correspond to the same underlying data.

IMMUTA_INIT_ALLOWED_CALLING_CLASSES_URI

A URI that points to a valid calling class file, which is an Immuta artifact you download during the Databricks Spark configuration process.

IMMUTA_SPARK_ACL_ALLOWLIST

This is a comma-separated list of Databricks users who can access any table or view in the cluster metastore without restriction.

IMMUTA_SPARK_ACL_PRIVILEGED_TIMEOUT_SECONDS

Default value: 3600

The number of seconds to cache privileged user status for the Immuta ACL. A privileged Databricks user is an admin or is allowlisted in IMMUTA_SPARK_ACL_ALLOWLIST.

IMMUTA_SPARK_AUDIT_ALL_QUERIES

Default value: false

Enables auditing all queries run on a Databricks cluster, regardless of whether users touch Immuta-protected data or not.

IMMUTA_SPARK_DATABRICKS_ALLOW_NON_IMMUTA_READS

Default value: false

Allows non-privileged users to SELECT from tables that are not protected by Immuta. See the Customizing the integration guide for details about this feature.

IMMUTA_SPARK_DATABRICKS_ALLOW_NON_IMMUTA_WRITES

Default value: false

Allows non-privileged users to run DDL commands and data-modifying commands against tables or spaces that are not protected by Immuta. See the Customizing the integration guide for details about this feature.

IMMUTA_SPARK_DATABRICKS_ALLOWED_IMPERSONATION_USERS

This is a comma-separated list of Databricks users who are allowed to impersonate Immuta users:

"spark_env_vars.IMMUTA_SPARK_DATABRICKS_ALLOWED_IMPERSONATION_USERS": {
  "type": "fixed",
  "value": "[email protected],[email protected]"
}

IMMUTA_SPARK_DATABRICKS_DBFS_MOUNT_ENABLED

Default value: false

Exposes the DBFS FUSE mount located at /dbfs. Granular permissions are not possible, so all users will have read/write access to all objects therein. Note: Raw, unfiltered source data should never be stored in DBFS.

IMMUTA_SPARK_DATABRICKS_DISABLED_UDFS

Block one or more Immuta user-defined functions (UDFs) from being used on an Immuta cluster. This should be a Java regular expression that matches the set of UDFs to block by name (excluding the immuta database). For example to block all project UDFs, you may configure this to be ^.*_projects?$. For a list of functions, see the project UDFs page.

IMMUTA_SPARK_DATABRICKS_JAR_URI

Default value: file:///databricks/jars/immuta-spark-hive.jar

The location of immuta-spark-hive.jar on the filesystem for Databricks. This should not need to change unless a custom initialization script that places immuta-spark-hive in a non-standard location is necessary.

IMMUTA_SPARK_DATABRICKS_LOCAL_SCRATCH_DIR_ENABLED

Default value: true

Creates a world-readable or writable scratch directory on local disk to facilitate the use of dbutils and 3rd party libraries that may write to local disk. Its location is non-configurable and is stored in the environment variable IMMUTA_LOCAL_SCRATCH_DIR. Note: Sensitive data should not be stored at this location.

IMMUTA_SPARK_DATABRICKS_LOG_LEVEL

Default value: INFO

The SLF4J log level to apply to Immuta's Spark plugins.

IMMUTA_SPARK_DATABRICKS_LOG_STDOUT_ENABLED

Default value: false

If true, writes logging output to stdout/the console as well as the log4j-active.txt file (default in Databricks).

IMMUTA_SPARK_DATABRICKS_SCRATCH_DATABASE

This configuration is a comma-separated list of additional databases that will appear as scratch databases when running a SHOW DATABASE query. This configuration increases performance by circumventing the Metastore to get the metadata for all the databases to determine what to display for a SHOW DATABASE query; it won't affect access to the scratch databases. Instead, use IMMUTA_SPARK_DATABRICKS_SCRATCH_PATHS to control read and write access to the underlying database paths.

Additionally, this configuration will only display the scratch databases that are configured and will not validate that the configured databases exist in the Metastore. Therefore, it is up to the Databricks administrator to properly set this value and keep it current.

IMMUTA_SPARK_DATABRICKS_SCRATCH_PATHS

Comma-separated list of remote paths that Databricks users are allowed to directly read/write. These paths amount to unprotected "scratch spaces." You can create a scratch database by configuring its specified location (or configure dbfs:/user/hive/warehouse/<db_name>.db for the default location).

To create a scratch path to a location or a database stored at that location, configure

IMMUTA_SPARK_DATABRICKS_SCRATCH_PATHS=s3://path/to/the/dir

To create a scratch path to a database created using the default location,

IMMUTA_SPARK_DATABRICKS_SCRATCH_PATHS=s3://path/to/the/dir,dbfs:/user/hive/warehouse/any_db_name.db</value>

IMMUTA_SPARK_DATABRICKS_SCRATCH_PATHS_CREATE_DB_ENABLED

Default value: false

Enables non-privileged users to create or drop scratch databases.

IMMUTA_SPARK_DATABRICKS_SINGLE_IMPERSONATION_USER

Default value: false

When true, this configuration prevents users from changing their impersonation user once it has been set for a given Spark session. This configuration should be set when the BI tool or other service allows users to submit arbitrary SQL or issue SET commands.

IMMUTA_SPARK_DATABRICKS_SUBMIT_TAG_JOB

Default value: true

Denotes whether the Spark job will be run that "tags" a Databricks cluster as being associated with Immuta.

IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS

A comma-separated list of Databricks trusted library URIs.

IMMUTA_SPARK_NON_IMMUTA_TABLE_CACHE_SECONDS

Default value: 3600

The number of seconds Immuta caches whether a table has been exposed as a data source in Immuta. This setting only applies when IMMUTA_SPARK_DATABRICKS_ALLOW_NON_IMMUTA_WRITES or IMMUTA_SPARK_DATABRICKS_ALLOW_NON_IMMUTA_READS is enabled.

IMMUTA_SPARK_REQUIRE_EQUALIZATION

Default value: false

Requires that users act through a single, equalized project. A cluster should be equalized if users need to run Scala jobs on it, and it should be limited to Scala jobs only via spark.databricks.repl.allowedLanguages.

IMMUTA_SPARK_RESOLVE_RAW_TABLES_ENABLED

Default value: true

Enables use of the underlying database and table name in queries against a table-backed Immuta data source. Administrators or allowlisted users can set IMMUTA_SPARK_RESOLVE_RAW_TABLES_ENABLED to false to bypass resolving raw databases or tables as Immuta data sources. This is useful if an admin wants to read raw data but is also an Immuta user. By default, data policies will be applied to a table even for an administrative user if that admin is also an Immuta user.

IMMUTA_SPARK_SESSION_RESOLVE_RAW_TABLES_ENABLED

Default value: true

Same as the IMMUTA_SPARK_RESOLVE_RAW_TABLES_ENABLED variable, but this is a session property that allows users to toggle this functionality. If users run set immuta.spark.session.resolve.raw.tables.enabled=false, they will see raw data only (not Immuta data policy-enforced data). Note: This property is not set in immuta_conf.xml.

IMMUTA_SPARK_SHOW_IMMUTA_DATABASE

Default value: true

This shows the immuta database in the configured Databricks cluster. When set to false Immuta will no longer show this database when a SHOW DATABASES query is performed. However, queries can still be performed against tables in the immuta database using the Immuta-qualified table name (e.g., immuta.my_schema_my_table) regardless of whether or not this feature is enabled.

IMMUTA_SPARK_VERSION_VALIDATE_ENABLED

Default value: true

Immuta checks the versions of its artifacts to verify that they are compatible with each other. When set to true, if versions are incompatible, that information will be logged to the Databricks driver logs and the cluster will not be usable. If a configuration file or the jar artifacts have been patched with a new version (and the artifacts are known to be compatible), this check can be set to false so that the versions don't get logged as incompatible and make the cluster unusable.

IMMUTA_USER_MAPPING_IAMID

Default value: bim

Denotes which IAM in Immuta should be used when mapping the current Spark user's username to a userid in Immuta. This defaults to Immuta's internal IAM (bim) but should be updated to reflect an actual production IAM.

Last updated 5 months ago

Was this helpful?