Skip to content

Databricks Spark Application Configuration

Audience: System Administrators

Content Summary: This page outlines configuration details for Immuta-enabled Databricks clusters. Databricks Administrators should place the desired configuration in the immuta_conf.xml file.

Configuration Details

Environment Variable Overrides

Properties in the config file can be overridden during installation using environment variables. The variable names are the config names in all upper case with _ instead of .. For example, to set the value of immuta.base.url via an environment variable, you would set the following in the Environment Variables section of cluster configuration: IMMUTA_BASE_URL=https://immuta.mycompany.com

  • immuta.ephemeral.host.override

    • Default: true

    • Description: Set this to false if ephemeral overrides should not be enabled for Spark. When true, this will automatically override ephemeral data source httpPaths with the httpPath of the Databricks cluster running the user's Spark application.

  • immuta.ephemeral.host.override.httpPath

    • Description: This configuration item can be used if automatic detection of the Databricks httpPath should be disabled in favor of a static path to use for ephemeral overrides.
  • immuta.ephemeral.table.path.check.enabled

    • Default: true

    • Description: When querying Immuta data sources in Spark, the metadata from the Metastore is compared to the metadata for the target source in Immuta to validate that the source being queried exists and is queryable on the current cluster. This check typically validates that the target (database, table) pair exists in the Metastore and that the table’s underlying location matches what is in Immuta. This configuration can be used to disable location checking if that location is dynamic or changes over time. Note: This may lead to undefined behavior if the same table names exist in multiple workspaces but do not correspond to the same underlying data.

  • immuta.spark.acl.enabled

    • Default: true

    • Description: Immuta Access Control List (ACL). Controls whether Databricks users are blocked from accessing non-Immuta tables. Ignored if Databricks Table ACLs are enabled (i.e., spark.databricks.acl.dfAclsEnabled=true).

  • immuta.spark.acl.whitelist

    • Description: Comma-separated list of Databricks usernames who may access raw tables when the Immuta ACL is in use.
  • immuta.spark.acl.privileged.timeout.seconds

    • Default: 3600

    • Description: The number of seconds to cache privileged user status for the Immuta ACL. A privileged Databricks user is an admin or is whitelisted in immuta.spark.acl.whitelist.

  • immuta.spark.acl.assume.not.privileged

    • Default: false

    • Description: Session property that overrides privileged user status when the Immuta ACL is in use. This should only be used in R scripts associated with spark-submit jobs.

  • immuta.spark.audit.all.queries

    • Default: false

    • Description: Enables auditing all queries run on a Databricks cluster, regardless of whether users touch Immuta-protected data or not.

  • immuta.spark.databricks.allow.non.immuta.reads

    • Default: false

    • Description: Allows non-privileged users to SELECT from tables that are not protected by Immuta. See Limited Enforcement in Databricks for details about this feature.

  • immuta.spark.databricks.allow.non.immuta.writes

    • Default: false

    • Description: Allows non-privileged users to run DDL commands and data-modifying commands against tables or spaces that are not protected by Immuta. See Limited Enforcement in Databricks for details about this feature.

  • immuta.spark.databricks.dbfs.mount.enabled

    • Default: false

    • Description: Exposes the DBFS FUSE mount located at /dbfs. Granular permissions are not possible, so all users will have read/write access to all objects therein. Note: Raw, unfiltered source data should never be stored in DBFS.

  • immuta.spark.databricks.filesystem.blacklist

    • Default: hdfs

    • Description: A list of filesystem protocols that this instance of Immuta will not support for workspaces. This is useful in cases where a filesystem is available to a cluster but should not be used on that cluster.

  • immuta.spark.databricks.jar.uri

    • Default: file:///databricks/jars/immuta-spark-hive.jar

    • Description: The location of immuta-spark-hive.jar on the filesystem for Databricks. This should not need to change unless a custom initialization script that places immuta-spark-hive in a non-standard location is necessary.

  • immuta.spark.databricks.local.scratch.dir.enabled

    • Default: true

    • Description: Creates a world-readable/writable scratch directory on local disk to facilitate the use of dbutils and 3rd party libraries that may write to local disk. Its location is non-configurable and is stored in the environment variable IMMUTA_LOCAL_SCRATCH_DIR. Note: Sensitive data should not be stored at this location.

  • immuta.spark.databricks.log.level

    • Default Value: INFO

    • Description: The SLF4J log level to apply to Immuta's Spark plugins.

  • immuta.spark.databricks.log.stdout.enabled

    • Default: false

    • Description: If true, writes logging output to stdout/the console as well as the log4j-active.txt file (default in Databricks).

  • immuta.spark.databricks.py4j.strict.enabled

    • Default: true

    • Description: Disable to allow the use of the dbutils API in Python. Note: This setting should only be disabled for customers who employ a homogeneous access pattern (i.e., all users have the same level of data access).

  • immuta.spark.databricks.scratch.paths

    • Description: Comma-separated list of remote paths that Databricks users are allowed to directly read/write. These paths amount to unprotected "scratch spaces." You can create a scratch database by configuring its specified location (or configure dbfs:/user/hive/warehouse/<db_name>.db for the default location).

      To create a scratch path to a location or a database stored at that location, configure

      <property>
          <name>immuta.spark.databricks.scratch.paths</name>
          <value>s3://path/to/the/dir</value>
      </property>
      

      To create a scratch path to a database created using the default location,

      <property>
          <name>immuta.spark.databricks.scratch.paths</name>
          <value>s3://path/to/the/dir, dbfs:/user/hive/warehouse/any_db_name.db</value>
      </property>
      
  • immuta.spark.databricks.scratch.paths.create.db.enabled

    • Default: false

    • Description: Enables non-privileged users to create or drop scratch databases.

  • immuta.spark.databricks.spark.3.preview

    • Default: false

    • Description: Enables Databricks Runtime 7 (Spark 3).

  • immuta.spark.databricks.submit.tag.job

    • Default: true

    • Description: Denotes whether the Spark job will be run that "tags" a Databricks cluster as being associated with Immuta.

  • immuta.spark.databricks.trusted.lib.uris

  • immuta.spark.non.immuta.table.cache.seconds

    • Default: 3600

    • Description: The number of seconds Immuta caches whether a table has been exposed as a source in Immuta. This setting only applies when immuta.spark.databricks.allow.non.immuta.writes or immuta.spark.databricks.allow.non.immuta.reads is enabled.

  • immuta.spark.require.equalization

    • Default: false

    • Description: Requires that users act through a single, equalized project. A cluster should be equalized if users need to run Scala jobs on it, and it should be limited to Scala jobs only via spark.databricks.repl.allowedLanguages.

  • immuta.spark.resolve.raw.tables.enabled

    • Default: true

    • Description: Enables use of the underlying database and table name in queries against a table-backed Immuta data source. Administrators or whitelisted users can set immuta.spark.session.resolve.raw.tables.enabled to false to bypass resolving raw databases or tables as Immuta data sources. This is useful if an admin wants to read raw data but is also an Immuta user. By default, data policies will be applied to a table even for an administrative user if that admin is also an Immuta user.

  • immuta.spark.session.resolve.raw.tables.enabled

    • Default: true

    • Description: Same as above, but a session property that allows users to toggle this functionality. If users run set immuta.spark.session.resolve.raw.tables.enabled=false, they will see raw data only (not Immuta data policy-enforced data). Note: This property is not set in immuta_conf.xml.

  • immuta.spark.version.validate.enabled

    • Default: true

    • Description: Immuta checks the versions of its artifacts to verify that they are compatible with each other. When set to true, if versions are incompatible, that information will be logged to the Databricks driver logs and the cluster will not be usable. If a configuration file or the jar artifacts have been patched with a new version (and the artifacts are known to be compatible), this check can be set to false so that the versions don't get logged as incompatible and make the cluster unusable.

  • immuta.user.context.class

    • Default: com.immuta.spark.OSUserContext

    • Description: The class name of the UserContext that will be used to determine the current user in immuta-spark-hive. The default implementation gets the OS user running the JVM for the Spark application.

  • immuta.user.mapping.iamid

    • Default: bim

    • Description: Denotes which IAM in Immuta should be used when mapping the current Spark user's username to a userid in Immuta. This defaults to Immuta's internal IAM (bim) but should be updated to reflect an actual production IAM.