Skip to content

Databricks Installation Guide

Audience: System Administrators

Content Summary: This guide provides instructions to enable native access to Databricks with Immuta protection through installation of a plugin within the target cluster.

Prerequisites

  • Databricks instance
  • Databricks instance has network level access to Immuta instance
  • Access to Immuta releases
  • (Azure only) Azure Databricks authenticates users with Azure AD. Be sure to configure your Immuta instance with an IAM that uses the same user ID as does Azure AD. Immuta's Spark security plugin will look to match this user ID between the two systems. See this Azure Active Directory page for details.

Supported Cluster Configurations

High Concurrency Clusters

Configuration Requirements

spark.executor.extraJavaOptions -Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service
spark.driver.extraJavaOptions -Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service
spark.databricks.repl.allowedLanguages python,sql,r

Language Support

  • Python
  • SQL
  • R

Caveats

  • Configuration drives whether this cluster resolves only Immuta data sources or if it resolves both Immuta data sources and raw tables.

  • If a Databricks Admin is tied to an Immuta account, they will have the ability to read raw tables on cluster.

  • If a Databricks user is listed as an "ignored" user, they will have the ability to read raw tables on-cluster. Users can be added to the immuta.spark.acl.whitelist configuration to become ignored users.

  • The spark conf immuta.spark.acl.assume.not.privileged needs to be set to true when using an R spark-submit job.

    • Set in SparkR with sparkR.session(sparkConfig = list(immuta.spark.acl.assume.not.privileged="true")).

Standard Clusters

Configuration Requirements

spark.executor.extraJavaOptions -Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service
spark.driver.extraJavaOptions -Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service
spark.databricks.repl.allowedLanguages python,sql,scala,r

Language Support

  • Python
  • SQL
  • R
  • Scala

Caveats

  • Configuration drives whether this cluster resolves only Immuta data sources or if it resolves both Immuta data sources and raw tables.

  • If a Databricks Admin is tied to an Immuta account, they will have the ability to read raw tables on-cluster.

  • If a Databricks user is listed as an "ignored" user, they will have the ability to read raw tables on-cluster. Users can be added to the immuta.spark.acl.whitelist configuration to become ignored users.

  • The spark conf immuta.spark.acl.assume.not.privileged needs to be set to true when using an R spark-submit job.

    • Set in SparkR with sparkR.session(sparkConfig = list(immuta.spark.acl.assume.not.privileged="true")).
  • Immuta recommends that a cluster running Scala jobs should

    • ONLY run Scala jobs via spark.databricks.repl.allowedLanguages = scala.

    • Require equalization in Immuta via the environment variable IMMUTA_SPARK_REQUIRE_EQUALIZATION=true. For more information about this recommendation, see Scala Cluster Security Details.

Installation Overview

The Immuta Databricks integration works by injecting an Immuta plugin into the SparkSQL stack at cluster startup. The Immuta plugin will make an "immuta" database available for querying and will intercept all queries executed against it. For these queries, policy determinations will be obtained from the connected Immuta instance and applied prior to returning the results to the user.

The following is a brief overview of the steps that will need to be completed to enable the integration.

  1. Download and configure Immuta artifacts
    • The plugin, cluster init script, and configuration
  2. Stage Immuta artifacts
    • Place the aforementioned files somewhere the cluster can read from during it's startup procedures
  3. Protect Immuta environment variables with Databricks secrets
  4. Create and configure the cluster
    • Configure a cluster to start with the init script and load Immuta into it's SparkSQL environment
  5. Query Immuta Data
    • Grant users permission to query data and set up sources within Immuta.

1. Download and configure Immuta artifacts

  1. Navigate to the Immuta releases page
  2. Scroll to the "All Archives" section and click "here"
  3. Alternatively, you can go directly to the Immuta archives and use the credentials provided on the releases page
  4. Navigate to the Databricks folder for your Immuta version Ex: https://archives.immuta.com/hadoop/databricks/2020.2.1/
  5. Download the .jar file (Immuta plugin) as well as the init script which will load the plugin at cluster startup
    immuta-spark-hive-X.X.X_YYYYMMDD-hadoop-Z.Z.Z-public.jar
    immuta_cluster_init_script.sh
    allowedCallingClasses.json
    obscuredCommands.yaml
    
  6. Create a new file titled immuta_conf.xml and fill it out according to the template below. Be sure to replace the specified values with your own.

    <configuration>
        <property>
            <name>immuta.system.api.key</name>
            <value>[Immuta api key]</value>
        </property>
        <property>
            <name>immuta.base.url</name>
            <value>[URL for Immuta instance]</value>
        </property>
        <property>
            <name>immuta.user.context.class</name>
            <value>com.immuta.spark.databricks.DatabricksUserContext</value>
        </property>
        <property>
            <name>immuta.user.mapping.iamid</name>
            <value>bim</value>
        </property>
        <property>
            <name>immuta.spark.resolve.raw.tables.enabled</name>
            <value>true</value>
        </property>
        <property>
            <name>immuta.spark.acl.enabled</name>
            <value>true</value>
        </property>
        <property>
            <name>immuta.spark.acl.workspace.enabled</name>
            <value>true</value>
        </property>
        <property>
            <name>immuta.spark.acl.whitelist</name>
            <value></value>
        </property>
        <property>
            <name>immuta.spark.acl.privileged.timeout.seconds</name>
            <value>3600</value>
        </property>
        <property>
            <name>immuta.spark.require.equalization</name>
            <value>false</value>
        </property>
    </configuration>
    
    Danger

    Generating a key will destroy any previously generated HDFS keys. This will cause previously integrated HDFS systems to lose access to your Immuta console. The key will only be shown once when generated.

    • immuta.system.api.key: Obtain this value from the Immuta Configuration UI under "HDFS" > "System API Key". You will need to be a user with the APPLICATION_ADMIN role to complete this action.
    • immuta.base.url: The full URL for the target Immuta instance Ex: https://immuta.mycompany.com
    • immuta.user.mapping.iamid: If users authenticate to Immuta using an IAM different from Immuta's built-in IAM, you need to update the configuration file to reflect the ID of that IAM. The IAM ID is shown within the Immuta app settings page within the "Identity Management" section.
Environment Variable Overrides

Properties in the config file can be overridden during installation using environment variables. The variable names are the config names in all upper case with _ instead of .. For example, to set the value of immuta.base.url via an environment variable, you would set the following in the Environment Variables section of cluster configuration: IMMUTA_BASE_URL=https://immuta.mycompany.com

2. Stage Immuta artifacts

When configuring the Databricks cluster, a path will need to be provided to each of the artifacts downloaded/created in the previous step. In order to do this, those artifacts must be hosted somewhere that your Databricks instance can access. The following methods can be used for this step:

  • Host files in AWS/S3 and provide access by the cluster
  • Host files in Azure ADL Gen 1 or Gen 2 and provide access by the cluster
  • Host files on an HTTPS server accessible by the cluster
  • Host files in DBFS (Not recommended for production)

These artifacts will be downloaded to the required location within the clusters file-system by the init script downloaded in the previous step. In order for the init script to find these files, a URI will have to be provided through environment variables configured on the cluster. Each method's URI structure and setup is explained below.

AWS/S3

URI Structure: s3://[bucket]/[path]

  1. Create an instance profile for clusters by following Databricks documentation.
  2. Upload the configuration file, JSON file, and JAR file to an S3 bucket that the role from step 1 has access to.

Authenticating with Access Keys or Session Tokens (Optional)

If you wish to authenticate using access keys, add the following items to the cluster's environment variables:

IMMUTA_INIT_AWS_SECRET_ACCESS_KEY=<aws secret key>
IMMUTA_INIT_AWS_ACCESS_KEY_ID=<aws access key id>

If you've assumed a role and received a session token, that can be added here as well:

IMMUTA_INIT_AWS_SESSION_TOKEN=<aws session token>

Azure

ADL Gen 2

URI Structure: abfs(s)://[container]@[account].dfs.core.windows.net/[path]

Upload the configuration file, JSON file, and JAR file to an ADL gen 2 blob container.

Environment Variables:

IMMUTA_INIT_AZCOPY_CRED_TYPE=SharedKey
IMMUTA_INIT_ACCOUNT_NAME=<ADLg2 account name>
IMMUTA_INIT_ACCOUNT_KEY=<ADLg2 account key>

ADL Gen 1

URI Structure: adl://[account].azuredatalakestore.net/[path]

Upload the configuration file, JSON file, and JAR file to ADL gen 1.

Environment Variables:

If authenticating as an AD user,

IMMUTA_INIT_AZURE_AD_USER=<azure AD username>
IMMUTA_INIT_AZURE_PASSWORD=<azure AD password>

If authenticating using a service principal,

IMMUTA_INIT_AZURE_SERVICE_PRINCIPAL=<azure service principal>
IMMUTA_INIT_AZURE_PASSWORD=<azure service principal password>
IMMUTA_INIT_AZURE_TENANT=<tenant ID where principal was created>

HTTPS

URI Structure: http(s)://[host](:port)/[path]

Artifacts are available for download from Immuta using basic authentication. Archives and your basic authentication credentials can be found here.

Environment Variables (Optional)

IMMUTA_INIT_HTTPS_USER=<basic auth username>
IMMUTA_INIT_HTTPS_PASSWORD=<basic auth password>

# Note: Credentials can also be included as part of the artifact URI. For example,
IMMUTA_INIT_JAR_URI=https://user:password@download.immuta.com/path/to/file

DBFS

Warning

DBFS does not support access control. Any Databricks user can access DBFS via the Databricks command line utility. Files containing sensitive materials (such as Immuta API keys) should not be stored there in plain text. Use other methods described herein to properly secure such materials.

URI Structure: /dbfs/[path]

Upload the artifacts directly to DBFS using the Databricks CLI.

Since any user has access to everything in DBFS,

  1. The artifacts can be stored anywhere in DBFS.
  2. It's best to have a cluster-specific place for your artifacts in DBFS if you are testing to avoid overwriting or reusing someone else's artifacts accidentally.

3. Protect Immuta environment variables with Databricks secrets

It is important that non-administrator users on an Immuta-enabled Databricks cluster do not have access to view or modify Immuta configuration or the immuta-spark-hive .jar file, as this would potentially pose a security loophole around Immuta policy enforcement. Therefore, use Databricks secrets to apply environment variables to an Immuta-enabled cluster in a secure way.

Databricks secrets can be used in the Environment Variables configuration section for a cluster by referencing the secret path rather than the actual value of the environment variable. For example, if a user wanted to make the following value secret

MY_SECRET_ENV_VAR=super_secret_stuff

they could instead create a Databricks secret and reference it as the value of that variable. For instance, if the secret scope my_secrets was created, and the user added a secret with the key my_secret_env_var containing the desired sensitive environment variable, they would reference it in the Environment Variables section:

MY_SECRET_ENV_VAR={{secrets/my_secrets/my_secret_env_var}}

Then, at runtime, {{secrets/my_secrets/my_secret_env_var}} would be replaced with the actual value of the secret if the owner of the cluster has access to that secret.

We recommend that ANY SENSITIVE environment variables listed below in the various artifact deployment instructions be replaced with secrets.

4. Create and configure the cluster

Cluster creation in an Immuta-enabled organization or Databricks workspace should be limited to administrative users to avoid allowing users to create non-Immuta enabled clusters.

  1. Create a cluster in Databricks by following the Databricks documentation.
    • "Cluster Mode": Set this to "High Concurrency" unless you are using Scala
    • "Autopilot Options" & "Worker Type": The default values provided here may be more than what is necessary for non-production or smaller use-cases. To reduce resource usage you can enable/disable autoscaling, limit the size and number of workers, and set the inactivity timeout to a lower value.
  2. Configure the "Instances" tab
    • "IAM Role" (AWS ONLY): Select the instance role you created for this cluster. (For access key authentication, you should instead use the environment variables listed in the AWS section.)
  3. Configure the "Spark" tab
    • "Spark Config": Add the configuration relevant to your cluster type. See Supported Cluster Configurations for details.
    • "Environment Variables": Add the environment variables necessary for your configuration. Remember that these variables should be protected with Databricks secrets as mentioned above
      # Specify the URI to the artifacts that were hosted in the previous steps
      # The URI must adhere to the supported types for each service mentioned above
      IMMUTA_INIT_JAR_URI=<Full URI to immuta-spark-hive.jar>
      IMMUTA_INIT_CONF_URI=<Full URI to Immuta configuration file>
      
      # (OPTIONAL)
      # Specify an additional configuration file to be added to the spark.sparkContext.hadoopConfiguration.
      # This file allows administrators to add sensitive configuration needed by the SparkSession that
      # should not viewable by users.
      # Further explanation of this variable as well as examples are provided below.
      IMMUTA_INIT_ADDITIONAL_CONF_URI=<full URI to additional configuration file>
      IMMUTA_INIT_ALLOWED_CALLING_CLASSES_URI=<full URI to allowedCallingClasses.json>
      IMMUTA_INIT_OBSCURED_COMMANDS_URI=<full URI to obscuredCommands.yaml>
      
  4. Configure the "Init Scripts" tab
    • "Destination": Specify the service you used to host the Immuta artifacts
    • "File Path": Specify the full URI to the immuta_cluster_init_script.sh
    • "Add" the new key/value to the configuration
  5. Configure the "Permissions" tab
    • "Who has access": Users or groups will need to have the permission "Can Attach To" in order to execute queries against Immuta configured data sources
  6. (Re)start the cluster.

Additional Hadoop Configuration File (Optional)

As mentioned in the "Environment Variables" section of the cluster configuration, there may be some cases where it is necessary to add sensitive configuration to SparkSession.sparkContext.hadoopConfiguration in order to read the data composing Immuta data sources.

As an example, when accessing external tables stored in Azure Data Lake Gen 2, Spark must have credentials to access the target containers/filesystems in ADLg2, but users must not have access to those credentials. In this case, an additional configuration file may be provided with a storage account key that the cluster may use to access ADLg2.

To use an additional Hadoop configuration file, you will need to set the IMMUTA_INIT_ADDITIONAL_CONF_URI environment variable referenced in the Create and configure the cluster section to be the full URI to this file.

The additional configuration file looks very similar to the Immuta Configuration file referenced above. Some example configuration files for accessing different storage layers are below.

Amazon S3

IAM Role for S3 Access

S3 can also be accessed using an IAM role attached to the cluster. See the Databricks documentation for more details.

<configuration>
    <property>
        <name>fs.s3n.awsAccessKeyId</name>
        <value>[AWS access key ID]</value>
    </property>
    <property>
        <name>fs.s3n.awsSecretAccessKey</name>
        <value>[AWS secret key]</value>
    </property>
</configuration>

Azure Data Lake Gen 2

<configuration>
    <property>
        <name>fs.azure.account.key.[storage account name].dfs.core.windows.net</name>
        <value>[storage account key]</value>
    </property>
</configuration>

Azure Data Lake Gen 1

ADL Prefix

Prior to Databricks Runtime version 6, the following configuration items should have a prefix of dfs.adls rather than fs.adl

<configuration>
    <property>
        <name>fs.adl.oauth2.refresh.url</name>
        <value>https://login.microsoftonline.com/[directory ID]/oauth2/token</value>
    </property>
    <property>
        <name>fs.adl.oauth2.access.token.provider.type</name>
        <value>ClientCredential</value>
    </property>
    <property>
        <name>fs.adl.oauth2.credential</name>
        <value>[client secret from Azure]</value>
    </property>
    <property>
        <name>fs.adl.oauth2.client.id</name>
        <value>[client ID from Azure]</value>
    </property>
</configuration>

Azure Blob Storage

<configuration>
    <property>
        <name>fs.azure.account.key.[storage account name].blob.core.windows.net</name>
        <value>[storage account key]</value>
    </property>
</configuration>

5. Query Immuta data

When the Immuta enabled Databricks cluster has been successfully started, users will see a new database labeled "immuta". This database is the virtual layer provided to access data sources configured within the connected Immuta instance.

Before users can query an Immuta data source, an administrator must give the user Can Attach To permissions on the cluster and GRANT the user access to the "immuta' database.

The following SQL query can be run as an administrator within a journal to give the user access to "Immuta":

%sql
GRANT SELECT,READ_METADATA ON DATABASE immuta TO `user@company.com`

Below are some example queries that can be run to obtain data from an Immuta configured data source.

%sql
show tables in immuta;
%sql
select * from immuta.my_data_source limit 5;

Creating a Databricks Data Source

See the Databricks Data Source Creation guide for a detailed walkthrough.

Databricks to Immuta User Mapping

By default, the IAM used to map users between Databricks and Immuta is the BIM (Immuta's internal IAM). The Immuta Spark plugin will check the Databricks username against the username within the BIM to determine access. For a basic integration, this means the users email address in Databricks and the connected Immuta instance must match.

It is possible within Immuta to have multiple users share the same username if they exist within different IAMs. In this case, the cluster can be configured to lookup users from a specified IAM. To do this, the value of immuta.user.mapping.iamid given in the immuta_conf.xml created and hosted in the previous steps must be updated to be the targeted IAM ID configured within the Immuta instance. The IAM ID can be found on the App Settings page. Each Databricks cluster can only be mapped to one IAM.

Debugging Immuta Installation Issues

For easier debugging of the Immuta Databricks installation, enable cluster init script logging. In the cluster page in Databricks for the target cluster, under Advanced Options -> Logging, change the Destination from NONE to DBFS and change the path to the desired output location. Note: The unique cluster ID will be added onto the end of the provided path.

For debugging issues between the Immuta web service and Databricks, you can view the Spark UI on your target Databricks cluster. On the cluster page, click the Spark UI tab, which shows the Spark application UI for the cluster. If you encounter issues creating Databricks data sources in Immuta, you can also view the JDBC/ODBC Server portion of the Spark UI to see the result of queries that have been sent from Immuta to Databricks.

Further Reading

Cluster Init Script

The Databricks cluster init script provided by Immuta downloads the previously mentioned Immuta artifacts (the configuration file and immuta-spark-hive .jar) onto the target cluster and puts them in the appropriate locations on local disk for use by Spark. Once the init script runs, the Spark application running on the Databricks cluster will have the appropriate artifacts on its CLASSPATH in order to use Immuta for policy enforcement.

The cluster init script uses environment variables in order to

  • Determine the location of the required artifacts for downloading.
  • Authenticate with the service/storage containing the artifacts.

Note: Each target system/storage layer (HTTPS, for example) can only have one set of environment variables, so the cluster init script assumes that any artifact retrieved from that system uses the same environment variables.