LogoLogo
2024.2
  • Immuta Documentation - 2024.2
  • What is Immuta?
  • Self-Managed Deployment
    • Getting Started
    • Deployment Requirements
    • Install
      • Managed Public Cloud
      • Red Hat OpenShift
      • Generic Installation
      • Immuta in an Air-Gapped Environment
      • Deploy Immuta without Elasticsearch
    • Configure
      • Ingress Configuration
      • Cosign Verification
      • TLS Configuration
      • Immuta in Production
      • External Cache Configuration
      • Rotating Credentials
      • Enabling Legacy Query Engine and Fingerprint
    • Upgrade
      • Upgrade Immuta
      • Upgrade to Immuta 2024.2 LTS
    • Disaster Recovery
    • Troubleshooting
    • Conventions
    • Release Notes
  • Data and Integrations
    • Immuta Integrations
    • Snowflake
      • Getting Started
      • How-to Guides
        • Configure a Snowflake Integration
        • Snowflake Table Grants Migration
        • Edit or Remove Your Snowflake Integration
        • Integration Settings
          • Enable Snowflake Table Grants
          • Use Snowflake Data Sharing with Immuta
          • Configure Snowflake Lineage Tag Propagation
          • Enable Snowflake Low Row Access Policy Mode
            • Upgrade Snowflake Low Row Access Policy Mode
      • Reference Guides
        • Snowflake Integration
        • Snowflake Data Sharing
        • Snowflake Lineage Tag Propagation
        • Snowflake Low Row Access Policy Mode
        • Snowflake Table Grants
        • Warehouse Sizing Recommendations
      • Phased Snowflake Onboarding Concept Guide
    • Databricks Unity Catalog
      • Getting Started
      • How-to Guides
        • Configure a Databricks Unity Catalog Integration
        • Migrate to Unity Catalog
      • Databricks Unity Catalog Integration Reference Guide
    • Databricks Spark
      • How-to Guides
        • Configuration
          • Simplified Databricks Configuration
          • Manual Databricks Configuration
          • Manually Update Your Databricks Cluster
          • Install a Trusted Library
        • DBFS Access
        • Limited Enforcement in Databricks
        • Hide the Immuta Database in Databricks
        • Run spark-submit Jobs on Databricks
        • Configure Project UDFs Cache Settings
        • External Metastores
      • Reference Guides
        • Databricks Spark Integration
        • Databricks Spark Pre-Configuration Details
        • Configuration Settings
          • Cluster Policies
            • Python & SQL
            • Python & SQL & R
            • Python & SQL & R with Library Support
            • Scala
            • Sparklyr
          • Environment Variables
          • Ephemeral Overrides
          • Py4j Security Error
          • Scala Cluster Security Details
          • Databricks Security Configuration for Performance
        • Databricks Change Data Feed
        • Databricks Libraries Introduction
        • Delta Lake API
        • Spark Direct File Reads
        • Databricks Metastore Magic
    • Starburst (Trino)
      • Getting Started
      • How-to Guides
        • Configure Starburst (Trino) Integration
        • Customize Read and Write Access Policies for Starburst (Trino)
      • Starburst (Trino) Integration Reference Guide
    • Redshift
      • Getting Started
      • How-to Guides
        • Configure Redshift Integration
        • Configure Redshift Spectrum
      • Reference Guides
        • Redshift Integration
        • Redshift Pre-Configuration Details
    • Azure Synapse Analytics
      • Getting Started
      • Configure Azure Synapse Analytics Integration
      • Reference Guides
        • Azure Synapse Analytics Integration
        • Azure Synapse Analytics Pre-Configuration Details
    • Amazon S3
    • Google BigQuery
    • Legacy Integrations
      • Securing Hive and Impala Without Sentry
      • Enabling ImmutaGroupsMapping
    • Registering Metadata
      • Data Sources in Immuta
      • Register Data Sources
        • Create a Data Source
        • Create an Amazon S3 Data Source
        • Create a Google BigQuery Data Source
        • Bulk Create Snowflake Data Sources
      • Data Source Settings
        • How-to Guides
          • Manage Data Sources and Data Source Settings
          • Manage Data Source Members
          • Manage Access Requests and Tasks
          • Manage Data Dictionary Descriptions
          • Disable Immuta from Sampling Raw Data
        • Data Source Health Checks Reference Guide
      • Schema Monitoring
        • How-to Guides
          • Run Schema Monitoring and Column Detection Jobs
          • Manage Schema Monitoring
        • Reference Guides
          • Schema Monitoring
          • Schema Projects
        • Why Use Schema Monitoring?
    • Catalogs
      • Getting Started with External Catalogs
      • Configure an External Catalog
      • Reference Guides
        • External Catalogs
        • Custom REST Catalogs
          • Custom REST Catalog Interface Endpoints
    • Tags
      • How-to Guides
        • Create and Manage Tags
        • Add Tags to Data Sources and Projects
      • Tags Reference Guide
  • People
    • Getting Started
    • Identity Managers (IAMs)
      • How-to Guides
        • Microsoft Entra ID
        • Okta LDAP Interface
        • Okta and OpenID Connect
        • Integrate Okta SAML SCIM with Immuta
        • OneLogin with OpenID
        • Configure SAML IAM Protocol
      • Reference Guides
        • Identity Managers
        • SAML Single Logout
        • SAML Protocol Configuration Options
    • Immuta Users
      • How-to Guides
        • Managing Personas and Permissions
        • Manage Attributes and Groups
        • User Impersonation
        • External User ID Mapping
        • External User Info Endpoint
      • Reference Guides
        • Attributes and Groups in Immuta
        • Permissions and Personas
  • Discover Your Data
    • Getting Started
    • Introduction
    • Architecture
    • Data Discovery
      • How-to Guides
        • Enable Sensitive Data Discovery (SDD)
        • Manage Identification Frameworks
        • Manage Patterns
        • Manage Rules
        • Manage SDD on Data Sources
        • Manage Global SDD Settings
        • Migrate From Legacy to Native SDD
      • Reference Guides
        • How Competitive Pattern Analysis Works
        • Built-in Pattern Reference
        • Built-in Discovered Tags Reference
    • Data Classification
      • How-to Guides
        • Activate Classification Frameworks
        • Adjust Identification and Classification Framework Tags
        • How to Use a Built-In Classification Framework with Your Own Tags
      • Built-in Classification Frameworks Reference Guide
  • Detect Your Activity
    • Getting Started
      • Monitor and Secure Sensitive Data Platform Query Activity
        • User Identity Best Practices
        • Integration Architecture
        • Snowflake Roles Best Practices
        • Register Data Sources
        • Automate Entity and Sensitivity Discovery
        • Detect with Discover: Onboarding Guide
        • Using Immuta Detect
      • General Immuta Configuration
        • User Identity Best Practices
        • Integration Architecture
        • Databricks Roles Best Practices
        • Register Data Sources
    • Introduction
    • Audit
      • How-to Guides
        • Export Audit Logs to S3
        • Export Audit Logs to ADLS
        • Run Governance Reports
      • Reference Guides
        • Universal Audit Model (UAM)
        • Snowflake Query Audit Logs
        • Databricks Unity Catalog Audit Logs
        • Databricks Query Audit Logs
        • Starburst (Trino) Query Audit Logs
        • UAM Schema
        • Audit Export CLI
        • Governance Report Types
      • Deprecated Audit Guides
        • Legacy to UAM Migration
        • Download Audit Logs
        • System Audit Logs
    • Detection
      • Use the Detect Dashboards
      • Reference Guides
        • Detect
        • Detect Dashboards
        • Unknown Users in Audit Logs
    • Monitors
      • Manage Monitors and Observations
      • Detect Monitors Reference Guide
  • Secure Your Data
    • Getting Started with Secure
      • Automate Data Access Control Decisions
        • The Two Paths: Orchestrated RBAC and ABAC
        • Managing User Metadata
        • Managing Data Metadata
        • Author Policy
        • Test and Deploy Policy
      • Compliantly Open More Sensitive Data for ML and Analytics
        • Managing User Metadata
        • Managing Data Metadata
        • Author Policy
      • Federated Governance for Data Mesh and Self-Serve Data Access
        • Defining Domains
        • Managing Data Products
        • Managing Data Metadata
        • Apply Federated Governance
        • Discover and Subscribe to Data Products
    • Introduction
      • Scalability and Evolvability
      • Understandability
      • Distributed Stewardship
      • Consistency
      • Availability of Data
    • Authoring Policies in Secure
      • Authoring Policies at Scale
      • Data Engineering with Limited Policy Downtime
      • Subscription Policies
        • How-to Guides
          • Author a Subscription Policy
          • Author an ABAC Subscription Policy
          • Subscription Policies Advanced DSL Guide
          • Author a Restricted Subscription Policy
          • Clone, Activate, or Stage a Global Policy
        • Reference Guides
          • Subscription Policies
          • Subscription Policy Access Types
          • Advanced Use of Special Functions
      • Data Policies
        • Overview
        • How-to Guides
          • Author a Masking Data Policy
          • Author a Minimization Policy
          • Author a Purpose-Based Restriction Policy
          • Author a Restricted Data Policy
          • Author a Row-Level Policy
          • Author a Time-Based Restriction Policy
          • Certifications Exemptions and Diffs
          • External Masking Interface
        • Reference Guides
          • Data Policy Types
          • Masking Policies
          • Row-Level Policies
          • Custom WHERE Clause Functions
          • Data Policy Conflicts and Fallback
          • Custom Data Policy Certifications
          • Orchestrated Masking Policies
    • Domains
      • Getting Started with Domains
      • Domains Reference Guide
    • Projects and Purpose-Based Access Control
      • Projects and Purpose Controls
        • Getting Started
        • How-to Guides
          • Create a Project
          • Create and Manage Purposes
          • Adjust a Policy
          • Project Management
            • Manage Projects and Project Settings
            • Manage Project Data Sources
            • Manage Project Members
        • Reference Guides
          • Projects and Purposes
          • Policy Adjustments
        • Why Use Purposes?
      • Equalized Access
        • Manage Project Equalization
        • Project Equalization Reference Guide
        • Why Use Project Equalization?
      • Masked Joins
        • Enable Masked Joins
        • Why Use Masked Joins?
      • Writing to Projects
        • How-to Guides
          • Create and Manage Snowflake Project Workspaces
          • Create and Manage Databricks Project Workspaces
          • Write Data to the Workspace
        • Reference Guides
          • Project Workspaces
          • Project UDFs (Databricks)
    • Data Consumers
      • Subscribe to a Data Source
      • Query Data
        • Querying Snowflake Data
        • Querying Databricks Data
        • Querying Databricks SQL Data
        • Querying Starburst (Trino) Data
        • Querying Redshift Data
        • Querying Azure Synapse Analytics Data
      • Subscribe to Projects
  • Application Settings
    • How-to Guides
      • App Settings
      • BI Tools
        • BI Tool Configuration Recommendations
        • Power BI Configuration Example
        • Tableau Configuration Example
      • Add a License Key
      • Add ODBC Drivers
      • Manage Encryption Keys
      • System Status Bundle
    • Reference Guides
      • Data Processing, Encryption, and Masking Practices
      • Metadata Ingestion
  • Releases
    • Immuta v2024.2 Release Notes
    • Immuta Release Lifecycle
    • Immuta LTS Changelog
    • Immuta Support Matrix Overview
    • Immuta CLI Release Notes
    • Immuta Image Digests
    • Preview Features
      • Features in Preview
    • Deprecations
  • Developer Guides
    • The Immuta CLI
      • Install and Configure the Immuta CLI
      • Manage Your Immuta Tenant
      • Manage Data Sources
      • Manage Sensitive Data Discovery
        • Manage Sensitive Data Discovery Rules
        • Manage Identification Frameworks
        • Run Sensitive Data Discovery on Data Sources
      • Manage Policies
      • Manage Projects
      • Manage Purposes
    • The Immuta API
      • Integrations API
        • Getting Started
        • How-to Guides
          • Configure an Amazon S3 Integration
          • Configure an Azure Synapse Analytics Integration
          • Configure a Databricks Unity Catalog Integration
          • Configure a Google BigQuery Integration
          • Configure a Redshift Integration
          • Configure a Snowflake Integration
          • Configure a Starburst (Trino) Integration
        • Reference Guides
          • Integrations API Endpoints
          • Integration Configuration Payload
          • Response Schema
          • HTTP Status Codes and Error Messages
      • Immuta V2 API
        • Data Source Payload Attribute Details
        • Data Source Request Payload Examples
        • Create Policies API Examples
        • Create Projects API Examples
        • Create Purposes API Examples
      • Immuta V1 API
        • Authenticate with the API
        • Configure Your Instance of Immuta
          • Get Fingerprint Status
          • Get Job Status
          • Manage Frameworks
          • Manage IAMs
          • Manage Licenses
          • Manage Notifications
          • Manage Sensitive Data Discovery (SDD)
          • Manage Tags
          • Manage Webhooks
          • Search Filters
        • Connect Your Data
          • Create and Manage an Amazon S3 Data Source
          • Create an Azure Synapse Analytics Data Source
          • Create an Azure Blob Storage Data Source
          • Create a Databricks Data Source
          • Create a Presto Data Source
          • Create a Redshift Data Source
          • Create a Snowflake Data Source
          • Create a Starburst (Trino) Data Source
          • Manage the Data Dictionary
        • Manage Data Access
          • Manage Access Requests
          • Manage Data and Subscription Policies
          • Manage Domains
          • Manage Write Policies
            • Write Policies Payloads and Response Schema Reference Guide
          • Policy Handler Objects
          • Search Audit Logs
          • Search Connection Strings
          • Search for Organizations
          • Search Schemas
        • Subscribe to and Manage Data Sources
        • Manage Projects and Purposes
          • Manage Projects
          • Manage Purposes
        • Generate Governance Reports
Powered by GitBook

Other versions

  • SaaS
  • 2024.3

Copyright © 2014-2024 Immuta Inc. All rights reserved.

On this page
  • 1 - Download and Configure Immuta Artifacts
  • 2 - Stage Immuta Artifacts
  • AWS/S3
  • Azure
  • HTTPS
  • DBFS
  • 3 - Protect Immuta Environment Variables with Databricks Secrets
  • 4 - Create and Configure the Cluster
  • Additional Hadoop Configuration File (Optional)
  • 5 - Register Data
  • 6 - Query Immuta Data
  • Creating a Databricks Data Source
  • Databricks to Immuta User Mapping

Was this helpful?

Export as PDF
  1. Data and Integrations
  2. Databricks Spark
  3. How-to Guides
  4. Configuration

Manual Databricks Configuration

PreviousSimplified Databricks ConfigurationNextManually Update Your Databricks Cluster

Last updated 1 month ago

Was this helpful?

This guide details the manual installation method for enabling access to Databricks with Immuta policies enforced. Before proceeding, ensure your Databricks workspace, instance, and permissions meet the guidelines outlined in the .

Databricks Unity Catalog: If Unity Catalog is enabled in a Databricks workspace, you must use an Immuta cluster policy when you setup the integration to create an Immuta-enabled cluster.

The immuta_conf.xml file is no longer required

The immuta_conf.xml file that was previously used to configure the Databricks integration is no longer required to install Immuta, so it is no longer staged as a deployment artifact. However, you can use these snippets if you wish to deploy an immuta_conf.xml file to set properties.

The required Immuta base URL and Immuta system API key properties, along with any other valid properties, can still be specified as Spark environment variables or in the optional immuta_conf.xml file. As before, if the same property is specified in both locations, the Spark environment variable takes precedence.

If you have an existing immuta_conf.xml file, you can continue using it. However, it's recommended that you delete any default properties from the file that you have not explicitly overridden, or remove the file completely and rely on Spark environment variables. Either method will ensure that any property defaults changed in upcoming Immuta releases are propagated to your environment.

1 - Download and Configure Immuta Artifacts

  1. Navigate to the .

  2. Scroll to the release that corresponds to your Immuta version.

  3. Download the .jar file (Immuta plugin) as well as the other scripts listed below, which will load the plugin at cluster startup.

    allowedCallingClasses.json
    immuta-benchmark-suite.dbc
    immuta-spark-hive-X.X.X_YYYYMMDD-hadoop-Z.Z.Z-public.jar
    immuta_cluster_init_script.sh
    obscuredCommands.yaml

    The immuta-benchmark-suite.dbc is a collection of notebooks packaged as a .dbc file. After you have added cluster policies to your cluster, you can import this file into Databricks to run performance tests and compare a regular Databricks cluster to one protected by Immuta. Detailed instructions are available in the first notebook, which will require an Immuta and non-Immuta cluster to generate test data and perform queries. Note: Use Spark 2 with Databricks Runtime prior to 7.x. Use Spark 3 with Databricks Runtime 7.x or later. Attempting to use an incompatible jar and Databricks Runtime will fail.

  4. Specify the following properties as Spark environment variables or in the optional immuta_conf.xml file. If the same property is specified in both locations, the Spark environment variable takes precedence. The variable names are the config names in all upper case with _ instead of .. For example, to set the value of immuta.base.url via an environment variable, you would set the following in the Environment Variables section of cluster configuration: IMMUTA_BASE_URL=https://immuta.mycompany.com

    • immuta.system.api.key: Obtain this value from the under HDFS > System API Key. You will need to be a user with the APPLICATION_ADMIN role to complete this action. Generating a key will destroy any previously generated HDFS keys. This will cause previously integrated HDFS systems to lose access to your Immuta console. The key will only be shown once when generated.

    • immuta.base.url: The full URL for the target Immuta tenant Ex: https://immuta.mycompany.com.

    • immuta.user.mapping.iamid: If users authenticate to Immuta using an IAM different from Immuta's built-in IAM, you need to update the configuration file to reflect the ID of that IAM. The IAM ID is shown within the Immuta App Settings page within the Identity Management section. See for more details.

Environment variables with Google Cloud Platform

Do not use environment variables to set sensitive properties when using Google Cloud Platform. Set them directly in immuta_conf.xml.

2 - Stage Immuta Artifacts

When configuring the Databricks cluster, a path will need to be provided to each of the artifacts downloaded/created in the previous step. To do this, those artifacts must be hosted somewhere that your Databricks instance can access. The following methods can be used for this step:

These artifacts will be downloaded to the required location within the clusters file-system by the init script downloaded in the previous step. In order for the init script to find these files, a URI will have to be provided through environment variables configured on the cluster. Each method's URI structure and setup is explained below.

AWS/S3

URI Structure: s3://[bucket]/[path]

  1. Upload the configuration file, JSON file, and JAR file to an S3 bucket that the role from step 1 has access to.

Authenticating with Access Keys or Session Tokens (Optional)

If you wish to authenticate using access keys, add the following items to the cluster's environment variables:

IMMUTA_INIT_AWS_SECRET_ACCESS_KEY=<aws secret key>
IMMUTA_INIT_AWS_ACCESS_KEY_ID=<aws access key id>

If you've assumed a role and received a session token, that can be added here as well:

IMMUTA_INIT_AWS_SESSION_TOKEN=<aws session token>

Azure

ADL Gen 2

URI Structure: abfs(s)://[container]@[account].dfs.core.windows.net/[path]

Environment Variables:

If you want to authenticate using an account key, add the following to your cluster's environment variables:

IMMUTA_INIT_AZCOPY_CRED_TYPE=SharedKey
IMMUTA_INIT_ACCOUNT_NAME=<ADLg2 account name>
IMMUTA_INIT_ACCOUNT_KEY=<ADLg2 account key>

If you want to authenticate using an Azure SAS token, add the following to your cluster's environment variables:

IMMUTA_INIT_AZURE_SAS_TOKEN=<SAS token>

ADL Gen 1

URI Structure: adl://[account].azuredatalakestore.net/[path]

Environment Variables:

If authenticating as a Microsoft Entra ID user,

IMMUTA_INIT_AZURE_AD_USER=<Microsoft Entra ID username>
IMMUTA_INIT_AZURE_PASSWORD=<Microsoft Entra ID password>

If authenticating using a service principal,

IMMUTA_INIT_AZURE_SERVICE_PRINCIPAL=<azure service principal>
IMMUTA_INIT_AZURE_PASSWORD=<azure service principal password>
IMMUTA_INIT_AZURE_TENANT=<tenant ID where principal was created>

HTTPS

URI Structure: http(s)://[host](:port)/[path]

Artifacts are available for download from Immuta using basic authentication. Your basic authentication credentials can be obtained from your Immuta support professional.

Environment Variables (Optional)

IMMUTA_INIT_HTTPS_USER=<basic auth username>
IMMUTA_INIT_HTTPS_PASSWORD=<basic auth password>

DBFS

DBFS does not support access control

Any Databricks user can access DBFS via the Databricks command line utility. Files containing sensitive materials (such as Immuta API keys) should not be stored there in plain text. Use other methods described herein to properly secure such materials.

URI Structure: dbfs:/[path]

Since any user has access to everything in DBFS:

  1. The artifacts can be stored anywhere in DBFS.

  2. It's best to have a cluster-specific place for your artifacts in DBFS if you are testing to avoid overwriting or reusing someone else's artifacts accidentally.

3 - Protect Immuta Environment Variables with Databricks Secrets

Databricks secrets can be used in the Environment Variables configuration section for a cluster by referencing the secret path rather than the actual value of the environment variable. For example, if a user wanted to make the following value secret

MY_SECRET_ENV_VAR=super_secret_stuff

they could instead create a Databricks secret and reference it as the value of that variable. For instance, if the secret scope my_secrets was created, and the user added a secret with the key my_secret_env_var containing the desired sensitive environment variable, they would reference it in the Environment Variables section:

MY_SECRET_ENV_VAR={{secrets/my_secrets/my_secret_env_var}}

Then, at runtime, {{secrets/my_secrets/my_secret_env_var}} would be replaced with the actual value of the secret if the owner of the cluster has access to that secret.

Best practice: Replace sensitive variables with secrets

Immuta recommends that any sensitive environment variables listed below in the various artifact deployment instructions be replaced with secrets.

4 - Create and Configure the Cluster

Cluster creation in an Immuta-enabled organization or Databricks workspace should be limited to administrative users to avoid allowing users to create non-Immuta enabled clusters.

  1. Select the Custom Access mode.

  2. Opt to adjust the Autopilot Options and Worker Type settings. The default values provided here may be more than what is necessary for non-production or smaller use-cases. To reduce resource usage you can enable/disable autoscaling, limit the size and number of workers, and set the inactivity timeout to a lower value.

  3. In the Advanced Options section, click the Instances tab.

  4. Click the Spark tab. In Spark Config field, add your configuration.

    • Cluster Configuration Requirements:

      spark.executor.extraJavaOptions -Djava.security.manager=com.immuta.security.ImmutaSecurityManager /
          -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json /
          -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service
      spark.driver.extraJavaOptions -Djava.security.manager=com.immuta.security.ImmutaSecurityManager /
          -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json /
          -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service
      spark.databricks.repl.allowedLanguages python,sql
      spark.databricks.pyspark.enableProcessIsolation true
      spark.databricks.isv.product Immuta
  5. # Specify the URI to the artifacts that were hosted in the previous steps
    # The URI must adhere to the supported types for each service mentioned above
    IMMUTA_INIT_JAR_URI=<Full URI to immuta-spark-hive.jar>
    IMMUTA_INIT_CONF_URI=<Full URI to Immuta configuration file>
    IMMUTA_INIT_ALLOWED_CALLING_CLASSES_URI=<full URI to allowedCallingClasses.json>
    IMMUTA_INIT_OBSCURED_COMMANDS_URI=<full URI to obscuredCommands.yaml>
    
    # (OPTIONAL)
    # Specify an additional configuration file to be added to the spark.sparkContext.hadoopConfiguration.
    # This file allows administrators to add sensitive configuration needed by the SparkSession that
    # should not viewable by users.
    # Further explanation of this variable as well as examples are provided below.
    IMMUTA_INIT_ADDITIONAL_CONF_URI=<full URI to additional configuration file>
  6. Click the Init Scripts tab and set the following configurations:

    • Destination: Specify the service you used to host the Immuta artifacts.

    • File Path: Specify the full URI to the immuta_cluster_init_script.sh.

    • Add the new key/value to the configuration.

  7. Click the Permissions tab and configure the following setting:

    • Who has access: Users or groups will need to have the permission Can Attach To to execute queries against Immuta configured data sources.

  8. (Re)start the cluster.

Additional Hadoop Configuration File (Optional)

As mentioned in the "Environment Variables" section of the cluster configuration, there may be some cases where it is necessary to add sensitive configuration to SparkSession.sparkContext.hadoopConfiguration in order to read the data composing Immuta data sources.

As an example, when accessing external tables stored in Azure Data Lake Gen 2, Spark must have credentials to access the target containers/filesystems in ADLg2, but users must not have access to those credentials. In this case, an additional configuration file may be provided with a storage account key that the cluster may use to access ADLg2.

The additional configuration file looks very similar to the Immuta Configuration file referenced above. Some example configuration files for accessing different storage layers are below.

Amazon S3

IAM role for S3 access

<configuration>
    <property>
        <name>fs.s3n.awsAccessKeyId</name>
        <value>[AWS access key ID]</value>
    </property>
    <property>
        <name>fs.s3n.awsSecretAccessKey</name>
        <value>[AWS secret key]</value>
    </property>
</configuration>

Azure Data Lake Gen 2

<configuration>
    <property>
        <name>fs.azure.account.key.[storage account name].dfs.core.windows.net</name>
        <value>[storage account key]</value>
    </property>
</configuration>

Azure Data Lake Gen 1

ADL prefix: Prior to Databricks Runtime version 6, the following configuration items should have a prefix of dfs.adls rather than fs.adl

<configuration>
    <property>
        <name>fs.adl.oauth2.refresh.url</name>
        <value>https://login.microsoftonline.com/[directory ID]/oauth2/token</value>
    </property>
    <property>
        <name>fs.adl.oauth2.access.token.provider.type</name>
        <value>ClientCredential</value>
    </property>
    <property>
        <name>fs.adl.oauth2.credential</name>
        <value>[client secret from Azure]</value>
    </property>
    <property>
        <name>fs.adl.oauth2.client.id</name>
        <value>[client ID from Azure]</value>
    </property>
</configuration>

Azure Blob Storage

<configuration>
    <property>
        <name>fs.azure.account.key.[storage account name].blob.core.windows.net</name>
        <value>[storage account key]</value>
    </property>
</configuration>

5 - Register Data

6 - Query Immuta Data

When the Immuta enabled Databricks cluster has been successfully started, users will see a new database labeled "immuta". This database is the virtual layer provided to access data sources configured within the connected Immuta instance.

Before users can query an Immuta data source, an administrator must give the user Can Attach To permissions on the cluster and GRANT the user access to the immuta database.

The following SQL query can be run as an administrator within a journal to give the user access to "Immuta":

%sql
GRANT SELECT,READ_METADATA ON DATABASE immuta TO `user@company.com`
%sql
select * from immuta.my_data_source limit 5;
%sql
select * from my_data_source limit 5;

Creating a Databricks Data Source

Databricks to Immuta User Mapping

By default, the IAM used to map users between Databricks and Immuta is the BIM (Immuta's internal IAM). The Immuta Spark plugin will check the Databricks username against the username within the BIM to determine access. For a basic integration, this means the users email address in Databricks and the connected Immuta tenant must match.

Host files in and provide access by the cluster

Host files in Gen 1 or Gen 2 and provide access by the cluster

Host files on an server accessible by the cluster

Host files in (Not recommended for production)

Create an instance profile for clusters by following .

Upload the configuration file, JSON file, and JAR file to an .

Upload the configuration file, JSON file, and JAR file to .

Upload the artifacts directly to using the .

It is important that non-administrator users on an Immuta-enabled Databricks cluster do not have access to view or modify Immuta configuration or the immuta-spark-hive.jar file, as this would potentially pose a security loophole around Immuta policy enforcement. Therefore, use to apply environment variables to an Immuta-enabled cluster in a secure way.

Create a cluster in Databricks by following the .

IAM Role (AWS ONLY): Select the instance role you created for this cluster. (For access key authentication, you should instead use the environment variables listed in the section.)

In the Environment Variables section, add the environment variables necessary for your configuration. Remember that these variables should be as mentioned above.

To use an additional Hadoop configuration file, you will need to set the IMMUTA_INIT_ADDITIONAL_CONF_URI environment variable referenced in the section to be the full URI to this file.

S3 can also be accessed using an IAM role attached to the cluster. See the for more details.

.

Below are example queries that can be run to obtain data from an Immuta-configured data source. Because Immuta supports raw tables in Databricks, you do not have to use Immuta-qualified table names in your queries like the first example. Instead, you can run queries like the second example, which does not reference the .

See the for a detailed walkthrough.

It is possible within Immuta to have multiple users share the same username if they exist within different IAMs. In this case, the cluster can be configured to lookup users from a specified IAM. To do this, the value of immuta.user.mapping.iamid created and hosted in the previous steps must be updated to be the targeted IAM ID configured within the Immuta tenant. The IAM ID can be found on the . Each Databricks cluster can only be mapped to one IAM.

Databricks documentation
ADL gen 2 blob container
ADL gen 1
DBFS
Databricks CLI
Databricks secrets
Databricks documentation
Databricks documentation
Register Databricks securables in Immuta
immuta database
Databricks Data Source Creation guide
AWS/S3
Azure ADL
HTTPS
DBFS
AWS
protected with Databricks secrets
Create and configure the cluster
Immuta GitHub repository
Databricks to Immuta User Mapping
Installation Introduction
Immuta Configuration UI
App Settings page