All pages
Powered by GitBook
1 of 8

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

How-to Guides

Project UDFs Cache Settings

This page outlines the configuration for setting up project UDFs, which allow users to set their current project in Immuta through Spark. For details about the specific functions available and how to use them, see the Use Project UDFs (Databricks) page.

Use project UDFs in Databricks Spark

Currently, caches are not all invalidated outside of Databricks because Immuta caches information pertaining to a user's current project. Consequently, this feature should only be used in Databricks.

  1. Lower the web service cache timeout in Immuta:

    1. Click the App Settings icon and scroll to the HDFS Cache Settings section.

    2. Lower the Cache TTL of HDFS user names (ms) to 0.

    3. Click Save.

  2. Raise the cache timeout on your Databricks cluster: In the Spark environment variables section, set the IMMUTA_CURRENT_PROJECT_CACHE_TIMEOUT_SECONDS and IMMUTA_PROJECT_CACHE_TIMEOUT_SECONDS to high values (like 10000).

    Note: These caches will be invalidated on cluster when a user calls immuta.set_current_project, so they can effectively be cached permanently on cluster to avoid periodically reaching out to the web service.

Install a Trusted Library

Databricks Libraries API: Installing trusted libraries outside of the Databricks Libraries API (e.g., ADD JAR ...) is not supported.

  1. In the Databricks Clusters UI, install your third-party library .jar or Maven artifact with Library Source Upload, DBFS, DBFS/S3, or Maven. Alternatively, use the Databricks libraries API.

  2. In the Databricks Clusters UI, add the IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS property as a Spark environment variable and set it to your artifact's URI. To specify more than one trusted library, comma delimit the URIs:

For Maven artifacts, the URI is maven:/<maven_coordinates>, where <maven_coordinates> is the Coordinates field found when clicking on the installed artifact on the Libraries tab in the Databricks Clusters UI. Here's an example of an installed artifact:

In this example, you would add the following Spark environment variable:

For jar artifacts, the URI is the Source field found when clicking on the installed artifact on the Libraries tab in the Databricks Clusters UI. For artifacts installed from DBFS or S3, this ends up being the original URI to your artifact. For uploaded artifacts, Databricks will rename your .jar and put it in a directory in DBFS. Here's an example of an installed artifact:

In this example, you would add the following Spark environment variable:

  1. Restart the cluster.

  2. Once the cluster is up, execute a command in a notebook. If the trusted library installation is successful, you should see driver log messages like this:

Troubleshooting

This page provides guidelines for troubleshooting issues with the Databricks Spark integration and resolving Py4J security and Databricks trusted library errors.

Debugging the integration

For easier debugging of the Databricks Spark integration, follow the recommendations below.

  • Enable cluster init script logging:

    • In the cluster page in Databricks for the target cluster, navigate to Advanced Options -> Logging.

    • Change the Destination from NONE to DBFS and change the path to the desired output location. Note: The unique cluster ID will be added onto the end of the provided path.

  • View the Spark UI on your target Databricks cluster: On the cluster page, click the Spark UI tab, which shows the Spark application UI for the cluster. If you encounter issues creating Databricks data sources in Immuta, you can also view the JDBC/ODBC Server portion of the Spark UI to see the result of queries that have been sent from Immuta to Databricks.

Using the validation and debugging notebook

The validation and debugging notebook is designed to be used by or under the guidance of an Immuta support professional. Reach out to your Immuta representative for assistance.

  1. Import the notebook into a Databricks workspace by navigating to Home in your Databricks instance.

  2. Click the arrow next to your name and select Import.

  3. Once you have executed commands in the notebook and populated it with debugging information, export the notebook and its contents by opening the File menu, selecting Export, and then selecting DBC Archive.

Py4J security error

  • Error Message: py4j.security.Py4JSecurityException: Constructor <> is not allowlisted

  • Explanation: This error indicates you are being blocked by Py4J security rather than the Immuta Security Manager. Py4J security is strict and generally ends up blocking many ML libraries.

  • Solution: Turn off Py4J security on the offending cluster by setting IMMUTA_SPARK_DATABRICKS_PY4J_STRICT_ENABLED=false in the environment variables section. Additionally, because there are limitations to the security mechanisms Immuta employs on-cluster when Py4J security is disabled, ensure that all users on the cluster have the same level of access to data, as users could theoretically see (policy-enforced) data that other users have queried.

Databricks trusted library errors

Check the driver logs for details. Some possible causes of failure include

  • One of the Immuta-configured trusted library URIs does not point to a Databricks library. Check that you have configured the correct URI for the Databricks library.

  • For trusted Maven artifacts, the URI must follow this format: maven:/group.id:artifact-id:version.

  • Databricks failed to install a library. Any Databricks library installation errors will appear in the Databricks UI under the Libraries tab.

IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS=maven:/com.github.immuta.hadoop.immuta-spark-third-party-maven-lib-test:2020-11-17-144644
IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS=maven:/my.group.id:my-package-id:1.2.3
IMMUTA_SPARK_DATABRICKS_TRUSTED_LIB_URIS=dbfs:/immuta/bstabile/jars/immuta-spark-third-party-lib-test.jar
TrustedLibraryUtils: Successfully found all configured Immuta configured trusted libraries in Databricks.
TrustedLibraryUtils: Wrote trusted libs file to [/databricks/immuta/immutaTrustedLibs.json]: true.
TrustedLibraryUtils: Added trusted libs file with 1 entries to spark context.
TrustedLibraryUtils: Trusted library installation complete.

Manually Update Your Databricks Cluster

If a Databricks cluster needs to be manually updated to reflect changes in the Immuta init script or cluster policies, you can remove and set up your integration again to get the updated policies and init script.

  1. Log in to Immuta as an Application Admin.

  2. Click the App Settings icon in the left sidebar and scroll to the Integration Settings section.

  3. Your existing Databricks Spark integration should be listed here; expand it and note the configuration values. Now select Remove to remove your integration.

  4. Click Add Integration and select Databricks Integration to add a new integration.

  5. Enter your Databricks Spark integration settings again as configured previously.

  6. Click Add Integration to add the integration, and then select Configure Cluster Policies to set up the updated cluster policies and init script.

  7. Select the cluster policies you wish to use for your Immuta-enabled Databricks clusters.

  8. Automatically push cluster policies and the init script (recommended) or manually update your cluster policies.

    • Automatically push cluster policies

      1. Select Automatically Push Cluster Policies and enter your privileged Databricks access token. This token must have privileges to write to cluster policies.

      2. Select

  9. Restart any Databricks clusters using these updated policies for the changes to take effect.

Configure a Databricks Spark Integration

Permissions

  • APPLICATION_ADMIN Immuta permission

  • CAN MANAGE

Run R and Scala spark-submit Jobs on Databricks

This guide illustrates how to run R and Scala spark-submit jobs on Databricks, including prerequisites and caveats.

R spark-submit

Apply Policies
to push the cluster policies and init script again.
  • Click Save and Confirm to deploy your changes.

  • Manually update cluster policies

    1. Download the init script and the new cluster policies to your local computer.

    2. Click Save and Confirm to save your changes in Immuta.

    3. Log in to your Databricks workspace with your administrator account to set up cluster policies.

    4. Get the path you will upload the init script (`immuta_cluster_init_script_proxy.sh`) to by opening one of the cluster policy `.json` files and looking for the `defaultValue` of the field `init_scripts.0.dbfs.destination`. This should be a DBFS path in the form of `dbfs:/immuta-plugin/hostname/immuta_cluster_init_script_proxy.sh`.

    5. Click Data in the left pane to upload your init script to DBFS to the path you found above.

    6. To find your existing cluster policies you need to update, click Compute in the left pane and select the Cluster policies tab.

    7. Edit each of these cluster policies that were configured before and overwrite the contents of the JSON with the new cluster policy JSON you downloaded.

  • Databricks privilege on the cluster

    Requirements

    • A Databricks workspace with the Premium tier, which includes cluster policies (required to configure the Spark integration)

    • A cluster that uses one of these supported Databricks Runtimes:

      • 11.3 LTS

      • 14.3 (private preview)

    • Supported languages

      • Python

      • R (not supported for Databricks Runtime 14.3)

      • Scala (not supported for Databricks Runtime 14.3)

      • SQL

    • A Databricks cluster that is one of these supported compute types:

    • Custom access mode

    • A Databricks workspace and cluster with the ability to directly make HTTP calls to the Immuta web service. The Immuta web service also must be able to connect to and perform queries on the Databricks cluster, and to call .

    Prerequisites

    • Enable OAuth M2M authentication (recommended) or ​personal access tokens.

    • Disable Photon by setting runtime_engine to STANDARD using the Clusters API. Immuta does not support clusters with Photon enabled. Photon is enabled by default on compute running Databricks Runtime 9.1 LTS or newer and must be manually disabled before setting up the integration with Immuta.

    • Restrict the set of Databricks principals who have CAN MANAGE privileges on Databricks clusters where the Spark plugin is installed. This is to prevent editing , editing cluster policies, or removing the Spark plugin from the cluster, all of which would cause the Spark plugin to stop working.

    • If Databricks Unity Catalog is enabled in a Databricks workspace, you must use an Immuta cluster policy when you set up the Databricks Spark integration to create an Immuta-enabled cluster. See the section below for guidance.

    • If Databricks Unity Catalog is not enabled in your Databricks workspace, you must disable Unity Catalog in your Immuta tenant before proceeding with your configuration of Databricks Spark:

      1. Navigate to the App Settings page and click Integration Settings.

      2. Uncheck the Enable Unity Catalog checkbox.

    Add the integration on the app settings page

    1. Click the App Settings icon in Immuta.

    2. Navigate to HDFS > System API Key and click Generate Key.

    3. Click Save and then Confirm. If you do not save and confirm, the system API key will not be saved.

    4. Scroll to the Integration Settings section.

    5. Click + Add Native Integration and select Databricks Spark Integration from the dropdown menu.

    6. Complete the Hostname field.

    7. Enter a Unique ID for the integration. The unique ID is used to name cluster policies clearly, which is important when managing several Databricks Spark integrations. As cluster policies are workspace-scoped, but multiple integrations might be made in one workspace, this ID lets you distinguish between different sets of cluster policies.

    8. Select the identity manager that should be used when mapping the current Spark user to their corresponding identity in Immuta from the Immuta IAM dropdown menu. This should be set to reflect the identity manager you use in Immuta (such as Entra ID or Okta).

    9. Choose an Access Model. The Protected until made available by policy option , whereas the Available until protected by policy option allows it.

    Behavior change in Immuta v2025.1 and newer

    If a table is registered in Immuta and does not have a subscription policy applied to it, that data will be visible to users in Databricks, even if the Protected until made available by policy setting is enabled.

    If you have enabled this setting, author an "Allow individually selected users" global subscription policy that applies to all data sources.

    1. Select the Storage Access Type from the dropdown menu.

    2. Opt to add any Additional Hadoop Configuration Files.

    3. Click Add Native Integration, and then click Save and Confirm. This will restart the application and save your Databricks Spark integration. (It is normal for this restart to take some time.)

    The Databricks Spark integration will not do anything until your cluster policies are configured, so even though your integration is saved, continue to the next section to configure your cluster policies so the Spark plugin can manage authorization on the Databricks cluster.

    Configure cluster policies

    1. Click Configure Cluster Policies.

    2. Select one or more cluster policies in the matrix. Clusters running Immuta with Databricks Runtime 14.3 can only use Python and SQL. You can make changes to the policy by clicking Additional Policy Changes and editing the environment variables in the text field or by downloading it. See the Spark environment variables reference guide for information about each variable and its default value. Some common settings are linked below:

      1. Audit all queries

      2. (you can also )

    3. Select your Databricks Runtime.

    4. Use one of the two installation types described below to apply the policies to your cluster:

      • Automatically push cluster policies: This option allows you to automatically push the cluster policies to the configured Databricks workspace. This will overwrite any cluster policy templates previously applied to this workspace.

        1. Select the Automatically Push Cluster Policies radio button.

    5. Click Close, and then click Save and Confirm.

    6. Apply the cluster policy generated by Immuta to the cluster with the Spark plugin installed by following the .

    Map users and grant them access to the cluster

    1. Map external user IDs from Databricks to Immuta.

    2. Give users the Can Attach To permission on the cluster.

    Prerequisites

    Before you can run spark-submit jobs on Databricks, complete the following steps.

    1. Initialize the Spark session:

      1. Enter these settings into the R submit script to allow the R script to access Immuta data sources, scratch paths, and workspace tables: immuta.spark.acl.assume.not.privileged="true" and spark.hadoop.immuta.databricks.config.update.service.enabled="false".

      2. Once the script is written, upload the script to a location in dbfs/S3/ABFS to give the Databricks cluster access to it.

    2. Because of how some user properties are populated in Databricks, load the SparkR library in a separate cell before attempting to use any SparkR functions.

    Create the R spark submit Job

    To create the R spark-submit job,

    1. Go to the Databricks jobs page.

    2. Create a new job, and select Configure spark-submit.

    3. Set up the parameters:

      Note: The path dbfs:/path/to/script.R can be in S3 or ABFS (on Azure Databricks), assuming the cluster is configured with access to that path.

    4. Edit the cluster configuration, and change the Databricks Runtime to be a .

    5. Configure the section as you normally would for an Immuta cluster.

    Scala spark-submit

    Prerequisites

    Before you can run spark-submit jobs on Databricks you must initialize the Spark session with the settings outlined below.

    1. Configure the Spark session with immuta.spark.acl.assume.not.privileged="true" and spark.hadoop.immuta.databricks.config.update.service.enabled="false".

      Note: Stop your Spark session (spark.stop()) at the end of your job or the cluster will not terminate.

    2. The spark submit job needs to be launched using a different classloader which will point at the designated user JARs directory. The following Scala template can be used to handle launching your submit code using a separate classloader:

    Create the Scala spark-submit Job

    To create the Scala spark-submit job,

    1. Build and upload your JAR to dbfs/S3/ABFS where the cluster has access to it.

    2. Select Configure spark-submit, and configure the parameters:

      Note: The fully-qualified class name of the class whose main function will be used as the entry point for your code in the --class parameter.

      Note: The path dbfs:/path/to/code.jar can be in S3 or ABFS (on Azure Databricks) assuming the cluster is configured with access to that path.

    3. Edit the cluster configuration, and change the Databricks Runtime to a .

    4. Include IMMUTA_INIT_ADDITIONAL_JARS_URI=dbfs:/path/to/code.jar in the "Environment Variables" (where dbfs:/path/to/code.jar is the path to your jar) so that the jar is uploaded to all the cluster nodes.

    Caveats

    • The user mapping works differently from notebooks because spark-submit clusters are not configured with access to the Databricks SCIM API. The cluster tags are read to get the cluster creator and match that user to an Immuta user.

    • Privileged users (Databricks admins and allowlisted users) must be tied to an Immuta user and given access through Immuta to access data through spark-submit jobs because the setting immuta.spark.acl.assume.not.privileged="true" is used.

    • There is an option of using the immuta.api.key setting with an Immuta API key generated on the Immuta profile page.

    • Currently when an API key is generated it invalidates the previous key. This can cause issues if a user is using multiple clusters in parallel, since each cluster will generate a new API key for that Immuta user. To avoid these issues, manually generate the API key in Immuta and set the immuta.api.key on all the clusters or use a specified job user for the submit job.

     [
     "--conf","spark.driver.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service",
     "--conf","spark.executor.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service",
     "--conf","spark.databricks.repl.allowedLanguages=python,sql,scala,r",
     "dbfs:/path/to/script.R",
     "arg1", "arg2", "..."
     ]
    package com.example.job
    
    import java.net.URLClassLoader
    import java.io.File
    
    import org.apache.spark.sql.SparkSession
    
    object ImmutaSparkSubmitExample {
    def main(args: Array[String]): Unit = {
        val jarDir = new File("/databricks/immuta/jars/")
        val urls = jarDir.listFiles.map(_.toURI.toURL)
    
        // Configure a new ClassLoader which will load jars from the additional jars directory
        val cl = new URLClassLoader(urls)
        val jobClass = cl.loadClass(classOf[ImmutaSparkSubmitExample].getName)
        val job = jobClass.newInstance
        jobClass.getMethod("runJob").invoke(job)
    }
    }
    
    class ImmutaSparkSubmitExample {
    
    def getSparkSession(): SparkSession = {
        SparkSession.builder()
        .appName("Example Spark Submit")
        .enableHiveSupport()
        .config("immuta.spark.acl.assume.not.privileged", "true")
        .config("spark.hadoop.immuta.databricks.config.update.service.enabled", "false")
        .getOrCreate()
    }
    
    def runJob(): Unit = {
        val spark = getSparkSession
        try {
        val df = spark.table("immuta.<YOUR DATASOURCE>")
    
        // Run Immuta Spark queries...
    
        } finally {
        spark.stop()
        }
    }
    }
     [
     "--conf","spark.driver.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service",
     "--conf","spark.executor.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service",
     "--conf","spark.databricks.repl.allowedLanguages=python,sql,scala,r",
     "--class","org.youorg.package.MainClass",
     "dbfs:/path/to/code.jar",
     "arg1", "arg2", "..."
     ]
    supported version
    Environment Variables
    supported version
    Click
    Save
    .
    Enter your Admin Token. This token must be for a user who has the required Databricks privilege. This will give Immuta temporary permission to push the cluster policies to the configured Databricks workspace and overwrite any cluster policy templates previously applied to the workspace.
  • Click Apply Policies.

  • Manually push cluster policies: Enabling this option allows you to manually push the cluster policies and the init script to the configured Databricks workspace.

    1. Select the Manually Push Cluster Policies radio button.

    2. Click Download Init Script and set the Immuta plugin init script as a cluster-scoped init script in Databricks by following the Databricks documentation.

    3. Click Download Policies, and then workspace.

      1. Ensure that the init_scripts.0.workspace.destination in the policy matches the file path to the init script you configured above.

      2. The Immuta cluster policy references Databricks Secrets for several of the sensitive fields. These secrets must be manually created if the cluster policy is not automatically pushed. Use Databricks API or CLI to push the proper secrets.\

  • All-purpose compute
    Job compute
    Databricks workspace APIs
    environment variables or Spark configuration
    configure cluster policies
    disallows reading and writing tables not protected by Immuta
    Scratch paths
    User impersonation
    prevent users from changing impersonation in a session
    Databricks documentation
    manually add this cluster policy to your Databricks

    DBFS Access

    This page outlines how to enable access to DBFS in Databricks for non-sensitive data. Databricks administrators should place the desired configuration in the Spark environment variables.

    DBFS FUSE mount

    This Databricks feature mounts DBFS to the local cluster filesystem at /dbfs. Although disabled when using process isolation, this feature can safely be enabled if raw, unfiltered data is not stored in DBFS and all users on the cluster are authorized to see each other’s files. When enabled, the entirety of DBFS essentially becomes a scratch path where users can read and write files in /dfbs/path/to/my/file as though they were local files.

    DBFS FUSE mount limitation: This feature cannot be used in environments with E2 Private Link enabled.

    For example,

    In Python,

    Note: This solution also works in R and Scala.

    Enable DBFS FUSE mount

    To enable the DBFS FUSE mount, set this configuration in the Spark environment variables: IMMUTA_SPARK_DATABRICKS_DBFS_MOUNT_ENABLED=true.

    Mounting a bucket

    • Users can that can also be accessed using the FUSE mount.

    • Mounting a bucket is a one-time action, and the mount will be available to all clusters in the workspace from that point on.

    Scala DBUtils (and %fs magic) with scratch paths

    Scratch paths will work when performing arbitrary remote filesystem operations with fs magic or Scala dbutils.fs functions. For example,

    Configure Scala DBUtils (and %fs magic) with scratch paths

    To support %fs magic and Scala DBUtils with scratch paths, configure

    Configure DBUtils in Python

    To use dbutils in Python, set this configuration: immuta.spark.databricks.py4j.strict.enabled=false.

    Example workflow

    This section illustrates the workflow for getting a file from a remote scratch path, editing it locally with Python, and writing it back to a remote scratch path.

    1. Get the file from remote storage:

    2. Make a copy if you want to explicitly edit localScratchFile, as it will be read-only and owned by root:

    3. Write the new file back to remote storage:

    Mounting must be performed from a non-Immuta cluster.
    mount additional buckets to DBFS
    %sh echo "I'm creating a new file in DBFS" > /dbfs/my/newfile.txt
    %python
    with open("/dbfs/my/newfile.txt", "w") as f:
      f.write("I'm creating a new file in DBFS")
    %fs put -f s3://my-bucket/my/scratch/path/mynewfile.txt "I'm creating a new file in S3"
    %scala dbutils.fs.put("s3://my-bucket/my/scratch/path/mynewfile.txt", "I'm creating a new file in S3")
    <property>
       <name>immuta.spark.databricks.scratch.paths</name>
       <value>s3://my-bucket/my/scratch/path</value>
    </property>
    %python
    import os
    import shutil
    
    s3ScratchFile = "s3://some-bucket/path/to/scratch/file"
    localScratchDir = os.environ.get("IMMUTA_LOCAL_SCRATCH_DIR")
    localScratchFile = "{}/myfile.txt".format(localScratchDir)
    localScratchFileCopy = "{}/myfile_copy.txt".format(localScratchDir)
    dbutils.fs.cp(s3ScratchFile, "file://{}".format(localScratchFile))
    shutil.copy(localScratchFile, localScratchFileCopy)
    with open(localScratchFileCopy, "a") as f:
        f.write("Some appended file content")
    dbutils.fs.cp("file://{}".format(localScratchFileCopy), s3ScratchFile)