Run spark-submit Jobs on Databricks

Audience: System Administrators

Content Summary: This guide illustrates how to run R and Scala spark-submit jobs on Databricks, including prerequisites and caveats.

Language Support

R and Scala are supported, but require advanced configuration; work with your Immuta support professional to use these languages. Python spark-submit jobs are not supported by the Databricks Spark integration.

Using R in a Notebook

Because of how some user properties are populated in Databricks, users should load the SparkR library in a separate cell before attempting to use any SparkR functions.

R spark-submit

Prerequisites

Before you can run spark-submit jobs on Databricks you must initialize the Spark session with the settings outlined below.

  1. Initialize the Spark session by entering these settings into the R submit script immuta.spark.acl.assume.not.privileged="true" and spark.hadoop.immuta.databricks.config.update.service.enabled="false".

    This will enable the R script to access Immuta data sources, scratch paths, and workspace tables.

  2. Once the script is written, upload the script to a location in dbfs/S3/ABFS to give the Databricks cluster access to it.

Create the R spark submit Job

To create the R spark-submit job,

  1. Go to the Databricks jobs page.

  2. Create a new job, and select Configure spark-submit.

  3. Set up the parameters:

     [
     "--conf","spark.driver.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service",
     "--conf","spark.executor.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service",
     "--conf","spark.databricks.repl.allowedLanguages=python,sql,scala,r",
     "dbfs:/path/to/script.R",
     "arg1", "arg2", "..."
     ]

    Note: The path dbfs:/path/to/script.R can be in S3 or ABFS (on Azure Databricks), assuming the cluster is configured with access to that path.

  4. Edit the cluster configuration, and change the Databricks Runtime to be a supported version (5.5, 6.4, 7.3, or 7.4).

  5. Configure the Environment Variables section as you normally would for an Immuta cluster.

Scala spark-submit

Prerequisites

Before you can run spark-submit jobs on Databricks you must initialize the Spark session with the settings outlined below.

  1. Configure the Spark session with immuta.spark.acl.assume.not.privileged="true" and spark.hadoop.immuta.databricks.config.update.service.enabled="false".

    Note: Stop your Spark session (spark.stop()) at the end of your job or the cluster will not terminate.

  2. The spark submit job needs to be launched using a different classloader which will point at the designated user JARs directory. The following Scala template can be used to handle launching your submit code using a separate classloader:

    package com.example.job
    
    import java.net.URLClassLoader
    import java.io.File
    
    import org.apache.spark.sql.SparkSession
    
    object ImmutaSparkSubmitExample {
    def main(args: Array[String]): Unit = {
        val jarDir = new File("/databricks/immuta/jars/")
        val urls = jarDir.listFiles.map(_.toURI.toURL)
    
        // Configure a new ClassLoader which will load jars from the additional jars directory
        val cl = new URLClassLoader(urls)
        val jobClass = cl.loadClass(classOf[ImmutaSparkSubmitExample].getName)
        val job = jobClass.newInstance
        jobClass.getMethod("runJob").invoke(job)
    }
    }
    
    class ImmutaSparkSubmitExample {
    
    def getSparkSession(): SparkSession = {
        SparkSession.builder()
        .appName("Example Spark Submit")
        .enableHiveSupport()
        .config("immuta.spark.acl.assume.not.privileged", "true")
        .config("spark.hadoop.immuta.databricks.config.update.service.enabled", "false")
        .getOrCreate()
    }
    
    def runJob(): Unit = {
        val spark = getSparkSession
        try {
        val df = spark.table("immuta.<YOUR DATASOURCE>")
    
        // Run Immuta Spark queries...
    
        } finally {
        spark.stop()
        }
    }
    }

Create the Scala spark-submit Job

To create the Scala spark-submit job,

  1. Build and upload your JAR to dbfs/S3/ABFS where the cluster has access to it.

  2. Select Configure spark-submit, and configure the parameters:

     [
     "--conf","spark.driver.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service",
     "--conf","spark.executor.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service",
     "--conf","spark.databricks.repl.allowedLanguages=python,sql,scala,r",
     "--class","org.youorg.package.MainClass",
     "dbfs:/path/to/code.jar",
     "arg1", "arg2", "..."
     ]

    Note: The fully-qualified class name of the class whose main function will be used as the entry point for your code in the --class parameter.

    Note: The path dbfs:/path/to/code.jar can be in S3 or ABFS (on Azure Databricks) assuming the cluster is configured with access to that path.

  3. Edit the cluster configuration, and change the Databricks Runtime to a supported version (5.5, 6.4, 7.3, or 7.4).

  4. Include IMMUTA_INIT_ADDITIONAL_JARS_URI=dbfs:/path/to/code.jar in the "Environment Variables" (where dbfs:/path/to/code.jar is the path to your jar) so that the jar is uploaded to all the cluster nodes.

Caveats

  • The user mapping works differently from notebooks because spark-submit clusters are not configured with access to the Databricks SCIM API. The cluster tags are read to get the cluster creator and match that user to an Immuta user.

  • Privileged users (Databricks Admins and Whitelisted Users) must be tied to an Immuta user and given access through Immuta to access data through spark-submit jobs because the setting immuta.spark.acl.assume.not.privileged="true" is used.

  • There is an option of using the immuta.api.key setting with an Immuta API key generated on the Immuta Profile Page.

  • Currently when an API key is generated it invalidates the previous key. This can cause issues if a user is using multiple clusters in parallel, since each cluster will generate a new API key for that Immuta user. To avoid these issues, manually generate the API key in Immuta and set the immuta.api.key on all the clusters or use a specified job user for the submit job.

Last updated

Copyright © 2014-2024 Immuta Inc. All rights reserved.