# Run R and Scala spark-submit Jobs on Databricks

This guide illustrates how to run R and Scala `spark-submit` jobs on Databricks, including prerequisites and caveats.

## R `spark-submit`

### Prerequisites

Before you can run `spark-submit` jobs on Databricks, complete the following steps.

1. Initialize the Spark session:
   1. Enter these settings into the R submit script to allow the R script to access Immuta data sources, scratch paths, and workspace tables: `immuta.spark.acl.assume.not.privileged="true"` and `spark.hadoop.immuta.databricks.config.update.service.enabled="false"`.
   2. Once the script is written, upload the script to a location in `dbfs/S3/ABFS` to give the Databricks cluster access to it.
2. Because of how some user properties are populated in Databricks, load the SparkR library in a separate cell before attempting to use any SparkR functions.

### Create the R `spark submit` Job

To create the R `spark-submit` job,

1. Go to the Databricks jobs page.
2. Create a new job, and select **Configure spark-submit**.
3. Set up the parameters:

   ```
    [
    "--conf","spark.driver.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service",
    "--conf","spark.executor.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service",
    "--conf","spark.databricks.repl.allowedLanguages=python,sql,scala,r",
    "dbfs:/path/to/script.R",
    "arg1", "arg2", "..."
    ]
   ```

   *Note: The path `dbfs:/path/to/script.R` can be in S3 or ABFS (on Azure Databricks), assuming the cluster is configured with access to that path.*
4. Edit the cluster configuration, and change the Databricks Runtime to be a [supported version](https://documentation.immuta.com/SaaS/configuration/integrations/databricks/reference-guides/databricks/installation-and-compliance#system-requirements).
5. Configure the [Spark environment variables](https://documentation.immuta.com/SaaS/configuration/integrations/databricks/databricks-spark/reference-guides/databricks/configuration) section as you normally would for an Immuta cluster.

## Scala spark-submit

### Prerequisites

Before you can run `spark-submit` jobs on Databricks you must initialize the Spark session with the settings outlined below.

1. Configure the Spark session with `immuta.spark.acl.assume.not.privileged="true"` and `spark.hadoop.immuta.databricks.config.update.service.enabled="false"`.

   *Note: Stop your Spark session (`spark.stop()`) at the end of your job or the cluster will not terminate.*
2. The spark submit job needs to be launched using a different classloader which will point at the designated user JARs directory. The following Scala template can be used to handle launching your submit code using a separate classloader:

   ```scala
   package com.example.job

   import java.net.URLClassLoader
   import java.io.File

   import org.apache.spark.sql.SparkSession

   object ImmutaSparkSubmitExample {
   def main(args: Array[String]): Unit = {
       val jarDir = new File("/databricks/immuta/jars/")
       val urls = jarDir.listFiles.map(_.toURI.toURL)

       // Configure a new ClassLoader which will load jars from the additional jars directory
       val cl = new URLClassLoader(urls)
       val jobClass = cl.loadClass(classOf[ImmutaSparkSubmitExample].getName)
       val job = jobClass.newInstance
       jobClass.getMethod("runJob").invoke(job)
   }
   }

   class ImmutaSparkSubmitExample {

   def getSparkSession(): SparkSession = {
       SparkSession.builder()
       .appName("Example Spark Submit")
       .enableHiveSupport()
       .config("immuta.spark.acl.assume.not.privileged", "true")
       .config("spark.hadoop.immuta.databricks.config.update.service.enabled", "false")
       .getOrCreate()
   }

   def runJob(): Unit = {
       val spark = getSparkSession
       try {
       val df = spark.table("immuta.<YOUR DATASOURCE>")

       // Run Immuta Spark queries...

       } finally {
       spark.stop()
       }
   }
   }
   ```

### Create the Scala `spark-submit` Job

To create the Scala `spark-submit` job,

1. Build and upload your JAR to `dbfs/S3/ABFS` where the cluster has access to it.
2. Select **Configure spark-submit**, and configure the parameters:

   ```
    [
    "--conf","spark.driver.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service",
    "--conf","spark.executor.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service",
    "--conf","spark.databricks.repl.allowedLanguages=python,sql,scala,r",
    "--class","org.youorg.package.MainClass",
    "dbfs:/path/to/code.jar",
    "arg1", "arg2", "..."
    ]
   ```

   *Note: The fully-qualified class name of the class whose `main` function will be used as the entry point for your code in the `--class` parameter.*

   *Note: The path `dbfs:/path/to/code.jar` can be in S3 or ABFS (on Azure Databricks) assuming the cluster is configured with access to that path.*
3. Edit the cluster configuration, and change the Databricks Runtime to a [supported version](https://documentation.immuta.com/SaaS/configuration/integrations/databricks/reference-guides/databricks/installation-and-compliance#system-requirements).
4. Include `IMMUTA_INIT_ADDITIONAL_JARS_URI=dbfs:/path/to/code.jar` in the "Environment Variables" (where `dbfs:/path/to/code.jar` is the path to your jar) so that the jar is uploaded to all the cluster nodes.

## Caveats

* The user mapping works differently from notebooks because `spark-submit` clusters are not configured with access to the Databricks SCIM API. The cluster tags are read to get the cluster creator and match that user to an Immuta user.
* Privileged users (Databricks admins and allowlisted users) must be tied to an Immuta user and given access through Immuta to access data through `spark-submit` jobs because the setting `immuta.spark.acl.assume.not.privileged="true"` is used.
* There is an option of using the `immuta.api.key` setting with an Immuta API key generated on the Immuta profile page.
* Currently when an API key is generated it invalidates the previous key. This can cause issues if a user is using multiple clusters in parallel, since each cluster will generate a new API key for that Immuta user. To avoid these issues, manually generate the API key in Immuta and set the `immuta.api.key` on all the clusters or use a specified job user for the submit job.
