# Run R and Scala spark-submit Jobs on Databricks

This guide illustrates how to run R and Scala `spark-submit` jobs on Databricks, including prerequisites and caveats.

## R `spark-submit`

### Prerequisites

Before you can run `spark-submit` jobs on Databricks, complete the following steps.

1. Initialize the Spark session:
   1. Enter these settings into the R submit script to allow the R script to access Immuta data sources, scratch paths, and workspace tables: `immuta.spark.acl.assume.not.privileged="true"` and `spark.hadoop.immuta.databricks.config.update.service.enabled="false"`.
   2. Once the script is written, upload the script to a location in `dbfs/S3/ABFS` to give the Databricks cluster access to it.
2. Because of how some user properties are populated in Databricks, load the SparkR library in a separate cell before attempting to use any SparkR functions.

### Create the R `spark submit` Job

To create the R `spark-submit` job,

1. Go to the Databricks jobs page.
2. Create a new job, and select **Configure spark-submit**.
3. Set up the parameters:

   ```
    [
    "--conf","spark.driver.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service",
    "--conf","spark.executor.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service",
    "--conf","spark.databricks.repl.allowedLanguages=python,sql,scala,r",
    "dbfs:/path/to/script.R",
    "arg1", "arg2", "..."
    ]
   ```

   *Note: The path `dbfs:/path/to/script.R` can be in S3 or ABFS (on Azure Databricks), assuming the cluster is configured with access to that path.*
4. Edit the cluster configuration, and change the Databricks Runtime to be a [supported version](/saas/configuration/integrations/databricks/databricks-spark/reference-guides/databricks/installation-and-compliance.md#system-requirements).
5. Configure the [Spark environment variables](/saas/configuration/integrations/databricks/databricks-spark/reference-guides/databricks/configuration.md) section as you normally would for an Immuta cluster.

## Scala spark-submit

### Prerequisites

Before you can run `spark-submit` jobs on Databricks you must initialize the Spark session with the settings outlined below.

1. Configure the Spark session with `immuta.spark.acl.assume.not.privileged="true"` and `spark.hadoop.immuta.databricks.config.update.service.enabled="false"`.

   *Note: Stop your Spark session (`spark.stop()`) at the end of your job or the cluster will not terminate.*
2. The spark submit job needs to be launched using a different classloader which will point at the designated user JARs directory. The following Scala template can be used to handle launching your submit code using a separate classloader:

   ```scala
   package com.example.job

   import java.net.URLClassLoader
   import java.io.File

   import org.apache.spark.sql.SparkSession

   object ImmutaSparkSubmitExample {
   def main(args: Array[String]): Unit = {
       val jarDir = new File("/databricks/immuta/jars/")
       val urls = jarDir.listFiles.map(_.toURI.toURL)

       // Configure a new ClassLoader which will load jars from the additional jars directory
       val cl = new URLClassLoader(urls)
       val jobClass = cl.loadClass(classOf[ImmutaSparkSubmitExample].getName)
       val job = jobClass.newInstance
       jobClass.getMethod("runJob").invoke(job)
   }
   }

   class ImmutaSparkSubmitExample {

   def getSparkSession(): SparkSession = {
       SparkSession.builder()
       .appName("Example Spark Submit")
       .enableHiveSupport()
       .config("immuta.spark.acl.assume.not.privileged", "true")
       .config("spark.hadoop.immuta.databricks.config.update.service.enabled", "false")
       .getOrCreate()
   }

   def runJob(): Unit = {
       val spark = getSparkSession
       try {
       val df = spark.table("immuta.<YOUR DATASOURCE>")

       // Run Immuta Spark queries...

       } finally {
       spark.stop()
       }
   }
   }
   ```

### Create the Scala `spark-submit` Job

To create the Scala `spark-submit` job,

1. Build and upload your JAR to `dbfs/S3/ABFS` where the cluster has access to it.
2. Select **Configure spark-submit**, and configure the parameters:

   ```
    [
    "--conf","spark.driver.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service",
    "--conf","spark.executor.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service",
    "--conf","spark.databricks.repl.allowedLanguages=python,sql,scala,r",
    "--class","org.youorg.package.MainClass",
    "dbfs:/path/to/code.jar",
    "arg1", "arg2", "..."
    ]
   ```

   *Note: The fully-qualified class name of the class whose `main` function will be used as the entry point for your code in the `--class` parameter.*

   *Note: The path `dbfs:/path/to/code.jar` can be in S3 or ABFS (on Azure Databricks) assuming the cluster is configured with access to that path.*
3. Edit the cluster configuration, and change the Databricks Runtime to a [supported version](/saas/configuration/integrations/databricks/databricks-spark/reference-guides/databricks/installation-and-compliance.md#system-requirements).
4. Include `IMMUTA_INIT_ADDITIONAL_JARS_URI=dbfs:/path/to/code.jar` in the "Environment Variables" (where `dbfs:/path/to/code.jar` is the path to your jar) so that the jar is uploaded to all the cluster nodes.

## Caveats

* The user mapping works differently from notebooks because `spark-submit` clusters are not configured with access to the Databricks SCIM API. The cluster tags are read to get the cluster creator and match that user to an Immuta user.
* Privileged users (Databricks admins and allowlisted users) must be tied to an Immuta user and given access through Immuta to access data through `spark-submit` jobs because the setting `immuta.spark.acl.assume.not.privileged="true"` is used.
* There is an option of using the `immuta.api.key` setting with an Immuta API key generated on the Immuta profile page.
* Currently when an API key is generated it invalidates the previous key. This can cause issues if a user is using multiple clusters in parallel, since each cluster will generate a new API key for that Immuta user. To avoid these issues, manually generate the API key in Immuta and set the `immuta.api.key` on all the clusters or use a specified job user for the submit job.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.immuta.com/saas/configuration/integrations/databricks/databricks-spark/how-to-guides/spark-submit.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
