Run spark-submit Jobs on Databricks
Audience: System Administrators
Content Summary: This guide illustrates how to run R and Scala
spark-submit
jobs on Databricks, including prerequisites and caveats.
Language Support
R and Scala are supported, but require advanced configuration; work with your Immuta support professional to use these languages. Python spark-submit
jobs are not supported by the Databricks Spark integration.
Using R in a Notebook
Because of how some user properties are populated in Databricks, users should load the SparkR library in a separate cell before attempting to use any SparkR functions.
R spark-submit
spark-submit
Prerequisites
Before you can run spark-submit
jobs on Databricks you must initialize the Spark session with the settings outlined below.
Initialize the Spark session by entering these settings into the R submit script
immuta.spark.acl.assume.not.privileged="true"
andspark.hadoop.immuta.databricks.config.update.service.enabled="false"
.This will enable the R script to access Immuta data sources, scratch paths, and workspace tables.
Once the script is written, upload the script to a location in
dbfs/S3/ABFS
to give the Databricks cluster access to it.
Create the R spark submit
Job
spark submit
JobTo create the R spark-submit
job,
Go to the Databricks jobs page.
Create a new job, and select Configure spark-submit.
Set up the parameters:
Note: The path
dbfs:/path/to/script.R
can be in S3 or ABFS (on Azure Databricks), assuming the cluster is configured with access to that path.Edit the cluster configuration, and change the Databricks Runtime to be a supported version (5.5, 6.4, 7.3, or 7.4).
Configure the Environment Variables section as you normally would for an Immuta cluster.
Scala spark-submit
Prerequisites
Before you can run spark-submit
jobs on Databricks you must initialize the Spark session with the settings outlined below.
Configure the Spark session with
immuta.spark.acl.assume.not.privileged="true"
andspark.hadoop.immuta.databricks.config.update.service.enabled="false"
.Note: Stop your Spark session (
spark.stop()
) at the end of your job or the cluster will not terminate.The spark submit job needs to be launched using a different classloader which will point at the designated user JARs directory. The following Scala template can be used to handle launching your submit code using a separate classloader:
Create the Scala spark-submit
Job
spark-submit
JobTo create the Scala spark-submit
job,
Build and upload your JAR to
dbfs/S3/ABFS
where the cluster has access to it.Select Configure spark-submit, and configure the parameters:
Note: The fully-qualified class name of the class whose
main
function will be used as the entry point for your code in the--class
parameter.Note: The path
dbfs:/path/to/code.jar
can be in S3 or ABFS (on Azure Databricks) assuming the cluster is configured with access to that path.Edit the cluster configuration, and change the Databricks Runtime to a supported version (5.5, 6.4, 7.3, or 7.4).
Include
IMMUTA_INIT_ADDITIONAL_JARS_URI=dbfs:/path/to/code.jar
in the "Environment Variables" (wheredbfs:/path/to/code.jar
is the path to your jar) so that the jar is uploaded to all the cluster nodes.
Caveats
The user mapping works differently from notebooks because
spark-submit
clusters are not configured with access to the Databricks SCIM API. The cluster tags are read to get the cluster creator and match that user to an Immuta user.Privileged users (Databricks Admins and Whitelisted Users) must be tied to an Immuta user and given access through Immuta to access data through
spark-submit
jobs because the settingimmuta.spark.acl.assume.not.privileged="true"
is used.There is an option of using the
immuta.api.key
setting with an Immuta API key generated on the Immuta Profile Page.Currently when an API key is generated it invalidates the previous key. This can cause issues if a user is using multiple clusters in parallel, since each cluster will generate a new API key for that Immuta user. To avoid these issues, manually generate the API key in Immuta and set the
immuta.api.key
on all the clusters or use a specified job user for the submit job.
Last updated