Audience: System Administrators
Content Summary: This guide illustrates how to run R and Scala
spark-submit
jobs on Databricks, including prerequisites and caveats.
Language Support
R and Scala are supported, but require advanced configuration; work with your Immuta support professional to use these languages. Python spark-submit
jobs are not supported by the Databricks Spark integration.
Using R in a Notebook
Because of how some user properties are populated in Databricks, users should load the SparkR library in a separate cell before attempting to use any SparkR functions.
spark-submit
Before you can run spark-submit
jobs on Databricks you must initialize the Spark session with the settings outlined below.
Initialize the Spark session by entering these settings into the R submit script immuta.spark.acl.assume.not.privileged="true"
and spark.hadoop.immuta.databricks.config.update.service.enabled="false"
.
This will enable the R script to access Immuta data sources, scratch paths, and workspace tables.
Once the script is written, upload the script to a location in dbfs/S3/ABFS
to give the Databricks cluster access to it.
spark submit
JobTo create the R spark-submit
job,
Go to the Databricks jobs page.
Create a new job, and select Configure spark-submit.
Set up the parameters:
Note: The path dbfs:/path/to/script.R
can be in S3 or ABFS (on Azure Databricks), assuming the cluster is configured with access to that path.
Edit the cluster configuration, and change the Databricks Runtime to be a supported version (5.5, 6.4, 7.3, or 7.4).
Configure the Environment Variables section as you normally would for an Immuta cluster.
Before you can run spark-submit
jobs on Databricks you must initialize the Spark session with the settings outlined below.
Configure the Spark session with immuta.spark.acl.assume.not.privileged="true"
and spark.hadoop.immuta.databricks.config.update.service.enabled="false"
.
Note: Stop your Spark session (spark.stop()
) at the end of your job or the cluster will not terminate.
The spark submit job needs to be launched using a different classloader which will point at the designated user JARs directory. The following Scala template can be used to handle launching your submit code using a separate classloader:
spark-submit
JobTo create the Scala spark-submit
job,
Build and upload your JAR to dbfs/S3/ABFS
where the cluster has access to it.
Select Configure spark-submit, and configure the parameters:
Note: The fully-qualified class name of the class whose main
function will be used as the entry point for your code in the --class
parameter.
Note: The path dbfs:/path/to/code.jar
can be in S3 or ABFS (on Azure Databricks) assuming the cluster is configured with access to that path.
Edit the cluster configuration, and change the Databricks Runtime to a supported version (5.5, 6.4, 7.3, or 7.4).
Include IMMUTA_INIT_ADDITIONAL_JARS_URI=dbfs:/path/to/code.jar
in the "Environment Variables" (where dbfs:/path/to/code.jar
is the path to your jar) so that the jar is uploaded to all the cluster nodes.
The user mapping works differently from notebooks because spark-submit
clusters are not configured with access to the Databricks SCIM API. The cluster tags are read to get the cluster creator and match that user to an Immuta user.
Privileged users (Databricks Admins and Whitelisted Users) must be tied to an Immuta user and given access through Immuta to access data through spark-submit
jobs because the setting immuta.spark.acl.assume.not.privileged="true"
is used.
There is an option of using the immuta.api.key
setting with an Immuta API key generated on the Immuta Profile Page.
Currently when an API key is generated it invalidates the previous key. This can cause issues if a user is using multiple clusters in parallel, since each cluster will generate a new API key for that Immuta user. To avoid these issues, manually generate the API key in Immuta and set the immuta.api.key
on all the clusters or use a specified job user for the submit job.