> For the complete documentation index, see [llms.txt](https://documentation.immuta.com/2024.3/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://documentation.immuta.com/2024.3/integrations/databricks-spark/reference-guides/configuration-settings/cluster-policies/sparklyr.md).

# Sparklyr

{% hint style="info" %}
**Single-user clusters recommended**

Like Databricks, Immuta recommends single-user clusters for sparklyr when user isolation is required. A single-user cluster can either be a job cluster or a cluster with credential passthrough enabled. *Note: spark-submit jobs are not currently supported.*
{% endhint %}

Two cluster types can be configured with sparklyr: Single-User Clusters (recommended) and Multi-User Clusters (discouraged).

* [**Single-User Clusters**](#single-user-cluster-configuration): Credential Passthrough (required on Databricks) allows a single-user cluster to be created. This setting automatically configures the cluster to assume the role of the attached user when reading from storage. Because Immuta requires that raw data is readable by the cluster, the instance profile associated with the cluster should be used rather than a role assigned to the attached user.
* [**Multi-User Clusters**](#multi-user-cluster-configuration): Because Immuta cannot guarantee user isolation in a multi-user sparklyr cluster, it is not recommended to deploy a multi-user cluster. To force all users to act under the same set of attributes, groups, and purposes with respect to their data access and eliminate the risk of a data leak, all sparklyr multi-user clusters must be equalized either by convention (all users able to attach to the cluster have the same level of data access in Immuta) or by configuration (detailed below).

## Single-User Cluster Configuration

### 1 - Enable sparklyr

In addition to the configuration for an Immuta cluster with R, add this environment variable to the **Environment Variables** section of the cluster:

```conf
IMMUTA_DATABRICKS_SPARKLYR_SUPPORT_ENABLED=true
```

This configuration makes changes to the iptables rules on the cluster to allow the sparklyr client to connect to the required ports on the JVM used by the sparklyr backend service.

### 2 - Set Up a sparklyr Connection in Databricks

1. Install and load libraries into a notebook. Databricks includes the stable version of sparklyr, so `library(sparklyr)` in an R notebook is sufficient, but you may opt to install the latest version of sparklyr from `CRAN`. Additionally, loading `library(DBI)` will allow you to execute SQL queries.
2. Set up a sparklyr connection:

   ```conf
   sc <- spark_connect(method = "databricks")
   ```
3. Pass the connection object to execute queries:

   ```conf
   dbGetQuery(sc, "show tables in immuta")
   ```

### 3 - Configure a Single-User Cluster

Add the following items to the Spark Config section of the cluster:

```conf
spark.databricks.passthrough.enabled true

spark.databricks.pyspark.trustedFilesystems com.databricks.s3a.S3AFileSystem,shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem,shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem,com.databricks.adl.AdlFileSystem,shaded.databricks.V2_1_4.com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem,shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem,shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem,org.apache.hadoop.fs.ImmutaSecureFileSystemWrapper

spark.hadoop.fs.s3a.aws.credentials.provider com.amazonaws.auth.InstanceProfileCredentialsProvider
```

The `trustedFileSystems` setting is required to allow Immuta’s wrapper FileSystem (used in conjunction with the `ImmutaSecurityManager` for data security purposes) to be used with credential passthrough. Additionally, the `InstanceProfileCredentialsProvider` must be configured to continue using the cluster’s instance profile for data access, rather than a role associated with the attached user.

## Multi-User Cluster Configuration

{% hint style="warning" %}
**Avoid deploying multi-user clusters with sparklyr configuration**

It is possible, but not recommended, to deploy a multi-user cluster sparklyr configuration. Immuta cannot guarantee user isolation in a multi-user sparklyr configuration.
{% endhint %}

The configurations in this section enable sparklyr, require project equalization, map sparklyr sessions to the correct Immuta user, and prevent users from accessing Immuta workspaces.

1. Add the following environment variables to the **Environment Variables** section of your cluster configuration:

   ```conf
   IMMUTA_DATABRICKS_SPARKLYR_SUPPORT_ENABLED=true

   IMMUTA_SPARK_REQUIRE_EQUALIZATION=true

   IMMUTA_SPARK_CURRENT_USER_SCIM_FALLBACK=false
   ```
2. Add the following items to the **Spark Config** section:

   ```conf
   immuta.spark.acl.assume.not.privileged true

   immuta.api.key=<user’s API key>
   ```

## Limitations

Immuta’s integration with sparklyr does not currently support

* spark-submit jobs,
* UDFs, or
* Databricks Runtimes 5, 6, or 7.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://documentation.immuta.com/2024.3/integrations/databricks-spark/reference-guides/configuration-settings/cluster-policies/sparklyr.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.