# Simplified Databricks Spark Configuration

This guide details the simplified installation method for enabling access to Databricks with Immuta policies enforced.

Ensure your Databricks workspace, instance, and permissions meet the guidelines outlined in the [Installation introduction](https://documentation.immuta.com/2024.3/integrations/databricks-spark/how-to-guides/configuration/..#prerequisites) before you begin.

{% hint style="warning" %}
**Databricks Unity Catalog**: If Unity Catalog is enabled in a Databricks workspace, you must use an Immuta cluster policy when you set up the integration to create an Immuta-enabled cluster.
{% endhint %}

## 1 - Add the Integration on the App Settings Page

1. Log in to Immuta and click the **App Settings** icon in the left sidebar.
2. Scroll to the **System API Key** subsection under **HDFS** and click **Generate Key**.
3. Click **Save** and then **Confirm**.
4. Scroll to the **Integration Settings** section.
5. Click **+ Add Integration** and select **Databricks Integration** from the dropdown menu.
6. Complete the **Hostname** field.
7. Enter a **Unique ID** for the integration. By default, your Immuta tenant URL populates this field. This ID is used to tie the set of cluster policies to your Immuta tenant and allows multiple Immuta tenants to access the same Databricks workspace without cluster policy conflicts.
8. Select your configured **Immuta IAM** from the dropdown menu.
9. Choose one of the following options for your data access model:
   * **Protected until made available by policy**: All tables are hidden until a user is permissioned through an Immuta policy. This is how most databases work and assumes least privileged access and also means you will have to register all tables with Immuta.
   * **Available until protected by policy**: All tables are open until explicitly registered and protected by Immuta. This makes a lot of sense if most of your tables are non-sensitive and you can pick and choose which to protect.
10. Select the **Storage Access Type** from the dropdown menu.
11. Opt to add any **Additional Hadoop Configuration Files**.
12. Click **Add Integration**.

## 2 - Configure Cluster Policies

Several cluster policies are available on the App Settings page when configuring this integration:

* [Python & SQL](https://documentation.immuta.com/2024.3/integrations/databricks-spark/reference-guides/configuration-settings/cluster-policies/python-sql)
* [Python & SQL & R](https://documentation.immuta.com/2024.3/integrations/databricks-spark/reference-guides/configuration-settings/cluster-policies/python-sql-r)
* [Python & SQL & R with Library Support](https://documentation.immuta.com/2024.3/integrations/databricks-spark/reference-guides/configuration-settings/cluster-policies/python-sql-r-lib-support)
* [Scala](https://documentation.immuta.com/2024.3/integrations/databricks-spark/reference-guides/configuration-settings/cluster-policies/scala)
* [Sparklyr](https://documentation.immuta.com/2024.3/integrations/databricks-spark/reference-guides/configuration-settings/cluster-policies/sparklyr)

Click a link above to read more about each of these cluster policies before continuing with the tutorial.

1. Click **Configure Cluster Policies**.
2. Select one or more cluster policies in the matrix by clicking the **Select** button(s).
3. Opt to check the **Enable Unity Catalog** checkbox to generate cluster policies that will enable Unity Catalog on your cluster. This option is only available when Databricks runtime 11.3 is selected.
4. Opt to make changes to these cluster policies by clicking **Additional Policy Changes** and editing the text field.
5. Use one of the two Installation Types described below to apply the policies to your cluster:
   * **Automatically push cluster policies:** This option allows you to automatically push the cluster policies to the configured Databricks workspace. This will overwrite any cluster policy templates previously applied to this workspace.
     1. Select the **Automatically Push Cluster Policies** radio button.
     2. Enter your **Admin Token**. This token must be for a user who can create cluster policies in Databricks.
     3. Click **Apply Policies**.
   * **Manually push cluster policies:** Enabling this option will allow you to manually push the cluster policies to the configured Databricks workspace. There will be various files to download and manually push to the configured Databricks workspace.
     1. Select the **Manually Push Cluster Policies** radio button.
     2. Click **Download Init Script**.
     3. Follow the steps in the **Instructions to upload the init script to DBFS** section.
     4. Click **Download Policies**, and then manually add these Cluster Policies in Databricks.
6. Opt to click the **Download the Benchmarking Suite** to compare a regular Databricks cluster to one protected by Immuta. Detailed instructions are available in the first notebook, which will require an Immuta and non-Immuta cluster to generate test data and perform queries.
7. Click **Close**, and then click **Save** and **Confirm**.

## 3 - Add Policies to Your Cluster

1. Create a cluster in Databricks by following the [Databricks documentation](https://docs.databricks.com/clusters/create.html).
2. In the **Policy** dropdown, select the Cluster Policies you pushed or manually added from Immuta.
3. Select the **Custom** Access mode.
4. Opt to adjust **Autopilot Options** and **Worker Type** settings: The default values provided here may be more than what is necessary for non-production or smaller use-cases. To reduce resource usage you can enable/disable autoscaling, limit the size and number of workers, and set the inactivity timeout to a lower value.
5. Opt to configure the **Instances** tab in the **Advanced Options** section:
   * **IAM Role** (AWS ONLY): Select the instance role you created for this cluster. (For access key authentication, you should instead use the environment variables listed in the [AWS](https://documentation.immuta.com/2024.3/integrations/databricks-spark/how-to-guides/manual#authenticating-with-access-keys-or-session-tokens-optional) section.)
6. Click **Create Cluster**.

## 4 - Register data

[Register Databricks securables in Immuta](https://documentation.immuta.com/2024.3/data-and-integrations/registering-metadata/register-data-sources/query-backed-tutorial).

## 5 - Query Immuta Data

When the Immuta-enabled Databricks cluster has been successfully started, Immuta will create an `immuta` database, which allows Immuta to track Immuta-managed data sources separately from remote Databricks tables so that policies and other security features can be applied. However, users can query sources with their original database or table name without referencing the `immuta` database. Additionally, when configuring a Databricks cluster you can hide `immuta` from any calls to `SHOW DATABASES` so that users aren't misled or confused by its presence. For more details, see the [<mark style="color:blue;">Hiding the</mark> <mark style="color:blue;">`immuta`</mark> <mark style="color:blue;">Database in Databricks</mark>](https://documentation.immuta.com/2024.3/integrations/databricks-spark/how-to-guides/hide-immuta-database) page.

1. Before users can query an Immuta data source, an administrator must give the user `Can Attach To` permissions on the cluster.
2. See the [Databricks Data Source Creation guide](https://documentation.immuta.com/2024.3/data-and-integrations/registering-metadata/register-data-sources/query-backed-tutorial) for a detailed walkthrough of creating Databricks data sources in Immuta.

### Example Queries

Below are example queries that can be run to obtain data from an Immuta-configured data source. Because Immuta supports raw tables in Databricks, you do not have to use Immuta-qualified table names in your queries like the first example. Instead, you can run queries like the second example, which does not reference the [<mark style="color:blue;">`immuta`</mark> <mark style="color:blue;">database</mark>](https://documentation.immuta.com/2024.3/integrations/databricks-spark/how-to-guides/hide-immuta-database).

```sql
%sql
select * from immuta.my_data_source limit 5;
```

```sql
%sql
select * from my_data_source limit 5;
```
