# Create a Data Source

For a complete list of supported databases, see the [Immuta Support Matrix](https://documentation.immuta.com/2024.2/releases/support-matrix#databases).

{% hint style="info" %}
This page contains references to the term whitelist, which Immuta no longer uses. When the term is removed from the software, it will be removed from this page.
{% endhint %}

{% hint style="info" %}
**Redshift data sources**

* Redshift Spectrum data sources must be registered [via the Immuta CLI or V2 API](https://documentation.immuta.com/2024.2/developer-guides/the-immuta-cli) using [this payload](https://documentation.immuta.com/2024.2/developer-guides/api-intro/immuta-v2-api/request-payload-examples#redshift-spectrum-data-sources).
* Registering Redshift datashares as Immuta data sources is unsupported.
  {% endhint %}

## Requirements

* `CREATE_DATA_SOURCE` Immuta permission
* The Snowflake user registering data sources must have the following privileges on all securables:
  * `USAGE` on all databases and schemas with registered data sources.
  * `REFERENCES` on all tables and views registered in Immuta.
  * [`SELECT` on all tables and views registered in Immuta](#user-content-fn-1)[^1].

{% hint style="warning" %}
**Snowflake imported databases**

Immuta does not support Snowflake tables from imported databases. Instead, create a view of the table and register that view as a data source.
{% endhint %}

* Databricks Spark integration requirements: Ensure that at least one of the traits below is true.
  * The user exposing the tables has READ\_METADATA and SELECT permissions on the target views/tables (specifically if Table ACLs are enabled).
  * The user exposing the tables is listed in the `immuta.spark.acl.whitelist` configuration on the target cluster.
  * The user exposing the tables is a Databricks workspace administrator.
* Databricks Unity Catalog integration requirements: When registering Databricks Unity Catalog securables in Immuta, use [the service principal from the integration configuration](https://documentation.immuta.com/2024.2/data-and-integrations/databricks-unity-catalog/how-to-guides/configure) and ensure it has the privileges listed below. Immuta uses this service principal continuously to orchestrate Unity Catalog policies and maintain state between Immuta and Databricks.
  * `USE CATALOG` and `MANAGE` on all catalogs containing securables registered as Immuta data sources.
  * `USE SCHEMA` on all schemas containing securables registered as Immuta data sources.
  * `MODIFY` and `SELECT` on all securables you want registered as Immuta data sources.

{% hint style="info" %}
`MANAGE` and `MODIFY` are required so that the service principal can apply row filters and column masks on the securable; to do so, the service principal must also have `SELECT` on the securable as well as `USE CATALOG` on its parent catalog and `USE SCHEMA` on its parent schema. Since privileges are inherited, you can grant the service principal the `MODIFY` and `SELECT` privilege on all catalogs or schemas containing Immuta data sources, which automatically grants the service principal the `MODIFY` and `SELECT` privilege on all current and future securables in the catalog or schema. The service principal also inherits `MANAGE` from the parent catalog for the purpose of applying row filters and column masks, but that privilege must be set directly on the parent catalog in order for grants to be fully applied.
{% endhint %}

## Enter connection information

{% hint style="info" %}
**Best Practice: Connections Use SSL**

Although not required, it is recommended that all connections use SSL. Additional connection string arguments may also be provided.

*Note: Only Immuta uses the connection you provide and injects all policy controls when users query the system. In other words, users always connect through Immuta with policies enforced and have no direct association with this connection.*
{% endhint %}

1. Navigate to the **My Data Sources** page.
2. Click the **New Data Source** button in the top right corner.
3. Select the data platform containing the data you wish to expose by clicking a tile.
4. Input the **connection parameters** to the database you're exposing. Click the tabs below for guidance for select data platforms.

{% tabs %}
{% tab title="Amazon S3" %}
See the [Create an Amazon S3 data source guide](https://documentation.immuta.com/2024.2/data-and-integrations/registering-metadata/register-data-sources/s3-tutorial) for instructions.
{% endtab %}

{% tab title="BigQuery" %}
{% hint style="info" %}
**Required Google BigQuery roles for creating data sources**

Ensure that the user creating the Google BigQuery data source has these roles:

* `roles/bigquery.metadataViewer` on the source table (if managed at that level) or dataset
* `roles/bigquery.dataViewer` (or higher) on the source table (if managed at that level) or dataset
* `roles/bigquery.jobUser` on the project
  {% endhint %}

See the [Create a Google BigQuery data source guide](https://documentation.immuta.com/2024.2/data-and-integrations/registering-metadata/register-data-sources/bigquery-tutorial) for instructions.
{% endtab %}

{% tab title="Databricks" %}
{% hint style="warning" %}
**Azure Databricks Unity Catalog limitation**

Set all table-level ownership on your Unity Catalog data sources to an individual user or service principal instead of a Databricks group before proceeding. Otherwise, Immuta cannot apply data policies to the table in Unity Catalog. See the [Azure Databricks Unity Catalog limitation](https://documentation.immuta.com/2024.2/databricks-unity-catalog/unity-catalog-overview#azure-databricks-unity-catalog-limitation) for details.
{% endhint %}

1. Complete the first four fields in the **Connection Information** box:
   * **Server**: hostname or IP address
   * **Port**: port configured for Databricks, typically port 443
   * **SSL**: when enabled, ensures communication between Immuta and the remote database is encrypted
   * **Database**: the remote database
2. Select your authentication method from the dropdown:
   * **Access Token**:
     1. Enter your **Databricks API Token**. Use a non-expiring token so that access to the data source is not lost unexpectedly.
     2. Enter the **HTTP Path** of your Databricks cluster or SQL warehouse.
   * **OAuth machine-to-machine (M2M)**:
     1. Enter the **HTTP Path** of your Databricks cluster or SQL warehouse.
     2. Fill out the **Token Endpoint** with the full URL of the identity provider. This is where the generated token is sent. The default value is `https://<your workspace name>.cloud.databricks.com/oidc/v1/token`
     3. Fill out the **Client ID**. This is a combination of letters, numbers, or symbols, used as a public identifier and is the same as the [service principal's application ID](https://docs.databricks.com/en/dev-tools/auth/oauth-m2m.html#step-3-create-an-oauth-secret-for-a-service-principal).
     4. Enter the **Scope** (string). The scope limits the operations and roles allowed in Databricks by the access token. See the [OAuth 2.0 documentation](https://oauth.net/2/scope/) for details about scopes.
     5. Enter the **Client Secret**. Immuta uses this secret to authenticate with the authorization server when it requests a token.
3. Enter the **HTTP Path** of your Databricks cluster or SQL warehouse.
4. If you are using a proxy server with Databricks, specify it in the **Additional Connection String Options**:

   ```
   UseProxy=1;ProxyHost=my.host.com;ProxyPort=6789
   ```

{% endtab %}
{% endtabs %}

5. Click the **Test Connection** button.

{% hint style="info" %}
**Further Considerations**

* Immuta pushes down joins to be processed on the database when possible. To ensure this happens, make sure the connection information matches between data sources, including host, port, ssl, username, and password. **You will see performance degradation on joins against the same database if this information doesn't match**.
* Some data platforms require different connection information than pictured in this section. Please refer to the tool-tips in the Immuta UI for this step if you need additional guidance.
* If you are creating an Impala data source against a Kerberized instance of Impala, the username field locks down to your Immuta username unless you possess the IMPERSONATE\_HDFS\_USER permission.
* If a client certificate is required to connect to the source database, you can add it in the **Upload Certificates** section at the bottom of the form.
  {% endhint %}

## Select virtual population

1. Decide how to virtually populate the data source by selecting **Create sources for all tables in this database and monitor for changes** or **Schema/Table**.
2. Complete the workflow for **Create sources for all tables in this database and monitor for changes** or **Schema/Table** selection, which are outlined on the tabs below:

{% tabs %}
{% tab title="Create sources for all tables in this database and monitor for changes" %}
**Create sources for all tables in this database and monitor for changes**

Selecting this option will create and keep in sync all data sources within this database. New schemas will be automatically detected and the corresponding data sources and schema projects will be created.
{% endtab %}

{% tab title="Schema/Table" %}
**Schema/Table**

Selecting this option will create and keep in sync all tables within the schema(s) selected. No new schemas will be detected.

1. If you choose **Schema/Table**, click **Edit** in the table selection box that appears.
2. By default, all schemas and tables are selected. Select and deselect by clicking the **checkbox** to the left of the name in the Import Schemas/Tables menu. You can create multiple data sources at one time by selecting an entire schema or multiple tables.
3. After making your selection(s), click **Apply**.
   {% endtab %}
   {% endtabs %}

## Enter basic information

Provide information about your source to make it discoverable to users.

1. Enter the **SQL Schema Name Format** to be the SQL name that the data source exists under in the Immuta Query Engine. **It must include a schema macro** but you may personalize it using lowercase letters, numbers, and underscores to personalize the format. It may have up to 255 characters.
2. Enter the **Schema Project Name Format** to be the name of the schema project in the Immuta UI. If you enter a name that already exists, the name will automatically be incremented. For example, if the schema project `Customer table` already exists and you enter that name in this field, the name for this second schema project will automatically become `Customer table 2` when you create it.
   1. When selecting **Create sources for all tables in this database and monitor for changes** you may personalize this field as you wish, but **it must include a schema macro**.
   2. When selecting **Schema/Table** this field is prepopulated with the recommended project name and you can edit freely.
3. Select the **Data Source Name Format**, which will be the format of the name of the data source in the Immuta UI.

{% tabs %}
{% tab title="<`Tablename`>" %}
**<`Tablename`>**

The data source name will be the name of the remote table, and the case of the data source name will match the case of the macro.
{% endtab %}

{% tab title="<`Schema`><`Tablename`>" %}
**<`Schema`><`Tablename`>**

The data source name will be the name of the remote schema followed by the name of the remote table, and the case of the data source name will match the cases of the macros.
{% endtab %}

{% tab title="Custom" %}
**Custom**

Enter a custom template for the Data Source Name. You may personalize this field as you wish, but **it must include a tablename macro**. The case of the macro will apply to the data source name (i.e., <`Tablename`> will result in "Data Source Name," <`tablename`> will result in "data source name," and <`TABLENAME`> will result in "DATA SOURCE NAME").
{% endtab %}
{% endtabs %}

4. Enter the **SQL Table Name Format**, which will be the format of the name of the table in Immuta. **It must include a table name macro**, but you may personalize the format using lowercase letters, numbers, and underscores. It may have up to 255 characters.

## Enable or disable schema monitoring

When selecting the **Schema/Table** option you can opt to enable [**Schema Monitoring**](https://documentation.immuta.com/2024.2/data-and-integrations/registering-metadata/schema-monitoring) by selecting the **checkbox** in this section.

*Note: This step will only appear if all tables within a server have been selected for creation.*

### Create a schema detection job in Databricks

In most cases, Immuta’s schema detection job runs automatically from the Immuta web service. For Databricks, that automatic job is disabled because of the [ephemeral nature of Databricks clusters](https://documentation.immuta.com/2024.2/data-and-integrations/databricks-spark/reference-guides/configuration-settings/ephemeral-overrides). In this case, Immuta requires users to download a schema detection job template (a Python script) and import that into their Databricks workspace.

{% hint style="info" %}
**Generate Your Immuta API Key**

Before you can run the script referenced in this tutorial, generate your **Immuta API Key** from your user profile page. The Immuta API key used in the Databricks notebook job for schema detection must either belong to an Immuta Admin or the user who owns the schema detection groups that are being targeted.
{% endhint %}

1. Enable [**Schema Monitoring**](#enable-or-disable-schema-monitoring) or **Detect Column Changes** on the Data Source creation page.
2. Click **Download Schema Job Detection Template**.
3. Click the **Click Here To Download** text.
4. Before you can run the script, follow the [Databricks documentation](https://docs.databricks.com/en/security/secrets/index.html) to create the scope and secret using the **Immuta API Key** generated on your user profile page.
5. Import the Python script you downloaded into a Databricks workspace as a notebook. *Note: The job template has commented out lines for specifying a particular database or table. With those two lines commented out, the schema detection job will run against ALL databases and tables in Databricks. Additionally, if you need to add proxy configuration to the job template, the template uses the* [*Python requests library*](https://requests.readthedocs.io/en/master/user/advanced/#proxies)*, which has a simple mechanism for configuring proxies for a request.*
6. Schedule the script as part of a notebook job to run as often as required. Each time the job runs, it will make an API call to Immuta to trigger schema detection queries, and these queries will run on the cluster from which the request was made. *Note: Use the `api_immuta` cluster for this job. The job in Databricks must use an Existing All-Purpose Cluster so that Immuta can connect to it over ODBC. Job clusters do not support ODBC connections.*

## Create the data source

Opt to configure settings in the [Advanced Options](#advanced-options) section (outlined below), and then click **Create** to save the data source(s).

## Advanced options

None of the following options are required. However, completing these steps will help maximize the utility of your data source.

{% tabs %}
{% tab title="Column Detection" %}
**Column Detection**

This setting monitors when remote tables' columns have been changed, updates the corresponding data sources in Immuta, and notifies Data Owners of these changes.

To enable, select the **checkbox** in this section.

See [Schema Projects Overview](https://documentation.immuta.com/2024.2/data-and-integrations/schema-monitoring#column-detection) to learn more about Column Detection.
{% endtab %}

{% tab title="Event Time" %}
**Event Time**

An Event Time column denotes the time associated with records returned from this data source. For example, if your data source contains news articles, the time that the article was published would be an appropriate Event Time column.

1. Click the **Edit** button in the Event Time section.
2. Select the **column(s)**.
3. Click **Apply**.

Selecting an Event Time column will enable

* more statistics to be calculated for this data source including the most recent record time, which is used for determining the freshness of the data source.
* the creation of [time-based restrictions](https://documentation.immuta.com/2024.2/secure-your-data/authoring-policies-in-secure/data-policies/how-to-guides/time-based-tutorial) in the Policy Builder.
  {% endtab %}

{% tab title="Latency" %}
**Latency**

1. Click **Edit** in the Latency section.
2. Complete the **Set Time** field, and then select **MINUTES**, **HOURS**, or **DAYS** from the subsequent dropdown menu.
3. Click **Apply**.

This setting impacts the following behaviors:

* How long Immuta waits to refresh data that is in cache by querying the data source. For example, if you only load data once a day in the remote platform, this setting should be greater than 24 hours. If data is constantly loaded in the remote platform, you need to decide how much data latency is tolerable vs how much load you want on your data source; however this is only relevant to Immuta S3, since SQL will always interactively query the remote database.
* How often Immuta checks for new values in a column that is driving row-level redaction policies. For example, if you are redacting rows based on a country column in the data, and you add a new country, it will not be seen by the Immuta policy until this period expires.
  {% endtab %}

{% tab title="Sensitive Data Discovery" %}
**Sensitive Data Discovery**

Data Owners can disable Sensitive Data Discovery for their data sources in this section.

1. Click **Edit** in this section.
2. Select **Enabled** or **Disabled** in the window that appears, and then click **Apply**.
   {% endtab %}

{% tab title="Data Source Tags" %}
**Data Source Tags**

Adding tags to your data source allows users to search for the data source using the tags and Governors to apply Global policies to the data source. *Note if Schema Detection is enabled, any tags added now will also be added to the tables that are detected.*

To add tags,

1. Click the **Edit** button in the Data Source Tags section.
2. Begin typing in the **Search by Tag Name** box to select your tag, and then click **Add**.

Tags can also be added after you create your data source from the [Data Source details](https://documentation.immuta.com/2024.2/secure-your-data/data-consumers/subscribe-to-data-source#viewing-my-data-sources) page on the Overview tab or the Data Dictionary tab.
{% endtab %}
{% endtabs %}

[^1]: Only required when using [sensitive data discovery (SDD)](https://documentation.immuta.com/2024.2/discover-your-data/data-discovery) or specialized masking policies that require fingerprinting.
