Create a Data Source

For a complete list of supported databases, see the Immuta Support Matrix.

This page contains references to the term whitelist, which Immuta no longer uses. When the term is removed from the software, it will be removed from this page.

Redshift data sources

Requirements

  • CREATE_DATA_SOURCE Immuta permission

  • Snowflake data source requirements:

    • USAGE Snowflake privilege on the schema and database

    • REFERENCES Snowflake privilege on the tables

  • Databricks Spark integration requirements: Ensure that at least one of the traits below is true.

    • The user exposing the tables has READ_METADATA and SELECT permissions on the target views/tables (specifically if Table ACLs are enabled).

    • The user exposing the tables is listed in the immuta.spark.acl.whitelist configuration on the target cluster.

    • The user exposing the tables is a Databricks workspace administrator.

  • Databricks Unity Catalog integration requirements: When exposing a table from Databricks Unity Catalog, be sure the credentials used to register the data sources have the Databricks privileges listed below.

    • The following privileges on the parent catalogs and schemas of those tables:

      • SELECT

      • USE CATALOG

      • USE SCHEMA

    • USE SCHEMA on system.information_schema

Snowflake imported databases

Immuta does not support Snowflake tables from imported databases. Instead, create a view of the table and register that view as a data source.

Enter connection information

Best Practice: Connections Use SSL

Although not required, it is recommended that all connections use SSL. Additional connection string arguments may also be provided.

Note: Only Immuta uses the connection you provide and injects all policy controls when users query the system. In other words, users always connect through Immuta with policies enforced and have no direct association with this connection.

  1. Navigate to the My Data Sources page.

  2. Click the New Data Source button in the top right corner.

  3. Select the data platform containing the data you wish to expose by clicking a tile.

  4. Input the connection parameters to the database you're exposing. Click the tabs below for guidance for select data platforms.

Required Google BigQuery roles for creating data sources

Ensure that the user creating the Google BigQuery data source has these roles:

  • roles/bigquery.metadataViewer on the source table (if managed at that level) or dataset

  • roles/bigquery.dataViewer (or higher) on the source table (if managed at that level) or dataset

  • roles/bigquery.jobUser on the project

See the Create a Google BigQuery data source guide for instructions.

  1. Click the Test Connection button.

Further Considerations

  • Immuta pushes down joins to be processed on the native database when possible. To ensure this happens, make sure the connection information matches between data sources, including host, port, ssl, username, and password. You will see performance degradation on joins against the same database if this information doesn't match.

  • Some data platforms require different connection information than pictured in this section. Please refer to the tool-tips in the Immuta UI for this step if you need additional guidance.

  • If you are creating an Impala data source against a Kerberized instance of Impala, the username field locks down to your Immuta username unless you possess the IMPERSONATE_HDFS_USER permission.

  • If a client certificate is required to connect to the source database, you can add it in the Upload Certificates section at the bottom of the form.

Select virtual population

  1. Decide how to virtually populate the data source by selecting Create sources for all tables in this database and monitor for changes or Schema/Table.

  2. Complete the workflow for Create sources for all tables in this database and monitor for changes or Schema/Table selection, which are outlined on the tabs below:

Create sources for all tables in this database and monitor for changes

Selecting this option will create and keep in sync all data sources within this database. New schemas will be automatically detected and the corresponding data sources and schema projects will be created.

  • Select Create sources for all tables in this database and monitor for changes.

Enter basic information

Provide information about your source to make it discoverable to users.

  1. Enter the SQL Schema Name Format to be the SQL name that the data source exists under in the Immuta Query Engine. It must include a schema macro but you may personalize it using lowercase letters, numbers, and underscores to personalize the format. It may have up to 255 characters.

  2. Enter the Schema Project Name Format to be the name of the schema project in the Immuta UI. This field is disabled if the schema project already exists within Immuta.

    1. When selecting Create sources for all tables in this database and monitor for changes you may personalize this field as you wish, but it must include a schema macro.

    2. When selecting Schema/Table this field is prepopulated with the recommended project name and you can edit freely.

  3. Select the Data Source Name Format, which will be the format of the name of the data source in the Immuta UI.

<Tablename>

The data source name will be the name of the remote table, and the case of the data source name will match the case of the macro.

  1. Enter the SQL Table Name Format, which will be the format of the name of the table in the Immuta Query Engine. It must include a table name macro, but you may personalize the format using lowercase letters, numbers, and underscores. It may have up to 255 characters.

Data source duplicates

Data source duplicates

In order to avoid two data sources referencing the same table, users can not create duplicate data sources. If you attempt to create a duplicate data source in the UI, you will encounter a warning stating "a data source with the same remote table already exists."

By default Immuta prevents users from creating data source duplicates. If you want to change this behavior,

  1. Navigate to the App Settings page, and scroll to the Advanced Configuration section.

  2. Copy and paste this YAML into the text box:

    featureFlags:
      allowDuplicateDataSources: true
  3. Click Save.

Enable or disable schema monitoring

When selecting the Schema/Table option you can opt to enable Schema Monitoring by selecting the checkbox in this section.

Note: This step will only appear if all tables within a server have been selected for creation.

Create a schema detection job in Databricks

In most cases, Immuta’s schema detection job runs automatically from the Immuta web service. For Databricks, that automatic job is disabled because of the ephemeral nature of Databricks clusters. In this case, Immuta requires users to download a schema detection job template (a Python script) and import that into their Databricks workspace.

Generate Your Immuta API Key

Before you can run the script referenced in this tutorial, generate your Immuta API Key from your user profile page. The Immuta API key used in the Databricks notebook job for schema detection must either belong to an Immuta Admin or the user who owns the schema detection groups that are being targeted.

  1. Enable Schema Monitoring or Detect Column Changes on the Data Source creation page.

  2. Click Download Schema Job Detection Template.

  3. Click the Click Here To Download text.

  4. Before you can run the script, create the correct scope and secret by running these commands in the CLI using the Immuta API Key generated on your user profile page:

        databricks secrets create-scope --scope auth
        databricks secrets put --scope auth --key apikey
  5. Import the Python script you downloaded into a Databricks workspace as a notebook. Note: The job template has commented out lines for specifying a particular database or table. With those two lines commented out, the schema detection job will run against ALL databases and tables in Databricks. Additionally, if you need to add proxy configuration to the job template, the template uses the Python requests library, which has a simple mechanism for configuring proxies for a request.

  6. Schedule the script as part of a notebook job to run as often as required. Each time the job runs, it will make an API call to Immuta to trigger schema detection queries, and these queries will run on the cluster from which the request was made. Note: Use the api_immuta cluster for this job. The job in Databricks must use an Existing All-Purpose Cluster so that Immuta can connect to it over ODBC. Job clusters do not support ODBC connections.

Create the data source

Opt to configure settings in the Advanced Options section (outlined below), and then click Create to save the data source(s).

Advanced options

None of the following options are required. However, completing these steps will help maximize the utility of your data source.

Column Detection

This setting monitors when remote tables' columns have been changed, updates the corresponding data sources in Immuta, and notifies Data Owners of these changes.

To enable, select the checkbox in this section.

See Schema Projects Overview to learn more about Column Detection.

Last updated

Copyright © 2014-2024 Immuta Inc. All rights reserved.