Query-Backed Data Source Tutorial
Audience: Data Owners
Content Summary: Query-backed data sources are accessible to subscribed Data Users through the Immuta Query Engine and appear as though they are Postgres tables. Although these files look like they are local, reading a file executes a query to the remote storage system or the Immuta caching layer for that file. In either access pattern, the Immuta policies are put into action either through querying the remote database or filtering automatically.
This guide outlines the process of creating query-backed data sources, such as Amazon Athena (which requires special consideration), Amazon Redshift, Azure SQL Data Warehouse, BigQuery, Databricks, ElasticSearch, Greenplum, HIVE, IBM Netezza, Impala, MariaDB, MemSQL, MS SQL Server, MySQL, Oracle, PostgreSQL, Presto, SAP HANA, Snowflake, Teradata, and Vertica.
If your storage technology is not listed above, navigate to the Object-backed Data Sources Tutorial.
Step 1: Create a New Data Source
To create a new data source,
- Click the plus button in the bottom left of the Immuta console.
- Select the Data Source icon.
- Navigate to the My Data Sources page.
- Click the New Data Source button in the top left corner.
Step 2: Select Your Storage Technology
Select the storage technology containing the data you wish to expose by clicking a tile. Please note that the list of enabled technologies is configurable and may differ from the image below.
Step 3: Enter Connection Information
- Determine the connection you would like to provide Immuta and input the connection parameters to the database you're exposing.
- Although not required, it is recommended that all connections use SSL. Additional connection string arguments may also be provided.
Click the Test Connection button.
If the connection is successful, a check mark will appear and you will be able to proceed. If an error occurs when attempting to connect, the error will be displayed in the UI. In order to proceed to the next step of data source creation, you MUST be able to connect to this data source using the connection information that you just entered.
Note: Only Immuta uses the connection you provide and injects all policy controls when users query the system. In other words, users always connect through Immuta with policies enforced and have no direct association with this connection. You can use a read-only restricted connection if desired, but Immuta is always read-only and has no concept of write to exposed databases.
- Immuta pushes down joins to be processed on the native database when possible. To ensure this happens, make sure the connection information matches between data sources, including host, port, ssl, username, and password. You will see performance degradation on joins against the same database if this information doesn't match.
- Some storage technologies require different connection information than pictured in this section. Please refer to the tool-tips in the Immuta UI for this step if you need additional guidance.
- If you are creating an Impala data source against a kerberized instance of Impala, the username field will be locked down to your Immuta username unless you possess the IMPERSONATE_HDFS_USER permission.
- If a client certificate is required to connect to the source database, you can add it in the Upload Certificates section at the bottom of the form.
Step 4: Determine Virtual Population
Decide how to virtually populate the data source: Table or SQL Statement.
Note: When creating Hive or Impala data sources, the SQL Statement option is disabled.
- If you choose Table, click Edit in the table selection box that appears.
By default, all tables are selected if the total number of tables is below the threshold setting, which is 2000. Select and deselect tables by clicking the checkbox to the left of the table name in the Import Tables menu. You can create multiple data sources at one time by selecting multiple tables.
After making your selection(s), click Apply.
Backing data sources with SQL statements allows SQL to build the data source, while the complexities of the statements and the specifics of the database remain hidden. Users just see a data source in Immuta and just a PostgreSQL table in the Immuta Query Engine.
- Before entering a SQL statement, test the statement to verify that it works.
- Enter your SQL statement in the text box.
- Click Validate Statement.
Step 5: Enter Basic Information
Here you provide information about your source that makes it discoverable to users.
- Complete the Data Source Name field, which will be the name shown in the Immuta UI.
The SQL Table Name, which auto populates but is editable, will be the name of the table presented to users in the Immuta Query Engine; this name is not the name of the table you are getting the data from.
None of the following steps are required. However, completing these steps will help maximize the utility of your data source.
Option 1: Select File Directory Structure
This step is specific to the Immuta S3 access pattern. If you were to provide your data source as-is to the users in Immuta S3, it would be one big file. However, in this step you can chunk your data into more than one file by building a directory structure.
- Click the Edit button in the File Directory Structure section.
- Click and drag different columns into position at the bottom of the screen to customize the directory structure in Immuta S3 for this data source.
When a column is dragged into position, values from that column are incorporated into the directory structure in Immuta S3, and the data source is partitioned into different records based on the values from the selected column; high cardinality columns will result in many files. For example, if there were a column in your table called "sex" and the unique values were "Male" and "Female," selecting this column would generate the files "Male.json" and "Female.json" (or csv, if you selected that output format) in Immuta S3. If more columns were placed after "sex," then "Male" and "Female" would display as directories and the files would become the values of the newly selected column.
This step is similar to the "Map" step in a MapReduce job. You are "mapping" your data into chunks where they can then be processed by downstream analysts accessing these specific directories in Immuta S3. Those analysts are essentially interactively working as the reduce step against the data. Please note, however, that some column types (like float, double, decimal) will not be available to select in the directory structure.
Option 2: Select Output Format
This step is specific to the Immuta S3 access pattern. You tell Immuta how to convert the result set from the database into a format relevant to a filesystem.
Each file in Immuta S3 is actually a result set from a SQL statement that's fired on demand at read time. Since Immuta knows the structure of a SQL result set, it can convert it and present it to the user. File formats currently supported by Immuta include CSV and JSON.
- Click the Edit button in the Output Format section.
- Select CSV (default) or JSON, and then click Apply.
Option 3: Select Columns
This is an optional step that allows you to decide which columns to include in the data source.
- Click the Edit button in the Columns section.
- By default, all columns are selected. De-select a column by clicking in the checkbox in the top left corner of that column.
- When necessary, convert column types by clicking the Type dropdown menu. Note that incorrectly converting a column type will break your data source at query time.
Further Considerations: Not Null Constraints
These constraints help Business Intelligence (BI) tools send better queries.
For SQL query-backed data sources, select Not NULL in the Nullable dropdown menu for any columns that have this constraint in the source database.
For table-backed data sources, this constraint is automatically detected.
Option 4: Create Event Time Column
An Event Time column denotes the time associated with records returned from this data source. For example, if your data source contains news articles, the time that the article was published would be an appropriate Event Time column.
- Click the Edit button in the Event Time section.
- Select the column(s).
Selecting an Event Time column will enable
- More statistics to be calculated for this data source including the most recent record time, which is used for determining the freshness of the data source.
- The creation of time-based restrictions in the Policy Builder.
Option 5: Determine Latency
To adjust this optional setting,
- Click Edit in the Latency section.
- Complete the Set Time field, and then select MINUTES, HOURS, or DAYS from the subsequent dropdown menu.
- Click Apply.
This setting impacts the following behaviors:
- How long Immuta waits to refresh data that is in cache by querying the native data source. For example, if you only load data once a day in the native source, this setting should be greater than 24 hours. If data is constantly loaded in the native source, you need to decide how much data latency is tolerable vs how much load you want on your data source. This is only relevant to Immuta S3; SQL will always interactively query the native database.
- How often Immuta checks for new values in a column that is driving row-level redaction policies. For example, if you are redacting rows based on a country column in the data, and you add a new country, it will not be seen by the Immuta policy until this period expires.
Option 6: Add Data Source Tags
Adding tags to your data source allows users to search for the data source using the tags and Governors to apply Global policies to the data source.
To add tags,
- Click the Edit button in the Data Source Tags section.
- Begin typing in the Search by Tag Name box to select your tag, and then click Add.
Tags can also be added after you create your data source from the Data Source details page on the Overview tab or the Data Dictionary tab.