Skip to content

Data Sources in Immuta

Audience: Data Owners

Content Summary: This page illustrates the concepts behind creating and managing data sources in Immuta. For a tutorial detailing how to manage data sources, navigate to the Manage Data Source Page.

Introduction

A data source is how Data Owners expose their data across their organization to other users. When Data Owners expose their data sources, they are not copying the data. Instead, the data source is an abstraction of data existing in a remote data storage technology.

Data sources fall in one of two categories: those that are backed by SQL technologies (query-backed data sources) and those that are not (object-backed data sources).

  • query-backed data sources: These data sources are accessible to subscribed Data Users through the Immuta Query Engine and appear as though they are Postgres tables.

  • object-backed data sources: These data sources are backed by data storage technologies that do not support SQL and can range from blob stores, to filesystems, to APIs.

  • dbt Cloud integrated data sources: These data sources are backed by various storage technologies, but are updated and detailed by jobs run through dbt Cloud.

Query-backed Data Sources

Query-backed data sources are accessible to subscribed data consumers via the Immuta Query Engine and appear as though they are Postgres tables. Every schema will have a schema project automatically created at the time of data source creation.

The query-backed storage technologies are listed below. Hyperlinked technologies require special instructions.

  • Amazon Athena
  • Amazon Redshift
  • Azure SQL Data Warehouse
  • BigQuery
  • Databricks
  • ElasticSearch
  • Greenplum
  • HIVE
  • IBM DB2
  • IBM Netezza
  • Impala
  • MariaDB
  • MemSQL
  • MongoDB
  • MS SQL Server
  • MySQL
  • Oracle
  • PostgreSQL
  • Presto (PrestoDB)
  • SAP HANA
  • Snowflake
  • Sybase ASE
  • Teradata
  • Trino (previously PrestoSQL)
  • Vertica

For a tutorial for one of these technologies, see Create Query-backed Data Sources

Amazon RDS Data Source Overview

Amazon Web Services provides a managed database service, Amazon Relational Database Service (RDS), that supports a variety of database engines: Amazon Aurora, MariaDB, Microsoft SQL Server, MySQL, Oracle, and PostgreSQL. Immuta treats each of these as it does any other SQL-based technology, allowing the creation of data sources from any of the technologies using the procedures described in the Query-backed Data Source Tutorial without special handling. As Immuta does not require write access to databases, Immuta can connect to read replicas where applicable to reduce analytic query burden on primary database servers.

Amazon RDS Database Engine Support

The following RDS database engines and versions are supported. Any unsupported features are noted.

Database Engine Unsupported Features Supported Versions
Aurora (MySQL) N/A 5.6 - 5.7
Aurora (PostgreSQL) N/A 9.6 - 11.6
MariaDB N/A 10.0 - 10.3
Microsoft SQL Server N/A 11 - 14
MySQL N/A 5.5 - 8.0
Oracle N/A 11.2 - 19.0
PostgreSQL N/A 9.6 - 11.6

Object-backed Data Sources

Object-backed data sources are data storage technologies that do not support SQL and can range from NoSQL technologies, to blob stores, to filesystems, to APIs.

The object-backed storage technologies are listed below.

  • Amazon S3
  • Apache HDFS
  • Azure Blob Storage
  • Custom
  • FTP
  • Persisted

For a tutorial on these technologies, see Create Object-backed Data Sources.

dbt Cloud Integration

The dbt Cloud integration allows Immuta to be connected to your dbt Cloud jobs so that updates run through dbt populate in Immuta. Once dbt and Immuta are connected and a job runs to update your database, that update will automatically be applied to your Immuta instance. While this is similar to Schema Monitoring in that new data sources will be updated, created, and deleted when prompted by the dbt jobs, it differs in that the dbt Cloud integration can also sync tags, column descriptions, and data source descriptions from your data sources into Immuta.

For a tutorial, see Connect Data Sources Using dbt Cloud Integration

Limitations

  • You cannot update a dbt Cloud API key or delete the dbt Cloud integration from the UI.

    Solution
    • To update the dbt Cloud API Key:

      PUT /dbt/{accountId}/{projectId}/{environmentId} (with a payload similar to {apiKey: ‘<newKey>’})

      Example Immuta CLI Command

      immuta api /dbt/{accountId}/{projectId}/{environmentId} -X PUT -P accountId=1 -P projectId=10 -P environmentId=100 -d apiKey='<newKey>'.

    • To delete the dbt integration: This will delete all data sources created with the integration:

      DELETE /dbt/{accountId}/{projectId}/{environmentId}

  • There are no distinguishing features on dbt data sources within Immuta. The dbt integration functions as a catalog but Immuta does not link the data source to the catalog. This allows the data source user to remove tags in the UI. Note that the tags will be re-added the next time the job runs through.

Custom Blob Handler

Immuta ships with a variety of handlers for queryable and ingested data, as well as tools for fine-tuning policies on your data. However, if your organization requires a more customized solution for fetching data or enforcing policies, Immuta data sources are able to hook into your own custom handler solutions.

See Custom Blob Handler if this would be a useful tool for your organization.

Data Source Health

When an Immuta data source is created, a background job is submitted to compute the row count and high cardinality column for the data source. This job uses the connection information provided at data source creation time. A data source initially has a health status of “healthy” because the initial health check performed is a simple SQL query against the source to make sure the source can be queried at all. After the background job for the row count/high cardinality column computation is complete, the health status is updated. If one or both of those jobs failed, the health status will change to “Unhealthy.”

These background jobs can be disabled during data source creation by adding a specific tag to prevent automatic table statistics. This prevent statistics tag can be set on the App Settings page by a System Administrator. The data source will still show as healthy; however, there are some considerations. Disabling the collection of statistics will prevent the Immuta Query Engine cost-based optimizer from correctly estimating query plan costs. This could have a significant, negative performance impact on any queries executed through the Query Engine against a data source that has statistic collection disabled. Additionally, with automatic table statistics disabled, these policies will be unavailable until the Data Source Owner manually generates the fingerprint:

  • Differential privacy
  • Masking with format preserving masking
  • Masking with K-Anonymization
  • Masking using randomized response

Unhealthy Databricks Data Sources

Unhealthy data sources may fail their row count queries if they run against a cluster that has the Databricks query watchdog enabled.

Data Source User Roles

There are various roles users and groups can play relating to each data source. These roles are managed though the Members tab of the Data Source. They include

  • Owners: Those who create and manage new data sources and their users, documentation, Data Dictionaries, and queries. They are also capable of ingesting data into their data sources as well as adding ingest users (if their data source is object-backed).
  • Subscribers: Those who have access to the data source data. With the appropriate data accesses and attributes, these users/groups can view files, run SQL queries, and generate analytics against the data source data. All users/groups granted access to a data source (except for those with the ingest role) have subscriber status.
  • Experts: Those who are knowledgeable about the data source data and can elaborate on it. They are responsible for managing the data source's documentation and the Data Dictionary.
  • Ingest: Those who are responsible for ingesting data for the data source. This role only applies to object-backed data sources (since query-backed data sources are ingested automatically). Ingest users cannot access any data once it's inside Immuta, but they are able to verify if their data was successfully ingested or not.

See Manage Data Sources for a tutorial on modifying user roles.

Data Attributes

Data attributes are information about the data within the data source. These attributes are then matched against policy logic to determine if a row or object should be visible to a specific user. This matching is usually done between the data attribute and the user attribute.

For example, in the policy

Only show rows where Country='US' for everyone except when user is a member of group Finance

the data attribute (US in the Country column) is matched against the user attribute (Finance group) to determine whether or not rows will be visible to the user accessing the data. In this case only users who are a member of the Finance group will see all rows in the data source.

User Attributes

User attributes are values connected to specific Immuta user accounts and are used in policies to restrict access to data. These attributes fall into three categories: permissions, groups, and attributes.

These user attributes give users access to various Immuta features and drive data source policies.

Permissions

Permissions control what actions a user can take in Immuta, both API and UI actions. Permissions can be added and removed from user accounts by a System Administrator (an Immuta user with the USER_ADMIN permission); however, the permissions themselves are managed by Immuta, and the actions associated with the permissions cannot be altered.

Groups

Groups allow System Administrators to group sets of users together. Users can belong to any number of groups and can be added or removed from groups at any time. Like attributes, groups can be used to restrict what data a set of users has access to.

Attributes

Attributes are custom tags that are applied to users to restrict what data users can see. Attributes can be added manually or mapped in from LDAP or Active Directory.

Data Dictionary

The Data Dictionary provides information about the columns within the data source, including column names and value types. Users subscribed to the data source can post and reply to discussion threads by commenting on the Data Dictionary.

Dictionary columns are automatically generated when the data source is created if the remote storage technology supports SQL. Otherwise, Data Owners or Experts can create the entries for the Data Dictionary manually.

What's Next

Now that you understand the data source and the types, you can choose to continue to Create a Query-backed Data Source or Create an Object-backed Data Source based on your technology.