Skip to content

Connection and Access to Data Sources

Audience: Data Users

Content Summary: This page discusses the concepts associated with subscribing to data sources in Immuta. For instructions detailing how to access data sources, navigate to the Subscribing to Data Sources Tutorial.

Immuta Data Sources

A data source is how Data Owners expose their data to other users. Throughout the process of sharing data, the data is not copied; a data source is simply a virtual representation of data that exists in a remote data storage technology.

When Data Users access a data source, policies (written by Data Owners and Data Governors) are dynamically enforced on the data, appropriately redacting and masking information depending on the user accessing the data. The data can then be accessed in a consistent manner across analytics and visualization tools, allowing reproducibility and collaboration.

Data Source Overview

After clicking on a data source, users automatically navigate to the data source Overview tab, which contains the following sections:

  • Documentation: Data Owners can provide additional documentation regarding their data source. If documentation has been created, it will be displayed in the center pane. Otherwise, this section will only display the data source name.

  • Connections: This section provides your SQL connection string and information for connecting the Immuta Query Engine to external analytics tools, including PySpark 1.6, PySpark 2.0, Python+Psycopg2, Python+pyodbc, R, and RStudio.

  • Tags: This section lists tags associated with the data source.

    Data Source Documentation

From this page, Data Users can also navigate among these other tabs in the data source: Data Dictionary, Queries, Metrics, Discussions, Contacts, and Lineage.

The Data Dictionary

The Data Dictionary is a table that provides information about the columns within the data source. Dictionary columns are generated automatically when the data source is created if the remote storage technology supports SQL. Otherwise, Data Dictionaries can be created manually by the Data Owner or Expert.

Data Source Queries

The Queries tab allows users to keep track of their personal queries, share their queries with others, and sample public queries. Additionally, users can submit a Debug Query request, which will be sent to the Data Owner(s).

Data Source Metrics

Immuta keeps track of data source usage and general statistics. Some data sources (object-backed) will also provide the total number of records available, while others (query-backed) will provide the total number of rows. Both metrics and stats are updated regularly.

Data Source Discussions

Data Users have the ability to comment on or ask questions about the Data Dictionary columns and definitions, public queries and/or the data source in general. Resolved comments and questions are available for review to keep a complete history of all the knowledge sharing that has occurred on this data source.

Data Source Contact Information

Contact information for Data Owners is provided for each data source, which allows Users to ask questions about accessibility and authorizations required for viewing the data.

Lineage

The Lineage tab provides a list of all projects associated with the data source and includes information about why the data source was added to the project, whom it was added by, and the date it was added.

Data Access Patterns

Immuta users are able to access data through one of the access patterns described below. Accessing data through Immuta ensures that Data Users are only consuming policy-controlled data with thorough auditing.

SQL Access: Once subscribed to a data source, users don't need to know special APIs to access the data. Instead, users are given the option of using a single SQL connection provided by Immuta to access all data sources as if they were database tables in the same database (even though they actually exist in data silos across the user's organization).

Filesystem Access: Alternatively, users can mount the Immuta Virtual Filesystem on a Linux machine.

HDFS/Spark: To process data at massive scale, users can access data through the HDFS and Spark access patterns.