Subscribing to Data Sources
Audience: Data Users
Content Summary: This page discusses the concepts associated with subscribing to data sources in Immuta. For a tutorial detailing how to access data sources, navigate to the Data Source User Guide.
Immuta Data Sources
A data source is how Data Owners expose their data to other users. Throughout the process of sharing data, the data is not copied; a data source is simply a virtual representation of data that exists in a remote data storage technology.
When Data Users access a data source, policies (written by Data Owners and Data Governors) are dynamically enforced on the data, appropriately redacting and masking information depending on the user accessing the data. The data can then be accessed in a consistent manner across analytics and visualization tools, allowing reproducibility and collaboration.
Data Source Overview
After selecting a data source, users are automatically brought to the data source Overview tab, organized into these sections:
Documentation: Data Owners can provide additional documentation regarding their data source. If documentation has been created, it is displayed in the center pane. Otherwise, this section only displays the data source name.
Connections: This section provides your SQL connection string and information for connecting the Immuta Query Engine to external analytics tools, including PySpark 1.6, PySpark 2.0, Python+Psycopg2, Python+pyodbc, R, and RStudio.
Tags: This section lists tags associated with the data source.
The Members tab lists users with access to this data source. It includes their name, email contact, how many days until their access expires, and their role (owner, subscribed, or expert).
Policies can only be managed on this tab by the Data Owner; however, Data Users can see the policies in effect on this page. It lists the Subscriber policy as well as any Local and Global Data policies in effect for that source.
The Data Dictionary
The Data Dictionary is a table that provides information about the columns within the data source. Dictionary columns are generated automatically when the data source is created if the remote storage technology supports SQL. Otherwise, Data Dictionaries can be created manually by the Data Owner or Expert.
The Data Dictionary is also where tags are added, edited, and removed for use in Global Policies. See the Tags Section for more information on their uses and tutorials.
Data Source Queries
The Queries tab allows users to keep track of their personal queries, share their queries with others, and sample public queries. Additionally, users can submit a Debug Query request to the Data Owner(s).
Data Source Metrics
Immuta keeps track of data source usage and general statistics, and these metrics are updated regularly. Object-backed Data Sources provide the total number of records available, while query-backed Data Sources provide the total number of rows.
Data Source Discussions
Data Users have the ability to comment on or ask questions about the Data Dictionary columns and definitions, public queries and the data source in general. Resolved comments and questions are available for review to keep a complete history of all the knowledge sharing that has occurred on this data source.
Data Source Contact Information
Contact information for Data Owners is provided for each data source, allowing users to ask questions about accessibility and attributes required for viewing the data.
The Lineage tab provides a list of all projects associated with the data source and includes information about why the data source was added to the project, whom it was added by, and the date it was added.
Data Access Patterns
Immuta users are able to access data through one of the access patterns described below. Accessing data through Immuta ensures that Data Users are only consuming policy-controlled data with thorough auditing.
Databricks: Through native integration, Databricks data sources exposed in Immuta are available as tables in a Databricks cluster. Users can then query these data sources through their Notebook. Like other integrations, policies are applied to the plan that Spark builds for a user's query and all data access is native.
For more details, see the Databricks Installation Guide.
HDFS: Unlike the other access patterns, the Immuta HDFS access pattern is not virtual. The value in HDFS processing is to bring the code to the data, and as such, requires the Immuta policies to be enforced in-place on the data in the HDFS data nodes. Because of this, the Immuta HDFS layer can only act on data stored in HDFS. However, you are able to build complex subscription and granular access policies on objects stored in HDFS and retain all the rich audit capabilities provided by the other Immuta virtual layers.
Immuta Query Engine: Users are provided a basic Immuta PostgreSQL connection. The tables within this connection represent all the connected data across your organization. Those tables, however, are virtual tables, completely empty until a query is run. At query time the SQL is proxied through the virtual Immuta table down to the native database while enforcing the policy automatically. The Immuta SQL connection can be used within any Business Intelligence (BI) tool or integrated directly into code for interactive analysis.
Native Dynamic Snowflake : Through the Native Dynamic Snowflake access pattern, Immuta applies policies directly in Snowflake, allowing data analysts to use the Snowflake Web UI and their existing BI tools and have per-user policies dynamically applied at query time.
Native Snowflake Workspaces : Native Snowflake workspaces allow users to access protected data directly in Snowflake without having to go through the Immuta Query Engine. Within these workspaces, users can interact directly with Snowflake secure views, create derived data sources, and collaborate with other project members at a common access level. That data can then be shared outside the project because these derived data sources inherit all appropriate policies. Additionally, derived data sources use the credentials of the Immuta system Snowflake account, allowing them to persist after a workspace is disconnected.
For more details about Snowflake workspaces, see the Projects Overview.
S3: Immuta supports an S3-style REST API, allowing users to communicate with Immuta the same way they would with S3. Consequently, Immuta easily integrates with tools users may already be using to work with S3.
SparkSQL: Users are able to access subscribed data sources within their Spark Jobs by utilizing Spark SQL with the ImmutaContext class. All tables are virtual and are not populated until a query is materialized. When a query is materialized, data from metastore-backed data sources, such as Hive and Impala, are accessed using standard Spark libraries to access the data in the underlying files stored in HDFS. All other data source types access data using the Query Engine, which proxies the query to the native database technology. Policies for each data source are enforced automatically.