Data Sources Page
Audience: All Immuta users
Content Summary: The Data Sources page allows Immuta users to view, subscribe to, and create data sources in Immuta. This section highlights the major tabs and features found on the details page for an individual data source.
For details about concepts relating to data sources or instructions for accessing and creating data sources, see the Chapter 4 - Connecting Data section of documentation.
Data Sources Page
Once users navigate to this page, a list of data sources appears in the center window. Users can navigate between the All Data Sources tab and the My Data Sources tab to filter this list.
Additionally, users may use the Search bar in the upper left corner of the Immuta console to filter search results by data source name, tag, project, connection strings, or columns.
To navigate to a specific data source, users simply click on it from this list, and they will be taken to the Data Source Overview page.
Data Source Details Page
In addition to the data source's health, this page provides detailed information about the data source and is organized by tabs across the top of the page: Overview, Users, Policies, Data Dictionary, Queries, Metrics, Discussions, Contacts, and Lineage. The visibility and appearance of the tabs will vary slightly depending on the type of user accessing the data source.
Data Source Health
This section includes detailed information regarding Data Source Health and Data Source Health Checks. The Health status of a data source is visible in the top right corner of the data source details page.
If you click the Health status text, a dropdown menu displays the status of specific data source checks.
- Health Check: When an Immuta data source is created, a background job is submitted to compute the row count and high cardinality column for the data source. This job uses the connection information provided at data source creation time. A data source initially has a health status of “healthy” because the initial health check performed is a simple SQL query against the source to make sure the source can be queried at all. After the background job for the row count/high cardinality column computation is complete, the health status is updated. If one or both of those jobs failed, the health status will change to “Unhealthy.”
- Fingerprint: Captures summary statistics of a data source when a data source is created, when a policy is applied or changed, or when a user manually updates the data source fingerprint.
- Row Count: Calculates the number of rows in the data source.
- High Cardinality: Calculates the high cardinality columns, which contain unique values such as identification numbers, email addresses, or usernames. A high cardinality column is required to generate consistent random subsets of data for use in certain minimization techniques.
- Global Policies Applied: Verifies that relevant Global Policies are successfully applied.
- Schema Detection: Detects when a new table has been added in a remote database and automatically creates a new data source. Correspondingly, if a remote table is removed, that data source will be disabled in the console. Schema detection is set to run every night.
- Column Detection: Detects when a column has been added or removed in a remote database and automatically updates the data source in Immuta. This detection is set to run every night, but users can manually trigger the job here.
This tab includes detailed information about the data source in the left side-panel, including its Description, Technology, Table Name, File Type, the Remote Database and Remote Table, the Parent Server, and the Data Source ID.
In the middle window, the information displayed is divided into three categories:
- Documentation: Data Owners can provide additional information about their data source here. If documentation has not been created, only the data source name will appear.
- Connections: This section provides users' SQL connection string and information for connecting the Immuta Query Engine to external analytics tools, including PySpark 1.6, PySpark 2.0, Python+Psycopg2, Python+pyodbc, R, and RStudio.
- Tags: This section lists tags associated with the data source.
This tab contains information about the users associated with the data source, their username, when their access expires, what their role is, and an Actions button that details the users' subscription history, including the reason users need access to the data and how they plan to use it.
This tab is visible to everyone, but Data Owners and Governors can manage users from this page.
This tab lists the policies associated with the data source and includes three components:
- Subscribers: Lists who may access the data source. If a Subscription Policy has already been set by a Global Policy, a notification and a Disable button appear at the bottom of this section. Data Owners can click the Disable button to make changes to the Subscription Policy.
- Data Policies: Lists policies that enforce privacy controls on the data source. Data Owners can use this section to manage policies.
- Activity Panel: Records all changes made to policies by Data Owners or Governors, including when the data source was created, the name and type of the policy, when the policy was applied or changed, and if the policy is in conflict on the data source. Global policy changes are identified by the Governance icon; all other updates are labeled by the Data Sources icon.
This tab is visible to everyone, but Data Owners and Governors can manage policies from this page.
Data Dictionary Tab
The Data Dictionary is a table that details information about each column in a data source. The information within the Data Dictionary is generated automatically when the data source is created if the remote storage technology supports SQL. Otherwise, Data Owners or Experts can manually create Data Dictionaries. The Data Dictionary tab includes three sections:
- Name: The name of the column in the table.
- Type: The type of value, which may be text, integer, decimal, or timestamp.
- Actions: Users may use the buttons in this column to edit, comment, or tag items in the Data Dictionary.
This tab allows users to keep track of their personal queries, share their queries with others, sample public queries, and debug queries.
If users have issues with a query they're running, they can send a request for a query debug, which sends the query plan to the Data Owner.
This tab details the data source usage and general statistics. Object-backed data sources will also provide the total number of records available, and query-backed data sources will provide the total number of rows.
Users are able to comment on or ask questions about the Data Dictionary columns and definitions, public queries, and the data source in general. Resolved comments and questions are available for review to keep a complete history of all the knowledge sharing that has occurred on a data source.
Contact information for Data Owners is provided for each data source, which allows users to ask questions about accessibility and attributes required for viewing the data.
This tab lists all projects, derived data sources, or parent data sources associated with the data source and includes the reason the data source was added to a project, who added the data source to the project or created it, and when the data source was added to the project or created.
When users submit a Debug Query or Unmask request in the UI, a Tasks tab appears beside the Lineage tab for the requesting user and the user receiving the request. This tab contains information about the request and allows users to view and manage the tasks listed.