Skip to content

Immuta Hadoop Filesystem Access Pattern

Audience: Data Owners and Data Users

Content Summary: Immuta integrates with your Hadoop cluster to provide policy-compliant access to data sources directly through HDFS. This page instructs how to access data through the HDFS access pattern, which only enforces file-level controls on data. For more information on installing and configuring the Immuta Hadoop plugin, see the Administration guide. There is also a Spark SQL access pattern should you need to enforce row-level and column-level controls on data.

The Immuta Hadoop plugin can also be integrated with an existing kerberos setup to allow users to access HDFS data using their existing kerberos principals, with data access and policy enforcement managed by Immuta.

Immuta HDFS Principal

When Immuta is installed on the cluster, users can only access data through HDFS using the HDFS principal that has been set for them in Immuta. This principal can only be set by an Immuta Administrator or imported from an external Identity Manager, but Immuta users can view their principal via the profile page.

Associating a Project with your Immuta HDFS Principal

If you wish to access data in HDFS while acting under a Project Purpose you must associate that project with your Immuta HDFS Principal via the profile page. This is required if the data that you wish to access has Purpose-based restrictions.

  1. Navigate to the Details section of your profile.
  2. Under HDFS Principal, click SELECT PROJECT.

    HDFS User Not Selected

  3. Choose your desired project from the drop-down menu in the modal. Then click Save.

    HDFS User Modal Project

  4. Your HDFS Principal is now tied to your selected project.

    HDFS User Project Selected

  5. To remove a project association from your HDFS Principal, click SELECT PROJECT again and select None from the drop-down menu.

    HDFS User Modal None

Authentication

In order to access data through Immuta's HDFS Access Pattern, you must be authenticated as the user or principal that is assigned to your Immuta HDFS principal.

  • For clusters secured with kerberos, you must successfully kinit with your Immuta HDFS principal before attempting to access data.
  • For insecure clusters, you must be logged in to the cluster as the system user that is assigned to your HDFS principal.

Accessing Data

Immuta's HDFS access pattern allows you to access data two different ways:

  • The immuta:/// namespace allows you to access files in relation to the Immuta data source that it is part of. For example, if you want to access a file called december_report.csv that is part of an Immuta data source called reports, you can access it with the following path:

    immuta:///immuta/reports/december_report.csv

    Note that the path to the file is relative to the Immuta data source that it falls under, not the real path in HDFS. Also, immuta:/// is restricted to only paths that a user can see - files that the user is not authorized for will not be visible.

  • The HDFS access pattern also allows users to access data using native HDFS paths. Authorized data source subscribers can access the file december_report.csv through its native path in HDFS:

    hdfs:///actual/path/in/hdfs/december_report.csv

    Note that in order for a user to access data using hdfs:/// paths, there must be a hdfs:///user/<user>/ directory where <user> corresponds to the user's Immuta HDFS principal. Also, hdfs:/// paths will allow users to see locations of all files, but they will only be able to read files that they have access to in Immuta.

Both methods of accessing data will be audited and compliant with data source policies. If users are not subscribed to or are policy-restricted by the data source that a file in HDFS falls under, they will not be able to access the file using either namespace.

HDFS User Impersonation

Immuta users with the IMPERSONATE_HDFS_USER permission can create HDFS, Hive, and Impala data sources as any HDFS user (provided that they have the proper credentials). For more information, see the tutorial for HDFS data sources. For Impala and Hive data sources, see the Query-backed Data Source tutorial.