Skip to content

HDFS Data Source Creation Tutorial

Audience: Data Owners

Content Summary: This guide details configuring an HDFS data source in Immuta. To explore data source creation guides for other storage technologies, see the Object-backed and Query-backed guides.

Step 1: Enter Connection Information

  1. Enter the required information in the following fields to connect to the NameNode of your HDFS cluster:

    • NameNode Host: an HDFS namenode.
    • NameNode Port: an HDFS port, typically port 8020.
    • Base Directory: the root directory in HDFS from which data should be recursively ingested.
    • Kerberos: enable if connecting to a Kerberized HDFS cluster.
    • Username: This username will be used to connect to the HDFS cluster. This field automatically defaults to the user's Immuta username. Only users with the IMPERSONATE_HDFS_USER permission can edit this field and change it to another value. Please contact your Immuta Admin for more information about this permission.
    • Kerberos Realm: If Kerberos is enabled, this field automatically defaults to the kerberized realm set by configuration. Only users with the IMPERSONATE_HDFS_USER permission can edit this field to change it to another value. Please contact your Immuta Admin for more information about this permission.

    HDFS Connection Information

  2. Click Test Connection.

Step 2: Select Data Format

Data Format

While object-backed data sources can really be any format (images, videos, etc.), we can still work under the assumption that some will have common formats. Should your blobs be comma separated, tab-delimited, or json, you can mask values through the Immuta interface. Specifying the data format will allow you to create masking policies for the data source.

Step 3: Populate Event Time

Event time allows you to catalog the blobs in your data source by date. It can also be used for creating data source minimization policies.

HDFS Event Time

By default, Immuta will use the write time of the blob for event time. However, write time is not always an accurate way to represent event time. Should you want to provide a customized event time, you can do that via blob attributes. In HDFS these are stored as xattrs. You can specify the key of the metadata/xattr that contains the date in ISO 8601 format, for example: 2015-11-15T05:13:32+00:00.

Step 4: Configure Tags and Features

Immuta extracts existing metadata from HDFS blobs. This metadata can also be used to apply tags or features to blobs.

  1. Determine how you would like to tag your data and extract features from your data by selecting one or both of the following options: Folder Name(s) or Use Object Attributes.

    HDFS Tags and Features

  2. If using attributes, enter the attribute names in the following fields and click Add Attributes:

    • How would you like to tag your data?: This step is optional and provides the ability to "auto-tag" based on attributes of the data rather than the manual entry you do in step 3. You can pull the tags from the folder name or metadata/xattrs. Adding tags makes your data more discoverable in the Immuta Web UI and REST API.

    • Select any attributes you would like to extract as features: Object-backed data sources are not accessible via the Immuta Query Engine. However, you can pass Immuta features about the data (essentially extra metadata) and that metadata can be queried through the Immuta Query Engine. Those features can be populated just like the tags (via folder name or metadata/xattrs).

    Note: when using xattrs in HDFS, Immuta assumes the user namespace prefix, so you do not need to include that in your keys.

  3. Click Apply.