HDFS Data Source Creation Tutorial
Audience: Data Owners
Step 1: Enter Connection Information
Enter the required information in the following fields to connect to the NameNode of your HDFS cluster:
- NameNode Host: an HDFS namenode.
- NameNode Port: an HDFS port, typically port 8020.
- Base Directory: the root directory in HDFS from which data should be recursively ingested.
- Kerberos: enable if connecting to a Kerberized HDFS cluster.
- Username: This username will be used to connect to the HDFS cluster. This field automatically defaults to
the user's Immuta username. Only users with the
IMPERSONATE_HDFS_USERpermission can edit this field and change it to another value. Please contact your Immuta Admin for more information about this permission.
- Kerberos Realm: If Kerberos is enabled, this field automatically defaults to the kerberized realm set by
configuration. Only users with the
IMPERSONATE_HDFS_USERpermission can edit this field to change it to another value. Please contact your Immuta Admin for more information about this permission.
Click Test Connection.
Step 2: Select Data Format
While object-backed data sources can really be any format (images, videos, etc.), we can still work under the assumption that some will have common formats. Should your blobs be comma separated, tab-delimited, or json, you can mask values through the Immuta interface. Specifying the data format will allow you to create masking policies for the data source.
Step 3: Populate Event Time
Event time allows you to catalog the blobs in your data source by date. It can also be used for creating data source minimization policies.
By default, Immuta will use the write time of the blob for event time.
However, write time is not always an accurate way to represent event time.
Should you want to provide a customized event time, you can do that via blob attributes. In HDFS these are
stored as xattrs. You can specify the key of the metadata/xattr that contains the
date in ISO 8601 format, for example:
Step 4: Configure Tags and Features
Immuta extracts existing metadata from HDFS blobs. This metadata can also be used to apply tags or features to blobs.
Determine how you would like to tag your data and extract features from your data by selecting one or both of the following options: Folder Name(s) or Use Object Attributes.
If using attributes, enter the attribute names in the following fields and click Add Attributes:
How would you like to tag your data?: This step is optional and provides the ability to "auto-tag" based on attributes of the data rather than the manual entry you do in step 3. You can pull the tags from the folder name or metadata/xattrs. Adding tags makes your data more discoverable in the Immuta Web UI and REST API.
Select any attributes you would like to extract as features: Object-backed data sources are not accessible via the Immuta Query Engine. However, you can pass Immuta features about the data (essentially extra metadata) and that metadata can be queried through the Immuta Query Engine. Those features can be populated just like the tags (via folder name or metadata/xattrs).
Note: when using xattrs in HDFS, Immuta assumes the
usernamespace prefix, so you do not need to include that in your keys.