HDFS Data Source Creation Tutorial
Audience: Data Owners
Content Summary: This guide details configuring an HDFS data source in Immuta. To explore data source creation guides for other storage technologies, see the Object-backed and Query-backed guides.
Step 1: Enter Connection Information
-
Select the HDFS Configuration from the dropdown menu, if available.
-
Enter the required information in the following fields to connect to the NameNode of your HDFS cluster:
- Base Directory: the root directory in HDFS from which data should be recursively ingested.
- Kerberos: enable if connecting to a Kerberized HDFS cluster.
- Username: This username will be used to connect to the HDFS cluster. This field automatically defaults to
the user's Immuta username. Only users with the
IMPERSONATE_HDFS_USER
permission can edit this field and change it to another value. Please contact your Immuta Admin for more information about this permission. - Kerberos Realm: If Kerberos is enabled, this field automatically defaults to the kerberized realm set by
configuration. Only users with the
IMPERSONATE_HDFS_USER
permission can edit this field to change it to another value. Please contact your Immuta Admin for more information about this permission.
- NameNode Host: an HDFS namenode.
- NameNode Port: an HDFS port, typically port 8020.
- Base Directory: the root directory in HDFS from which data should be recursively ingested.
- Kerberos: enable if connecting to a Kerberized HDFS cluster.
- Username: This username will be used to connect to the HDFS cluster. This field automatically defaults to
the user's Immuta username. Only users with the
IMPERSONATE_HDFS_USER
permission can edit this field and change it to another value. Please contact your Immuta Admin for more information about this permission. - Kerberos Realm: If Kerberos is enabled, this field automatically defaults to the kerberized realm set by
configuration. Only users with the
IMPERSONATE_HDFS_USER
permission can edit this field to change it to another value. Please contact your Immuta Admin for more information about this permission.
-
Click Test Connection.
Step 2: Select Data Format
While object-backed data sources can really be any format (images, videos, etc.), we can still work under the assumption that some will have common formats. Should your blobs be comma separated, tab-delimited, or json, you can mask values through the Immuta interface. Specifying the data format will allow you to create masking policies for the data source.
Step 3: Populate Event Time
Event time allows you to catalog the blobs in your data source by date. It can also be used for creating data source minimization policies.
By default, Immuta will use the write time of the blob for event time.
However, write time is not always an accurate way to represent event time.
Should you want to provide a customized event time, you can do that via blob attributes. In HDFS these are
stored as xattrs. You can specify the key of the metadata/xattr that contains the
date in ISO 8601 format, for example: 2015-11-15T05:13:32+00:00
.
Step 4: Configure Tags
Immuta extracts existing metadata from HDFS blobs. This metadata can also be used to apply tags to blobs.
-
Determine how you would like to tag your data by selecting one or both of the following options: Folder Name(s) or Use Object Attributes.
-
If using attributes, enter the attribute names in the following fields and click Add Attributes:
How would you like to tag your data?
: This step is optional and provides the ability to "auto-tag" based on attributes of the data rather than the manual entry you do in step 3. You can pull the tags from the folder name or metadata/xattrs. Adding tags makes your data more discoverable in the Immuta Web UI and REST API. Additionally, HDFS xattrs can be used to drive policies.
-
Click Apply.