Skip to content

Create an Object-Backed Data Source

Audience: Data Owners

Content Summary: Object-backed data sources are data storage technologies that do not support SQL and can range from NoSQL technologies, to blob stores, to filesystems, to APIs. Object-backed data sources act like key/value stores and are often called ingested sources because Immuta must ingest metadata about the data source to provide access and create policy restrictions. Data Owners provide Immuta metadata about the blobs they are exposing so that Immuta understands how to reach the blobs and apply policies.

This guide outlines the process of creating object-backed data sources, such as Amazon S3, Apache HDFS, Azure Blob Storage, Custom, FTP, and Persisted.

If your storage technology is not listed above, navigate to the Query-backed Data Sources Tutorial.

1 - Create a New Data Source

To create a new data source,

  1. Click the plus button in the top left corner of the Immuta console.
  2. Select the Data Source icon.

Alternatively,

  1. Navigate to the My Data Sources page.
  2. Click the New Data Source button in the top right corner.

2 - Select Your Storage Technology

Select the storage technology containing the data you wish to expose by clicking a tile. Please note that the list of enabled technologies is configurable and may differ from the image below.

Data Source Creation Select Backend

3 - Enter Basic Information

Provide information about your source that makes it discoverable to users.

  1. Complete the Data Source Name field, which will be the name shown in the Immuta UI.
  2. Enter the Immuta S3 Folder, which is the name of the Immuta S3 folder that corresponds to this data source. Note that for object-backed data sources, this table will only store metadata about blobs in this data source.

    Data Source Creation Basic Information

4 - Enter Connection Information

Select the tabs below for specific instructions for your chosen storage technology.

Amazon S3

  1. Select an Authentication Method. You must choose an authentication method before connecting an S3 data source. The following methods are supported:

    • AWS Access Key: Connect to a private S3 bucket using an access key pair.
      • AWS Instance Role: Connect to a private S3 bucket with no credentials, instead leveraging an IAM Role that has been assigned to the Immuta EC2 instance. The Data Owner must possess the CREATE_S3_DATASOURCE_WITH_INSTANCE_ROLE permission to proceed. By default, this option is disabled. For instructions on enabling this option, please contact your Immuta Support Professional.
      • No Authentication: Connect to a public S3 bucket with no credentials.

    S3 Connection Information

  2. Fill out the following fields in the Connection Information window:

    • AWS Access Key Id: Your AWS public access key.
      • AWS Secret Access Key: Your AWS secret access key.
      • Assume AWS IAM Role: Enable only if you want to connect using an IAM Role's access as opposed to an individual user's access.
      • AWS IAM Role ARN: If Assume AWS IAM Role is enabled, this is the name of the IAM Role ARN that will be used (please login to the AWS Console or contact your AWS Admin for this information).
      • AWS Region: The region that contains the S3 bucket that you wish to expose.
    IAM Policy Elements

    To connect to an S3 bucket, the IAM role you use must have several policy elements. These elements are included in a Statement array and must be present for every bucket you wish to connect with that IAM Role.

    • s3:ListBucket on the bucket you want to expose.
      • s3:getObject on /* of the bucket you want to expose.
      • s3:getObjectTagging on /* of the bucket you want to expose.

    Optionally, you can give the element s3:ListAllMyBuckets on arn:aws:s3:::*. This allows Immuta to present you with a list of buckets to choose from at data source creation time. Without this, you'll have to type in your bucket name.

    In order to update metadata from S3 as changes occur, additional policy elements are required.

    Below is an example policy that can be used with the buckets test-bucket-1 and test-bucket-2. Note, that you'd have to create a data source per bucket, but that this policy (applied to your IAM role) is applicable to both buckets. This policy does not include the list.

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket",
                    "s3:ListBucketMultipartUploads"
                ],
                "Resource": [
                    "arn:aws:s3:::test-bucket-1",
                    "arn:aws:s3:::test-bucket-2"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:getObject",
                    "s3:getObjectTagging"
                ],
                "Resource": [
                    "arn:aws:s3:::test-bucket-1/*",
                    "arn:aws:s3:::test-bucket-2/*"
                ]
            }
        ]
    }
    
  3. Click Verify Credentials.

  4. Complete the following fields in the S3 Configuration window:

    • S3 Bucket: the bucket you want your data source to reference
      • S3 Prefix: the prefix within the specified bucket. Only blobs from within the specified prefix will be ingested and returned.
      • Additional Options:
        • Ingest Metadata Automatically: when enabled, this automatically monitors S3 for new updates to the data. If not enabled, there will only be a one-time metadata pull from S3. No additions or deletions will be captured after the pull completes. That means even if you change S3 attributes, those changes will not be reflected in the data source policy. This feature creates SQS queues containing bucket updates. Note that even if you do not enable this option, you can still direct Immuta to re-crawl the data manually whenever you would like.

    S3 Bucket Configuration

  5. If you select Ingest Metadata Automatically, enter the name of the SQS queue in the SQS Queue field.

    These AWS IAM permissions are required:

    • sqs:DeleteMessageBatch a general permission on SQS.
      • sqs:ReceiveMessage a general permission on SQS.

Advanced Options

In this section, you can edit advanced configurations for your data source. None of these configurations are required to create the data source.

Select Data Format

Data Format

While object-backed data sources can really be any format (images, videos, etc.), we can still work under the assumption that some will have common formats. Should your blobs be comma separated, tab-delimited, or json, you can mask values through the Immuta interface. Specifying the data format will allow you to create masking policies for the data source.

Select Event Time

By default, Immuta will use the write time of the blob for the event time. However, write time is not always an accurate way to represent event time. Should you want to provide a customized event time, you can do that via blob attributes. In S3 these are stored as metadata. You can specify the key of the metadata/xattr that contains the date in ISO 8601 format, for example: 2015-11-15T05:13:32+00:00.

S3 Event Time

Configure Tags

  • How would you like to tag your data?: This step is optional and provides the ability to "auto-tag" based on attributes of the data rather than the manual entry you do in step 3. You can pull the tags from the folder name or metadata/xattrs. Adding tags makes your data more discoverable in the Immuta Web UI and REST API.

  • Select any attributes you would like to extract as features: Object-backed data sources are not accessible via the Immuta Query Engine. However, you can pass Immuta features about the data (essentially extra metadata) and that metadata can be queried through the Immuta Query Engine. Those features can be populated just like the tags (via folder name or metadata/xattrs).

  • Note that when using metadata in S3, Immuta assumes the x-amz-metdata- namespace prefix, so you do not need to include that in your keys.

    S3 Tags and Features

To learn more about S3 tags and metadata attributes see the official AWS documentation:

Azure Blob Storage

  1. To connect a data source to an Azure Blob Storage container, you must first create a Shared Access Signature for your Azure Blob Storage account.

  2. Enter the Shared Access Signature Token and the corresponding URL for your Azure Storage account. Follow the steps below to retrieve your SAS credentials from the Azure Portal.

    Azure Blob Storage Connection Information

  3. Open the Azure Portal Web UI.

  4. Find and select your desired Azure Storage Account resource.
  5. Under SETTINGS select Shared access signature.

    Azure Blob Storage Portal Sidebar

  6. Configure the SAS Token's allowed services, resource types, and permissions to match the following image.

    Azure Blob Storage SAS Settings

  7. Set a reasonable expiration date for your SAS Token. When your SAS Token expires, your Immuta data source will no longer be able to fetch data from Azure.

  8. Select Generate SAS and save the provided credentials.

  9. Select the container that you wish to base this data source on. The data source will contain all of the blobs in this container, and it will also maintain the container's directory structure.

    Azure Blob Storage Container Configuration

  10. On the data source creation page in the Immuta console, enter the Blob Storage SAS Token and Token URL and click Verify Credentials.

  11. Enter the Azure Blob Storage Container.

Advanced Options

In this section, you can edit advanced configurations for your data source. None of these configurations are required to create the data source.

Azure Blob Storage Advanced

Refresh Interval

If left blank or set to 0, Azure blob data will only be indexed once when the data source is initially created. Otherwise, the Azure blob data will be re-indexed based on the selected time interval.

Refresh Interval

  • Set Time: This is how often Immuta will re-index data located in the remote Azure blob container.
  • Set Period: This is the time period and can be set to minutes, hours or days.

If you do not set a refresh interval, Immuta will never automatically crawl your container. You can always manually crawl from the Data Source Overview page.

Data Format

While object-backed data sources can be any format (images, videos, etc.), Immuta can still work under the assumption that some will have common formats. Should your blobs be comma separated, tab-delimited, or json, you can mask values through the Immuta interface. Specifying the data format will allow you to create masking policies for the data source.

Data Format

Event Time

Event time allows you to catalog the blobs in your data source by date. It can also be used for creating data source minimization policies.

By default, Immuta will use each blob's Last Modified date attribute from Azure for Event Time. However, this is not always an accurate way to represent event time. Should you want to provide a customized event time, you can do that via blob attributes. You can specify the key of the metadata attribute that contains the date in ISO 8601 format, for example: 2015-11-15T05:13:32+00:00.

Azure Blob Storage Event Time

Tags

Immuta will extract any existing metadata from Azure blobs. This metadata can also be used to apply tags to blobs. When configuring tags, note that Attribute Name refers to the key of your desired blob metadata attribute in Azure.

Azure Blob Storage Tags

HDFS

  1. Select the HDFS Configuration from the dropdown menu, if available.

  2. Enter the required information in the following fields to connect to the NameNode of your HDFS cluster:

    • Base Directory: the root directory in HDFS from which data should be recursively ingested.
      • Kerberos: enable if connecting to a Kerberized HDFS cluster.
      • Username: This username will be used to connect to the HDFS cluster. This field automatically defaults to the user's Immuta username. Only users with the IMPERSONATE_HDFS_USER permission can edit this field and change it to another value. Please contact your Immuta Admin for more information about this permission.
      • Kerberos Realm: If Kerberos is enabled, this field automatically defaults to the kerberized realm set by configuration. Only users with the IMPERSONATE_HDFS_USER permission can edit this field to change it to another value. Please contact your Immuta Admin for more information about this permission.

    HDFS Connection Information

    • NameNode Host: an HDFS namenode.
      • NameNode Port: an HDFS port, typically port 8020.
      • Base Directory: the root directory in HDFS from which data should be recursively ingested.
      • Kerberos: enable if connecting to a Kerberized HDFS cluster.
      • Username: This username will be used to connect to the HDFS cluster. This field automatically defaults to the user's Immuta username. Only users with the IMPERSONATE_HDFS_USER permission can edit this field and change it to another value. Please contact your Immuta Admin for more information about this permission.
      • Kerberos Realm: If Kerberos is enabled, this field automatically defaults to the kerberized realm set by configuration. Only users with the IMPERSONATE_HDFS_USER permission can edit this field to change it to another value. Please contact your Immuta Admin for more information about this permission.

    HDFS Connection Information

  3. Click Test Connection.

Advanced Options

In this section, you can edit advanced configurations for your data source. None of these configurations are required to create the data source.

Data Format

While object-backed data sources can really be any format (images, videos, etc.), we can still work under the assumption that some will have common formats. Should your blobs be comma separated, tab-delimited, or json, you can mask values through the Immuta interface. Specifying the data format will allow you to create masking policies for the data source.

Data Format

Event Time

Event time allows you to catalog the blobs in your data source by date. It can also be used for creating data source minimization policies.

HDFS Event Time

By default, Immuta will use the write time of the blob for event time. However, write time is not always an accurate way to represent event time. Should you want to provide a customized event time, you can do that via blob attributes. In HDFS these are stored as xattrs. You can specify the key of the metadata/xattr that contains the date in ISO 8601 format, for example: 2015-11-15T05:13:32+00:00.

Tags

Immuta extracts existing metadata from HDFS blobs. This metadata can also be used to apply tags to blobs.

  1. Determine how you would like to tag your data by selecting one or both of the following options: Folder Name(s) or Use Object Attributes.

  2. If using attributes, enter the attribute names in the following fields and click Add Attributes:

    • How would you like to tag your data?: This step is optional and provides the ability to "auto-tag" based on attributes of the data rather than the manual entry you do in step 3. You can pull the tags from the folder name or metadata/xattrs. Adding tags makes your data more discoverable in the Immuta Web UI and REST API. Additionally, HDFS xattrs can be used to drive policies.
  3. Click Apply.

FTP Tags

FTP

  1. Fill out the following fields in the Connection Information window:

      • Port: port configured for FTP, typically port 21
      • SFTP: enable if connecting to an FTP server that supports SFTP
      • Select Authentication Method: select the authentication method for connecting to the FTP server with (Anonymous, Basic Authentication or SSH Key)
        • Username: the username to connect to the FTP server (only applicable if Basic Authentication or SSH key are selected for the authentication method)
        • Password: password to connect to FTP server (only applicable if Basic Authentication is selected for the authentication method)
        • Private Key: browse to the file containing this user's private SSH Key (only applicable if SSH Key is selected for the authentication method)
      • Root Path: path from the FTP server that you want the data source to reference, typically ‘/’

      Server: hostname of your FTP server

      FTP Connection Information

  2. Click Test Connection.

Advanced Options

In this section, you can edit advanced configurations for your data source. None of these configurations are required to create the data source.

Data Format

While object-backed data sources can be any format (images, videos, etc.), Immuta can still work under the assumption that some will have common formats. Should your blobs be comma separated, tab-delimited, or json, you can mask values through the Immuta interface. Specifying the data format will allow you to create masking policies for the data source.

Data Format

Tags

Immuta extracts existing metadata from FTP blobs. This metadata can also be used to apply tags to blobs. FTP only supports using folder names for tags. To utilize this feature:

  1. Click Edit to open the Tag(s) option window.

  2. Select Folder Name(s).

  3. Click Apply.

FTP Tags

Refresh Interval

If left blank or set to 0, FTP data will only be indexed once when the data source is initially created. Otherwise, the FTP data will be re-indexed based on the selected time interval.

  1. Fill out the Set Time and Set Period fields:

    • Set Time: This is how often Immuta will re-index data located on the remote FTP server.
      • Set Period: This is the time period and can be set to minutes, hours or days.

    Refresh Interval

  2. Click Apply.

Persisted

The only configuration option for Persisted data sources is the data format. Specifying the data format here allows you to create masking policies for the data source.

Data Format

Upload Files

Data Owners can easily upload files to Persisted data sources through the Immuta UI.

  1. Navigate to the Data Source details page.
  2. Click the drop-down button in the top right of the page and select Upload.

    Persisted Upload Button

  3. In the resulting modal, fill out the fields with your desired directory structure, event time attribute, tags, and metadata for the blob(s) that you upload. You can drag and drop or select multiple files to upload at once. The file count will display above these uploaded files.

    Persisted Upload

  4. Click Save.

Advanced Options

In this section, you can edit advanced configurations for your data source. None of these configurations are required to create the data source.

Data Format

While object-backed data sources can be any format (images, videos, etc.), Immuta can still work under the assumption that some will have common formats. Should your blobs be comma separated, tab-delimited, or json, you can mask values through the Immuta interface. Specifying the data format will allow you to create masking policies for the data source.

Data Format

Tags

Immuta extracts existing metadata from blobs. This metadata can also be used to apply tags to blobs.

  1. Click Edit to open the Tag(s) option window.

  2. Select Folder Name(s).

  3. Click Apply.

FTP Tags

Custom Data Source

Before you can register a Custom data source, a Custom Blob Handler must already be deployed.

  1. Select http or https.

    Best Practice: Use Two-Way SSL Configuration

    The two-way SSL configuration is highly recommended as it is the most secure configuration for a custom blob store handler endpoint.

  2. If you've selected https, opt to provide a Private Key, a Certificate, or a CA bundle for the blob store handler.

  3. Opt to enter a Record ID.
  4. Click Test Connection.

Advanced Options

In this section, you can edit advanced configurations for your data source. None of these configurations are required to create the data source.

Data Format

While object-backed data sources can be any format (images, videos, etc.), Immuta can still work under the assumption that some will have common formats. Should your blobs be comma separated, tab-delimited, or json, you can mask values through the Immuta interface. Specifying the data format will allow you to create masking policies for the data source.

Data Format

Tags

Immuta extracts existing metadata from blobs. This metadata can also be used to apply tags to blobs.

  1. Click Edit to open the Tag(s) option window.

  2. Select Folder Name(s).

  3. Click Apply.

FTP Tags

5 - Create the Data Source

Click Create to save the data source(s).

Manually Re-crawling Data Sources

Some object-backed data sources can be manually re-crawled to fetch fresh metadata about the data objects. If your data source is not set up to ingest the metadata automatically, you may need to perform this action from time to time.

  1. Navigate to the Data Source Overview page.
  2. Click on the menu icon in the upper right corner and select Re-crawl.

    Data Source Re-crawl

What's Next

Now that you've created a date source, you can choose to continue to the next page or to one of these tutorials:

Manage Data Sources Write a Local Policy