Hadoop and Spark Plugin Configuration

Audience: System Administrators

Content Summary: This page outlines the on-cluster configurations for Immuta's Hadoop and Spark plugins. Most of these values are consistent across Hadoop providers; however, some values are provider-specific. To learn more about provider-specific deployments, see the installation guides for Cloudera and Amazon EMR.

Components

Immuta NameNode Plugin

The NameNode plugin runs on each HDFS NameNode as the hdfs user. It will have access to any configuration items available to HDFS clients as well as potentially additional configuration items for the NameNode only. The configuration for the NameNode plugin can be placed in an alternate configuration file (detailed below) to avoid leaking sensitive configuration items.

The NameNode plugin configurations can be set in core-site.xml and hdfs-site.xml (for NameNode-specific values).

Immuta Vulcan Service

The Vulcan Service is an Immuta service that is mostly relevant to Spark applications. It has its own configuration file (generator.xml) and also reads all system-wide/client configuration for Hadoop (core-site.xml).

Hadoop Clients

Clients of HDFS/Hadoop services are Spark jobs, MapReduce jobs, and other user-driven applications in the Hadoop ecosystem. The configuration items for clients can be provided system-wide in core-site.xml or configured per-job (typically) on the command line or in application/job configuration.

Spark Applications

There is an additional generator.xml file that is created for Spark applications only that contains connection information for the Vulcan Service. Immuta configuration can also be added to spark-defaults.conf or system-wide application to Spark jobs. Unless otherwise stated, items in spark-defaults.conf should be prefixed with spark.hadoop. because they are read from Hadoop configuration.

Public NameNode and Hadoop Client Configuration

Public configuration is not sensitive, and is shared by client libraries such as ImmutaApiKeyAuth and the NameNode plugin (as well as potentially other Immuta and non-Immuta services on the cluster). These configuration items should be in a core-site.xml file distributed across the cluster and readable by all users.

  • immuta.generated.api.key.dir

    • Default: /user

    • Description: The base directory under which the NameNode plugin will look for generated API keys for use with the Immuta Web Service. The default value is user with the username and .immuta_generated added to the end so that each user has their own generated API key directory and the .immuta_generated directory adds an additional layer of protection so other users can't listen on the /user/<username> directory to wait for API keys to be generated. This configuration item should never point at a non-HDFS path because attempting to generate credentials outside of HDFS is invalid. This item should be in sync between the NameNode plugin's configuration and client configuration.

  • immuta.credentials.dir

    • Default: /user

    • Description: A directory which will be used to store each user's Immuta API key and token for use with the Immuta Web Service. The user's API key and token are stored this way to avoid re-authenticating frequently with the web service and introducing additional overhead to processes like MapReduce and Spark. Similar to the generated API key directory, this configuration item defaults to /user with the username of the current user added on. Each user should have a directory under the credentials directory for storing their own credentials. NOTE: It is valid for a user to provide and save their own API key in /user/<username>/immuta_api_key so that their code does not attempt to generate an API key. It is also valid to override this value with a non-HDFS path in case HDFS is not being used (Spark in a non-HDFS environment, for example); e.g., file:///home/ would point to file:///home/<username>/immuta_api_key with the user's API key file.

  • immuta.base.url

    • Description: The URL at which the Immuta API can be reached. This should be the base URL of the Immuta API.

  • fs.immuta.impl

    • Description: This configuration allows users to access the immuta:// scheme in order to have their filesystem built in the same way that the Immuta FUSE filesystem is built. This filesystem is also used in Spark deployments, which read data from external object storage (e.g., S3). This means that users will have consistent filesystem views regardless of where they are accessing Immuta. This is not set by default and must be set to com.immuta.hadoop.ImmutaFileSystem system-wide in core-site.xml.

  • immuta.cluster.name

    • Default: hostname from fs.defaultFS

    • Description: This configuration item identifies a cluster to the Immuta Web Service. This is very important because it determines how file access is controlled in HDFS by the NameNode plugin and which data sources are available to a cluster. The default value is taken from fs.defaultFS and administrators should be advised that when an organization has multiple HA HDFS clusters it is possible that they all have the same nameservice name, so this value should be set on each cluster for identification purposes.

  • immuta.api.key

    • Description: (CLIENT ONLY) Users can configure their own API key when running jobs or interacting with an HDFS client, but if an API key is not configured for the user it will be generated on the first attempt to communicate with the Immuta service and stored securely in their credential directory (described above). Immuta uses the Configuration.getPassword() method to retrieve this configuration item, so it may also be set using the Hadoop CredentialProvider API.

  • immuta.permission.fallback.class

    • Default: org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider (HDFS 2.6.x/CDH), org.apache.hadoop.hdfs.server.namenode.DefaultINodeAttributesProvider (HDFS 2.7+)

    • Sentry: org.apache.sentry.hdfs.SentryINodeAttributesProvider (HDFS 2.7+)

    • Description: The configuration key for the fully qualified class name of the fallback permission checking class that will be used after the Immuta authorization or inode attribute provider.

  • immuta.permission.allow.fallback

    • Default: false

    • Description: Denotes the action that the Immuta permission checking classes will take when a user is forbidden access to data in Immuta. If set to true every time a user is denied access to a file via Immuta their permissions will be checked against the underlying default permission checker, potentially meaning that they will still have access to data that they cannot access via Immuta.

  • immuta.permission.users.to.ignore

    • Default: hdfs,yarn,hive,impala,llama,mapred,spark,oozie,hue,hbase,immuta

    • Description: CSV list of users that will not ever have their HDFS file accesses checked in Immuta. This should include any system superusers to avoid overhead of checking permissions in Immuta that should not be relevant.

  • immuta.permission.groups.to.ignore

    • Description: Same as immuta.permission.users.to.ignore but for groups.

  • immuta.permission.users.to.enforce

    • Description: A comma delimited list of users that must go through Immuta when checking permissions on HDFS files. If this configuration item is set, then fallback authorizations will apply to everyone by default, unless they are on this list. If a user is on both the enforce list and the ignore list, then their permissions will be checked with Immuta (i.e., the enforce configuration item takes precedence).

  • immuta.permission.groups.to.enforce

    • Description: Same as immuta.permission.users.to.enforce but for groups.

  • immuta.permission.paths.to.enforce

    • Description: A comma delimited list of paths to ignore when checking permissions on HDFS files. If this configuration item is set, then these paths and their children will use fallback authorizations and not go through Immuta. All other paths will be checked with Immuta. Setting both immuta.permission.paths.to.ignore and immuta.permission.paths.to.enforce properties at the same time is unsupported.

  • immuta.permission.paths.to.ignore

    • Description: A comma delimited list of paths to enforce when checking permissions on HDFS files. If this configuration item is set, then these paths and their children will be checked in Immuta. All other paths will use fallback authorizations. WARNING: Setting this property effectively disables Immuta file permission checking for all paths not in this configuration item. Setting both immuta.permission.paths.to.ignore and immuta.permission.paths.to.enforce properties at the same time is unsupported.

  • immuta.system.details.cache.timeout.seconds

    • Default: 1800

    • Description: The number of seconds to cache system detail information from the Immuta Web Service. This should be high since, ideally, the relevant values in Immuta configuration won't change often (or ever).

  • immuta.permission.workspace.ignored.users

    • Default: hive,impala

    • Description: Comma-delimited list of users that should be ignored when accessing workspace directories. This should never have to change since the default Hive and Impala principals are covered, but this can be modified in case of non-standard configuration. This list is separate from the ignored user list above because we do not want to allow access to ignored non-system users who may be operating on a cluster with Immuta installed but who should not be allowed to see workspace data. This should be limited to the principals for Hive and Impala.

NameNode-only Configuration

The following configuration items are only relevant to the NameNode plugin. These are typically set somewhere like hdfs-site.xml and for the most part they are not sensitive. There are some highly sensitive configuration items, and those should be set in such a way that only the NameNode process has the ability to read them. Immuta provides one solution for this: have an additional NameNode plugin configuration file that must be configured elsewhere (such as hdfs-site.xml) and is only readable by the hdfs user. This will be detailed below.

  • immuta.extra.name.node.plugin.config

    • Description: Path to Hadoop-style XML configuration file containing items that will be used by the Immuta NameNode plugin. This item helps to configure sensitive information in a way that will only be readable by the hdfs user to avoid leaking sensitive configuration to other users. This should be in the form file:///path/to/file.xml.

  • immuta.system.api.key

    • Description: HIGHLY SENSITIVE. This configuration item is used by the NameNode plugin (and the Vulcan Service) to access privileged endpoints of the Immuta API. This is a required configuration item for both the NameNode plugin and Vulcan Service.

  • immuta.no.data.source.cache.timeout.seconds

    • Default: 60

    • Description: The amount of time in seconds that the NameNode plugin will cache the fact that a specific path is not a part of any Immuta data sources.

  • immuta.hive.impala.cache.timeout.seconds

    • Default: 60

    • Description: The amount of time in seconds to cache the fact that a user is subscribed to a Hive or Impala data source containing the target file they are attempting to access.

  • immuta.canisee.cache.timeout.seconds

    • Default: 30

    • Description: The amount of time in seconds to cache the access result from Immuta for a user/path pair.

  • immuta.specific.access.cache.timeout

    • Default: 10

    • Description: The amount of time to temporarily unlock a file in HDFS for a user using temporary access tokens with files backing Hive and Impala data sources in Spark.

  • immuta.data.source.cache.timeout.seconds

    • Default: 300

    • Description: The amount of time in seconds that users' subscribed data sources should be cached in memory to avoid reaching out to Immuta for data sources over and over. Relevant to the Immuta Hadoop client FileSystem and Spark jobs.

  • immuta.canisee.metastore.cache.timeout.seconds

    • Default: 30

    • Description: The amount of time in seconds that the NameNode plugin will cache the fact that a path belongs to a Metastore (Impala or Hive) data source. Reduces network calls from NameNode to Immuta when Vulcan is accessing paths belonging to Metastore sources.

  • immuta.canisee.non.user.cache.timeout.seconds

    • Default: 30

    • Description: The amount of time that the NameNode plugin will cache that a user principal does not belong to an Immuta user. This is useful if the ignored/enforced users/groups configurations are not being used so that when the NameNode receives a 401 response from the canisee endpoint it will store that information and not retry canisee requests to Immuta during that time.

  • immuta.canisee.num.retries

    • Default: 1

    • Description: The number of times to retry access calls from the NameNode plugin to Immuta to account for network issues.

  • immuta.project.user.cache.timeout.seconds

    • Default: 300

    • Description: The amount of time in seconds that the ImmutaGroupsMapping will cache whether or not a principal is tied to an Immuta user account. This decreases the number of calls from HDFS to Immuta when there are accounts that are not tied to Immuta.

  • immuta.project.cache.timeout.seconds

    • Default: 30

    • Description: The amount of time in seconds that the ImmutaGroupsMapping will cache project and workspace information for a given project ID. This is also the amount of time a user's current project will be cached.

  • immuta.project.forbidden.cache.timeout.seconds

    • Default: 30

    • Description: The amount of time in seconds that the ImmutaCurrentProjectHelper will cache the fact that a principal tied to an Immuta user is being forbidden from using their current project.

  • immuta.workspace.deduplication.timeout.seconds

    • Default: 60

    • Description: The amount of time to wait before auditing duplicate workspace filesystem actions from HDFS. This is the amount of time the NameNode plugin will wait before a user reading or writing the same path will have duplicate audit records written to Immuta.

  • immuta.permission.system.details.retries

    • Default: 5

    • Description: The number of times the system details background worker will attempt to retrieve system details from the Immuta web service if an attempt fails.

  • immuta.permission.source.cache.enabled

    • Default: false

    • Description: Denotes whether a background thread should be started to periodically cache paths from Immuta that represent Immuta-protected paths in HDFS. Enabling this increases NameNode performance because it prevents the NameNode plugin from calling the Immuta web service for paths that do not back HDFS data sources.

  • immuta.permission.source.cache.timeout.seconds

    • Default: 300

    • Description: The time between calls to sync/cache all paths that back Immuta data sources in HDFS.

  • immuta.permission.source.cache.retries

    • Default: 5

    • Description: The number of times the data source cache background worker will attempt to retry calls to Immuta on failure.

  • immuta.permission.request.retries

    • Default: 5

    • Description: The number of retries that the NameNode plugin will attempt for any blocking web request between HDFS and the Immuta API.

  • immuta.permission.request.initial.delay.milliseconds

    • Default: 250

    • Description: The initial delay for the BackoffRetryHelper that the NameNode plugin will employ for any retries of blocking web requests between HDFS and the Immuta API.

  • immuta.permission.request.socket.timeout

    • Default: 1500

    • Description: The time in milliseconds that the NameNode plugin will wait before cancelling a request to the Immuta API if no data has been read from the HTTP connection. This applies to blocking requests only.

  • immuta.permission.workspace.base.path.override

    • Description: This configuration item can be set so that the NameNode does not have to retrieve Immuta HDFS workspace base path periodically from the Immuta API.

Spark Application Configuration

The following items are relevant to any Immuta Spark applications using the ImmutaSparkSession or ImmutaContext.

  • immuta.spark.data.source.cache.timeout.seconds

    • Default: 30

    • Description: The amount of time in seconds that data source information will be cached in the user's Spark job. This reduces the number of times the client will need to refresh data source information.

  • immuta.spark.sql.account.expiration

    • Default: 2880

    • Description: The amount of time in seconds that temporary SQL account credentials will be valid that are created by the Immuta Spark plugins for accessing queryable data sources via Postgres over JDBC.

  • immuta.postgres.fetch.size

    • Default: 1000

    • Description: The JDBC fetch size used for data sources accessed via Postgres over JDBC.

  • immuta.postgres.configuration

    • Description: The configuration key for any extra JDBC options that should be appended to the Immuta Postgres connection by the Immuta SQL Context. An example would include sslfactory=org.postgresql.ssl.NonValidatingFactory to turn off SSL validation.

  • immuta.enable.jdbc

    • Default: false

    • Description: If true, allows the user's Spark job to make queries to Immuta's Postgres instance automatically when we detect that the data source is not on cluster and we must pull data back via PG. This can be set per-job, but defaults to false to prevent a user from accidentally (and unknowingly) pulling huge amounts of data over JDBC.

  • immuta.ephemeral.host.override

    • Default: true

    • Description: Set this to false if ephemeral overrides should not be enabled for Spark. When true this will automatically override ephemeral data source host names with an auto-detected host name on cluster that should be running HiveServer2. It is assumed HiveServer2 is running on the NameNode.

  • immuta.ephemeral.host.override.address

    • Description: This configuration item can be used if automatic detection of Hive's hostname should be disabled in favor of a static hostname to use for ephemeral overrides. This is useful for when your cluster is behind a load balancer or proxy.

  • immuta.ephemeral.host.override.name-node

    • Description: In an HA cluster it may be a good idea to specify the NameNode on which Hive is running for ephemeral overrides. This should contain the NameNode from configuration that is hosting HiveServer2.

  • immuta.secure.truststore.enabled

    • Default: false

    • Description: Enables TLS truststore verification. If enabled without a custom truststore it will use the default.

  • immuta.secure.truststore

    • Description: Location of the truststore that contains the Immuta Web Service certification.

  • immuta.secure.truststore.password

    • Description: Password for the truststore that contains the Immuta Web Service certification.

  • immuta.spark.visibility.cache.timeout.seconds

    • Default: 30

    • Description: The amount of time in seconds the ImmutaContext or ImmutaSparkSession will cache visibilities from Immuta. Maximum of 30 seconds.

  • immuta.spark.visibility.read.timeout.seconds

    • Default: 300

    • Description: The socket read timeout for visibility calls to Immuta.

  • immuta.spark.audit.retries

    • Default: 2

    • Description: The number of times to retry audit calls to Immuta from Spark.

  • immuta.masked.jdbc.optimization.enabled

    • Default: true

    • Description: Enables push down filters to postgres. This should only be changed to false if the user is joining to a non-Spark data source (in PostgreSQL) on a masked column.

Vulcan Service Configuration

The following configuration items are needed by the Immuta Vulcan Service. Some of these items are also shared with the NameNode plugin as they work in tandem to protect data in HDFS.

  • immuta.meta.store.token.dir

    • Default: /user/<Vulcan Service user>/tokens

    • Description: The directory in which temporary access tokens for HDFS files backing Hive/Impala data sources will be stored. This needs to be configured for the NameNode plugin as well in order to unlock files in HDFS.

  • immuta.meta.store.remote.token.dir

    • Default: /user/<Vulcan Service user>/remotetokens

    • Description: The directory in which temporary access tokens for remote/object storage (S3, GS, etc) files backing Hive/Impala data sources will be stored.

  • immuta.spark.partition.generator.user

    • Default: immuta

    • Description: The username of the user that will be running the Vulcan Service. This should also be the short username of the Kerberos principal running the Vulcan Service.

  • immuta.secure.partition.generator.hostname

    • Default: localhost

    • Description: The interface/hostname that clients will use to communicate with the Vulcan Service.

  • immuta.secure.partition.generator.listen.address

    • Default: 0.0.0.0

    • Description: The interface/hostname on which the Vulcan Service will listen for connections.

  • immuta.secure.partition.generator.port

    • Default: 9070

    • Description: The port on which the Vulcan Service will listen for connections.

  • immuta.configuration.id.file.config

    • Default: hdfs:///user/<Vulcan Service user>/config_id

    • Description: The file in HDFS where the cluster configuration ID will be stored. This is used to keep track of the unique ID in Immuta tied to the current cluster.

  • immuta.secure.partition.generator.keystore

    • Description: Path the keystore file to be used for securing Vulcan Service with TLS.

  • immuta.secure.partition.generator.keystore.password

    • Description: The password for the keystore configured with immuta.secure.partition.generator.keystore.

  • immuta.secure.partition.generator.keymanager.password

    • Description: The configuration key for the key manager password for the keystore configured with immuta.secure.partition.generator.keystore.

  • immuta.secure.partition.generator.url.external

    • Default: <NameNode / master hostname>:<Vulcan Service port>

    • Description: The configuration key for specifying the externally addressable Vulcan Service URL. This URL must be reachable from the Immuta web app. If this is not set, the Vulcan Service will try to determine the URL based on its Hadoop configuration.

  • immuta.yarn.validation.params

    • Default: /user/<Vulcan Service user>/yarnParameters.json

    • Description: The file containing parameters to use when validating YARN applications for secure token generation for file access. When a Spark application requests tokens be generated for file access, the Vulcan Service will validate that the Spark application is configured properly using the parameters from this file.

  • immuta.emrfs.credential.file.path

    • Description: For EMR/EMRFS only. This configuration points to a file containing AWS credentials that the Vulcan Service can use for accessing data in S3. This is also useful for Hive/the hive user so that (if impersonation is turned on) only a few users (hive and the Vulcan Service user) on cluster can access data in S3 while everyone else is forced through the Vulcan Service.

  • immuta.workspace.allow.create.table

    • Default: false

    • Description: True if the user should be allowed to create workspace tables. Users will not be able to drop their created tables if sentry object ownership is not set to ALL.

  • immuta.partition.tokens.ttl.seconds

    • Default: 3600

    • Description: How long in seconds Immuta temporary file access tokens should live in HDFS before being cleaned up.

  • immuta.partition.tokens.interval.seconds

    • Default: 1800

    • Description: Number of seconds between runs of the token cleanup job which will delete all expired temporary file access tokens from HDFS.

  • immuta.scheduler.heartbeat.enable

    • Default: true

    • Description: True to enable sending configuration to the Immuta Web Service and updating on an interval. This can be set to false to prevent this cluster from being available in the HDFS configurations dropdown for HDFS data sources as well as prevent it from being used for workspaces. This make sense for ephemeral (EMR) clusters.

  • immuta.scheduler.heartbeat.initial.delay.seconds

    • Default: 0

    • Description: When starting the Vulcan Service, how long in seconds to wait before first sending configuration to the Immuta Web Service.

  • immuta.scheduler.heartbeat.interval.seconds

    • Default: 28800

    • Description: How long in seconds to wait between each configuration update submission to the Immuta Web Service.

  • immuta.file.session.store.expiration.seconds

    • Default: 900

    • Description: Number of seconds that idle remote file sessions will be kept active in the Vulcan Service. This is for spark clients that are reading remote data (S3, GS) via the Vulcan Service.

  • immuta.file.session.status.expiration.seconds

    • Default: 300

    • Description: Number of seconds that the Vulcan Service will cache file statuses from remote object storage.

  • immuta.file.session.status.max.size

    • Default: 250

    • Description: Maximum number of file status objects that the Vulcan Service will cache at one time.

  • immuta.yarn.api.num.retries

    • Default: 5

    • Description: Number of times that the YARN Validator will attempt to contact the YARN resource manager API to validate a Spark application for partition tokens.

Last updated

Copyright © 2014-2024 Immuta Inc. All rights reserved.