Hadoop and Spark Plugin Configuration
Audience: System Administrators
Content Summary: This page outlines the on-cluster configurations for Immuta's Hadoop and Spark plugins. Most of these values are consistent across Hadoop providers; however, some values are provider-specific. To learn more about provider-specific deployments, see the installation guides for Cloudera and Amazon EMR.
Components
Immuta NameNode Plugin
The NameNode plugin runs on each HDFS NameNode as the hdfs
user. It will have access to any configuration items available to HDFS clients as well as potentially additional configuration items for the NameNode only. The configuration for the NameNode plugin can be placed in an alternate configuration file (detailed below) to avoid leaking sensitive configuration items.
The NameNode plugin configurations can be set in core-site.xml
and hdfs-site.xml
(for NameNode-specific values).
Immuta Vulcan Service
The Vulcan Service is an Immuta service that is mostly relevant to Spark applications. It has its own configuration file (generator.xml
) and also reads all system-wide/client configuration for Hadoop (core-site.xml
).
Hadoop Clients
Clients of HDFS/Hadoop services are Spark jobs, MapReduce jobs, and other user-driven applications in the Hadoop ecosystem. The configuration items for clients can be provided system-wide in core-site.xml
or configured per-job (typically) on the command line or in application/job configuration.
Spark Applications
There is an additional generator.xml
file that is created for Spark applications only that contains connection information for the Vulcan Service. Immuta configuration can also be added to spark-defaults.conf
or system-wide application to Spark jobs. Unless otherwise stated, items in spark-defaults.conf
should be prefixed with spark.hadoop.
because they are read from Hadoop configuration.
Public NameNode and Hadoop Client Configuration
Public configuration is not sensitive, and is shared by client libraries such as ImmutaApiKeyAuth
and the NameNode plugin (as well as potentially other Immuta and non-Immuta services on the cluster). These configuration items should be in a core-site.xml
file distributed across the cluster and readable by all users.
immuta.generated.api.key.dir
Default:
/user
Description: The base directory under which the NameNode plugin will look for generated API keys for use with the Immuta Web Service. The default value is
user
with the username and.immuta_generated
added to the end so that each user has their own generated API key directory and the.immuta_generated
directory adds an additional layer of protection so other users can't listen on the/user/<username>
directory to wait for API keys to be generated. This configuration item should never point at a non-HDFS path because attempting to generate credentials outside of HDFS is invalid. This item should be in sync between the NameNode plugin's configuration and client configuration.
immuta.credentials.dir
Default:
/user
Description: A directory which will be used to store each user's Immuta API key and token for use with the Immuta Web Service. The user's API key and token are stored this way to avoid re-authenticating frequently with the web service and introducing additional overhead to processes like MapReduce and Spark. Similar to the generated API key directory, this configuration item defaults to
/user
with the username of the current user added on. Each user should have a directory under the credentials directory for storing their own credentials. NOTE: It is valid for a user to provide and save their own API key in/user/<username>/immuta_api_key
so that their code does not attempt to generate an API key. It is also valid to override this value with a non-HDFS path in case HDFS is not being used (Spark in a non-HDFS environment, for example); e.g.,file:///home/
would point tofile:///home/<username>/immuta_api_key
with the user's API key file.
immuta.base.url
Description: The URL at which the Immuta API can be reached. This should be the base URL of the Immuta API.
fs.immuta.impl
Description: This configuration allows users to access the
immuta://
scheme in order to have their filesystem built in the same way that the Immuta FUSE filesystem is built. This filesystem is also used in Spark deployments, which read data from external object storage (e.g., S3). This means that users will have consistent filesystem views regardless of where they are accessing Immuta. This is not set by default and must be set tocom.immuta.hadoop.ImmutaFileSystem
system-wide incore-site.xml
.
immuta.cluster.name
Default: hostname from
fs.defaultFS
Description: This configuration item identifies a cluster to the Immuta Web Service. This is very important because it determines how file access is controlled in HDFS by the NameNode plugin and which data sources are available to a cluster. The default value is taken from
fs.defaultFS
and administrators should be advised that when an organization has multiple HA HDFS clusters it is possible that they all have the same nameservice name, so this value should be set on each cluster for identification purposes.
immuta.api.key
Description: (CLIENT ONLY) Users can configure their own API key when running jobs or interacting with an HDFS client, but if an API key is not configured for the user it will be generated on the first attempt to communicate with the Immuta service and stored securely in their credential directory (described above). Immuta uses the
Configuration.getPassword()
method to retrieve this configuration item, so it may also be set using the HadoopCredentialProvider
API.
immuta.permission.fallback.class
Default:
org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider
(HDFS 2.6.x/CDH),org.apache.hadoop.hdfs.server.namenode.DefaultINodeAttributesProvider
(HDFS 2.7+)Sentry:
org.apache.sentry.hdfs.SentryINodeAttributesProvider (HDFS 2.7+)
Description: The configuration key for the fully qualified class name of the fallback permission checking class that will be used after the Immuta authorization or inode attribute provider.
immuta.permission.allow.fallback
Default:
false
Description: Denotes the action that the Immuta permission checking classes will take when a user is forbidden access to data in Immuta. If set to
true
every time a user is denied access to a file via Immuta their permissions will be checked against the underlying default permission checker, potentially meaning that they will still have access to data that they cannot access via Immuta.
immuta.permission.users.to.ignore
Default:
hdfs,yarn,hive,impala,llama,mapred,spark,oozie,hue,hbase,immuta
Description: CSV list of users that will not ever have their HDFS file accesses checked in Immuta. This should include any system superusers to avoid overhead of checking permissions in Immuta that should not be relevant.
immuta.permission.groups.to.ignore
Description: Same as
immuta.permission.users.to.ignore
but for groups.
immuta.permission.users.to.enforce
Description: A comma delimited list of users that must go through Immuta when checking permissions on HDFS files. If this configuration item is set, then fallback authorizations will apply to everyone by default, unless they are on this list. If a user is on both the enforce list and the ignore list, then their permissions will be checked with Immuta (i.e., the enforce configuration item takes precedence).
immuta.permission.groups.to.enforce
Description: Same as
immuta.permission.users.to.enforce
but for groups.
immuta.permission.paths.to.enforce
Description: A comma delimited list of paths to ignore when checking permissions on HDFS files. If this configuration item is set, then these paths and their children will use fallback authorizations and not go through Immuta. All other paths will be checked with Immuta. Setting both
immuta.permission.paths.to.ignore
andimmuta.permission.paths.to.enforce
properties at the same time is unsupported.
immuta.permission.paths.to.ignore
Description: A comma delimited list of paths to enforce when checking permissions on HDFS files. If this configuration item is set, then these paths and their children will be checked in Immuta. All other paths will use fallback authorizations. WARNING: Setting this property effectively disables Immuta file permission checking for all paths not in this configuration item. Setting both
immuta.permission.paths.to.ignore
andimmuta.permission.paths.to.enforce
properties at the same time is unsupported.
immuta.system.details.cache.timeout.seconds
Default:
1800
Description: The number of seconds to cache system detail information from the Immuta Web Service. This should be high since, ideally, the relevant values in Immuta configuration won't change often (or ever).
immuta.permission.workspace.ignored.users
Default:
hive,impala
Description: Comma-delimited list of users that should be ignored when accessing workspace directories. This should never have to change since the default Hive and Impala principals are covered, but this can be modified in case of non-standard configuration. This list is separate from the ignored user list above because we do not want to allow access to ignored non-system users who may be operating on a cluster with Immuta installed but who should not be allowed to see workspace data. This should be limited to the principals for Hive and Impala.
NameNode-only Configuration
The following configuration items are only relevant to the NameNode plugin. These are typically set somewhere like hdfs-site.xml
and for the most part they are not sensitive. There are some highly sensitive configuration items, and those should be set in such a way that only the NameNode process has the ability to read them. Immuta provides one solution for this: have an additional NameNode plugin configuration file that must be configured elsewhere (such as hdfs-site.xml
) and is only readable by the hdfs
user. This will be detailed below.
immuta.extra.name.node.plugin.config
Description: Path to Hadoop-style XML configuration file containing items that will be used by the Immuta NameNode plugin. This item helps to configure sensitive information in a way that will only be readable by the
hdfs
user to avoid leaking sensitive configuration to other users. This should be in the formfile:///path/to/file.xml
.
immuta.system.api.key
Description: HIGHLY SENSITIVE. This configuration item is used by the NameNode plugin (and the Vulcan Service) to access privileged endpoints of the Immuta API. This is a required configuration item for both the NameNode plugin and Vulcan Service.
immuta.no.data.source.cache.timeout.seconds
Default:
60
Description: The amount of time in seconds that the NameNode plugin will cache the fact that a specific path is not a part of any Immuta data sources.
immuta.hive.impala.cache.timeout.seconds
Default:
60
Description: The amount of time in seconds to cache the fact that a user is subscribed to a Hive or Impala data source containing the target file they are attempting to access.
immuta.canisee.cache.timeout.seconds
Default:
30
Description: The amount of time in seconds to cache the access result from Immuta for a user/path pair.
immuta.specific.access.cache.timeout
Default:
10
Description: The amount of time to temporarily unlock a file in HDFS for a user using temporary access tokens with files backing Hive and Impala data sources in Spark.
immuta.data.source.cache.timeout.seconds
Default:
300
Description: The amount of time in seconds that users' subscribed data sources should be cached in memory to avoid reaching out to Immuta for data sources over and over. Relevant to the Immuta Hadoop client FileSystem and Spark jobs.
immuta.canisee.metastore.cache.timeout.seconds
Default:
30
Description: The amount of time in seconds that the NameNode plugin will cache the fact that a path belongs to a Metastore (Impala or Hive) data source. Reduces network calls from NameNode to Immuta when Vulcan is accessing paths belonging to Metastore sources.
immuta.canisee.non.user.cache.timeout.seconds
Default:
30
Description: The amount of time that the NameNode plugin will cache that a user principal does not belong to an Immuta user. This is useful if the ignored/enforced users/groups configurations are not being used so that when the NameNode receives a 401 response from the canisee endpoint it will store that information and not retry canisee requests to Immuta during that time.
immuta.canisee.num.retries
Default:
1
Description: The number of times to retry access calls from the NameNode plugin to Immuta to account for network issues.
immuta.project.user.cache.timeout.seconds
Default:
300
Description: The amount of time in seconds that the
ImmutaGroupsMapping
will cache whether or not a principal is tied to an Immuta user account. This decreases the number of calls from HDFS to Immuta when there are accounts that are not tied to Immuta.
immuta.project.cache.timeout.seconds
Default:
30
Description: The amount of time in seconds that the
ImmutaGroupsMapping
will cache project and workspace information for a given project ID. This is also the amount of time a user's current project will be cached.
immuta.project.forbidden.cache.timeout.seconds
Default:
30
Description: The amount of time in seconds that the ImmutaCurrentProjectHelper will cache the fact that a principal tied to an Immuta user is being forbidden from using their current project.
immuta.workspace.deduplication.timeout.seconds
Default:
60
Description: The amount of time to wait before auditing duplicate workspace filesystem actions from HDFS. This is the amount of time the NameNode plugin will wait before a user reading or writing the same path will have duplicate audit records written to Immuta.
immuta.permission.system.details.retries
Default:
5
Description: The number of times the system details background worker will attempt to retrieve system details from the Immuta web service if an attempt fails.
immuta.permission.source.cache.enabled
Default:
false
Description: Denotes whether a background thread should be started to periodically cache paths from Immuta that represent Immuta-protected paths in HDFS. Enabling this increases NameNode performance because it prevents the NameNode plugin from calling the Immuta web service for paths that do not back HDFS data sources.
immuta.permission.source.cache.timeout.seconds
Default:
300
Description: The time between calls to sync/cache all paths that back Immuta data sources in HDFS.
immuta.permission.source.cache.retries
Default:
5
Description: The number of times the data source cache background worker will attempt to retry calls to Immuta on failure.
immuta.permission.request.retries
Default:
5
Description: The number of retries that the NameNode plugin will attempt for any blocking web request between HDFS and the Immuta API.
immuta.permission.request.initial.delay.milliseconds
Default:
250
Description: The initial delay for the BackoffRetryHelper that the NameNode plugin will employ for any retries of blocking web requests between HDFS and the Immuta API.
immuta.permission.request.socket.timeout
Default:
1500
Description: The time in milliseconds that the NameNode plugin will wait before cancelling a request to the Immuta API if no data has been read from the HTTP connection. This applies to blocking requests only.
immuta.permission.workspace.base.path.override
Description: This configuration item can be set so that the NameNode does not have to retrieve Immuta HDFS workspace base path periodically from the Immuta API.
Spark Application Configuration
The following items are relevant to any Immuta Spark applications using the ImmutaSparkSession
or ImmutaContext
.
immuta.spark.data.source.cache.timeout.seconds
Default:
30
Description: The amount of time in seconds that data source information will be cached in the user's Spark job. This reduces the number of times the client will need to refresh data source information.
immuta.spark.sql.account.expiration
Default:
2880
Description: The amount of time in seconds that temporary SQL account credentials will be valid that are created by the Immuta Spark plugins for accessing queryable data sources via Postgres over JDBC.
immuta.postgres.fetch.size
Default:
1000
Description: The JDBC fetch size used for data sources accessed via Postgres over JDBC.
immuta.postgres.configuration
Description: The configuration key for any extra JDBC options that should be appended to the Immuta Postgres connection by the Immuta SQL Context. An example would include
sslfactory=org.postgresql.ssl.NonValidatingFactory
to turn off SSL validation.
immuta.enable.jdbc
Default:
false
Description: If
true
, allows the user's Spark job to make queries to Immuta's Postgres instance automatically when we detect that the data source is not on cluster and we must pull data back via PG. This can be set per-job, but defaults to false to prevent a user from accidentally (and unknowingly) pulling huge amounts of data over JDBC.
immuta.ephemeral.host.override
Default:
true
Description: Set this to
false
if ephemeral overrides should not be enabled for Spark. Whentrue
this will automatically override ephemeral data source host names with an auto-detected host name on cluster that should be running HiveServer2. It is assumed HiveServer2 is running on the NameNode.
immuta.ephemeral.host.override.address
Description: This configuration item can be used if automatic detection of Hive's hostname should be disabled in favor of a static hostname to use for ephemeral overrides. This is useful for when your cluster is behind a load balancer or proxy.
immuta.ephemeral.host.override.name-node
Description: In an HA cluster it may be a good idea to specify the NameNode on which Hive is running for ephemeral overrides. This should contain the NameNode from configuration that is hosting HiveServer2.
immuta.secure.truststore.enabled
Default:
false
Description: Enables TLS truststore verification. If enabled without a custom truststore it will use the default.
immuta.secure.truststore
Description: Location of the truststore that contains the Immuta Web Service certification.
immuta.secure.truststore.password
Description: Password for the truststore that contains the Immuta Web Service certification.
immuta.spark.visibility.cache.timeout.seconds
Default:
30
Description: The amount of time in seconds the
ImmutaContext
orImmutaSparkSession
will cache visibilities from Immuta. Maximum of 30 seconds.
immuta.spark.visibility.read.timeout.seconds
Default:
300
Description: The socket read timeout for visibility calls to Immuta.
immuta.spark.audit.retries
Default:
2
Description: The number of times to retry audit calls to Immuta from Spark.
immuta.masked.jdbc.optimization.enabled
Default:
true
Description: Enables push down filters to postgres. This should only be changed to false if the user is joining to a non-Spark data source (in PostgreSQL) on a masked column.
Vulcan Service Configuration
The following configuration items are needed by the Immuta Vulcan Service. Some of these items are also shared with the NameNode plugin as they work in tandem to protect data in HDFS.
immuta.meta.store.token.dir
Default:
/user/<Vulcan Service user>/tokens
Description: The directory in which temporary access tokens for HDFS files backing Hive/Impala data sources will be stored. This needs to be configured for the NameNode plugin as well in order to unlock files in HDFS.
immuta.meta.store.remote.token.dir
Default:
/user/<Vulcan Service user>/remotetokens
Description: The directory in which temporary access tokens for remote/object storage (S3, GS, etc) files backing Hive/Impala data sources will be stored.
immuta.spark.partition.generator.user
Default:
immuta
Description: The username of the user that will be running the Vulcan Service. This should also be the short username of the Kerberos principal running the Vulcan Service.
immuta.secure.partition.generator.hostname
Default:
localhost
Description: The interface/hostname that clients will use to communicate with the Vulcan Service.
immuta.secure.partition.generator.listen.address
Default:
0.0.0.0
Description: The interface/hostname on which the Vulcan Service will listen for connections.
immuta.secure.partition.generator.port
Default:
9070
Description: The port on which the Vulcan Service will listen for connections.
immuta.configuration.id.file.config
Default:
hdfs:///user/<Vulcan Service user>/config_id
Description: The file in HDFS where the cluster configuration ID will be stored. This is used to keep track of the unique ID in Immuta tied to the current cluster.
immuta.secure.partition.generator.keystore
Description: Path the keystore file to be used for securing Vulcan Service with TLS.
immuta.secure.partition.generator.keystore.password
Description: The password for the keystore configured with
immuta.secure.partition.generator.keystore
.
immuta.secure.partition.generator.keymanager.password
Description: The configuration key for the key manager password for the keystore configured with
immuta.secure.partition.generator.keystore
.
immuta.secure.partition.generator.url.external
Default:
<NameNode / master hostname>:<Vulcan Service port>
Description: The configuration key for specifying the externally addressable Vulcan Service URL. This URL must be reachable from the Immuta web app. If this is not set, the Vulcan Service will try to determine the URL based on its Hadoop configuration.
immuta.yarn.validation.params
Default:
/user/<Vulcan Service user>/yarnParameters.json
Description: The file containing parameters to use when validating YARN applications for secure token generation for file access. When a Spark application requests tokens be generated for file access, the Vulcan Service will validate that the Spark application is configured properly using the parameters from this file.
immuta.emrfs.credential.file.path
Description: For EMR/EMRFS only. This configuration points to a file containing AWS credentials that the Vulcan Service can use for accessing data in S3. This is also useful for Hive/the
hive
user so that (if impersonation is turned on) only a few users (hive
and the Vulcan Service user) on cluster can access data in S3 while everyone else is forced through the Vulcan Service.
immuta.workspace.allow.create.table
Default:
false
Description: True if the user should be allowed to create workspace tables. Users will not be able to drop their created tables if sentry object ownership is not set to ALL.
immuta.partition.tokens.ttl.seconds
Default:
3600
Description: How long in seconds Immuta temporary file access tokens should live in HDFS before being cleaned up.
immuta.partition.tokens.interval.seconds
Default:
1800
Description: Number of seconds between runs of the token cleanup job which will delete all expired temporary file access tokens from HDFS.
immuta.scheduler.heartbeat.enable
Default:
true
Description: True to enable sending configuration to the Immuta Web Service and updating on an interval. This can be set to
false
to prevent this cluster from being available in the HDFS configurations dropdown for HDFS data sources as well as prevent it from being used for workspaces. This make sense for ephemeral (EMR) clusters.
immuta.scheduler.heartbeat.initial.delay.seconds
Default:
0
Description: When starting the Vulcan Service, how long in seconds to wait before first sending configuration to the Immuta Web Service.
immuta.scheduler.heartbeat.interval.seconds
Default:
28800
Description: How long in seconds to wait between each configuration update submission to the Immuta Web Service.
immuta.file.session.store.expiration.seconds
Default:
900
Description: Number of seconds that idle remote file sessions will be kept active in the Vulcan Service. This is for spark clients that are reading remote data (S3, GS) via the Vulcan Service.
immuta.file.session.status.expiration.seconds
Default:
300
Description: Number of seconds that the Vulcan Service will cache file statuses from remote object storage.
immuta.file.session.status.max.size
Default:
250
Description: Maximum number of file status objects that the Vulcan Service will cache at one time.
immuta.yarn.api.num.retries
Default:
5
Description: Number of times that the YARN Validator will attempt to contact the YARN resource manager API to validate a Spark application for partition tokens.
Last updated