Performance Optimization
Audience: System Administrators
Content Summary: This page describes strategies for improving performance of Immuta's NameNode plugin on CDH clusters.
Overview
Immuta operates within a locked operation in the NameNode when granting / denying permissions based on Immuta policies. This section contains configuration and strategies to prevent RPC queue latency, threads waiting, or other issues on cluster-wide file permission checks.
Deployment Architecture
Isolated HDFS Namespace
Best Practice: NameNode Plugin Configuration
Immuta recommends only configuring the NameNode Plugin to check permissions on the NameNode(s) that oversee the data that you want to protect.
For example, say that you currently have a federated HDFS NameNode architecture with three Nameservices - nameservice1
, nameservice2
, and nameservice3
. The HDFS federation in this example is distributed across these nameservices as described below.
nameservice1
:/data
,/tmp/
,/user
nameservice2
:/data2
nameservice3
:/data3
Suppose you know that all the sensitive data that you want to protect with Immuta is located under /data3
. To achieve optimum performance in this case, you can go ahead and add the Immuta NameNode-only configuration (hdfs-site.xml
) to the role config group for nameservice3
, and leave it out of nameservice1
and nameservice2
. The public / client Immuta configuration (core-site.xml
) should still be configured cluster-wide. See Immuta CDH Integration Installation for more details about these configuration groupings.
One caveat to take into consideration here is that Immuta's Vulcan service requires the Immuta NameNode Plugin to oversee user credentials that are stored in /user/<username>
by default. Vulcan also stores some configuration under /user/immuta
by default. This is a problem because /user
resides under nameservice1
, and the goal is to only operate the Immuta NameNode Plugin on nameservice3
.
A simple solution to this problem is to create a new directory for these credentials, /data3/immuta_creds
for example, and configure the NameNode Plugin and the Vulcan service to use this directory instead of /user
. Changing this requires the configuration modifications listed below.
HDFS - Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml
Set
immuta.generated.api.key.dir
andimmuta.credentials.dir
to/data3/immuta_creds
.
Immuta - Immuta Spark 2 Vulcan Server Advanced Configuration Snippet(Safety Valve) for session/generator.xml
Set
immuta.meta.store.token.dir
to/data3/immuta_creds/immuta/tokens
.Set
immuta.meta.store.remote.token.dir
to/data3/immuta_creds/immuta/remotetokens
.Set
immuta.configuration.id.file.config
tohdfs://nameservice3/data3/immuta_creds/immuta/config_id
.
Note that you will need to manually create the /data3/immuta_creds/immuta
directory and set the permissions such that only the immuta
user can read / write in that directory. The /data3/immuta_creds
directory should also be world writable to allow user directories to be created the first time that they interact with Immuta on the cluster.
Configuration
Essential Performance Tuning Settings
immuta.permission.paths.to.enforce
Description: A comma delimited list of paths to enforce when checking permissions on HDFS files. This ensures that API calls to the Immuta web service are only made when permissions are being checked on the paths that you specify in this configuration. This also means that you can only create data sources against data that lives under these paths, and the Immuta Workspace must be under one of these paths as well. Alternatively,
immuta.permission.paths.to.ignore
can be set to a list of paths that you know do not contain Immuta data - then API calls will never be made against those paths. Setting bothimmuta.permission.paths.to.ignore
andimmuta.permission.paths.to.enforce
properties at the same time is unsupported.
immuta.permission.groups.to.enforce
Description: A comma delimited list of groups that must go through Immuta when checking permissions on HDFS files. If this configuration item is set, then fallback authorizations will apply to everyone by default, unless they are in a group on this list. If a user is on both the enforce list and the ignore list, then their permissions will be checked with Immuta (i.e., the enforce configuration item takes precedence). This may improve NameNode performance by only making permission check API calls for the subset of users who fall under Immuta enforcement.
immuta.permission.source.cache.enabled
Description: Denotes whether a background thread should be started to periodically cache paths from Immuta that represent Immuta-protected paths in HDFS. Enabling this increases NameNode performance because it prevents the NameNode plugin from calling the Immuta web service for paths that do not back HDFS data sources. For performance optimization, it is best to enable this cache to act as a "backup" to
immuta.permission.paths.to.enforce
.
immuta.permission.source.cache.enabled
Description: The time between calls to sync/cache all paths that back Immuta data sources in HDFS. You can increase this value to further reduce the number of API calls made from the NameNode.
immuta.permission.workspace.base.path.override
Description: This configuration item can be set so that the NameNode does not have to retrieve the Immuta HDFS workspace base path periodically from the Immuta API.
Advanced Cache and Network Settings
There are also a wide variety of cache and network settings that can be used to fine-tune performance. You can refer to the Configuration Guide for details on each of these items.
immuta.permission.source.cache.timeout.seconds
immuta.permission.source.cache.retries
immuta.permission.request.initial.delay.milliseconds
immuta.permission.request.socket.timeout
immuta.no.data.source.cache.timeout.seconds
immuta.hive.impala.cache.timeout.seconds
immuta.canisee.cache.timeout.seconds
immuta.data.source.cache.timeout.seconds
immuta.canisee.metastore.cache.timeout.seconds
immuta.canisee.non.user.cache.timeout.seconds
immuta.canisee.num.retries
immuta.project.user.cache.timeout.seconds
immuta.project.cache.timeout.seconds
immuta.project.forbidden.cache.timeout.seconds
immuta.permission.system.details.retries
Debugging Suspected Performance Issues
See Immuta Log Analysis Tool for CDH Deployments for instructions on how to identify performance issues in the Immuta NameNode Plugin.
Last updated