Audience: System Administrators
Content Summary: This page describes ephemeral overrides for Databricks data sources.
Best Practices: Ephemeral Overrides
Disable ephemeral overrides for clusters when using multiple workspaces and dedicate a single cluster to serve queries from Immuta in a single workspace.
If you use multiple E2 workspaces without disabling ephemeral overrides, avoid applying the where user row-level policy to data sources.
In Immuta, a Databricks data source is considered ephemeral, meaning that the compute resources associated with that data source will not always be available.
Ephemeral data sources allow the use of ephemeral overrides, user-specific connection parameter overrides that are applied to Immuta metadata operations and queries that the user runs through the Query Editor.
When a user runs a Spark job in Databricks, Immuta plugins automatically submit ephemeral overrides for that user to Immuta for all applicable data sources to use the current cluster as compute for all subsequent metadata operations for that user against the applicable data sources.
Example Query and Ephemeral Override Request
- A user runs a query on cluster B.
- The Immuta plugins on the cluster detect that the user is subscribed to data sources 1, 2, and 3 and that data sources 1 and 3 are both present in the Metastore for cluster B.
- Immuta’s plugins submit ephemeral override requests for data sources 1 and 3 to override their connections with the HTTP path from cluster B.
- Since data source 2 is not present in the Metastore, it is marked as a JDBC source.
If the user attempts to query data source 2 and they have not enabled JDBC sources, they will be presented with an error message telling them to do so:
com.immuta.spark.exceptions.ImmutaConfigurationException: This query plan will cause data to be pulled over JDBC. This spark context is not configured to allow this. To enable JDBC set
immuta.enable.jdbc=truein the spark context hadoop configuration.
Immuta Operations that Use Ephemeral Overrides
Ephemeral overrides are enabled by default because Immuta must be aware of a cluster that is running to serve metadata queries. The operations that use the ephemeral overrides include
- Visibility checks on the data source for a particular user. These checks assess how to apply row-level policies for specific users.
- Stats collection triggered by a specific user.
- Validating a custom WHERE clause policy against a data source. When owners or governors create custom WHERE clause policies, Immuta uses compute resources to validate the SQL in the policy. In this case, the ephemeral overrides for the user writing the policy are used to contact a cluster for SQL validation.
- High Cardinality Column detection. Certain advanced policy types (e.g., minimization and randomized response) in Immuta require a High Cardinality Column, and that column is computed on data source creation. It can be recomputed on demand and, if so, will use the ephemeral overrides for the user requesting computation.
However, ephemeral overrides can be problematic in environments that have a dedicated cluster to handle maintenance activities, since ephemeral overrides can cause these operations to execute on a different cluster than the dedicated one.
Configure Overrides in Immuta-Enabled Clusters
To reduce the risk that a user has overrides set to a cluster (or multiple clusters) that aren't currently up,
- direct all clusters' HTTP paths for overrides to a cluster dedicated for metadata queries or
- disable overrides completely.
Disable Ephemeral Overrides
To disable ephemeral overrides, set
spark-defaults.conf to false.