Immuta Best Practices (SQL)
Audience: Data Users
Content Summary: This page outlines some best practices for maximum performance from the Immuta Query Engine, which is an abstraction of your underlying database tables. This virtual layer is powerful as a single point to access all data across your organization and apply data policies consistently.
Immuta has the ability to push down many different queries, operations, and functions depending on the native database. Below are common best practices to consider when using the Immuta Query Engine in order to take advantage of the pushdown optimization and to avoid problematic queries that add latency to the response.
- Limit result sets.
- Push work down to the native database.
- Use unions and window functions sparingly.
- Avoid joining across servers or technologies.
- Keep data types consistent.
Unless there’s a good reason to pull back a large result set, try to limit the response. The simplest way to
accomplish this is through a
WHERE clause that limits the response size or a
LIMIT. It is best to use
when possible, since whether or not a
LIMIT is pushed down depends on the query structure. Many business
intelligence tools append
LIMIT statements by default to avoid this issue. Treat Immuta similarly.
You are limited to PostgreSQL syntax when querying through the Immuta Query Engine. For a function to be pushed down, an equivalent function must exist in the native database. Immuta maintains a "Rosetta Stone" of database functions that map PostgreSQL syntax to their analogs in all of the data sources that Immuta supports.
If there is specific database syntax that is not pushed down or does not exist in PostgreSQL, you can create an Immuta data source using the SQL statement option (rather than the SQL table option) and include the unsupported native syntax as the SQL statement for the data source. When you query the data source that was created with a SQL statement you will still query the Immuta layer with PostgreSQL syntax, but the underlying native syntax will also be executed as a subquery on the native database (similar to a view), effectively pushing the syntax down.
UNION and most window functions are not currently pushed down, so they will run in the Immuta layer
if you are not aggregating or setting a
LIMIT on the results.
To aggregate or
LIMIT results of the
UNION, the data must
flow back to Immuta. One solution to this problem is to create a view that represents the
UNION, expose that
view in Immuta, and query that view.
Immuta will push down joins between data sources that share the same source database as long as the data sources are created with the same connection details, which include hostname, port, database name, username, and schema.
JOIN statement contains data sources that use Kerberos authentication, then the same Immuta user must have been
used to create each data source. If an Immuta user has the
IMPERSONATE_HDFS_USER permission, they may set their
principal to match any data sources that they want to be pushed down.
Simply put, at the time of data source creation,
all data source details must be the same in order to be pushed down.
Keep data types consistent
Casting, especially when done using
CASE statements, can be an expensive operation.
Therefore, it's best to have a schema set in the native database, and then allow Immuta to replicate that schema.
That is, if the source data are
TEXT, make sure that Immuta data type is
Inconsistencies, especially for
DATE data types, can cause performance issues with tools that
automatically generate queries.