SQL Best Practices
Audience: Data Users
Content Summary: This page outlines some best practices for maximum performance from the Immuta Query Engine, which is an abstraction of your underlying database tables. This virtual layer is powerful as a single point to access all data across your organization and apply data policies consistently.
What is Pushdown
The Immuta Query Engine takes advantage of a performance optimization called pushdown. Pushdown means that parts of a query flow through the Immuta Query Engine and then are executed on the native database. One drawback of this process is that there are some parts of queries that are not pushed down to the native database, and then the work must happen in the Immuta layer as the data streams back through it. Depending on the operation, this may add latency.
Here’s a simple example of pushdown functionality that was added in version 2.1.2 of Immuta:
The exact query run:SELECT vendor_id FROM taxi_trips WHERE substr(‘vendor_id’, 1, 1) = ‘V’ LIMIT 10;
In version 2.1.1, the query below is what would be sent to the native database through the Immuta Query Engine:SELECT `vendor_id` FROM `DEFAULT`.`taxi_trips`
Notice that the
substrfunction is not passed at all. This means the
substr"work" happens in the Immuta PostgreSQL layer as the rows stream back.
However, in version 2.1.2,
substrsupport was added, so the query below would be pushed down, which makes a significant difference when querying against a large data set:SELECT `vendor_id` FROM `default`.`taxi_trips` WHERE (((substr(`vendor_id`,1,1) = 'V'))) LIMIT 10
Immuta has the ability to push down many different queries, operations, and functions depending on the native database. Below are common best practices to consider when using the Immuta Query Engine in order to take advantage of the pushdown optimization and to avoid problematic queries that add latency to the response.
Limit Result Sets
Unless there’s a good reason to pull back a large result set, try to limit the response. The simplest way to accomplish this is through a WHERE clause that limits the response size or a LIMIT. It is best to use WHERE clauses when possible, since whether or not a LIMIT is pushed down depends on the query structure. Many business intelligence tools append LIMIT statements by default to avoid this issue. Treat Immuta similarly.
You are limited to PostgreSQL syntax when querying through the Immuta Query Engine. This means in order for a function to be pushed down, an equivalent function must exist in the native database. Immuta maintains a "Rosetta Stone" which maps PostgreSQL syntax to equivalent functions in all of the data sources that Immuta supports.
If there is specific database syntax that is not pushed down or does not exist in PostgreSQL, you can create an Immuta data source using the SQL statement option (rather than the SQL table option) and include the unsupported native syntax as the SQL statement for the data source. When you query the data source that was created with a SQL statement you will still query the Immuta layer with PostgreSQL syntax, but the underlying native syntax will also be executed as a subquery on the native database (similar to a view), effectively pushing the syntax down.
Unions are not currently pushed down, so unions will run in Immuta as long as you are not aggregating or using a LIMIT on the results. In order to successfully execute an aggregation or LIMIT results of the UNION, results must flow back to Immuta. One solution to this problem is to create a view that represents the UNION, expose that view in Immuta, and query that view.
Multi Data Source Joins
Immuta has the ability to push down joins between Immuta data sources that share the same source database as long as the data sources are created using the same connection details, which include hostname, port, database name, username, and schema.
If a join contains data sources that use Kerberos authentication, then the same Immuta user must be used to
create each data source. If an Immuta user has the
IMPERSONATE_HDFS_USER permission, then they may set their
principal to match any data sources that they want to be pushed down. Simply put, at the time of data source creation,
all data source details must be the same in order to be pushed down.