Prerequisite: Before using this walkthrough, please ensure that you’ve first done Part 3 of the POV Data Setup.
Performance tests can be complex; one must consider using realistic queries and scenarios along with ensuring apples to apples comparisons. Luckily, a lot of this hard work has already been done by TPC-DS. TPC-DS data has been used extensively by Database and Big Data companies for testing performance, scalability, and SQL compatibility across a range of Data Warehouse queries — from fast, interactive reports to complex analytics. It reflects a multidimensional data model of a retail enterprise selling through 3 channels (stores, web, and catalogs), while the data is sliced across 17 dimensions, including Customer, Store, Time, Item, etc. The bulk of the data is contained in the large fact tables - Store Sales, Catalog Sales, Web Sales — representing daily transactions spanning 5 years.
Databricks uses TPC-DS for their own internal testing, and Immuta has taken components of that Databricks test suite and created a Databricks notebook that
Generates the TPC-DS data (at the scale you desire)
Registers it with Immuta
Applies masking policies
Runs through the test suite, capturing results
Does so both on immuta-enabled and non-immuta clusters
Generates a report at completion
This can be run against any of your Databricks clusters enabled by the different Immuta cluster policies. In fact, you can run this on clusters enabled by competitors to see the same comparisons.
In our own internal testing, with over 100 column masking policies in place (SHA-256 salted hashing), we see slightly over 1 second of overhead on average, which varies by different cluster policies. You can read more about our internal results here.
During Part 3 of the POV Data Setup you should have downloaded the Benchmarking suite.
Import the Notebook downloaded from Step 1 into Databricks.
Go to your workspace.
Click the down arrow next to your username.
Select import.
Import the file from Step 1.
Follow the instructions in the notebook.
Doing simple select * from table
queries to validate performance. TPC-DS has done a lot of work to create a realistic analytical query suite - you should use it. That being said, feel free to also run tests on your own data.
This was the final walkthrough in the POV Guide, but feel free to go back and do others you may have skipped.