Skip to content

Databricks Performance Test

Prerequisite: Before using this walkthrough, please ensure that you’ve first done Part 3 of the POV Data Setup.

Overview

Performance tests can be complex; one must consider using realistic queries and scenarios along with ensuring apples to apples comparisons. Luckily, a lot of this hard work has already been done by TPC-DS. TPC-DS data has been used extensively by Database and Big Data companies for testing performance, scalability, and SQL compatibility across a range of Data Warehouse queries — from fast, interactive reports to complex analytics. It reflects a multi-dimensional data model of a retail enterprise selling through 3 channels (stores, web, and catalogs), while the data is sliced across 17 dimensions, including Customer, Store, Time, Item, etc. The bulk of the data is contained in the large fact tables - Store Sales, Catalog Sales, Web Sales — representing daily transactions spanning 5 years.

Databricks uses TPC-DS for their own internal testing, and Immuta has taken components of that Databricks test suite and created a Databricks notebook that

  • Generates the TPC-DS data (at the scale you desire)
  • Registers it with Immuta
  • Applies masking policies
  • Runs through the test suite, capturing results
  • Does so both on immuta-enabled and non-immuta clusters
  • Generates a report at completion

This can be run against any of your Databricks clusters enabled by the different Immuta cluster policies. In fact, you can run this on clusters enabled by competitors to see the same comparisons.

In our own internal testing, with over 100 column masking policies in place (SHA-256 salted hashing), we see slightly over 1 second of overhead on average, which varies by different cluster policies. You can read more about our internal results here.

Running the performance test suite

  1. During Part 3 of the POV Data Setup you should have downloaded the Benchmarking suite.
  2. Import the Notebook downloaded from Step 1 into Databricks.
    1. Go to your workspace.
    2. Click the down arrow next to your username.
    3. Select import.
    4. Import the file from Step 1.
  3. Run all cells in the Notebook. This Notebook references other notebook cells, so you are free to jump around and closely examine all that the notebooks are doing; they are well-documented.

Anti-Patterns

Doing simple select * from table queries to validate performance. TPC-DS has done a lot of work to create a realistic analytical query suite - you should use it. That being said, feel free to also run tests on your own data.

Next Steps

This was the final walkthrough in the POV Guide, but feel free to go back and do others you may have skipped.