Benchmarks - SparkSQL
When you create Hive or Impala tables from your data in HDFS it may require policies on who can see what rows and columns. This becomes complex on a Hadoop cluster because not only do you need to protect the Hive and/or Impala tables, but you also need to protect the data in HDFS that back those tables.
For example, when you run SparkSQL, although it does reference Hive or Impala tables, it does not actually read any data from them. For performance reasons it reads the data directly from HDFS. This means that any protections you set on those Hive or Impala tables through Sentry or Ranger will not be applied to the raw file reads in SparkSQL. And in fact, those files need to be completely open to anyone running SparkSQL jobs.
Immuta provides the ability to enforce policy controls not only on the Hive and Impala tables, but also the backing files in HDFS.
The Immuta SparkSQL context is a subclass of SQLContext, exactly like HiveContext. You can do all the normal functions you would with HiveContext, except Immuta will handle dynamically unlocking the files in HDFS and enforcing the row and column level controls within the Spark job. As far as the end users are concerned, SparkSQL works as it always has, yet the policies are being enforced.
Unlike the Immuta SQL integration, the Immuta SparkSQL integration is not virtual at all. This is because the nature of distributed processing requires the data remain on-node. As such, the ImmutaContext unlocks HDFS files and enforces policies native to Spark and HDFS. Because of this, there is very little latency added to the jobs. For example the below large SparkSQL jobs resulted in an average of 8% overhead only.
Describe multiple columns single table:
SELECT * FROM x WHERE x.date >= <START DATE> AND x.date <= <END DATE>”
Data read size: 67,885,792
Describe single column in join:
SELECT x.a, x.b from x JOIN y ON y.id = x.id WHERE x.date >= <START DATE> AND x.date <= <END DATE> AND y.date >= <START DATE> AND y.date <= <END DATE>
Data read size: 16,928,624
o_df = context.table(o).filter('date >= <START DATE> AND date <= <END DATE> win = SQL.Window.partitionBy(o_df['s_code']).orderBy('s_code', 'date', 'o_id') ap = F.avg(o_df['p']).over(win) max_p = F.max(o_df['p']).over(win) min_p = F.min(o_df['p']).over(win) win_o = o_df.select('o_id', 's_code', 'p', 'date', ap.alias('avg_p'), max_p.alias('max_p'), min_p.alias('min_p')) win_o.filter("p < 0.5*avg_p").show()
Data read size: 67,885,792
- CDH 5.7.6
- 10 Nodes, 9 Datanodes (8 cores, 32 GB RAM)
- Synthetic Trading Data (~2 Billion rows / 1TB of data)
- Single Node install of Immuta v 2.1
- 1 @ 4 vCPU, 16 GiB RAM, 64 GiB SSD
Benchmarks - SQL
Through the Immuta SQL layer, Immuta provides a scalable, highly available, data abstraction layer which not only unifies your data but that allows complex policy enforcement on-the-fly rather than batch precomputing various different anonymized data sets - because nobody has static data nor static policies. The benefits of this are the ability to maintain policies in a single location across all your data silos, change those policies at will, have those changes reflected immediately, and have consistent audit logs across your data silos. This allows the execution of many more complex policies as each policy doesn’t result in a new full copy of the data to match every user/policy combination.
This power and flexibility requires some computation to occur at the Immuta layer between the data being queried in the native storage technology and the analyst wanting to analyze the data. While some overhead is to be expected, Immuta provides a solution that minimizes this trade-off significantly - providing you with the flexibility of dynamic policies. When considering the impact of Immuta queries, you should compare it against not only the query response times, as we will demonstrate below, but also the business process “red tape.” We’ve seen months of manual approval processes and data anonymization translations before data is in the hands of the downstream user due to policies and regulation - which Immuta eliminates.
Life Cycle of a Policy-Enforced Query in Immuta
The following actions occur on Immuta data fetches (non-HDFS) to enforce the dynamic/complex policy logic with some potential caching occurring on steps 1 and 2:
- Interacts with the enterprise identity management system to understand the groups/attributes the authenticated user possesses.
- Fetches the policies currently enforced on the data source being queried.
- Blends the fetched policy with the user attribution to build the literal policy logic.
- Executes the query on the native data source including any policy logic that can be enforced as part of the query execution, e.g. pushed down to the native database for processing.
- Response of the query passes through Immuta, enforcing any remaining policies, and back to the requesting client.
- Audits the query.
For HDFS-only steps 1 through 3 and 6 occur, but at job startup time, and the data reads happen on the HDFS data nodes with the policies being enforced based on the data returned from step 3. Auditing does occur on those reads.
Constant Immuta Overhead
Without any policies at all, the Immuta SQL layer introduces a constant overhead due to steps 1, 2, 4 and 5 described in the Life Cycle of a Policy-Enforced Query in Immuta section. That constant overhead, in a recommended Immuta deployment configuration, is ¼ of a second, no matter the query.
Additionally, due to step 5 in the Life Cycle of a Policy-Enforced Query in Immuta section, the latency increase scales linearly with returned data volume and that amount varies slightly based on the database type. Image 1 shows our performance test results while querying an Impala data source with no policies. Although the lines look parallel, there is a slight slope difference showing that Immuta’s overhead increases with data scale. At 15 million data points, a half second of overhead is added as compared to querying Impala directly, so a 12 second direct query will take 12.5 seconds through Immuta. Extrapolating these curves shows that Immuta will add an additional second to a query that returns 75 million data points.
- CDH 5.7.6
- 10 Nodes, 9 DataNodes (8 cores, 32 GB RAM)
- ~350 GB NYC Taxi Data (170 million rows, 180 columns)
- 5 Node install of Immuta v. 1.5 (note that the amount of nodes does not impact performance, only availability)
- 1 @ 4 vCPU, 16 GiB RAM, 64 GiB SSD
- 4 @ 16 vCPU, 122 GiB RAM, 256 GiB SSD
SQL queries were executed 10 times for each: directly to Impala, and via Immuta to Impala, taking the average response times. These were “select * from table limit x” queries. It’s important to note that Immuta has various levels of pushdown from it’s abstraction layer down to the native database, so the SQL functions used and structure of the query can impact performance positively or negatively, in some cases.
Note that choosing the right database driver could have, and typically does, a more significant impact on the query response times than Immuta. As shown in Image 2, using the impyla library to connect directly to Impala (no Immuta at all) showed query times increasing dramatically as data scale increased.
Impact of Row Redaction
Row redaction has very comparable overhead to the constant Immuta overhead discussed above because the redaction policies are always pushed down to the native database as part of the where clause. Image 3 depicts the impact of row redaction on Immuta queries, where the percentage of the data protected (or returned, depending how you look at it) has an impact simply because the where clause differences on the pushed-down policy.
Impact of Column Masking
Column masking is sometimes enforced as the data is streaming through Immuta, so there can be an impact on performance based on how many columns are being masked. The only column masking policies that must be enforced while the response is streaming through Immuta are hashing and regex. Both rounding and replace with constant/null are pushed down to the remote data source and thus have no impact on performance beyond what was discussed above.
Below are charts depicting the impact of column masking (using hashing, e.g. not pushed down), focusing on the amount of masking policies being enforced. As you can see, as you add more masked columns, performance will degrade. You should consider carefully what kind of masking policies should be enforce and what columns you should choose when exposing data in Immuta.
The filesystem has identical overhead with the interactive SQL as both are going through the same process as depicted in the integration diagram above. However, the filesystem does cache the response as files on disk and in the Immuta caching layer. Subsequent reads of those files will be read directly from disk and will be substantially faster than querying the remote storage technology.