Skip to content

You are viewing documentation for Immuta version 2.8.

For the latest version, view our documentation for Immuta SaaS or the latest self-hosted version.

Contrasting Ranger, Sentry, and Immuta Spark Controls

Immuta is happy to announce our 2.1 release which includes SparkSQL support. With this new capability, Immuta can now enforce all the same policies historically available through Immuta while allowing massive scale processing. This includes row and column level controls, time windowing, minimization, purpose limitations as well as differential privacy.

Why is this important? There’s no limit to how much data you can process, secure, and audit with Immuta policies because not only can the new Immuta SparkSQL layer access data on cluster, but also allow joins with data external of the cluster. With Immuta, all your data is made available from within Spark - even if it doesn’t live on your cluster.

But what about using Sentry and/or Ranger to protect data? This question is asked quite a bit, and is probably why you’re now reading this document. Before we break down the differences, it’s first important to understand how SparkSQL interacts with the data on your cluster with regard to Hive and Impala tables.

What makes SparkSQL so fast is even though you are “reading” from Hive or Impala tables, SparkSQL takes that query plan and runs your query as jobs on the files sitting in HDFS; it does not go through Hive or Impala at all for data access, rather it uses the metadata from the tables for planning. This is great for speed, but really bad for security - why? Because any user running a SparkSQL job needs underlying read access to the raw files in HDFS, no matter the policies on the Hive or Impala tables. We’ll refer to this as the “HDFS raw file read” problem.

Spark without Immuta

In the following section we’ll run through how Sentry, Ranger, and Immuta apply data policies and some other key differentiators.

Sentry

Sentry allows you to enforce table and column level controls on your Impala/Hive tables.

Table controls

  • You can GRANT SELECT TO [ROLE] [roleName] on your tables.
  • It can only be permissioned to ROLEs

Column controls

  • You can GRANT SELECT(column_name) TO [ROLE] [roleName] on your tables.
  • It can only be permissioned to ROLEs
  • GRANT SELECT(column_name) will only allow you to select that column, if others are contained in your query, your query is completely blocked.
  • Yes, this means you have to explicitly GRANT every column rather than REVOKE the columns you don’t want your users to see.
  • There is a minimalist HUE user interface for authoring policies, sadly though, it was actually easier to use the command line in Hive or Impala shell to build the policies.

Row Level Security

  • The recommended way to enforce row level security in Sentry is by creating views and only providing access to those views instead of the table they were created from.

Visibility for Compliance

Sentry provides no way other than command line to view the policies which are currently being enforced on tables, this looked to be possible in the HUE interface, but unfortunately, the shell was easier. Sentry also does not provide any reporting capability based on actions being taken in the system.

So what about the HDFS raw file read problem

Cloudera provides a handy utility that synchronizes HDFS ACLs with Sentry permissions. What this means is:

  • If you have this running, all the HDFS files are blocked through the ACL until you GRANT SELECT on the table they back.
  • If you have GRANT SELECT(column_name) then the raw HDFS files are completely blocked (from the user and his/her Spark jobs).
  • Yes, this means you can’t do column level controls in SparkSQL with Sentry. This is because there’s no way to enforce the policy when reading from the raw files.
  • If you have two tables built from the same files in HDFS, it enforces the least restrictive policy, e.g. if one is column security and the other is table access, you still can see the file because you have table access.
  • The row level security through a table view breaks down in the SparkSQL situation, as the underlying raw HDFS files are blocked from the user if they only have access to the table’s view. Even if the user had access to the raw file, there would be no way to enforce the row level policy (e.g. the view) on SparkSQL’s read of the raw file.

Ranger

Before diving into the details of Ranger, it’s important to note that Ranger does not support Impala. So if you are running a Cloudera distro and leveraging Impala, you can’t use Ranger to enforce controls.

Table Controls

Ranger is very similar to Sentry in how it handles table access.

  • You can GRANT SELECT to groups or users to specific tables.

Column controls

Similar to Sentry, you can also GRANT access to specific columns, but unlike Sentry, you can also specify which columns to not allow access, instead of the other way around. Additionally, Ranger also allows column masking rather than simply removing access. These are the types of column masking policies Ranger supports:

  • Redact – mask all alphabetic characters with "x" and all numeric characters with "n".
  • Partial mask: show last 4 – Show only the last four characters.
  • Partial mask: show first 4 – Show only the first four characters.
  • Hash – Replace all characters with a hash of entire cell value. Note that this hash is not a salted hash, and could in some cases be easily broken with a rainbow table.
  • Nullify – Replace all characters with a NULL value.
  • Unmasked (retain original value) – No masking is applied.
  • Date: show only year – Show only the year portion of a date string and default the month and day to 01/01. However, we were not able to get this to work on TIMESTAMP column type.
  • Custom – Specify a custom masked value or expression. Custom masking can use any valid Hive UDF (Hive that returns the same data type as the data type in the column being masked). This is similar to the Immuta regular expression policy you’ll read about momentarily.

Row Level Security

Ranger supports row level security through a WHERE clause policy.

  • This means you can apply a WHERE clause to run on demand at query time based on the user/group running the query. Think of this as a view on the fly via the policy.

Conditions

It’s very important to note that the conditions for the row level security and column masking/access are inclusive. In other words, you explicitly specify who the policy applies to rather than who should see the data without the policy. This is a risk for a data leak. We’ll discuss this further in the Immuta conditions section.

Visibility for Compliance

Ranger does have a capability to list all existing policies and search them, however, it does not provide any reporting capability based on actions being taken in the system.

So what about the HDFS raw file read problem

Unfortunately, there is no utility like Sentry that syncs policies created on Hive tables with the underlying HDFS files. This means:

  • If you create a policy on a Hive table, the user still has access to the files in HDFS unless you secure those as well.
  • Obviously this means there is no support for row or column policies being enforced within Spark using Ranger because of this.

Immuta

Immuta now lets you enforce all it’s advanced policies on SparkSQL raw file access. In other words, SparkSQL is still doing it’s raw read from the HDFS files AND enforcing the Immuta controls. This means you have the scale and speed of SparkSQL combined with the policy enforcement, entitlements and auditing provided by Immuta, thus solving the HDFS raw file read problem discussed above.

As you probably know, Immuta provides the ability for anyone to author these policies in natural language. The enforcement and legibility of the policies allows compliance and legal to not only understand exactly how policies are being enforced, but also can author the policies themselves easily. Policy authoring is quite burdensome in both Ranger and Sentry, and there is no concept of entitlement workflows to gain access to tables in either tool. In Immuta, you create the policies on the Hive and/or Impala tables, and those policies will also be enforced on the HDFS files when being read during SparkSQL jobs.

Immuta Spark Security

Table controls

Rather than an admin simply GRANTing SELECT on tables manually, Immuta allows you to build data entitlement workflows for gaining access to data; what we term “Subscriptions”. Below are the types of subscriptions supported noting that users can search and discover tables in Immuta and request access, which will initiate this workflow:

  • Anyone: Users will automatically be granted access when requested.
  • Anyone Who Asks (and is Approved): Users will need to request access through the Immuta Web UI and be granted permission by a data owner (which can be a group). This means you can manually approve access when someone makes the request.
  • Users with Specific Groups/Attributes: Only users with the specified groups/attributes will be able to see the Data Source in the Immuta Web UI and subscribe. This means you can build logic to make access decisions automatically based on the user’s groups/attributes.
  • Individual Users You Select: The Data Source will not appear in the Immuta Web UI (invisible to users) and data owners must manually add/remove users. This is similar to how Sentry and Ranger work with GRANTing access unbeknownst to the user.

For all of the above, expirations can be set for access as well, meaning you can approve access, but “only for 30 days”, for example. All entitlement workflows are fully audited in the Immuta audit logs.

Column controls

Similar to Ranger, Immuta allows column masking rather than simply blocking column access, these are the types of masking policies Immuta supports:

  • Hashing: Hash the values to an irreversible sha256 hash, which is consistent for the same value throughout the data source so you can count or track the specific values, but not know the true raw value. This provides a good deal of utility which still protecting sensitive data. The Immuta hash uses a salt per user to avoid a rainbow table attack against the hash.
  • Replace with Null: Make all the values in the column null, removing any utility of this column.
  • Replace with constant: Replace all the values in the column with the same constant value you choose, such as 'Redacted', removing any utility of this column.
  • Regular Expression (regex): This is similar to replacing with a constant, yet provides more utility as you can retain portions of the true value. For example, you could mask the final digits of an IP address but retain the rest, like so: 164.16.13.XXX. This is similar to the Ranger custom policies, yet you use UDF functions in Ranger rather than regex. You can also do the show last 4 show first 4 policies like Ranger, but with a regex, providing more flexibility.
  • Rounding: This is a technique to hide precision from numeric values yet providing more utility than simply hashing, also retaining the column type. For example, you could remove precision from a geospatial coordinate. You can also use this type of policy to remove precision from dates and times by rounding to the nearest hour, day, month, or year. Ranger is only able to do this against dates to the year resolution, however, we never were able to get this to work.

Row Level Security

  • WHERE clause: You can specify a WHERE clause to execute for certain users based on the policy condition, thus redacting rows. You can almost think of this as a view on the fly. This is just like the Ranger policy, except remembering the condition is exclusionary in Immuta and inclusionary in Ranger (refer to conditions section below).
  • Matching: Match a user attribute with a row attribute (a column value) to determine if that row should be visible. This can be more powerful than the WHERE clause way of doing this, as you can dynamically incorporate the user attributes as part of the logic of the policy.
  • Time Window: Restrict access to rows that fall within the last x days/hours/minutes/years. Think of this as a moving window of time, which is chopping off the rows of data that fall in the rear (further back in time) of that window.
  • Minimization: Restrict access to only a limited percentage of the data, randomly sampled, but the same sample for all the users. For example, you could limit certain users to only 10% of the data. The data the user sees will always be the same, but new rows may be added as new data arrives in the system.

Purpose Restrictions

Immuta is the only data management platform to allow for purpose-based restrictions on data, which can limit what purposes a data source can be used for. A data owner could, for example, limit the use of a data source to purposes “Market Research,” “Fraud Detection,” or “Loan Eligibility,” and that data source could only ever be used within Immuta for those purposes. In order to use that data source, Immuta users will be forced to agree to automated acknowledgement statements declaring that they understand the terms of the use of that data, which Immuta records. These acknowledgement statements and individual purposes can be customized.

Differential Privacy

Differential Privacy

Immuta’s cutting-edge technology takes the concept of differential privacy out of academia and puts it in the hands of commercial customers. Immuta's differential privacy engine provides mathematical guarantees that an outsider will be unable to make confident inferences about the contents of individual records from query results.

SparkSQL Note

Note that differential privacy is the only Immuta policy that does not allow SparkSQL to query the raw files in HDFS directly. This policy does require SparkSQL to reach out to Impala/Hive to execute the query.

Conditions

Similar to Sentry and Ranger, you must also establish the conditions for which the policies will be enforced. Immuta allows you to append multiple conditions to the data. Those conditions are based on user attributes and groups (which can come from your identity management system), or purposes they are acting under via Immuta projects. Note that the attributes and groups can be retrieved from multiple different identity management systems and applied as conditions to the same policy. Sentry and Ranger do not support anything beyond users (Ranger), groups (Ranger), and roles (Sentry).

Conditions in Immuta are exclusionary or inclusionary, depending on the policy that's being enforced. Immuta has determined the best direction for the condition to avoid inadvertent data leaks. For example, rather than specifying every user attribute that should see the masked value, you instead specify the attribute that is allowed to see the unmasked value, e.g. mask for everyone except. This is exclusionary. There are inclusionary policies in Immuta, such as row level security matching, where you require that the user attribute matches the data attribute.

This is a key distinction between Ranger because all of their policies are inclusionary. Which means you need to specify all groups you want the policy to APPLY TO rather than stating who the policy SHOULD NOT APPLY TO. Immuta’s exclusionary policies are more secure, as you should be explicit about who the policy does not apply to, rather than who it does apply to, as the people it does apply to are getting the restricted view. A good analogy is, you don’t create a list of people that aren’t going to your wedding, you create a list of people who are invited. In other words:

Block entry to wedding for everyone except group wedding guests (Immuta, exclusive)

Block entry to wedding to group x (Ranger, inclusive)

The issue with Ranger’s inclusive conditions, beyond the fact that it’s time consuming to account for every group/user that exists, is that if a new group is created, it will bypass the policy. This results in a severe risk for data leaks.

Note Sentry gets around this problem because the only policies they support are giving access rather than taking away. So although inclusionary, it works, for example:

Allow entry to wedding to wedding guests, e.g. allowing entry rather than blocking entry.

Visibility for Compliance

Unlike Ranger and Sentry, Immuta also has the compliance and legal users in mind with its features rather than just the database administrator and analysts. As already mentioned, it is common for compliance personnel to build Immuta policies themselves through the simple to use natural language policy builder and also have complete visibility into how policies are actually being enforced. But there’s more!

Immuta also provides a report generation tool, this allows compliance and legal to generate common reports based on the Immuta audit logs rather than having to rely on IT to comb the audit logs. This is critical to prove compliance to outside auditors. This also includes notifications, both in the app and over email, to critical activities occurring in the platform, such as entitlement workflow actions, policy changes, and new data being exposed.

So what about the HDFS raw file read problem

As mentioned above, all the policies are applied on the data being read from HDFS via SparkSQL directly. It also protects from any access to those raw files in HDFS outside of SparkSQL.

Conclusion

If you are concerned with row and column level controls with Spark, you should use Immuta’s new SparkSQL support. You can expose Hive and Impala tables in Immuta, build complex Immuta policies on those tables and then have those policies also enforced in SparkSQL. Note that all three products do support individual file controls on HDFS files, but that does not help with row and column controls. In summary:

  • Immuta can support row, column, and purpose controls on SparkSQL jobs accessing data directly from HDFS.
  • Immuta is much more fully featured with regard to the types of policies it can enforce.
  • Immuta provides entitlement workflows for table access.
  • Immuta can expose data in SparkSQL from databases outside your hadoop cluster.
  • The conditions for Immuta policies avoid data leaks (see Immuta conditions section)
  • Immuta provides report generation from audit logs and natural language policies to empower legal and compliance users.