1 of 6

Introduction

Immuta Secure is the final piece of the puzzle: Now that you understand where sensitive data lives (via Discover) and can monitor activity against that data (via Detect), you can now mitigate risk using Immuta Secure.

In short, Immuta Secure enables the management and delivery of trusted data at scale.

Challenge and goals

Managing access control in your data platform typically starts off easy, but over time becomes a house of cards. This concept is termed role explosion and is a result of having to keep up with every permutation of access across your organization. Once this occurs, it becomes difficult to evolve policies for fear of breaking existing access or because of a lack of understanding across your extensive role list.

Secure allows you to apply engineering principles to how you manage data access, giving your team the agility to lower time-to-data across your organization while meeting your stringent and granular compliance requirements. Immuta allows massively scalable, evolvable, and understandable automation around data policies; creates stability and repeatability around how those policies are maintained; allows distributed stewardship across your organization, but provides consistency of enforcement across your data ecosystem no matter your compute or data warehouse; and fosters more availability of data through the use of highly granular data controls.

How does it work?

Each of the guides below explains Secure principles in detail:

Scalability and Evolvability: A scalable and evolvable data management system allows you to make changes that impact thousands of tables at once, accurately. It also allows you to evolve your policies over time with minor changes (or no changes at all) through policy logic.
Understandability: Immuta can present policies in a natural language form that is easily understood and provide an audit history of changes to create a trust and verify environments. This allows you to prove policy is being implemented correctly to business leaders concerned with compliance and risk, and your business can meet audit obligations to external parties or customers.
Distributed Stewardship: Immuta enables fine-grained data ownership and controls over organizational domains, allowing a data mesh environment for sharing data - embracing the ubiquity of your organization. You can enable different parts of your organization to manage their data policies in a self-serve manner without involving you in every step, and you can make data available across the organization without the need to centralize both the data and authority over the data. This frees your organization to share more data more quickly.
Consistency: With inconsistency comes complexity, both for your team and the downstream analysts trying to read data. That complexity from inconsistency removes all value of separating policy from compute. Immuta provides complete consistency so that you can build a policy once, in a single location, and have it enforced scalably and consistently across all your data platforms.
Availability (of data): Because of these highly granular decisions at the access control level, you can increase data access by over 50% in some cases when using Immuta because friction between compliance and data access is reduced.

Scalability and Evolvability

ABAC vs RBAC

Do you find yourself spending too much time managing roles and defining permissions in your system? When there are new requests for data, or a policy change, does this cause you to spend an inordinate amount of time to make those changes? Scalability and evolvability will completely remove this burden. When you have a scalable and evolvable data policy management system, it allows you to make changes that impact hundreds if not thousands of tables at once, accurately. It also allows you to evolve your policies over time with minor changes or no changes at all, through future-proof policy logic.

Lack of scalability and evolvability are rooted in the fact that you are attempting to apply a coarse role-based access control (RBAC) model to your modern data architecture. Using Apache Ranger, a well known legacy RBAC system built for Hadoop, as an example, independent research has shown the explosion of management required to do the most basic of tasks with an RBAC system: Apache Ranger Evaluation for Cloud Migration and Adoption Readiness.

In a scalable solution such as Immuta, that count of policy changes required will remain extremely low, providing the scalability and evolvability. GigaOm researched this exactly, comparing Immuta’s ABAC model to what they called Ranger’s RBAC with Object Tagging (OT-RBAC) model and showed a 75 times increase in policy management with Ranger.

https://gigaom.com/report/cloud-data-security/

Value to you: You have more time to spend on the complex tasks you should be spending time on and you don’t fear making a policy change.
Value to the business: Policies can be easily enforced and evolved, allowing the business to be more agile and decrease time-to-data across your organization and avoid errors.

Separating policy definition from role definition

When building access control into our database platforms, the concept of role-based access control (RBAC) is familiar. Roles both define who is in them, but also determine what those roles get access to. A good way to think about this is roles conflate the who and what: who is in them and what they have access to (but lack the why).

In contrast, attribute-based access control (ABAC) allows you to decouple your roles from what they have access to, essentially separating the what and why from the who, which also allows you to explicitly explain the “why” in the policy. This gives you an incredible amount of scalability and understandability in policy building. Note this does not mean you have to throw away your roles necessarily, you can make them more powerful and likely scale them back significantly.

If you remember this picture and article from the start of this introduction, most of the Ranger, Snowflake, Databricks, etc. access control scalability issues are rooted in the fact that it’s an RBAC model vs ABAC model.

Example: Building row-level security with an ABAC model

Consider that you have a table which contains a transaction_country column and you have data localization needs which requires you to limit specific countries to specific users.

With a classic RBAC approach, you would need to create a role for every permutation of country access. Remember that it's not necessarily just a role per country, because some users may need access to more than one country. Every time a new permutation of country combination is required, a new role must be managed to represent that access.

With Immuta's ABAC approach, since Immuta is able to decouple policy logic from users, you can simply assign users countries and Immuta will filter appropriately on the fly. This can be done with a single policy in Immuta which references the user country metadata. If you add a new user with a never before seen combination of countries, in the RBAC model, you would have to remember to create a new role and policy for them to see data. In the ABAC model it will “just work” since everything is dynamic - future proofing your policies.

For more discussion about this model, see the Role-Based Access Control vs. Attribute-Based Access Control — Explained blog or the NIST article on ABAC, Guide to Attribute Based Access Control (ABAC) Definition and Considerations.

Policy boolean logic

The only way to support AND boolean logic with a role-based model (RBAC) is by creating a new role that conflates the two or more roles you want to AND together.

For example, a governor wants users to only see certain data if they have security awareness training and have consumer privacy training. It would be natural to assume you need both separately as metadata attached to users to drive the policy. However, when you build policies in a role based model, it assumes roles are either OR’ed together in the policy logic or you can only act under one role at a time, and because of this, you will have to create a single role to represent this combination of requirements “users with security awareness training AND consumer privacy training.” This is completely silly and unmanageable - you need to account for every possible combination relevant to a policy, and you have no way of knowing that ahead of time.

With Immuta and its ABAC model, you are able to keep user attributes as meaningful separate facts about the users and then use boolean logic to combine those facts in policy logic. As an example, consider the country filtering policy described in the prior section: you could build the filtering, as described, but additionally add an exception such as "do this filtering for everyone except members of group security awareness training and members of group consumer privacy training" without the need to create a new role that represents those combined.

Exception-based policy authoring

This next section draws on an analogy: Imagine you are planning your wedding reception. It’s a rather posh affair, so you have a bouncer checking people at the door.

Do you tell your bouncer who’s allowed in? (exception-based) Or, do you tell the bouncer who to keep out? (rejection-based)

The answer to that question should be obvious, but many policy engines allow both exception- and rejection-based policy authoring, which causes a conflict nightmare. Exception-based policy authoring in our wedding analogy means the bouncer has a list of who should be let into the reception. This will always be a shorter list of users/roles if following the principle of least privilege, which is the idea that any user, program, or process should have only the bare minimum privileges necessary to perform its function - you can’t go to the wedding unless invited. This aligns with the concept of privacy by design, the foundation of the CPRA and GDPR, which states “Privacy as the default setting.”

What this means in practice is that you should define what should be hidden from everyone, and then slowly peel back exceptions as needed.

How could your data leak if it wasn’t exception based?

What if you did two policies:

Mask Person Name using hashing for everyone who possesses attribute Department HR.
Mask Person Name using constant REDACTED for everyone who possesses attribute Department Analytics.

Now, some user comes along who is in Department Finance - guess what, they will see the Person Name columns in the clear because they were not accounted for, just like the bouncer would let them into your wedding because you didn’t think ahead of time to add them to your deny list.

There are two main issues with allowing bi-directional policies, which is why Immuta only allows exception-based policies, aligning to the industry standard of least privileged access:

Ripe for data leaks: Rejection-based policies are extremely dangerous and why Immuta does not allow them except with a catch-all OTHERWISE statement at the end. Again this is because if a new role/attribute comes along that you haven’t accounted for, that data will be leaked. It is impossible for you to anticipate every possible user/attribute/group that could possibly exist ahead of time just like it’s impossible for you to anticipate any person off the street that could try to enter your posh wedding that you would have to account for on your deny list.
Ripe for conflicts and confusion: Tools that specifically allow both rejection-based and exception-based policy building create a conflict disaster. Let’s walk through a simple example, noting this is very simple, imagine if you have hundreds of these policies:
- Policy 1: mask name for everyone who is member of group A
- Policy 2: mask name for everyone except members of group B

What happens if someone is in both groups A and B? The policy will have to fall back on policy ordering to avoid this conflict, which requires users to understand all other policies before building their policy and it is nearly impossible to understand what a single policy does without looking at all policies.

Hierarchical tag-based policy definitions

While many platforms support the concept of object tagging / sensitive data tagging, very few truly support hierarchical tag structures.

First, a quick overview of what hierarchical tag structure means:

This would be a flat tag structure:

SUV
Subaru
Truck
Jeep
Gladiator
Outback

Each tag stands on its own and is not associated with one another in any way; there’s no correlation between Jeep and Gladiator nor Subaru and Outback.

A hierarchical tagging structure establishes these relationships:

SUV.Subaru.Outback
Truck.Jeep.Gladiator

Support for a tagging hierarchy is more than just supporting the tag structure itself. More importantly, policy enforcement should respect the hierarchy as well. Let’s run through a quick contrived example; you want the following policies:

Mask by making null any SUV data
Mask using hashing any Outback data

With a flat structure, if you build those policies they will be in conflict with one another. To avoid that problem you would have to order which policies take precedence, which can get extremely complex when you have many policies.

Instead, if your policy engine truly supports a tagging hierarchy, like Immuta does, it will recognize that Outback is more specific than SUV, and have that policy take precedence.

Mask by making null any SUV data
Mask using hashing any SUV.Subaru.Outback data

Policies are applied correctly without any need for complex ordering of policies.

Yes, this does put some work on the business to correctly build specificity, or depth, into their tagging hierarchy. This is not necessarily easy; however, the logic will have to live somewhere, and having it in the tagging hierarchy rather than policy order again allows you to separate policy definition from data definition. This provides you scalability, evolvability, understandability, and, most importantly, correctness because policy conflicts can be caught at policy-authoring-time.

Subscription policies: benefits of attribute-based table GRANTs

There are a myriad of techniques and processes companies use to determine what users should have access to which tables. Some customers have had 7 people responding to an email chain for approval before a DBA runs a table GRANT statement, for example. Manual approvals are sometimes necessary, of course, but there’s a lot of power and consistency in establishing objective criteria for gaining access to a table rather than subjective human approvals.

Let’s take the “7 people approve with an email chain” example. Ask the question, “Why do any of you 7 say yes to the user gaining access?” If it’s objective criteria, you can completely automate this process. For example, if the approver says, “I approve them because they are in group x and work in the US,” that is user metadata that could allow the user to automatically gain access to the tables, either ahead of time or when requested. This removes a huge burden from your organization and avoids mistakes.

Being objective is always better than subjective: it increases accuracy, removes bias, eliminates errors, and proves compliance. If you can be objective and prescriptive about who should gain access to what tables - you should.

The anti-pattern is manual approvals. Although there are some regulatory requirements for this, if there’s any possible way to switch to objective approvals, you should do it. With subjective human-driven approvals, there is bias, larger chance for errors, and no consistency - this makes it very difficult to prove compliance and is simply passing the buck (and risk) to the approvers and wasting their valuable time.

One could argue that it’s subjective or biased to assign a user the Country.JP attribute. This is not true, because, remember, data policy is separated from user metadata. The act of giving a user the Country.JP attribute is simply defining that user - it is a fact about that user, there is no implied access given to the user from this act and that attribute will be objective - e.g., you know if they are in Japan or not.

The approach where an access decision is conflated with a role or group is common practice. So not only do you end up with manual approval flows, but you also end up with role explosion from so many roles to meet every combination of access.

Understandability

Natural language represented policy

This is a pretty simple one: if you can’t show your work, you are in a situation of trust with no way to verify. Writing code to enforce policy (Snowflake, Databricks, etc.) or building complex policies in Ranger does show your work to a certain extent - but not enough for outsiders to easily understand the policy goals and verify their accuracy, and certainly not to the non-engineering teams that care that policy enforcement is done correctly.

With Immuta, policy is represented in natural language that is easily understood by all. This allows non-engineering users to verify that policy has been written correctly. Remember that when using global policies they leverage tags rather than physical table/column names, which further enhances understandability.

Lastly, and as covered in the scalability principle, with Immuta you are able to build far fewer policies, upwards of 75x fewer policies, which provides an enormous amount of understandability with it.

Certainly this does not mean you have to build every policy through our UI - data engineers can build automation through the Immuta API, if desired, and those policies are presented in a human readable form to the non-engineering teams that need to understand how policy is being enforced.

Policy history and changes

Understandability of policy is critically important. This should be further augmented by change history around policy, and being able to monitor and attribute change.

Immuta provides this capability through extensive audit logs and takes it a step further by providing history views and diffs in the user interface.

This is different from query activity in your data platform, as discovered and surfaced in Immuta Detect. In addition to that, actions taken in Immuta that alter policy decisions are audited and allowing the creation of compliance reports around that information.

Without Immuta, if you build policy based on tasking an engineer in an ad-hoc manner there is no history of the change, nor is it possible to see the difference between the old and new policies. That makes it impossible to take a historical look at change and understand where an issue may have arisen. If you have a standardized platform for making policy changes, like Immuta, then you are able to understand and inspect those changes over time.

Distributed Stewardship

Data mesh

If your goal is data mesh, read the content below, but also refer to the Data mesh use case. It will help you understand how distributed stewardship aligns with additional data mesh strategies in Immuta.

Separation of policy building from data tagging

Separation of duties is a critical component of policy enforcement. An additional component to consider is also separation of understanding, where some people in your organization are much more knowledgeable about what policies must be enforced compared to the people in your organization that understand deeply what data is contained in certain tables, experts on data, so to speak.

Wouldn’t it be nice if you could rely on data experts to ensure that data is being tagged correctly, and rely on the compliance experts to ensure that policy is being authored appropriately based on requirements - separation of understanding? This is possible with Immuta.

You can have a set of users manage the tags on the data - those who know the data best - and a separate set of users to author the policies. When they author those policies, they reference tags, a semantic layer, rather than the physical tables and columns, which they don't understand.

The tags bridge the gap between the physical world and the logical world, allowing the compliance experts to build meaningful policy leveraging the knowledge of the physical world transferred into the tags.

Remember also, it is possible to automatically tag data through Immuta Discover, which further automates this process.

Data ownership vs data governance

The GOVERNANCE permission in Immuta is quite powerful, as described in our permissions section. It is for a situation where a select few users are the only ones that control all policies.

It is possible to instead delegate policy control to data owners without giving them governance permission. This allows them to write global policies just like governors, but they are restricted to only the data sources they own.

Note that this capability is further enhanced with the Immuta domains feature which is currently private preview.

Consistency

Build anonymization against compute/platform 1 and n consistently

This is one of the largest challenges for organizations. Having multiple data platforms/compute, which is quite common, means that you must configure policies uniquely in each of them. For example, the way you build policies in Databricks is completely different from how you build policies in Snowflake. This becomes highly complex to manage, understand, and evolve (really hard to make changes).

Just like the big data era created the need to separate compute from storage, the privacy era requires you to separate policy from platform. Immuta does just that; it abstracts the policy definition from your many platforms, allowing you to define policy once and apply anywhere - consistently!

You should not build policies uniquely in each data platform/compute you use; this will create chaos, errors, and make the data platform team a bottleneck for getting data in the hands of analysts.

Availability of Data

Example of anonymizing a column rather than blocking it

By having highly granular controls coupled with anonymization techniques, more data than ever can be at the fingertips of your analysts and data scientists (in some cases, up to 50% more).

Why is that?

Let’s start with a simple example and get more complex. Obviously, if you can’t do row- and column-level controls and are limited to only GRANTing access to tables, you are either over-sharing or under-sharing. In most cases, it’s under-sharing: there are rows and columns in that table the users can see, just not all of them, but they are blocked completely from the table.

That example was obvious, but it can get a little more complex. If you have column-level controls, now you can give them access to the table, but you can completely hide a column from a user by making all the values in it null, for example. Thus, they’ve lost all data/utility from that column, but at least they can get to the other columns.

That masked column can be more useful, though. If you hash the values in that column instead, utility is gained because the hash is consistent - you can track and group by the values, but can’t know exactly what they are.

But you can make that masked column even more useful! If you use something like instead of hashing, they can know many of the values, but not all of them, gaining almost complete utility from that column. As your anonymization techniques become more advanced, you gain utility from the data while preserving privacy. These are termed privacy enhancing technologies (PETs) and Immuta places them at your fingertips.

This is why advanced anonymization techniques can get significantly more data into your analysts' hands.

Using k-anonymization to mask columns

While columns like first_name, last_name, email, and social security number can certainly be directly identifying, something like gender and race, on the surface, seem like they may not be directly identifying, but it could be. Imagine if there are very few Tongan men in a data set...in fact, for the sake of this example, lets say there’s only one. So if I know of a Tongan man in that company, I can easily run a query like this and figure out that person’s salary without using their name, email, or social security number:

select salary from [table] where race = 'Tongan' and gender = 'Male';

This is the challenge with indirect identifiers. It comes down to how much your adversary, the person trying to break privacy, knows externally, which is unknowable to you. In this case, all they had to know was the person was Tongan and a man (and there happens to be only one of them in the data) to figure out their salary, sensitive information. Let's also pretend the result of that query was a salary of 106072. This is called a linkage attack and is specifically called out in privacy regulations as something you must contend with, for example, from GDPR:

Article 4(1): "Personal data" means any information relating to an identified or identifiable natural person ("data subject"); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that person.

Almost any useful column with many unique values will be a candidate for indirectly identifying an individual, but also be an important column for your analysis. So if you completely hide every possible indirectly identifying column, your data is left useless.

You can solve this problem with PETs. Take note of two things by querying the data:

If you only search for “Tongan” alone (no Male), there are several Tongan women, so this linkage attack no longer works: select salary, gender from [table] where race = 'Tongan';
There are no null values in the gender or race columns.

Now let's say you apply the k-anonymization masking policy using Immuta.

Then you run this query again to find the Tongan man's salary: select salary from immuta_fake_hr_data where race = 'Tongan' and gender = 'Male';

You get no results.

Now you run this query ignoring the gender: select salary, gender from immuta_fake_hr_data where race = 'Tongan';

Only the women are returned.

The linkage attack was successfully averted. Remember, from our queries prior to the policy, the salary was 106072, so let’s run a query with that: select race, gender from immuta_fake_hr_data where salary = 106072;

There he is! But race will be suppressed (NULL) so this linkage attack will not work. It was also smart enough to not suppress gender because that did not contribute to the attack; suppressing race alone averts the attack. This is the magic of k-anonymization: it provides as much utility as possible while preserving privacy by suppressing values that appear so infrequently (along with other values in that row) that they could lead to a linkage attack.

Cell-level security

Cell-level security is not exactly an advanced privacy enhancing technology (PET) as in the example above, but it does provide impressive granular controls within a column for common use cases.

What is cell-level security?

If you have values in a column that should sometimes be masked, but not always, that is masking at the cell-level, meaning the intersection of a row with a column. What drives whether that cell should be masked or not is some other value (or set of values) in the rest of the row shared with that column (or a joined row from another table).

For example, a user wants to mask the credit card numbers but only when the transaction amount is greater than $500. This allows you to drive masking in a highly granular manner based on other data in your tables.

This technique is also possible using Immuta, and you can leverage tags on columns to drive which column in the row should be looked at to mask the cell in question, providing further scalability.