Prerequisites: Before using this walkthrough, please ensure that you’ve first completed Parts 1-5 of the POV Data Setup and the Schema Monitoring and Automatic Sensitive Data Discovery walkthrough.
Cell-level security is not exactly an advanced privacy enhancing technology (PET) like we showed in the Example of anonymizing a column rather than blocking it walkthrough, but it does provide impressive granular controls within a column for common use cases.
What is cell level security?
If you have values in a column that should sometimes be masked, but not always, that is masking at the cell level, meaning the intersection of a row with a column. What drives whether that cell should be masked or not is some other value (or set of values) in the rest of the row shared with that column (or a joined row from another table).
Let’s use a silly example. Let’s say we want to mask the credit card numbers, but only when the transaction amount is greater than $500. This allows you to drive masking in a highly granular manner based on other data in your tables.
Creating a balance between privacy and utility is critical to stakeholders across the business. Legal and compliance stakeholders can rest assured that policy is in place, yet data analysts can have access to more data than ever before.
Because of this, the business reaps
Increased revenue: increased data access by providing utility from sensitive data rather than completely blocking it.
Decreased cost: the amount of views you would need to create and manage to do cell-level controls manually would be enormous.
Decreased risk: your organization may end up over-sharing since they don’t have the granular controls at their fingertips, opening up high levels of risk. With Immuta, you can reduce risk through the privacy vs utility balance provided.
Assumptions: Your user has the following permissions in Immuta (note you should have these by default if you were the initial user on the Immuta installation):
GOVERNANCE: in order to build policy on any table OR
“Data Owner” of the registered tables (you likely are the Data Owner and have GOVERNANCE permission).
Let's create a Global masking policy.
Log in to Immuta with the user that owns the data sources you created in the POV Data Setup.
Click the Policies icon in the left sidebar.
Click + Add New Data Policy.
Name it Mask Credit Cards.
For action, select Mask.
Leave columns tagged.
Type in the tag Discovered.Entity.Credit Card Number
.
Change the masking type to using a constant.
Enter the constant REDACTED.
Change for to where.
For the where clause, enter transaction_amount > 500.
Note that you can also reference tags in your where clause, so we could have done something like @columnTagged(‘amounts’) > 500
if the transaction_amount
columns are named differently across tables.
Change everyone except to everyone. (This policy will have no exceptions.)
Click Add.
Leave Where should this policy be applied? as is. (Immuta will guess properly based on previous steps.)
Click Create Policy and then Activate Policy.
You can also test out whether everything was masked correctly in the “Immuta fake credit card transactions” table by following the Query Your Data guide. Note that if the transaction_amount
column is greater than $500, the credit card number in that same row is replaced with the word REDACTED.
Note: If you cannot query the “Immuta fake credit card transactions” table it’s likely because you did not remove the purpose restriction policy from the Purpose based exceptions walkthrough.
Coarse-grained access control. Over- and under-sharing gets you in hot water with either Legal and Compliance (want more privacy) or the analysts (want more data), depending on which direction you go. Highly granular techniques like cell-level security give you the flexibility to make these tradeoffs and keep both stakeholders happy.
Feel free to return to the POV Guide to move on to your next topic.
Prerequisites: Before using this walkthrough, please ensure that you’ve first completed Parts 1-5 of the and the walkthrough.
As mentioned in the , by having highly granular controls coupled with anonymization techniques, more data than ever can be at the fingertips of your analysts and data scientists (we’ve seen examples of up to 50% more).
Why is that?
Let’s start with a simple example and get more complex. Obviously, if you can’t do row- and column-level controls, and you are limited to only GRANTing access to tables, you are either over-sharing or under-sharing. In most cases, it’s under sharing: there are rows and columns in that table the users can see, just not all of them, but instead, they are blocked completely from the table.
Ok, that was obvious, now let’s get a little more complex. If you have column-level controls, now you can give them access to the table, but you can completely hide a column from a user by making all the values in it null, for example, and, thus, they’ve lost all data/utility from that column, but at least they can get to the other columns.
We can make that masked column more useful, though. If you hash the values in that column instead, utility is gained because the hash is consistent - you can track and group by the values, but can’t know exactly what they are.
But you can make that masked column even more useful! If you use something like k-anonymization (we’ll talk about shortly) instead of hashing, they can know many of the values, but not all of them, gaining almost complete utility from that column. As your anonymization techniques become more advanced, you gain utility from the data while preserving privacy. These are termed Privacy Enhancing Technologies (PETs) and Immuta places them at your fingertips.
This is why advanced anonymization techniques can get significantly more data into your analysts' hands.
Creating a balance between privacy and utility is critical to stakeholders across the business. Legal and compliance stakeholders can rest assured that policy is in place, yet data analysts can have access to more data than ever before.
Because of this, the business reaps
Increased revenue: increased data access by providing utility from sensitive data rather than completely blocking it.
Decreased cost: building these PETs is complex and expensive, Immuta has invested years of research to apply these PETs dynamically to your data at a click of a button.
Decreased risk: your organization may end up over-sharing since they don’t have the granular controls at their fingertips, opening up high levels of risk. With Immuta, you can reduce risk through the privacy vs utility balance provided.
Assumptions: Your user has the following permissions in Immuta (note you should have these by default if you were the initial user on the Immuta installation):
GOVERNANCE: in order to build policy on any table OR
“Data Owner” of the registered tables (you likely are the Data Owner and have GOVERNANCE permission).
While columns like first_name
, last_name
, email
, and social security number
can certainly be directly identifying (although we masked them in previous walkthroughs you may have completed), something like gender
and race
, on the surface, seem like they may not be directly identifying. But it could be: imagine if there are very few Tongan males in this data set...in fact, there’s only one. So if I know of a Tongan male in that company, I can easily run a query like this and figure out that person’s salary without using their name, email, or social security number:
select * from immuta_fake_hr_data where race = 'Tongan' and gender = 'Male';
This is the challenge with indirect identifiers. It comes down to how much your adversary, the person trying to break privacy, knows externally, which is unknowable to you. In this case, all they had to know was the person was Tongan and male (and there happens to be only one of them in the data) to figure out their salary (it’s $106,072). This is called a linkage attack and is specifically called out in privacy regulations as something you must contend with, for example, from GDPR:
Article 4(1): "Personal data" means any information relating to an identified or identifiable natural person ("data subject"); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that person.
So you see where we are going: almost any useful column with many unique values will be a candidate for indirectly identifying an individual, but also be an important column for your analysis. So if you completely hide every possible indirectly identifying column, your data is left useless.
You can solve this problem with PETs. Before we get started with K-Anonymization, take note of two things by querying the data:
If you only search for “Tongan” alone (no Male), there are several Tongan females, so this linkage attack no longer works: select * from immuta_fake_hr_data where race = 'Tongan';
There are no null values in the gender or race columns.
Let's build a k-anonymization policy:
Log in to Immuta with your user with GOVERNANCE permission (and/or is the Data Owner of the table “Immuta Fake HR Data”).
Visit the Immuta Fake HR Data data source and click the Policies tab.
If you’ve done some of the other walkthroughs, you will see those policies listed here because they propagated from a global policy down to what we call local policies.
In this case, we will create a local policy on this specific table (make sure if you have multiple computes/warehouses this is the one you plan to query against).
Click + New Policy in the Data Policies section.
Select the Mask option.
Set the mask type to with K-Anonymization.
Select the gender and race columns.
Leave using Fingerprint (group size = 5)
In this case, through our algorithm, we selected the best group size for you (see the third bullet below for more details). This means any combination of gender and race that shows up 5 or fewer times will be suppressed.
You could override this setting with your own group size, or
You could set the maximum re-identifiability probability as a way to set the group size, meaning, if you want a 1% change of re-identifiability probability you will have a higher group size than if you have a 20% re-identifiability probability. In other words, you are trading utility for privacy because more data will be suppressed the lower the re-identifiability probability. The default for the fingerprint setting (described in the first bullet above) uses a heuristic that attempts to preserve 80% of the information in the columns without going below a maximum re-identification probability of 20% (group size of 5 or greater). This assumes you’ve selected all possible indirect identifiers in the k-Anonymization policy.
Change for everyone except to for everyone.
Click Create and Save All.
It may take a few seconds for Immuta to run the k-anonymization calculations to apply this policy.
First let’s run this query again to find the male Tongan’s salary: select * from immuta_fake_hr_data where race = 'Tongan' and gender = 'Male';
Wait...what...no results?
Ok, let’s run this query ignoring the gender: select * from immuta_fake_hr_data where race = 'Tongan';
We only get the Females back!
We successfully averted this linkage attack. Remember, from our queries prior to the policy, the salary was 106072, so let’s run a query with that: select * from immuta_fake_hr_data where salary = 106072;
There he is! But notice race is suppressed (NULL) so this linkage attack will not work. It was also smart enough to NOT suppress gender because that did not contribute to the attack; suppressing race alone averts the attack. This technique provides as much utility as possible while preserving privacy.
Coarse-grained access control. Over- and under-sharing gets you in hot water with either Legal and Compliance (want more privacy) or the analysts (want more data), depending on which direction you go. Advanced anonymization techniques give you the flexibility to make these tradeoffs and keep both stakeholders happy.
Before we build this policy, let’s take a quick look at the Immuta Fake HR Data table; please query it in your compute/warehouse of choice following the guide.
Now let’s go back and query the Immuta Fake HR Data table, remembering to query it in your compute/warehouse where you built the local policy in the above steps following the guide.
To learn more about K-Anonymization and our other advanced PETs, please download our ebook:
Feel free to return to the to move on to your next topic.