Prerequisites: Before using this walkthrough, please ensure that you’ve first completed Parts 1-5 of the POV Data Setup and the Schema Monitoring and Automatic Sensitive Data Discovery walkthrough.
As mentioned in the POV Guide, by having highly granular controls coupled with anonymization techniques, more data than ever can be at the fingertips of your analysts and data scientists (we’ve seen examples of up to 50% more).
Why is that?
Let’s start with a simple example and get more complex. Obviously, if you can’t do row- and column-level controls, and you are limited to only GRANTing access to tables, you are either over-sharing or under-sharing. In most cases, it’s under sharing: there are rows and columns in that table the users can see, just not all of them, but instead, they are blocked completely from the table.
Ok, that was obvious, now let’s get a little more complex. If you have column-level controls, now you can give them access to the table, but you can completely hide a column from a user by making all the values in it null, for example, and, thus, they’ve lost all data/utility from that column, but at least they can get to the other columns.
We can make that masked column more useful, though. If you hash the values in that column instead, utility is gained because the hash is consistent - you can track and group by the values, but can’t know exactly what they are.
But you can make that masked column even more useful! If you use something like k-anonymization (we’ll talk about shortly) instead of hashing, they can know many of the values, but not all of them, gaining almost complete utility from that column. As your anonymization techniques become more advanced, you gain utility from the data while preserving privacy. These are termed Privacy Enhancing Technologies (PETs) and Immuta places them at your fingertips.
This is why advanced anonymization techniques can get significantly more data into your analysts' hands.
Creating a balance between privacy and utility is critical to stakeholders across the business. Legal and compliance stakeholders can rest assured that policy is in place, yet data analysts can have access to more data than ever before.
Because of this, the business reaps
Increased revenue: increased data access by providing utility from sensitive data rather than completely blocking it.
Decreased cost: building these PETs is complex and expensive, Immuta has invested years of research to apply these PETs dynamically to your data at a click of a button.
Decreased risk: your organization may end up over-sharing since they don’t have the granular controls at their fingertips, opening up high levels of risk. With Immuta, you can reduce risk through the privacy vs utility balance provided.
Assumptions: Your user has the following permissions in Immuta (note you should have these by default if you were the initial user on the Immuta installation):
GOVERNANCE: in order to build policy on any table OR
“Data Owner” of the registered tables (you likely are the Data Owner and have GOVERNANCE permission).
Before we build this policy, let’s take a quick look at the Immuta Fake HR Data table; please query it in your compute/warehouse of choice following the Query Your Data guide.
While columns like first_name
, last_name
, email
, and social security number
can certainly be directly identifying (although we masked them in previous walkthroughs you may have completed), something like gender
and race
, on the surface, seem like they may not be directly identifying. But it could be: imagine if there are very few Tongan males in this data set...in fact, there’s only one. So if I know of a Tongan male in that company, I can easily run a query like this and figure out that person’s salary without using their name, email, or social security number:
select * from immuta_fake_hr_data where race = 'Tongan' and gender = 'Male';
This is the challenge with indirect identifiers. It comes down to how much your adversary, the person trying to break privacy, knows externally, which is unknowable to you. In this case, all they had to know was the person was Tongan and male (and there happens to be only one of them in the data) to figure out their salary (it’s $106,072). This is called a linkage attack and is specifically called out in privacy regulations as something you must contend with, for example, from GDPR:
Article 4(1): "Personal data" means any information relating to an identified or identifiable natural person ("data subject"); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that person.
So you see where we are going: almost any useful column with many unique values will be a candidate for indirectly identifying an individual, but also be an important column for your analysis. So if you completely hide every possible indirectly identifying column, your data is left useless.
You can solve this problem with PETs. Before we get started with K-Anonymization, take note of two things by querying the data:
If you only search for “Tongan” alone (no Male), there are several Tongan females, so this linkage attack no longer works: select * from immuta_fake_hr_data where race = 'Tongan';
There are no null values in the gender or race columns.
Let's build a k-anonymization policy:
Log in to Immuta with your user with GOVERNANCE permission (and/or is the Data Owner of the table “Immuta Fake HR Data”).
Visit the Immuta Fake HR Data data source and click the Policies tab.
If you’ve done some of the other walkthroughs, you will see those policies listed here because they propagated from a global policy down to what we call local policies.
In this case, we will create a local policy on this specific table (make sure if you have multiple computes/warehouses this is the one you plan to query against).
Click + New Policy in the Data Policies section.
Select the Mask option.
Set the mask type to with K-Anonymization.
Select the gender and race columns.
Leave using Fingerprint (group size = 5)
In this case, through our algorithm, we selected the best group size for you (see the third bullet below for more details). This means any combination of gender and race that shows up 5 or fewer times will be suppressed.
You could override this setting with your own group size, or
You could set the maximum re-identifiability probability as a way to set the group size, meaning, if you want a 1% change of re-identifiability probability you will have a higher group size than if you have a 20% re-identifiability probability. In other words, you are trading utility for privacy because more data will be suppressed the lower the re-identifiability probability. The default for the fingerprint setting (described in the first bullet above) uses a heuristic that attempts to preserve 80% of the information in the columns without going below a maximum re-identification probability of 20% (group size of 5 or greater). This assumes you’ve selected all possible indirect identifiers in the k-Anonymization policy.
Change for everyone except to for everyone.
Click Create and Save All.
It may take a few seconds for Immuta to run the k-anonymization calculations to apply this policy.
Now let’s go back and query the Immuta Fake HR Data table, remembering to query it in your compute/warehouse where you built the local policy in the above steps following the Query Your Data guide.
First let’s run this query again to find the male Tongan’s salary: select * from immuta_fake_hr_data where race = 'Tongan' and gender = 'Male';
Wait...what...no results?
Ok, let’s run this query ignoring the gender: select * from immuta_fake_hr_data where race = 'Tongan';
We only get the Females back!
We successfully averted this linkage attack. Remember, from our queries prior to the policy, the salary was 106072, so let’s run a query with that: select * from immuta_fake_hr_data where salary = 106072;
There he is! But notice race is suppressed (NULL) so this linkage attack will not work. It was also smart enough to NOT suppress gender because that did not contribute to the attack; suppressing race alone averts the attack. This technique provides as much utility as possible while preserving privacy.
Coarse-grained access control. Over- and under-sharing gets you in hot water with either Legal and Compliance (want more privacy) or the analysts (want more data), depending on which direction you go. Advanced anonymization techniques give you the flexibility to make these tradeoffs and keep both stakeholders happy.
To learn more about K-Anonymization and our other advanced PETs, please download our ebook: How to Enhance Privacy in Data Science
Feel free to return to the POV Guide to move on to your next topic.