Availability of Data
Last updated
Was this helpful?
Last updated
Was this helpful?
By having highly granular controls coupled with anonymization techniques, more data than ever can be at the fingertips of your analysts and data scientists (in some cases, up to 50% more).
Why is that?
Let’s start with a simple example and get more complex. Obviously, if you can’t do row- and column-level controls and are limited to only GRANTing access to tables, you are either over-sharing or under-sharing. In most cases, it’s under-sharing: there are rows and columns in that table the users can see, just not all of them, but they are blocked completely from the table.
That example was obvious, but it can get a little more complex. If you have column-level controls, now you can give them access to the table, but you can completely hide a column from a user by making all the values in it null, for example. Thus, they’ve lost all data/utility from that column, but at least they can get to the other columns.
That masked column can be more useful, though. If you hash the values in that column instead, utility is gained because the hash is consistent - you can track and group by the values, but can’t know exactly what they are.
But you can make that masked column even more useful! If you use something like instead of hashing, they can know many of the values, but not all of them, gaining almost complete utility from that column. As your anonymization techniques become more advanced, you gain utility from the data while preserving privacy. These are termed privacy enhancing technologies (PETs) and Immuta places them at your fingertips.
This is why advanced anonymization techniques can get significantly more data into your analysts' hands.
While columns like first_name
, last_name
, email
, and social security number
can certainly be directly identifying, something like gender
and race
, on the surface, seem like they may not be directly identifying, but it could be.
For example, if an analyst wanted to publish data from a health survey she conducted, she could remove direct identifiers. However, consider these survey participants, a cohort of male welders who share the same zip code:
...
...
...
...
...
880d0096
75002
Male
Welder
Y
f267334b
75002
Male
Welder
Y
bfdb43db
75002
Male
Welder
Y
260930ce
75002
Male
Welder
Y
046dc7fb
75002
Male
Welder
Y
...
...
...
...
...
All members of this cohort have indicated substance abuse, sensitive personal information that could have damaging consequences, and, even though direct identifiers have been removed, outsiders could infer substance abuse for an individual if they knew a male welder in this zip code.
This is the challenge with indirect identifiers. It comes down to how much your adversary, the person trying to break privacy, knows externally, which is unknowable to you. In this case, all they had to know was the zip code and that the person was a man to figure out their survey response, sensitive information. This is called a linkage attack and is specifically called out in privacy regulations as something you must contend with, for example, from GDPR:
Article 4(1): "Personal data" means any information relating to an identified or identifiable natural person ("data subject"); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that person.
Almost any useful column with many unique values will be a candidate for indirectly identifying an individual, but also be an important column for your analysis. So if you completely hide every possible indirectly identifying column, your data is left useless.
You can solve this problem with PETs. For example, a randomized response policy masks data by slightly randomizing the values in a column, preserving the utility of the data while preventing outsiders from inferring content of specific records.
In this scenario, using randomized response would change some of the Y's in substance_abuse
to N's and vice versa; consequently, outsiders couldn't be sure of the displayed value of substance_abuse
given in any individual row, as they wouldn't know which rows had changed.
This is the magic of randomized response: it provides as much utility as possible while preserving privacy by replacing values that appear so infrequently (along with other values in that row) that they could lead to a linkage attack. Using randomized response doesn't destroy the data because data is only randomized slightly; aggregate utility can be preserved because analysts know how and what proportion of the values will change. Through this technique, values can be interpreted as hints, signals, or suggestions of the truth, but it is much harder to reason about individual rows.
Cell-level security is not exactly an advanced privacy enhancing technology (PET) as in the example above, but it does provide impressive granular controls within a column for common use cases.
What is cell-level security?
If you have values in a column that should sometimes be masked, but not always, that is masking at the cell-level, meaning the intersection of a row with a column. What drives whether that cell should be masked or not is some other value (or set of values) in the rest of the row shared with that column (or a joined row from another table).
For example, a user wants to mask the credit card numbers but only when the transaction amount is greater than $500. This allows you to drive masking in a highly granular manner based on other data in your tables.
This technique is also possible using Immuta, and you can leverage tags on columns to drive which column in the row should be looked at to mask the cell in question, providing further scalability.