Availability of Data

Example of anonymizing a column rather than blocking it

By having highly granular controls coupled with anonymization techniques, more data than ever can be at the fingertips of your analysts and data scientists (in some cases, up to 50% more).

Why is that?

Let’s start with a simple example and get more complex. Obviously, if you can’t do row- and column-level controls and are limited to only GRANTing access to tables, you are either over-sharing or under-sharing. In most cases, it’s under-sharing: there are rows and columns in that table the users can see, just not all of them, but they are blocked completely from the table.

That example was obvious, but it can get a little more complex. If you have column-level controls, now you can give them access to the table, but you can completely hide a column from a user by making all the values in it null, for example. Thus, they’ve lost all data/utility from that column, but at least they can get to the other columns.

That masked column can be more useful, though. If you hash the values in that column instead, utility is gained because the hash is consistent - you can track and group by the values, but can’t know exactly what they are.

But you can make that masked column even more useful! If you use something like randomized response instead of hashing, they can know many of the values, but not all of them, gaining almost complete utility from that column. As your anonymization techniques become more advanced, you gain utility from the data while preserving privacy. These are termed privacy enhancing technologies (PETs) and Immuta places them at your fingertips.

This is why advanced anonymization techniques can get significantly more data into your analysts' hands.

Using randomized response to mask columns

Support limitation: This policy is only supported in Snowflake integrations.

While columns like first_name, last_name, email, and social security number can certainly be directly identifying, something like gender and race, on the surface, seem like they may not be directly identifying, but it could be.

For example, if an analyst wanted to publish data from a health survey she conducted, she could remove direct identifiers. However, consider these survey participants, a cohort of male welders who share the same zip code:

participant_id

zip_code

gender

occupation

substance_abuse

...

880d0096

75002

Male

Welder

f267334b

75002

Male

Welder

bfdb43db

75002

Male

Welder

260930ce

75002

Male

Welder

046dc7fb

75002

Male

Welder

...

All members of this cohort have indicated substance abuse, sensitive personal information that could have damaging consequences, and, even though direct identifiers have been removed, outsiders could infer substance abuse for an individual if they knew a male welder in this zip code.

This is the challenge with indirect identifiers. It comes down to how much your adversary, the person trying to break privacy, knows externally, which is unknowable to you. In this case, all they had to know was the zip code and that the person was a man to figure out their survey response, sensitive information. This is called a linkage attack and is specifically called out in privacy regulations as something you must contend with, for example, from GDPR:

Article 4(1): "Personal data" means any information relating to an identified or identifiable natural person ("data subject"); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that person.

Almost any useful column with many unique values will be a candidate for indirectly identifying an individual, but also be an important column for your analysis. So if you completely hide every possible indirectly identifying column, your data is left useless.

You can solve this problem with PETs. For example, a randomized response policy masks data by slightly randomizing the values in a column, preserving the utility of the data while preventing outsiders from inferring content of specific records.

In this scenario, using randomized response would change some of the Y's in substance_abuse to N's and vice versa; consequently, outsiders couldn't be sure of the displayed value of substance_abuse given in any individual row, as they wouldn't know which rows had changed.

This is the magic of randomized response: it provides as much utility as possible while preserving privacy by replacing values that appear so infrequently (along with other values in that row) that they could lead to a linkage attack. Using randomized response doesn't destroy the data because data is only randomized slightly; aggregate utility can be preserved because analysts know how and what proportion of the values will change. Through this technique, values can be interpreted as hints, signals, or suggestions of the truth, but it is much harder to reason about individual rows.

Cell-level security

Cell-level security is not exactly an advanced privacy enhancing technology (PET) as in the example above, but it does provide impressive granular controls within a column for common use cases.

What is cell-level security?

If you have values in a column that should sometimes be masked, but not always, that is masking at the cell-level, meaning the intersection of a row with a column. What drives whether that cell should be masked or not is some other value (or set of values) in the rest of the row shared with that column (or a joined row from another table).

For example, a user wants to mask the credit card numbers but only when the transaction amount is greater than $500. This allows you to drive masking in a highly granular manner based on other data in your tables.

This technique is also possible using Immuta, and you can leverage tags on columns to drive which column in the row should be looked at to mask the cell in question, providing further scalability.

Last updated 2 months ago

Was this helpful?