Prerequisite: Before using this walkthrough, please ensure that you’ve first done the Parts 1-5 of the POV Data Setup and the Schema Monitoring and Automatic Sensitive Data Discovery walkthrough.
While many platforms support the concept of object tagging / sensitive data tagging, very few truly support hierarchical tag structures.
First, a quick overview of what we mean by hierarchical tag structure:
This would be a flat tag structure:
SUV
Subaru
Truck
Jeep
Gladiator
Outback
Each tag stands on its own and is not associated with one another in any way; there’s no correlation between Jeep and Gladiator nor Subaru and Outback.
A hierarchical tagging structure establishes these relationships, and we’ll explain why this is important momentarily.
SUV.Subaru.Outback
Truck.Jeep.Gladiator
“Support” for a tagging hierarchy is more than just supporting the tag structure itself. More importantly, policy enforcement should respect the hierarchy as well. Let’s run through a quick contrived example. Let's say that you wanted the following policies:
Mask by making null any SUV data
Mask using hashing any Outback data
With a flat structure, if you build those policies they will be in conflict with one another. To avoid that problem you would have to order which policies take precedence, which can get extremely complex when you have many policies. This is in fact how many policy engines handle this problem. (We’ll discuss more in the Anti-Patterns section.)
Instead, if your policy engine truly supports a tagging hierarchy like Immuta does, it will recognize that Outback is more specific than SUV, and have that policy take precedence.
Mask by making null any SUV data
Mask using hashing any SUV.Subaru.Outback
data
Policies are applied correctly without any need for complex ordering of policies.
This allows the business to think about policy and application of policy based on a logical model of their data, because of this, you are provided:
Understandability: Policies are easily read and understood on their own without having to also comprehend precedence of policy (e.g., inspect each policy in combination with all other policies).
Evolvability: What if you need to change all Subaru data to hashing now? With Immuta, that’s an easy change, just update the policy. With solutions that don’t support tagging hierarchy, you must understand both the policy and its precedence. With a tagging hierarchy the precedence was taken care of when building the logical tagging model.
Correctness: If two policies hit each other at the same level of the hierarchy, the user is warned of this conflict when building the 2nd policy. This is important because in this case, there likely is a true conflict on the opinion of what the policy should be doing and the business can make a decision. With policy ordering this conflict is not apparent.
Because of this, the business reaps
Increased revenue: accelerate data access / time-to-data.
Decreased cost: operating efficiently at scale, agility at scale by avoiding comprehension of all policies at once in order to create/edit more of them.
Decreased risk: avoid policy errors through missed conflicts and not understanding policy precedence.
Assumptions: Your user has the following permissions in Immuta (note you should have these by default if you were the initial user on the Immuta installation):
GOVERNANCE: in order to build policy against any table in Immuta OR
“Data Owner” of the registered tables. (You likely are the Data Owner and have GOVERNANCE permission.)
To build a policy using tags,
In Immuta, visit the Fake HR Data data source (from any warehouse/compute).
Go to the Data Dictionary tab and view where you have the Discovered.Identifier Direct
and the Discovered.Entity.Social Security Number
tags. Let’s build two separate policies using those.
Policy 1:
Click the Policies icon in the left sidebar of the Immuta console.
Click + Add New Data Policy.
Name it Mask Direct Identifiers.
For action, select Mask.
Leave columns tagged.
Type in the tag Discovered.Identifier Direct
.
Change masking type to by making null.
Change everyone except to everyone. (This policy will have no exceptions.)
Click Add.
Leave Where should this policy be applied? as is. (Immuta will guess correctly based on previous steps.)
Click Create Policy and then Activate Policy.
Policy 2:
Click + Add New Data Policy.
Name it Mask SSN.
For action, select Mask.
Leave columns tagged.
Type in the tag Discovered.Entity.Social Security Number
.
Change masking type to using hashing.
Change everyone except to everyone. (This policy will have no exceptions.)
Click Add.
Leave Where should this policy be applied? as is. (Immuta will guess correctly based on previous steps.)
You can further refine where this policy is applied by adding another circumstance:
Click + Add Another Circumstance.
Change the or to an and.
Select tagged for the circumstance. (Make sure you pick “tagged” and not “with columns tagged.”)
Type in Immuta POV for the tag name. (Remember, this was the tag you created in Schema Monitoring and Automatic Sensitive Data Discovery.) Note that if you are a Data Owner of the tables without GOVERNANCE permission, the policy will be automatically limited to the tables you own.
Click Create Policy and then Activate Policy.
Now visit the Fake HR Data data source again (from any warehouse/compute).
Click the Policies tab.
You will see both of those policies applied; however, the “Mask Direct Identifiers” mask SSN policy was not applied because it was not as specific as the “Mask SSN” policy.
You can also test out everything was masked correctly by following the Query Your Data guide.
This has already been covered fairly well in the business value section, but policy precedence ordering is the anti-pattern and is unfortunately commonly found in tools such as Sentry and Ranger. The problem is that you put the onus on the policy builder to understand the precedence rather than baking that into your data metadata. The policy builder must understand all other policies and cannot build their policy in a vacuum. Similarly, anyone reading policy must consider it in tandem with every other policy and its precedence to understand how policy is going to actually be enforced. Other tools, like Snowflake and Databricks have no concept of policy precedence, which leaves you no solution at all to this problem.
Yes, this does put some work on the business to correctly build “specificity” into their tagging hierarchy (depth == specificity). This is not necessarily easy; however, this logic will have to live somewhere, and having it in the tagging hierarchy rather than policy order again allows you to separate policy definition from data definition. This provides you scalability, evolvability, understandability, and, we believe most importantly, correctness because policy conflicts can be caught at policy-authoring-time as described in the business value section.
Feel free to return to the POV Guide to move on to your next topic.