Build Anonymization Against Compute/Warehouse 1 and n Consistently
Prerequisites: Before using this walkthrough, please ensure that you’ve first completed Parts 1-5 of the POV Data Setup and the Schema Monitoring and Automatic Sensitive Data Detection walkthrough.
This is one of the largest challenges for organizations. Having multiple warehouses/compute, which is quite common, means that you must configure policies uniquely in each of them. For example, the way you build policies in Databricks is completely different from how you build policies in Snowflake. Not only that, they support different levels of control, so while you might be able to do row level security in Snowflake, you can’t in Databricks. This becomes highly complex to manage, understand, and evolve (really hard to make changes).
Just like the big data era created the need to separate compute from storage, the privacy era requires you to separate policy from platform. Immuta does just that; it abstracts the policy definition from your many platforms, allowing you to define policy once and apply anywhere - consistently!
Evolvability and consistency are the key outcomes of separating policy from platform. It’s easy to make changes in a single place and apply everywhere consistently.
Because of this, the business reaps
- Increased revenue: accelerate data access / time-to-data, building and evolving policy is optimized.
- Decreased cost: operating efficiently at scale, you only have to make changes in a single well understood platform.
- Decreased risk: avoid data leaks caused by uniquely managing and editing policy in each platform which works differently.
There is no walkthrough for this topic, because you’ve already been doing it (or could do it).
- If you have multiple compute/warehouses, make sure you configure all of them using the POV Data Setup guide.
- Once configured, use any of these walkthroughs to show building policy once in Immuta and applying it to all places:
The anti-pattern is obvious: do not build policies uniquely in each warehouse/compute you use; this will create chaos, errors, and make the data platform team a bottleneck for getting data in the hands of analysts.
Legacy solutions, such as Apache Ranger, can only substantiate the abstraction of policy from compute in the Hadoop ecosystem. This is due to inconsistencies in how Ranger enforcement has been implemented in the other downstream compute/warehouse engines. That inconsistency arises not only from ensuring row, column, and anonymization techniques work the same in Databricks as they do in Snowflake, for example, but also from the need for additional roles to be created and managed in each system separately and inconsistently from the policy definitions. With Immuta, you have complete consistency without forcing new roles to be created into each individual warehouse’s paradigm.
Feel free to return to the POV Guide to move on to your next topic.