Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Prerequisites: Before using this walkthrough, please ensure that you’ve first done the Parts 1-5 of the POV Data Setup and the Schema Monitoring and Automatic Sensitive Data Discovery walkthrough.
When building access control into our database platforms we are all used to a concept called Role Based Access Control (RBAC). Roles both define who is in them, but also determine what those users get access to. A good way to think about this is Roles conflate the who and what: who is in them and what they have access to (but lack the why).
In contrast, Attribute Based Access Control (ABAC) allows you to decouple your roles from what they have access to, essentially separating the what and why from the who, which also allows you to explicitly explain the “why” in the policy. This gives you an incredible amount of scalability and understandability in policy building. Note this does not mean you have to throw away your roles necessarily, you can make them more powerful and likely scale them back significantly.
If you remember this picture and article from the start of this POV, most of the Ranger, Snowflake, Databricks, etc. access control scalability issues are rooted in the fact that it’s an RBAC model vs ABAC model. Apache Ranger Evaluation for Cloud Migration and Adoption Readiness
This walkthrough will run you through a very simple scenario that shows why separating the who from the what is so critical to scalability and future proofing policies.
If you only have to manage 7 understandable policies vs 700 - wouldn’t you want to? That’s the real value here.
Scalability: Far fewer policies and roles to manage.
Understandability: Policies (and roles) are clearly understood. No one super user is required to explain what is going on.
Evolvability: No fear of making changes, changes are made easily, again, without the need for super user tribal knowledge.
Durability: Changes in data and users will not result in data leaks.
Because of this, the business reaps
Increased revenue: accelerate data access / time to data.
Decreased cost: operating efficiently at scale, agility at scale, data engineers aren’t spending time managing roles and complex policies.
Decreased risk: prove policy easily, avoid policy errors, understand what policy is doing.
Assumptions: Your user has the following permissions in Immuta (note you should have these by default if you were the initial user on the Immuta installation):
USER_ADMIN: in order to change attributes on users
GOVERNANCE: in order to build policy against any table in Immuta OR
“Data Owner” of the registered tables from Part 4 without the GOVERNANCE permission. (You likely are the Data Owner and have GOVERNANCE permission.)
This is a simple row-level policy that will restrict what countries you see in the “Immuta Fake Credit Card Transactions” table.
In order to do ABAC, we need to have attributes or groups assigned to you to drive policy. With Immuta these can come from anywhere (we mean literally anywhere), and Immuta will aggregate them to use in policy. Most commonly these come from your identity manager, such as LDAP, Active Directory, Okta, etc., but for simplicity sake, we are going to assign attributes to you in Immuta.
Click the People icon and select Users in the left sidebar.
Select your name and click + Add Attributes.
In the Add Attributes menu, type Country in the Attribute field and click create.
In the Attribute value field, type US and click create. Repeat the same process to add JP as an attribute value.
Repeat these steps for the non-admin user you created in Part 3 of the POV Data Setup. However, leave off JP and ONLY give US to that non-admin user.
Follow our Query Your Data Guide to run a query against the Immuta Fake Credit Card Transactions data in your compute/warehouse of choice to see the data before we create the policy. You can query with both your admin and non-admin user (if you were able to create a non-admin user).
In the Immuta UI, look at the Data Dictionary for the Immuta Fake Credit Card Transactions table (you can do this by visiting the data source in Immuta and clicking the Data Dictionary tab); notice that the column transaction_country
is tagged with Discovered.Entity.Location
. This will be important when policy building.
Click the Policies icon in the left sidebar of the Immuta console. (Note: This is not the Policy tab in the “Immuta Fake Credit Card Transactions” data source; that tab is for local policies).
On the Data Policies tab, click + Add Data Policy.
Name the policy: RLS walkthrough.
Select the action Only show rows.
Leave the sub-action as where user.
Set the qualification as possesses attribute.
Set the attribute key as Country. (Remember, we added those US and JP attributes to you under Country.)
Set the field as Discovered.Entity.Location
. (Remember, the transaction_country
column was tagged this.)
Change for everyone except to for everyone. This means there are no exceptions to the policy.
Click Add.
Leave the default circumstance Where should this policy be applied? with On data sources with columns tagged Discovered.Entity.Location
. This was chosen because it was the tag you used when building the policy.
You can further refine where this policy is applied by adding another circumstance:
Click + Add Another Circumstance.
Change the or to an and.
Select tagged for the circumstance. (Make sure you pick “tagged” and not “with columns tagged.”)
Type in Immuta POV for the tag name. (Remember, this was the tag you created in Schema Monitoring and Automatic Sensitive Data Discovery.) Note that if you are a Data Owner of the tables without GOVERNANCE permission, the policy will be automatically limited to the tables you own.
Click Create Policy and Activate Policy.
Now the policy is active and easily understandable. We are saying that the user must have a matching Country attribute to the value in the column transaction_country
in order to see that row and there are no exceptions to that policy. However, there’s a lot of hidden value in how you built this policy so easily:
Because you separated who the user is (their Country) from the policy definition (above) the user’s country is injected dynamically at runtime, this is the heart of ABAC. In an RBAC model this is not possible because the who and the what are conflated. You would have to create a Role PER COUNTRY. Not only that, you would also have to create a Role per combination of Country (remember, you had US and JP). RBAC is very similar to writing code without being able to use variables. Some vendors will claim you can fix this limitation by creating a lookup table that mimics ABAC, however, when that is done, you remove all your policy logic from your policy engine and instead place it in this lookup table.
We also didn’t care how the transaction_country
column was named/spelled because we based the policy on the logical tag, not the physical table(s). If you had another table with that same tag but the transaction_country
column spelled differently, the policy would have still worked. This allows you to write the policy once and have it apply to all relevant tables based on your tags, remembering Immuta can auto-discover many relevant tags.
If you add a new user with a never before seen combination of Countries, in the RBAC model, you would have to remember to create a new policy for them to see data. In the ABAC model it will “just work” since everything is dynamic - future proofing your policies.
Look at the data again to prove the policy is working:
Follow our Query Your Data Guide to run a query against the Immuta Fake Credit Card Transactions data in your compute/warehouse of choice to see the data after we created the policy. You can query with both your admin and non-admin user (if you were able to create a non-admin user).
Notice that the admin user will only see US and JP, and the non-admin user only sees US.
RBAC is an anti-pattern because you conflate the who with the what. Again, it’s like writing code without being able to use variables. If you were to write this policy with Ranger you would end up with hundreds if not thousands of policies because you need to account for every unique country combination. Doing this with groups or roles in Databricks and Snowflake would be the same situation.
You also cannot specify row-level policies based on tags (only column masking policies), so not only do you need all those Roles described above, but you also need to recreate those policies over and over again for every relevant table.
We have seen this in real customer use cases. In one great example we required 1 policy in Immuta for the equivalent controls requiring 96 rules in Ranger. There’s also of course the independent study referenced at the start of this walkthrough as well.
For more reading on the RBAC anti-pattern: Data Governance Anti-Patterns: Stop Conflating Who, Why, and What
More reading on RBAC vs ABAC in general: Role-Based Access Control vs. Attribute-Based Access Control — Explained
And if you really want to go deep, NIST on ABAC: Guide to Attribute Based Access Control (ABAC) Definition and Considerations
Feel free to return to the POV Guide to move on to your next topic.
Prerequisite: Before using this walkthrough, please ensure that you’ve first done the Parts 1-5 of the POV Data Setup and the Schema Monitoring and Automatic Sensitive Data Discovery walkthrough.
There are a myriad of techniques and processes companies use to determine what users should have access to which tables. We’ve seen 7 people having to respond to an email chain for approval before a DBA runs a table GRANT statement, for example. Manual approvals are sometimes necessary, of course, but there’s a lot of power and consistency in establishing objective criteria for gaining access to a table rather than subjective human approvals.
Let’s take the “7 people approve with an email chain” example. We like to ask the question: “why do any of you 7 say yes to the user gaining access?” If it’s objective criteria, you can completely automate this process. For example, if the approver says, “I approve them because they are in group x and work in the US,” that is user metadata that could allow the user to automatically gain access to the tables, either ahead of time or when requested. This removes a huge burden from your organization and avoids mistakes. Note that many times it can be the purpose of what the user is working on that drives if they should have access or not; we cover that in our purpose exceptions walkthrough next.
Being objective is always better than subjective: it increases accuracy, removes bias, eliminates errors, and proves compliance. If you can be objective and prescriptive about who should gain access to what tables - you should.
Because of this, the business reaps
Increased revenue: accelerate data access / time-to-data, no waiting for humans to make decisions.
Decreased cost: operating efficiently at scale, agility at scale because humans are removed from the daily approval flows.
Decreased risk: avoid policy errors through subjective bias and mistakes.
Assumptions: Your user has one of the following permissions in Immuta. (Note that you should have these by default if you were the initial user on the Immuta installation):
GOVERNANCE: in order to build policy against any table in Immuta
“Data Owner” of the registered tables (you likely are the Data Owner and have GOVERNANCE permission).
Up until now we’ve only talked about data policies, which limit what users see WITHIN a table. Subscription policies manage who can see what tables, similar to table GRANTs you may already be familiar with.
Immuta supports multiple modes of subscription policies:
Allow anyone: This is where anyone can access the table. (We did this in Part 5 of the POV Data Setup.)
Allow anyone who asks (and is approved): These are manual subjective approvals.
Allow users with specific groups/attributes: These are objective policies for access that we will walk through.
Allow individually selected users: This is just like database GRANTs, the table is hidden until a user or group is manually granted access.
As you can see, Immuta does support subjective approvals through “Allow anyone who asks (and is approved)” because there are some regulatory requirements for them, although we’d argue this is an anti-pattern. To show you the power of objective subscription policies, we’ll walk through “Allow users with specific groups/attributes” policy building.
Following the Query Your Data guide, confirm that both your user and the non-admin user you created in Part 3 of the POV Data Setup can query any POV table (“Fake HR Data” or “Fake Credit Card Transactions”). Both can query them due to the “Allow anyone” subscription policy you created in Part 5 of the POV Data Setup; otherwise, the non-admin user would not have been able to query it. We are going to edit that policy to change it to an “Allow users with specific groups/attributes” policy.
Click the Policies icon in the left sidebar of the Immuta console.
Click the Subscription Policies tab at the top. You should see the “Open Up POV Data” subscription policy you created in Part 5 of the POV Data Setup.
Edit the Open Up POV Data subscription policy by clicking the menu button on its right and selecting Edit.
Change How should this policy grant access? from Allow anyone to Allow users with specific groups/attributes.
For Allow users to subscribe when user, select the condition
possesses attribute
Key: Country
Value: JP (Remember, you created this attribute in the Schema Monitoring and Automatic Sensitive Data Discovery walkthrough.)
At this point there are several other settings for this policy. For the purposes of this walkthrough, leave all of the below options unchecked.
Allow Discovery: Users who don’t meet the policy (don’t have Country JP in this case) will still be able to see/discover this table in the Immuta UI, otherwise it will be hidden.
Require users to take action to subscribe: If left unchecked (the default) users will automatically be subscribed to the table when the policy is activated or when they reach a state (get Country JP) that allows them to be subscribed. If checked, the user would visit the data source page in the catalog and request a subscription (and be immediately added because they meet the policy).
On merge, allow shared policy responsibility: This will allow other data owners or governors to build policies against these tables as alternatives to your policy (essentially additional policies are OR'ed rather than AND'ed).
Leave the Where should this policy be applied? as is.
Click Save Policy.
Following the Query Your Data guide, confirm that now your admin user with Country JP can see any POV table (“Fake HR Data” or “Fake Credit Card Transactions”); however, the non-admin user you created in Part 3 of the POV Data Setup with only Country US cannot query either table (in some cases, based on enforcement, the user can query the table, but they just won’t get any data back). You can also look at any of the “Members” tab of the data source in the Immuta UI and see your non-admin user has been removed. Additionally, if you log in to Immuta as your non-admin user, you will notice all those tables are gone from the catalog.
Now let’s go back and make it so the non-admin user you created in Part 3 of the POV Data Setup can see the tables again:
Click the Policies icon in the left sidebar.
Click the Subscription Policies tab at the top.
Edit the Open Up POV Data subscription policy again by clicking the menu button on its right and selecting Edit.
Click + Add Another Condition.
Change and to or” (IMPORTANT).
Select the condition
possesses attribute
Key: Country
Value: US
Leave everything else as is and click Save Policy.
Following the Query Your Data guide, confirm both users can see the tables again.
The anti-pattern is manual approvals. We understand that there are some regulatory requirements for this, but if there’s any possible way to switch to objective approvals, you should do it. With subjective human-driven approvals, there is bias, larger chance for errors, and no consistency - this makes it very difficult to prove compliance and is simply passing the buck (and risk) to the approvers and wasting their valuable time.
One could argue that it’s subjective or biased to assign a user the Country JP attribute. This is not true, because, remember, we separated the data policy from user metadata. The act of giving a user the Country JP attribute is simply defining that user, there is no implied access given to the user from this act and that attribute will be objective - e.g., you know if they are in JP or not.
As we’ve seen in our other anti-patterns, the approach where an access decision is conflated with a role or group is common practice (Ranger, Snowflake, Databricks, etc). So not only do you end up with manual approval flows, but you also end up with role explosion from so many roles to meet every combination of access, as we described in the Policy boolean logic walkthrough. If you want a real life example of this in action, you can watch a financial institution’s YouTube video about a software product they built whose sole purpose was to help users understand what roles they needed to request access to (through manual approval) in order to get to the data they needed - they had so much role explosion they had to build an app to handle this.
Feel free to return to the POV Guide to move on to your next topic.
Prerequisite: Before using this walkthrough, please ensure that you’ve first done the Parts 1-5 of the POV Data Setup and the Schema Monitoring and Automatic Sensitive Data Discovery walkthrough.
It’s recommended, but not required, that you also complete the Separating Policy Definition from Role Definition: Dynamic Attributes walkthrough.
The use case for this walkthrough is fairly simple, but unachievable in most role-based access control models. Long story short, the only way to support AND boolean logic with a role-based model is by creating a new role that conflates the two or more roles you want to AND together.
Let’s take an example: we want users to only see certain data if they have security awareness training AND have consumer privacy training. It would be natural to assume you need both separately as metadata attached to users to drive the policy, but when you build policies in a role based model, it assumes roles are either OR’ed together in the policy logic or you can only act under one role at a time, and because of this, you will have to create a single role to represent this combination of requirements “users with security awareness and consumer privacy training”. This is completely silly and unmanageable - you need to account for every possible combination relevant to a policy, and you have no way of knowing that ahead of time.
We go over the benefits of an attribute-based access control model over a role-based model ad nauseam in the Separating Policy Definition from Role Definition: Dynamic Attributes walkthrough if you have not done that yet and want more details. (We recommend you do if this walkthrough is of interest to you.)
If you only have to manage 7 understandable attributes/roles vs 700 - wouldn’t you want to? That’s the real value here.
Scalability: Far fewer roles to manage.
Understandability: Policies (and roles) are clearly understood. No one super user is required to explain what is going on.
Evolvability: No fear of making changes, changes are made easily, again, without the need for super user tribal knowledge.
Durability: Changes in data and users will not result in data leaks.
Because of this, the business reaps
Increased revenue: accelerate data access / time-to-data.
Decreased cost: operating efficiently at scale, agility at scale.
Decreased risk: prove policy easily, avoid policy errors.
Assumptions: Your user has the following permissions in Immuta (note you should have these by default if you were the initial user on the Immuta install):
GOVERNANCE: in order to build policy against any table in Immuta OR are a “Data Owner” of the registered tables. (You likely are the Data Owner and have GOVERNANCE permission.)
USER_ADMIN: in order to manage groups/attributes on users.
We need to have attributes or groups assigned to you to drive policy. With Immuta these can come from anywhere (we mean literally anywhere), and Immuta will aggregate them to use in policy. Most commonly these come from your identity manager, such as LDAP, Active Directory, Okta, etc., but for simplicity sake, we are going to assign attributes to you in Immuta.
Click the People icon and select Users in the left sidebar.
Select your name and click + Add Attributes.
In the Add Attributes modal, type Training Accomplished in the Attribute field and click Create.
In the Attribute value field, create these two values: Security Awareness and Consumer Privacy.
Repeat these steps for the non-admin user you created in Part 3 of the POV Data Setup. However, give that user
Attribute: Training Accomplished
Attribute value: Security Awareness
Notice that second user does not have Consumer Privacy training.
Click the Policies icon in the left sidebar of the Immuta console.
On the Data Policies tab, click + Add Data Policy.
Name the policy: RLS and condition.
Select the action: Only show rows.
Select the sub-action: where.
Set the where clause as: salary < 200000.
Leave for everyone except, and change the exception to
possesses attribute
Training Accomplished
Security Awareness
Click + Add Another Condition.
Make sure the logic is and
Possesses attribute
Training Accomplished
Consumer Privacy
Click Add.
Under, Where should this policy be applied?, select
On Data Sources
with column names spelled like
salary
Don’t select any modifiers. We did it this way because the salary column has no column tags we can use. (We could have added a tag to it, though, if we wanted.)
Click Create Policy and then Activate Policy.(Ignore any warnings of policy overlap.)
Following the Query Your Data guide, test that your user sees the rows with salary above 200000 because you have both trainings and your non-admin user only sees rows with a salary under 200000 because they only have one of the two required trainings.
This is just another example of why role-based policy management can get you in trouble. This problem specifically leads to an industry phenomenon termed “role explosion.” Roles must account for every possible combination of requirements for a policy since that logic cannot be prescribed as an AND condition.
Almost every database follows a role-based model, including legacy policy engines such as Apache Ranger. For example, with Snowflake you can only act under one role at a time, so all policy logic must consider that, in Databricks you may have multiple groups that are all assumed, but the methods for defining policy do not allow AND logic against those groups. This same problem holds true for Ranger policy logic.
The answer is an attribute based model where you can separate defining policy from defining user and data metadata, providing scalability and avoiding role explosion. We have seen this in real customer use cases. In one great example we required 1 policy in Immuta for the equivalent controls requiring 96 rules in Ranger.
For more reading on the RBAC anti-pattern: Data Governance Anti-Patterns: Stop Conflating Who, Why, and What
More reading on RBAC vs ABAC in general: Role-Based Access Control vs. Attribute-Based Access Control — Explained
And if you really want to go deep, NIST on ABAC: Guide to Attribute Based Access Control (ABAC) Definition and Considerations
Feel free to return to the POV Guide to move on to your next topic.
Prerequisite: Before using this walkthrough, please ensure that you’ve done Parts 1-3 in the POV Data Setup walkthrough.
Immuta considers itself a “live” metadata aggregator - not only metadata about your data but also your users. Considering data specifically, to be “live” means Immuta will monitor for schema changes in your database and reflect those changes in your Immuta instance. This allows you to register your databases with Immuta and not have to worry about registering individual tables today or in the future.
Additionally, when the tables are discovered through the registration process, Immuta will inspect the table data for sensitive information and tag it as such. These tags are critical for scaling tag-based policies which you will learn about in subsequent walkthroughs. This sensitive data discovery is done by inspecting samples of your data and using algorithms to decide what we believe the data contains. Those tags are editable or new custom tags can be curated and added by you.
It is also possible to add tags curated or discovered in other systems or catalogs. While this is not specifically covered in this walkthrough, it’s important to understand.
Both the monitoring for new data and discovering and tagging sensitive data aligns with the Scalability and Evolvability theme, removing redundant and arduous work. As users create new tables or columns in your database, those tables/columns will be automatically registered in Immuta and automatically tagged. Why does this matter? Because once they are registered and tagged, policies can immediately be applied - this means humans can be completely removed from the process by creating tag-based policies that dynamically attach themselves to new tables. (We’ll walk through tag-based policies shortly.)
Because of this, the business reaps
Increased revenue: accelerate data access / time to data because where sensitive data lives is well understood.
Decreased cost: operating efficiently at scale, agility at scale.
Decrease risk: sensitive data discovered immediately.
Assumptions: Your user has the following permissions in Immuta (note you should have these by default if you were the initial user on the Immuta installation):
CREATE_DATA_SOURCE: in order to register the data with Immuta
GOVERNANCE: to create a custom tag in Immuta
We are going to create a custom tag to tag the data with, this will:
Help differentiate your real data from this fake POV data.
Help build global policies across these tables from multiple compute/warehouses, if you have more than one.
To create a custom tag,
Click the Governance icon in the left sidebar of the Immuta console.
Click on the Tags tab.
Click + Add Tags.
Name the tag Immuta POV. You can delete the nested tag placeholder in order to save.
Click Save.
Let’s walk through registration of a schema to monitor (You do not need GOVERNANCE permission to do this step, only CREATE_DATA_SOURCE):
From the Data Source page, click the + New Data Source button.
Data Platform: Choose the data platform of interest. This should align to where you loaded the data in the POV Data Setup walkthrough, but of course could be your own data as well. Note that if you are using the same Databricks workspace for Databricks and SQL Analytics, you only need to load it once.
Connection Information: This is the account Immuta will use to monitor your database and query the data metadata. This account should have read access to the data that you need to register. For simplicity, you may want to use the same account you used to load the data in the POV Data Setup walkthrough, but it’s best if you can use an Admin account for registering the data and a separate user account for querying it (which we’ll do later). It should also point to the data you loaded in the POV Data Setup walkthrough which should be the immuta_pov
database unless you named the database something else or placed the data somewhere else.
Virtual Population: There are several options here for how you want Immuta to monitor your database to automatically populate metadata. In our case we want to choose the first option: Create sources for all tables in this database and monitor for changes.
Basic Information: This section allows you to apply a convention to how the tables are named. If you have multiple data warehouses/compute and you’ve already registered these tables once and are registering them now from a 2nd (or more) warehouse/compute, you will have to change the naming convention for the Immuta data source name and schema project so you can tell them apart in the Immuta UI. This will NOT impact what they are named in the native database.
Advanced (Optional):
Note that Sensitive Data Discovery is enabled.
We are going to add that Immuta POV tag we created above by going to the last section “Data Source 1. Tags”
Click Edit.
Enter Immuta POV and click Add.
This will add that tag to any table that is discovered now or in the future.
You can leave the defaults for the rest.
Click Create to kick off the job.
Repeat these steps for each warehouse/compute you have (not to be confused with a Snowflake warehouse; we mean other data warehouses, like Redshift, Databricks SQL analytics, etc.).
You will be dumped into a screen that depicts the progress of your monitoring job. You’ll also see a gear spinning in the upper right corner of the screen which depicts the jobs that are running, one of those being the “fingerprint,” which is what is used to gather statistics about your tables and run the Sensitive Data Discovery.
Once the tables are registered and the gear stops spinning, click into the Immuta POV Immuta Fake Hr Data table. Once there, click on the Data Dictionary tab. In there you will see your columns as well as the Sensitive Data that was discovered. Also note that because we found a specific entity (such as Discovered.Entity.Person Name
), we also tag that column with other derivative tags (such as Discovered.Identifier Indirect
). This hierarchy will become important in the Hierarchical Tag-Based Policy Definitions walkthrough.
Also visit the Data Dictionary in the Immuta POV Immuta Fake Credit Card Transactions table. If you scroll to the bottom column, transaction_country
, you’ll notice we incorrectly tagged it as Discovered.Entity.State
- you can go ahead and remove that tag. Notice it is simply disabled so that when monitoring runs again it will not be re-tagged with the incorrect Discovered.Entity.State
tag.
One thing worth mentioning is that the table is completely protected after being discovered based on our default policy. We’ll learn more about this in subsequent sections.
This anti-pattern is pretty obvious - instead of automatically detecting schema/database changes you would have to manually manage that, and instead of automatically detecting sensitive data, you would also have to manually manage that.
It’s not just the manual time suck, but also complicates the process, because not only must you understand when a new table is present, but you then must remember to tag it and potentially protect it appropriately. This leaves you ripe for data leaks as new data is created across your organization, almost daily.
If you came to this walkthrough from the POV Data Setup, please make sure to complete the final Part 5 there!
Otherwise, feel free to return to the POV Guide to move on to your next topic.
We’ve provided some data that will allow you to complete the walkthroughs provided. Of course, you would use your own data with Immuta in Production, but since we are going to walk through very specific use cases, it’s easier to work off the same sheet of music, data-wise.
While this page is long, you will only need to worry about your specific data warehouses/compute.
Databricks Workspaces
A Databricks workspace (your Databricks URL) can be configured to use traditional notebooks or SQL Analytics. Select one of these options from the menu in the top left corner of the Databricks console.
Select one of the tabs below to download the script to generate fake data for your specific warehouse.
Databricks Notebooks (Data Science and Engineering or Machine Learning Notebooks)
Databricks SQL (SQL Workspace)
Snowflake
Redshift
Synapse
Download these resources:
Starburst (Trino)
Download these datasets:
This will get the data downloaded in the first step into your data warehouse.
Assumptions: Part 2 assumes you have a user with permission to create databases/schemas/tables in your warehouse/compute (and potentially write files to cloud storage).
Databricks and SQL Analytics Imports
If you’ve already done the import using SQL Analytics and SQL Analytics shares the same workspace with Databricks Notebooks, you will not have to do it again in Databricks because they share a metastore.
Before importing and running the notebook, ensure you are either logged in as Databricks admin or you are running it on a cluster that is NOT Immuta-enabled.
Import the Notebook downloaded from step 1 into Databricks.
Go to your workspace and click the down arrow next to your username.
Select import.
Import the file from step 1.
Run all cells in the Notebook, which will create both tables.
For simplicity, the data is being stored in DBFS; however, we do not recommend this in real deployments, and you should instead store your real data in your cloud-provided object store (S3, ADLS, Google Storage).
Databricks and SQL Analytics Imports
Note, if you’ve already done the import using Databricks and Databricks shares the same workspace with SQL Analytics, you will not have to do it again in Databricks because they share a metastore.
Before importing and running the script, ensure you are logged in as a user who can create databases.
Select SQL from the upper left menu in Databricks.
Click Create → Query.
Copy the contents of the SQL script you downloaded from step 1 and paste that script into the SQL area.
Run the script.
For simplicity, the data is being stored in DBFS, however, we do not recommend this in real deployments and you should instead store your real data in your cloud-provided object store (S3, ADLS, Google Storage).
Open up a worksheet in Snowflake using a user that has CREATE DATABASE and CREATE SCHEMA permission. Alternatively, you can save the data in a pre-existing database or schema by editing the provided SQL script.
To the right of your schema selection in the worksheet, click the ... menu to find the Load Script option.
Load the script downloaded from step 1.
Optional: Edit the database and schema if desired at the top of the script.
Check the All Queries button next to the Run button.
Ensure you have a warehouse selected, and then click the Run button to execute the script. (There should be 11 commands it plans to run.)
Both tables should be created and populated.
Redshift RA3 Instance Type
You must use a Redshift RA3 instance type because Immuta requires cross-database views, which are only supported in Redshift RA3 instance types.
Unfortunately there is not a standard query editor for Redshift, so creating the POV tables in Redshift is going to be a bit less automated.
Connect to your Redshift instance using your query editor of choice.
Create a new database called immuta_pov
using the command CREATE DATABASE immuta_pov
; optionally, you can connect to a pre-existing database and just load these tables in there.
After creating the database, you will need to disconnect from Redshift and connect to the new database you created (if you did so).
Upload the script you downloaded from step 1 above. If your query editor does not support uploading a SQL script, you can simply open that file in a text editor to copy the commands and paste them in the editor.
Run the script.
Both tables should be created and populated.
Synapse Analytics Dedicated SQL Pools
Immuta supports Synapse Analytics dedicated SQL pools only.
Creating the data in Synapse is potentially a four step process. First you will need to upload the data to a storage account, then create a Synapse Workspace (if you don’t have one in mind to use), then create a dedicated SQL pool, and then point Synapse to that stored data.
Log in to the Azure Portal.
Select or create a storage account.
If selecting an existing storage account and you already have a Synapse Workspace you plan to use, make sure the storage container(s) are attached to that Synapse workspace.
The selected or created storage account MUST have Data Lake Storage Gen2 Hierarchical namespace enabled. Note this has to be enabled at creation time and cannot be changed after creation.
The setting Enable hierarchical namespace is found under advanced settings when creating the storage account.
Click Containers.
Select or create a new container to store the data.
Upload both files from step 1 to the container by clicking the upload button.
Go to Azure Synapse Analytics (still logged in to Azure Portal).
Create a Synapse workspace.
Select a resource group.
Provide a workspace name.
Select a region.
For account name use the storage account used from the above steps, remembering it MUST have Data Lake Storage Gen2 Hierarchical namespace enabled.
Select the container you created in the above steps.
Make sure Assign myself the Storage Blob Data Contributor role on the Data Lake Storage Gen2 account to interactively query it in the workspace is checked.
Go to the security section.
Enter your administrator username/password (save these credentials).
Review and create.
Once the Synapse Workspace is created, there should be a Workspace web URL (this is to the Synapse Studio) available on the overview page, go there.
With Synapse, a dedicated pool is essentially a database. So we want to create a database for this POV data.
On the Azure portal home page click Azure Synapse Analytics.
Click the Synapse workspace above and then click + New dedicated SQL pool (Immuta only works on Synapse dedicated pools).
Next, enter immuta_pov
as the name of your dedicated SQL pool.
Choose an appropriate performance level. For testing purposes (other than performance) it is recommended to use the lowest performance level to avoid high costs if the instance is left running.
Once that information is chosen, click Review + Create and then Create.
From Synapse Studio (this is the Workspace web URL you were given when the Synapse Workspace was completed) click the Data menu on the left.
Click on the Workspace tab.
Expand databases and you should see the dedicated pool you created above. Sometimes, even if the dedicated pool has been deployed, it takes time to see it in Synapse Studio. Wait some time and refresh the browser.
Once the dedicated pool is there, hover over it, and click the Actions button.
Select New SQL script.
Select Empty script.
Paste the contents of the script you downloaded in Part 1 into the script window.
Run it.
From that same Synapse Studios window, click the Integrate menu on the left.
Click the + button to Add a new resource.
Select Copy Data tool.
Leave it as Built-in copy task with Run once now, and then click Next >.
For Source type select Azure Data Lake Storage Gen 2.
For connection, you should see your workspace; select it.
For File or folder, click the browse button and select the container where you placed the data.
Dive into that container (double click the folder if in a folder) and select the immuta_fake_hr_data_tsv.txt file.
Uncheck recursive and click Next >.
For File format, select Text format.
For Column delimiter, leave the default, Tab (\t).
For Row delimiter, leave the default, Line feed (\n).
Leave First row is header checked.
Click Next >
For Target type, select Azure Synapse dedicated SQL pool.
For Connection, select the dedicated pool you created: immuta_pov
.
Under Azure Data Lake Storage Gen2 file, click Use existing table.
Select the pov_data.immuta_fake_hr_data
table.
Click Next >.
Leave all the defaults on the column mapping page, and then click Next >.
On the settings page, name the task hr_data_import.
Open the Advanced section.
Uncheck Allow PolyBase.
Uncheck the Edit button under Degree of copy parallelism.
Click Next >.
Review the Summary page and click Next >.
This should run the task and load the data; you can click Finish when it completes.
If you’d like, you can test that it worked by opening a new SQL Script from the data menu and running: SELECT * FROM pov_data.immuta_fake_hr_data
.
Repeat these steps for the immuta_fake_credit_card_transactions_tsv.txt file, loading it into the pov_data.immuta_fake_credit_card_transactions table.
Since Starburst (Trino) can connect to and query many different systems, it would be impossible for us to list instructions for every single one. To load these tables into Starburst (Trino), you should
Upload the data from Part 1 to whatever backs your Starburst (Trino) instance. If that is cloud object storage, this would mean loading the files downloaded from Part 1. If it’s a database, you may want to leverage some of the SQL scripts listed for the other databases in Part 1.
Follow the appropriate guide from here.
Create a database: immuta_pov
for the schema/tables.
Create a schema: pov_data for the tables.
Load both tables into that schema:
immuta_fake_hr_data
immuta_fake_credit_card_transactions
Assumptions: Part 3 assumes your user has the following permissions in Immuta (note you should have these by default if you were the initial user on the Immuta installation):
APPLICATION_ADMIN: in order to configure the integration
USER_ADMIN: in order to create a non-admin user
When the integration is configured, since it allows you to query the data directly from the data warehouse/compute, there needs to be a mapping between Immuta users and data warehouse/compute users. Typically in production this is accomplished through a shared identity manager like LDAP or Okta.
However, for the purposes of this POV, you may prefer to do a simple mapping instead, which we will describe. Taking this a step further, you really need two different “levels” of users to really see the power of Immuta. For example, you want an “admin” user that has more permissions to create policies (and avoid policies) and a regular “non-admin” user to see the impacts of policies on queries - think of this as your regular downstream analyst.
It’s best to follow these rules of thumb:
When running through the below native configurations, it’s best to use a system account from your data warehouse/compute to configure them (when we ask for credentials in the configuration steps, we are not talking about the actual Immuta login here), although you can use your admin account.
You need some kind of Immuta admin account for registering data, building policies, etc, this should be the user that initially stood up the Immuta instance in many cases, but could be a different user as long as you give them all required permissions. This user should map to your data warehouse/compute admin user. We get more into segmentation of duties later in the walkthroughs.
You need a non-admin user. This may be more difficult if you have an external identity / SSO system where you can’t login as a different user to your data warehouse/compute. But if possible you should create a second user for yourself with no special permissions in Immuta and be able to map that user to a user you can login as on your data warehouse/compute of choice.
Understanding the rules of thumb above, you will need an admin and non-admin user in Immuta, and those users need to map to users in your data warehouse/computes of choice. Typically, if the users in both places are identified with email addresses, this all “just works” - they are already mapped. However, if they do not match, you can manually configure the mapping. For example, if you want to map steve@immuta.com to the plain old “steve” username in Synapse, you can do that by following the steps below. Again, this is not necessary if your users in Immuta match your users (spelling) in your data warehouse/compute (typically by email address).
Click the People icon and select Users in the left sidebar.
Select the user.
Next to the username on the left, click the more actions menu.
Here you will see the following options: Change Databricks username, change Snowflake username, etc.
Select which data warehouse/compute username you want to map.
Enter the data warehouse/compute username that maps to that Immuta user.
Click Save.
For Immuta to enforce controls, you must enable what is called our integrations. This is done slightly differently per each database/warehouse/compute, and how it works is explained in more detail in our Query Your Data Guide. For now, let’s just get the integrations of interest configured.
Log in to Immuta.
Click the App Settings icon in the left sidebar (the wrench).
Under the Configuration menu on the left, click System API Key under HDFS.
Click Generate Key.
Click Save in the bottom left of the Configuration screen, and Confirm when prompted.
Under the Configuration menu on the left, click Integration Settings.
Click the + Add Native Integration button.
Select Databricks Integration.
Enter the Databricks hostname.
For Immuta IAM, there should only be one option, bim. Select it. This is the built-in Immuta identity manager. It’s likely you would hook up a different identity manager in production, like Okta or LDAP, but this is sufficient for POV testing.
Access Model: this one is a pretty big decision, read the descriptions for each and eventually you will need to decide which mode to use. But for the purposes of this POV guide, we assumed the default: Protected until made available by policy.
Select the appropriate Storage Access Type.
Enter the required authentication information based on which Storage Access Type you select.
No Additional Hadoop Configuration is required.
Click Add Native Integration.
This will pop up a message stating that Your Databricks integration will not work properly until your cluster policies are configured. Clicking this button will allow you to select the cluster policies that are deployed to your Databricks instance. We encourage you to read that table closely including the detailed notes linked in it to decide which cluster policies to use.
Once you select the cluster policies you want deployed, click Download Policies. This will allow you to either:
Automatically Push Cluster Policies to the Databricks cluster if you provide your Databricks admin token (Immuta will not store it), or
Manually Push Cluster Policies yourself without providing your Databricks admin token, you decide.
Please also Download the Benchmarking Suite; you will use that later in the Databricks Performance Test walkthrough.
If you choose to Manually Push Cluster Policies you will have to also Download Init Script.
Click Download Policies or Apply Policies depending which option you selected.
Once adding the integration is successful, Click Save in the bottom left of the Configuration screen (also Confirm when warned). This may take a little while to run.
If you took the manual approach, you must deploy those cluster policies manually in Databricks. You should configure Immuta-enabled cluster(s) using the deployed cluster policies.
Congratulations, you have successfully configured the Immuta integration with Databricks. To leverage it, you will need to use a cluster configured with one of the cluster policies created through the above steps.
Log in to Immuta.
Click the App Settings icon in the left sidebar (the wrench).
Click the Integrations tab.
Click the + Add Native Integration button.
Select Databricks SQL analytics.
Enter the Databricks SQL analytics host.
Enter the Databricks SQL analytics port.
Enter the HTTP Path of the SQL Endpoint that will be used to execute DDL to create views. This is not to be confused with an HTTP Path to a regular Databricks cluster!
Enter the Immuta Database: immuta_pov_secure. This is the database name where all the secure views Immuta creates will be stored. (You’ll learn more about this in the Query Your Data Guide.) That is why we named it immuta_pov_secure
(since the original data is in the immuta_pov database
), but, remember, it could contain data from multiple different databases if desired, so in production you likely want to name this database something more generic.
Enter any additional required Connection String Options.
Enter your Personal Access Token Immuta needs to connect to Databricks to create the integration database, configure the necessary procedures and functions, and maintain state between Databricks and Immuta. The Personal Access Token provided here should not have a set expiration and should be tied to an account with the privileges necessary to perform the operations listed above (e.g., an Admin User).
Make sure the SQL Endpoint is running (if you don't, you may get a timeout waiting for it to start when you test the connection), and then click Test Databricks SQL Connection.
Once the connection is successful, Click Save in the bottom left of the Configuration screen. This may take a little while to run.
Congratulations, you have successfully configured the Immuta integration with Databricks SQL. Be aware, you can configure multiple Databricks SQL workspaces to Immuta.
Log in to Immuta.
Click the App Settings icon in the left sidebar (the wrench).
Click the Integrations tab.
Click the + Add Native Integration button.
Select Snowflake.
Enter the Snowflake host.
Enter the Snowflake port.
Enter the default warehouse. This is the warehouse Immuta uses to compute views, so it does not need to be very big; XS is fine.
Enter the Immuta Database: IMMUTA_POV_SECURE.
This is the database name where all the secure schemas and views Immuta creates will be stored (you’ll learn more about this in the Query Your Data Guide). That is why we named it IMMUTA_POV_SECURE (since the original data is in the IMMUTA_POV
database), but, remember, it could contain data from multiple different databases if desired, so in production you likely want to name this database something more generic.
For Additional Connection String Options, you may need to specify something here related to proxies depending on how your Snowflake is set up.
You now need to decide if you want to do an automated installation or not; Immuta can automatically install the necessary procedures, functions, and system accounts into your Snowflake account if you provide privileged credentials (described in next step). These credentials will not be stored or saved by Immuta. However, if you do not feel comfortable with providing these credentials, you can manually run the provided bootstrap script.
Select Automatic or Manual depending on your decision above.
Automatic:
Enter the username (when performing an automated installation, the credentials provided must have the ability to both CREATE databases and CREATE, GRANT, REVOKE, and DELETE roles.)
Enter the password.
You can use a key pair if required.
For role, considering this user must be able to both CREATE databases and CREATE, GRANT, REVOKE, and DELETE roles, make sure you enter the appropriate role.
Click Test Snowflake Connection.
Manual:
Download the bootstrap script
Enter a NEW user, this is the account that will be created, and then the bootstrap script will populate it with the appropriate permissions.
Please feel free to inspect the bootstrap script for more details
Enter a password for that NEW user
You can use a key pair if required
Click “Test Snowflake Connection”
Once the connection is successful, click Save in the bottom left of the Configuration screen. This may take a little while to run.
Run the bootstrap script in Snowflake.
Congratulations, you have successfully configured the Immuta integration with Snowflake. Be aware, you can configure multiple Snowflake instances to Immuta.
Redshift RA3 Instance Type
You must use a Redshift RA3 instance type because Immuta requires cross-database views, which are only supported in Redshift RA3 instance types.
Log in to Immuta.
Click the App Settings icon in the left sidebar (the wrench).
Click the Integrations tab.
Click the + Add Native Integration button and select Redshift.
Enter the Redshift host.
Enter the Redshift port.
Enter the Immuta Database: immuta_pov_secure. This is the database name where all the secure schemas and views Immuta creates will be stored (you’ll learn more about this in the Query Your Data Guide). That is why we named it immuta_pov_secure (since the original data is in the immuta_pov
database), but, remember, it could contain data from multiple different databases if desired, so in production you likely want to name this database something more generic.
You now need to decide if you want to do an automated install or not, Immuta can automatically install the necessary procedures, functions, and system accounts into your Redshift account if you provide privileged credentials. These credentials will not be stored or saved by Immuta. However, if you do not feel comfortable with providing these credentials, you can manually run the provided bootstrap script. Please ensure you enter the username and password that were set in the bootstrap script.
Select Automatic or Manual depending on your decision above.
Automatic:
Enter the initial database. This should be a database that already exists; it doesn’t really matter which. Immuta simply needs this because you must include a database when connecting to Redshift.
Enter the username (this must be a user that can create databases, users, and modify grants).
Enter the password.
Manual:
Download the bootstrap script.
Enter a NEW user, this is the account that will be created, and then the bootstrap script will populate it with the appropriate permissions.
Please feel free to inspect the bootstrap script for more details.
Enter a password for that NEW user.
Click Test Redshift Connection.
Once the connection is successful, click Save in the bottom left of the Configuration screen.
This may take a little while to run.
Run the bootstrap scripts in Redshift.
Congratulations, you have successfully configured the Immuta integration with Redshift. Be aware, you can configure multiple Redshift instances to Immuta.
Log in to Immuta.
Click the App Settings icon in the left sidebar (the wrench).
Click the Integrations tab.
Click the + Add Native Integration button and select Azure Synapse Analytics.
Enter the Synapse Analytics host (this should come from the SQL dedicated pool).
Enter the Synapse Analytics port.
Enter the Immuta Database. This should be a database that already exists; this is where Immuta will create the schemas that contain the secure views that will be generated. In our case, that should be immuta_pov
.
Enter the Immuta Schema: pov_data_secure.
This is the schema name where all the secure views Immuta creates will be stored (you’ll learn more about this in the Query Your Data Guide). That is why we named it pov_data_secure (since the original data is in the pov_data schema), but, remember, it could contain data from multiple different schemas if desired, so in production you likely want to name this schema something more generic.
Add any additional connection string options.
Since Synapse does not support array/json primitives, Immuta must store user attribute information using a delimiter. If you expect any of these in user profiles, please update them accordingly. (It’s likely you don’t.)
You now need to decide if you want to do an automated installation or not. Immuta can automatically install the necessary procedures, functions, and system accounts into your Azure Synapse Analytics account if you provide privileged credentials. These credentials will not be stored or saved by Immuta. However, if you do not feel comfortable providing these credentials, you can manually run the provided bootstrap script (initial database) and bootstrap script.
Select Automatic or Manual depending on your decision above.
Automatic:
Enter the username. (We recommend using the system account you created associated to the workspace.)
Enter the password.
Manual:
Download both bootstrap scripts.
Enter a NEW user; this is the account that will be created, and then the bootstrap script will populate it with the appropriate permissions.
Please feel free to inspect the bootstrap scripts for more details.
Enter a password for that NEW user.
Click Test Azure Synapse Analytics Connection
Once the connection is successful, click Save in the bottom left of the Configuration screen. This may take a little while to run.
Run the bootstrap scripts in Synapse Analytics.
Congratulations, you have successfully configured the Immuta integration with Azure Synapse Analytics. Be aware, you can configure multiple Azure Synapse Analytics instances to Immuta.
Log in to Immuta.
Click the App Settings icon in the left sidebar (the wrench).
Click the Integrations tab.
Click the + Add Native Integration button and select Trino.
To connect a Trino cluster to Immuta, you must install the Immuta Plugin and register an Immuta Catalog. For Starburst clusters, the plugin comes installed, so you just need to register an Immuta Catalog. The catalog configuration needed to connect to this instance of Immuta is displayed in this section of the App Settings page. Copy that information to configure the plugin.
Should you want Starburst (Trino) queries to be audited you must configure an Immuta Audit Event Listener. The event listener configuration needed to configure the audit listener is displayed below. The catalog name should match the name of the catalog associated with the Immuta connector configuration displayed above. Copy that information to configure the Audit Event Listener.
Click Save in the bottom left of the Configuration screen. This may take a little while to run.
Go to Starburst (Trino) and configure the plugin and audit listener.
Congratulations, you have successfully configured the Immuta integration with Starburst (Trino). Be aware, you can configure multiple Starburst (Trino) instances to Immuta. Configuring a combination of Trino and Starburst clusters in a single Immuta tenant is not supported.
Assumptions: Part 4 assumes your user has the following permissions in Immuta (note you should have these by default if you were the initial user on the Immuta installation):
CREATE_DATA_SOURCE: in order to register the data with Immuta
GOVERNANCE: in order to create a custom tag to tag the data tables with
These steps are captured in our first walkthrough under the Scalability & Evolvability theme: Schema monitoring and automatic sensitive data discovery. Please do that walkthrough to register the data to complete your data setup. Make sure you come back here to complete Part 5 below after doing this!
Assumptions: Part 5 assumes your user has the following permissions in Immuta (note you should have these by default if you were the initial user on the Immuta install):
GOVERNANCE: in order to build policy against any table in Immuta OR
“Data Owner” of the registered tables from Part 4 without GOVERNANCE permission. (You likely are the Data Owner and have GOVERNANCE permission.)
Only do this part if you created the non-admin user in Part 3 (it is highly recommended you do that). If you did, you must give them access to the data as well. Immuta has what are called subscription policies, these are what controls access to tables, you may think of these as table GRANTs.
To get things going, let’s simply open those tables you created to anyone:
Click the Policy icon in the left sidebar of the Immuta console.
Click the Subscription Policies tab at the top.
Click + Add Subscription Policy.
Name the policy Open Up POV Data.
For How should this policy grant access? select Allow Anyone.
For Where should this policy be applied?, select On Data Sources.
Select tagged for the circumstance (make sure you pick “tagged” and not “with columns tagged”).
Type in Immuta POV for the tag. (Remember, this was the tag you created in the Schema Monitoring and Automatic Sensitive Data Discovery walkthrough under Part 4 above). Note that if you are a Data Owner of the tables without GOVERNANCE permission the policy will be automatically limited to the tables you own.
Click Create Policy → Activate Policy
That will allow anyone access to those tables you created. We’ll come back to subscription policies later to learn more.
Return to the POV Guide to move on to your next topic.
Prerequisite: Immuta is installed.
Welcome to your Immuta Proof of Value (POV)!
This POV guide has been built to help you get the most out of your Immuta experience in the shortest amount of time possible. It’s not to say you should limit your testing to these themes, we encourage you to augment this guide with your own specific use cases, that’s where our power really shines. But we also recommend you tackle this guide on the topics that interest you to set the foundation for your own use cases.
If the vision of why you need Immuta doesn’t align with our vision of Immuta - maybe we shouldn’t waste your time, right? So let us tell you our vision for Immuta.
Our vision: To enable the management and delivery of trusted data products at scale.
What does that mean? We allow you to apply engineering principles to how you manage data. This will give your team the agility to lower time-to-data across your organization while meeting your stringent and granular compliance requirements. Immuta allows massively scalable, evolvable (yet understandable) automation around data policies; creates stability and repeatability around how those policies are maintained, in a way that allows distributed stewardship across your organization, but provides consistency of enforcement across your data ecosystem no matter your compute/data warehouse; and fosters more availability of data through the use of highly granular data controls (it’s easier to slice a pizza with a knife than a snow shovel) with little performance overhead.
This guide is broken into seven themes of Immuta:
Each of these are foundational concepts to applying engineering principles to data management. Each section will provide a quick overview of the theme. After reading that overview, if it is of interest to you, there are walkthroughs of specific Immuta features aligned to those themes tied to data tables we will help you populate as part of this POV. You of course are welcome to try the concepts against your own data. You can skip sections that are less important/relevant to you as well, up to you.
Each walkthrough also contains an “anti-pattern” section. Anti-patterns are design decisions that seem smart on the surface, but actually cause unforeseen issues and paint you in a corner in the long run. These anti-patterns have been experienced by our Immuta team over the years working in highly complex environments in the US Intelligence Community where anti-patterns, and the problems associated with them, could mean loss of lives. They are why we built Immuta in the first place. We find these anti-patterns across customers and open source projects and call them out specifically so you can effectively compare and contrast Immuta with other solutions and designs you may have or are evaluating.
By the end of this guide, you should have a strong understanding of the value proposition of Immuta as well as the features and functions to more quickly apply it to your real internal use cases, which we highly recommend you do as part of the POV. Note that this POV guide assumes you already have Immuta installed and running.
Do you find yourself spending too much time managing roles and defining permissions in your system? When there are new requests for data, or a policy change, does this cause you to spend an inordinate amount of time to make those changes? Scalability and evolvability will completely remove this burden. When you have a scalable and evolvable data management system, it allows you to make changes that impact hundreds if not thousands of tables at once, accurately. It also allows you to evolve your policies over time with minor changes or no changes at all, through future-proof policy logic.
In a scalable solution such as Immuta, that count of policy changes required will remain extremely low, providing the scalability and evolvability. GigaOm researched this exactly, comparing Immuta’s ABAC model to what they called Ranger’s RBAC with Object Tagging (OT-RBAC) model and showed a 75 times increase in policy management with Ranger.
Value to you: You have more time to spend on the complex tasks you should be spending time on and you don’t fear making a policy change.
Value to the business: Policies can be easily enforced and evolved, allowing the business to be more agile and decrease time-to-data across your organization and avoid errors.
If you can’t prove you are applying controls correctly, does it even count? In other words, how can you prove to your legal, compliance, Chief Data Officer, etc that you’ve translated regulation, typically in written form, to code-based policy logic correctly?
Obviously the prior section, scalability and evolvability help solve this problem, because they remove the amount of policies that must be defined (and reviewed). However, if you define policy buried in SQL and complex interfaces only you can understand and prove, to include history of change, you have a trust but not-verify environment that is ripe for error. In addition to scalability and evolvability removing complexity, Immuta’s platform can also present policy in a natural language form, easily understood, along with an audit history of change to create a trust and verify environment.
Value to you: You can easily prove policy is being implemented correctly to business leaders concerned with compliance and risk.
Value to the business: Ability to meet any audit obligations to external parties and/or to your customers.
Up until now we’ve shown you how to build scalable, evolvable, and understandable policy through the Immuta user interface. However, to get stability and repeatability, you as engineers want and need to apply software engineering principles to how you manage policy. This is the automation around Immuta just like you need automation around your infrastructure and data pipelines.
Immuta was built with the “as-code” movement in mind, allowing you to, if desired, treat Immuta as ephemeral and represent state in source control.
Value to you: You can merge data policy management into your existing engineering paradigms and toolchains, allowing full automation of every component of Immuta.
Value to the business: Reduce time-to-data across the organization because policy management is stable and your time is being spent on more complex initiatives.
Heard of data mesh? As first defined by Zhamak Dehghani, “a data mesh is a type of data platform architecture that embraces the ubiquity of data in the enterprise by leveraging a domain-oriented, self-serve design.” You may have a data mesh and not even know it, for example, jurisdictions with strong data protections generally exert extraterritorial controls that prevent consumer or citizen data from being accessed or processed in other jurisdictions that do not afford comparable data protections and controls. What it means in practice for data management is that you need distributed stewardship for your data domains (physical and/or logical) across your organization. Put more concretely, you can’t have single “god” administrators that control everything from an access management perspective.
Immuta enables fine-grained data ownership and controls over organizational domains, allowing a data mesh environment for sharing data - embracing the ubiquity of your organization.
Value to you: You can enable different parts of your organization to manage their data policies in a self-serve manner without involving you in every step.
Value to the business: Make data available across the organization without the need to centralize both the data and authority over the data. This will free your organization to share more data more quickly than ever before.
This section is only relevant if you are using more than one data warehouse / compute, for example, Databricks and Snowflake. Just like the big data era required the separation of compute from storage to scale, the “data compliance era” requires the separation of policy from compute to scale. This was made evident in the scalability and evolvability section, but requires a bit more critical details.
Legacy solutions, such as Apache Ranger, can only substantiate the abstraction of policy from compute in the Hadoop ecosystem. This is due to inconsistencies in how Ranger enforcement has been implemented in the other downstream compute/warehouse engines. That inconsistency arises not only from ensuring row, column, and anonymization techniques work the same in Databricks as they do in Snowflake, for example, but also from the need for additional roles to be created and managed in each system separately and inconsistently from the policy definitions. With Immuta, you have complete consistency without forcing new roles to be created into each individual warehouse’s paradigm.
With inconsistency comes complexity, both for your team and the downstream analysts trying to read data (for example, have to know what role to assume). That complexity from inconsistency removes all value of separating policy from compute. With Immuta, you are provided complete consistency.
Value to you: You can build policy once, in a single location, and have it enforced scalably and consistently across all your data warehouses. This is the foundational piece to all sections above.
Value to the business: None of the other section’s business values are possible without this foundational piece.
As an engineer, this probably is not what you think of when you hear availability, we are not talking about your data warehouse availability. In this case we mean availability of data - as much as possible.
When most think of data access control, they see it as a blocker, a brake on a car, if you will. When in fact, with fine-grained access controls and advanced anonymization techniques, it’s the opposite. For example, if the only trick up your sleeve is to grant or deny access to a table, then that’s it, you need to decide if the user can see the whole table or not. This leads to over-hiding or over-sharing. But it goes deeper, even if you can grant/deny at the column level, that still becomes a binary decision on if the user should see that column or not. Instead of a binary decision, anonymization techniques can be applied to columns to “fuzz” data just enough to provide the utility required while at the same time meet rigorous privacy and control requirements.
Availability of these highly granular decisions at the access control level is the car accelerator, not the brake: we find organizations can increase data access by over 50% in some cases when using Immuta.
Value to you: You are no longer caught in the middle between compliance and analysts. You can allow analysts access to more data than ever before while keeping compliance happy.
Value to the business: More data than ever at the fingertips of your analysts and data scientists (we’ve seen examples of up to 50% more). Your business can thrive on being data driven.
Last but not least, performance. What good is all the benefits provided by Immuta if it completely slows down all query workloads? To be clear, with increased security there is some decrease in performance. Immuta gives you the flexibility to decide how much security is appropriate for your use case given the overhead associated with that security.
Performance is tied to how Immuta implements policy enforcement. Rather than requiring a copy of data to be created, Immuta enforces policy live and this is done differently based on the warehouse/compute in question. Understanding the Immuta enforcement mechanisms will allow you to more effectively understand and evaluate Immuta performance.
Plugin: This enforcement is done by Immuta slightly altering the query natively in the database (with SQL). The overhead associated with this lies in checking in on the policy decision (which are cached) and any logic core to the policy that is injected into the plan. Compute/Warehouses: Databricks, Starburst (Trino).
Policy Push: This enforcement is done by Immuta creating a single view on top of the original table and baking all policy logic into that view. In this case, the view only changes when there is a policy change, so policy decision check overhead is eliminated and any overhead is associated with the logic of the policy in the view. Compute/Warehouses: Snowflake, Synapse, Redshift, Databricks SQL.
Prerequisite: Before using this walkthrough, please ensure that you’ve first done the Parts 1-5 of the and the walkthrough.
Let’s draw an analogy. Imagine you are planning your wedding reception. It’s a rather posh affair, so you have a bouncer checking people at the door.
Do you tell your bouncer who’s allowed in? (exception-based) Or, do you tell the bouncer who to keep out? (rejection-based)
The answer to that question should be obvious, but many policy engines allow both exception- and rejection-based policy authoring, which causes a conflict nightmare. Ignoring that anti-pattern for a moment (which we’ll cover in the Anti-Pattern section) exception-based policy authoring in our wedding analogy means the bouncer has a list of who should be let into the reception. This will always be a shorter list of users/roles if following the , which is the idea that any user, program, or process should have only the bare minimum privileges necessary to perform its function - you can’t go to the wedding unless invited. This aligns with the concept of , the foundation of the CPRA and GDPR, which states “Privacy as the default setting.”
What this means in practice is that you should define what should be hidden from everyone, and then slowly peel back exceptions as needed.
Using an exception-based approach is a security standard across the board; this is because it’s a scalable approach that avoids costly data leaks and allows the business to move quickly. The “how” will be discussed in more detail in the section.
Because of this, the business reaps
Increased revenue: accelerate data access / time-to-data.
Decreased cost: operating efficiently at scale, agility at scale by building exceptions to agreed-upon foundational policies.
Decreased risk: avoid data leaks by not conflating conflicting exception- and rejection-based policies.
Assumptions: Your user has the following permissions in Immuta (note you should have these by default if you were the initial user on the Immuta installation):
GOVERNANCE: in order to build policy against any table in Immuta OR are a “Data Owner” of the registered tables (you likely are the Data Owner and have GOVERNANCE permission).
USER_ADMIN: in order to manage groups/attributes on users.
We need to have attributes or groups assigned to you to drive policy. With Immuta these can come from anywhere (we mean literally anywhere), and Immuta will aggregate them to use in policy. Most commonly these come from your identity manager, such as LDAP, Active Directory, Okta, etc., but for simplicity sake, we are going to assign attributes to you in Immuta.
Click the People icon and select Users in the left sidebar.
Select your name and click + Add Attributes.
In the Add Attributes modal, enter Department in the Attribute field.
Enter HR for Attribute value field.
Repeat these steps for the non-admin user you created in Part 3 of the POV Data Setup. However, give that user the Attribute Department with the Attribute Value Analytics (instead of HR).
In Immuta, visit the Fake HR Data data source (from any warehouse/compute).
Go to the Data Dictionary tab and find where you have the Discovered.Entity.Person Name
tags. Let’s build a policy against that tag that includes an exception.
Click the Policies icon in the left sidebar.
Click + Add New Data Policy.
Name it Mask Person Name.
For action, select Mask.
Leave columns tagged.
Type in the tag Discovered.Entity.Person Name
.
Change masking type to using a constant.
Type in the constant REDACTED.
Leave for everyone except and change the exception to
possesses attribute
Department
HR
Click Add.
Leave Where should this policy be applied? as is. (Immuta will guess correctly based on the previous steps.)
You can further refine where this policy is applied by adding another circumstance:
Click + Add Another Circumstance.
Change the or to an and.
Select tagged for the circumstance. (Make sure you pick “tagged” and not “with columns tagged.")
Click Create Policy and then Activate Policy.
Let’s make this a little more complex. Let’s say that we want people in Department HR to see hashed names, but everyone else to see REDACTED. To do this, let’s update the policy:
From the Policies page, click the menu button on the Mask Person Name policy we just created and click Edit.
Click the three dot button on the actual policy definition and then select Edit. (Note you edit that separately because you can have multiple policy definitions in the single policy.)
Change everyone except to everyone who.
Change using a constant to using hashing in the first policy.
Click Update.
Click Save Policy.
A key point to realize here is that when you did “everyone who” you were actually building a rejection-based policy, but to ensure there was no data leak, Immuta forced you to also have that catch-all OTHERWISE statement at the end, similar to a for-else in coding. This retains the exception-based concept to avoid a data leak.
How could your data leak if it wasn’t exception based?
What if you did two policies:
Mask Person Name using hashing for everyone who possesses attribute Department HR.
Mask Person Name using constant REDACTED for everyone who possesses attribute Department Analytics.
Now, some user comes along who is in Department finance - guess what, they will see the Person Name columns in the clear because they were not accounted for, just like the bouncer would let them into your wedding because you didn’t think ahead of time to add them to your deny list.
Again, fairly obvious: rejection-based policies are the Anti-Pattern and are completely contradictory to the industry standard of least privileged access; yet, for some reason, tools like Ranger rely on them and send users tumbling down this trap.
There are two main issues:
Ripe for data leaks: Rejection-based policies are extremely dangerous and why Immuta does not allow them except with a catch-all OTHERWISE statement at the end, which you walked through. Again this is because if a new role/attribute comes along that you haven’t accounted for, that data will be leaked. It is impossible for you to anticipate every possible user/attribute/group that could possibly exist ahead of time just like it’s impossible for you to anticipate any person off the street that could try to enter your posh wedding that you would have to account for on your deny list.
Ripe for conflicts and confusion: Tools that specifically allow both rejection-based and exception-based policy building create a conflict disaster. Let’s walk through a simple example, noting this is very simple, imagine if you have hundreds of these policies:
Policy 1: mask name for everyone who is member of group A
Policy 2: mask name for everyone except members of group B
What happens if someone is in both groups A and B? We have to fall back on policy ordering to avoid this conflict, which requires users to understand all other policies before building their policy and it is nearly impossible to understand what a single policy does without looking at all policies.
Prerequisites: Before using this walkthrough, please ensure that you’ve first done
Parts 1-5 of the
the walkthrough and
at least one of the following:
Prerequisites: Before using this walkthrough, please ensure that you’ve first done
Parts 1-5 of the
the walkthrough and
at least one of the following:
Understandability of policy, as discussed in the previous walkthrough, , is critically important to create a prove and verify environment. This should be further augmented by change history around policy, and being able to monitor and attribute change.
Immuta provides this capability through our extensive audit logs and takes it a step further by providing history views and diffs in the user interface.
Once you have created a trust and verify environment WITH full auditability, all stakeholders can rest easy and monitoring change can be enabled.
Because of this, the business reaps
Increased revenue: accelerate data access / time-to-data because the legal and compliance teams trust that data is being protected correctly because they can verify that is the case.
Decreased risk: Changes are obvious to all and can be reacted to quickly.
Assumptions: Your user has the following permissions in Immuta (note you should have these by default if you were the initial user on the Immuta installation):
GOVERNANCE: in order to view policy audit OR
“Data Owner” of the registered tables. (You likely are the Data Owner and have GOVERNANCE permission.)
First, let's examine a Global Policy.
Log in to Immuta.
Click the Audit icon in the left sidebar.
In the facets section on the left, expand the time bar to full history.
Under Record Type, click the Global Policy Applied checkbox.
This will list all Global Policies that have been applied; click on one to inspect it.
Now let’s leave the audit history and go to an actual table in the UI to see its specific history.
Click the Data Sources icon in the left sidebar.
Click into any of your data sources (where you’ve applied policy).
Click the Policies tab.
On the right, there is an Activity menu; if it is not expanded, expand it.
Examine it. Depending on how many policies you’ve applied, it will show the running history.
Lastly, let’s take a look at all activity in Immuta and examine a policy “diff."
Click the Governance icon in the left sidebar.
Click the Notifications tab at the top of the page.
Scroll through the notifications until you see one that starts with something like The following global policy has been applied/updated on… This is a global policy applied event.
Click on the green Governance icon on the left of that row to View Details.
This will provide a GitHub-like diff pop-up that will show the previous policy as compared to the prior policy. (Prior policy is likely empty because we created policies from scratch in these walkthroughs.)
Note that all notifications can be grabbed as webhooks, so you can take Immuta notifications and plug them into something like Slack, if desired.
The anti-pattern is to build policy based on tasking an engineer in an ad-hoc manner. When this occurs, there is no history of the change, nor is it possible to see the difference between the old and new policies. That makes it impossible to take a historical look at change and understand where an issue may have arisen. If you have a standardized platform for making policy changes, then you are able to understand and inspect those changes over time.
Prerequisites: Before using this walkthrough, please ensure that you’ve first done
Parts 1-5 of the
the walkthrough and
at least one of the following:
This is a pretty simple one: if you can’t show your work, you are in a situation of trust with no way to verify. Writing code to enforce policy (Snowflake, Databricks, etc.) or building complex policies in Ranger does show your work to a certain extent - but not enough for outsiders to easily understand the policy goals and verify their accuracy, and certainly not to the non-engineering teams that care that policy enforcement is done correctly.
With Immuta, policy is represented in natural language that is easily understood by all. This allows non-engineering users to verify that policy has been written correctly. Remember also that when using global policies they leverage tags rather than physical table/column names, which further enhances understandability.
Lastly, and as covered in the Scalability theme, with Immuta you are able to build far fewer policies, we are talking upwards of 75x fewer policies, which provides an enormous amount of understandability with it.
Certainly this does not mean you have to build every policy through our UI - Data Engineers can build automation through our API (covered in the next theme), if desired, and those policies are presented in a human readable form to the non-engineering teams that need to understand how policy is being enforced.
Once you have created a trusted and verified environment, through centralized policy management, all stakeholders can rest easy and mistakes can be caught quickly.
Because of this, the business reaps
Increased revenue: accelerate data access / time-to-data because the legal and compliance teams trust that data is being protected correctly because they can verify that is the case.
Decreased risk: Mistakes will not linger hidden beneath complex code, the spirit of how your organization interprets law and policy can be easily verified.
Assumptions: Your user does not have to have any required permissions.
Log in to Immuta with any user.
Click the Policies icon in the left sidebar.
Choose a Data policy to expand and read. You understand them; anyone can!
This is a picture one of our customers created that depicts the logic:
The anti-pattern is that the way you build policy is so technical and/or complex, you have no way to allow non-technical leadership to validate your work. This leaves the Data Engineering team struggling to prove they’ve done their job and creates distrust that policy enforcement is happening correctly, which creates a domino effect of involving more humans to manually approve access, completely halting time-to-data.
Prerequisites: Before using this walkthrough, please ensure that you’ve first completed
Parts 1-5 of
and
While the Immuta user interface is a powerful tool to demonstrate to legal, compliance, and other leadership what and how policy is being enforced in a human-consumable manner, data engineering teams do not want to spend time clicking buttons in a user interface.
Instead, data engineering teams want policy enforcement to fit cleanly into their existing automation workflows. Immuta enables this through what we term policy-as-code. This is very similar to the infrastructure-as-code concepts you may be familiar with. With Immuta policy-as-code, customers can construct declarative files that represent Immuta state, store those in a Git repo where changes can be managed through pull requests, and push those files to Immuta through the Immuta CLI.
This feature allows data engineering teams to build automation around all they do with policy while at the same time proving compliance through the Immuta user interface (without having to actually touch the user interface to manage policy).
Because of this, the business reaps
Increased revenue: accelerate data access / time-to-data, all processes around policy can be automated.
Decreased cost: operating efficiently at scale, automation, automation, and more automation!
Decreased risk: reduces humans-in-the-loop through code automation which decreases errors.
Assumptions: Your user has the following permissions in Immuta (note you should have these by default if you were the initial user on the Immuta install):
GOVERNANCE: in order to build policy on any table.
First let’s install the CLI:
Select the folder of the current release and download the binary zip file corresponding to your system architecture.
Unzip the file and add the binary to a directory in your system's $PATH.
Now let’s configure the CLI:
First create an API key in Immuta.
Log in to Immuta with your user with GOVERNANCE permission.
Click on your profile icon in the upper right corner. (It should be the first letter of your first name.)
Click Profile.
Click the API keys tab.
Click Generate Key.
Select Current Project (none) for your project.
Name the key CLI.
Copy the API key and save it for the next step.
Open a terminal and run immuta configure
.
Enter the URL of your Immuta instance in the interactive prompt.
Enter your Immuta API Key from the previous step in the interactive prompt.
You can see your configuration by examining the file saved at ~/.immutacfg.yaml
.
For full details on how to use the Immuta CLI/API please refer to the Immuta documentation, but here’s a simple walkthrough:
Since we already have an Immuta instance set up we can use the immuta clone
command to save all your data sources, projects, policies, and purposes as payloads. Options you can specify to get more information about this command include h
or --help
:
From the terminal, run immuta clone <outputDirPath>
You can look in that directory and see there are four folders: data, policy, project, purpose.
Go look at all the files in the policy folder.
Note: If you named that policy something different in the walkthrough, the file name will be that instead of RLS-Walkthrough.
Open the RLS-Walkthrough.yaml file in your editor of choice.
Edit the file to change staged: false to staged: true.
Save the file.
Now let’s push that change to your Immuta instance:
From the terminal, run immuta policy save <outputDirPath/policy/RLS-Walkthrough.yaml>
.
Now go back to the Immuta UI.
Click the Policies icon in the left sidebar.
Look at the RLS Walkthrough policy; it should now be staged.
Now, rather than changing that policy back to active in the UI, do it through the yaml file to ensure that those files remain your single source of truth:
Open the RLS-Walkthrough.yaml file in your editor of choice.
Edit the file to change staged: true to staged: false.
Save the file.
From the terminal, run immuta policy save <outputDirPath/policy/RLS-Walkthrough.yaml>
.
It’s recommended that you source control these files and push when necessary through the Immuta CLI using automated CI/CD workflows.
Anti-pattern: “Existing cybersecurity architectures and operating models break down as companies adopt public-cloud platforms. Why? Almost all breaches in the cloud stem from misconfiguration, rather than from attacks that compromise the underlying cloud infrastructure.”
Solution: “Security as Code (SaC) requires highly automated services that developers can consume via APIs. This, in turn, requires behavioral changes in security, infrastructure, and application-development teams. The security organization must move from a reactive, request-based model to one in which they engineer highly automated security products.”
Benefits:
"The first benefit of SaC is speed. To fully realize the business benefits of the cloud, security teams must move at a pace they are unaccustomed to in on-premises environments. Manual intervention introduces friction that slows down development and erodes the cloud’s overall value proposition to the business."
"The second benefit is risk reduction. On-premises security controls simply don’t account for the nuances of the cloud. Cloud security requires controls to move with a workload throughout its entire life cycle. The only way to achieve this level of embedded security is through SaC."
"Finally, SaC is a business enabler. Security and compliance requirements are becoming increasingly central to businesses’ core products and services. In this respect, SaC not only expedites time to market but expands opportunities for innovation and product creativity without compromising security.”
Prerequisites: Before using this walkthrough, please ensure that you’ve first completed Parts 1-5 of the and the walkthrough.
Separation of duties is a critical component of policy enforcement. An additional component to consider is also separation of understanding. What we mean by that is there may be some people in your organization that are much more knowledgeable about what policies must be enforced vs people in your organization that understand deeply what data is contained in certain tables, “experts” on data, so to speak.
If you’ve created a few global policies during this POV guide, you’ve noticed that they are driven based on tags on data. Wouldn’t it be nice if you could rely on data experts to ensure that data is being tagged correctly, and rely on the data engineers to ensure that policy is being authored appropriately based on requirements - separation of understanding? This is possible with Immuta.
As with most features we’ve discussed in this guide, this is an optional approach that can be taken to optimize scalability and avoid costly policy mistakes by separating controls to who knows best.
Because of this, the business reaps
Increased revenue: accelerate data access / time-to-data, tagging and policy building is optimized by those who know it best.
Decreased cost: operating efficiently at scale, everything happens faster because experts of their domain are in charge.
Decreased risk: putting experts in charge of their domain reduces risks and mistakes by avoiding go-betweens.
Assumptions: Your user has the following permissions in Immuta (note you should have these by default if you were the initial user on the Immuta installation):
GOVERNANCE: in order to build policy or assign an expert against any table in Immuta OR
Are a “Data Owner” of the registered tables (you likely are the Data Owner and have GOVERNANCE permission).
We are going to add an Expert to your data table, once an expert, they can manage and validate tags while you focus on policy building.
Log in to Immuta with your user with GOVERNANCE permission (and/or is the Data Owner of the table “Immuta Fake HR Data”).
Visit the Immuta Fake HR Data data source.
Click the Members tab.
Change their Role from subscribed to expert.
Now that user is no longer simply subscribed to that table, as an expert they can manage all metadata surrounding the data source, such as documentation and tags.
Log in to Immuta with the user you just made an expert in the above step.
Visit the same Immuta Fake HR Data data source.
Visit the Data Catalog tab.
Click the Add tags button to the right of the race column.
Start typing Ethnic and you should see the tag Discovered.Entity.Ethnic Group
- select that tag.
Note if you have multiple computes/warehouses with this table, you should do this in all of them.
That tag was not automatically discovered by Immuta’s sensitive data discovery but now was added by an expert.
If you had policies that were built referencing Discovered.Entity.Ethnic Group
, they would now attach to that column since an expert tagged it as such.
Forcing a single “god” user (or very small set of god users) to manage everything isn’t necessarily an anti-pattern, but it certainly can make things more difficult for those charged with solving your access control problems. You may force some users out of their comfort zone where there are other users more comfortable with making those calls - you should empower them if they exist.
Prerequisite: Before using this walkthrough, please ensure that you’ve done .
In the prior walkthroughs in this theme, we’ve spent a lot of time talking about attribute-based access controls and their benefits. However, in today’s world of modern privacy regulations, deciding what a single user can see is not just about who they are, but what they are doing. For example, the same user may not be able to see credit card information normally, if they are doing fraud detection work, they are allowed to.
This may sound silly because it’s the same person doing the analysis, why should we make this distinction? This gets into a larger discussion about controls. When most think about controls, we think about data controls - how do we hide enough information (hide rows, mask columns) to lower our risk. There’s a second control called contextual controls; what this amounts to is having a user agree they will only use data for a certain purpose, and not step beyond that purpose. Combining contextual controls with data controls is the most effective way to reduce your overall risk.
In addition to the data controls you’ve seen in Immuta, Immuta is also able to enforce contextual controls through what we term “purposes.” You are able to assign exceptions to policies, and those exceptions can be the purpose of the analysis in addition to who the user is (and what attributes they have). This is done through Immuta projects; projects contain data sources, have members, and are also protected by policy, but most importantly, projects can also have a purpose which can act as an exception to data policy. Projects can also be a self-service mechanism for users to access data for predetermined purposes without having to involve humans for ad hoc approvals.
Purpose-based exceptions reduce risk and align with many privacy regulations, such as GDPR and CCPA. They also allow policy to be created a priori for exceptions to rules based on anticipated use cases (purposes) in your business, thus removing time-consuming and ad hoc manual approvals.
Because of this, the business reaps
Increased revenue: accelerate data access / time-to-data, no waiting for humans to make decisions.
Decreased cost: operating efficiently at scale, agility at scale because humans are removed from the daily approval flows.
Decreased risk: align to privacy regulations and significantly reduce risk with the addition of contextual controls.
Assumptions: Your user has the following permission in Immuta (note you should have these by default if you were the initial user on the Immuta installation):
GOVERNANCE: in order to add a new purpose, build policy against any table in Immuta, and approve a purpose’s use in a user created project.
In this example we are going to hide credit card data unless acting under a certain purpose. Let’s start by creating the purpose as a user with GOVERNANCE permission.
Click the Governance icon in the left sidebar of the Immuta console.
On the Purposes tab, click + Add Purpose.
For Purpose Name put Fraud Analysis.
Leave the default Statement. However, this statement is important, it is what the user is agreeing to when they use this Purpose; you can edit this to whatever internal legal language you want.
Leave the Description empty.
Click Create.
Click Edit to the right of your Fraud Analysis purpose to edit it.
Click Add Sub Purpose.
For the nested purpose, enter Charges.
Click Save.
Now let’s build a policy:
Click the Policies icon in the left sidebar.
Click + Add New Data Policy.
Name it Mask Credit Card Numbers.
For action, select Mask.
Leave columns tagged.
Type in the tag Discovered.Entity.Credit Card Number
.
Change the masking type to by making null.
Leave everyone except.
Set when user is to acting under purpose.
Set Fraud Analysis.Charges
as the purpose.
Click Add.
Now let’s add a second action to this policy:
Click + Add Another Action.
Select Limit usage to purpose(s).
Select Fraud Analysis as the purpose. (Notice that we left off Charges, unlike above.)
Change for everyone except to for everyone.
Click Add.
Leave Where should this policy be applied? as is.
Click Create Policy and then Activate Policy.
Click the Data icon and select Projects in the left sidebar of the Immuta console.
Click + New Project.
Name the Project My Fraud Project.
Set the description as Immuta POV.
Leave the documentation as the default. (You could add markdown to describe your project here.)
Set your purpose to Fraud Analysis.
Ignore Native Workspace.
For Data Sources, select your Fake Credit Card Transactions table(s).
Click Affirm and Create -- note that you are affirming the acknowledgement statement that popped up in the previous section.
Click the Project Overview tab.
You will see the Fraud Analysis purpose there, but it is staged.
At this point you have the project, but until another user with GOVERNANCE or PROJECT_MANAGEMENT permissions approves that purpose on that project, you cannot act under it. This is because a human must confirm that the data sources you added to the project align with the purpose you are acting under and the project you are attempting to accomplish. Yes, this is a manual approval step; however, it is fully documented in the project and audited, allowing the approver to make a decision with all information required. This is not a policy decision - it is a decision on if the project legitimately aligns to the purpose. Let’s go ahead and do that with your other user.
Go to your Immuta window that has the admin user with GOVERNANCE permission logged in.
You should see a little red dot above the Requests icon in the upper right corner of the Immuta console.
If you click on the Requests icon, you will see you have There are 1 pending Purpose Approval request(s).
Click the Review button. This will drop you into your Requests window under your profile. You can review the request by visiting the project through the hyperlink, but since you already know about the project, just click the checkbox on the right to approve it.
Go back to the other non-admin user window and refresh the project screen.
You will be asked to acknowledge the purpose per the statement that was attached when the purpose was created. Click I Agree. (That will be audited in Immuta.)
Now that the Fraud Analysis purpose is active, click in the upper right corner of the console where it says No Current Project - that menu is how you switch your project contexts to act under a purpose.
Set your current project to the one you created: My Fraud Project.
You are now acting under the purpose: Fraud Analysis.
Ok, that was cool, but look, the credit card numbers are null. This is because we use a more specific purpose to exception you from the credit card masking policy, remember, it was Fraud Analysis.Charges
rather than just Fraud Analysis
. So let’s make our purpose more specific in the Project, re-approve it, and then show the credit card numbers are in the clear.
Using the non-admin user that created the project, click Manage above the purposes on the My Fraud Project Overview tab.
In the Purposes drop down, uncheck Fraud Analysis and then select Fraud Analysis.Charges
.
This will require you to affirm the new purpose.
Go back to your admin user and go through the flow of approving the purpose again; you will have another Requests notification. (You can just refresh the Requests screen if you are already there.)
Once approved, go back to the non-admin user and refresh their My Fraud Project window.
Click I Agree to the acknowledgement.
But wait, did you notice something? Why are you able to see the table at all? You aren’t working under the purpose of Fraud Analysis
anymore? This is because Fraud Analysis.Charges
is a more specific subset of Fraud Analysis
, so by acting under it you also are acting as any other Purposes further up the tree - the power of hierarchical purposes!
DO THIS: Ok, now we need to do some cleanup because we want to use that credit card data later in these walkthroughs and not have to act under a purpose to do so (this will let the other walkthroughs stand on their own without having to do this walkthrough).
With your admin user, click the Policies icon in the left sidebar.
Find the Mask Credit Card Numbers policy you created in this walkthrough.
Click the menu button to the right of it and select Delete.
Click Confirm.
With your non-admin user, switch your project toggle: Switch Current Project: None. Note: If you do not do this step, you will only be able to see the tables in the My Fraud Project and no tables outside the My Fraud Project when querying data.
Some may claim they can do purpose exceptions using - you guessed it - Roles! Sigh, as we’ve seen, this continues to exacerbate our role explosion problem.
Also, there are two kinds of RBAC models: flat and hierarchical. Flat means you can only work under one role at a time (Snowflake uses this model) which does align well if you wanted to do the anti-pattern and use roles. However, most databases (everything other than Snowflake) have hierarchical roles, meaning you act under all your roles at once. For hierarchical roles, using a role for purpose doesn’t work because at runtime you have no idea which purpose the user is actually acting under - why does that matter? Remember, the user acknowledged to only use the data for that certain purpose, if the user has no way to explicitly state which purpose they are acting under, how can we hold them accountable?
Lastly, there is no workflow for the user to acknowledge they will only use the data for the purpose if you are simply assigning them roles, nor is there a workflow to validate that the purpose is aligned to the project the user is working on.
For these reasons, Purpose needs to be its own object/concept in your access control model.
Prerequisites: Before using this walkthrough, please ensure that you’ve first completed Parts 1-5 of the and the walkthrough.
As mentioned in the , by having highly granular controls coupled with anonymization techniques, more data than ever can be at the fingertips of your analysts and data scientists (we’ve seen examples of up to 50% more).
Why is that?
Let’s start with a simple example and get more complex. Obviously, if you can’t do row- and column-level controls, and you are limited to only GRANTing access to tables, you are either over-sharing or under-sharing. In most cases, it’s under sharing: there are rows and columns in that table the users can see, just not all of them, but instead, they are blocked completely from the table.
Ok, that was obvious, now let’s get a little more complex. If you have column-level controls, now you can give them access to the table, but you can completely hide a column from a user by making all the values in it null, for example, and, thus, they’ve lost all data/utility from that column, but at least they can get to the other columns.
We can make that masked column more useful, though. If you hash the values in that column instead, utility is gained because the hash is consistent - you can track and group by the values, but can’t know exactly what they are.
But you can make that masked column even more useful! If you use something like k-anonymization (we’ll talk about shortly) instead of hashing, they can know many of the values, but not all of them, gaining almost complete utility from that column. As your anonymization techniques become more advanced, you gain utility from the data while preserving privacy. These are termed Privacy Enhancing Technologies (PETs) and Immuta places them at your fingertips.
This is why advanced anonymization techniques can get significantly more data into your analysts' hands.
Creating a balance between privacy and utility is critical to stakeholders across the business. Legal and compliance stakeholders can rest assured that policy is in place, yet data analysts can have access to more data than ever before.
Because of this, the business reaps
Increased revenue: increased data access by providing utility from sensitive data rather than completely blocking it.
Decreased cost: building these PETs is complex and expensive, Immuta has invested years of research to apply these PETs dynamically to your data at a click of a button.
Decreased risk: your organization may end up over-sharing since they don’t have the granular controls at their fingertips, opening up high levels of risk. With Immuta, you can reduce risk through the privacy vs utility balance provided.
Assumptions: Your user has the following permissions in Immuta (note you should have these by default if you were the initial user on the Immuta installation):
GOVERNANCE: in order to build policy on any table OR
“Data Owner” of the registered tables (you likely are the Data Owner and have GOVERNANCE permission).
While columns like first_name
, last_name
, email
, and social security number
can certainly be directly identifying (although we masked them in previous walkthroughs you may have completed), something like gender
and race
, on the surface, seem like they may not be directly identifying. But it could be: imagine if there are very few Tongan males in this data set...in fact, there’s only one. So if I know of a Tongan male in that company, I can easily run a query like this and figure out that person’s salary without using their name, email, or social security number:
select * from immuta_fake_hr_data where race = 'Tongan' and gender = 'Male';
This is the challenge with indirect identifiers. It comes down to how much your adversary, the person trying to break privacy, knows externally, which is unknowable to you. In this case, all they had to know was the person was Tongan and male (and there happens to be only one of them in the data) to figure out their salary (it’s $106,072). This is called a linkage attack and is specifically called out in privacy regulations as something you must contend with, for example, from GDPR:
Article 4(1): "Personal data" means any information relating to an identified or identifiable natural person ("data subject"); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that person.
So you see where we are going: almost any useful column with many unique values will be a candidate for indirectly identifying an individual, but also be an important column for your analysis. So if you completely hide every possible indirectly identifying column, your data is left useless.
You can solve this problem with PETs. Before we get started with K-Anonymization, take note of two things by querying the data:
If you only search for “Tongan” alone (no Male), there are several Tongan females, so this linkage attack no longer works: select * from immuta_fake_hr_data where race = 'Tongan';
There are no null values in the gender or race columns.
Let's build a k-anonymization policy:
Log in to Immuta with your user with GOVERNANCE permission (and/or is the Data Owner of the table “Immuta Fake HR Data”).
Visit the Immuta Fake HR Data data source and click the Policies tab.
If you’ve done some of the other walkthroughs, you will see those policies listed here because they propagated from a global policy down to what we call local policies.
In this case, we will create a local policy on this specific table (make sure if you have multiple computes/warehouses this is the one you plan to query against).
Click + New Policy in the Data Policies section.
Select the Mask option.
Set the mask type to with K-Anonymization.
Select the gender and race columns.
Leave using Fingerprint (group size = 5)
In this case, through our algorithm, we selected the best group size for you (see the third bullet below for more details). This means any combination of gender and race that shows up 5 or fewer times will be suppressed.
You could override this setting with your own group size, or
You could set the maximum re-identifiability probability as a way to set the group size, meaning, if you want a 1% change of re-identifiability probability you will have a higher group size than if you have a 20% re-identifiability probability. In other words, you are trading utility for privacy because more data will be suppressed the lower the re-identifiability probability. The default for the fingerprint setting (described in the first bullet above) uses a heuristic that attempts to preserve 80% of the information in the columns without going below a maximum re-identification probability of 20% (group size of 5 or greater). This assumes you’ve selected all possible indirect identifiers in the k-Anonymization policy.
Change for everyone except to for everyone.
Click Create and Save All.
It may take a few seconds for Immuta to run the k-anonymization calculations to apply this policy.
First let’s run this query again to find the male Tongan’s salary: select * from immuta_fake_hr_data where race = 'Tongan' and gender = 'Male';
Wait...what...no results?
Ok, let’s run this query ignoring the gender: select * from immuta_fake_hr_data where race = 'Tongan';
We only get the Females back!
We successfully averted this linkage attack. Remember, from our queries prior to the policy, the salary was 106072, so let’s run a query with that: select * from immuta_fake_hr_data where salary = 106072;
There he is! But notice race is suppressed (NULL) so this linkage attack will not work. It was also smart enough to NOT suppress gender because that did not contribute to the attack; suppressing race alone averts the attack. This technique provides as much utility as possible while preserving privacy.
Coarse-grained access control. Over- and under-sharing gets you in hot water with either Legal and Compliance (want more privacy) or the analysts (want more data), depending on which direction you go. Advanced anonymization techniques give you the flexibility to make these tradeoffs and keep both stakeholders happy.
Prerequisites: Before using this walkthrough, please ensure that you’ve first completed Parts 1-5 of the and the walkthrough.
This is one of the largest challenges for organizations. Having multiple warehouses/compute, which is quite common, means that you must configure policies uniquely in each of them. For example, the way you build policies in Databricks is completely different from how you build policies in Snowflake. Not only that, they support different levels of control, so while you might be able to do row-level security in Snowflake, you can’t in Databricks. This becomes highly complex to manage, understand, and evolve (really hard to make changes).
Just like the big data era created the need to separate compute from storage, the privacy era requires you to separate policy from platform. Immuta does just that; it abstracts the policy definition from your many platforms, allowing you to define policy once and apply anywhere - consistently!
Evolvability and consistency are the key outcomes of separating policy from platform. It’s easy to make changes in a single place and apply everywhere consistently.
Because of this, the business reaps
Increased revenue: accelerate data access / time-to-data, building and evolving policy is optimized.
Decreased cost: operating efficiently at scale, you only have to make changes in a single well understood platform.
Decreased risk: avoid data leaks caused by uniquely managing and editing policy in each platform which works differently.
There is no walkthrough for this topic, because you’ve already been doing it (or could do it).
If you have multiple compute/warehouses, make sure you configure all of them using the guide.
Once configured, use any of these walkthroughs to show building policy once in Immuta and applying it to all places:
The anti-pattern is obvious: do not build policies uniquely in each warehouse/compute you use; this will create chaos, errors, and make the data platform team a bottleneck for getting data in the hands of analysts.
Legacy solutions, such as Apache Ranger, can only substantiate the abstraction of policy from compute in the Hadoop ecosystem. This is due to inconsistencies in how Ranger enforcement has been implemented in the other downstream compute/warehouse engines. That inconsistency arises not only from ensuring row, column, and anonymization techniques work the same in Databricks as they do in Snowflake, for example, but also from the need for additional roles to be created and managed in each system separately and inconsistently from the policy definitions. With Immuta, you have complete consistency without forcing new roles to be created into each individual warehouse’s paradigm.
Prerequisites: Before using this walkthrough, please ensure that you’ve first completed Parts 1-5 of the and the walkthrough.
If you’ve already completed the walkthrough, you’ve seen the power of separating duties (and knowledge). In short, when managing policy, there are many components, and it's best to put those that are experts in their domain in control of their domain.
This walkthrough takes this concept a step further by providing an overview of how you can avoid “god” users that control all and instead use a domain-focused approach where data owners control their own data sources with complete autonomy.
To do this with Immuta you simply remove GOVERNANCE permission from everyone; in doing so, Data Owners have complete autonomy over how their data is controlled because they can write their own policies, even global policies, that are restricted to only their data sources. (You may have noticed this referenced multiple times in the Assumptions section of the walkthroughs.)
As with most features we’ve discussed in this guide, this is an optional approach that can be taken to optimize scalability and avoid costly policy mistakes by separating controls to who knows best.
Because of this, the business reaps
Increased revenue: accelerate data access / time-to-data, policy building is optimized by those who know it best.
Decreased cost: operating efficiently at scale, everything happens faster because experts of their domain are in charge.
Decreased risk: putting experts in charge of their domain reduces risks and mistakes by avoiding go-betweens and also aligns with many extraterritorial regulations.
Assumptions: Your user has the following permissions in Immuta (note you should have these by default if you were the initial user on the Immuta installation):
“Data Owner” of the registered tables (you likely are the Data Owner and have GOVERNANCE permission).
In this walkthrough, we are going to make a user a data owner and have them build a global policy that will only interact with the data sources they own.
Visit the Immuta Fake Credit Card Transactions data source.
Click the Members tab.
Change their Role from subscribed to owner.
That user is now considered a data owner of that data source. Note, you do not need to have GOVERNANCE permission to set ownership on a data source; any data owner of that specific data source can do that. Now, using that user you just made an owner of the “Immuta Fake Credit Card Transactions” data source, let’s build a policy.
Log in to Immuta with the user you just made an owner in the above step.
Click the Policies icon in the left sidebar.
Click + Add New Data Policy.
Name it My Data Owner Policy.
For action, select Minimize data source.
Set the percentage to 90%.
Change for everyone except to for everyone.
Click Add.
For Where should this policy be applied?
Select On data sources.
For circumstance, select tagged.
For the tag, select Immuta POV.
You’ll notice there’s an additional filter to where the policy will be applied: Restricted to data sources owned by users…
If you do not see this, it’s because that user has GOVERNANCE permission.
You’ll notice the user’s name is there.
You cannot edit this since you do not have GOVERNANCE permission, so this policy will automatically be limited to only the data sources this user owns.
Click Create Policy and then Activate Policy.
Note that users with GOVERNANCE permission are the only users who can create tags and purposes in Immuta, so the workflow would be to curate those up front (if necessary) and then trim back GOVERNANCE permissions as necessary.
It could be that you want a central group of users to control all policy on your data ecosystem, where there could be other cases where you want complete autonomy for your data owners, or somewhere in between. So there is no concrete anti-pattern here, just note that Immuta can support either approach.
Lack of scalability and evolvability are rooted in the fact that you are attempting to apply a coarse role-based access control (RBAC) model to your modern data architecture. Using Apache Ranger, a well known legacy RBAC system built for Hadoop, as an example, independent research has shown the explosion of management required to do the most basic of tasks with an RBAC system:
You can read details on our internal TPC-DS performance results on Databricks (plugin) . However, we understand you probably want to test this yourself, so here are guided walkthroughs to simplify running TPC-DS performance benchmark results on Immuta-protected tables.
Type in Immuta POV for the tag name. (Remember, this was the tag you created in .) Note that if you are a Data Owner of the tables without GOVERNANCE permission the policy will be automatically limited to the tables you own.
Following the guide, test that your user sees the Person Name tagged columns in the clear, because you are part of Department HR and your non-admin user sees “REDACTED” for the same columns because they are not part of Department HR.
Again, following the guide, test that your user sees the Person Name tagged columns now hashed, because you are part of Department HR and your non-admin user sees “REDACTED” for the same columns because they are not part of Department HR.
Feel free to return to the to move on to your next topic.
Feel free to return to the to move on to your next topic.
Feel free to return to the to move on to your next topic.
Navigate to the page. If you are prompted to log in and need basic authentication credentials, reach out to your Immuta support professional.
Remember, before starting this walkthrough we asked that you complete the walkthrough. Assuming you did, you should at minimum see the following file: RLS-Walkthrough.yaml.
McKinsey has a that describes this anti-pattern and the need for automation:
Feel free to return to the to move on to your next topic.
Find the non-admin user you created in Part 3 of the .
Feel free to return to the to move on to your next topic.
You should see the Fraud Analysis purpose you created listed now. However, let’s add a hierarchy to this purpose, similar to what we did with tags in the walkthrough.
Following the guide, confirm that neither your admin user nor the non-admin user you created in Part 3 of the can see data in the “Fake Credit Card Transactions” table. This is because neither is acting under purpose Fraud Analysis. If they could query the table, they wouldn’t see the credit card numbers either, because they also aren’t acting under purpose Fraud Analysis.Charges
.
Ok, so how do we work under a purpose? Let’s use your non-admin user you created in Part 3 of the for this part to prove that this is completely self service for your users.
Log in to Immuta with your non-admin user you created in Part 3 of the in a private or incognito window.
To prove it, following the guide, confirm that non-admin user can see data in the “Fake Credit Card Transactions” table. Make sure you are querying as the non-admin user that just switched their project. Note that it may take 10 - 20 seconds or so before Immuta updates your current project in the enforcement point before you can see the data. (Immuta does some caching.)
You are now acting under the purpose Fraud Analysis.Charges
. Now query the data again with your non-admin user following the guide. The credit card numbers are in the clear because you are acting under the appropriate purpose!
Feel free to return to the to move on to your next topic.
Before we build this policy, let’s take a quick look at the Immuta Fake HR Data table; please query it in your compute/warehouse of choice following the guide.
Now let’s go back and query the Immuta Fake HR Data table, remembering to query it in your compute/warehouse where you built the local policy in the above steps following the guide.
To learn more about K-Anonymization and our other advanced PETs, please download our ebook:
Feel free to return to the to move on to your next topic.
Feel free to return to the to move on to your next topic.
Log in to Immuta with the user that owns the data sources you created in the .
Since this user likely also have GOVERNANCE permission (the default for the first user on the system), we are going to give the non-admin user you created in Part 3 of the ownership of one of the data sources:
Find the non-admin user you created in Part 3 of the . (Ensure they do NOT have GOVERNANCE permission.)
Feel free to return to the to move on to your next topic.
Prerequisite: Before using this walkthrough, please ensure that you’ve first done Part 3 of the POV Data Setup.
Performance tests can be complex; one must consider using realistic queries and scenarios along with ensuring apples to apples comparisons. Luckily, a lot of this hard work has already been done by TPC-DS. TPC-DS data has been used extensively by Database and Big Data companies for testing performance, scalability, and SQL compatibility across a range of Data Warehouse queries — from fast, interactive reports to complex analytics. It reflects a multidimensional data model of a retail enterprise selling through 3 channels (stores, web, and catalogs), while the data is sliced across 17 dimensions, including Customer, Store, Time, Item, etc. The bulk of the data is contained in the large fact tables - Store Sales, Catalog Sales, Web Sales — representing daily transactions spanning 5 years.
Databricks uses TPC-DS for their own internal testing, and Immuta has taken components of that Databricks test suite and created a Databricks notebook that
Generates the TPC-DS data (at the scale you desire)
Registers it with Immuta
Applies masking policies
Runs through the test suite, capturing results
Does so both on immuta-enabled and non-immuta clusters
Generates a report at completion
This can be run against any of your Databricks clusters enabled by the different Immuta cluster policies. In fact, you can run this on clusters enabled by competitors to see the same comparisons.
In our own internal testing, with over 100 column masking policies in place (SHA-256 salted hashing), we see slightly over 1 second of overhead on average, which varies by different cluster policies. You can read more about our internal results here.
During Part 3 of the POV Data Setup you should have downloaded the Benchmarking suite.
Import the Notebook downloaded from Step 1 into Databricks.
Go to your workspace.
Click the down arrow next to your username.
Select import.
Import the file from Step 1.
Follow the instructions in the notebook.
Doing simple select * from table
queries to validate performance. TPC-DS has done a lot of work to create a realistic analytical query suite - you should use it. That being said, feel free to also run tests on your own data.
This was the final walkthrough in the POV Guide, but feel free to go back and do others you may have skipped.
Prerequisites: Before using this walkthrough, please ensure that you’ve first completed Parts 1-5 of the POV Data Setup and the Schema Monitoring and Automatic Sensitive Data Discovery walkthrough.
Cell-level security is not exactly an advanced privacy enhancing technology (PET) like we showed in the Example of anonymizing a column rather than blocking it walkthrough, but it does provide impressive granular controls within a column for common use cases.
What is cell level security?
If you have values in a column that should sometimes be masked, but not always, that is masking at the cell level, meaning the intersection of a row with a column. What drives whether that cell should be masked or not is some other value (or set of values) in the rest of the row shared with that column (or a joined row from another table).
Let’s use a silly example. Let’s say we want to mask the credit card numbers, but only when the transaction amount is greater than $500. This allows you to drive masking in a highly granular manner based on other data in your tables.
Creating a balance between privacy and utility is critical to stakeholders across the business. Legal and compliance stakeholders can rest assured that policy is in place, yet data analysts can have access to more data than ever before.
Because of this, the business reaps
Increased revenue: increased data access by providing utility from sensitive data rather than completely blocking it.
Decreased cost: the amount of views you would need to create and manage to do cell-level controls manually would be enormous.
Decreased risk: your organization may end up over-sharing since they don’t have the granular controls at their fingertips, opening up high levels of risk. With Immuta, you can reduce risk through the privacy vs utility balance provided.
Assumptions: Your user has the following permissions in Immuta (note you should have these by default if you were the initial user on the Immuta installation):
GOVERNANCE: in order to build policy on any table OR
“Data Owner” of the registered tables (you likely are the Data Owner and have GOVERNANCE permission).
Let's create a Global masking policy.
Log in to Immuta with the user that owns the data sources you created in the POV Data Setup.
Click the Policies icon in the left sidebar.
Click + Add New Data Policy.
Name it Mask Credit Cards.
For action, select Mask.
Leave columns tagged.
Type in the tag Discovered.Entity.Credit Card Number
.
Change the masking type to using a constant.
Enter the constant REDACTED.
Change for to where.
For the where clause, enter transaction_amount > 500.
Note that you can also reference tags in your where clause, so we could have done something like @columnTagged(‘amounts’) > 500
if the transaction_amount
columns are named differently across tables.
Change everyone except to everyone. (This policy will have no exceptions.)
Click Add.
Leave Where should this policy be applied? as is. (Immuta will guess properly based on previous steps.)
Click Create Policy and then Activate Policy.
You can also test out whether everything was masked correctly in the “Immuta fake credit card transactions” table by following the Query Your Data guide. Note that if the transaction_amount
column is greater than $500, the credit card number in that same row is replaced with the word REDACTED.
Note: If you cannot query the “Immuta fake credit card transactions” table it’s likely because you did not remove the purpose restriction policy from the Purpose based exceptions walkthrough.
Coarse-grained access control. Over- and under-sharing gets you in hot water with either Legal and Compliance (want more privacy) or the analysts (want more data), depending on which direction you go. Highly granular techniques like cell-level security give you the flexibility to make these tradeoffs and keep both stakeholders happy.
Feel free to return to the POV Guide to move on to your next topic.
Prerequisite: Before using this walkthrough, please ensure that you’ve first done the Parts 1-5 of the POV Data Setup and the Schema Monitoring and Automatic Sensitive Data Discovery walkthrough.
While many platforms support the concept of object tagging / sensitive data tagging, very few truly support hierarchical tag structures.
First, a quick overview of what we mean by hierarchical tag structure:
This would be a flat tag structure:
SUV
Subaru
Truck
Jeep
Gladiator
Outback
Each tag stands on its own and is not associated with one another in any way; there’s no correlation between Jeep and Gladiator nor Subaru and Outback.
A hierarchical tagging structure establishes these relationships, and we’ll explain why this is important momentarily.
SUV.Subaru.Outback
Truck.Jeep.Gladiator
“Support” for a tagging hierarchy is more than just supporting the tag structure itself. More importantly, policy enforcement should respect the hierarchy as well. Let’s run through a quick contrived example. Let's say that you wanted the following policies:
Mask by making null any SUV data
Mask using hashing any Outback data
With a flat structure, if you build those policies they will be in conflict with one another. To avoid that problem you would have to order which policies take precedence, which can get extremely complex when you have many policies. This is in fact how many policy engines handle this problem. (We’ll discuss more in the Anti-Patterns section.)
Instead, if your policy engine truly supports a tagging hierarchy like Immuta does, it will recognize that Outback is more specific than SUV, and have that policy take precedence.
Mask by making null any SUV data
Mask using hashing any SUV.Subaru.Outback
data
Policies are applied correctly without any need for complex ordering of policies.
This allows the business to think about policy and application of policy based on a logical model of their data, because of this, you are provided:
Understandability: Policies are easily read and understood on their own without having to also comprehend precedence of policy (e.g., inspect each policy in combination with all other policies).
Evolvability: What if you need to change all Subaru data to hashing now? With Immuta, that’s an easy change, just update the policy. With solutions that don’t support tagging hierarchy, you must understand both the policy and its precedence. With a tagging hierarchy the precedence was taken care of when building the logical tagging model.
Correctness: If two policies hit each other at the same level of the hierarchy, the user is warned of this conflict when building the 2nd policy. This is important because in this case, there likely is a true conflict on the opinion of what the policy should be doing and the business can make a decision. With policy ordering this conflict is not apparent.
Because of this, the business reaps
Increased revenue: accelerate data access / time-to-data.
Decreased cost: operating efficiently at scale, agility at scale by avoiding comprehension of all policies at once in order to create/edit more of them.
Decreased risk: avoid policy errors through missed conflicts and not understanding policy precedence.
Assumptions: Your user has the following permissions in Immuta (note you should have these by default if you were the initial user on the Immuta installation):
GOVERNANCE: in order to build policy against any table in Immuta OR
“Data Owner” of the registered tables. (You likely are the Data Owner and have GOVERNANCE permission.)
To build a policy using tags,
In Immuta, visit the Fake HR Data data source (from any warehouse/compute).
Go to the Data Dictionary tab and view where you have the Discovered.Identifier Direct
and the Discovered.Entity.Social Security Number
tags. Let’s build two separate policies using those.
Policy 1:
Click the Policies icon in the left sidebar of the Immuta console.
Click + Add New Data Policy.
Name it Mask Direct Identifiers.
For action, select Mask.
Leave columns tagged.
Type in the tag Discovered.Identifier Direct
.
Change masking type to by making null.
Change everyone except to everyone. (This policy will have no exceptions.)
Click Add.
Leave Where should this policy be applied? as is. (Immuta will guess correctly based on previous steps.)
Click Create Policy and then Activate Policy.
Policy 2:
Click + Add New Data Policy.
Name it Mask SSN.
For action, select Mask.
Leave columns tagged.
Type in the tag Discovered.Entity.Social Security Number
.
Change masking type to using hashing.
Change everyone except to everyone. (This policy will have no exceptions.)
Click Add.
Leave Where should this policy be applied? as is. (Immuta will guess correctly based on previous steps.)
You can further refine where this policy is applied by adding another circumstance:
Click + Add Another Circumstance.
Change the or to an and.
Select tagged for the circumstance. (Make sure you pick “tagged” and not “with columns tagged.”)
Type in Immuta POV for the tag name. (Remember, this was the tag you created in Schema Monitoring and Automatic Sensitive Data Discovery.) Note that if you are a Data Owner of the tables without GOVERNANCE permission, the policy will be automatically limited to the tables you own.
Click Create Policy and then Activate Policy.
Now visit the Fake HR Data data source again (from any warehouse/compute).
Click the Policies tab.
You will see both of those policies applied; however, the “Mask Direct Identifiers” mask SSN policy was not applied because it was not as specific as the “Mask SSN” policy.
You can also test out everything was masked correctly by following the Query Your Data guide.
This has already been covered fairly well in the business value section, but policy precedence ordering is the anti-pattern and is unfortunately commonly found in tools such as Sentry and Ranger. The problem is that you put the onus on the policy builder to understand the precedence rather than baking that into your data metadata. The policy builder must understand all other policies and cannot build their policy in a vacuum. Similarly, anyone reading policy must consider it in tandem with every other policy and its precedence to understand how policy is going to actually be enforced. Other tools, like Snowflake and Databricks have no concept of policy precedence, which leaves you no solution at all to this problem.
Yes, this does put some work on the business to correctly build “specificity” into their tagging hierarchy (depth == specificity). This is not necessarily easy; however, this logic will have to live somewhere, and having it in the tagging hierarchy rather than policy order again allows you to separate policy definition from data definition. This provides you scalability, evolvability, understandability, and, we believe most importantly, correctness because policy conflicts can be caught at policy-authoring-time as described in the business value section.
Feel free to return to the POV Guide to move on to your next topic.
Prerequisite: Before using this walkthrough, please ensure that you’ve first done the Schema Monitoring and Automatic Sensitive Data Discovery walkthrough.
Prerequisites: Ensure you are using an Immuta-enabled cluster, as described in Part 3 of the POV Data Setup.
You must consider the Databricks user you are using. If you are a Databricks Admin, certain scenarios may allow you to bypass Immuta security controls. In Part 3 of the POV Data Setup you may have created a non-admin user, and it’s best to use that user in lieu of (or in combination with) your admin user, especially when querying your data in Databricks. When using the admin user, here are scenarios to consider, but the workflow that sent you to this page will describe if you should use the admin user in your queries or not:
Immuta enforcement occurs by rewriting the SQL query under the covers prior to the Databricks catalyst planning. This means all data interactions in Databricks must occur through a metastore SQL query - no direct file access. If you attempt direct file access to your cloud storage (S3, ADLS, Google Storage) it will be blocked. The below notebook will demonstrate querying data into a DataFrame in each language using this pattern. Note it is possible to configure direct file access using Immuta, but that is limited to file-level controls only and is not covered in this POV Guide. Please work with your Immuta expert for more information if you want direct file reads.
We highly recommend abstracting structured data with a metastore if you are not already doing so.
Run a query:
Download the following Notebook* to run queries.
Consider using an incognito/private window to use two different users (your admin user and the non-admin user you potentially created) to run queries, importing that notebook for both.
*That notebook contains generic queries; some of the walkthroughs may instruct you to run different queries.
Prerequisite: Ensure you have configured the Databricks SQL integration with Immuta as described in Part 3 of the POV Data Setup.
Immuta enforces policy by creating Databricks SQL views that contain all the policy logic.
When moving to production, for everyone but people that need to manage the raw tables, you should REVOKE access from the raw tables and instead only let users query these Immuta-managed views.
Those views are created in the database configured in Part 3 of the POV Data Setup
Table name of the raw table is maintained (as the view name) with the database appended to the front to avoid collisions.
Immuta handles GRANTing access to those views.
Run a query:
Make sure you select SQL from the upper left menu in Databricks.
Click Create → Query.
Select the database name used when configuring the integration in Part 3 of the POV Data Setup.
Run the following queries in the worksheet* (you must include the database):
select * from [database name from above].immuta_pov_immuta_fake_credit_card_transactions limit 100;
select * from [database name from above].immuta_pov_immuta_fake_hr_data limit 100;
Consider using an incognito/private window to use two different users (your admin user and the non-admin user you potentially created) to run queries.
*These are generic queries, some of the walkthroughs may instruct you to run different queries.
Prerequisite: Ensure you have configured the Snowflake integration with Immuta as described in Part 3 of the POV Data Setup.
Immuta enforces policy by creating Snowflake views that contain all the policy logic. Those views are available to the PUBLIC role in Snowflake, which is the foundation of every role. The outcome of this means:
When moving to production, for everyone but people that need to manage the raw tables, you should REVOKE access from the raw tables and instead only let users query these Immuta-managed views.
Since PUBLIC is the foundation of every role, you no longer need to manage Snowflake roles for your end users, roles can be limited to defining warehouse access.
You do not need to explicitly GRANT access to those views, that is handled by Immuta by GRANTing to the PUBLIC role (and further protecting using subscription policies you will configure in a future walkthrough).
Those views are created in the database configured in Part 3 of the POV Data Setup.
Schema name of the raw table is maintained.
Table name of the raw table is maintained (as the view name).
Run a query:
Create a new worksheet.
You must select the appropriate role/database/schema to query the views:
Role: any or just PUBLIC.
Warehouse: any.
Database: the database name used when configuring the integration in Part 3 of the POV Data Setup (default is IMMUTA).
Schema: POV_DATA (or whatever you named the schema in Part 2 of the POV Data Setup).
Run the following queries in the worksheet*:
select * from immuta_fake_credit_card_transactions limit 100;
select * from immuta_fake_hr_data limit 100;
Consider using an incognito/private window to use two different users (your admin user and the non-admin user you potentially created) to run queries.
*These are generic queries, some of the walkthroughs may instruct you to run different queries.
Prerequisite: Ensure you have configured the Synapse integration with Immuta as described in Part 3 of the POV Data Setup.
Immuta enforces policy by creating Synapse views that contain all the policy logic within a schema (contained in your dedicated SQL pool) that is created in Part 3 of the POV Data Setup.
When moving to production, for everyone but people that need to manage the raw tables, you should REVOKE access from the raw tables and instead only let users query these Immuta-managed views. In other words, only give them access to the schema created in Part 3 of the POV Data Setup.
Those views are created in the schema configured in Part 3 of the POV Data Setup.
Table name of the raw table is maintained with the database name appended to the front to avoid conflicts.
Immuta handles GRANTing access to those views.
Run a query:
From Synapse Studio click on the Data menu on the left.
Click on the Workspace tab.
Expand databases and you should see the dedicated pool you created: immuta_pov
.
Expand immuta_pov
and you should see the schema you created in Part 3 of the POV Data Setup.
Select that schema.
Select New SQL script and then Empty script.
Run the following queries* (note that Synapse does not support LIMIT and the SQL is case sensitive:
SELECT TOP 100 * FROM [schema].immuta_pov_immuta_fake_credit_card_transactions;
SELECT TOP 100 * FROM [schema].immuta_pov_immuta_fake_hr_data;
*These are generic queries, some of the walkthroughs may instruct you to run different queries.
Prerequisite: Ensure you have configured the Starburst (Trino) integration with Immuta as described in Part 3 of the POV Data Setup.
Immuta enforces policy by creating views that contain all the policy logic. Those views are created within a catalog that is created in Part 3 of the POV Data Setup.
The outcome of this means:
When moving to production, for everyone but people that need to manage the raw tables, you should REVOKE access from the raw tables and instead only let users query these Immuta-managed views.
You do not need to explicitly GRANT access to those views, that is handled by Immuta.
Those views are created in the catalog configured in Part 3 of the POV Data Setup.
Schema name of the raw table is maintained.
Table name of the raw table is maintained (as the view name).
Run a query:
Since Starburst (Trino) does not have a user interface, please use your tool of choice to connect.
Once connected, you should see the catalog that was created in Part 3 of the POV Data Setup.
Expand that and you should see the pov_data schema.
Expand that and you should see the secure views Immuta created, you can run these queries*:
select * from [catalog].pov_data.immuta_fake_credit_card_transactions limit 100;
select * from [catalog].pov_data.immuta_fake_hr_data limit 100;
*These are generic queries, some of the walkthroughs may instruct you to run different queries.
Prerequisites:
Ensure you have configured the Redshift integration with Immuta as described in Part 3 of the POV Data Setup.
You must use an RA3 instance type; this is because Immuta requires cross-database views and those are only supported in Redshift RA3 instance types.
If you get an error like this when querying the Immuta-generated views, you are NOT on an RA3 instance type:
Immuta enforces policy by creating views that contain all the policy logic. Those views are created within a database that is created in Part 3 of the POV Data Setup.
The outcome of this means:
When moving to production, for everyone but people that need to manage the raw tables, you should REVOKE access from the raw tables and instead only let users query these Immuta-managed views.
You do not need to explicitly GRANT access to those views, that is handled by Immuta.
Those views are created in the database configured in Part 3 of the POV Data Setup.
Schema name of the raw table is maintained
Table name of the raw table is maintained (as the view name)
Run a query:
Since Redshift does not have a user interface, please use your tool of choice to connect.
Once connected, you should see the database that was created in Part 3 of the POV Data Setup.
Expand that and you should see the pov_data schema.
Expand that and you should see the secure views Immuta created, you can run these queries*:
select * from [database].pov_data.immuta_fake_credit_card_transactions limit 100;
select * from [database].pov_data.immuta_fake_hr_data limit 100;
*These are generic queries, some of the walkthroughs may instruct you to run different queries.