Data Engineering with Limited Policy Downtime

When executing transforms in your data platform, new tables and views are constantly being created, columns added, data changed - transform DDL. This constant transformation can cause latency between the DDL and Immuta policies discovering, adapting, and attaching to those changes, which can result in data leaks. This policy latency is referred to as policy downtime.

The goal is to have as little policy downtime as possible. However, because Immuta is separate from data platforms and those data platforms do not currently have webhooks or eventing service, Immuta does not receive alerts of DDL events. This causes policy downtime.

This page describes the appropriate steps to minimize policy downtime as you execute transforms using dbt or any other transform tool and links to tutorials that will help you complete these steps.

Prerequisites

Required:

  • Schema monitoring: This feature detects destructively recreated tables (from CREATE OR REPLACE statements) even if the table schema wasn’t changed. Enable schema monitoring when you register your data sources.

Recommended (if you are using Snowflake):

  • Snowflake table grants enabled: This feature implements Immuta subscription policies as table GRANTS in Snowflake rather than Snowflake row access policies. Note this feature may not be automatically enabled if you were an Immuta customer before January 2023; see Enable Snowflake table grants to enable.

  • Low row access mode enabled: This feature removes unnecessary Snowflake row access policies when Immuta project workspaces or impersonation are disabled, which improves the query performance for data consumers.

Step 1: Create global policies and prepare tags for data sources

To benefit from the scalability and manageability provided by Immuta, you should author all Immuta policies as global policies. Global policies are built at the semantic layer using tags rather than having to reference individual tables with policy. When using global policies, as soon as a new tag is discovered by Immuta, any applicable policy will automatically be applied. This is the most efficient approach for reducing policy downtime.

There are three different approaches for tagging in Immuta:

  1. Auto-tagging (recommended): This approach uses sensitive data discovery (SDD) to automatically tag data.

  2. Manually tagging with an external catalog: This approach pulls in the tags from an external catalog. Immuta supports Snowflake, Databricks Unity Catalog, Alation, and Collibra to pull in external tags.

  3. Manually tagging in Immuta: This approach requires a user to create and manually apply tags to all data sources using the Immuta API or UI.

Note that there is added complexity with manually tagging new columns with Alation, Collibra, or Immuta. These listed catalogs can only tag columns that are registered in Immuta. If you have a new column, you must wait until after schema detection runs and detects that new column. Then that column must be manually tagged. This issue is not present when manually tagging with Snowflake or Databricks Unity catalog because they are already aware of the column or using SDD because it runs after schema monitoring.

Using this approach, Immuta can automatically tag your data after it is registered by schema monitoring using sensitive data discovery (SDD). SDD is made of algorithms you can customize to discover and tag the data most important to you and your organization's policies. Once customized and deployed, any time Immuta discovers a new table or column through schema monitoring, SDD will run and automatically tag the new columns without the need for any manual intervention. This is the recommended option because once SDD is customized for your organization, it will eliminate the human error associated with manually tagging and is more proactive than manual tagging, further reducing policy downtime.

Native SDD should be enabled and customized before registering any data with Immuta. Contact your Immuta support professional to enable this feature flag for you before enabling SDD on the app settings page.

Manually tagging with an external catalog

Using this approach, you will rely on humans to tag. Those tags will be stored in the data platform (Snowflake, Databricks Unity Catalog) or catalog (Alation, Collibra). Then they will be synchronized back to Immuta. If using this option, Immuta recommends storing the tags in the data platfom because the write of the tags to Snowflake or Databricks Unity Catalog can be managed and reused with SQL from tools like dbt, removing burden from manually tagging on every run. API calls to Alation and Collibra are also possible, but are not accessible over dbt. Manually tagging through the Alation or Collibra UI will negatively impact data downtime.

Manually tagging in Immuta

Using this approach, you will rely on humans to tag, but the tags will be stored directly in Immuta. This can be done using the Immuta API or through the Immuta UI However, manually tagging through the Immuta UI will negatively impact data downtime.

Step 2: Register your data in Immuta

When registering tables with Immuta, you must register each database or catalog with schema monitoring enabled. Schema monitoring means that you do not need to individually register tables but rather make Immuta aware of databases, and then Immuta will periodically scan that database for changes and register any new changes for you. You can also manually run schema monitoring using the Immuta API.

Step 3: Consider the result and user making transformations

Views vs tables

Access to and registration of views created from Immuta-protected tables only need to be taken into consideration if you are using both data and subscription policies.

Views will have existing data policies (row-level security, masking) enforced on them that exist on the backing tables by nature of how views work (the query is passed down to the backing tables). So when you tag and register a view with Immuta, you are re-applying the same data policies on the view that already exist on the backing tables, assuming the tags that drive the data policies are the same on the view’s columns.

If you do not want this behavior or its possible negative performance consequences, then Immuta recommends the following based on how you are tagging data:

  • For auto-tagging, place your incremental views in a separate database that is not being monitored by Immuta. Do not register them with Immuta, and schema monitoring will not detect them from the separate database.

  • For either manually tagging option, do not tag view columns.

Using either option, the views will only be accessible to the person who created them. The views will not have any subscription policies applied to give other users access because the views are either not registered in Immuta or there are no tags. To give other users access to the data in the view, they should subscribe to the table at the end of the transform pipeline.

However, if you do want to share the views using subscription policies, you should ensure that the tags that drive the subscription policies exist on the view and that those tags are not shared with tags that drive your data policies. It is possible to target subscription policies on all tables or tables from a specific database rather than using tags.

Access level of job executioner

Policy is enforced on READ. Therefore, if you run a transform that creates a new table, the data in that new table will represent the policy-enforced data.

For example, if the credit_card_number column is masked for Steve, on read, the real credit card numbers will be dynamically masked. If Steve then copies them into a new table via the transform, he is physically loading masked credit card numbers into that table. Now if another user, Jane, is allowed to see credit card numbers and queries the table, her query will not show the credit card numbers. This is because credit card numbers are already masked in that table. This problem only exists for tables, not views, since tables have the data physically copied into them.

To address this situation, you can do one of the following:

  • Use views for all transforms.

  • Ensure the users who are executing the transforms always have a higher level of access than the users who will consume the results of the transforms. Or, if this is not possible,

  • Set up a dev environment for creating the transformation code; then, when ready for production, have a promotion process to execute those production transformations using a system account free of all policies. Once the jobs execute as that system account, Immuta will discover, tag, and apply the appropriate policy.

Step 4: Force data downtime

Data downtime refers to the techniques you can use to hide data after transformations until Immuta policies have had a chance to synchronize. It makes data inaccessible; however, it is preferred to the data leaks that could occur while waiting for policies to sync.

Whenever DDL occurs, it can result in policy downtime, such as in the following examples:

  • An existing table or view is recreated with the CREATE OR REPLACE statement. This will drop all policies resulting in users losing all access. There is one exception: with Databricks Unity, the grants on the table are not dropped, which means masking and row filtering policies are dropped but the table access is not, making policy downtime even more critical to manage.

  • A new column is added to a table that needs to be masked from users that have access to that table (potential data leak).

  • A new table is created in a space where other users have read access on future tables (potential data leak).

  • A tag that drives a policy is updated, deleted, or newly added with no other changes to the schema or table. This is a limitation of how Immuta can discover changes - it is too inefficient to search for tag changes, so schema changes are what drives Immuta to take action.

Best practices for Snowflake

Immuta recommends all of the following best practices to ensure data downtime occurs during policy downtime:

Without these best practices, you are making unintentional policy decision in Snowflake that may be in conflict with your organization's actual policies enforced by Immuta.

For example, if the CREATE OR REPLACE added a new column that contains sensitive data, and the user COPY GRANTS, it opens that column to users, causing a data leak. Instead, access must be blocked using the above data downtime techniques until Immuta has synchronized.

Best practices for Databricks Unity Catalog

Immuta recommends all of the following best practices to ensure data downtime occurs during policy downtime:

  • Do not grant to catalogs or schemas, because those apply to current and future objects (future is the problematic piece).

  • There is no way to stop Databricks Unity Catalog from carrying over table grants on CREATE OR REPLACE statements, so ensure you initiate policy uptime as quickly as possible if you have row filters or masking policies on that table.

  • Use CREATE OR REPLACE for all DDL, including altering tags, so that access is always revoked.

Without these best practices, you are making unintentional policy decision in Unity Catalog that may be in conflict with your organization's actual policies enforced by Immuta.

For example, if you GRANT SELECT on a schema, and then someone writes a new table with sensitive data into that schema, it could cause a data leak. Instead, access must be blocked using the above data downtime techniques until Immuta has synchronized.

Step 5: Initiate policy uptime

As discussed above, data platforms do not currently have webhooks or eventing service, so Immuta does not receive alerts of DDL events. Schema monitoring will run every 24 hours by default to detect changes, but schema monitoring should also be run across your databases after you make changes to them. You can manually run schema monitoring using the Immuta API and the payload can be scoped down run schema monitoring on a specific database or schema or column detection on a specific table.

When schema monitoring is run globally, it will detect the following:

  • Any new table

  • Any new view

  • Changes to the object type backing an Immuta data source (for example, when a TABLE changes to a VIEW); when an object type changes, Immuta reapplies existing policies to the data source.

  • Any existing table destructively recreated through CREATE OR REPLACE (even if there are no schema changes)

  • Any existing view destructively recreated through CREATE OR REPLACE (even if there are no schema changes)

  • Any dropped table

  • Any new column

  • Any dropped column

  • Any column type change (which can impact policy application)

  • Any tag created, updated, or deleted (but only if the schema changed; otherwise tag changes alone are detected with Immuta’s health check)

Then, if any of the above is detected, for those tables or views, Immuta will complete the following:

  1. Synchronize the existing policy back to the table or view to reduce data downtime

  2. If SDD is enabled, execute SDD on any new columns or tables

  3. If an external catalog is configured, execute a tag synchronization

  4. Synchronize the final updated policy based on the SDD results and tag synchronization

  5. Apply "New" tags to all tables and columns not previously registered in Immuta and lock them down with the "New Column Added" templated global policy

See the image below for an illustration of this process with Snowflake.

The two options for running schema monitoring are described in the sections below. You can implement them together or separately.

Alert Immuta through the API or a custom function

If the data platform supports custom UDFs and external functions, you can wrap the /dataSource/detectRemoteChanges endpoint with one. Then, as your transform jobs complete, you can use SQL to call this UDF or external function to tell Immuta to execute schema monitoring. The reason for wrapping in a UDF or external function is because dbt and transform jobs always compile to SQL, and the best way to make this happen immediately after the table is created (after the transform job completes) is to execute more SQL in the same job.

Consult your Immuta professional for a custom UDF compatible with Snowflake or Databricks Unity Catalog.

Periodic schema monitoring

The default schedule for Immuta to run schema monitoring is every night at 12:30 a.m. UTC. However, this schedule can be updated by changing some advanced configuration. The processing time for schema monitoring is dependent on the number of tables and columns changed in your data environment. If you want to change the schedule to run more frequently than daily, Immuta recommends you test the runtime (with a representative set of DDL changes) before making the configuration change.

Consult your Immuta professional to update the schema monitoring schedule, if desired.

There are some use cases where you want all users to have access to all tables, but want to mask sensitive data within those tables. While you could do this using just data policies, Immuta recommends you still utilize subscription policies to ensure users are granted access.

Subscription policies allow for Immuta to have a state to move table access into post-data-downtime to realize policy uptime. Without subscription policies, when Immuta synchronizes policy, users will continue to not have access to tables because there is no subscription policy granting them access. If you want all users to have access to all tables, use a global "Anyone" subscription policy in Immuta for all your tables. This will ensure users are granted access back to the tables after data downtime.

Last updated

Copyright © 2014-2024 Immuta Inc. All rights reserved.