dbt and Transform Workflow for Limited Policy Downtime

When executing transforms in your data platform, new tables and views are constantly being created, columns added, data changed - transform DDL. This constant transformation can cause latency between the DDL and Immuta policies discovering, adapting, and attaching to those changes, which can result in data leaks. This policy latency is referred to as policy downtime.

The goal is to have as little policy downtime as possible. However, because Immuta is separate from data platforms and those data platforms do not currently have webhooks or eventing service, Immuta does not receive alerts of DDL events. This causes policy downtime.

This page describes the appropriate steps to minimize policy downtime as you execute transforms using dbt or any other transform tool and links to tutorials that will help you complete these steps. This page is specific to Snowflake integrations, but the best practices outlined below can be used with other integrations.

Prerequisites

Required:

Native schema monitoring enabled (private preview): This feature improves performance of legacy schema monitoring and enhances it by detecting destructively recreated tables (from CREATE OR REPLACE statements) even if the table schema wasn’t changed.

See the configuration page for instructions to enable.

Recommended:

Snowflake table grants enabled: This feature implements Immuta subscription policies as table GRANTS in Snowflake rather than Snowflake row access policies. Note this feature may not be automatically enabled if you were an Immuta customer before January 2023; see Enable Snowflake table grants to enable.
Low row access mode enabled (public preview): This feature removes unnecessary Snowflake row access policies when Immuta Project workspaces or impersonation are disabled, which improves the query performance for data consumers.
Native sensitive data discovery (SDD) enabled (private preview): This feature processes all sensitive data discovery natively in-database rather than flowing a sample of data back to Immuta for processing. It is only necessary if you are using auto-tagging.

Consult your Immuta professional for instructions to enable private preview features.

Step 1: Create global policies and prepare tags for data sources

To benefit from the scalability and manageability provided by Immuta, you should author all Immuta policies as global policies. Global policies are built at the semantic layer using tags rather than having to reference individual tables with policy. When using global policies, as soon as a new tag is discovered by Immuta, any applicable policy will automatically be applied. This is the most efficient approach for reducing policy downtime.

There are three different approaches for tagging in Immuta:

Auto-tagging (recommended): This approach uses SDD to automatically tag data.
Manually tagging with an external catalog: This approach pulls in the tags from an external catalog. Immuta supports Snowflake, Alation, and Collibra to pull in external tags.
Manually tagging in Immuta: This approach requires a user to create and manually apply tags to all data sources using the Immuta API or UI.

Note that there is added complexity with manually tagging new columns with Alation, Collibra, or Immuta. These listed catalogs can only tag columns that are registered in Immuta. If you have a new column in Snowflake, you must wait until after schema detection runs and detects that new column. Then that column must be manually tagged. This issue is not present when manually tagging with Snowflake because Snowflake is already aware of the column or using SDD because it runs after schema monitoring.

Auto-tagging (recommended)

Using this approach, Immuta can automatically tag your data after it is registered by schema monitoring using sensitive data discovery (SDD). SDD is made of algorithms you can customize to discover and tag data most important to you and your organization's policies. Once customized and deployed, any time Immuta discovers a new table or column through schema monitoring, SDD will run and automatically tag the new columns without the need for any manual intervention. This is the recommended option because once SDD is customized for your organization, it will eliminate the human error associated with manually tagging and is more proactive than manual tagging, further reducing policy downtime.

SDD should be enabled and customized before registering any data with Immuta.

Manually tagging with an external catalog

Using this approach, you will rely on humans to tag. Those tags will be stored in the data platform (Snowflake) or catalog (Alation, Collibra). Then they will be synchronized back to Immuta. If using this option, Immuta recommends storing the tags in Snowflake because the write of the tags to Snowflake can be managed and reused with SQL from tools like dbt, removing burden from manually tagging on every run. API calls to Alation and Collibra are also possible, but are not accessible over SQL like dbt-to-Snowflake. Manually tagging through the Alation or Collibra UI will negatively impact data downtime.

Your catalog configuration should be enabled before registering any data with Immuta.

Preventing automatic table statistics when registering tables with the Immuta API

If using this approach, you must use a tag from your external catalog to prevent automatic table statistics. This means the preventing tag must be configured on the App Settings page and must be identified in the API payload when registering the data source. If you do not do this, and instead use an Immuta tag, your external catalog tags will be overridden and not applied to your data source. This behavior ensures a single source of truth from tags. If for any reason you must mix Immuta tags (with the exception of SDD) with external catalog tags, then you must use the Immuta UI to register data sources.

Manually tagging in Immuta

Using this approach, you will rely on humans to tag, but the tags will be stored directly in Immuta. This can be done using the Immuta API or through the Immuta UI However, manually tagging through the Immuta UI will negatively impact data downtime.

Step 2: Register your data in Immuta

When registering tables with Immuta, you must register each database or catalog with schema monitoring enabled. Schema monitoring means that you do not need to individually register tables but rather make Immuta aware of databases, and then Immuta will periodically scan that database for changes and register any new changes for you. You can also manually run schema monitoring using the Immuta API.

If you are not going to use any of the advanced masking techniques provided by Immuta (format preserving masking, K-Anonymization, or randomized response) you should configure Immuta to prevent automatic table statistics, which will improve data registration performance.

Step 3: Consider the result and user making transformations

Views vs tables

Access to and registration of views created from Immuta-protected tables only need to be taken into consideration if you are using both data and subscription policies.

Views will have existing data policies (row-level security, masking) enforced on them that exist on the backing tables by nature of how views work (the query is passed down to the backing tables). So when you tag and register a view with Immuta, you are re-applying the same data policies on the view that already exist on the backing tables, assuming the tags that drive the data policies are the same on the view’s columns.

If you do not want this behavior or its possible negative performance consequences, then Immuta recommends the following based on how you are tagging data:

For auto-tagging, place your incremental views in a separate database that is not being monitored by Immuta. Do not register them with Immuta, and schema monitoring will not detect them from the separate database.
For either manually tagging option, do not tag view columns.

Using either option, the views will only be accessible to the person who created them. The views will not have any subscription policies applied to give other users access because the views are either not registered in Immuta or there are no tags. To give other users access to the data in the view, they should subscribe to the table at the end of the transform pipeline.

However, if you do want to share the views using subscription policies, you should ensure that the tags that drive the subscription policies exist on the view and that those tags are not shared with tags that drive your data policies. It is possible to target subscription policies on all tables or tables from a specific database rather than using tags.

Access level of job executioner

Policy is enforced on READ. Therefore, if you run a transform that creates a new table, the data in that new table will represent the policy-enforced data.

For example, if the credit_card_number column is masked for Steve, on read, the real credit card numbers will be dynamically masked. If Steve then copies them into a new table via the transform, he is physically loading masked credit card numbers into that table. Now if another user, Jane, is allowed to see credit card numbers and queries the table, her query will not show the credit card numbers. This is because credit card numbers are already masked in that table. This problem only exists for tables, not views, since tables have the data physically copied into them.

To address this situation, you can do one of the following:

Use views for all transforms.
Ensure the users who are executing the transforms always have a higher level of access than the users who will consume the results of the transforms. Or, if this is not possible,
Set up a dev environment for creating the transformation code; then, when ready for production, have a promotion process to execute those production transformations using a system account free of all policies. Once the jobs execute as that system account, Immuta will discover, tag, and apply the appropriate policy.

Step 4: Force data downtime

Data downtime refers to the techniques you can use to hide data after transformations until Immuta policies have had a chance to synchronize. It makes data inaccessible; however, it is preferred to the data leaks that could occur while waiting for policies to sync.

Whenever DDL occurs, it can result in policy downtime, such as in the following examples:

An existing table or view is recreated in Snowflake with the CREATE OR REPLACE statement. This will drop all policy.
A new column is added to a table that needs to be masked from users that have access to that table.
A new table is created in a space where other users have read access.
A tag that drives a policy is updated, deleted, or added in Snowflake with no other changes to the schema or table.

Best practices

Immuta recommends all of the following best practices to ensure data downtime occurs during policy downtime:

Do not COPY GRANTS when executing a CREATE OR REPLACE statement.
Do not use GRANT SELECT ON FUTURE TABLES.
Use CREATE OR REPLACE for all DDL, including altering tags, so that access is always revoked.

Without these best practices, you are making unintentional policy decision in Snowflake that may be in conflict with your organization's actual policies enforced by Immuta.

For example, if the CREATE OR REPLACE added a new column that contains sensitive data, and the user COPY GRANTS, it opens that column to users, causing a data leak. Instead, access must be blocked using the above data downtime techniques until Immuta has synchronized.

Step 5: Initiate policy uptime

As discussed above, data platforms do not currently have webhooks or eventing service, so Immuta does not receive alerts of DDL events. Schema monitoring will run every 24 hours by default to detect changes, but schema monitoring should also be run across your databases after you make changes to them. You can manually run schema monitoring using the Immuta API and the payload can be scoped down run schema monitoring on a specific database or schema or column detection on a specific table.

When schema monitoring is run globally, it will detect the following:

Any new table
Any new view
Any existing table destructively recreated through CREATE OR REPLACE (even if there are no schema changes)
Any existing view destructively recreated through CREATE OR REPLACE (even if there are no schema changes)
Any dropped table
Any new column
Any dropped column
Any column type change (which can impact policy application)
Any tag created, updated, or deleted (but only if the schema changed; otherwise tag changes alone are detected with Immuta’s health check)

Then, if any of the above is detected, for those tables or views, Immuta will complete the following:

Synchronize the existing policy back to the table or view to reduce data downtime
If SDD is enabled, execute SDD on any new columns or tables
If an external catalog is configured, execute a tag synchronization
Synchronize the final updated policy based on the SDD results and tag synchronization
Apply "New" tags to all tables and columns not previously registered in Immuta and lock them down with the "New Column Added" templated global policy

The two options for running schema monitoring are described in the sections below. You can implement them together or separately.

Alert Immuta through the API or a custom function

If the data platform supports custom UDFs and external functions, you can wrap the /dataSource/detectRemoteChanges endpoint with one. Then, as your transform jobs complete, you can use SQL to call this UDF or external function to tell Immuta to execute schema monitoring. The reason for wrapping in a UDF or external function is because dbt and transform jobs always compile to SQL, and the best way to make this happen immediately after the table is created (after the transform job completes) is to execute more SQL in the same job.

Consult your Immuta professional for a custom UDF compatible with Snowflake.

Periodic schema monitoring

The default schedule for Immuta to run schema monitoring is every night at 12:30 a.m. However, this schedule can be updated through advanced configuration. The processing time for schema monitoring is dependent on the number of tables and columns changed in your data environment. If you want to change the schedule to run more frequently than daily, Immuta recommends you test the runtime (with a representative set of DDL changes) before making the configuration change.

Recommended Immuta policy types

There are some use cases where you want all users to have access to all tables, but want to mask sensitive data within those tables. While you could do this using just data policies, Immuta recommends you still utilize subscription policies to ensure users are granted access in Snowflake.

Subscription policies allow for Immuta to have a state to move table access into post-data-downtime to realize policy uptime. Without subscription policies, when Immuta synchronizes policy, users will continue to not have access to tables because there is no subscription policy granting them access. If you want all users to have access to all tables, use a global "Anyone" subscription policy in Immuta for all your tables. This will ensure users are granted access back to the tables after data downtime.