1 of 7

Writing to Projects

With project equalization enabled, project users can create Snowflake or Databricks Spark project workspaces where users can view and write data.

How-to guides

Create and manage project workspaces: Create a project workspace to allow Snowflake or Databricks users subscribed to the project to write data to the project.
Writing to projects: Write data to a project when working in the context of a Snowflake or Databricks project workspace.

Reference guides

Immuta project workspaces: This reference guide describes the components and design of project workspaces for Snowflake and Databricks and defines derived data sources.
Project UDFs: This reference guide lists the available functions for switching your project context in Databricks.

How-to Guides

Create and Manage Project Workspaces

After workspaces are configured, project owners can enable workspaces within their projects. This feature allows project members to write data to the project and share this data with other users as derived data sources.

Requirement

You must own the project

Prerequisites

Snowflake

Deprecation notice

Support for project workspaces for Snowflake has been deprecated. See the Deprecations page for EOL dates.

Snowflake integration is configured with workspaces enabled.
Snowflake tables are registered in Immuta.
External IDs have been connected with an IAM or manually mapped in for Snowflake.
Data sources registered by excepted roles: Snowflake workspaces generate static views with the credentials used to register the table as an Immuta data source. Those tables must be registered in Immuta by an excepted role so that policies applied to the backing tables are not applied to the project workspace views.

Databricks Spark

Create a workspace

Navigate to the Policies tab and enable project equalization by clicking the Project Equalization slider to on.
Scroll to the Workspace section and click Create.
Select Snowflake from the Workspace Configuration dropdown menu.
Name the Workspace Schema. By default, the schema name is based off of the project name, but you can change it here. Your project workspace will exist within this schema under Snowflake under the database configured by the Application Admin.
Use the dropdown menu to select the Hostname. Projects can only be configured to use one Snowflake host.
Select one or more Warehouses to be available to project members when they are working in the Snowflake workspace.
Click Create to enable the workspace.

Databricks cluster configuration

Before creating a workspace, the cluster must send its configuration to Immuta; to do this, run a simple query on the cluster (i.e., show tables). Otherwise, an error message will occur when you attempt to create a workspace.

Navigate to the Policies tab and enable Project Equalization by clicking the Project Equalization slider to on.
Scroll to the Workspace section and click Create.
Select Databricks from the Workspace Configuration dropdown menu.
Opt to edit the sub-directory in the Workspace Directory field; this sub-directory auto-populates as the project name.
Enter the Workspace Database Name.
Click Create to enable the workspace.

Delete a workspace

Scroll to the Workspace section on the policies tab and click the toggle to disable the workspace.
Click Delete in the workspace section.
Choose one of the following options in the modal:
- Purge Generic Workspace Data: Permanently delete data, while the data used by derived data sources is preserved. Note: If you created a derived data source that references a view on top of a table in Snowflake that isn't a derived data source, that table will be deleted and break the derived data source.
- Purge Everything & Delete Derived Data Sources: Permanently delete data and purge all derived data sources.
Click Delete.

Write Data to the Workspace

Once the workspace is created, project members will see relevant data sources when working under the project context.

Switch your project context.
Write data to the project workspace in Snowflake or Databricks:
- Snowflake: Select the role created by the project workspace. The role created will be a combination of the database name (configured by the application admin) and the schema name. Then, write data to this location.
- Databricks: Write data to the directory and database created in Databricks for the project workspace.

Now that data has been written to the workspace, users can share this data with others by making it a derived data source in Immuta.

Create a derived data source

Deprecation notice: Support for this feature has been deprecated.

Select a project.
Select the data source from which the new data was created.
Select Table for the virtual population option.
Click Edit and select the tables you created, and then click Apply.
Opt to edit the Basic Information fields, and then click Create.

Reference Guides

Writing to Projects

Project workspaces

With equalization enabled, project users can create project workspaces for Snowflake or Databricks where users can view and write data. Then, those users can create derived data sources to share this data with other users.

Snowflake project workspaces

Deprecation notice

Support for this feature has been deprecated. See the Deprecations page for EOL dates.

Snowflake project workspaces allow users to access and write data directly in Snowflake.

With Snowflake project workspaces, Immuta enforces policy logic on registered tables and represents them as secure views in Snowflake. Since secure views are static, creating a secure view for every unique user in your organization for every table in your organization would result in secure view bloat; however, Immuta addresses this problem by virtually grouping users and tables and equalizing users to the same level of access, ensuring that all members of the project see the same view of the data. Consequently, all members share one secure view.

While interacting directly with Snowflake secure views in these workspaces, users can write within Snowflake and create derived data sources, all the while collaborating with other project members at a common access level. Because these derived data sources will inherit all of the appropriate policies, that data can then be shared outside the project. Additionally, derived data sources use the credentials of the Immuta system Snowflake account, which will allow them to persist after a workspace is disconnected.

Policy enforcement

Immuta enforces policy logic on data and represents it as secure views in Snowflake. Because projects group users and tables and equalize members to the same level of access, all members will see the same view of the data and, consequently, will only need one secure view. Changes to policies immediately propagate to relevant secure views.

Snowflake project workspace workflow

An Immuta user with the CREATE_PROJECT permission creates a new project with Snowflake data sources.
The Immuta project owner enables project equalization which balances every project members’ access to the data to be the same.
The Immuta project owner creates a Snowflake project workspace which automatically generates a subfolder in the root path specified by the application admin and remote database associated with the project.
Project members can access data sources within the project and use WRITE to create derived tables. To ensure equalization, users will only see data sources within their project as long as they are working in the Snowflake Context.
The CREATE_DATA_SOURCE_IN_PROJECT permission is given to specific users so they can expose their derived tables in the Immuta project; the derived tables will inherit the policies, and then the data can be shared outside the project.
If a project member leaves a project or a project is deleted, that Snowflake Context will be removed from the user's Snowflake account.

Root directory details

Immuta only supports a single root location, so all projects will write to a subdirectory under this single root location.
If an administrator changes the default directory, the Immuta user must have full access to that directory. Once any workspace is created, this directory can no longer be modified.

Mapping projects to secure views

Immuta projects are represented as Session Contexts within Snowflake. As they are linked to Snowflake, projects automatically create corresponding

roles in Snowflake: IMMUTA_[project name]
schemas in the Snowflake IMMUTA database: [project name]
secure views in the project schema for any table in the project

To switch projects, users have to change their Snowflake Session Context to the appropriate Immuta project. If users are not entitled to a data source contained by the project, they will not be able to access the Context in Snowflake until they have access to all tables in the project. If changes are made to a user's attributes and access level, the changes will immediately propagate to the Snowflake Context.

Because users access data only through secure views in Snowflake, it significantly decreases the amount of role management for administrators in Snowflake. Organizations should also consider having a user in Snowflake who is able to create databases and make GRANTs on those databases and having separate users who are able to read and write from those tables.

Benefits

Few roles to manage in Snowflake; that complexity is pushed to Immuta, which is designed to simplify it.
A small set of users has direct access to raw tables; most users go through secure views only, but raw database access can be segmented across departments.
Policies are built by the individual database administrators within Immuta and are managed in a single location, and changes to policies are automatically propagated across thousands of tables’ secure views.
Self-service access to data based on data policies.
Users work in various contexts in Snowflake natively, based on their collaborators and their purpose, without fear of leaking data.
All policies are enforced natively in Snowflake without performance impact.
- Security is maintained through Snowflake primitives (roles and secure views).
- Performance and scalability is maintained (no proxy).
Policies can be driven by metadata, allowing massive scale policy enforcement with only a small set of actual policies.
Derived tables can be shared back out through Immuta, improving collaboration.
User access and removal are immediately reflected in secure views.

Databricks Spark project workspaces

Using Immuta projects and project equalization, Databricks Spark project workspaces are a space where every project member has the same level of access to data. This equalized access allows collaboration without worries about data leaks. Not only can project members collaborate on data, but they can also write protected data to the project.

Users will only be able to access the directory and database created for the workspace when acting under the project. The Immuta Spark SQL Session will apply policies to the data, so any data written to the workspace will already be compliant with the restrictions of the equalized project, where all members see data at the same level of access. When users are ready to write data to the project, they should use the SparkSQL session to copy data into the workspace.

Databricks project workspace workflow

An Immuta user with the CREATE_PROJECT permission creates a new project with Databricks data sources.
The Immuta project owner enables project equalization which balances every project members’ access to the data to be the same.
The Immuta project owner creates a Databricks project workspace which automatically generates a subfolder in the root path specified by the application admin and remote database associated with the project.
The Immuta project members query equalized data within the context of the project, collaborate, and write data, all within Databricks.
The Immuta project members use their newly written derived data and register the derived tables in Immuta as derived data sources. These derived data sources inherit the necessary Immuta policies to be securely shared outside of the project.

Root directory details

Immuta only supports a single root location, so all projects will write to a subdirectory under this single root location.
If an administrator changes the default directory, the Immuta user must have full access to that directory. Once any workspace is created, this directory can no longer be modified.
Administrators can place a configuration value in the cluster configuration (core-site.xml) to mark that cluster as unavailable for use as a workspace.

Read and write data

When acting in the workspace project, users can read data using calls like spark.read.parquet("immuta:///some/path/to/a/workspace").
To write delta lake data to a workspace and then expose that delta table as a data source in Immuta, you must specify a table when creating the derived data source (rather than a directory) in the workspace for the data source.

Supported cloud providers

Microsoft Azure

Immuta currently supports the abfss schema for Azure General Purpose V2 Storage Accounts. This includes support for Azure Data Lake Gen 2. When configuring Immuta workspaces for Databricks on Azure, the Azure Databricks workspace ID must be provided. More information about how to determine the workspace ID for your workspace can be found in the Databricks documentation. It is also important that the additional configuration file is included on any clusters that wish to use Immuta workspaces with credentials for the container in Azure Storage that contains Immuta workspaces.

Google Cloud Platform

Immuta currently supports the gs schema for Google Cloud Platform. The primary difference between Databricks on Google Cloud Platform and Databricks on AWS or Azure is that it is deployed to Google Kubernetes Engine. Databricks handles automatically provisioning and auto scaling drivers and executors to pods on Google Kubernetes Engine, so Google Cloud Platform admin users can view and monitor the Google Kubernetes resources in the Google Cloud Platform.

Caveats and limitations

Stage Immuta installation artifacts in Google Storage, not DBFS: The DBFS FUSE mount is unavailable, and the IMMUTA_SPARK_DATABRICKS_DBFS_MOUNT_ENABLED property cannot be set to true to expose the DBFS FUSE mount.
Stage the Immuta init script in Google Storage: Init scripts in DBFS are not supported.
Stage third-party libraries in DBFS: Installing libraries from Google Storage is not supported.
Install third-party libraries as cluster-scoped: Notebook-scoped libraries have limited support. See the Databricks trusted libraries section for more details.
Maven library installation is only supported in Databricks Runtime 8.1+.
/databricks/spark/conf/spark-env.sh is mounted as read-only:
- Set sensitive Immuta configuration values directly in immuta_conf.xml: Do not use environment variables to set sensitive Immuta properties. Immuta is unable to edit the spark-env.sh file because it is read-only; therefore, remove environment variables and keep them from being visible to end users.
- Use /immuta-scratch directly: The IMMUTA_LOCAL_SCRATCH_DIR property is unavailable.
Allow the Kubernetes resource to spin down before submitting another job: Job clusters with init scripts fail on subsequent runs.
The DBFS CLI is unavailable: Other non-DBFS Databricks CLI functions will still work as expected.

Supported metastore providers for Databricks

To write data to a table in Databricks through an Immuta workspace, use one of the following supported provider types for your table format:

avro
csv
delta
orc
parquet

Derived data sources

Deprecation notice: Support for this feature has been deprecated.

A derived data source is a data source that is created within an equalized project and contains data from its parent sources. Consequently, when the derived data source is created, it will inherit the data policies from its parent data sources to keep the data secure.

Policy inheritance for derived data sources is a feature unique to the environment that an equalized project creates. Within the equalized project, every user sees the same data and work can be shared and collaborated on without any risk of a user viewing more than they should. When a derived data source is created, it inherits the data policies from its parent sources and a subscription policy is created from the equalized entitlements on the project, allowing project members to safely share secure data.

Example

Consider these data sources, within an equalized Project 1, that each contain subscription and data policies:

Data source A
- Subscription policy: Allow users to subscribe to the data source when user is a member of group Medical Claims
- Data policies:
  - Mask by making null the value in the column(s) city except for members of group Legal
  - Mask by making null the value in the column(s) gender for everyone
Data source B
- Subscription policy: Allow users to subscribe to the data source when user is approved by anyone with permission owner and anyone with permission governance
- Data policy: Limit usage to purpose(s) Research for everyone

If a user creates a derived data source, Data Source C, from these two data sources, Data Source C will inherit these policies, which will be unchangeable:

Data source C
- Subscription policy: Allow user to subscribe when they satisfy all of the following:
  - is a member of group Legal and is a member of group Medical Claims
  - is approved by anyone with permission owner (of data source B) and anyone with permission governance
- Data policy: Limit usage to purpose(s) Research for everyone

Derived data sources inherit policies from parent sources

Identification can apply tags to derived data sources; however, because they inherit policies from their parent sources, the global policies that contain these tags will not apply to derived data sources.

Behavior

Notice that one of the data policies in Data Source A, mask by making null the value in the column(s) gender for everyone, is not included in data source C. This is because the creator could not have seen the values in the parent sources; therefore, there are no values in the derived data source to be masked.
Most local data policies will not need to be present in the derived data source with the exception of limit usage to purpose(s) policies. And no global policies will be added to a derived data source.
Data source C's policies are reliant on which groups are in the project, and as the groups change so do the policies.
For example, if there were a data user in the project who was not in the Legal group, then that trait would not be needed in the subscription policy because, with equalization, those values would not be visible to the project members in the parent data source.
The subscription and data policies in the derived data source will always be the minimum required permissions and traits because of project equalization.
Derived data source policies will not adapt with the parent data sources. Any changes in the parent data source policies will be logged in the Relationships tab of the derived data source page, but will not be changed in the derived data source policies.
The data owner may choose to add new local data policies to the derived data source to keep up with any changes, but the inherited policies are not adjustable.
Any changes within the parent data source's data will not trickle down into the derived data source. After the creation of the derived data source, they stay connected for auditing and relationships, not for updating content.

Using data outside the project

If members use data outside the project to create their data source, they must first add that data to the project and re-derive the data source through the project connection. When creating a derived data source, members are prompted to certify that their data is derived from the parent data sources they selected upon creation.

For detailed instructions on creating a derived data source, navigate to Create a derived data source.

Project UDFs (Databricks)

You can switch project contexts and view a list of your current project or available projects through UDFs in Spark.

Available functions

UDF

Description

Virtual tables

To view a list of your current project or available projects in a Spark job, you can query these virtual tables.

Virtual Table

Query

Return

Writing to Projects

Project workspaces

Snowflake project workspaces

Deprecation notice

Support for this feature has been deprecated. See the Deprecations page for EOL dates.

Snowflake project workspaces allow users to access and write data directly in Snowflake.

Policy enforcement

Snowflake project workspace workflow

An Immuta user with the CREATE_PROJECT permission creates a new project with Snowflake data sources.
The Immuta project owner enables project equalization which balances every project members’ access to the data to be the same.
The Immuta project owner creates a Snowflake project workspace which automatically generates a subfolder in the root path specified by the application admin and remote database associated with the project.
Project members can access data sources within the project and use WRITE to create derived tables. To ensure equalization, users will only see data sources within their project as long as they are working in the Snowflake Context.
The CREATE_DATA_SOURCE_IN_PROJECT permission is given to specific users so they can expose their derived tables in the Immuta project; the derived tables will inherit the policies, and then the data can be shared outside the project.
If a project member leaves a project or a project is deleted, that Snowflake Context will be removed from the user's Snowflake account.

Root directory details

Immuta only supports a single root location, so all projects will write to a subdirectory under this single root location.
If an administrator changes the default directory, the Immuta user must have full access to that directory. Once any workspace is created, this directory can no longer be modified.

Mapping projects to secure views

Immuta projects are represented as Session Contexts within Snowflake. As they are linked to Snowflake, projects automatically create corresponding

roles in Snowflake: IMMUTA_[project name]
schemas in the Snowflake IMMUTA database: [project name]
secure views in the project schema for any table in the project

Benefits

Few roles to manage in Snowflake; that complexity is pushed to Immuta, which is designed to simplify it.
A small set of users has direct access to raw tables; most users go through secure views only, but raw database access can be segmented across departments.
Policies are built by the individual database administrators within Immuta and are managed in a single location, and changes to policies are automatically propagated across thousands of tables’ secure views.
Self-service access to data based on data policies.
Users work in various contexts in Snowflake natively, based on their collaborators and their purpose, without fear of leaking data.
All policies are enforced natively in Snowflake without performance impact.
- Security is maintained through Snowflake primitives (roles and secure views).
- Performance and scalability is maintained (no proxy).
Policies can be driven by metadata, allowing massive scale policy enforcement with only a small set of actual policies.
Derived tables can be shared back out through Immuta, improving collaboration.
User access and removal are immediately reflected in secure views.

Databricks Spark project workspaces

Databricks project workspace workflow

An Immuta user with the CREATE_PROJECT permission creates a new project with Databricks data sources.
The Immuta project owner enables project equalization which balances every project members’ access to the data to be the same.
The Immuta project owner creates a Databricks project workspace which automatically generates a subfolder in the root path specified by the application admin and remote database associated with the project.
The Immuta project members query equalized data within the context of the project, collaborate, and write data, all within Databricks.
The Immuta project members use their newly written derived data and register the derived tables in Immuta as derived data sources. These derived data sources inherit the necessary Immuta policies to be securely shared outside of the project.

Root directory details

Immuta only supports a single root location, so all projects will write to a subdirectory under this single root location.
If an administrator changes the default directory, the Immuta user must have full access to that directory. Once any workspace is created, this directory can no longer be modified.
Administrators can place a configuration value in the cluster configuration (core-site.xml) to mark that cluster as unavailable for use as a workspace.

Read and write data

When acting in the workspace project, users can read data using calls like spark.read.parquet("immuta:///some/path/to/a/workspace").
To write delta lake data to a workspace and then expose that delta table as a data source in Immuta, you must specify a table when creating the derived data source (rather than a directory) in the workspace for the data source.

Supported cloud providers

Microsoft Azure

Google Cloud Platform

Caveats and limitations

Stage Immuta installation artifacts in Google Storage, not DBFS: The DBFS FUSE mount is unavailable, and the IMMUTA_SPARK_DATABRICKS_DBFS_MOUNT_ENABLED property cannot be set to true to expose the DBFS FUSE mount.
Stage the Immuta init script in Google Storage: Init scripts in DBFS are not supported.
Stage third-party libraries in DBFS: Installing libraries from Google Storage is not supported.
Install third-party libraries as cluster-scoped: Notebook-scoped libraries have limited support. See the Databricks trusted libraries section for more details.
Maven library installation is only supported in Databricks Runtime 8.1+.
/databricks/spark/conf/spark-env.sh is mounted as read-only:
- Set sensitive Immuta configuration values directly in immuta_conf.xml: Do not use environment variables to set sensitive Immuta properties. Immuta is unable to edit the spark-env.sh file because it is read-only; therefore, remove environment variables and keep them from being visible to end users.
- Use /immuta-scratch directly: The IMMUTA_LOCAL_SCRATCH_DIR property is unavailable.
Allow the Kubernetes resource to spin down before submitting another job: Job clusters with init scripts fail on subsequent runs.
The DBFS CLI is unavailable: Other non-DBFS Databricks CLI functions will still work as expected.

Supported metastore providers for Databricks

To write data to a table in Databricks through an Immuta workspace, use one of the following supported provider types for your table format:

avro
csv
delta
orc
parquet

Derived data sources

Deprecation notice: Support for this feature has been deprecated.

Example

Consider these data sources, within an equalized Project 1, that each contain subscription and data policies:

Data source A
- Subscription policy: Allow users to subscribe to the data source when user is a member of group Medical Claims
- Data policies:
  - Mask by making null the value in the column(s) city except for members of group Legal
  - Mask by making null the value in the column(s) gender for everyone
Data source B
- Subscription policy: Allow users to subscribe to the data source when user is approved by anyone with permission owner and anyone with permission governance
- Data policy: Limit usage to purpose(s) Research for everyone

If a user creates a derived data source, Data Source C, from these two data sources, Data Source C will inherit these policies, which will be unchangeable:

Data source C
- Subscription policy: Allow user to subscribe when they satisfy all of the following:
  - is a member of group Legal and is a member of group Medical Claims
  - is approved by anyone with permission owner (of data source B) and anyone with permission governance
- Data policy: Limit usage to purpose(s) Research for everyone

Derived data sources inherit policies from parent sources

Behavior

Notice that one of the data policies in Data Source A, mask by making null the value in the column(s) gender for everyone, is not included in data source C. This is because the creator could not have seen the values in the parent sources; therefore, there are no values in the derived data source to be masked.
Most local data policies will not need to be present in the derived data source with the exception of limit usage to purpose(s) policies. And no global policies will be added to a derived data source.
Data source C's policies are reliant on which groups are in the project, and as the groups change so do the policies.
For example, if there were a data user in the project who was not in the Legal group, then that trait would not be needed in the subscription policy because, with equalization, those values would not be visible to the project members in the parent data source.
The subscription and data policies in the derived data source will always be the minimum required permissions and traits because of project equalization.
Derived data source policies will not adapt with the parent data sources. Any changes in the parent data source policies will be logged in the Relationships tab of the derived data source page, but will not be changed in the derived data source policies.
The data owner may choose to add new local data policies to the derived data source to keep up with any changes, but the inherited policies are not adjustable.
Any changes within the parent data source's data will not trickle down into the derived data source. After the creation of the derived data source, they stay connected for auditing and relationships, not for updating content.

Using data outside the project

For detailed instructions on creating a derived data source, navigate to Create a derived data source.