Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
After workspaces are configured, project owners can enable workspaces within their projects. This feature allows project members to write data to projects and share this data with other users as derived data sources.
Requirement: You must own the project
Prerequisites:
Databricks cluster configuration
Before creating a workspace, the cluster must send its configuration to Immuta; to do this, run a simple query on the cluster (i.e., show tables
). Otherwise, an error message will occur when you attempt to create a workspace.
Navigate to the Policies tab and enable Project Equalization by clicking the Project Equalization slider to on.
Scroll to the Native Workspace section and click Create.
Select Databricks from the Workspace Configuration dropdown menu.
Opt to edit the sub-directory in the Workspace Directory field; this sub-directory auto-populates as the project name.
Enter the Workspace Database Name.
Click Create to enable the workspace.
Databricks cluster configuration
Before creating a workspace, the cluster must send its configuration to Immuta; to do this, run a simple query on the cluster (i.e., show tables
). Otherwise, an error message will occur when you attempt to create a workspace.
Scroll to the Native Workspace section on the policies tab and click the toggle to disable the workspace.
Click Delete in the native workspace section.
Choose one of the following options in the modal:
Purge Generic Workspace Data: Permanently delete data, while the data used by derived data sources is preserved. Note: If you created a derived data source that references a view on top of a table in Snowflake that isn't a derived data source, that table will be deleted and break the derived data source.
Purge Everything & Delete Derived Data Sources: Permanently delete data and purge all derived data sources.
Click Delete.
With project equalization enabled, project users can create Snowflake or Databricks Spark project workspaces where users can view and write data.
: Create a project workspace to allow Snowflake users subscribed to the project to write data to the project.
: Create a project workspace to allow Databricks users subscribed to the project to write data to the project.
: Write data to a project when working in the context of a Snowflake or Databricks project workspace.
: Create a derived data source to share the data you've written with other Immuta users.
: This reference guide describes the components and design of project workspaces for Snowflake and Databricks and defines derived data sources.
: This reference guide lists the available functions for switching your project context in Databricks.
Project workspaces
With , project users can create or where users can view and write data. Then, those users can create to share this data with other users.
Combining Immuta projects and Snowflake workspaces allows users to access and write data directly in Snowflake.
With Snowflake workspaces, Immuta enforces policy logic on registered tables and represents them as secure views in Snowflake. Since secure views are static, creating a secure view for every unique user in your organization for every table in your organization would result in secure view bloat; however, Immuta addresses this problem by virtually grouping users and tables and equalizing users to the same level of access, ensuring that all members of the project see the same view of the data. Consequently, all members share one secure view.
While interacting directly with Snowflake secure views in these workspaces, users can write within Snowflake and create , all the while collaborating with other project members at a common access level. Because these derived data sources will inherit all of the appropriate policies, that data can then be shared outside the project. Additionally, derived data sources use the credentials of the Immuta system Snowflake account, which will allow them to persist after a workspace is disconnected.
Snowflake workspaces can be used on their own or with the Snowflake integration.
Immuta enforces policy logic on data and represents it as in Snowflake. Because projects group users and tables and to the same level of access, all members will see the same view of the data and, consequently, will only need one secure view. Changes to policies immediately propagate to relevant secure views.
An Immuta user with the CREATE_PROJECT
permission with Snowflake data sources.
The Immuta project owner which balances every project members’ access to the data to be the same.
The Immuta project owner which automatically generates a subfolder in the root path specified by the application admin and remote database associated with the project.
Project members can access data sources within the project and use WRITE to create derived tables. To ensure equalization, users will only see data sources within their project as long as they are working in the Snowflake Context.
The CREATE_DATA_SOURCE_IN_PROJECT
permission is given to specific users so they can ; the derived tables will inherit the policies, and then the data can be shared outside the project.
If a project member leaves a project or a project is deleted, that Snowflake Context will be removed from the user's Snowflake account.
Immuta only supports a single root location, so all projects will write to a subdirectory under this single root location.
If an administrator changes the default directory, the Immuta user must have full access to that directory. Once any workspace is created, this directory can no longer be modified.
roles in Snowflake: IMMUTA_[project name]
schemas in the Snowflake IMMUTA database: [project name]
secure views in the project schema for any table in the project
To switch projects, users have to change their Snowflake Session Context to the appropriate Immuta project. If users are not entitled to a data source contained by the project, they will not be able to access the Context in Snowflake until they have access to all tables in the project. If changes are made to a user's attributes and access level, the changes will immediately propagate to the Snowflake Context.
Because users access data only through secure views in Snowflake, it significantly decreases the amount of role management for administrators in Snowflake. Organizations should also consider having a user in Snowflake who is able to create databases and make GRANTs on those databases and having separate users who are able to read and write from those tables.
Few roles to manage in Snowflake; that complexity is pushed to Immuta, which is designed to simplify it.
A small set of users has direct access to raw tables; most users go through secure views only, but raw database access can be segmented across departments.
Policies are built by the individual database administrators within Immuta and are managed in a single location, and changes to policies are automatically propagated across thousands of tables’ secure views.
Self-service access to data based on data policies.
Users work in various contexts in Snowflake natively, based on their collaborators and their purpose, without fear of leaking data.
All policies are enforced natively in Snowflake without performance impact.
Security is maintained through Snowflake primitives (roles and secure views).
Performance and scalability is maintained (no proxy).
Policies can be driven by metadata, allowing massive scale policy enforcement with only a small set of actual policies.
Derived tables can be shared back out through Immuta, improving collaboration.
User access and removal are immediately reflected in secure views.
Users will only be able to access the directory and database created for the workspace when acting under the project. The Immuta Spark SQL Session will apply policies to the data, so any data written to the workspace will already be compliant with the restrictions of the equalized project, where all members see data at the same level of access. When users are ready to write data to the project, they should use the SparkSQL session to copy data into the workspace.
The Immuta project members query equalized data within the context of the project, collaborate, and write data, all within Databricks.
Immuta only supports a single root location, so all projects will write to a subdirectory under this single root location.
If an administrator changes the default directory, the Immuta user must have full access to that directory. Once any workspace is created, this directory can no longer be modified.
Administrators can place a configuration value in the cluster configuration (core-site.xml
) to mark that cluster as unavailable for use as a workspace.
When acting in the workspace project, users can read data using calls like spark.read.parquet("immuta:///some/path/to/a/workspace")
.
To write delta lake data to a workspace and then expose that delta table as a data source in Immuta, you must specify a table when creating the derived data source (rather than a directory) in the workspace for the data source.
Immuta currently supports the gs
schema for Google Cloud Platform. The primary difference between Databricks on Google Cloud Platform and Databricks on AWS or Azure is that it is deployed to Google Kubernetes Engine. Databricks handles automatically provisioning and auto scaling drivers and executors to pods on Google Kubernetes Engine, so Google Cloud Platform admin users can view and monitor the Google Kubernetes resources in the Google Cloud Platform.
Stage Immuta installation artifacts in Google Storage, not DBFS: The DBFS FUSE mount is unavailable, and the IMMUTA_SPARK_DATABRICKS_DBFS_MOUNT_ENABLED
property cannot be set to true
to expose the DBFS FUSE mount.
Stage the Immuta init script in Google Storage: Init scripts in DBFS are not supported.
Stage third party libraries in DBFS: Installing libraries from Google Storage is not supported.
Maven library installation is only supported in Databricks Runtime 8.1+.
/databricks/spark/conf/spark-env.sh
is mounted as read-only:
Set sensitive Immuta configuration values directly in immuta_conf.xml
: Do not use environment variables to set sensitive Immuta properties. Immuta is unable to edit the spark-env.sh
file because it is read-only; therefore, remove environment variables and keep them from being visible to end users.
Use /immuta-scratch
directly: The IMMUTA_LOCAL_SCRATCH_DIR
property is unavailable.
Allow the Kubernetes resource to spin down before submitting another job: Job clusters with init scripts fail on subsequent runs.
The DBFS CLI is unavailable: Other non-DBFS Databricks CLI functions will still work as expected.
To write data to a table in Databricks through an Immuta workspace, use one of the following supported provider types for your table format:
avro
csv
delta
orc
parquet
Deprecation notice: Support for this feature has been deprecated.
A derived data source is a data source that is created within an equalized project and contains data from its parent sources. Consequently, when the derived data source is created, it will inherit the data policies from its parent data sources to keep the data secure.
Policy inheritance for derived data sources is a feature unique to the environment that an equalized project creates. Within the equalized project, every user sees the same data and work can be shared and collaborated on without any risk of a user viewing more than they should. When a derived data source is created, it inherits the data policies from its parent sources and a subscription policy is created from the equalized entitlements on the project, allowing project members to safely share secure data.
Consider these data sources, within an equalized Project 1, that each contain subscription and data policies:
Data source A
Subscription policy: Allow users to subscribe to the data source when user is a member of group Medical Claims
Data policies:
Mask by making null the value in the column(s) city except for members of group Legal
Mask by making null the value in the column(s) gender for everyone
Data source B
Subscription policy: Allow users to subscribe to the data source when user is approved by anyone with permission owner and anyone with permission governance
Data policy: Limit usage to purpose(s) Research for everyone
If a user creates a derived data source, Data Source C, from these two data sources, Data Source C will inherit these policies, which will be unchangeable:
Data source C
Subscription policy: Allow user to subscribe when they satisfy all of the following:
is a member of group Legal and is a member of group Medical Claims
is approved by anyone with permission owner (of data source B) and anyone with permission governance
Data policy: Limit usage to purpose(s) Research for everyone
Derived data sources inherit policies from parent sources
Sensitive data discovery applies Discovered tags to derived data sources; however, because they inherit policies from their parent sources, the global policies that contain these tags will not apply to derived data sources.
Notice that one of the data policies in Data Source A, mask by making null the value in the column(s) gender for everyone, is not included in data source C. This is because the creator could not have seen the values in the parent sources; therefore, there are no values in the derived data source to be masked.
Most local data policies will not need to be present in the derived data source with the exception of limit usage to purpose(s) policies. And no global policies will be added to a derived data source.
Data source C's policies are reliant on which groups are in the project, and as the groups change so do the policies.
For example, if there were a data user in the project who was not in the Legal group, then that trait would not be needed in the subscription policy because, with equalization, those values would not be visible to the project members in the parent data source.
The subscription and data policies in the derived data source will always be the minimum required permissions and traits because of project equalization.
Derived data source policies will not adapt with the parent data sources. Any changes in the parent data source policies will be logged in the Relationships tab of the derived data source page, but will not be changed in the derived data source policies.
The data owner may choose to add new local data policies to the derived data source to keep up with any changes, but the inherited policies are not adjustable.
Any changes within the parent data source's data will not trickle down into the derived data source. After the creation of the derived data source, they stay connected for auditing and relationships, not for updating content.
If members use data outside the project to create their data source, they must first add that data to the project and re-derive the data source through the project connection. When creating a derived data source, members are prompted to certify that their data is derived from the parent data sources they selected upon creation.
Once the workspace is created, project members will see relevant data sources when working under the project context.
.
Write data to the project workspace in Snowflake or Databricks:
Snowflake: Select the role created by the project workspace. The role created will be a combination of the database name (configured by the application admin) and the schema name. Then, write data to this location.
Databricks: Write data to the directory and database created in Databricks for the project workspace.
Now that data has been written to the workspace, users can share this data with others by making it a derived data source in Immuta.
Deprecation notice: Support for this feature has been deprecated.
Select a project.
Select the data source from which the new data was created.
Select Table for the virtual population option.
Click Edit and select the tables you created, and then click Apply.
Opt to edit the Basic Information fields, and then click Create.
After workspaces are configured, project owners can enable workspaces within their projects. This feature allows project members to to the project and share this data with other users as .
Requirement: You must own the project
Prerequisites:
.
.
External IDs have been connected with an IAM or in for Snowflake.
Data sources registered by : Snowflake workspaces generate static views with the credentials used to register the table as an Immuta data source. Those tables must be registered in Immuta by an excepted role so that policies applied to the backing tables are not applied to the project workspace views.
Navigate to the Policies tab and enable project equalization by clicking the Project Equalization slider to on.
Scroll to the Native Workspace section and click Create.
Select Snowflake from the Workspace Configuration dropdown menu.
Name the Workspace Schema. By default, the schema name is based off of the project name, but you can change it here. Your project workspace will exist within this schema under Snowflake under the database configured by the Application Admin.
Use the dropdown menu to select the Hostname. Projects can only be configured to use one Snowflake host.
Select one or more Warehouses to be available to project members when they are working in the Snowflake workspace.
Click Create to enable the workspace.
Scroll to the Native Workspace section on the policies tab and click the toggle to disable the workspace.
Click Delete in the native workspace section.
Choose one of the following options in the modal:
Purge Generic Workspace Data: Permanently delete data, while the data used by derived data sources is preserved. Note: If you created a derived data source that references a view on top of a table in Snowflake that isn't a derived data source, that table will be deleted and break the derived data source.
Purge Everything & Delete Derived Data Sources: Permanently delete data and purge all derived data sources.
Click Delete.
Immuta projects are represented as within Snowflake. As they are linked to Snowflake, projects automatically create corresponding
Using and , Databricks Spark project workspaces are a space where every project member has the same level of access to data. This equalized access allows collaboration without worries about data leaks. Not only can project members collaborate on data, but they can also write protected data to the project.
An Immuta user with the CREATE_PROJECT
permission with Databricks data sources.
The Immuta project owner which balances every project members’ access to the data to be the same.
The Immuta project owner which automatically generates a subfolder in the root path specified by the application admin and remote database associated with the project.
The Immuta project members use their newly written derived data and . These derived data sources inherit the necessary Immuta policies to be securely shared outside of the project.
Immuta currently supports the abfss
schema for Azure General Purpose V2 Storage Accounts. this includes support for Azure Data Lake Gen 2. When configuring Immuta workspaces for Databricks on Azure, the Azure Databricks workspace ID must be provided. More information about how to determine the workspace ID for your workspace can be found in the . It is also important that the additional configuration file is included on any clusters that wish to use Immuta workspaces with credentials for the container in Azure Storage that contains Immuta workspaces.
Install third party libraries as cluster-scoped: Notebook-scoped libraries have limited support. See the page for more details.
For detailed instructions on creating a derived data source, navigate to .
immuta.set_current_project(id) | Sets the user's current project to the project ID denoted by the id parameter. This UDF must be called in its own notebook cell to ensure the changes take effect. |
immuta.set_current_project() (no parameters) | Sets the user's current project to None. |
immuta.clear_caches() | Clears all client caches for the current user's ImmutaClient instance. This can be used when a user would like to invalidate cached items, like data source subscription information or if the state of Immuta has changed and the cache is outdated. For backward compatibility, this UDF is also available at default.immuta_clear_caches() |
default.immuta_clear_metastore_cache() | Clears the cluster-wide Metastore cache. This UDF can only be run by a privileged user. |
immuta.get_current_project | select * from immuta.get_current_project | This virtual table returns a single row with "name" and "id" columns that show your currently selected project. |
immuta.list_projects | select * from immuta.list_projects | This virtual table returns rows with "name," "id," and "current_project" columns. Each row is a different project to which you are subscribed (and can use as your current project). The "current_project" row will be true for the row defining the project that you have set as your current project. |