You can switch project contexts and view a list of your current project or available projects through UDFs in Spark.
UDF | Description |
---|---|
To view a list of your current project or available projects in a Spark job, you can query these virtual tables.
Virtual Table | Query | Return |
---|---|---|
View your available projects by running the following query in Spark: select * from immuta.list_projects. In the resulting table, note the values listed in the id column; this value will be used at the parameter in the following step.
Run select immuta.set_current_project(<id>
). This UDF must be called in its own notebook cell to ensure the changes take effect.
Your project context will be switched, and that project's data sources and workspaces will now be visible. To set your project context to None, run select immuta.set_current_project()
with no parameters.
Note: Since the UDFs are not actually registered with the FunctionRegistry, if you call DESCRIBE FUNCTION immuta.set_current_project
, you won't get back the documentation for the UDF.
immuta.set_current_project(id)
Sets the user's current project to the project ID denoted by the id parameter. This UDF must be called in its own notebook cell to ensure the changes take effect.
immuta.set_current_project() (no parameters)
Sets the user's current project to None.
immuta.clear_caches()
Clears all client caches for the current user's ImmutaClient instance. This can be used when a user would like to invalidate cached items, like data source subscription information or if the state of Immuta has changed and the cache is outdated. For backward compatibility, this UDF is also available at default.immuta_clear_caches()
default.immuta_clear_metastore_cache()
Clears the cluster-wide Metastore cache. This UDF can only be run by a privileged user.
immuta.get_current_project
select * from immuta.get_current_project
This virtual table returns a single row with "name" and "id" columns that show your currently selected project.
immuta.list_projects
select * from immuta.list_projects
This virtual table returns rows with "name," "id," and "current_project" columns. Each row is a different project to which you are subscribed (and can use as your current project). The "current_project" row will be true for the row defining the project that you have set as your current project.
With equalization enabled, project users can create Snowflake or Databricks workspaces where users can view and write data. Then, those users can create derived data sources to share this data with other users.
Combining Immuta projects and Snowflake workspaces allows users to access and write data directly in Snowflake.
With Snowflake workspaces, Immuta enforces policy logic on registered tables and represents them as secure views in Snowflake. Since secure views are static, creating a secure view for every unique user in your organization for every table in your organization would result in secure view bloat; however, Immuta addresses this problem by virtually grouping users and tables and equalizing users to the same level of access, ensuring that all members of the project see the same view of the data. Consequently, all members share one secure view.
While interacting directly with Snowflake secure views in these workspaces, users can write within Snowflake and create derived data sources, all the while collaborating with other project members at a common access level. Because these derived data sources will inherit all of the appropriate policies, that data can then be shared outside the project. Additionally, derived data sources use the credentials of the Immuta system Snowflake account, which will allow them to persist after a workspace is disconnected.
Snowflake workspaces can be used on their own or with the Snowflake integration.
Immuta enforces policy logic on data and represents it as secure views in Snowflake. Because projects group users and tables and equalize members to the same level of access, all members will see the same view of the data and, consequently, will only need one secure view. Changes to policies immediately propagate to relevant secure views.
An Immuta user with the CREATE_PROJECT
permission creates a new project with Snowflake data sources.
The Immuta project owner enables project equalization which balances every project members’ access to the data to be the same.
The Immuta project owner creates a Snowflake project workspace which automatically generates a subfolder in the root path specified by the application admin and remote database associated with the project.
Project members can access data sources within the project and use WRITE to create derived tables. To ensure equalization, users will only see data sources within their project as long as they are working in the Snowflake Context.
The CREATE_DATA_SOURCE_IN_PROJECT
permission is given to specific users so they can expose their derived tables in the Immuta project; the derived tables will inherit the policies, and then the data can be shared outside the project.
If a project member leaves a project or a project is deleted, that Snowflake Context will be removed from the user's Snowflake account.
Immuta only supports a single root location, so all projects will write to a subdirectory under this single root location.
If an administrator changes the default directory, the Immuta user must have full access to that directory. Once any workspace is created, this directory can no longer be modified.
Immuta projects are represented as Session Contexts within Snowflake. As they are linked to Snowflake, projects automatically create corresponding
roles in Snowflake: IMMUTA_[project name]
schemas in the Snowflake IMMUTA database: [project name]
secure views in the project schema for any table in the project
To switch projects, users have to change their Snowflake Session Context to the appropriate Immuta project. If users are not entitled to a data source contained by the project, they will not be able to access the Context in Snowflake until they have access to all tables in the project. If changes are made to a user's attributes and access level, the changes will immediately propagate to the Snowflake Context.
Because users access data only through secure views in Snowflake, it significantly decreases the amount of role management for administrators in Snowflake. Organizations should also consider having a user in Snowflake who is able to create databases and make GRANTs on those databases and having separate users who are able to read and write from those tables.
Few roles to manage in Snowflake; that complexity is pushed to Immuta, which is designed to simplify it.
A small set of users has direct access to raw tables; most users go through secure views only, but raw database access can be segmented across departments.
Policies are built by the individual database administrators within Immuta and are managed in a single location, and changes to policies are automatically propagated across thousands of tables’ secure views.
Self-service access to data based on data policies.
Users work in various contexts in Snowflake natively, based on their collaborators and their purpose, without fear of leaking data.
All policies are enforced natively in Snowflake without performance impact.
Security is maintained through Snowflake primitives (roles and secure views).
Performance and scalability is maintained (no proxy).
Policies can be driven by metadata, allowing massive scale policy enforcement with only a small set of actual policies.
Derived tables can be shared back out through Immuta, improving collaboration.
User access and removal are immediately reflected in secure views.
Using Immuta projects and project equalization, Databricks project workspaces are a space where every project member has the same level of access to data. This equalized access allows collaboration without worries about data leaks. Not only can project members collaborate on data, but they can also write protected data to the project.
Users will only be able to access the directory and database created for the workspace when acting under the project. The Immuta Spark SQL Session will apply policies to the data, so any data written to the workspace will already be compliant with the restrictions of the equalized project, where all members see data at the same level of access. When users are ready to write data to the project, they should use the SparkSQL session to copy data into the workspace.
An Immuta user with the CREATE_PROJECT
permission creates a new project with Databricks data sources.
The Immuta project owner enables project equalization which balances every project members’ access to the data to be the same.
The Immuta project owner creates a Databricks project workspace which automatically generates a subfolder in the root path specified by the application admin and remote database associated with the project.
The Immuta project members query equalized data within the context of the project, collaborate, and write data, all within Databricks.
The Immuta project members use their newly written derived data and register the derived tables in Immuta as derived data sources. These derived data sources inherit the necessary Immuta policies to be securely shared outside of the project.
Immuta only supports a single root location, so all projects will write to a subdirectory under this single root location.
If an administrator changes the default directory, the Immuta user must have full access to that directory. Once any workspace is created, this directory can no longer be modified.
Administrators can place a configuration value in the cluster configuration (core-site.xml
) to mark that cluster as unavailable for use as a workspace.
When acting in the workspace project, users can read data using calls like spark.read.parquet("immuta:///some/path/to/a/workspace")
.
To write delta lake data to a workspace and then expose that delta table as a data source in Immuta, you must specify a table when creating the derived data source (rather than a directory) in the workspace for the data source.
Immuta currently supports the abfss
schema for Azure General Purpose V2 Storage Accounts. this includes support for Azure Data Lake Gen 2. When configuring Immuta workspaces for Databricks on Azure, the Azure Databricks workspace ID must be provided. More information about how to determine the workspace ID for your workspace can be found in the Databricks documentation. It is also important that the additional configuration file is included on any clusters that wish to use Immuta workspaces with credentials for the container in Azure Storage that contains Immuta workspaces.
Immuta currently supports the gs
schema for Google Cloud Platform. The primary difference between Databricks on Google Cloud Platform and Databricks on AWS or Azure is that it is deployed to Google Kubernetes Engine. Databricks handles automatically provisioning and auto scaling drivers and executors to pods on Google Kubernetes Engine, so Google Cloud Platform admin users can view and monitor the Google Kubernetes resources in the Google Cloud Platform.
Stage Immuta installation artifacts in Google Storage, not DBFS: The DBFS FUSE mount is unavailable, and the IMMUTA_SPARK_DATABRICKS_DBFS_MOUNT_ENABLED
property cannot be set to true
to expose the DBFS FUSE mount.
Stage the Immuta init script in Google Storage: Init scripts in DBFS are not supported.
Stage third party libraries in DBFS: Installing libraries from Google Storage is not supported.
Install third party libraries as cluster-scoped: Notebook-scoped libraries have limited support. See the Databricks Libraries page for more details.
Maven library installation is only supported in Databricks Runtime 8.1+.
/databricks/spark/conf/spark-env.sh
is mounted as read-only:
Set sensitive Immuta configuration values directly in immuta_conf.xml
: Do not use environment variables to set sensitive Immuta properties. Immuta is unable to edit the spark-env.sh
file because it is read-only; therefore, remove environment variables and keep them from being visible to end users.
Use /immuta-scratch
directly: The IMMUTA_LOCAL_SCRATCH_DIR
property is unavailable.
Allow the Kubernetes resource to spin down before submitting another job: Job clusters with init scripts fail on subsequent runs.
The DBFS CLI is unavailable: Other non-DBFS Databricks CLI functions will still work as expected.
To write data to a table in Databricks through an Immuta workspace, use one of the following supported provider types for your table format:
avro
csv
delta
orc
parquet
Deprecation notice: Support for this feature has been deprecated.
A derived data source is a data source that is created within an equalized project and contains data from its parent sources. Consequently, when the derived data source is created, it will inherit the data policies from its parent data sources to keep the data secure.
Policy inheritance for derived data sources is a feature unique to the environment that an equalized project creates. Within the equalized project, every user sees the same data and work can be shared and collaborated on without any risk of a user viewing more than they should. When a derived data source is created, it inherits the data policies from its parent sources and a subscription policy is created from the equalized entitlements on the project, allowing project members to safely share secure data.
Consider these data sources, within an equalized Project 1, that each contain subscription and data policies:
Data source A
Subscription policy: Allow users to subscribe to the data source when user is a member of group Medical Claims
Data policies:
Mask by making null the value in the column(s) city except for members of group Legal
Mask by making null the value in the column(s) gender for everyone
Data source B
Subscription policy: Allow users to subscribe to the data source when user is approved by anyone with permission owner and anyone with permission governance
Data policy: Limit usage to purpose(s) Research for everyone
If a user creates a derived data source, Data Source C, from these two data sources, Data Source C will inherit these policies, which will be unchangeable:
Data source C
Subscription policy: Allow user to subscribe when they satisfy all of the following:
is a member of group Legal and is a member of group Medical Claims
is approved by anyone with permission owner (of data source B) and anyone with permission governance
Data policy: Limit usage to purpose(s) Research for everyone
Derived data sources inherit policies from parent sources
Sensitive data discovery applies Discovered tags to derived data sources; however, because they inherit policies from their parent sources, the global policies that contain these tags will not apply to derived data sources.
Notice that one of the data policies in Data Source A, mask by making null the value in the column(s) gender for everyone, is not included in data source C. This is because the creator could not have seen the values in the parent sources; therefore, there are no values in the derived data source to be masked.
Most local data policies will not need to be present in the derived data source with the exception of limit usage to purpose(s) policies. And no global policies will be added to a derived data source.
Data source C's policies are reliant on which groups are in the project, and as the groups change so do the policies.
For example, if there were a data user in the project who was not in the Legal group, then that trait would not be needed in the subscription policy because, with equalization, those values would not be visible to the project members in the parent data source.
The subscription and data policies in the derived data source will always be the minimum required permissions and traits because of project equalization.
Derived data source policies will not adapt with the parent data sources. Any changes in the parent data source policies will be logged in the Relationships tab of the derived data source page, but will not be changed in the derived data source policies.
The data owner may choose to add new local data policies to the derived data source to keep up with any changes, but the inherited policies are not adjustable.
Any changes within the parent data source's data will not trickle down into the derived data source. After the creation of the derived data source, they stay connected for auditing and relationships, not for updating content.
If members use data outside the project to create their data source, they must first add that data to the project and re-derive the data source through the project connection. When creating a derived data source, members are prompted to certify that their data is derived from the parent data sources they selected upon creation.
For detailed instructions on creating a derived data source, navigate to Create a derived data source.