Intermingling your pre-existing roles in Databricks with Immuta can be confusing at first. Below outlines some best practices on how to think about roles in each platform.
Access to data, platform permissions, and the ability to use clusters and data warehouses are controlled in Databricks Unity Catalog with permissions to individual users or groups. Immuta can control those permissions to grant users permission to read data based on subscription policies.
This section discusses best practices for Databricks Unity Catalog permissions for end-users.
Users who consume data (directly in your Databricks workspace or through other applications) need permission to access objects. But permissions are also used to control write, Databricks clusters and warehouses, and other object types that can be registered in Databricks Unity Catalog.
To manage this at scale, Immuta recommends taking a 3-layer approach, where you separate the different permissions into different privileges:
Privileges for read access (Immuta managed)
Privileges for write access (optional, soon supported by Immuta)
Privileges for warehouse and clusters, internal billing
Read access is managed by Immuta. By using subscription policies, data access can be controlled to the table level. Attribute-based table GRANTS help you scale compared to RBAC, where access control is typically done on a schema or catalog level.
Since Immuta leverages native Databricks Unity Catalog GRANTs, you can combine Immuta’s grants with grants done manually in Databricks Unity Catalog. This means you can gradually migrate to an Immuta-protected Databricks workspace.
Write access is typically granted on a schema, catalog, or volume level. This makes it easy to manage in Databricks Unity Catalog through manual grants. We recommend creating groups that give INSERT
, UPDATE
, or DELETE
permissions to a specific schema or catalog and attach this group to a user. This attachment can be done manually or using your identity manager groups. (See the Databricks documentation for details.) Note that Immuta is working toward supporting write policies, so this will not need to be separately managed for long.
Warehouses and clusters are granted to users to give them access to computing resources. Since this is directly tied to Databricks’ consumption model, warehouses and clusters are typically linked to cost centers for (internal) billing purposes. Immuta recommends creating a group per team/domain/cost center, applying this group for cluster/warehouse privileges, and granting this group to users using identity manager groups.
Immuta has two types of service accounts to connect to Databricks:
Policy role: Immuta needs to use a service principal to be able to push policies to Databricks Unity Catalog and to pull audits to Immuta (optional). This principal needs USE CATALOG
and USE SCHEMA
on all catalogs and schemas, and SELECT
and MODIFY
on all tables in the metastore managed by Immuta.
Data ownership role: You will also need a user/principal for the data source registration. A service account/principal is recommended so that when the user moves or leaves the organization, Immuta still has the proper credentials to connect to Databricks Unity Catalog. You can follow one of the two best practices:
A central role for registration (recommended): It is recommended that you create a service role/user with SELECT
permissions for all objects in your metastore. Immuta can register all the tables and views from Databricks, populate the Immuta catalog, and scan the objects for sensitive data using Immuta Discover. Immuta will not apply policy directly by default, so no existing access will be impacted.
A service principal per domain (alternative): Alternatively, if you cannot create a service principal with SELECT
permissions for all objects, you can allow the different domains or teams in the organization to use a service user/principal scoped to their data. This is delegating metadata registration and aligns well with data mesh type use cases and means every team is responsible for registering their data sets in Immuta.
This guide is for anyone ready to begin their journey with Immuta who isn't using Snowflake.
If you are using Snowflake with Immuta, see the use case.
This use case provides the basics for setting up Immuta and helps you understand the logic and decisions behind the tasks.
Follow these steps to configure Immuta:
Configure your users in Immuta, using the .
Read the and connect Immuta to your databases. Consider the if using Databricks.
in order to start using Immuta features on the data sources.
Immuta is not just a location to define your policy logic; Immuta also enforces that logic in your data platform. How that occurs varies based on each data platform, but the overall architecture remains consistent and follows the NIST Zero Trust framework. The below diagram describes the recommended architecture from NIST:
Immuta lives in the middle control plane. To do this, Immuta knows details about the subjects and enterprise resources, acts as the policy decision point through policies administered by policy administrators, and makes real-time policy decisions using the internal Immuta policy engine.
Lastly, and of importance to how Immuta Secure functions, Immuta also enables the policy enforcement point by administering the policies natively in your data platform in a way that can react to policy changes and live queries.
To use Immuta, you must configure the Immuta native integration, which will require some level of privileged access to administer policies in your data platform, depending on your data platform and how the Immuta integration works. If using Databricks, refer to Databricks roles best practices for Databricks before configuring the native integration.
This guide outlines best practices for managing user identities in Immuta with your identity manager, such as Active Directory and Okta.
Reusing information you have already captured today is a good practice. A lot of information about users is in your identity manager platform and can be used in Immuta for user onboarding and policies.
All users protected by Immuta must be registered in Immuta, even though people might not log in to Immuta.
SAML is commonly used as a single sign-on mechanism for users to log in to Immuta. This means you can use your organization's SSO, which complies with your security standards.
Every user that will be protected by Immuta needs to have a user on the platform to enforce policy, regardless of if they are logging in to Immuta. SCIM should be used to provision users from your identity manager platforms to Immuta automatically. The advantage here is that not all end-users need to log in to Immuta to create their accounts, and updates in your identity manager will be automatically reflected in Immuta, hence updating the access in your platforms.
Details on how to configure your individual identity manager's protocols can be found here:
There are several different combinations of supported protocol configurations, so consider those as you plan your user synchronization.
In Immuta, permissions control what actions a user is allowed to take through the API and UI. The different permissions can be found in the Immuta permissions guide.
We recommend using identity manager groups to manage permissions. When you configure the identity manager integration, you can enable group permissions. This allows you to control the permissions via identity manager groups and use the group assignment and approval process currently in place.
In order to take advantage of all the capabilities of Immuta, you must make Immuta aware of your data metadata. This is done by registering your data with Immuta as data sources. It’s important to remember that Immuta is not reading your actual data at all; it is simply discovering your information schemas and pulling that information back as the foundation for everything else.
This section offers the best practices when onboarding data sources into Immuta.
If you have an external data catalog, configure the catalog integration first; then register your data in Immuta. This process will automatically tag your data with the external catalog tags as you register it.
Use Immuta's no default subscription policy setting to onboard metadata without affecting your users' access. This means you onboard all metadata in Immuta without any impact on current accesses which gives you time to fully convert your operations to Immuta without causing unnecessary data downtime. Immuta will only take control when the first policies are applied. Because of this, register all tables.
While it can be tempting to start small and register only the pieces of data that you intend to protect, you must remember that Immuta is not just about access control. It’s important to register your data metadata so that Immuta can also track activity and understand where that sensitive data lies (with Immuta Detect). In other words, Immuta can’t tell you where you have problems unless you first tell it to look at your metadata.
Without the no default subscription policy, Immuta will set each data source's subscription policy to the most restrictive option which automatically locks data down during onboarding. To unlock the data and give your users access again, new subscription policies must be set.
If you are delegating the registration and control of data, then read our Data mesh use case for more information.
Use the /api/v2/data
endpoint to register a schema; then use schema monitoring to find new data sources and automatically register them.
One of the greatest benefits of a modern data platform is that you can manage all your data transformations at the data tier. This means that data is constantly changing in the data platform, which may result in the need for access control changes as well. This is why it is critical that you enable schema monitoring and column detection when registering metadata with Immuta. This will allow Immuta to constantly monitor and update for these changes.
It’s also important to understand that many data engineering tools make changes by destructively recreating tables and views, which results in all policies being dropped in the data platform. This is actually a good thing, because this gives Immuta a chance to update the access as the changes are found (policy uptime) while the only user that can see the data being recreated is the creator of that change (data downtime for all other users). This is why schema monitoring and column detection are so critical.