Managing Data Metadata
Last updated
Last updated
Your schema metadata is registered using either of the Detect use cases:
Monitor and secure sensitive data platform query activity use case (Snowflake only)
General Immuta configuration use case (if not using Snowflake)
You should have already done some data tagging while configuring Immuta in the Detect getting started. That guide focuses on understanding where compliance issues may exist in your data platform and may not have fully covered all tags required for policy. Read on to see if there's more work to be done with data tagging.
Now that we’ve enriched facts about our users, let’s focus on the second point on the policy triangle diagram: the data tags. Just like you need user metadata, you need metadata on your data (tags) in order to decouple policy logic from referencing physical tables or columns. You must choose between the orchestrated RBAC or ABAC method of data access:
Orchestrated RBAC method: tag data sources at the table level
ABAC method: tag data at the table and column level
While it is possible to target policies using both table- and column-level tags, for ABAC it’s more common to target column tags because they represent more granularly what is in the table. Just like user metadata needs to be facts about your users, the data metadata must be facts about the data. The tags on your tables should not contain any policy logic.
Fact-based column tags are descriptive (recommended):
Column ssn
has column tag social security number
Column f_name
has column tag name
Column dob
has column tags date
and date of birth
Logic-based column tags requires subjective decisions (not recommended):
Column ssn
has column tag PII
Column f_name
has column tag sensitive
Column dob
has column tag indirect identifier
But can't I get policy authoring scalability by tagging things with higher level classifications, like PII, so I can build broader policies? This is what Immuta’s classification frameworks are for.
Entity tags are facts about the contents of individual columns in isolation. Entity tags are what we listed above: social security number, name, date, and data of birth. Entity tags do not attempt to contextualize column contents with neighboring columns' contents. Instead, categorization and classification tags describe the sensitive contents of a table with the context of all its columns, which is what is listed in the logic-based tags above, things like PII, sensitive, and indirect identifier.
For example, under the HIPAA framework a list of procedures a doctor performed is only considered protected health information (PHI) if it can be associated with the identity of patients. Since entity tagging operates on a single column-by-column basis, it can’t reason whether or not a column containing procedure codes merits classification as PHI. Therefore, entity tagging will not tag procedure codes as PHI. But categorization tagging will tag it PHI if it detects patient identity information in the other columns of the table.
Additionally, entity tagging does not indicate how sensitive the data is, but categorization tags carry a sensitivity level, the classification tag. For example, an entity tag may identify a column that contains telephone numbers, but the entity tag alone cannot say that the column is sensitive. A phone number associated with a person may be classified as sensitive, while the publicly-listed phone number of a company might not be considered sensitive.
Contextual tags are really what you should target with policy where possible. This provides a way to create higher level objects for more scalable and generic policy. Rather than building a policy like “allow access to tables with columns tagged person name
and phone number
,” it would be much easier to build it like “allow access to tables with columns tagged PII
.”
In short, you must tag your entities, and then rely on a classification framework (provided by Immuta or customized by you) to provide the higher level context, also as tags. Remember, the owners of the tables (those who created them) can tag the data with facts about what is in the columns without having to understand the higher level implications of those tags (categorization and classification). This allows better separation of duty.
For orchestrated-RBAC, the data tags are no longer facts about your data, they are instead a single variable that determines access. As such, they should be table-level tags (which also improves the amount of processing Immuta must do).
There are several options for doing this, and if you are following along with the use cases for Detect getting started, you may have already accomplished the recommended option 1.
Immuta Discover's sensitive data discovery (SDD): This is the most powerful option. Immuta is able to discover your sensitive data, and you are able to extend what types of entities are discovered to those specific to your business. SDD can run completely within your data platform, with no data leaving at all for Immuta to analyze. SDD is more relevant for the ABAC approach because the tags are facts about the data.
Tags from an external source: You may have already done all the work tagging your data in some external catalog, such as Collibra, Alation, or your own homegrown tool. If so, Immuta can pull those tags in and use them. Out of the box Immuta supports Alation, Collibra, and Snowflake tags, and for anything else you can build a Custom REST Catalog Interface. But remember, just like user metadata, these should represent facts about your data and not policy decisions.
Manually tag: Just like with user metadata, you are able to manually tag tables and columns in Immuta from within the UI, using the Immuta API, or when registering the data, either during initial registration or subsequent tables discovered in the future through schema monitoring.
Just like hierarchy has an impact with user metadata, so can data tag hierarchy. We discussed the matching of user metadata to data metadata in the Managing user metadata guide. However, there are even simpler approaches that can leverage data tag hierarchy beyond matching. This will be covered in more detail in the Author policy guide, but is important to understand as you think through data tagging.
As a quick example, it is possible to tag your data with Cars
and then also tag that same data with more specific tags (in the hierarchy) such as Cars.Nissan.Xterra
. Then, when you build policies, you could allow access to tables tagged Cars
to administrators
, but only those tagged Cars.Nissan.Xterra
to suv_inspectors
. This will result in two separate policies landing on the same table, and the beauty of Immuta is that it will handle the conflict of those two separate policies. This provides a large amount of scalability because you have to manage far fewer policies.
Imagine if you didn’t have this capability? You would have to include administrators
access to every policy you created for the different vehicle makes - and if that policy needed to evolve, such as adding more than administrators
to all cars, it would be an enormous effort to make that change. With Immuta, it’s one policy change.