You are viewing documentation for Immuta version 2020.3.

For the latest version, view our documentation for Immuta SaaS or the latest self-hosted version.

Audience: Data Owners, Governors, and Users

Content Summary: This page outlines the theory of differential privacy and provides conceptual information helpful in configuring Differential Privacy Policies in Immuta. It further explains the effects of Immuta's Differential Privacy Policies on query results.

For instructions on how to programmatically create and apply Differential Privacy Policies, see the Differential Privacy Rule Type section of the Policy Handler HTTP API documentation.

## Background

Differential privacy is a mathematical framework enabling approximate evaluation of a function over a private database, while limiting the ability of an outside party to make inferences about individual input records. This page will outline the motivation and mathematical underpinnings of differential privacy, as well as provide insights on how to get the most out of Differential Privacy Policies.

## Theory

Consider a game in which an attacker is provided a pair of databases $D_1$ and $D_2$ that differ only by insertion (or deletion) of a single record, and the evaluation of some aggregate, $\mathcal{A}$, over either $D_1$ or $D_2$. This evaluation is denoted $\mathcal{A}(D)$, where $D$ stands for the unknown input database — either $D_1$ or $D_2$, as appropriate. The attacker is allowed to examine $D_1$, $D_2$, and is given a full specification of $\mathcal{A}$, so that he or she can independently evaluate $\mathcal{A}$ over $D_1$, $D_2$, or any input of their choosing. The goal of the attacker is to guess whether $D$ was $D_1$ or $D_2$.

This task is considered difficult if the attacker is forced to guess and, regardless of attack methodology, is unlikely to be right significantly more than $50\%$ of the time. Intuitively, if the attacker is unable to determine which input database was used to produce the observed result after being given the complete contents of $D_1$, $D_2$, and being permitted examination of and experimentation with $\mathcal{A}$, it must be that the result carries little information about the row in which the databases differ. Further, $\mathcal{A}$ is considered private if its construction ensures that this task is difficult over any pair of adjacent databases.

### Defining Differential Privacy

Differential privacy is a mathematical framework enabling approximate evaluation of a function over a private database while limiting the ability of an outside party to make inferences about individual input records. Difficulty is ensured by requiring that the privacy mechanism respects a certain distributional condition with respect to its output.

Formally, a privacy mechanism, $\mathcal{A}$, is called $(\varepsilon, \delta)$-differentially-private, if for any pair of databases $D_1$, $D_2$ which differ from each other by insertion (or deletion) of a single record, and any $S \subseteq \textrm{Range}({\mathcal{A}})$,

\Pr[\mathcal{A}(D_1)\in S] \leq e^\varepsilon \cdot \Pr[\mathcal{A}(D_2) \in S] + \delta.

Roughly, when $\varepsilon$ is small, and $\delta = 0$, this condition ensures that there exists no set of outputs of $\mathcal{A}$ that provide a significant advantage in determining whether the privacy mechanism was evaluated over $D_1$ and $D_2$. We can quantify the advantage conferred to an adversary hoping to discriminate $D_1$ from $D_2$. Let $S \subseteq \textrm{Range}(\mathcal{A})$. We think of observing the event $S$ whenever $\mathcal{A}$ returns a value contained in $S$.

More formally, the significance (privacy-loss) of observing the event $S$ under the privacy mechanism $\mathcal{A}$ is

\mathcal{L}^{\mathcal{A}}_{D_1, D_2}(S) :=\ln\left(\frac{\Pr[\mathcal{A}(D_1) \in S]}{\Pr[\mathcal{A}(D_2) \in S]}\right).

Thus when $\mathcal{A}$ is $(\varepsilon, \delta)$-differentially-private, it holds that for any pair of databases $D_1$, $D_2$ differing by a single record, the privacy loss, $\mathcal{L}^{\mathcal{A}}_{D_1, D_2}(S)$, is no larger than $\varepsilon$ with probability at least $1-\delta$.

### Sensitivity

The goal of differential privacy is to protect how much an adversary can learn about the underlying data from analysis results. This leads to the following concern: if an analysis can be heavily influenced by the presence of certain items, it may be possible to draw conclusions about the participation of those by observing the output.

As an example, consider the query SELECT MAX(salary) FROM employees. This query is thought of as sensitive, as it is entirely determined by a single row in the input.

On the other hand, the query SELECT 3 FROM employees GROUP BY true is almost entirely insensitive to data in the employees table. Assuming the employees table must always be non-empty, the result ($3$) is entirely independent of the data, and revealing it leaks no information about the contents of the database.

Note

It is important to note that the above queries behave differently when the table is empty, returning an empty result-set. This is an important distinction that cannot be ignored in a more generic setting where the employees table is permitted to be empty. In such a setting, the empty result-set would allow an outsider to infer that the true count of the table is $0$. Immuta's Differential Privacy Engine ensures that results are returned even when the implied result-set would otherwise be empty.

Sensitivity is a numerical measure which bounds how much an analysis can be influenced by a single row.

We now introduce some notation which helps in being mathematically precise. Let $\mathcal{D}$ denote the set of all databases. For ease of reading, conceptualize $\mathcal{D}$ as a set of all possible instantations of some table. With this in mind, we can consider each (numerical) query or, more generally, any subsequent quantitative analysis, to be a mathematical function, $f$, which assigns each possible table $k$ numerical values. To this end, let $f:\mathcal{D}\rightarrow{\mathbb{R}}^k$ stand for a query.

The $\ell_1$-sensitivity of $f$, denoted $\Delta(f)$, is a numerical measure of how much $f$ can change when comparing its value over any pair of databases that differ by the presence of a single record. To this end, we use the notation $D_1 \sim D_2$ to indicate that databases $D_1$ and $D_2$ differ by a single record.

\begin{align}\Delta(f) := & \max_{\substack{D_1, D_2 \in \mathcal{D} \\ D_1 \sim D_2}} \|f(D_1) - f(D_2) \|_1 \\ = & \max_{\substack{D_1, D_2 \in \mathcal{D} \\ D_1 \sim D_2}} \sum_{i=1}^k|f(D_1)-f(D_2)| \end{align}

### Mechanisms

There are numerous algorithmic mechanisms which attain differential privacy. Internally, Immuta's Differential Privacy Engine employs one of two mechanisms, depending upon the choice of SQL aggregate. This section outlines those mechanisms.

#### Laplace Mechanism

The Laplace Mechanism protects function input by adding noise directly to the analysis results. At first glance it may not be clear under what circumstances adding noise to the output is sufficient to protect the input, if at all. It turns out that doing so is sufficient precisely when the $\ell_1$-sensitivity of the analysis is bounded.

Let $\textrm{Lap}(b)$ denote the $0$-centered Laplace distribution with variance $2b^2$. The probability density function for this distribution is $f(x;b)=\frac{1}{2b}e^{-\frac{|x|}{b}}.$

Theorem [Dwork 2006]

Let $\varepsilon > 0$, and let $f:\mathcal{D} \rightarrow \mathbb{R}^k$ with finite sensitivity, then $f(x) + (\eta_1, \eta_2, \ldots, \eta_k)$ is $\varepsilon$-differentially-private, provided that $\eta_1, \eta_2, \ldots, \eta_k$ are independently sampled from $\textrm{Lap}(\Delta(f)/\varepsilon)$.

#### Sample and Aggregate

Sample and Aggregate provides a method for differentially-private evaluation of sensitive functions. The idea is to randomly partition the data and then evaluate $f$ over each partition. The evaluations are aggregated together into the final analysis via a differentially-private aggregation. Provided that $f$ is stable under subsampling, this strategy provides a differentially-private estimate for the evaluation $f$ over the database, even when $f$ is sensitive.

Within Immuta, Sample and Aggregate is carried out using the method of Nissim et. al. [NSS06]

### Composition

It is natural to wonder what level of privacy protection is afforded to individual rows in the database when multiple independent evaluations of differentially-private privacy mechanisms are available to an outsider. Effectively, neither the net privacy loss nor the combined tolerances of failure can exceed their respective individual sums across all releases. This statement is made precise below:

Theorem [DR14]

Let $\mathcal{A}_1$, $\ldots$, $\mathcal{A}_k$ denote a sequence of $k$ privacy mechanisms where the $i$-th privacy mechanism $\mathcal{A}_i$ satisfies $\left(\varepsilon_i, \delta_i\right)$-differential privacy, respectively. Then, given any database, $D$, the release of $\left\{(i, \mathcal{A}_i(D)) : 1 \leq i \leq k \right\}$, satisfies $\left(\sum_{i=1}^k \varepsilon_i, \sum_{i=1}^k \delta_i\right)$-differential privacy.

As a special case, when the database is partitioned into $k$ disjoint subsets, sequential evaluation over separate partitions composes in a parallel manner. In other words, neither privacy-loss nor failure tolerance accumulate. The precise statement follows:

Theorem

Let $D_1$, $D_2$, $\ldots$, $D_k$ be a partition of a database $D$. Let $\mathcal{A}_1$, $\ldots$, $\mathcal{A}_k$ denote a sequence of $k$ privacy mechanisms where the $i$-th privacy mechanism $\mathcal{A}_i$ satisfies $\left(\varepsilon_i, \delta_i\right)$-differential privacy, respectively. Then the release of $\left\{(i, \mathcal{A}_i(D_i)) : 1 \leq i \leq k \right\}$, satisfies $\left(\max_{i=1}^k \varepsilon_i, \max_{i=1}^k \delta_i\right)$-differential privacy.

## Differential Privacy in Immuta

Internally, Immuta applies differential privacy at the aggregate level where the choice of privacy mechanism is a function of the aggregate.

In this setting, a query result can be thought of as a sequential composition of differentially-private outputs. Since the $\varepsilon$ value on the policy is treated as a privacy-loss tolerance for the entire query result, the effective $\varepsilon$ of each column is divided by $k$, where $k$ is the number of columns in the result-set. It should be noted that the $\delta$ parameter is not scaled, resulting in an $(\varepsilon, k\cdot\delta)$-differentially private release.

• Only aggregates can be queried
• Only WHERE clauses can be used to filter (no GROUP BY)
• No JOINS

### Aggregates

#### Average

When a Differential Privacy policy is in effect, the AVG SQL aggregate returns an $(k^{-1}\varepsilon, \delta)$-differentially-private average.

Evaluation occurs via the Sample and Aggregate method, as follows:

1. First, the result-set is randomly partitioned into samples subsets.
2. Next, an average (the arithmetic mean) is computed for each random partition element.
3. Finally, a differentially-private estimation of the average is obtained via an $(k^{-1}\varepsilon, \delta)$-smoothed differentially-private median of the partition-wise averages.

#### Count

When a Differential Privacy policy is in effect, the COUNT SQL aggregate returns an $(k^{-1}\varepsilon, 0)$-differentially-private count.

Evaluation occurs via the Laplace Mechanism method.

Note that the $\ell_1$-sensitivity of the COUNT aggregate is $1$ since, in the worst case, adding or removing a single record changes the count of the filtered result-set by at most one. Thus, employing the Laplace Mechanism with a sensitivity of $1$ provides $(\varepsilon, 0)$-differential-privacy.

Since this method involves adding noise sampled from $\textrm{Lap}(k/\varepsilon)$ to the true count, the expected error is roughly the standard deviation of the noise distribution, $\pm \sqrt{2}k/\varepsilon$.

#### Max

When a Differential Privacy policy is in effect, the MAX SQL aggregate returns an $(k^{-1}\varepsilon, \delta)$-differentially-private average.

Evaluation occurs via the Sample and Aggregate method, as follows:

1. First, the result-set is randomly partitioned into samples subsets.
2. Next, the maximum is computed for each random partition element.
3. Finally, a differentially-private estimation of the average is obtained via an $(k^{-1}\varepsilon, \delta)$-smoothed differentially-private median of the partition-wise minimums.

#### Min

When a Differential Privacy policy is in effect, the MAX SQL aggregate returns an $(k^{-1}\varepsilon, \delta)$-differentially-private minimum.

Evaluation occurs via the Sample and Aggregate method, as follows:

1. First, the result-set is randomly partitioned into samples subsets.
2. Next, the minimum is computed for each random partition element.
3. Finally, a differentially-private estimation of the average is obtained via an $(k^{-1}\varepsilon, \delta)$-smoothed differentially-private median of the partition-wise minimums.

#### Sum

When a Differential Privacy policy is in effect, the SUM SQL aggregate returns an $(k^{-1}\varepsilon, \delta)$-differentially-private count.

This aggregate assumes that the column takes values in a bounded interval: $\left[ R_\min, R_\max \right].$

Evaluation occurs via the Laplace Mechanism method with a sensitivity of $\max(|R_\min|, |R_\max|).$

Since this method involves adding noise sampled from $\textrm{Lap}(k\max(|R_\min|, |R_\max|)/\varepsilon)$ to the true sum, the expected error is roughly the standard deviation, $\pm \sqrt{2}k\max(|R_\min|, |R_\max|)/\varepsilon.$

## References

[DKM06]: C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, M. Naor. "Our Data, Ourselves: Privacy Via Distributed Noise Generation." Advances in Cryptology - EUROCRYPT 2006

[NSS07]: K. Nissim, S. Raskhodnikova, A. D. Smith. “Smooth sensitivity and sampling in private data analysis.” Proceedings of the 39th Annual ACM Symposium on Theory of Computing.

[DR14]: C. Dwork, A. Roth. “The Algorithmic Foundations of Differential Privacy.” Foundations and Trends in Theoretical Computer Science. Vol 9.