Scale of Dataset | Mean Absolute Error | Mean Scaled Absolute Error | |
---|---|---|---|
Small dataset | 1,000 | 100 | 0.100 |
Large dataset | 100,000 | 100 | 0.001 |
The queries we consider always ask for the count of the set of individuals in the dataset who satisfy some condition. For example, on our ADULT dataset, we might ask:
How many individuals have capital loss < 1000?
Suppose, on the Small Dataset, that the true answer to this query is 216, or, as a fraction of the population, .216. The scaled absolute error reported above (0.1) is significant, since an analyst receiving .216 +/- .1 (in expectation) may draw very different conclusions about the query.
However, if the Large Dataset represents a larger sample from a similar population, we expect the query answer to be close to 21,600, still a .216 fraction of the population. The reported scaled absolute error (.001) is now negligible since most analysts would be happy with .216 +/- .001 (in expectation) as the answer to their query.
Scaled error of .1 is high: an algorithm offering scaled error of 0.1 will only be useful for estimating queries that test very common conditions (i.e. those that nearly all individuals satisfy).
Scaled error of .0001 is low: an algorithm offering scaled error of .0001 will be useful for more conditions, even rare conditions satisfied by smaller fractions of the population.
Note that scaled error decreases as scale increases. Scaled error also decreases with increasing epsilon. In fact, the scale of the data and the epsilon privacy budget are both factors that contribute to the strength of the "signal" available to the analyst. When error is viewed in scaled terms, they have precisely the same effect, a property we call scale-epsilon exchangeability.