Scale, Shape and Domain Size
The key characteristics of the input data that affect the error introduced by a differentially private algorithm.
Understanding the error of algorithms on specific input datasets (rather than in the worst case) requires a knowledge of the properties of datasets that can impact error. The following three properties significantly impact algorithm error in our experiments.
Scale
Scale refers to the
number of records in the input dataset. Differential privacy attemps to hide properties of individual records, while releasing aggregates over the entire dataset. Thus, it is intuitive that as the number of records in the dataset increases, one should be able to draw more utility from the data despite the error incurred by a differentially private algorithm. As demonstrated in the discussion on
scaled error, an algorithm that always incurs the same absolute error has lower scaled error on larger datasets. For some algorithms, both the absolute error and the scaled error can reduce with an increase in scale.
Domain Size
Throughout our experiments, we assume that our input is represented as a histogram over a fixed
number of bins. That is, we assume that the domain of record in the input database is discrete (or quantized into a finite number of bins if continuous). Domain size refers the number of bins in the input histogram. While increasing or decreasing the domain size (by refining or coarsening the quantized domain) does not change the data itself, it changes the ability to query the data at finer or coarser granularities, respectively. Queries on datasets with a larger domain are at finer granularities and may suffer larger error.
Shape
Shape refers to the
empirical distribution of the dataset. There are
data independent algorithms whose scaled error is not affected by the shape of the distribution. However, for other
data dependent algorithms, shape crucially impacts error.
Data Generation
Prior empirical evaluations compared the error of algorithms across multiple datasets that varied in all three characteristics - scale, shape and domain size. This resulted in gaps and even inconsistencies in our knowledge about algorithm performance (see the
DP Bench paper for more details).
Hence, in our experiments, we explicitly control for these three characteristics and systematically evaluate algorithms. Given an input dataset, we extract the shape of the dataset and generate multiple synthetic datasets at different scales (by sampling from the shape distribution) and different domain sizes (by coarsening the domain). This allows us to, for instance, hold scale and domain size fixed while exploring the effect of scale on error, or hold scale and domain size fixed when characterizing the effect of shape on algorithm performance.