Identifying Statistical Bias in Dataset Replication

    Paper    Code

We discuss our paper on diagnosing bias in dataset replication studies. Zooming in on the ImageNet-v2 reproduction effort, we explain the majority of the accuracy drop between ImageNet and ImageNet-v2 (from 11.7% to 3.6%) after accounting for bias in the data collection process.

Measuring Progress in Supervised Learning

In the last few years, researchers have made extraordinary progress on increasing accuracy on vision tasks like those in the ImageNet, CIFAR-10, and COCO datasets. Progress on these tasks is promising, but comes with an important caveat: the test sets used to measure performance are finite, fixed, and have been used and re-used by countless researchers over several years.

There are (at least) two possible ways in which evaluating solely with test-set accuracy could hinder our progress on the tasks researchers design benchmarks to proxy (e.g. general image classification for ImageNet). The first of these issues is adaptive overfitting: since each dataset has only one test set to measure performance on, algorithmic progress on that (finite and fixed) test set could be mistaken for algorithmic progress on the distribution from which the test set was chosen from.

The second issue that could arise is oversensitivity to irrelevant properties of the test distribution arising from the dataset collection process; for example, the image encoding algorithm used to save images.

How can we assess whether models are truly making progress on the tasks that our benchmarks proxy?

Dataset replication

A promising approach to diagnosing the two above issues is dataset replication, in which one mimics the original test set creation process as closely as possible to make a new dataset. Then, existing models’ performance on this newly created test set should identify any models that have adaptively overfit to the original test set. Moreover, since every intricacy of a dataset collection process cannot be mimicked exactly, natural variability in the replication should help us unearth cases of algorithms’ oversensitivity to the original dataset creation process.

The problem of replicating the original test set creation process is harder than it may initially appear. Particularly challenging is controlling for relevant covariates In experimental design, a covariateis a variable that is not the independent variable (in our case the choice of dataset), but affects measurements of the dependent variable (in our case the accuracy).<p></p> For example, suppose we wanted to replicate a study about the effect of a certain drug on an age-linked disease. After gathering subjects, we have to reweight or filter them so that the age distribution matches that of the original study, as otherwise the results of the studies are incomparable. This filtering/reweighting step is analogous to dataset replication, with participant age being the relevant covariate., or variables that we expect to have an impact on the outcome (i.e., model accuracy) we measure.

To match covariate distributions between the new reproduction and the original dataset, we frame dataset replication as a two-step process. In step one, the replicator collects candidate data using a data pipeline as similar as possible to that used by the original dataset creators. Then, after approximating the original pipeline, the dataset replicator should identify the relevant covariate statistic(s), and choose candidates (via filtering or reweighting the collected candidate data) so that the distributions of the statistic under the replicated and original datasets are equal. We call this process statistic matchingIn causal inference literature, statistic matching is often referred to as enforcing covariate balance..


In their recent work, Recht et al. use dataset replication to take a closer look at model performance on ImageNet, one of the most widely used computer vision datasets. They first use an apparatus similar to that of the original ImageNet paper to collect a large set of candidate images and labels from the photo-sharing site Flickr.

Then, in the second step, they identify and control for a covariate called the “selection frequency.” Selection frequency is a measure of how frequently a (time-limited) human decides that an image (with its candidate label) is “correctly labeled.” We can get estimate selection frequencies by asking crowd workers questions like:

Selection frequency is a very reasonable (and in some sense, the “right”) covariate to control for, since the original ImageNet dataset was filtered using a similar measureTo construct ImageNet, the authors scraped many candidate image-label pairs, and just as above, asked crowd workers whether the image corresponded to the label---filtering was done via a convincing-majority vote procedure, making the selection frequency a relevant metric.

Recht et al. obtain empirical selection frequency estimates for the ImageNet validation set and their collected candidate pool by presentingThe figure above is simplified for clarity—in reality, datapoints are presented to the crowd workers as groups of 48 images, all corresponding to the same candidate label, and annotators are asked to select the images in that group that truly correspond to the label. each (image, label) pair to 10 crowd workers.

By filtering on the selection frequency in various ways, the authors come up with three new ImageNet-v2 datasets. One of these, called MatchedFrequency, is a “true” dataset replication in the sense that it tries to control for the selecion frequency by filtering the candidate images and labels so that the resulting selection frequency distribution matches that of the original ImageNet validation set. Since our focus is on dataset replication, from now on we’ll ignore the other datasets, and use “MatchedFrequency” and “ImageNet-v2” interchangeably. For convenience, we’ll also use “ImageNet” and “ImageNet-v1” interchangeably to refer to the original ImageNet validation setSince the test set is not released, the ImageNet validation set usually acts as a de facto test set, and the two terms (validation and test) are used interchangeably to refer to the validation set..

The key observation made by the ImageNet-v2 authors—and the one that we’ll focus on in this post—is a consistent drop in accuracy that models suffer when evaluated on ImageNet-v2 instead of ImageNet:

Each dot in the scatter plot above represents a model—the $x$ coordinate is the model’s ImageNet accuracy, and the $y$ coordinate is the model’s accuracy when evaluated on ImageNet-v2. (You can mouse over the dots to see model names!) Ideally, since the generating pipelines for the two datasets are similar, and the relevant covariates were matched, one would expect all of the models to fall on the dotted line $y = x$. Yet, across the examined models, their accuracies dropped by an average of 11.7% between ImageNet and ImageNet-v2.

Our Findings

The significant accuracy drop above presents an empirical mystery given the similarity in data pipeline between the two datasets. Why do classifiers perform so poorly on ImageNet-v2?

In our work, we identify an aspect of the dataset reproduction process that might lead to a significant accuracy drop. The general phenomenon we identify is that in dataset replication, even mean-zero noise in measurements of the relevant covariate/control variable can result in significant bias in the resulting matched dataset if not accounted for.

In the case of ImageNet-v2, we show how noisy readings of the selection frequency statistic can result in bias in the ImageNet-v2 dataset towards lower selection frequency and consequently, lower accuracyRecht et al. observed previously that changes in selection frequency affect model performance..

After accounting for this bias and thus controlling for selection frequency, we estimate that the adjusted ImageNet to ImageNet-v2 accuracy gap is less than or equal to 3.6% (instead of the initially observed 11.7%).

In the next few sections, we’ll focus on ImageNet-v2, first trying to get a clearer picture of the source of statistical bias in this dataset replication effort, and then discussing ways to correct for it.

Identifying the bias

Earlier, we decomposed the data replication process into two steps: (1) replicating the pipeline, and (2) controlling for covariates via statistic matching. To perform the latter of these two steps, one first needs to find the distribution of the relevant statistic (e.g., selection frequency, in the case of ImageNet) for both the original test set as well as the newly collected data. Just as with any empirical statistic, however, we can’t read the true selection frequency $s(x)$ for any given image $x$. We can only sample from a binomial random variable,

\[ \widehat{s}(x) = \frac{1}{n}\text{Binomial}(n, s(x)), \]

to obtain a selection frequency estimate, i.e., where $n$ is the number of crowd workers used.

Now, even though the expected value of the measurement we get is indeed $s(x)$ (i.e., $\widehat{s}(x)$ is an unbiased estimator of $s(x)$), the reading itselfThis is not the same as the underlying selection frequency for an image; for example, a image-label pair $x$ might have a true selection frequency of $s(x) = 0.645$, meaning that on average a crowd annotator is 64.5% likely to say that the image corresponds to the label. If we ask 10 crowd annotators to label the pair, however, we never observe this 64.5% number. will be $k/n$ for some integer $k$ (with $n=10$, in the case of the ImageNet-v2 replication process).

To see why this seemingly innocuous fact impacts the dataset replication process, suppose we knew exactly what the distribution of true selection frequency looked like for both the original ImageNet test set and the candidate images collected through the ImageNet-v2 replication pipeline:

Suppose now that we estimate a selection frequency of $\widehat{s}(x) = 0.7$ for a given image $x$ (i.e., seven of the ten workers who were shown the image and its candidate label, marked it as correctly labeled). The key observation here is that given this empirical selection frequency, the most likely value for $s(x)$ is not $\widehat{s}(x)$ (even though $\widehat{s}(x)$ is an unbiased estimator of $s(x)$). Instead, the maximum likelihood estimate actually depends on which dataset $x$ is from!

If $x$ was sourced from ImageNet, it’s more likely that $s(x) > 0.7$ and therefore that $\widehat{s}(x)$ is an underestimate, since most of the mass of the ImageNet selection frequency distribution—see our (hypothetical) plot above—is on $s(x) > 0.7$. Conversely, if $x$ was a newly collected candidate image, then most likely $s(x) < 0.7$.

Remark: Mathematically, the phenomenon is that $E[\widehat{s}(x)|s(x)] = s(x)$, but $E[s(x)|\widehat{s}(x)]$ is not equal to $s(x)$, and instead depends on the distribution of $s(x)$ via Bayes' ruleSpecifically, we have that for a given dataset, $$p(s|\widehat{s}) = \frac{p_{data}(s) \cdot p_{binom}(\widehat{s}|s)} {\int p(s') p_{binom}(\widehat{s}|s')\ ds'},$$ where $p_{data}$ is the density function for selection frequencies under this dataset, and $p_{binom}$ is the binomial probability mass function..

The interactive graph below visualizes this intuition: by moving the slider, you can adjust the “observed” selection frequency $\widehat{s}(x)$. The shaded curves then show what our belief about the corresponding true selection frequency $s(x)$ looks like, depending on which pool the image came from.

Hard (0/20) Easy (20/20)

Thus, if the selection frequency distribution of ImageNet is skewed upwards compared to that of the collected candidate images (which is likely, since ImageNet was already filtered for qualityImageNet was originally constructed by collecting candidate images in a way similar to the ones described here, and then subsequently filtering for quality using a "majority vote" based on selection frequency. when it was constructed), matching the two data sources via observed selection frequency will result in ImageNet images having systematically higher selection frequencies than their new ImageNet-v2 counterparts. Also, the noiser our reading of observed selection frequency are (i.e. the less annotators we use), the more important the source distribution becomes, and so the greater the effect of the bias. Indeed, if we were able to use infinite annotators for each image, the bias would disappear, as we would have perfect readings of the corresponding selection frequency. The only possible value for $s(x)$ would then be $s(x) = \widehat{s}(x)$, making the shaded distributions above collapse into a point mass at the green line.

The bias we are describing is summarized visually in the interactive graph below: by adjusting the sliders, you can manipulate the distributions of “true” selection frequency for ImageNet (red) and for the candidate data (black), as well as the number of annotators used to estimate selection frequencies. The distribution of ImageNet-v2 selection frequencies resulting from performing statistic matching is shown in blue. Notice that as long as the candidate selection frequencies do not come from the same distribution as the ImageNet selection frequencies, the resulting ImageNet-v2 test set never matches the original test set statistics. For the reasons we discussed earlier, having less annotators, or having a bigger gap between ImageNet and the candidate distribution, also exacerbate the effect.

Mean IN/IN-v2 difference: %
An interactive graph depicting the source of bias in the ImageNet-v2 generation process. Interact with the sliders below to change the "easiness" (selection frequency) of the ImageNet-v1 and candidate image datasources, and the number of annotators used to measure selection frequency (click the legend to show/hide lines).
Hard Easy
Hard Easy
1 100

Quantifying the Effects of Bias

The model above, paired with the fact that ImageNet (having already been filtered for quality before) is likely much higher-quality than candidate images sourced from Flickr, predicts that ImageNet-v2 images’ selection frequencies are consistently lower than those of ImageNet images. To test this theory, we set up another crowdsourced task that is extremely similar (but not quite identicalWe implemented a few changes for quality control but kept the task instructions and interface constant: the exact differences are outlined in Appendix B.2 of our paper.) to the one used by the ImageNet-v2 creators, this time using 40 annotators per image (instead of 10) to estimate its selection frequency. A histogram of the selection frequencies we observed in our experiment is shown below:

Even though the ImageNet-v2 creators report average selection frequencies for ImageNet and ImageNet-v2 of 0.71 and 0.73 respectively, our new experiments yield average selection frequencies of 0.85 and 0.81; note the change in relative ordering (why are the selection frequencies higher?We discuss this in depth in Appendix B.2 of our paper. The task and instructions are the same, so we hypothesize that the discrepancy boils down to either (a) data quality: we used worker qualifications to ensure annotator quality while the original experiment did not—qualifications have been shown to reduce the share of low-quality or inattentive crowd workers from 34% to 2% in other studies (e.g., the study we reference found that 16%/0.4% of workers without/with qualifications reported having had a fatal heart attack while watching television); or (b) data makeup: workers are presented with grids of 48 images at a time in both experiments, with grids containing a mix of ImageNet, ImageNet-v2, and candidate images—but the exact proportions of this mix differ between the two experiments.).

Aside: Why did we need to run a new crowdsourced study to observe this gap, instead of using the data already collected for the ImageNet-v2 study? The answer is finite-sample reuse: the selection frequencies collected in the original study are precisely the ones used to filter the ImageNet-v2 dataset. So, by construction, these selection frequency will match the selection frequencies of the ones of ImageNet test set, regardless of whether there is bias in the selection processTo draw a crude analogy here, suppose that instead of matching image datasets we are matching piles of coins: Pile A is rigged $P(\text{heads})=1$, but Pile B is fair $P(\text{heads}) = 0.5$. We flip all the coins in both piles 10 times each---inevitably (if there are enough total coins in Pile B), some of the Pile B coins will land "heads" all 10 times, and will thus appear identical to the rigged Pile A coins. Are they in fact identical? (After all, these coins match the Pile A coins according to the "number of heads" statistic!) The answer is obviously "no," but the key is that even though all the coins in Pile B are fair (and flipping them another 10 times would reveal this), it's impossible to conclude anything other than $P(\text{heads}) = 1$ solely from the already-collected data on the selected coins.. If we're careful about avoiding this finite-sample reuse (for example, by re-performing the filtering process using half of the annotators and then measuring selection frequencies with the other half) we can actually identify bias in the original data—the process for doing so is shown in Appendix C of our paper.

How does this bias affect measured accuracy?

Our model and experiments suggest that matching empirical statistics from different sources introduces bias into the dataset replication pipeline, and that in the case of ImageNet-v2 this means that selection frequencies are actually lower for the new ImageNet-v2 test set compared to the old one. Since selection frequency is meant to roughly reflect data quality, and is known (as found already by the ImageNet-v2 authors) to affect model accuracy, we expect the downwards selection frequency bias in ImageNet-v2 to directly translate into a downwards bias in model accuracy.

To test if this is really the case, we use a progressively increasing number of annotators $n$ out of the 40 that we collected. For each $n$, we matchWe perform this matching via reweighting, rather than filtering---more details are given in the next section. the ImageNet-v2 observed selection frequencies (calculated using $n$ annotators) to the ImageNet onesFor context, the ImageNet-v2 creators matched their candidate pool to ImageNet with $n = 10$, and measure the resulting model accuracies. Our statistical model predicts that more annotators means less noise in the observed selection frequencies, which in turn means less bias and higher model accuracies in the original selection frequency used to create ImageNet-v2, and so we should see the resulting model accuracies increase. The data confirms this prediction: below we plot model accuracies on ImageNet versus their adjusted accuracies on ImageNet-v2—using the slider below the graph, you can vary the number of annotators used to make the adjustment from zero (i.e. no matching, just raw ImageNet-v2 accuracies) to 40 (accuracies after statistic matching using all 40 annotations).

1 40

After using 40 workers to control for selection frequency between ImageNet and ImageNet-v2, we reduce the 11.7% gap that was originally observed to a gap of 5.7%. This is already a significant reduction, but the trend of the graph suggests that 5.7% is still an overestimate—the gap continues to consistently shrink with each increase in number annotators. In the final part of this post, we’ll use a technique from classical statistics to get an even better estimate of the real, bias-adjusted gap between ImageNet and ImageNet-v2 model accuracies.

Adjusting for Bias with the Statistical Jackknife

The statistic matching that led to the previous graph was based on what we’ll call a selection frequency-adjusted accuracy estimator, defined for a given classifier $f$ as:

\[ \text{Acc}(n) = \sum_{k=1}^n E_{x\sim \text{ImageNet-v2}}\left(1[\text{$f$ is correct on }x] | \widehat{s}(x) = \frac{k}{n}\right)\cdot P_{x\sim \text{ImageNet}}\left(\widehat{s}(x) = \frac{k}{n}\right) \]

This estimator has a simple interpretation: it is equivalent to (a) sampling an ImageNet-v1 image and observing its (empirical) selection frequency; then (b) finding a random ImageNet-v2 image with the same (empirical) selection frequency, and recording the classifier’s correctness on that input. So what we are estimating here is what a model’s accuracy on ImageNet-v2 would be, if the selection frequencies of ImageNet-v2 were distributed as in the ImageNet test set. (Notice that if the ImageNet and ImageNet-v2 selection frequency distributions already matched, then this estimator would be independent from $n$ and would evaluate to exactly model accuracy on ImageNet-v2.)

Now, the $\text{Acc}(n)$ estimator is subject to the same bias as dataset replication itself, as it too ignores the discrepancy between the empirical selection frequency $\widehat{s}(x)$ and the true selection frequency $s(x)$. Since we’ve been talking about bias pretty abstractly in this post, it’s worth noting that the $\text{Acc}(n)$ estimator ties everything back to the formal, statistical definition of bias. Specifically, our main finding can be restated (though maybe less intuitively) as “$\text{Acc}(n)$ is a downwards-biased estimator of the true reweighted accuracy,” that is

\[ E[\text{Acc}(n)] < \lim_{n\rightarrow\infty} \text{Acc}(n). \]

So, from this perspective, our graph in the previous section can be viewed as just a plot of the value of $\text{Acc}(n)$ for every classifier for various values of $n$. Also, the estimator behaves exactly how as predicted by our model of the bias—as $n$ increases, $\widehat{s}(x)$ becomes a less noisy estimator of $s(x)$, so the bias in the matching process decreases and so $\text{Acc}(n)$ increases.

Now, what we really want to know is what $\text{Acc}(n)$ looks like as $n \rightarrow \infty$, especially given that even when we use 40 annotators for statistic matching the adjusted accuracy still improves.

In our paper we present further techniques for tackling this problem. These include making use of tools from empirical Bayes estimation techniques, beta-binomial regression, and kernel density estimation. To keep things short here, we’ll only discuss the simplest estimation method we use: one based on a technique known as the statistical jackknife.

The jackknife dates back to the work of Maurice Quenouille and John Tukey in the 1950s, and provides a way to estimate the bias of anyTechnically, certain mild assumptions, such as the estimator being (statistically) consistent and having bias that is analytic in $1/n$, are needed. statistical estimator. In short, the jackknife estimate of an $n$-sample estimator $\Theta_n(X_1,\ldots,X_n)$ is given by

\[ b_{jack}(\widehat{\theta}_n) = (n-1)\cdot \left(\frac{1}{n}\sum_{i=1} \widehat{\theta}_{n-1}^{(i)} - \widehat{\theta}_n \right), \]

\[ \text{where}\qquad \widehat{\theta}_{n-1}^{(i)} = \widehat{\theta}(X_1,\ldots,X_{i-1},X_{i+1},\ldots,X_n) \text{ is the $i$th leave-one-out estimate.} \]

A brief summary: consider our $n$-sample estimator $\Theta_n$, and define $\Theta$ to be the true value of the estimator, i.e. $\lim_{n\rightarrow\infty} \Theta_n$. In our case, $\Theta$ would be the adjusted accuracy of ImageNet-v2 if we had infinite workers. Now, suppose that the bias in $\Theta_n$ is on the order of $1/n$, i.e. \[ E[\Theta_n] = \Theta + \frac{b(\Theta)}{n} \] Now, for a given n-sample estimate $\widehat{\Theta}_n$, we define a leave-one-our estimator $\widehat{\Theta}_n^{(-i)}$ to be $\widehat{\Theta}_n$ computed with all but the $i$th datapoint. There are $n$ possible such leave-one-out estimators for a given $\widehat{\Theta}_n$. The main idea behind the jackknife is to make the following two approximations: \[ \widehat{\Theta}_n \approx E[\Theta_n] = \Theta + \frac{b(\Theta)}{n} \] \[ \frac{1}{n} \sum_{i=1} \widehat{\Theta}_n^{(-i)} \approx E[\Theta_{n-1}] = \Theta + \frac{b(\Theta)}{n-1} \] Using these two assumptions, one can solve for $b(\Theta)$ and end up with the bias estimate given in the post.

Averaging across all of the classifiers studied, the jackknife estimates the bias in our estimator as about 1.0%, meaning that the bias-corrected gap between ImageNet and ImageNet-v2 shrinks from 11.7% (without any correction) to 5.7% (using the 40 annotators we have to correct), to 4.7% (using this additional jackknife bias correction). Moreover, as our paper discusses, this is almost certainly still an overestimate—using more refined methods for bias estimation reduces the gap to somewhere between 3.4% and 3.8% (with variation being across different methods).

Summary and Conclusions

We find that noise—even if it is mean-zero—can result in bias in dataset reproductions if not accounted for. Zooming in on the ImageNet-v2 replication effort, we find that a majority of the observed accuracy drop can be explained by a measurable covariate: selection frequency. Looking forward, knowing this source of the accuracy drop will allow us to focus on making models robust to changes in a quantifiable axis (rather than a seemingly ambiguous distribution shift). More broadly, our results suggest that modeling aspects of the data collection process is a useful tool in data replication generally. For more details check out our paper, where we discuss precise modeling procedures, further experiments, and future directions.

Subscribe to our RSS feed.