* We discuss our paper
on diagnosing bias in dataset replication
studies. Zooming in on the ImageNet-v2
reproduction effort, we explain the majority of the accuracy drop between
ImageNet and ImageNet-v2 (from 11.7% to 3.6%) after accounting
for bias in the data collection process. *

## Measuring Progress in Supervised Learning

In the last few years, researchers have made extraordinary progress on increasing accuracy on vision tasks like those in the ImageNet, CIFAR-10, and COCO datasets. Progress on these tasks is promising, but comes with an important caveat: the test sets used to measure performance are finite, fixed, and have been used and re-used by countless researchers over several years.

There are (at least) two possible ways in which evaluating solely with test-set accuracy could hinder our progress on the tasks researchers design benchmarks to proxy (e.g. general image classification for ImageNet). The first of these issues is adaptive overfitting: since each dataset has only one test set to measure performance on, algorithmic progress on that (finite and fixed) test set could be mistaken for algorithmic progress on the distribution from which the test set was chosen from.

The second issue that could arise is oversensitivity to irrelevant properties of the test distribution arising from the dataset collection process; for example, the image encoding algorithm used to save images.

*How can we assess whether models are truly making progress on the tasks that
our benchmarks proxy*?

### Dataset replication

A promising approach to diagnosing the two above issues is *dataset
replication*, in which one mimics the original test set creation process as
closely as possible to make a new dataset. Then, existing models’ performance on
this newly created test set should identify any models that have adaptively
overfit to the original test set. Moreover, since every intricacy of a dataset
collection process cannot be mimicked exactly, natural variability in the
replication should help us unearth cases of algorithms’ oversensitivity to the
original dataset creation process.

The problem of replicating the original test set creation process is harder than it may initially appear. Particularly challenging is controlling for relevant

To match covariate distributions between the new reproduction and the original
dataset, we frame dataset replication as a two-step process. In step one, the
replicator collects candidate data using a data pipeline as similar as possible
to that used by the original dataset creators. Then, after approximating the
original pipeline, the dataset replicator should identify the relevant covariate
statistic(s), and choose candidates (via filtering or reweighting the collected
candidate data) so that the distributions of the statistic under the replicated
and original datasets are equal. We call this process

### ImageNet-v2

In their recent work, Recht et al. use dataset replication to take a closer look at model performance on ImageNet, one of the most widely used computer vision datasets. They first use an apparatus similar to that of the original ImageNet paper to collect a large set of candidate images and labels from the photo-sharing site Flickr.

Then, in the second step, they identify and control for a covariate called the “selection frequency.” Selection frequency is a measure of how frequently a (time-limited) human decides that an image (with its candidate label) is “correctly labeled.” We can get estimate selection frequencies by asking crowd workers questions like:

Selection frequency is a very reasonable (and in some sense, the
“right”) covariate to control for, since the original ImageNet dataset
was

Recht et al. obtain empirical selection
frequency estimates for the ImageNet validation set and their collected
candidate pool by

By filtering on the selection frequency in various ways, the authors come up
with three new `ImageNet-v2`

datasets. One of these, called
`MatchedFrequency`

, is a “true” dataset replication in the sense that it tries
to control for the selecion frequency by filtering the candidate images and
labels so that the resulting selection frequency distribution matches that of
the original ImageNet validation set.
Since our focus is on dataset
replication, from now on we’ll ignore the other datasets, and use
“MatchedFrequency” and “ImageNet-v2” interchangeably. For convenience, we’ll
also use “ImageNet” and “ImageNet-v1” interchangeably to refer to the original
ImageNet

The key observation made by the ImageNet-v2 authors—and the one that we’ll
focus on in this post—is a *consistent* drop in accuracy that models suffer
when evaluated on ImageNet-v2 instead of ImageNet:

Each dot in the scatter plot above represents a model—the $x$ coordinate is the model’s ImageNet accuracy, and the $y$ coordinate is the model’s accuracy when evaluated on ImageNet-v2. (You can mouse over the dots to see model names!) Ideally, since the generating pipelines for the two datasets are similar, and the relevant covariates were matched, one would expect all of the models to fall on the dotted line $y = x$. Yet, across the examined models, their accuracies dropped by an average of 11.7% between ImageNet and ImageNet-v2.

## Our Findings

The significant accuracy drop above presents an empirical mystery given the similarity in data pipeline between the two datasets. Why do classifiers perform so poorly on ImageNet-v2?

In our work, we identify an aspect of the dataset reproduction process that might lead to a significant accuracy drop. The general phenomenon we identify is that in dataset replication, even mean-zero noise in measurements of the relevant covariate/control variable can result in significant bias in the resulting matched dataset if not accounted for.

In the case of ImageNet-v2, we show how noisy readings of the selection
frequency statistic can result in bias in the ImageNet-v2 dataset towards lower
selection frequency and consequently, lower

After accounting for this bias and thus controlling for selection frequency, we estimate that the adjusted ImageNet to ImageNet-v2 accuracy gap is less than or equal to 3.6% (instead of the initially observed 11.7%).

In the next few sections, we’ll focus on ImageNet-v2, first trying to get a clearer picture of the source of statistical bias in this dataset replication effort, and then discussing ways to correct for it.

### Identifying the bias

Earlier, we decomposed the data replication process into two steps: (1) replicating the pipeline, and (2) controlling for covariates via statistic matching. To perform the latter of these two steps, one first needs to find the distribution of the relevant statistic (e.g., selection frequency, in the case of ImageNet) for both the original test set as well as the newly collected data. Just as with any empirical statistic, however, we can’t read the true selection frequency $s(x)$ for any given image $x$. We can only sample from a binomial random variable,

\[ \widehat{s}(x) = \frac{1}{n}\text{Binomial}(n, s(x)), \]

to obtain a selection frequency estimate, i.e., where $n$ is the number of crowd workers used.

Now, even though the expected value of the measurement we get is indeed $s(x)$
(i.e., $\widehat{s}(x)$ is an unbiased estimator of $s(x)$),
the

To see why this seemingly innocuous fact impacts the dataset replication process, suppose we knew exactly what the distribution of true selection frequency looked like for both the original ImageNet test set and the candidate images collected through the ImageNet-v2 replication pipeline:

Suppose now that we estimate a selection frequency of $\widehat{s}(x) = 0.7$ for a given image $x$ (i.e., seven of the ten workers who were shown the image and its candidate label, marked it as correctly labeled). The key observation here is that given this empirical selection frequency, the most likely value for $s(x)$ is not $\widehat{s}(x)$ (even though $\widehat{s}(x)$ is an unbiased estimator of $s(x)$). Instead, the maximum likelihood estimate actually depends on which dataset $x$ is from!

If $x$ was sourced from ImageNet, it’s more likely that $s(x) > 0.7$ and therefore that $\widehat{s}(x)$ is an underestimate, since most of the mass of the ImageNet selection frequency distribution—see our (hypothetical) plot above—is on $s(x) > 0.7$. Conversely, if $x$ was a newly collected candidate image, then most likely $s(x) < 0.7$.

**Remark:**Mathematically, the phenomenon is that $E[\widehat{s}(x)|s(x)] = s(x)$, but $E[s(x)|\widehat{s}(x)]$ is not equal to $s(x)$, and instead depends on the distribution of $s(x)$ via

The interactive graph below visualizes this intuition: by moving the slider, you can adjust the “observed” selection frequency $\widehat{s}(x)$. The shaded curves then show what our belief about the corresponding true selection frequency $s(x)$ looks like, depending on which pool the image came from.

Thus, if the selection frequency distribution of ImageNet is skewed upwards
compared to that of the collected candidate images (which is likely, since
ImageNet was

The bias we are describing is summarized visually in the interactive graph below: by adjusting the sliders, you can manipulate the distributions of “true” selection frequency for ImageNet (red) and for the candidate data (black), as well as the number of annotators used to estimate selection frequencies. The distribution of ImageNet-v2 selection frequencies resulting from performing statistic matching is shown in blue. Notice that as long as the candidate selection frequencies do not come from the same distribution as the ImageNet selection frequencies, the resulting ImageNet-v2 test set never matches the original test set statistics. For the reasons we discussed earlier, having less annotators, or having a bigger gap between ImageNet and the candidate distribution, also exacerbate the effect.

## Quantifying the Effects of Bias

The model above, paired with the fact that ImageNet (having already been
filtered for quality before) is likely much higher-quality than candidate images sourced from
Flickr, predicts that ImageNet-v2 images’ selection frequencies are
consistently lower than those of ImageNet images. To test this
theory, we set up another crowdsourced task that is extremely similar (

Even though the ImageNet-v2 creators report average selection
frequencies for ImageNet and ImageNet-v2 of 0.71 and 0.73 respectively, our
new experiments yield average selection frequencies of 0.85 and 0.81; note the change
in relative ordering (*fatal* heart attack while watching television); or (b) data makeup:
workers are presented with grids of 48 images at a time in both experiments,
with grids containing a mix of ImageNet, ImageNet-v2, and candidate
images—but the exact proportions of this mix differ between the two
experiments.

**Aside**: Why did we need to run a new crowdsourced study to observe this gap, instead of using the data already collected for the ImageNet-v2 study? The answer is finite-sample reuse: the selection frequencies collected in the original study are precisely the ones used to filter the ImageNet-v2 dataset. So, by construction, these selection frequency will match the selection frequencies of the ones of ImageNet test set,

### How does this bias affect measured accuracy?

Our model and experiments suggest that matching empirical statistics from different sources introduces bias into the dataset replication pipeline, and that in the case of ImageNet-v2 this means that selection frequencies are actually lower for the new ImageNet-v2 test set compared to the old one. Since selection frequency is meant to roughly reflect data quality, and is known (as found already by the ImageNet-v2 authors) to affect model accuracy, we expect the downwards selection frequency bias in ImageNet-v2 to directly translate into a downwards bias in model accuracy.

To test if this is really the case, we use a progressively increasing number of
annotators $n$ out of the 40 that we collected. For each $n$, we

After using 40 workers to control for selection frequency between ImageNet and ImageNet-v2, we reduce the 11.7% gap that was originally observed to a gap of 5.7%. This is already a significant reduction, but the trend of the graph suggests that 5.7% is still an overestimate—the gap continues to consistently shrink with each increase in number annotators. In the final part of this post, we’ll use a technique from classical statistics to get an even better estimate of the real, bias-adjusted gap between ImageNet and ImageNet-v2 model accuracies.

## Adjusting for Bias with the Statistical Jackknife

The statistic matching that led to the previous graph was based on what we’ll call a
*selection frequency-adjusted accuracy* estimator, defined for a given
classifier $f$ as:

\[ \text{Acc}(n) = \sum_{k=1}^n E_{x\sim \text{ImageNet-v2}}\left(1[\text{$f$ is correct on }x] | \widehat{s}(x) = \frac{k}{n}\right)\cdot P_{x\sim \text{ImageNet}}\left(\widehat{s}(x) = \frac{k}{n}\right) \]

This estimator has a simple interpretation: it is equivalent to (a) sampling an ImageNet-v1 image and observing its (empirical) selection frequency; then (b) finding a random ImageNet-v2 image with the same (empirical) selection frequency, and recording the classifier’s correctness on that input. So what we are estimating here is what a model’s accuracy on ImageNet-v2 would be, if the selection frequencies of ImageNet-v2 were distributed as in the ImageNet test set. (Notice that if the ImageNet and ImageNet-v2 selection frequency distributions already matched, then this estimator would be independent from $n$ and would evaluate to exactly model accuracy on ImageNet-v2.)

Now, the $\text{Acc}(n)$ estimator is subject to the same bias as dataset replication itself, as it too ignores the discrepancy between the empirical selection frequency $\widehat{s}(x)$ and the true selection frequency $s(x)$. Since we’ve been talking about bias pretty abstractly in this post, it’s worth noting that the $\text{Acc}(n)$ estimator ties everything back to the formal, statistical definition of bias. Specifically, our main finding can be restated (though maybe less intuitively) as “$\text{Acc}(n)$ is a downwards-biased estimator of the true reweighted accuracy,” that is

\[ E[\text{Acc}(n)] < \lim_{n\rightarrow\infty} \text{Acc}(n). \]

So, from this perspective, our graph in the previous section can be viewed as just a plot of the value of $\text{Acc}(n)$ for every classifier for various values of $n$. Also, the estimator behaves exactly how as predicted by our model of the bias—as $n$ increases, $\widehat{s}(x)$ becomes a less noisy estimator of $s(x)$, so the bias in the matching process decreases and so $\text{Acc}(n)$ increases.

Now, what we really want to know is what $\text{Acc}(n)$ looks like as $n \rightarrow \infty$, especially given that even when we use 40 annotators for statistic matching the adjusted accuracy still improves.

In our paper we present further techniques for tackling this problem. These include making use of tools from empirical Bayes estimation techniques, beta-binomial regression, and kernel density estimation. To keep things short here, we’ll only discuss the simplest estimation method we use: one based on a technique known as the statistical jackknife.

The jackknife dates back to the work of Maurice Quenouille and John
Tukey in the 1950s, and provides a way to estimate the bias of

\[ b_{jack}(\widehat{\theta}_n) = (n-1)\cdot \left(\frac{1}{n}\sum_{i=1} \widehat{\theta}_{n-1}^{(i)} - \widehat{\theta}_n \right), \]

\[ \text{where}\qquad \widehat{\theta}_{n-1}^{(i)} = \widehat{\theta}(X_1,\ldots,X_{i-1},X_{i+1},\ldots,X_n) \text{ is the $i$th leave-one-out estimate.} \]

*leave-one-our*estimator $\widehat{\Theta}_n^{(-i)}$ to be $\widehat{\Theta}_n$ computed with all but the $i$th datapoint. There are $n$ possible such leave-one-out estimators for a given $\widehat{\Theta}_n$. The main idea behind the jackknife is to make the following two approximations: \[ \widehat{\Theta}_n \approx E[\Theta_n] = \Theta + \frac{b(\Theta)}{n} \] \[ \frac{1}{n} \sum_{i=1} \widehat{\Theta}_n^{(-i)} \approx E[\Theta_{n-1}] = \Theta + \frac{b(\Theta)}{n-1} \] Using these two assumptions, one can solve for $b(\Theta)$ and end up with the bias estimate given in the post.

Averaging across all of the classifiers studied, the jackknife estimates the bias in our estimator as about 1.0%, meaning that the bias-corrected gap between ImageNet and ImageNet-v2 shrinks from 11.7% (without any correction) to 5.7% (using the 40 annotators we have to correct), to 4.7% (using this additional jackknife bias correction). Moreover, as our paper discusses, this is almost certainly still an overestimate—using more refined methods for bias estimation reduces the gap to somewhere between 3.4% and 3.8% (with variation being across different methods).

## Summary and Conclusions

We find that noise—even if it is mean-zero—can result in bias in dataset reproductions if not
accounted for. Zooming in on the `ImageNet-v2`

replication effort, we find
that a majority of the observed accuracy drop can be explained by a measurable
covariate: selection frequency. Looking forward, knowing this source of the
accuracy drop will allow us to focus on making models robust to changes in a
quantifiable axis (rather than a seemingly ambiguous distribution shift).
More broadly, our results suggest that modeling aspects of the data collection
process is a useful tool in data replication generally.
For more details check out our paper, where we discuss precise modeling procedures, further
experiments, and future directions.