Training Robust Classifiers (Part 2)

In part 1 of this post, we discussed how training a robust classifier can be naturally formulated as minimizing the robust population loss:

\[\mathbb{E}_{x, y \sim D} \left[\max_{x' \in P(x)} L(f (x'), y ) \right]\;.\]

This directly leads to solving the following min-max problem:

\[\min_\theta \sum_{i=1}^n \max_{x' \in P(x_i)} L(f_\theta (x'), y_i ) \; .\]

We then concluded that computing gradients for the outer minimization problem requires us to repeatedly solve the inner maximization problem. This latter problem corresponds to maximizing the loss of the classifier over a subset of the input space (e.g., small perturbations of image pixels). In this blog post, we will outline how to find good solutions for the min-max problem. For details, please refer to our paper.

Solving the Inner Maximization Problem

The key difficulty we immediately run into is that the inner maximization is provably hard (in the worst case), even for simple ReLU networks (link). So there is little hope for solving this problem efficiently and with good general guarantees on the quality of the computed solution. So instead, we resort to the leading methodology for optimizing non-convex objectives involving neural networks: applying a first-order method and hoping for the best!

Since our problem is constrained, we will use Projected Gradient Descent (PGD), a canonical method for solving constrained optimization problems. PGD proceeds by repeatedly taking a step in the direction of the gradient of the loss function – to greedily increase its value – and then projecting the resulting point back to the constraint set:

\[\Pi_C(x + \eta\nabla L(x, y))\; .\]

Here, \(\Pi_C\) refers to projecting a point onto the set \(C\). For a given point \(x’\), computing \(\Pi_C(x')\) corresponds to finding the point in \(C\) that is closest to \(x'\). As usual, \(\eta\) denotes the step size. (Note that tuning the step size is crucial for PGD to work well.)

Since we are interested in maximizing over an \(\ell_\infty\)-ball, we will use the \(\ell_\infty\)-variant of PGD (see here or here for details):

\[\Pi_C(x + \eta\cdot\text{sign}(\nabla L(x, y)))\; .\]

(Following the signs of the gradient instead of the gradient itself lets us make more progress in the \(\ell_\infty\)-geometry. It is worth keeping in mind though that following the signs is specific to the \(\ell_\infty\)-setting and would not be a suitable choice, e.g., for \(\ell_2\)-bounded perturbations.)

The resulting method turns out to be very successful in finding adversarial examples for a broad range of models. In the context of deep learning, it is sometimes referred to as Basic Iterative Method (link, link).

Reliable Inner Maximization is Critical

Before we examine the performance of robust training with PGD, we want to emphasize that it is crucial to solve the inner maximization problem sufficiently well. First, recall from last time that Danskin’s theorem is valid only if we are able to solve the inner maximization problem optimally. So, even though we generally can’t be guaranteed to find such optimal solutions efficiently, all current evidence shows that we need to find solutions that are sufficiently close to optimal. Such solutions are more likely to give us a good update to the outer minimization problem (cf. the algorithm for solving the min-max problem we discussed last time).

In the past, researchers used fast, yet fairly weak optimization procedures (attacks) for solving the inner problem. While these methods successfully produce adversarial examples for “standard” (non-robust) classifiers, they don’t necessarily give sufficiently good (i.e. near-optimal) solutions for the inner maximization problem. Models trained using them turned out to be resistant to these weak attacks, but still vulnerable to stronger attacks.

A prominent illustration of this phenomenon is adversarial training with FGSM (fast gradient sign method). FGSM is an attack method based on performing a single gradient step. It is very efficient and successful at creating adversarial examples for standard (non-robust) models. However, it is far less effective at finding worst-case adversarial perturbations, especially when compared to perturbations computed with a few steps of PGD. This can be clearly seen when we examine the loss function along the direction of FGSM and the direction found by a few steps of PGD, respectively. Here we plot the value of the loss on the line segment between the unmodified data point and the adversarially perturbed points found by FGSM and PGD:

While the FGSM direction leads to a high loss value close to the starting point for CIFAR10, the final iterate of PGD achieves a significantly higher loss. This is not surprising. After all, FGSM chooses the steepest direction locally (i.e., in a greedy way), and thus is not sensitive to the changes in loss value further along this direction.

As a result, models trained using FGSM perturbations are robust to FGSM attacks but are highly vulnerable to PGD attacks. (These models are also often memorizing the attacks they are exposed to, a phenomenon that was called label leaking.) Here are the same plots on models trained using FGSM:

As one can see, the points found by FGSM have very low loss, despite the fact that points with significantly higher loss exist (and can be found by PGD). The small bump at the start of the CIFAR plot is a sign of “gradient masking”, see here for further analysis.

Examining the Performance of PGD

In the previous section, we have seen that suboptimal solutions to the inner maximization problem can lead to vulnerable models (even if the attack method can easily fool a “vanilla” model). In particular, models trained using FGSM still turn out to be vulnerable to perturbations found via PGD.

Since PGD found solutions with higher loss than FGSM, a natural starting point for training more robust models is using PGD for the inner maximization. However, it is still unclear whether we can depend on PGD to always find optimal (or sufficiently close to optimal) solutions. After all, PGD is only guaranteed to find an actual maximum for concave functions over convex sets. But here, PGD is facing a non-concave function.

We conducted multiple experiments to understand the performance of PGD in our non-concave setting. In the first set of experiments, our goal was to examine whether the loss surface contains local maxima with very different values. To this end, we evaluated the robust loss of standard MNIST and CIFAR10 networks with multiple runs of PGD initialized at different random starting points. Below are the histograms of the final loss values found by PGD starting from \(10^5\) different random points for two random examples from each test set:

Interestingly, the behavior of PGD is quite consistent. The values of the local maxima found by PGD are fairly concentrated and there are no major outliers (we did not crop the x axis). This is encouraging! The plots suggest that just a few steps of PGD already find a point with a reasonably good value. It is important to keep in mind though that, strictly speaking, we do not really know if the loss we obtain is close to the maximum possible value. But even if it was not the case, we at least get some degree of confidence that a first-order adversary – i.e., an adversary that only utilizes first-order information about the model – is unlikely to find examples with significantly higher loss.

In order to understand the loss landscape in more detail, we now take a look at two further visualizations for standard MNIST and CIFAR10 classifiers. In particular, we will plot the value of the loss (as a function of input image, not model parameters) in the two dimensional space spanned by an unperturbed image and the two local maxima found by PGD for different starting points. To be precise: if \(nat\) is the original point and \(adv_1\), \(adv_2\) are the perturbed points computed by PGD, we plot \(L(nat + \lambda_1 \cdot (adv_1-nat) + \lambda_2\cdot (adv_2-nat))\) as a function of \(\lambda_1\) and \(\lambda_2\).

Interestingly, the maxima found across restarts are distinct (they form a (roughly) 90 degree angle with the starting point), and the value of the loss is significantly lower in the line segments between them. So this is concrete evidence that the loss (as a function of input pixels) is indeed non-concave.

Applying Robust Optimization to MNIST and CIFAR10

After these experiments with PGD, our conjecture was that solving the inner maximization problem with PGD would yield good enough gradients for the outer minimization. To test this conjecture, we trained classifiers with PGD from a random starting point (i.e., a random perturbation of the input image) for the inner problem, and standard SGD for the outer problem. To compare with prior work, we began with standard architectures on the commonly used MNIST and CIFAR-10 datasets. Here, we plot the value of the robust loss (as approximated by PGD) during training:

The robust loss indeed decreases consistently and the model becomes increasingly more robust during training. This confirms that our intuition motivated by Danskin’s theorem indeed holds up experimentally: by repeatedly training on (approximate) maxima of the inner problem, we are able to reduce the overall value of the saddle point problem. The resulting models achieve 100% robust accuracy on the training set. That is, they classify all training points correctly even when these are perturbed using PGD.

The training performance also transfers to the test set reasonably well, leading to a robust accuracy of 89% on MNIST for \(\epsilon\)=0.3/1 and 47% on CIFAR10 for \(\epsilon\)=8/255, both under \(\ell_\infty\)-perturbations. To the best of our knowledge, these were the first published MNIST models that were robust to large adversarial \(\ell_\infty\)-perturbations. Admittedly, the CIFAR10 model is less impressive, but it still constituted a large improvement over standard models and the previous state-of-the-art among robust models (which was essentially 0% accuracy against PGD).

Finally, note that the standard accuracy of these models dropped as a result of adversarial training (as compared to a model trained in the standard way). This is undesirable but, somewhat surprisingly, this might actually be inevitable. We will discuss this in a later blog post.

Beyond PGD Evaluation

So far, we have seen that robust optimization produces models that are fairly robust to perturbations generated with PGD. But how can we be sure that there is no other attack that significantly reduces the accuracy of our models? After all, robustness to a specific attack is not meaningful if a slightly different attack defeats the classifier. Indeed, this is a common occurrence in the adversarial ML literature. While new defenses usually have good performance against existing attack methods, they are often entirely bypassed by more sophisticated versions of these attacks adapted to the particular defense (link1, link2, link3, link4, link5, link6).

The true robustness of a classifier is thus determined by the most powerful attack (within the threat model of interest) that is specifically targeting that classifier. In particular, such an attack should be able to incorporate full knowledge of the classifier and the defense used.

As a first step in understanding the robustness of our PGD-trained models, we plot histograms of the robust loss (final loss value found by PGD for \(10^5\) random starting points) for the robust models.

As before, the robust loss is nicely concentrated, but now around a significantly smaller range of values. This loss concentration suggests robustness (at least to first-order methods with a random start), but it does not constitute a rigorous proof. To get more confidence, we tested our models against a broader range of attacks, using multiple different step sizes and multiple restarts. We included transfer attacks from independently trained models and first-order attacks using the Carlini-Wagner loss function. Perhaps most importantly, we publicly released our models and training procedures (MNIST, CIFAR10), inviting the research community to mount more involved attacks. None of the submissions managed to reduce the accuracy of our models significantly below that found by PGD.

Follow-up work as well as our own ongoing research has found that this is not due to the insufficiency of the attack methods. One can indeed formally verify that no successful \(\ell_\infty\)-attacks exist for a large fraction of the MNIST test set points (and a non-trivial fraction of CIFAR10 test set points). These findings demonstrate that for this task and method, training against PGD is indeed provably effective.

Moreover, subsequent work (link1, link2, link3, link4) has modified and extended the robust optimization paradigm for neural networks to achieve provable guarantees for simple settings. While these provable guarantees are significantly smaller than the empirical robustness of our models (and the resulting models are not significantly more robust empirically), they constitute an important step towards the long-term goal of building provably reliable models.

What went wrong with CIFAR10?

While we made progress towards a robust CIFAR10 model, the resulting robust accuracy is still less than ideal. So what went wrong?

From an optimization perspective, things actually look fine: by using large enough models, we are able to achieve 100% adversarial robustness on the training set. This means that we are able to successfully minimize the empirical version of the robust loss. So what we are facing here is not an issue of optimization but one of (adversarially robust) generalization! The perfect robust accuracy of our model on the training set does not generalize to unseen adversarial examples (the robust accuracy on the test set is only 47%). It turns out that there is a broader intriguing phenomenon at play here. We will investigate this further in a subsequent blog post.