In the previous post, we have seen that classifiers are often vulnerable to small perturbations of their inputs. Although these perturbations (also known as adversarial examples) usually do not confuse humans, they can entirely change the prediction of an otherwise accurate classifier. In this blog post and the next one, we will see how to train classifiers that are significantly more robust to such perturbations.

## The Goal: (Adversarially) Robust Generalization

When studying adversarial examples, it is important to remember that we implicitly change our measure of performance. Normally, we judge the performance of a classifier $f$ (mapping from input $x$ to label $y$) by computing the average loss under a data distribution $D$:

$\mathbb{E}_{x, y \sim D} \big[ L(f (x), y )\big]\;.\qquad (1)$

If this average loss is small, we say that we have learned a good classifier.

In contrast, robust classification calls for a more stringent measure of performance. We now allow an adversary to perturb the sample $x$ before feeding it to our classifier $f$. Since we are primarily interested in perturbations that do not change the true label, we limit the adversary by restricting it to a pre-defined perturbation set $P(x)$. This leads to the following quantity:

$\mathbb{E}_{x, y \sim D} \left[\max_{x' \in P(x)} L(f (x'), y ) \right]\;.\qquad(2)$

It is important to note that a classifier $f$ with small “standard” loss (1) is not necessarily a good classifier under the robust loss (2). After all, the guarantee (1) makes no explicit statement about the perturbation set $P(x)$. A classifier $f$ could have small “standard” loss (1) and correctly classify most of the inputs in $P(x)$, while still having a large robust loss when measured as in (2). This is because we count $x$ as misclassified in (2) as long as there is at least one $x'$ in $P(x)$ that is misclassified. So we should not be surprised when “standard” classifiers - which aim to minimize (1) - turn out to be non-robust: we never trained them to be robust in the first place! The widespread vulnerability to small perturbations is thus hardly an unexpected “failure” of standard classifiers.

Another key aspect of the robust loss (2) is that it does not involve a specific method for constructing perturbations $x'$. So if we can find a classifier with small robust loss, we can be sure that the classifier is robust to any perturbation in the set $P$. This is an important difference to prior work, where the focus was often on robustness to specific attack methods (e.g., FGSM or JSMA). While specialized defenses do give strong robustness to these specific attacks, this has also led to “arms races” in earlier work (link1, link2, link3, link4) where such specialized defenses were later circumvented by applying a different attack to them. In contrast, objective (2) gives a universal robustness guarantee that applies to all methods for crafting perturbations in $P(x)$.

Now that we have set our sights on objective (2), there are two key questions:

1. What is a good perturbation set $P$?
2. How can we get a robust classifier for the chosen perturbation set $P$?

Both issues are the subject of active research. We will present our take on perturbation sets at a later time and focus on the second question (optimization) here. To be concrete, we will work with the $\ell_\infty$-ball as constraint set P since it has received a lot of attention in image classification lately. While $\ell_\infty$-perturbations certainly do not capture all aspects of image similarity, they are commonly used as a natural baseline. After all, a fully robust classifier needs to be invariant to at least small $\ell_\infty$-bounded perturbations. Hence $\ell_\infty$-robustness is a necessary (but not sufficient) perturbation set here. In a later blog post, we will see that the same optimization methodology largely applies to other perturbation sets as well.

## Robust Optimization (a.k.a. Adversarial Training)

In the previous section we have seen that a classifier with small robust loss as in (2) would be useful. How can we learn such a classifier? For the standard loss (1), the prevalent learning methods are based on empirical risk minimization (ERM). So a natural approach in the adversarial setting is to use an analogous “robustified” variant of ERM. In particular, our goal is to find model parameters $\theta$ so that the corresponding classifier $f_\theta$ has a small empirical loss even when an adversarial perturbation is applied:

$\min_\theta \frac{1}{n} \sum_{i = 1}^n \max_{x' \in P(x_i)} L(f_\theta(x'), y_i)\;.\qquad(3)$

Min-max problems of this form have a long history going back to at least the work of Abraham Wald from the 1940s (link). Moreover, the field of robust optimization has studied min-max problems over the past few decades (link). In the following, we will discuss how to solve such problems in the context of neural networks.

At a first glance, it might be unclear how to solve Problem (3). The optimization problem consists of both an outer minimization and an inner maximization. In addition, the inner maximization is non-concave, and the outer minimization is non-convex. To tackle this problem, it turns out to be useful to split the overall problem into an outer minimization and inner maximization, as we will describe in the next two sections.

## Outer Minimization

Without the inner maximization, Problem (3) is exactly the same problem as training a regular classifier. While this is a non-convex problem for neural networks, SGD still tends to perform well in that setting. Hence it is also a natural starting point for the outer minimization part of the robust ERM problem (3). In order to run SGD, we need to compute gradients of the robust loss

$\phi_{x, y}(\theta) = \max_{x' \in P(x)} L(f_\theta(x'), y)\;.\qquad(4)$

Note that $\phi$ is itself an optimization problem, so we can’t simply run backpropagation with the standard building blocks of a neural network. Fortunately, as we mentioned, the optimization community has long figured out how to compute gradients for such min-max problems. The theoretical motivation here comes from Danskin’s Theorem (link). Informally speaking, Danskin’s Theorem tells us that we can obtain a gradient of $\phi$ w.r.t. $\theta$ by finding a constrained maximizer $x^*$ of the inner maximization problem (4). We can then use this maximizer as our actual data point and compute the gradient at $x^*$, i.e.,

$\nabla_\theta \phi_{x, y}(\theta) = \nabla_\theta L(f_\theta(x^*), y)\;.$

There are technical conditions for Danskin’s Theorem, but convexity is not one of them. So if we can find an optimal solution of the inner maximization problem, we can also get a gradient for the outer minimization problem and run SGD.

## Inner Maximization

We now consider the inner maximization problem. Note that keeping $\theta$ fixed and finding a maximizer of the robust loss (4) corresponds exactly to the the problem of optimally attacking a given classifier. That is, the classifier $f$ can be fooled with a point in the perturbation set $P$ if and only if the the maximizer of $L(f_\theta(x), y)$ is an adversarial example. This connection shows an important duality between attacking a classifier and training a robust classifier: if we have a good attack, we also have a method for finding good gradients of the robust loss $\phi$. So in order to train a classifier minimizing (2), we should train against the strongest adversary possible (i.e., within a feasible computational budget).

Overall, the robust optimization perspective suggests the following procedure for training a robust classifier:

1. Sample a data point $x$, $y$.
2. Compute the maximizer $x^*$ of the robust loss $\phi_{x, y}(\theta)$.
3. Compute the gradient $g = \nabla_\theta L(f_\theta(x^*), y)$.
4. Update $\theta$ with the gradient $g$.
5. Repeat Steps 1 - 4 until convergence.

In deep learning, this procedure amounts to a variant of standard training in which, instead of training on unmodified data points (as SGD normally would), we first compute adversarial perturbations for all the training points in the current batch (Step 2). Then we apply the gradient to our model parameters as usual, but with respect to these perturbed points instead of the original ones (Steps 3-4). In the deep learning community, this approach has become known as “adversarial training”, but it is worth to remember that methods such as this one have a long history in robust optimization.

Of course, the important question now is whether this theoretical perspective is predictive of the actual performance in practice. One reason to be concerned is that robust optimization (i.e., adversarial training) had already been employed in prior work, but did not lead to truly robust models there (even on simple datasets such as MNIST). It turns out that the robust optimization framework can indeed produce models that are robust in practice. But the key to success lies in the details of how to exactly execute the robust optimization paradigm. We will discuss this aspect in the next post.