Certified Patch Robustness Via Smoothed Vision Transformers (Part 2)

Paper Code

In our previous post, we gave an overview of smoothing-based approaches to certified defenses, and described a certified patch defense known as de-randomized smoothing. In our latest work, we show how to leverage vision transformers (ViTs) to significantly improve such certified patch defenses along all possible axes: standard accuracy, certified accuracy, and speed. In fact, with some straightforward modifications, our certified ViTs not only enable significantly better certified robustness but they also maintain standard accuracy and inference times that are comparable to that of regular (non-robust) models. This progress marks a substantial step forward for certified defenses as a practical alternative to regular models.

An overview of the Smoothed ViT.

The main appeal of certified defense is their ironclad guarantees: a certified result leaves no room for doubt and is not prone to blind-spots of empirical evaluation. However, practitioners care not only about certified performance: the accuracy and the inference speed of the model are equally important factors too. Unfortunately, nearly all certified defenses come at a severe price: (1) the standard accuracy plummets in comparison to a standard model and/or (2) inference time is orders of magnitude slower than that of a standard model.

A standard ResNet-50 can achieve about 76% accuracy on the ImageNet benchmark, and takes less than 1 second to make a prediction on a typical GPU. In contrast, top-performing certified defenses such as Levine & Feizi (2020) reports 44% standard accuracy and takes 150 seconds to make a prediction on similar hardware.

This trade-off between robustness and performance can be untenable even for safety-critical applications. For example, object detection systems deployed in autonomous vehicles need to recognize obstacles and signs quickly in real time. Consequently, in addition to being robust, these systems also need to be fast and accurate.

To approach this tradeoff between robustness and accuracy, we take a closer look at certified defenses from an architectural perspective. Specifically, smoothing defenses use a backbone architecture that is typically implemented using classic convolutional networks, the default choice across much of computer vision. But are these architectures really the best tool for the job?

In our work, we show that when moving from standard models to certified defenses, it may be worth it to rethink the core architecture behind them. Specifically, we find that smoothing-based patch defenses can drastically improve when using vision transformers (ViTs) instead of convolutional networks as their backbone. With some additional adjustments to the transformer architecture, we not only achieve superior certified accuracy to prior work but also reach standard accuracies and inference times comparable to a standard, non-robust ResNet!

Rethinking smoothing backbones

Recall from the previous post that smoothing defenses use a base classifier to make predictions on many variations of the original input. For patch defenses, this amounts to using the base classifier to classify column ablations (i.e., masked images that only reveal a small column of the original image). Consequently, the accuracy of the base classifier at classifying columns of an image directly translates into the performance of the certified defense.

In the light of this observation, it is crucial to realize that convolutional networks typically process complete (i.e., not masked) images. So, this type of architecture might not actually be best suited for classifying the highly masked images used in smoothing defenses.

Why might we expect convolutions to be suboptimal for masked images? CNNs slide filters across the image and grow the receptive field across layers of the network. This large receptive field, while usually being a strength of CNNs for standard settings, forces the network to process a large amount of masked pixels when processing ablated images. Since these pixels carry no information, the network must learn to detect and ignore these masked regions (in addition to performing unnecessary computation).

Vision transformers for ablated images

Unlike CNNs, vision transformers (ViTs) use attention modules instead of convolutions as their main building block. Crucially, attention modules can ignore masked regions instead of processing them.

To test out this intuition, we measured the accuracy of various convolutional and ViT architectures at classifying column ablations. It turns out that ViTs are significantly better at this task—for example, on ImageNet, a ViT-S has 12% higher accuracy at classifying column ablations than a similarly sized ResNet-50.

Small

Medium

Large

Ablation accuracy of ViTs vs. ResNets: A comparison of the accuracy of ViTs with similarly sized ResNets at classifying image ablations.

These improvements in ablation accuracy directly boosts both the certified and standard accuracies for the smoothed model. For example, the increase in ablation accuracy improves the certified accuracy of a smoothed ViT-S by 13% (as compared to a similarly sized ResNet-50) and the standard accuracy by 12%.

With some additional adjustments to the certification procedure described in our paper, our largest certifiably robust vision transformer can reach 73% standard accuracy—only 3% off from the standard accuracy of a regular, non-robust ResNet-50, and almost 30% higher than previous versions of de-randomized smoothing!

Small

Medium

Large

Certified accuracy of ViTs vs. ResNets: A comparison of the certified accuracy of smoothed ViTs/ResNets when using different ablation sizes.

Small

Medium

Large

Standard accuracy of ViTs vs. ResNets: A comparison of the standard accuracy of smoothed ViTs/ResNets when using different ablation sizes.

Modifying ViTs to Speed up Inference Time

Also, recall that a smoothed classifier tends to be significantly slower than a standard model because it needs to classify a large number of input variations for each input image. Indeed, traditional randomized smoothing approaches have used upwards of 100,000 variations, resulting in a forward pass that is 4-5 orders of magnitude slower than a standard model. These costs are infeasible in scenarios where decisions need to be made in a fraction of a second, such as autonomous driving. As Levine & Feizi have shown, for patch defenses, a move towards deterministic smoothing (as opposed to randomized) can result in a great speed up (by reducing the number of variations needed to be classified). However, the resulting solution is still two orders of magnitude slower than standard models.

In order to speed up inference (much) further, we leverage the token-based nature of ViTs to gracefully handle image ablations. Specifically, note that ViTs process images as tokens, where each token represents a contiguous patch of the input image. Since the runtime of a ViT is proportional to the number of tokens, a natural approach to speed up inference is to simply drop the masked out tokens, as shown in the following figure:

Vision transformers split images into patches, which are processed by the transformer architecture. When processing column ablated images, we can simply pass only the tokens in the ablation to significantly reduce the computational complexity of smoothing.

We modify the ViT architecture to drop fully masked tokens in such a manner. As expected, this significantly speeds up the inference time of the smoothed model! Indeed, with this optimization, our smoothed ViT-S is faster than a smoothed ResNet-50 by an order of magnitude. In our paper, we discuss how additional improvements to the certification procedure can make our ViT architectures another order of magnitude faster. In total, our fastest smoothed ViT is actually comparable in terms of inference speed to a standard, non-robust ResNet.

Conclusion

Smoothed ViTs present a simple but effective way to achieve state-of-the art certified accuracy. These models are also realistically deployable, as they maintain similar accuracy and inference speed as non-robust models. The key here is the vision transformer’s inherent ability to quickly and accurately process image ablations. These advancements are thus a step towards certified defenses that can be practically deployed in real-world scenarios.