Ensemble everything everywhere:
Multi-scale aggregation for adversarial robustness

Google DeepMind

*Main contributor and project lead
The headline figure showing the key aspects of the multi-resolution self-ensemble

We use a multi-resolution decomposition (a) of an input image and a partial decorrelation of predictions of intermediate layers (b) to build a classifier (c) that has, by default, adversarial robustness comparable or exceeding state-of-the-art (f), even without any adversarial training. Optimizing inputs against it leads to interpretable changes (d) and images generated from scratch (e).

Abstract

Adversarial examples pose a significant challenge to the robustness, reliability and alignment of deep neural networks. We propose a novel, easy-to-use approach to achieving high-quality representations that lead to adversarial robustness through the use of multi-resolution input representations and dynamic self-ensembling of intermediate layer predictions. We demonstrate that intermediate layer predictions exhibit inherent robustness to adversarial attacks crafted to fool the full classifier, and propose a robust aggregation mechanism based on Vickrey auction that we call \textit{CrossMax} to dynamically ensemble them. By combining multi-resolution inputs and robust ensembling, we achieve significant adversarial robustness on CIFAR-10 and CIFAR-100 datasets without any adversarial training or extra data, reaching an adversarial accuracy of ≈72% (CIFAR-10) and ≈48% (CIFAR-100) on the RobustBench AutoAttack suite (L∞=8/255) with a finetuned ImageNet-pretrained ResNet152. This represents a result comparable with the top three models on CIFAR-10 and a +5 % gain compared to the best current dedicated approach on CIFAR-100. Adding simple adversarial training on top, we get ≈78% on CIFAR-10 and ≈51% on CIFAR-100, improving SOTA by 5 % and 9 % respectively and seeing greater gains on the harder dataset. We validate our approach through extensive experiments and provide insights into the interplay between adversarial robustness, and the hierarchical nature of deep representations. We show that simple gradient-based attacks against our model lead to human-interpretable images of the target classes as well as interpretable image changes. As a byproduct, using our multi-resolution prior, we turn pre-trained classifiers and CLIP models into controllable image generators and develop successful transferable attacks on large vision language models.

Multi-resolution input to mimic the human eye

Humans do not take a single static picture with their eyes and classify it. Instead, many noisy, jittered frames are effectively captured at different resolutions and a classification is performed on all of them at once. To mimic this, we train a neural network to accept a channel-wise multi-resolution stack of images at once. This automatically leads to significant adversarial robustness of the learned model, provided that very low learning rate is used.

A multi-resolution architecture taking a channel-wise, multi-resolution stack of images as an input

Standard adversarial attacks only fool the final layer

We experimentally demonstrate that standard adversarial attacks on a classifier only fool the very final layers of the network. In other words, a dog attacked to look like a car still has predominantly dog-like early and intermediate layer representations such as oriented edges, textures, and even higher level features.

An adversarial attack on a classifier only confuses the representations at the very last layers of the neural network

Attacks that perturb an image to confuse a particular layer are mostly effective only for the layer itself and the layers surrounding it. Representations from layers before and even after partially recover and see the original ground truth class instead of the attack target class.

An adversarial attack oconfusing a particular layer does not confuse the layers before and even after

Ensembling intermediate predictions via CrossMax => multi-resolution self-ensemble

We use the partial layer susceptibility decorrelation to construct a self-ensemble by ensembling the predictions of the intermediate layers (extracted with trained linear probes). To avoid non-robust aggregation that could be dominated by a single layer or class, we propose a new ensembling algorithm called CrossMax.

A new ensembling algorithm called CrossMax

Combining the multi-resolution input with intermediate layer self-ensembling via CrossMax leads to SOTA or above SOTA white-box adversarial robustness.

A multi-resolution self-ensemble

Adversarial robustness on RobustBench

Our multi-resolution self-ensemble without any adversarial training at all reaches similar results to SOTA on the RobustBench white box AutoAttack attack suite on CIFAR-10 and improves upon SOTA by +5% on CIFAR-100. With very light adversarial training (2x compute overhead), we surpass SOTA on CIFAR-10 by +5% and on CIFAR-100 by +9%. This is despite competing methods using 100x—1000x more compute for adversarial training.

Results for adversarial robustness on RobustBench

Gradient ascent on pixels towards class => interpretable image

Our model is so robust and aligned with human-like visual features that directly optimizing the input image to increase the probability of the target class generates interpretable images of the semantic content of the class. Normally, this would lead to uninterpretable noise. We therefore effectively unify classification and generation. Examples of 4 such generated "attacks" starting from grey pixels and maximizing the probability of a target class for our CIFAR-100 model:

Results for adversarial robustness on RobustBench

If an attack succeeds, we see why

Due to the robustness of our model, successful attacks end up looking very interpretable and we see why the class decision has been changed.

When an adversarial attack succeeds, we see why the image has been misclassified

Bonus 1: Pretrained CLIP is now a generator with no additional training

Using the multi-resolution prior and flipping it around, we show that if we express an adversarial perturbation as a sum of perturbations at different resolutions and optimize them at once, the resulting images will be very interpretable. Doing this with a CLIP model, we can steer the "attack" towards a particular text embedding, effectively creating an image generator for free, with no diffusion, GANs, or training involved at any stage.

Using the multi-resolution prior to turn a pretrained CLIP into an image generator

Starting from an image and "attacking" it towards a label, the changes we get are extremely interpretable and effectively boil down to photo manipulation.

A multi-resolution attack on CLIP turning Isaac Newton to Albert Einstein in an interpretable way

Bonus 2: The first transferrable image attacks on GPT-4, Claude 3, and Bing AI

Using the same multi-resolution prior, we constructed the first transferrable image attacks on state-of-the-art closed-source vision-LLMs such as GPT-4, Claude 3 and Bing AI.

Transferrable image attacks on GPT-4o

BibTeX

@misc{fort2024ensembleeverywheremultiscaleaggregation,
      title={Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness}, 
      author={Stanislav Fort and Balaji Lakshminarayanan},
      year={2024},
      eprint={2408.05446},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.05446}, 
}