abstract: | We study the transfer of adversarial robustness of deep neural networks between different perturbation types. While most work on adversarial examples has focused on $L_\infty$ and $L_2$-bounded perturbations, these do not capture all types of perturbations available to an adversary. The present work evaluates 32 attacks of 5 different types against models adversarially trained on a 100-class subset of ImageNet. Our empirical results suggest that evaluating on a wide range of perturbation sizes is necessary to understand whether adversarial robustness transfers between perturbation types. We further demonstrate that robustness against one perturbation type may not always imply and may sometimes hurt robustness against other perturbation types. In light of these results, we recommend evaluation of adversarial defenses take place on a diverse range of perturbation types and sizes. bibliography:

adv-icml2019.bib

Introduction {#sec:intro}

Deep networks have shown remarkable accuracy on benchmark tasks [@he2016identity], but can also be fooled by imperceptible changes to inputs, known as adversarial examples [@goodfellow2014explaining]. In response, researchers have studied the robustness of models, or how well models generalize in the presence of (potentially adversarial) bounded perturbations to inputs.

How can we tell if a model is robust? Evaluating model robustness is challenging because, while evaluating accuracy only requires a fixed distribution, evaluating the robustness of a model requires that the model have good performance in the presence of many, potentially hard to anticipate and model, perturbations. In the context of image classification, considerable work has focused on robustness to "$L_{\infty}$-bounded" perturbations (perturbations with bounded per-pixel magnitude) [@goodfellow2014explaining; @madry2017towards; @xie2018feature]. However, models hardened against $L_{\infty}$-bounded perturbations are still vulnerable to even small, perceptually minor departures from this family, such as small rotations and translations [@engstrom2017rotation]. Meanwhile, researchers continue to develop creative attacks that are difficult to even mathematically specify, such as fake eyeglasses, adversarial stickers, and 3D-printed objects [@sharif2018adversarial; @brown2017adversarial; @athalye2017synthesizing].

The perspective of this paper is that any single, simple-to-define type of perturbation is likely insufficient to capture what a deployed model will be subject to in the real world. To address this, we investigate robustness of models with respect to a broad range of perturbation types. We start with the following question:

When and how much does robustness to one type of perturbation transfer to other perturbations?

We study this question using adversarial training, a strong technique for adversarial defense applicable to any fixed attack [@goodfellow2014explaining; @madry2017towards]. We evaluate $32$ attacks of $5$ different types--$L_\infty$ [@goodfellow2014explaining], $L_2$ [@carlini2017towards], $L_1$ [@chen2018ead], elastic deformations [@xiao2018spatially], and JPEG [@shin2017jpeg]--against adversarially trained ResNet-50 models on a 100-class subset of full-resolution ImageNet.

Our results provide empirical evidence that models robust under one perturbation type are not necessarily robust under other natural perturbation types. We show that:

Evaluating on a carefully chosen range of perturbation sizes is important for measuring robustness transfer.
Adversarial training against the elastic deformation attack demonstrates that adversarial robustness against one perturbation type can transfer poorly to and at times hurt robustness to other perturbation types.
Adversarial training against the $L_2$ attack may be better than training against the widely used $L_\infty$ attack.

While any given set of perturbation types may not encompass all potential perturbations that can occur in practice, our results demonstrate that robustness can fail to transfer even across a small but diverse set of perturbation types. Prior work in this area [@sharma2017attacking; @jordan2019quantifying; @tramer2019adversarial] has studied transfer using single values of $\varepsilon$ for each attack on lower resolution datasets; we believe our larger-scale study provides a more comprehensive and interpretable view on transfer between these attacks. We therefore suggest considering performance against several different perturbation types and sizes as a first step for rigorous evaluation of adversarial defenses.

Adversarial attacks {#sec:attacks}

We consider five types of adversarial attacks under the following framework. Let $f: \mathbb{R}^{3 \times 224 \times 224} \to \mathbb{R}^{100}$ be a model mapping images to logits[^1], and let $\ell(f(x), y)$ denote the cross-entropy loss. For an input $x$ with true label $y$ and a target class $y' \neq y$, the attacks attempt to find $x'$ such that

the attacked image $x'$ is a perturbation of $x$, constrained in a sense which differs for each attack, and
the loss $\ell(f(x'), y')$ is minimized (targeted attack).

We consider the targeted setting and the following attacks, described in more detail below:

$L_\infty$ [@goodfellow2014explaining]
$L_2$ [@szegedy2013intriguing; @carlini2017towards]
$L_1$ [@chen2018ead]
JPEG
Elastic deformation [@xiao2018spatially]

The $L_\infty$ and $L_2$ attacks are standard in the adversarial examples literature [@athalye2018obfuscated; @papernot2016distillation; @madry2017towards; @carlini2017towards] and we chose the remaining attacks for diversity in perturbation type. We now describe each attack, with sample images in Figure 6{reference-type="ref" reference="fig:sample-images"} and Appendix 5{reference-type="ref" reference="sec:attack-samples"}. We clamp output pixel values to $[0, 255]$.

For $L_p$ attacks with $p \in {1, 2, \infty}$, the constraint allows an image $x \in \mathbb{R}^{3 \times 224 \times 224}$, viewed as a vector of RGB pixel values, to be modified to an attacked image $x' = x + \delta$ with $$|x' - x|p \leq \varepsilon,$$ where $|\cdot|p$ denotes the $L_p$-norm on $\mathbb{R}^{3 \times 224 \times 224}$. For the $L\infty$ and $L_2$ attacks, we optimize using randomly-initialized projected gradient descent (PGD), which optimizes the perturbation $\delta$ by gradient descent and projection to the $L\infty$ and $L_2$ balls [@madry2017towards]. For the $L_1$ attack, we use the randomly-initialized Frank-Wolfe algorithm [@frank1956algorithm], detailed in Appendix 7{reference-type="ref" reference="sec:fw-pseudo"}. We believe that our Frank-Wolfe algorithm is more principled than the optimization used in existing $L_1$ attacks such as EAD.

$Sample attacked images with label "black swan" for $\varepsilon$ at the top end of our range. [[fig:sample-images]]{#fig:sample-images label="fig:sample-images"}$ {#fig:sample-images width=".85in"} $Sample attacked images with label "black swan" for $\varepsilon$ at the top end of our range. [[fig:sample-images]]{#fig:sample-images label="fig:sample-images"}$ {#fig:sample-images width=".85in"} $Sample attacked images with label "black swan" for $\varepsilon$ at the top end of our range. [[fig:sample-images]]{#fig:sample-images label="fig:sample-images"}$ {#fig:sample-images width=".85in"} clean $L_\infty$ $L_2$ $Sample attacked images with label "black swan" for $\varepsilon$ at the top end of our range. [[fig:sample-images]]{#fig:sample-images label="fig:sample-images"}$ {#fig:sample-images width=".85in"} $Sample attacked images with label "black swan" for $\varepsilon$ at the top end of our range. [[fig:sample-images]]{#fig:sample-images label="fig:sample-images"}$ {#fig:sample-images width=".85in"} $Sample attacked images with label "black swan" for $\varepsilon$ at the top end of our range. [[fig:sample-images]]{#fig:sample-images label="fig:sample-images"}$ {#fig:sample-images width=".85in"} $L_1$ JPEG elastic

As discussed in @shin2017jpeg as a defense, JPEG compression applies a lossy linear transformation based on the discrete cosine transform (denoted by $\mathsf{JPEG}$) to image space, followed by quantization. The JPEG attack, which we believe is new to this work, imposes on the attacked image $x'$ an $L_\infty$-constraint in this transformed space: $$|\mathsf{JPEG}(x) - \mathsf{JPEG}(x')|_\infty \leq \varepsilon.$$ We optimize $z = \mathsf{JPEG}(x')$ with randomly initialized PGD and apply a right inverse of $\mathsf{JPEG}$ to obtain the attacked image.

The elastic deformation attack allows perturbations $$x' = \mathsf{Flow}(x, V),$$ where $V: {1, \ldots, 224}^2 \to \mathbb{R}^2$ is a vector field on pixel space, and $\mathsf{Flow}$ sets the value of pixel $(i, j)$ to the (bilinearly interpolated) value at $(i, j) + V(i, j)$. We constrain $V$ to be the convolution of a vector field $W$ with a $25 \times 25$ Gaussian kernel with standard deviation $3$, and enforce that $$|W(i, j)|_\infty \leq \varepsilon\qquad \text{ for } i, j \in {1, \ldots, 224}.$$ We optimize the value of $W$ with randomly initialized PGD. Note that our attack differs in details from @xiao2018spatially, but is similar in spirit.

{width="0.99\linewidth"}

Experiments {#sec:experiments}

We measure transfer of adversarial robustness by evaluating our attacks against adversarially trained models. For each attack, we adversarially train models against the attack for a range of perturbation sizes $\varepsilon$. We then evaluate each adversarially trained model against each attack, giving the $2$-dimensional accuracy grid of attacks evaluated against adversarially trained models shown in Figure [fig:grid]{reference-type="ref" reference="fig:grid"} (analyzed in detail in Section 3.2{reference-type="ref" reference="sec:results"}).

Experimental setup

Dataset and model. We use the $100$-class subset of ImageNet-1K [@deng2009imagenet] containing classes whose WordNet ID is a multiple of $10$. We use the ResNet-50 [@he2016identity] architecture with standard 224$\times$224 resolution as implemented in torchvision. We believe this full resolution is necessary for the elastic and JPEG attacks.

Training hyperparameters. We trained on machines with 8 Nvidia V100 GPUs using standard data augmentation practices [@he2016identity]. Following best practices for multi-GPU training [@goyal2017accurate], we used synchronized SGD for $90$ epochs with a batch size of 32$\times$8 and a learning rate schedule in which the learning rate is "warmed up" for 5 epochs and decayed at epochs 30, 60, and 80 by a factor of 10. Our initial learning rate after warm-up was 0.1, momentum was $0.9$, and weight decay was $5\times10^{-6}$.

attack optimization algorithm $\varepsilon$ or $\varepsilon_{\text{max}}$ values step size steps (adversarial training) steps (eval)

$L_\infty$ PGD ${2^i \mid 0 \leq i \leq 5}$ $\frac{\varepsilon}{\sqrt{\text{steps}}}$ 10 50
$L_2$ PGD ${150 \cdot 2^i \mid 0 \leq i \leq 5}$ $\frac{\varepsilon}{\sqrt{\text{steps}}}$ 10 50
$L_1$ Frank-Wolfe ${9562.5 \cdot 2^i \mid 0 \leq i \leq 6}$ N/A 10 50
JPEG PGD ${0.03125 \cdot 2^i \mid 0 \leq i \leq 5}$ $\frac{\varepsilon}{\sqrt{\text{steps}}}$ 10 50
Elastic PGD ${0.25 \cdot 2^i \mid 0 \leq i \leq 6}$ $\frac{\varepsilon}{\sqrt{\text{steps}}}$ 30 100

Adversarial training. We harden models against attacks using adversarial training [@madry2017towards]. To train against attack $A$, for each mini-batch of training images, we select target classes for each image uniformly at random from the $99$ incorrect classes. We generate adversarial images by applying the targeted attack $A$ to the current model with $\varepsilon$ chosen uniformly at random between $0$ and $\varepsilon_{\text{max}}$. Finally, we update the model with a step of synchronized SGD using these adversarial images alone.

We list attack parameters used for training in Table [tab:adv-settings]{reference-type="ref" reference="tab:adv-settings"}. For the PGD attack, we chose step size $\frac{\varepsilon}{\sqrt{\text{steps}}}$, motivated by the fact that taking step size proportional to $1/\sqrt{\text{steps}}$ is optimal for non-smooth convex functions [@nemirovski1978cezari; @nemirovski1983complexity]. Note that the greater number of PGD steps for elastic deformation is due to the greater difficulty of its optimization problem, which we are not confident is fully solved even with this greater number of steps.

Attack hyperparameters. We evaluate our adversarially trained models on the (subsetted) ImageNet-1K validation set against targeted attacks with target chosen uniformly at random from among the $99$ incorrect classes. We list attack parameters for evaluation in Table [tab:adv-settings]{reference-type="ref" reference="tab:adv-settings"}. As suggested in [@carlini2019evaluating], we use more steps for evaluation than for adversarial training to ensure PGD converges.

Results and analysis {#sec:results}

Using the results of our adversarial training and evaluation experiments in Figure [fig:grid]{reference-type="ref" reference="fig:grid"}, we draw the following conclusions.

Choosing $\varepsilon$ well is important. Because attack strength increases with the allowed perturbation magnitude $\varepsilon$, comparing robustness between different perturbation types requires a careful choice of $\varepsilon$ for both attacks. First, we observe that a range of $\varepsilon$ yielding comparable attack strengths should be used for all attacks to avoid drawing misleading conclusions. We suggest the following principles for choosing this range, which we followed for the parameters in Table [tab:adv-settings]{reference-type="ref" reference="tab:adv-settings"}:

Models adversarially trained against the minimum value of $\varepsilon$ should have validation accuracy comparable to that of a model trained on unattacked data.
Attacks with the maximum value of $\varepsilon$ should substantially reduce validation accuracy in adversarial training or perturb the images enough to confuse humans.

To illustrate this point, we provide in Appendix 6{reference-type="ref" reference="sec:trunc-range"} a subset of Figure [fig:grid]{reference-type="ref" reference="fig:grid"} with $\varepsilon$ ranges that differ in strength between attacks; the (deliberately) biased ranges of $\varepsilon$ chosen in this subset cause the $L_1$ and elastic attacks to be perceived as stronger than our full results reveal.

Second, even if two attacks are evaluated on ranges of $\varepsilon$ of comparable strength, the specific values of $\varepsilon$ chosen within those ranges may be important. In our experiments, we scaled $\varepsilon$ geometrically for all attacks, but when interpreting our results, attack strength may not scale in the same way with $\varepsilon$ for different attacks. As a result, we only draw conclusions which are invariant to the precise scaling of attack strength with $\varepsilon$. We illustrate this type of analysis with the following two examples.

Robustness against elastic transfers poorly to the other attacks. In Figure [fig:grid]{reference-type="ref" reference="fig:grid"}, the accuracies of models adversarially trained against elastic are higher against elastic than the other attacks, meaning that for these values of $\varepsilon$, robustness against elastic does not imply robustness against other attacks. On the other hand, training against elastic with $\varepsilon\geq 4$ generally increases accuracy against elastic with $\varepsilon\geq 4$, but decreases accuracy against all other attacks.

Together, these imply that the lack of transfer we observe in Figure [fig:grid]{reference-type="ref" reference="fig:grid"} is not an artifact of the specific values of $\varepsilon$ we chose, but rather a broader effect at the level of perturbation types. In addition, this example shows that increasing robustness to larger perturbation sizes of a given type can hurt robustness to other perturbation types. This effect is only visible by considering an appropriate range of $\varepsilon$ and cannot be detected from a single value of $\varepsilon$ alone.

$L_2$ adversarial training is weakly better than $L_\infty$. Comparing rows of Figure [fig:grid]{reference-type="ref" reference="fig:grid"} corresponding to training against $L_2$ with $\varepsilon\in {300, 600, 1200, 2400, 4800}$ with rows corresponding to training against $L_\infty$ with $\varepsilon\in {1, 2, 4, 8, 16}$, we see that training against $L_2$ yields slightly lower accuracies against $L_\infty$ attacks and higher accuracies against all other attacks. Because this effect extends to all $\varepsilon$ for which training against $L_\infty$ is helpful, it does not depend on the relation between $L_\infty$ attack strength and $\varepsilon$. In fact, against the stronger half of our attacks, training against $L_2$ with $\varepsilon= 4800$ gives comparable or better accuracy to training against $L_\infty$ with adaptive choice of $\varepsilon$. This provides some evidence that $L_2$ is more effective to train against than $L_\infty$.

Conclusion {#sec:conclusion}

This work presents an empirical study of when and how much robustness transfers between different adversarial perturbation types. Our results on adversarial training and evaluation of 32 different attacks on a 100-class subset of ImageNet-1K highlight the importance of considering a diverse range of perturbation sizes and types for assessing transfer between types, and we recommend this as a guideline for evaluating adversarial robustness.

Acknowledgements {#acknowledgements .unnumbered}

D. K. was supported by NSF Grant DGE-1656518. Y. S. was supported by a Junior Fellow award from the Simons Foundation and NSF Grant DMS-1701654. D. K., Y. S., and J. S. were supported by a grant from the Open Philanthropy Project.

Sample attacked images {#sec:attack-samples}

In this appendix, we give more comprehensive sample outputs for our adversarial attacks. Figures [fig:strong-attack]{reference-type="ref" reference="fig:strong-attack"} and [fig:weak-attack]{reference-type="ref" reference="fig:weak-attack"} show sample attacked images for attacks with relatively large and small $\varepsilon$ in our range, respectively. Figure [fig:attack-transfer]{reference-type="ref" reference="fig:attack-transfer"} shows examples of how attacked images can be influenced by different types of adversarial training for defense models. In all cases, the images were generated by running the specified attack against an adversarially trained model with parameters specified in Table [tab:adv-settings]{reference-type="ref" reference="tab:adv-settings"} for both evaluation and adversarial training.

               clean                                                     $L_\infty$                                                  $L_2$                                                       $L_1$                                                    JPEG                                                       elastic
                                                                         $\varepsilon=32$                                            $\varepsilon=4800$                                          $\varepsilon=306000$                                     $\varepsilon=1$                                            $\varepsilon=8$

black swan {width=".85in"} {width=".85in"} {width=".85in"} {width=".85in"} {width=".85in"} {width=".85in"} chain mail {width=".85in"} {width=".85in"} {width=".85in"} {width=".85in"} {width=".85in"} {width=".85in"} espresso maker {width=".85in"} {width=".85in"} {width=".85in"} {width=".85in"} {width=".85in"} {width=".85in"} manhole cover {width=".85in"} {width=".85in"} {width=".85in"} {width=".85in"} {width=".85in"} {width=".85in"} water tower {width=".85in"} {width=".85in"} {width=".85in"} {width=".85in"} {width=".85in"} {width=".85in"} orange {width=".85in"} {width=".85in"} {width=".85in"} {width=".85in"} {width=".85in"} {width=".85in"} volcano {width=".85in"} {width=".85in"} {width=".85in"} {width=".85in"} {width=".85in"} {width=".85in"}

               clean                                                     $L_\infty$                                                 $L_2$                                                      $L_1$                                                      JPEG                                                          elastic
                                                                         $\varepsilon=4$                                            $\varepsilon=600$                                          $\varepsilon=38250$                                        $\varepsilon=0.125$                                           $\varepsilon=1$

attack clean $L_2$ $\varepsilon=2400$ $L_2$ $\varepsilon=2400$ $L_1$ $\varepsilon= 153000$ $L_1$ $\varepsilon=153000$ elastic $\varepsilon=4$ elastic $\varepsilon=4$ adversarial training $L_1$ $\varepsilon= 153000$ elastic $\varepsilon= 4$ $L_2$ $\varepsilon=2400$ elastic $\varepsilon=4$ $L_2$ $\varepsilon=2400$ $L_1$ $\varepsilon=153000$ black swan {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} chain mail {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} espresso maker {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} manhole cover {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} water tower {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} orange {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} volcano {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"} {width=".7in"}

Evaluation on a truncated $\varepsilon$ range {#sec:trunc-range}

In this appendix, we show in Figure [fig:grid-small]{reference-type="ref" reference="fig:grid-small"} a subset of Figure [fig:grid]{reference-type="ref" reference="fig:grid"} with a truncated range of $\varepsilon$. In particular, we omitted small values of $\varepsilon$ for $L_1$, elastic, and JPEG and large values of $\varepsilon$ for $L_\infty$ and $L_2$. The resulting accuracy grid gives several misleading impressions, including:

The $L_1$ attack is stronger than $L_\infty$, $L_2$, and JPEG.
Training against the other attacks gives almost no robustness against the elastic attack.

The full range of results in Figure [fig:grid]{reference-type="ref" reference="fig:grid"} shows that these two purported effects are artifacts of the incorrectly truncated range of $\varepsilon$ used in Figure [fig:grid-small]{reference-type="ref" reference="fig:grid-small"}. In particular:

The additional smaller $\varepsilon$ columns for the $L_1$ attack in Figure [fig:grid]{reference-type="ref" reference="fig:grid"} demonstrate its perceived strength in Figure [fig:grid-small]{reference-type="ref" reference="fig:grid-small"} is an artifact of incorrectly omitting these values.
The additional smaller $\varepsilon$ columns for the elastic attack in Figure [fig:grid]{reference-type="ref" reference="fig:grid"} reveal that training against the other attacks is effective in defending against weak versions of the elastic attack, contrary to the impression given by Figure [fig:grid-small]{reference-type="ref" reference="fig:grid-small"}.

{width="0.99\linewidth"}

Input: function $f$, initial input $x \in [0,1]^d$, $L_1$ radius $\rho$, number of steps $T$. Output: approximate maximizer $\bar{x}$ of $f$ over the truncated $L_1$ ball $B_1(\rho; x) \cap [0,1]^d$ centered at $x$. $x^{(0)} \gets \mathrm{RandomInit}(x)$ $g \gets \nabla f(x^{(t-1)})$ $s_k \gets \text{index of the coordinate of $g$ by with $k^\text{th}$ largest norm}$ $S_k \gets {s_1, \ldots, s_k}$. $b_i \gets 1-x_i$ $b_i \gets -x_i$

$M_k \gets \sum_{i \in S_k} |b_i|$ $k^* \gets \max{k \mid M_k \leq \rho}$ $\hat{x}_i \gets x_i + b_i$ $\hat{x}i \gets x_i + (\rho - M{k^*}) \operatorname{sign}(g_i)$ $\hat{x}_i \gets x_i$ $x^{(t)} \gets (1-\frac{1}{t})x^{(t-1)} + \frac{1}{t}\hat{x}$ $\bar{x} \gets x^{(T)}$

$L_1$ Attack {#sec:fw-pseudo}

We chose to use the Frank-Wolfe algorithm for optimizing the $L_1$ attack, as Projected Gradient Descent would require projecting onto a truncated $L_1$ ball, which is a complicated operation. In contrast, Frank-Wolfe only requires optimizing linear functions $g^{\top}x$ over a truncated $L_1$ ball; this can be done by sorting coordinates by the magnitude of $g$ and moving the top $k$ coordinates to the boundary of their range (with $k$ chosen by binary search). This is detailed in Algorithm [alg:fw-alg]{reference-type="ref" reference="alg:fw-alg"}.

[^1]: For all experiments, the input is a $224 \times 224$ image, and the output is one of $100$ classes.

Transfer of Adversarial Robustness Between Perturbation Types