Certified Adversarial Defenses Meet Out-of-Distribution Corruptions: Benchmarking Robustness and Simple Baselines

http://arxiv.org/abs/2112.00659v1

abstract: | Certified robustness guarantee gauges a model's robustness to test-time attacks and can assess the model's readiness for deployment in the real world. In this work, we critically examine how the adversarial robustness guarantees from randomized smoothing-based certification methods change when state-of-the-art certifiably robust models encounter out-of-distribution (OOD) data. Our analysis demonstrates a previously unknown vulnerability of these models to low-frequency OOD data such as weather-related corruptions, rendering these models unfit for deployment in the wild. To alleviate this issue, we propose a novel data augmentation scheme, FourierMix, that produces augmentations to improve the spectral coverage of the training data. Furthermore, we propose a new regularizer that encourages consistent predictions on noise perturbations of the augmented data to improve the quality of the smoothed models. We find that FourierMix augmentations help eliminate the spectral bias of certifiably robust models enabling them to achieve significantly better robustness guarantees on a range of OOD benchmarks. Our evaluation also uncovers the inability of current OOD benchmarks at highlighting the spectral biases of the models. To this end, we propose a comprehensive benchmarking suite that contains corruptions from different regions in the spectral domain. Evaluation of models trained with popular augmentation methods on the proposed suite highlights their spectral biases and establishes the superiority of FourierMix trained models at achieving better-certified robustness guarantees under OOD shifts over the entire frequency spectrum.[^1] author:


Introduction {#sec:intro}

Developing machine learning (ML) systems that are robust to adversarial variations in the test data is critical for applied domains that require ML safety [@mlsafety], such as autonomous driving and cyber-security. Unfortunately, a large body of work in this direction has fallen into the cycle where new empirical defense techniques are proposed, followed by new adaptive attacks breaking these defenses [@athalye2018obfuscated; @tramer2020adaptive]. Therefore, significant efforts have been dedicated to developing methods that are certifiably robust [@wong2018provable; @gowal2018effectiveness; @raghunathan2018certified] which provide provable robustness guarantees. Most promising among these certified defenses are randomized smoothing (RS) based certified defenses [@li2018second; @lecuyer2019certified; @cohen2019certified] which are scalable to deep neural networks (DNNs) and high-dimensional datasets. Specifically, the RS-based certification procedure relies on a smoothed version of the original classifier, which outputs the class most likely returned by the original classifier under random noise perturbations of the input. Prediction from the RS procedure at the test time is accompanied by a radius in which the predictions of the smoothed classifier are guaranteed to remain constant, thereby making them resilient to adversarial attacks within the neighborhood. Training methods such as [@cohen2019certified; @Zhai2020MACER; @salman2019provably] have been proposed to maximize the average certified radius (ACR), and models trained using these procedures achieve state-of-the-art (SOTA) adversarial robustness guarantees, all while assuming that the test data is identically distributed to the training data. In this work, we take a critical look at the current status of certifiably robust ML and consider whether these certifiably robust models are ready for deployment in the real world.

Our work takes the first steps towards answering this question by evaluating RS-based provably robust ML models on out-of-distribution (OOD) shifts, as mismatches between the training and deployment distributions are ubiquitous in the real world. Our analysis shows that OOD data pose a serious threat to certifiably robust models. We therefore highlight a previously unrecognized threat to certifiably robust models and thereby show that these models are not yet ready for deployment in the real world. Specifically, we found state-of-the-art certifiably robust models to be surprisingly brittle to low-frequency perturbations, such as weather-related corruptions (fog and frost). Vulnerability to such corruptions could lead to a detrimental performance of ML models on safety-critical applications. For example, $35%$--$75%$ performance drop is observed on low-frequency corruptions rendering RS-based robustness guarantees useless (Figure 1{reference-type="ref" reference="fig:teaser"}).

Motivated from our analysis, which shows RS-based smoothed classifiers to be brittle to low-frequency corruptions, we propose a novel data augmentation method that uses spectrally diverse yet semantically consistent augmentations of the training data. Specifically, our proposed data augmentation method FourierMix generates augmented data samples by using Fourier-based transformations on the input data to increase the spectral coverage of the training data. FourierMix randomly perturbs the amplitude and phase of the images in the training data and then combines them with the affine transformation of the data, producing spectrally diverse augmentations. To encourage the model to produce consistent predictions on different augmentations of the data, we propose a hierarchical consistency regularizer (HCR). The use of HCR as the regularizer leads to semantic consistency of representations across random noise perturbations as well as FourierMix generated augmentations of the same input image. The proposed scheme consistently achieves significantly better certified robustness guarantees as compared to existing state-of-the-art data augmentation schemes extended to build a smoothed classifier, across a range of OOD benchmarks. We further analyze these smooth models using Fourier sensitivity analysis in the spectral domain. In comparison to other methods, models trained on FourierMix augmentations coupled with hierarchical consistency regularization are significantly more resilient to perturbations across the entire frequency spectrum.

Robustness guarantees of certified models [@cohen2019certified] degrade significantly on out-of-distribution data, suggesting that these models are not yet ready for deployment in the real world.{#fig:teaser width="\linewidth"}

Our empirical evaluation of certifiably robust models on various OOD benchmark datasets uncovers another peculiar phenomenon--popular benchmark datasets may be biased towards certain frequency regions. Due to the complexity of real-world data, it is extremely challenging and tedious to uncover the spectral biases of the models and to identify their failure modes. Because of this, improvements in the performance of the models on these benchmark datasets may not generalize to other OOD scenarios. Thus, we should be cautious and avoid over-reliance on a specific leaderboard, especially to judge the robustness of models on OOD data. To enable the designers to understand the spectral biases of their models and obtain a more comprehensive view of the model robustness to OOD data, we propose a complementary new benchmark that includes a collection of OOD test sets, each focusing on specific frequency ranges while collectively covering the entire frequency spectrum. Evaluation of the certified robustness of different models on the proposed dataset shows that the smooth models obtained after training with existing data augmentation schemes are indeed biased towards certain frequencies regions. This justifies the observed performance (and ranking) variations across different benchmarks. On the other hand, models trained with our FourierMix based data augmentations perform significantly better than the competitors across the entire frequency spectrum. This further demonstrates that our data augmentation can produce spectrally diverse data which help alleviate the biases of the models to different frequencies. Our main contributions are as follows:

Generalization does occur (almost always) but the ranking is problematic: Existing data augmentation techniques help with OoD robustness but how much (or ranking) is very hard to determine. IOW, these techniques help outperform the baseline w/o augmentations but not necessarily each other.

Motivate why it is necessary given OOD and adversarial examples are separate research thrusts and goal of RS in not to guarantee robustness to OOD.

AM: Can be omitted: since we are not working on OOD detection problem We can cite OOD adversarial example papers and say none of the OOD detection method works. OOD adversarial attacks are largely under-explored (Chen et al. 2020; Sehwag et al. 2019; Chen et al. 2021; Bitterwolf 2020).

Note that we are still focusing on certified defense to small additive perturbation but under complex/unknown OOD perturbations (we don't focus on certified semantic defense that requires knowing corruptions at train time -- this could be future work).

Discuss why certified robustness to semantic perturbations were not considered

How Robust are Model Rankings?: We investigate recent defenses and observe a significant degradation in certified robustness under OOD setting especially when the test-time corruptions are sampled to be spectrally dissimilar from augmentations.

Smooth models behave differently compared to robust base models -- heatmap.

Discuss interesting/surprising findings -- cases where improvement on empirical accuracy does not correlates with OOD radius gains, e.g., consistency improve OOD radius but GA does not? GA+JS is simple yet surprisingly strong baseline.

Propose a simple baseline spectral data augmentation method with multi-view consistency regularizer.

There are two aspects associated with model evaluation-- performance scores and ranking. Leaderboard bias: Leaderboards have emerged as a popular mechanism to track and accelerate progress. Motivate why new benchmarking dataset is necessary. Existing benchmarking datasets are biased towards certain frequencies that give incomplete if not misleading picture.

Discuss/motivate the FourerMix data generation with plots, robustness results, and deeper surprising insights from a Fourier perspective (limitation of existing benchmark OOD data and evaluation protocol).

Make sure to list observations such that our contributions do not appear trivial or incremental over existing data augmentation and certification works.

Reiterate the high level message of the paper --

Our results suggest that test error can be improved by training on spectrally diverse augmentations, and sota data augmentations may not generalize well beyond the existing benchmark.

Dut different defenses/augmentation schemes exhibit different degrees of generalization capability. As an implication, caution is needed for fair robustness evaluations when additional data augmentation is introduced.

We have not solved the problem but taken some few steps

We hope our results and tools will allow for more robust progress towards improving robustness to image corruptions. We have only pointed the issue and provided some understanding and tools -- it will require a lot more work from the community to solve the problem.

Related Work {#sec:related}

Deep neural networks (DNNs) trained using standard gradient descent optimizers [@ruder2016overview] have been shown vulnerable to adversarial examples [@szegedy2013intriguing]. A number of white- and black-box attacks have been proposed [@7958570; @chen2017ead; @xiao2018generating; @CPY17zoo; @ilyas2018black] to construct adversarial examples with small $\ell_{p}$ distances to the original data that mislead these DNN models. Besides adversarial attacks, recent studies have devoted efforts to characterizing model performance under out-of-distribution (OOD) shifts [@hendrycks2019benchmarking; @bulusu2020anomalous], where natural corruptions lead to a significant impact on the accuracy of SOTA ML models. Thus, it has become imperative to study how ML models can be made robust to test data coming from different distributions when the models are deployed in the real world.

**Certified Robustness and Defenses.**The authors in [@szegedy2013intriguing] have discovered the adversarial examples in DNN models, after which many defenses have been presented to mitigate such vulnerability [@athalye2018obfuscated]. However, many of the proposed countermeasures have been shown to rely on gradient obfuscation, limiting malicious agents from accessing the accurate gradients. Such defenses are vulnerable to adaptive attacks, which give a false sense of security [@athalye2018obfuscated] of the models. Certified defenses are thus highly desirable. Along with a prediction on the test point, these defenses output a certified radius $r$ such that for any $||{\bm{\delta}}||_2<r$, the model continues to have the same prediction. Such techniques include convex polytope[@wong2018provable], recursive propagation[@gowal2018effectiveness], and linear relaxation[@raghunathan2018certified; @zhang2018efficient]. These methods provide a lower bound on the perturbation required to change the model's prediction on a target point. However, such methods can merely be applied to shallow models, which limits their practicality. Recently, [@li2018second; @lecuyer2019certified; @cohen2019certified; @mohapatra2020higher] have proposed randomized smoothing (RS)-based certified defenses that produce better lower bounds and are scalable to large networks. In this paper, we study the OOD robustness of such certified defenses. Unlike a recent work [@mehra2021robust], which uses data poisoning attacks to hurt the robustness guarantees of the RS-based models, our work demonstrates the failure of these models on test-time corruptions, which might be encountered by the model deployed in the real world.

Robustness against Common Corruptions -- Benchmarks and Defenses. Pioneering studies have identified vulnerabilities of deep learning models to common corruptions. Dodge find that standard trained DNNs are vulnerable to blur and Gaussian noise [@dodge2016understanding]. Hendrycks  [@hendrycks2019benchmarking] present CIFAR-10/100-C and ImageNet-C, consisting of fifteen different common corruptions with five severity levels to facilitate robustness evaluations of CIFAR [@krizhevsky2009learning] and ImageNet [@deng2009imagenet] models. Recently, Mintum further propose CIFAR-10/100-$\bar{\text{C}}$ and ImageNet-$\bar{\text{C}}$ to provide new corruptions [@mintun2021interaction]. There are two popular lines of work on improving the robustness against common corruptions: test-time adaptation [@schneider2020improving] and data augmentation [@cubuk2019autoaugment; @hendrycks2019augmix]. The authors in [@saenko2010adapting] propose a method to update the batch normalization (BN) statistics for improving domain adaptation. Another recent method, TENT[@wang2021tent] updates both the affine transformation and statistics of BN by using self-entropy minimization. On the other hand, methods such as AutoAugment [@cubuk2019autoaugment] leverages reinforcement learning to learn an augmentation policy that produces a diverse set of augmentations to help make the models robust to OOD data. Another popular method, AugMix [@hendrycks2019augmix] achieves impressive performance improvement on corrupted data using augmentations generated by mixing up images obtained from applying randomly sampled operations along with using a Jenson-Shannon based consistency loss during training. Unlike existing data augmentation schemes which intend to improve the empirical robust accuracy of the models, the data augmentation schemes of interest to this paper aim to improve the adversarial robustness guarantees on OOD data.

**Certified Semantic Robustness.**Recent work [@mohapatra2020towards; @fischer2020certified; @li2021tss] have also focused on developing techniques to provide performance guarantees to seen (or known) common corruption types (such as rotation or brightness changes). However, in this work, we are interested in more realistic scenarios with unseen (or unknown) test-time corruptions. It is worth noting that the susceptibility analysis and defense techniques developed in this work can be extended to SOTA semantic robustness techniques.

Are Certifiably Robust Models Ready for Deployment in the Wild? {#sec:rs_on_c}

image{width="\linewidth"}

Predictions of certifiably robust ML models are guaranteed to stay constant in a neighborhood of a test point, making them provably resilient to adversaries at the test time. This feature of certified defenses makes them an attractive candidate for real world safety-critical applications. However, progress in this area has been assessed by evaluating these models in idealistic scenarios (the in-distribution setup), which is not representative of real world data distributions. To better understand the performance of certified defenses in the real world, in this section, we evaluate SOTA certified defenses against OOD shifts.

Background on SOTA Certified Defenses {#sec:metrics}

We focus our attention on the SOTA certification technique based on randomized smoothing (RS) which is efficient and scalable. Let us consider a base classifier $\mathcal{M}$ trained on samples ${\bm{x}}\in \mathcal{X} \subset {\mathbb{R}}^{d\times d\times 3}$ and their corresponding labels $y \in \mathcal{Y} \subset {\mathbb{R}}^{+}$, obtained from an underlying data distribution $\mathcal{D}$.

**Certification.**The RS-based certification uses a base classifier $\mathcal{M}$ and provides certified robustness guarantees for its smoothed version defined as $\hat{\mathcal{M}}({\bm{x}}) = \mathop{\mathrm{arg,max}}{c \in \mathcal{Y}}{\mathbb{P}}(\mathcal{M}({\bm{x}}+{\bm{\delta}})=c)$ where ${\bm{\delta}}\sim \mathcal{N}(0,\sigma^{2}\mathbf{I})$. Intuitively, $\hat{\mathcal{M}}$ returns the most probable class returned by $\mathcal{M}$ on Gaussian perturbations of the input ${\bm{x}}$. The certification guarantees that the prediction of the smoothed classifier $\hat{\mathcal{M}}$ are consistent in the $\ell_2$ radius [@cohen2019certified] of $\text{CR}(\hat{\mathcal{M}},\sigma,{\bm{x}};y) = \frac{\sigma}{2}(\Phi^{-1}(p_A)-\Phi^{-1}(p_B))$, where $\Phi^{-1}$ is the inverse CDF of the standard Gaussian distribution, $p_A={\mathbb{P}}(\mathcal{M}({\bm{x}}+{\bm{\delta}})=c{A})$ is probability of the top class ($c_A$) and $p_B=\max_{c \neq c_{A}} {\mathbb{P}}(\mathcal{M}({\bm{x}}+{\bm{\delta}})=c_{A})$ is the probability of the runner-up class. Monte Carlo-based sampling [@hammersley2013monte] is utilized to approximate $\underline{p_A} \leq p_A$ and $\overline{p_B} = 1-\underline{p_A} \geq p_B$. The certified radius can still be computed using the same formula by replacing $p_A$ and $p_B$ with $\underline{p_A}$ and $\overline{p_B}$.

**Improved Training.**It has been observed empirically [@cohen2019certified] that models trained using the standard training procedure do not provide reasonable certified robustness. Therefore, there is an increasing interest in developing improved training techniques to maximize the certified robustness. Several works [@li2020sok] have made significant advances on the training techniques and reported impressive gains in terms of certified radius on in-distribution test data. Specifically, new training methods such as Gaussian augmentation [@cohen2019certified], SmoothAdv [@salman2019provably] and MACER [@Zhai2020MACER] have been proposed. Intuitively, Cohen  [@cohen2019certified] propose to leverage Gaussian augmentation with variance $\sigma^2$ to train the base classifier. SmoothAdv [@salman2019provably] and MACER [@Zhai2020MACER] both use Gaussian augmentation and further improve Cohen 's baseline method by adversarial training and introducing an auxiliary objective to maximize the certified radius, respectively. However, the effect of OOD data on the robustness guarantees of these models has been unexplored in the literature.

**Evaluation Metrics.**Similar to previous works [@Zhai2020MACER; @mehra2021robust; @salman2019provably], we use the average certified radius (ACR) as our metric to evaluate the robustness of the models on in-distribution (or clean) test data. Specifically, $\text{ACR}:=\frac{1}{|\mathcal{D}{test}|}\sum{({\bm{x}},y)\in \mathcal{D}{test}} \text{CR}(\hat{\mathcal{M}},\sigma,{\bm{x}};y)\times\textbf{1}{\hat{\mathcal{M}}({\bm{x}},\sigma)=y}$, which is also equivalent to the area under the certified radius-accuracy curve. We assign $\text{CR}(\cdot)=0$ for incorrect prediction of $\hat{\mathcal{M}}$. For OOD performance, we measure the mean ACR (mACR) as an overall metric, $\text{mACR} := \frac{1}{c}\sum_{i=1}^{c}{\text{ACR}_{i}}$, where $c$ is the number of corruptions leveraged in a specific test set. For example, $c=15$ and $10$ in CIFAR-10/100-C and -$\overline{\text{C}}$ datasets, respectively. We also report the ACR for each corruption type. Unlike previous studies on empirical defenses, we do not use the empirical clean and robust accuracy [@cohen2019certified; @salman2019provably; @Zhai2020MACER] as a metric in this work since we focus on the certified robustness.

Analyzing Certified Defenses in OOD Settings

Real-world test data oftentimes do not follow the training data distribution $\mathcal{D}$, although tangible improvements have been made on certifying robustness of in-distribution data. Therefore, evaluating the performance of $\mathcal{M}$ on out-of-distribution (OOD) data ${(\hat{{\bm{x}}},y)_1,...,(\hat{{\bm{x}}},y)_n} \sim \hat{\mathcal{D}}$ becomes a major concern. We consider the impact of OOD data on models trained using SOTA robust training methods [@cohen2019certified; @salman2019provably; @Zhai2020MACER] and RS-based certified defenses.

Degradation of Certified Robustness Guarantees on Common Corruptions

To measure the performance of certified defenses on OOD data, we use the popular common corruptions dataset CIFAR-10-C [@hendrycks2019benchmarking]. CIFAR-10-C contains 15 different corruptions from four categories (with 5 severity levels): noise, blur, weather, and digital corruptions. We re-arrange the corruption dataset into three groups and evaluate the ACR by increasing the severity level of the corruptions. Grouping is performed based on the visual similarity of the amplitude spectrum of corrupted images (see Appendix 8{reference-type="ref" reference="psd_cifar-10-c"}). Group-H corruptions (roughly categorized as high-frequency corruption type) consist of {Gaussian noise, impulse noise, shot noise, pixelate, JPEG}; Group-M corruptions (roughly categorized as mid-frequency corruption type) consist of {defocus blur, frosted glass blur, motion blur, zoom blur, elastic}; and Group-L corruptions (roughly categorized as low-frequency corruption type) consist of {brightness, fog, frost, snow, contrast}.

The performance of SOTA certified defenses on these groups of corruptions is presented in Figure [fig:motivation]{reference-type="ref" reference="fig:motivation"}. SmoothAdv and MACER both achieve tangible enhancements in ACR on in-distribution CIFAR-10 data compared to the Gaussian augmentation baseline. However, all methods show a sharp performance drop in ACR as we move from Group-H (high-frequency) to Group-L (low-frequency). We see that these methods are surprisingly brittle in low-frequency corruption regime, we see up to 54% drop in ACR when moving from severity 0 (in-distribution) to severity 5. We emphasize that this performance drop points to a methodological shortcoming and is not due to the corruptions in Group-L being too difficult since the empirical robust accuracy (Figure [fig:motivation2]{reference-type="ref" reference="fig:motivation2"} in Appendix 9{reference-type="ref" reference="app:robustbench"}) remains consistently high on all the groups and severity levels for empirically robust models [@hendrycks2019augmix; @kireev2021effectiveness; @rebuffi2021fixing]. Even though the performance of any ML model is expected to suffer on test data that lies far away from the data used during training, the drastic performance degradation of RS-based certifiably robust models on low-frequency corruptions is particularly concerning.

Validating the Brittleness of Smoothed Models Through a Spectral Lens {#sec:2.2.2}

To highlight that the vulnerability to low-frequency corruptions is a limitation of provably robust ML models, in this section, we perform a more systematic analysis that corroborates that our finding is not limited to a specific benchmark and holds more broadly. To achieve this, we perform a spectral domain analysis of SOTA smoothed models by utilizing the Fourier sensitivity analysis [@yin2019fourier], which we briefly summarize next.

A Fourier basis image in the pixel space is a real-valued matrix ${\bm{U}}{i,j} \in {\mathbb{R}}^{d\times d}$ where its $||{\bm{U}}{i,j}||2 = 1$, and $\text{FFT}({\bm{U}}{i,j})$ only has two non-zero elements at $(i,j)$ and $(-i,-j)$ in the coordinate that views the image center as the origin. Given a test set and a smoothed model, we evaluate the CR($\cdot$) of $\widetilde{{\bm{x}}}{i,j} = {\bm{x}}+ r\epsilon {\bm{U}}{i,j}$ for each ${\bm{x}}$ in the test set and compute their ACR, where $r$ is randomly sampled in ${-1,1}$, $\epsilon$ is the perturbation in $\ell_2$ norm, and we treat the RGB channels independently. Each of the evaluated ACR corresponds to a data point in the heat map located at $(i,j)$. Figure [fig:fourier_basis1]{reference-type="ref" reference="fig:fourier_basis1"} shows the heatmaps of models trained with Gaussian augmentation[@cohen2019certified], SmoothAdv[@salman2019provably], and MACER[@Zhai2020MACER] using $\epsilon=4$ [@yin2019fourier]. The center and edges of the heatmap contain evaluation on the lowest and highest frequency perturbations, respectively. The results in Figure [fig:fourier_basis1]{reference-type="ref" reference="fig:fourier_basis1"} show that the certifiably robust classifiers achieve small ACR on OOD data belonging to the low-frequency region (around the center of the image) whereas they achieve a high ACR in the high-frequency region (near the edges). In particular, the ACRs are always less than 0.3 for all three methods in the mid-to-low frequency range, while they perform well in high-frequency regime. We emphasize that the Fourier sensitivity analysis in Figure [fig:fourier_basis1]{reference-type="ref" reference="fig:fourier_basis1"} is general and is not specific to corruptions appearing in CIFAR10-C. Based on our analysis, we find that certifiably robust models are biased towards high-frequency noises and perform surprisingly poor on low-frequency OOD data.

image{width="\linewidth"}

Following this insight, we develop a data augmentation method capable of producing spectrally diverse augmentations to make certifiably robust models perform well on OOD data across the entire frequency spectrum in § 4{reference-type="ref" reference="sec:fouriermix"}.

${\bm{x}}\text{aug} = 0$ Sample mixing weights ($w_1, ..., w_k$) $\sim$ Dirichlet($\alpha,...,\alpha$) Sample weight $m \sim \text{Beta} (\alpha,\alpha)$ ${\bm{x}}\text{aug} = m {\bm{x}}\text{orig} + (1-m){\bm{x}}\text{aug}$

FourierMix: Data Augmentation with Broad Spectral Coverage {#sec:fouriermix}

To improve the certified robustness of RS-based methods on OOD data, it is intuitively desirable to make the base classifier $\mathcal{M}$ robust against different types of corruptions and their Gaussian perturbations. Motivated by our Fourier sensitivity analysis (§ 3{reference-type="ref" reference="sec:rs_on_c"}), we propose a novel data augmentation method, FourierMix, to boost the certified robustness performance on OOD data[^3]. To improve the spectral coverage, we introduce Fourier-based operations that manipulate the image in the frequency domain. We also leverage randomly sampled affine transformations to enrich the augmentations in FourierMix. We adopt the high-level framework of AugMix [@hendrycks2019augmix] for chaining and mixing different augmented images. Figure [fig:pipeline]{reference-type="ref" reference="fig:pipeline"} shows the overall pipeline and Algorithm [alg:fmix]{reference-type="ref" reference="alg:fmix"} presents the pseudocode of FourierMix.

Fourier Operations. Two dimensional images can be converted into the frequency domain by applying the Fourier transform and vice versa. Fourier transform has the duality property, which provides a unique but equivalent perspective for image analysis. We use fast Fourier transform (FFT) and inverse FFT (IFFT) for the transformation between the pixel and frequency domains. $\text{FFT}({\bm{x}})$ is complex in general, $\text{FFT}({\bm{x}})= \text{FFT}{real}({\bm{x}})+i\text{FFT}{imag}({\bm{x}})$, with ${\bm{A}}= |\text{FFT}({\bm{x}})|$ as its amplitude and ${\bm{P}}= \arctan(\text{FFT}{imag}({\bm{x}})/\text{FFT}{real}({\bm{x}}))$ as its phase. The amplitude spectrum of natural images generally follows a power-law distribution, $\frac{1}{f^{\alpha}}$, where $f$ is the azimuthal frequency and $\alpha \approx 2$ [@burton1987color; @tolhurst1992amplitude], resulting in extremely small power in the high-frequency areas. However, the amplitude spectrum of the IID Gaussian noise is a uniform distribution, so Gaussian augmentation biases the models towards the high-frequency regime relative to original images. In order to have broad and unbiased spectral coverage, the core of FourierMix is to allocate similar proportions of augmentations across all frequencies. We use two spectral perturbation methods in FourierMix to achieve this goal: $$\small \mathbf{A}(u,v) = {\bm{A}}{u,v}^\text{orig} \cdot \text{U}(1-s{\mathbf{A}},1+s_{\mathbf{A}}) \label{eq:fouriermix1}$$ $$\small % \mathbf{P}(f) = \mathbf{P}^\text{orig}(f) + \frac{\phi()}{} \mathbf{P}(u,v) = {\bm{P}}{u,v}^\text{orig}+ \mathcal{N}\text{truncated}^{s_{\mathbf{P}}}(0,\sigma^2\mathbf{I}) \label{eq:fouriermix2}$$ where $(u,v)$ is the coordinate of the 2D frequency in the spectrum, and $s_{\mathbf{A}}$ and $s_{\mathbf{P}}$ control the severity levels of two perturbations. Formally, the PDF of $\mathcal{N}\text{truncated}^{s{\mathbf{P}}}=\frac{\phi(x/\sigma)}{\sigma \cdot (2\Phi(s_{\mathbf{P}}/\sigma)-1)}$, where $\phi(\cdot)$ and $\Phi(\cdot)$ denote the PDF and CDF functions of a standard normal distribution, respectively. On one hand, we apply multiplicative factors sampled from a uniform distribution $\text{U}(\cdot)$ to all frequencies in the amplitude spectrum. Therefore, $\mathbf{A}(u,v)$ ensures that the proportions of augmentation are similar across all frequencies relative to the original spectrum. On the other hand, since the magnitude of a phase spectrum is not correlated with the 2D frequency [@lim1990two], additive noises are able to assign similar proportions of augmentations across 2D frequencies. As it is widely acknowledged that the phase component retains most of the high-level semantics [@xu2021fourier; @yang2020phase; @kermisch1970image], we leverage additive truncated Gaussian to constrain $\mathbf{P}(u,v)$ so that it will not destroy the semantics of the training images. Some sample images generated using FourierMix are provided in Appendix 11{reference-type="ref" reference="app:fmix-images"}.

Hierarchical Consistency Regularization (HCR). Motivated from  [@jeong2020consistency] that enforces consistency on in-distribution data, we propose hierarchical consistency regularization (HCR) to further boost the performance of FourierMix in terms of the ACR on OOD test sets: $$\small \mathcal{L}{G} = \frac{1}{s}\sum{i=0}^s \text{KL}(\mathcal{M}({\bm{x}}j+{\bm{\delta}}i)| \overline{\mathcal{M}}({\bm{x}}j,{\bm{\delta}}))$$ $$\small \mathcal{L}{HCR} = \frac{1}{k+1}\sum{j=0}^k \Big[\lambda \cdot \text{KL}(\overline{\mathcal{M}}({\bm{x}}j,{\bm{\delta}})| \overline{\mathcal{M}}({\bm{x}},{\bm{\delta}})) + \eta \cdot \mathcal{L}{G} \Big] \label{eq:hcr}$$ where $\overline{\mathcal{M}}({\bm{x}},{\bm{\delta}})=\mathbb{E}{j\in{0,1,...,k}}[\overline{\mathcal{M}}({\bm{x}}_j,{\bm{\delta}})]$, $\overline{\mathcal{M}}({\bm{x}}j, {\bm{\delta}}) = \mathbb{E}{i\in{1,2,...,s}}[\mathcal{M}({\bm{x}}_j+{\bm{\delta}}i)]$, ${\bm{x}}0$ is the original training image, and $\text{KL}(\cdot|\cdot)$ denotes the Kullback--Leibler divergence (KLD) [@Joyce2011]. We use $k=2$ and $s=2$ for the FourierMix and Gaussian augmentation with $\delta_i = \mathcal{N}(0, \sigma^2 \mathbf{I})$, respectively. Since Jensen--Shannon divergence (JSD) [@fuglede2004jensen] uses the KLD to calculate a normalized score that is symmetrical, HCR essentially stacks two levels of JSD while training the base classifier to enforce the consistent representations over both augmentations. The first level of consistency $\mathcal{L}G$ is applied to the Gaussian augmentation, rendering the Gaussian perturbed neighbors of ${\bm{x}}{0,1,2}$ have similar outputs, and the second level of consistency is on the whole $(k+1)s$ set to enforce FourierMix augmented images with consistent outputs. We utilize $\lambda$ and $\eta$ as hyper-parameters to tune the weights of two levels of consistency. The overall training loss is: $\mathcal{L} = \frac{1}{s}\sum{i=1}^s\mathcal{L}({\bm{x}}0+{\bm{\delta}}{i},y) + \mathcal{L}{HCR}$.

**Comparison with AugMix.**The key difference between FourierMix and AugMix is the base augmentation operations used in the pipeline. AugMix leverages the operations from AutoAugment [@cubuk2019autoaugment] that do not overlap with ImageNet-C. In contrast, the augmentations in FourierMix utilize a simpler set of generic augmentations. We compare the performance (ACR) of FourierMix with AugMix on multiple OOD datasets in our evaluation (§ 5{reference-type="ref" reference="sec:experiment"} and § 6{reference-type="ref" reference="sec:cifarf"}).

Experiments on Popular OOD Benchmarks {#sec:experiment}

**Experimental Setup.**As introduced in § 3.1{reference-type="ref" reference="sec:metrics"}, we use ACR and mACR as the main evaluation metrics. We utilize the official implementation from [@cohen2019certified] to compute the certified radius $\text{CR}(\cdot)$. We use the same base architectures leveraged in prior arts [@cohen2019certified; @Zhai2020MACER; @salman2019provably; @jeong2020consistency], ResNet-110 and ResNet-50 as the backbones, for experiments on CIFAR-10/100 and ImageNet [@he2016deep], respectively. We use Gaussian augmentation with $\sigma=0.25$ and $0.5$ for both training and certifying the CIFAR-10/100 and ImageNet models, respectively. Further details on training are provided in Appendix 12{reference-type="ref" reference="app:training-details"}.

**Baselines.**We evaluate the certified robustness of models trained with following augmentations schemes on OOD data: Gaussian [@cohen2019certified], AutoAugment [@cubuk2019autoaugment], and AugMix [@hendrycks2019augmix]. We also compare HCR with the the baseline JSD regularization [@jeong2020consistency]. We follow Cohen  [@cohen2019certified] and Jeong  [@jeong2020consistency] to train the Gaussian and Gaussian+JSD baseline models, respectively. For other augmentation methods, we apply Gaussian noise $\mathcal{N}(0,\sigma^2\mathbf{I})$ to half of the training samples in the mini-batch to ensure good certification performance using RS, and we follow Hendrycks to apply JSD to these augmentation methods [@hendrycks2019augmix].

**Datasets.**For the in-distribution evaluation, we use CIFAR-10/100 [@krizhevsky2009learning] and ImageNet [@deng2009imagenet] datasets. CIFAR-10/100 consist of small $32 \times 32$ images belonging to 10/100 classes and ImageNet consists of 1.2 millions images with 1,000 classes. We crop images in ImageNet into the same size of $224\times224\times3$ pixels. For OOD data, we use the common corruptions datasets [@hendrycks2019benchmarking] (CIFAR-10/100-C, ImageNet-C) and a recently proposed dataset [@mintun2021interaction] (CIFAR-10/100-$\overline{\text{C}}$ ImageNet-$\overline{\text{C}}$) which contains human interpretable and perceptually different corruptions as compared to those contained in CIFAR-C/ImageNet-C.

Results on CIFAR-Based OOD Benchmarks {#sec:cifar-10/100}

The results in Tables [tb:total]{reference-type="ref" reference="tb:total"} show the overall mACR of the models trained on CIFAR-10 using different augmentation and regularization methods when evalauated on CIFAR-10-C and CIFAR-10-$\bar{\text{C}}$, respectively. The results show that FourierMix consistently achieves the highest mACR across different corruption types. FourierMix+HCR significantly improves upon the baseline of Gaussian augmented training by 26.7% and 33.4% in terms of the overall mACR on CIFAR-10-C and CIFAR-10-$\bar{\text{C}}$, respectively and also improves upon the stronger baseline, AugMix+HCR, by 5.3% and 6.6% on the two datasets. We find consistency regularization to be helpful for certified robustness on OOD data. Specially, on mid- and high frequency corruptions adding JSD to Gaussian augmentations significantly improves the robustness on OOD data. We see that combining HCR with FourierMix achieves SOTA ACRs on all corruptions types providing significant gains even on low-frequency corruptions. This success is attributed to the spectrally diverse corruptions produced by FourierMix. Interestingly, we find AutoAugment overfits to corruptions in CIFAR-10-C since it suffers a major performance degradation on corruptions in CIFAR-10-$\bar{\text{C}}$. We believe the large overlap between the leveraged augmentations and corruptions in CIFAR-10-C and limited spectral diversity are the primary reasons for this performance degradation of AutoAugment. Detailed results for each corruption type in CIFAR-10-C/$\bar{\text{C}}$ are shown in Tables [tb:cifar10c]{reference-type="ref" reference="tb:cifar10c"} and [tb:cifar10c-bar]{reference-type="ref" reference="tb:cifar10c-bar"} in Appendix 10.1{reference-type="ref" reference="app:detailed_cifar10"}.

Next, we present the mACR (Table [tb:total]{reference-type="ref" reference="tb:total"}) of the models trained with CIFAR-100 when evaluated on OOD data (CIFAR-100-C and CIFAR-100-$\bar{\text{C}}$). Similar to the performance of models trained with CIFAR-10, FourierMix achieves the highest overall mACR among all augmentation methods on both OOD datasets. Specifically, FourierMix+HCR outperforms the Gaussian baseline by 54.4% and 74.6% on two datasets, respectively. Compared to AugMix+HCR, FourierMix+HCR improves the performance by 4.8% and 7.6% on the two datasets, respectively. Detailed results for each corruption type in CIFAR-100-C/$\bar{\text{C}}$ are shown in Tables [tb:cifar100c]{reference-type="ref" reference="tb:cifar100c"} and [tb:cifar100c-bar]{reference-type="ref" reference="tb:cifar100c-bar"} in Appendix 10.2{reference-type="ref" reference="app:detailed_cifar100"}.

To further corroborate our findings on OOD benchmarks, we carry out the Fourier sensitivity analysis of models trained on CIFAR-10/100 in Figure [fig:fourier_basis_cifar10]{reference-type="ref" reference="fig:fourier_basis_cifar10"}. Adding a consistency loss (Gaussian+JSD) improves the ACR of the model in the high-frequency region but is still worse than the ACR achieved by the addition of consistency loss (JSD and HCR) with FourierMix augmentations in low-to-mid frequency regions. Similar to our quantitative results, AutoAugment does not improve much over the baseline of Gaussian augmentation which suggests that models trained with AutoAugment may be biased towards high frequency regions.

Results on ImageNet-Based OOD Benchmarks {#sec:exp-imagenet}

Table [tb:total]{reference-type="ref" reference="tb:total"} presents the mACR of the models trained on ImageNet when evaluated on ImageNet-C and ImageNet-$\bar{\text{C}}$. Due to poor performance of some of the methods on CIFAR-10/100, we chose not to pursue them for ImageNet scale experiments (denoted by "-" in Table [tb:total]{reference-type="ref" reference="tb:total"}). We observe that OOD shifts lead to a drastic decline in the certified robustness on ImageNet. The drop between the ACR of clean data and the mACR of OOD data is $\sim$57%, whereas it was $\sim$30% on CIFAR-10/100. Encouragingly, FourierMix continues to achieve the highest mACR compared to other baselines. FourierMix outperforms the baseline of Gaussian augmented training and AugMix+JSD by 55.9% and 2.1% in terms of the overall mACR, respectively. FourierMix also realizes consistent good performance across the spectrum, whereas Gaussian+JSD and AugMix+JSD are biased to high-frequency and low-frequency corruptions, respectively. Despite HCR not making a significant difference over JSD regularization, it is worth noting that substantial improvements can still be gained by FourierMix (over other baseline augmentations) on ImageNet due to its broad spectral coverage. Detailed results for ImageNet-C/$\bar{\text{C}}$ can be found in Tables [tb:imagenet-c]{reference-type="ref" reference="tb:imagenet-c"} and [tb:imagenet-c-bar]{reference-type="ref" reference="tb:imagenet-c-bar"} in Appendix 10.3{reference-type="ref" reference="app:detailed_imagenet"}.

Overall Insights.Our results in this section not only highlight the vulnerability of SOTA certified defenses to OOD data but also uncovers spectral biases in the benchmark datasets that are used to measure OOD robustness. In particular, methods that perform well on one corrupted dataset may not work well on other dataset due to differences in the spectral signatures of the corruptions. For example, AutoAugment performs well on CIFAR10-C corruptions but is significantly worse on CIFAR10-$\bar{\text{C}}$ and CIFAR100-C/$\bar{\text{C}}$. Moreover, we find that such ranking discrepancies not only exist when measuring ACR but are also evident when measuring empirical robust accuracy. Figure 3{reference-type="ref" reference="fig:robustbench"} in Appendix 9{reference-type="ref" reference="app:robustbench"} shows that there is no single SOTA model from RobustBench leaderboard [@croce2020robustbench] which performs the best on all benchmark datasets or corruption types. This makes it incredibly important to obtain a comprehensive view of the model robustness to avoid issues such as leaderboard bias [@mishra2021robust] and model overfitting to a specific benchmark [@mintun2021interaction]. To enable researchers achieve this objective, next we propose a new benchmark which has a collection of spectrally diverse OOD datasets.

A Spectral OOD Benchmarking Suite {#sec:cifarf}

Here we discuss the creation and evaluation of models on the proposed OOD benchmarking suite. The goal of this new OOD suite is to complement (and not replace) the existing benchmark datasets and enable researchers to uncover and resolve spectral biases of their models.

Protocol for Dataset Generation

The proposed benchmark is a collection of datasets (CIFAR-10/100-F) each focusing on a specific frequency range while collectively covering the entire frequency spectrum. Different from the Fourier sensitivity analysis that only perturb a single frequency using the Fourier basis, CIFAR-10/100-F leverages power law-based noise [@powerlaw] to generate complex perturbations in the spectral domain [@johnson1925schottky]. Note that power spectrum of several natural data distributions (e.g., natural images) follow power-law distribution [@powerlaw]. Inspired by this, we model the amplitude perturbation as ${\bm{\delta}}{\text{Fourier}}(f){\mathbf{A}} = \frac{P(f)}{(|f-f_c| + 1)^{\alpha}} \cdot \text{U}(1-b,1+b)$, where $P(f)$ approximates the tolerance of corruptions at azimuthal frequency $f=\sqrt{u^2+v^2}$, $f_c$ is the central frequency that the perturbation focuses on, and $\alpha$ denotes the power of the power law distribution. We also use a uniform distribution $\text{U}(1-b,1+b)$ with $b$ as a hyper-parameter ($b=0.2$ in our study) to diversify the perturbations. We define $P(f) = \text{clip} ({\bm{A}}{{\bm{x}}}^{\text{clean}}(f), a{\text{lower}}, a_{\text{upper}})$ which adds the amount of perturbation based on the power associated with the different frequencies in the clean image [@joubert2009rapid], frequencies with higher power have larger perturbations. We leverage the $\text{clip}(\cdot)$ function to bound the amount of corruption in each spatial frequency. The maximum and minimum values are chosen to ensure that perturbations do not affect the semantic content of the images. The phase perturbation is formulated as ${\bm{\delta}}{\text{Fourier}}(f){\mathbf{P}}=\text{U}(0,2\pi)$ to simulate real world noises. Given each pair $({\bm{x}}^i,y^i)$ in the original validation set, we synthesize CIFAR-10/100-F images as $$\small {\bm{x}}F^i = {\bm{x}}^i+ \gamma \cdot \text{IFFT}({\bm{\delta}}{\text{Fourier}}),$$ where $\gamma = \frac{\epsilon}{||\text{IFFT}({\bm{\delta}}_{\text{Fourier}})||_2}$ normalizes the spreading effect of the power-law distribution and, thus, controls the severity level of CIFAR-10/100-F. We create both CIFAR-10/100-F with 3 severity levels with $\epsilon \in {8,10,12}$. As the images in CIFAR-10/100 are of size $32\times32$, their FFT spectrums have discrete azimuthal frequencies from 0 to 16. Since zero-frequency noise is a constant in the pixel space, we set the center frequency $f_c \in {1,2,...,16}$. We leverage $\alpha \in {0.5,1,2,3}$ because power law noises with $0<\alpha\leq 3$ arise in both natural signals and in man-made processes [@powerlaw]. In total, our CIFAR-10/100-F datasets consists of $3\times4\times16=192$ test sets from different regions of the frequency spectrum thereby increasing the spectral coverage of the dataset.

Sample images from CIFAR-10/100-F when $\epsilon=12$.{#fig:cifarf-illu width="48.5%"}

Visual Effect of Varying $\alpha$ and $f_c$.To better understand our proposed benchmark, we explain the effect of $\alpha$ and $f_c$ with some sample images. As shown in Figure 2{reference-type="ref" reference="fig:cifarf-illu"}, $\alpha$ controls the frequency dispersion of the corruption at $f_c$. With a smaller $\alpha$, $\alpha=0.5$, the spreading effect of the power law distribution is more significant. The corrupted images thus contain noises across all azimuthal frequencies. In contrast, for larger $\alpha$, the corruptions will be focused more on a single frequency $\alpha=3$. As evident from Figure 2{reference-type="ref" reference="fig:cifarf-illu"} higher $f_c$ leads to a higher corruption frequency. More images from CIFAR-10/100-F from different classes and severity levels are shown in Appendix 13{reference-type="ref" reference="app:sample_cifarf"}.

Results on CIFAR-10/100-F

Figure [fig:cifar-10-f]{reference-type="ref" reference="fig:cifar-10-f"} reports the performance of models used in § 5.1{reference-type="ref" reference="sec:cifar-10/100"} on CIFAR-10-F benchmark. Our results show that both AutoAugment [@cubuk2019autoaugment] and AugMix [@hendrycks2019augmix] based smoothed models are relatively biased towards low-frequency corruptions. The effect of high frequency corruptions is more pronounced on models trained with AutoAugment which behave similar to the simple baseline of Gaussian augmentation (Figures [fig:cifar-10-f-2]{reference-type="ref" reference="fig:cifar-10-f-2"} and [fig:cifar-10-f-3]{reference-type="ref" reference="fig:cifar-10-f-3"}). The intersection of the curves of AugMix+JSD and Gaussian+JSD in the mid frequency region in CIFAR-10-F (Figure [fig:cifar-10-f]{reference-type="ref" reference="fig:cifar-10-f"}), illustrates the different spectral biases introduced by different augmentation methods. Unlike CIFAR10-F, we find that Gaussian and Gaussian+JSD perform relatively worse on CIFAR-100-F compared to other augmentation methods. In comparison to other methods, we find that models trained with FourierMix and HCR do not show significant spectral biases and serve as a strong baseline. Specifically, models trained with FourierMix+HCR, on average, outperforms AugMix+HCR, by 11.8% and 16.0% on CIFAR-10/100-F, respectively. We emphasize that models trained with FourierMix do not overfit to CIFAR-10/100-F datasets since they have different formulations and even visual patterns (see Appendix 11{reference-type="ref" reference="app:fmix-images"} and 13{reference-type="ref" reference="app:sample_cifarf"}). Moreover, FourierMix models provide consistently better performance on other OOD benchmarks as well, demonstrating its generality.

Our Recommendations

[AM: I don't think we need this section. We can mention the content in the discussion/conclusion.]{style="color: cyan"}[BK: I agree -- let us have some of this discussion in Sec 6.2 instead of Sec 7.]{style="color: red"} This performance difference is enough to change the ordering of augmentations by corruption error, and this inconsistency of generalization suggests it is important to not rely on single benchmarks to study robustness to unknown corruptions.

Discussion and Conclusion {#sec:discussion}

Our work showed that certified defenses are surprisingly brittle to OOD shifts such as low-frequency corruptions. To alleviate this issue, we proposed FourierMix augmentation to increase the spectral coverage of the training data. We also presented a benchmarking suite to gain a comprehensive understanding of the model's OOD robustness. Some of our findings are consistent with past results in that we also show that the model evaluation in OOD settings is a challenging problem, and one should not rely on a single benchmark [@hendrycks2021many; @mintun2021interaction]. However, as opposed to the existing works that focus on empirical robustness, we show that these issues persist and may even be more prominent on the problem of certified adversarial defense. Even though evaluation against all possible types of OOD data is infeasible, our results highlighted that eliminating spectral biases of the models improves the certified robustness on OOD data.

Although we have taken some first steps to address this challenging problem, there are still many questions that remain to be answered. First, bridging the gap between robustness guarantees in high-frequency and low-frequency corruption regimes is still an open problem. A deeper theoretical understanding of this phenomenon will likely motivate systematic approaches to overcome this issue. Next, we encourage future research to pursue test-time adaptation ideas [@mueller2020certify; @diffenderfer2021winning] in the context of robustness certification. We expect that designing data-efficient and unsupervised adaptation methods that improve robustness guarantees under OOD shifts can be a worthwhile direction. Finally, the analysis done in this work can be explored in the context of certifying other $\ell_p$ norms [@yang2020randomized] and semantic transformations [@li2021tss]. It is also worth noting that we have only focused on incomplete verification techniques in this work, and we expect the OOD brittleness issue to be a general phenomenon affecting complete verification techniques as well [@li2020sok].

We hope that our work will motivate researchers to study these crucial issues at the intersection of certified adversarial defense and OOD robustness and help the community design ML methods that work reliably in the wild.

Fundamental trade-off in robustness at low-freq OOD robustness and certified robustness. We may need new training/augmentation methods to counter this trade-off.

Our results are consistent with [@hendrycks2021many] in that

Towards envisioning a better and more sustainable future, we propose that leaderboards can benefit by including as diverse a collection of benchmark OOD datsets as possible.

Tradeoff in high-freq (needed for $l_2$ certification RS) and low-freq corruption robustness (needed for natural OOD). Gaussian augmentation or adversarial training for $l_2$ certification may not be the best strategy.

Certified semantic defense and other SOTA Macer/smoothadv.

Some related observations have been reported in the empirical robustness literature, we show that they hold for ACR as well. There are some unique challenges as well.

Acknowledgements {#acknowledgements .unnumbered}

This work was performed under the auspices of the U.S. Department of Energy by the Lawrence Livermore National Laboratory under Contract No. DE-AC52-07NA27344 and LLNL LDRD Program Project No. 20-ER-014.


Augmentation CIFAR-10 mACR -Low -Mid -High Gauss. Shot Impulse Defocus Glass Motion Zoom Snow Frost Fog Bright Contrast Elastic Pixel JPEG Gaussian 0.461 0.363 0.301 0.353 0.435 0.448 0.448 0.421 0.380 0.346 0.338 0.357 0.394 0.347 0.187 0.439 0.137 0.342 0.420 0.440  +JSD 0.535 0.439 0.346 0.451 [0.520]{.ul} 0.529 [0.514]{.ul} 0.528 0.471 0.445 0.443 0.453 0.449 0.378 0.235 0.485 0.185 0.444 [0.506]{.ul} [0.521]{.ul}  +AutoAugment 0.411 0.372 0.312 0.364 0.431 0.451 0.452 0.419 0.411 0.356 0.342 0.360 0.403 0.354 0.201 0.446 0.158 0.352 0.429 0.445   +JSD 0.432 0.400 0.343 0.395 0.464 0.473 0.476 0.443 0.423 0.385 0.394 0.390 0.427 0.403 0.212 0.483 0.189 0.382 0.453 0.473  +AugMix 0.452 0.385 0.324 0.383 0.449 0.459 0.460 0.436 0.412 0.369 0.372 0.391 0.413 0.374 0.216 0.457 0.159 0.371 0.439 0.453   +JSD 0.518 0.430 0.357 0.436 0.496 0.504 0.507 0.481 0.461 0.426 0.429 0.441 0.452 0.408 0.240 0.501 0.185 0.425 0.485 0.502   +HCR 0.520 0.437 0.369 0.444 0.497 0.505 0.506 0.484 0.464 0.438 0.435 0.447 0.460 0.426 0.252 0.505 0.200 0.437 0.487 0.501  +FourierMix 0.455 0.388 0.326 0.386 0.453 0.461 0.462 0.446 0.417 0.369 0.378 0.393 0.415 0.376 0.220 0.457 0.160 0.373 0.439 0.456   +JSD [0.522]{.ul} [0.444]{.ul} [0.375]{.ul} [0.454]{.ul} 0.504 0.512 0.513 0.491 [0.474]{.ul} [0.448]{.ul} [0.446]{.ul} [0.456]{.ul} [0.464]{.ul} [0.432]{.ul} [0.257]{.ul} 0.519 [0.201]{.ul} [0.445]{.ul} 0.495 0.508   +HCR 0.535 0.460 0.384 0.473 0.521 [0.528]{.ul} 0.530 [0.513]{.ul} 0.492 0.470 0.464 0.477 0.477 0.432 0.275 [0.517]{.ul} 0.220 0.462 0.511 0.524


image{width="0.85\linewidth"}

[[fig:motivation2]]{#fig:motivation2 label="fig:motivation2"}

The ranking of SOTA models [@hendrycks2019augmix; @kireev2021effectiveness; @rebuffi2021fixing] (based on empirical robust accuracy) changes across datasets and corruption types, suggesting there is no single model which performs the best on different OOD benchmarks.{#fig:robustbench width="0.95\linewidth"}


Augmentation mACR Blue Brown Checkerboard Circular Inv. Sparkle Lines Pinch Ripple Sparkles Trans. Chromatic Gaussian 0.314 0.351 0.255 0.310 0.386 0.222 0.336 0.398 0.365 0.251 0.269  +JSD 0.393 [0.458]{.ul} 0.303 [0.395]{.ul} 0.452 0.252 [0.430]{.ul} [0.492]{.ul} 0.463 0.306 0.376  +AutoAugment 0.304 0.351 0.263 0.312 0.395 0.223 0.348 0.406 0.248 0.235 0.256   +JSD 0.346 0.354 0.297 0.335 0.445 0.238 0.374 0.436 0.402 0.269 0.308  +AugMix 0.341 0.389 0.269 0.334 0.439 0.233 0.358 0.416 0.397 0.272 0.307   +JSD 0.382 0.429 0.303 0.372 0.483 0.255 0.404 0.467 0.450 0.306 0.350   +HCR 0.393 0.442 [0.309]{.ul} 0.384 [0.486]{.ul} [0.268]{.ul} 0.419 0.471 [0.464]{.ul} [0.320]{.ul} 0.368  +FourierMix 0.348 0.391 0.269 0.331 0.441 0.237 0.368 0.432 0.401 0.280 0.325   +JSD [0.397]{.ul} 0.445 0.307 [0.395]{.ul} 0.482 0.265 [0.430]{.ul} 0.490 0.463 [0.320]{.ul} [0.377]{.ul}   +HCR 0.419 0.474 0.317 0.418 0.504 0.289 0.459 0.501 0.486 0.339 0.406


Appendix


Augmentation CIFAR-100 mACR -Low -Mid -High Gauss. Shot Impulse Defocus Glass Motion Zoom Snow Frost Fog Bright Contrast Elastic Pixel JPEG Gaussian 0.238 0.169 0.131 0.182 0.208 0.214 0.218 0.193 0.181 0.170 0.157 0.169 0.177 0.153 0.069 0.207 0.051 0.159 0.206 0.209  +JSD 0.291 0.232 0.167 0.248 0.280 0.283 0.285 0.273 0.261 0.252 0.240 0.250 0.226 0.188 0.104 0.242 0.079 0.235 0.278 0.281  +AutoAugment + JSD 0.265 0.225 0.175 0.234 0.265 0.275 0.273 0.252 0.248 0.230 0.230 0.238 0.232 0.202 0.104 0.257 0.082 0.225 0.261 0.266  +AugMix + JSD 0.286 0.231 0.184 0.240 0.269 0.274 0.278 0.256 0.255 0.236 0.233 0.243 0.239 0.211 0.111 0.267 0.092 0.232 0.267 0.270  +AugMix + HCR [0.296]{.ul} [0.249]{.ul} [0.191]{.ul} [0.263]{.ul} [0.292]{.ul} [0.296]{.ul} [0.301]{.ul} 0.282 [0.278]{.ul} [0.264]{.ul} [0.255]{.ul} [0.263]{.ul} 0.249 0.215 [0.118]{.ul} 0.274 [0.097]{.ul} [0.253]{.ul} [0.291]{.ul} [0.292]{.ul}  +FourierMix + JSD 0.295 0.247 0.190 0.258 [0.292]{.ul} 0.295 0.300 [0.283]{.ul} 0.273 0.257 0.249 0.260 [0.251]{.ul} [0.217]{.ul} 0.115 [0.275]{.ul} 0.092 0.250 0.288 [0.292]{.ul}  +FourierMix + HCR 0.309 0.261 0.199 0.278 0.307 0.310 0.313 0.302 0.291 0.283 0.270 0.277 0.260 0.221 0.128 0.284 0.102 0.267 0.303 0.307



Augmentation mACR Blue Brown Checkerboard Circular Inv. Sparkle Lines Pinch Ripple Sparkles Trans. Chromatic Gaussian 0.130 0.151 0.070 0.114 0.159 0.097 0.137 0.199 0.160 0.097 0.116  +JSD 0.196 0.228 0.106 0.186 0.233 0.124 0.221 0.274 0.242 0.151 0.193  +AutoAugment + JSD 0.176 0.211 0.087 0.152 0.229 0.119 0.184 0.236 0.217 0.140 0.185  +AugMix + JSD 0.193 0.227 0.107 0.176 0.259 0.131 0.206 0.260 0.244 0.153 0.191  +AugMix + HCR [0.211]{.ul} [0.253]{.ul} [0.120]{.ul} [0.199]{.ul} [0.276]{.ul} [0.136]{.ul} 0.224 [0.283]{.ul} [0.263]{.ul} [0.156]{.ul} 0.203  +FourierMix + JSD 0.207 0.243 0.106 0.194 0.262 [0.136]{.ul} [0.226]{.ul} 0.281 0.258 0.154 [0.205]{.ul}  +FourierMix + HCR 0.227 0.260 0.129 0.219 0.281 0.151 0.247 0.300 0.278 0.172 0.228


Amplitude Spectrum of CIFAR-10-C/$\overline{\text{C}}$ {#psd_cifar-10-c}

As introduced in § 3{reference-type="ref" reference="sec:rs_on_c"}, we arrange the amplitude specturm of corruptions from CIFAR-10-C into three groups, roughly categorized as high/mid/low-frequency corruptions. Specifically, we compute the the $\mathbb{E}[\text{FFT}({\bm{x}})]$ and $\mathbb{E}[\text{FFT}(C({\bm{x}})={\bm{x}})]$ by averaging over all the validation images [@yin2019fourier] for CIFAR-10 and each corruption in CIFAR-10-C, respectively, where $C(\cdot)$ denotes the corruption function. As Figure [fig:psd_cifar10c]{reference-type="ref" reference="fig:psd_cifar10c"} shows, CIFAR-10 (clean) images follow a distribution of $\frac{1}{f^\alpha}$, where $f=\sqrt{u^2+v^2}$ is the azimuthal frequency and $\alpha \approx 2$. Therefore, clean images have extremely low power in the high-frequency regions (the edges and corner). Due to this, all the noise perturbations corresponding to JPEG and pixelate can be considered as high-frequency corruptions, relative to the clean images' distribution. On the other hand, weather-related and contrast corruptions are all centered in the low-frequency region. We categorize remaining perturbations as mid-frequency corruptions.

We also visualize the amplitude spectrum of corruptions from CIFAR-10-$\bar{\text{C}}$ in Figure [fig:psd_cifar10cbar]{reference-type="ref" reference="fig:psd_cifar10cbar"}. We find that most of the corruptions from CIFAR-10-$\bar{\text{C}}$ are centered in the low/mid-frequency ranges, explaining why FourierMix achieves lager improvements on CIFAR-10-$\bar{\text{C}}$ than CIFAR-10-C compared to spectrally-biased baselines.

Empirical Robust Accuracy of SOTA Models on Corrupted Data {#app:robustbench}

The results in Figure 3{reference-type="ref" reference="fig:robustbench"} show empirical robust accuracy of state-of-the-art models on existing OOD benchmarks. We use the recently proposed RobustBench [@croce2020robustbench] benchmark and selected the top-performing models on CIFAR-10-C for this experiment [@kireev2021effectiveness; @hendrycks2019augmix; @rebuffi2021fixing]. As evident from the figure, the performance of the models varies across datasets and corruption types showing that a single model is not able to achieve the best performance on all types of OOD data. Evaluating the models on a single benchmark is not enough to obtain the true picture of the OOD robustness of a model. Thus to eliminate the biases present in OOD benchmarks, one should gauge the OOD robustness of a model by evaluating it on a variety of datasets. Our proposed CIFAR-10/100-F benchmark can be used by designers to probe the spectral biases of the models.

Detailed Evaluation Results

CIFAR-10-Based OOD Benchmarks {#app:detailed_cifar10}

In this section, we present detailed results for our evaluation on CIFAR-10-C/$\bar{\text{C}}$. We fix $\eta=10$ and use $\lambda=40$ for HCR (Equation [eq:hcr]{reference-type="ref" reference="eq:hcr"}) in our experiments on CIFAR-10. Tables [tb:cifar10c]{reference-type="ref" reference="tb:cifar10c"} and [tb:cifar10c-bar]{reference-type="ref" reference="tb:cifar10c-bar"} present the ACR on individual corruption types from CIFAR-10-C/$\bar{\text{C}}$, respectively. FourierMix consistently achieves the highest ACR on most of the corruption types in both OOD datasets. Especially, we find FourierMix helps achieve larger improvements on weather-related corruptions, which have real-world implications (safety of autonomous driving).


Augmentation ImageNet mACR -Low -Mid -High Gauss. Shot Impulse Defocus Glass Motion Zoom Snow Frost Fog Bright Contrast Elastic Pixel JPEG Gaussian 0.600 0.256 0.155 0.228 0.385 0.342 0.324 0.310 0.174 0.227 0.212 0.201 0.148 0.170 0.013 0.419 0.027 0.325 0.440 0.507  +JSD 0.736 0.395 0.220 0.382 0.581 0.537 0.519 0.508 0.289 0.378 0.351 0.374 [0.254]{.ul} 0.245 0.013 0.551 0.039 0.518 0.640 0.702  +AugMix + JSD 0.717 0.391 0.238 [0.387]{.ul} 0.550 0.496 0.489 0.473 0.329 0.395 0.376 0.352 0.255 0.286 0.041 0.542 0.064 0.481 0.622 0.668  +AugMix + HCR 0.727 0.390 0.234 0.383 0.552 0.500 0.494 0.480 [0.320]{.ul} [0.391]{.ul} 0.374 0.349 0.249 0.283 [0.040]{.ul} 0.539 0.061 0.481 0.624 0.662  +FourierMix + JSD 0.751 0.399 0.242 0.389 0.564 0.515 0.493 0.483 0.315 0.384 0.380 [0.370]{.ul} [0.254]{.ul} 0.300 0.041 [0.544]{.ul} 0.073 [0.497]{.ul} [0.637]{.ul} [0.694]{.ul}  +FourierMix + HCR [0.750]{.ul} [0.397]{.ul} [0.239]{.ul} [0.387]{.ul} [0.567]{.ul} [0.518]{.ul} [0.499]{.ul} [0.492]{.ul} 0.312 0.382 [0.377]{.ul} [0.370]{.ul} 0.249 [0.295]{.ul} 0.039 [0.544]{.ul} [0.069]{.ul} 0.494 [0.637]{.ul} 0.689



Augmentation mACR Blue Brown Caustic Checkboard Cocentric Inv. Sparkle Perlin Plasma Single Freq. Sparkle Gaussian 0.266 0.394 0.284 0.325 0.250 0.235 0.152 0.274 0.065 0.284 0.400  +JSD 0.395 0.579 0.395 0.512 0.370 0.374 0.224 0.404 0.113 0.408 0.567  +AugMix + JSD 0.379 0.560 0.381 0.461 [0.365]{.ul} 0.342 0.212 0.413 0.121 0.397 0.538  +AugMix + HCR 0.378 0.563 0.377 0.464 0.361 0.342 0.210 [0.410]{.ul} 0.115 0.396 0.539  +FourierMix + JSD 0.413 0.562 0.544 0.479 0.370 0.366 [0.215]{.ul} 0.413 0.227 0.417 0.547  +FourierMix + HCR [0.411]{.ul} [0.565]{.ul} [0.535]{.ul} [0.481]{.ul} [0.365]{.ul} [0.367]{.ul} 0.210 0.408 [0.215]{.ul} [0.415]{.ul} [0.550]{.ul}


Method clean mACR Gauss. Shot Impulse Defocus Glass Motion Zoom Snow Frost Fog Bright Contrast Elastic Pixel JPEG


Gaussian 0.461 0.363 0.448 0.448 0.421 0.380 0.346 0.338 0.357 0.394 0.347 0.187 0.439 0.137 0.342 0.420 0.440 MACER 0.539 0.426 0.509 0.509 0.492 0.460 0.436 0.422 0.433 0.443 0.381 0.232 0.477 0.185 0.428 0.490 0.503 SmoothAdv 0.519 0.411 0.483 0.485 0.471 0.448 0.426 0.423 0.425 0.418 0.361 0.222 0.451 0.175 0.415 0.472 0.483

Adaptation mACR Gauss. Shot Impulse Defocus Glass Motion Zoom Snow Frost Fog Bright Contrast Elastic Pixel JPEG


Gaussian 0.363 0.448 0.448 0.421 0.380 0.346 0.338 0.357 0.394 0.347 0.187 0.439 0.137 0.342 0.420 0.440  +BN 0.356 0.441 0.442 0.417 0.369 0.338 0.326 0.345 0.392 0.347 0.181 0.436 0.133 0.332 0.411 0.432  +TENT 0.357 0.442 0.442 0.419 0.369 0.337 0.328 0.346 0.394 0.345 0.182 0.436 0.132 0.330 0.412 0.434

CIFAR-100-Based OOD Benchmarks {#app:detailed_cifar100}

In this section, we present detailed results for our evaluation on CIFAR-100-C/$\bar{\text{C}}$ and CIFAR-100-F. We fix $\eta=10$ and use $\lambda=20$ for HCR in our experiments on CIFAR-100. Tables [tb:cifar100c]{reference-type="ref" reference="tb:cifar100c"} and [tb:cifar100c-bar]{reference-type="ref" reference="tb:cifar100c-bar"} present the ACR on individual corruption types from CIFAR-100-C/$\bar{\text{C}}$, respectively. CIFAR-100 is more difficult for RS-based certification compared to CIFAR-10. We find that FourierMix+HCR helps achieve the highest ACR on all corruption types in both datasets with significant enhancements compared to existing augmentation methods.

ImageNet-Based OOD Benchmarks {#app:detailed_imagenet}

ImageNet appears to be the most challenging dataset for certified defenses, to which only RS-based techniques can be applied  [@cohen2019certified]. We select representative combinations of augmentations and regularization schemes that perform well on CIFAR-10/100 for our experiments on ImageNet. We exclude the input normalization layer, which trades off the ACR on clean data for the ACR on OOD data. We use $\eta=5$ and $\lambda=5$ for our experiments with HCR. Tables [tb:imagenet-c]{reference-type="ref" reference="tb:imagenet-c"} and [tb:imagenet-c-bar]{reference-type="ref" reference="tb:imagenet-c-bar"} present the detailed results on our evaluation on ImageNet-C/$\bar{\text{C}}$. Note the the corruption types in ImageNet-$\bar{\text{C}}$ are different from the ones in CIFAR-10/100-$\bar{\text{C}}$. We find that the spectral biases of other baselines become much more noticeable on ImageNet-based OOD benchmarks. Gaussian+JSD accomplishes the highest ACR on high-frequency corruptions, while AugMix+JSD performs the best on several low-frequency corruptions in ImageNet-C. As RS-based models generally suffer performance degradation on low-frequency corruptions, Gaussain+JSD beats AugMix+JSD in terms of overall mACR. However, FourierMix performs well across the spectrum, reaching the highest mACR on both datasets. Although tangible improvements have been realized by FourierMix on ImageNet-based OOD benchmarks, we want to highlight that there is still large room for future research to improve over our baselines. We hope this work will motivate more studies on certified defenses for ImageNet under OOD shifts, as discussed in § 7{reference-type="ref" reference="sec:discussion"}.

FourierMix Details {#app:fmix-images}

Hyper-parameter Settings. We detail the chosen hyper-parameters used in the experiments with FourierMix. As illustrated in Algorithm [alg:fmix]{reference-type="ref" reference="alg:fmix"} and Equations [eq:fouriermix1]{reference-type="ref" reference="eq:fouriermix1"} and [eq:fouriermix2]{reference-type="ref" reference="eq:fouriermix2"}, we leverage 5 different severity levels and truncated Gaussian distribution. We use a large $\sigma=5$ for the truncated Gaussian distribution to make FourierMix render more diverse augmentation. For CIFAR-10/100, we set $s_\mathbf{A} \in [0.2,0.3,0.4,0.5,0.6]$ and $s_\mathbf{P} \in [\frac{\pi}{12},\frac{\pi}{10},\frac{\pi}{8},\frac{\pi}{6},\frac{\pi}{4}]$ as the 5 severity levels in Equations [eq:fouriermix1]{reference-type="ref" reference="eq:fouriermix1"} and [eq:fouriermix2]{reference-type="ref" reference="eq:fouriermix2"}, respectively. For ImageNet, we use the same set of $s_\mathbf{A}$ and set $s_\mathbf{P} \in [\frac{\pi}{4},\frac{3\pi}{10},\frac{3\pi}{8},\frac{\pi}{2},\frac{3\pi}{4}]$ since high-resolution images can tolerate more perturbations in the phase spectrum.

Sample Images from FourierMix.We visualize randomly sampled images from CIFAR-10/100 and ImageNet in Figure [fig:fmixcifar10]{reference-type="ref" reference="fig:fmixcifar10"}, [fig:fmixcifar100]{reference-type="ref" reference="fig:fmixcifar100"}, and [fig:fmiximagenet]{reference-type="ref" reference="fig:fmiximagenet"}, respectively.

Training and Evaluation Details {#app:training-details}

**Training.**We train CIFAR-10/100 and ImageNet models for 200 and 90 epochs for all methods with an SGD optimizer, respectively [@ruder2016overview]. We exclude the input normalization layer as it will degrade the certification performance on OOD data. We use different $\sigma$ to train CIFAR-10/100 and ImageNet models, as specified in § 5{reference-type="ref" reference="sec:experiment"}.

**Evaluation.**Recall from the theorem derived in § 3{reference-type="ref" reference="sec:rs_on_c"} of Cohen , $\text{CR}(\cdot)$ approaches $\infty$ when $\underline{p_A}$ approaches the value $1$ [@cohen2019certified]. However, this will also require the Gaussian perturbed samples $n \approx \infty$. Consider that the base classifier $\mathcal{M}({\bm{x}}+{\bm{\delta}})$ has observed $n$ samples that all equal to $c_A$, $p_A \geq \alpha^{(1/n)}$ has a probability $1-\alpha$ [@cohen2019certified]. To both constrain the computational complexity and achieve a tight bound, we use $n=100,000$, $n_0=100$, and $\alpha=0.001$ as the hyper-parameters to get high confidence of the computed radius, following prior arts [@cohen2019certified; @Zhai2020MACER; @salman2019provably; @jeong2020consistency]. Since we need to evaluate OOD datasets with $125 \times$ larger sizes than the original test sets, we certify 500 and 350 examples from each corruption and each severity level of the CIFAR-10/100 and ImageNet OOD datasets (-C/$\bar{\text{C}}$). For the Fourier sensitivity analysis of CIFAR-10/100, each data point in the heat map is the corresponding ACR of 200 examples.

image{width="\linewidth"}

image{width="0.9\linewidth"}

Sample Images from CIFAR-10/100-F {#app:sample_cifarf}

We visualize more sample images from our created datasets in Figure [fig:fsamples]{reference-type="ref" reference="fig:fsamples"} using different classes. It is also worth noting that FourierMix augmented images (Figure [fig:fmixcifar10]{reference-type="ref" reference="fig:fmixcifar10"} and [fig:fmixcifar100]{reference-type="ref" reference="fig:fmixcifar100"}) have different patterns with CIFAR-10/100-F.

Discussion on Test-Time Adaptation

As discussed in Appendix 2{reference-type="ref" reference="sec:related"}, another widely acknowledged approach to counter OOD shifts is test-time adaptation. We thus perform a preliminary study on how test-time adaptation will affect the certified robustness. Specifically, we use BN [@saenko2010adapting] and TENT [@wang2021tent] as representative methods. Since the theorem derived by Cohen  [@cohen2019certified] requires the base classifier $\mathcal{M}$ to be deterministic, we cannot apply BN and TENT in an online manner. To deal with such a problem, while evaluating the ACR of OOD data from a specific corruption type, we randomly sample 500 (out of 10,000) images from the OOD test set for the adaptation. We follow other settings specified in [@saenko2010adapting; @wang2021tent] for our experimentation. Table [tb:testtime]{reference-type="ref" reference="tb:testtime"} presents the detailed results on CIFAR-10-C. We find that test-time adaptations fail to improve the ACR in the OOD setting. The reason is that one-shot adaptation relies upon a small amount of data which is not sufficient to correct the OOD shift. In contrast, it may cause the base classifier $\mathcal{M}$ to become biased towards the small subset of test data used for adaptation. We highlight that certification of adaptive models is also a potential direction that can help with OOD certified robustness. More theoretical support is needed in this direction, and we leave it as a promising future work.

image{width="0.9\linewidth"}

image{width="0.9\linewidth"}

image{width="0.9\linewidth"}

[^1]: Pre-print under review. Codes and benchmark datasets will be open-sourced upon paper acceptance.

[^2]: Work partially done during internship at Lawrence Livermore National Laboratory (LLNL).

[^3]: As Gaussian augmentation is fundamental to RS-based certified defenses, we focus on improving the performance of Gaussian augmentation based defenses under OOD corruptions. Further ACR gains can be achieved by leveraging the techniques proposed in this paper along with more advanced SOTA certified defenses such as SmoothAdv and MACER.