abstract: | Out-of-distribution (OOD) detection has received much attention lately due to its importance in the safe deployment of neural networks. One of the key challenges is that models lack supervision signals from unknown data, and as a result, can produce overconfident predictions on OOD data. Previous approaches rely on real outlier datasets for model regularization, which can be costly and sometimes infeasible to obtain in practice. In this paper, we present VOS, a novel framework for OOD detection by adaptively synthesizing virtual outliers that can meaningfully regularize the model's decision boundary during training. Specifically, VOS samples virtual outliers from the low-likelihood region of the class-conditional distribution estimated in the feature space. Alongside, we introduce a novel unknown-aware training objective, which contrastively shapes the uncertainty space between the ID data and synthesized outlier data. VOS achieves competitive performance on both object detection and image classification models, reducing the FPR95 by up to 9.36% compared to the previous best method on object detectors. Code is available at https://github.com/deeplearning-wisc/vos. author:
- |
Xuefeng Du, Zhaoning Wang, Mu Cai, Yixuan Li
Department of Computer Sciences
University of Wisconsin - Madison
{xfdu,mucai,sharonli}@cs.wisc.edu
bibliography: - citations.bib
title: |
VOS: Learning What You Don't Know by
Virtual Outlier Synthesis
Introduction
Modern deep neural networks have achieved unprecedented success in known contexts for which they are trained, yet they often struggle to handle the unknowns. In particular, neural networks have been shown to produce high posterior probability for out-of-distribution (OOD) test inputs [@nguyen2015deep], which arise from unknown categories and should not be predicted by the model. Taking self-driving car as an example, an object detection model trained to recognize in-distribution objects (e.g., cars, stop signs) can produce a high-confidence prediction for an unseen object of a moose; see Figure 1{reference-type="ref" reference="fig::GMM_toy"}(a). Such a failure case raises concerns in model reliability, and worse, may lead to catastrophe when deployed in safety-critical applications.
The vulnerability to OOD inputs arises due to the lack explicit knowledge of unknowns during training time. In particular, neural networks are typically optimized only on the in-distribution (ID) data. The resulting decision boundary, despite being useful on ID tasks such as classification, can be ill-fated for OOD detection. We illustrate this in Figure 1{reference-type="ref" reference="fig::GMM_toy"}. The ID data (gray) consists of three class-conditional Gaussians, on which a three-way softmax classifier is trained. The resulting classifier is overconfident for regions far away from the ID data (see the red shade in Figure 1{reference-type="ref" reference="fig::GMM_toy"}(b)), causing trouble for OOD detection. Ideally, a model should learn a more compact decision boundary that produces low uncertainty for the ID data, with high OOD uncertainty elsewhere (e.g., Figure 1{reference-type="ref" reference="fig::GMM_toy"}(c)). However, achieving this goal is non-trivial due to the lack of supervision signal of unknowns. This motivates the question: Can we synthesize virtual outliers for effective model regularization?
In this paper, we propose a novel unknown-aware learning framework dubbed VOS (Virtual Outlier Synthesis), which optimizes the dual objectives of both ID task and OOD detection performance. In a nutshell, VOS consists of three components tackling challenges of outlier synthesis and effective model regularization with synthesized outliers. To synthesize the outliers, we estimate the class-conditional distribution in the feature space, and sample outliers from the low-likelihood region of ID classes (Section 3.1{reference-type="ref" reference="sec:gda"}). Key to our method, we show that sampling in the feature space is more tractable than synthesizing images in the high-dimensional pixel space [@lee2018training]. Alongside, we propose a novel unknown-aware training objective, which contrastively shapes the uncertainty surface between the ID data and synthesized outliers (Section 3.2{reference-type="ref" reference="sec:training"}). During training, VOS simultaneously performs the ID task (e.g., classification or object detection) as well as the OOD uncertainty regularization. During inference time, the uncertainty estimation branch produces a larger probabilistic score for ID data and vice versa, which enables effective OOD detection (Section 3.3{reference-type="ref" reference="sec:inference"}).
{#fig::GMM_toy width="0.95\linewidth" height="0.35\linewidth"}[[fig:teaser]]{#fig:teaser label="fig:teaser"}
VOS offers several compelling advantages compared to existing solutions. (1) VOS is a general learning framework that is effective for both object detection and image classification tasks, whereas previous methods were primarily driven by image classification. Image-level detection can be limiting as an image could be OOD in certain regions while being in-distribution elsewhere. Our work bridges a critical research gap since OOD detection for object detection is timely yet underexplored in literature. (2) VOS enables adaptive outlier synthesis, which can be flexibly and conveniently used for any ID data without manual data collection or cleaning. In contrast, previous methods using outlier exposure [@hendrycks2018deep] require an auxiliary image dataset that is sufficiently diverse, which can be arguably prohibitive to obtain. Moreover, one needs to perform careful data cleaning to ensure the auxiliary outlier dataset does not overlap with ID data. (3) VOS synthesizes outliers that can estimate a compact decision boundary between ID and OOD data. In contrast, existing solutions use outliers that are either too trivial to regularize the OOD estimator, or too hard to be separated from ID data, resulting in sub-optimal performance.
Our key contributions and results are summarized as follows:
-
We propose a new framework
VOSaddressing a pressing issue---unknown-aware deep learning that optimizes for both ID and OOD performance.VOSestablishes state-of-the-art results on a challenging object detection task. Compared to the best method,VOSreduces the FPR95 by up to 9.36% while preserving the accuracy on the ID task. -
We conduct extensive ablations and reveal important insights by contrasting different outlier synthesis approaches. We show that
VOSis more advantageous than generating outliers directly in the high-dimensional pixel space (e.g., using GAN [@lee2018training]) or using noise as outliers. -
We comprehensively evaluate our method on common OOD detection benchmarks, along with a more challenging yet underexplored task in the context of object detection. Our effort facilitates future research to evaluate OOD detection in a real-world setting.
Problem Setup
We start by formulating the problem of OOD detection in the setting of object detection. Our framework can be easily generalized to image classification when the bounding box is the entire image (see Section [sec:cls]{reference-type="ref" reference="sec:cls"}). Most previous formulations of OOD detection treat entire images as anomalies, which can lead to ambiguity shown in Figure 1{reference-type="ref" reference="fig::GMM_toy"}. In particular, natural images are composed of numerous objects and components. Knowing which regions of an image are anomalous could allow for safer handling of unfamiliar objects. This setting is more realistic in practice, yet also more challenging as it requires reasoning OOD uncertainty at the fine-grained object level.
Specifically, we denote the input and label space by $\altmathcal{X}=\mathbb{R}^d$ and $\altmathcal{Y}={1,2,...,K}$, respectively. Let $\mathbf{x} \in \mathcal{X}$ be the input image, $\mathbf{b} \in \mathbb{R}^4$ be the bounding box coordinates associated with object instances in the image, and $y \in \mathcal{Y}$ be the semantic label for $K$-way classification. An object detection model is trained on in-distribution data $\altmathcal{D}=\left{\left(\mathbf{x}{i},\mathbf{b}i, {y}{i}\right)\right}{i=1}^{N}$ drawn from an unknown joint distribution $\altmathcal{P}$. We use neural networks with parameters $\theta$ to model the bounding box regression $p_\theta(\mathbf{b} |\mathbf{x})$ and the classification $p_\theta(y|\mathbf{x}, \mathbf{b})$.
The OOD detection can be formulated as a binary classification problem, which distinguishes between the in- vs. out-of-distribution objects. Let $P_{\altmathcal{X}}$ denote the marginal probability distribution on $\altmathcal{X}$. Given a test input $\mathbf{x}^\sim P_\mathcal{X}$, as well as an object instance $\mathbf{b}^$ predicted by the object detector, the goal is to predict $p_\theta(g \vert \mathbf{x}^, \mathbf{b}^)$. We use $g=1$ to indicate a detected object being in-distribution, and $g=0$ being out-of-distribution, with semantics outside the support of $\mathcal{Y}$.
{#fig:overview width="100%"}
Method {#sec:method}
Our novel unknown-aware learning framework is illustrated in Figure 2{reference-type="ref" reference="fig:overview"}. Our framework encompasses three novel components and addresses the following questions: (1) how to synthesize the virtual outliers (Section 3.1{reference-type="ref" reference="sec:gda"}), (2) how to leverage the synthesized outliers for effective model regularization (Section 3.2{reference-type="ref" reference="sec:training"}), and (3) how to perform OOD detection during inference time (Section 3.3{reference-type="ref" reference="sec:inference"})?
VOS: Virtual Outlier Synthesis {#sec:gda}
Our framework VOS generates virtual outliers for model regularization, without relying on external data. While a straightforward idea is to train generative models such as GANs [@goodfellow2014generative; @lee2018training], synthesizing images in the high-dimensional pixel space can be difficult to optimize. Instead, our key idea is to synthesize virtual outliers in the feature space, which is more tractable given lower dimensionality. Moreover, our method is based on a discriminatively trained classifier in the object detector, which circumvents the difficult optimization process in training generative models.
Specifically, we assume the feature representation of object instances forms a class-conditional multivariate Gaussian distribution (see Figure [fig:umap]{reference-type="ref" reference="fig:umap"}): $$p_\theta(h(\mathbf{x},\mathbf{b}) | y=k)=\altmathcal{N}(\boldsymbol\mu_{k}, \mathbf{\Sigma}),$$ where $\boldsymbol\mu_k$ is the Gaussian mean of class $k \in{1,2, . ., \mathrm{K}}$, $\mathbf{\Sigma}$ is the tied covariance matrix, and $h(\mathbf{x},\mathbf{b}) \in \mathbb{R}^m$ is the latent representation of an object instance $(\mathbf{x},\mathbf{b})$. To extract the latent representation, we use the penultimate layer of the neural network. The dimensionality $m$ is significantly smaller than the input dimension $d$.
r0.25
{width="\linewidth"}
[[fig:umap]]{#fig:umap label="fig:umap"}
To estimate the parameters of the class-conditional Gaussian, we compute empirical class mean $\widehat{\bm\mu}{k}$ and covariance $\widehat{\mathbf{\Sigma}}$ of training samples $\left{\left(\mathbf{x}{i},\mathbf{b}i, {y}{i}\right)\right}_{i=1}^{N}$:
$$\begin{aligned} \widehat{\bm\mu}{k}&=\frac{1}{N{k}} \sum_{i: y_{i}=k} h(\mathbf{x}i, \mathbf{b}i) \ \widehat{\mathbf{\Sigma}}&=\frac{1}{N} \sum{k} \sum{i: y_{i}=k}\left(h(\mathbf{x}_i, \mathbf{b}i)-\widehat{\bm\mu}{k}\right)\left(h(\mathbf{x}_i,\mathbf{b}i)-\widehat{\bm\mu}{k}\right)^{\top}, \label{eq:mean_cov} \vspace{-4.5em}\end{aligned}$$
where $N_k$ is the number of objects in class $k$, and $N$ is the total number of objects. We use online estimation for efficient training, where we maintain a class-conditional queue with $|Q_k|$ object instances from each class. In each iteration, we enqueue the embeddings of objects to their corresponding class-conditional queues, and dequeue the same number of object embeddings.
Sampling from the feature representation space. We propose sampling the virtual outliers from the feature representation space, using the multivariate distributions estimated above. Ideally, these virtual outliers should help estimate a more compact decision boundary between ID and OOD data.
To achieve this, we propose sampling the virtual outliers $\mathcal{V}_k$ from the $\epsilon$-likelihood region of the estimated class-conditional distribution: $$\begin{aligned} \mathcal{V}_k= { \mathbf{v}_k &\vert \frac{1}{(2 \pi)^{m / 2}|\widehat{\mathbf{\mathbf{\Sigma}}}|^{1 / 2}} \exp \left(-\frac{1}{2}(\mathbf{v}_k-\widehat{\bm\mu}_k)^{\top} \widehat{\mathbf{\Sigma}}^{-1}(\mathbf{v}_k-\widehat{\bm\mu}_k)\right) < \epsilon}, \label{eq:virtual}\end{aligned}$$ where $\mathbf{v}_k \sim \altmathcal{N}(\widehat{\bm\mu}_k,\widehat{\mathbf{\Sigma}})$ denotes the sampled virtual outliers for class $k$, which are in the sublevel set based on the likelihood. $\epsilon$ is sufficiently small so that the sampled outliers are near class boundary.
Classification outputs for virtual outliers. For a given sampled virtual outlier $\mathbf{v}\in \mathbb{R}^m$, the output of the classification branch can be derived through a linear transformation: $$f(\mathbf{v}; \theta)= W_\text{cls}^\top\mathbf{v} ,$$ where $W_\text{cls}\in \mathbb{R}^{m\times K}$ is the weight of the last fully connected layer. We proceed with describing how to regularize the output of virtual outliers for improved OOD detection.
Unknown-aware Training Objective {#sec:training}
We now introduce a new training objective for unknown-aware learning, leveraging the virtual outliers in Section 3.1{reference-type="ref" reference="sec:gda"}. The key idea is to perform visual recognition task while regularizing the model to produce a low OOD score for ID data, and a high OOD score for the synthesized outlier.
Uncertainty regularization for classification.
For simplicity, we first describe the regularization in the multi-class classification setting. The regularization loss should ideally optimize for the separability between the ID vs. OOD data under some function that captures the data density. However, directly estimating $\log p(\mathbf{x})$ can be computationally intractable as it requires sampling from the entire space $\mathcal{X}$. We note that the log partition function $E(\mathbf{x};\theta) := - \log \sum_{k=1}^K e^{f_k(\mathbf{x};\theta)}$ is proportional to $\log p(\mathbf{x})$ with some unknown factor, which can be seen from the following: $$p(y | \mathbf{x}) = \frac{p(\mathbf{x},y)}{p(\mathbf{x})} = \frac{e^{f_y(\mathbf{x};\theta)}}{\sum_{k=1}^K e^{f_k(\mathbf{x};\theta)}},$$ where $f_y(\mathbf{x}; {\theta})$ denotes the $y$-th element of logit output corresponding to the label $y$. The negative log partition function is also known as the free energy, which was shown to be an effective uncertainty measurement for OOD detection [@liu2020energy].
Our idea is to explicitly perform a level-set estimation based on the energy function (threshold at 0), where the ID data has negative energy values and the synthesized outlier has positive energy: $$\begin{aligned}
\mathcal{L}\text{uncertainty} = \mathbb{E}{\mathbf{v}\sim \mathcal{V}}~~\mathds{1} {E(\mathbf{v};\theta) > 0} + \mathbb{E}{\mathbf{x}\sim \mathcal{D}}~~\mathds{1}{E(\mathbf{x};\theta) \le 0}\end{aligned}$$ This is a simpler objective than estimating density. Since the $0/1$ loss is intractable, we replace it with the binary sigmoid loss, a smooth approximation of the $0/1$ loss, yielding the following: $$\mathcal{L}\text{uncertainty}=\mathbb{E}{\mathbf{v}\sim \mathcal{V}} \left[-\log \frac{1}{1+\exp ^{- \phi(E(\mathbf{v};\theta))}} \right]+\mathbb{E}{{\mathbf{x} \sim \mathcal{D}}} \left[-\log \frac{\exp ^{- \phi(E(\mathbf{x};\theta))}}{1+\exp ^{- \phi(E(\mathbf{x};\theta))}} \right].
\label{eq:reg_loss}$$ Here $\phi(\cdot)$ is a nonlinear MLP function, which allows learning flexible energy surface. The learning process shapes the uncertainty surface, which predicts high probability for ID data and low probability for virtual outliers $\mathbf{v}$. @liu2020energy employed energy for model uncertainty regularization, however, the loss function is based on the squared hinge loss and requires tuning two margin hyperparameters. In contrast, our uncertainty regularization loss is completely hyperparameter-free and is much easier to use in practice. Moreover, VOS produces probabilistic score for OOD detection, whereas [@liu2020energy] relies on non-probabilistic energy score.
Object-level energy score.
In case of object detection, we can replace the image-level energy with object-level energy score. For ID object $(\mathbf{x}, \mathbf{b})$, the energy is defined as: $$E(\mathbf{x},\mathbf{b};\theta) = -\log \sum_{k=1}^K w_k\cdot \exp^{f_k((\mathbf{x},\mathbf{b});\theta)}, \label{eq:energy} \vspace{-0.4em}$$ where $f_k((\mathbf{x},\mathbf{b});\theta)=W^\top_\text{cls}h(\mathbf{x},\mathbf{b})$ is the logit output for class $k$ in the classification branch. The energy score for the virtual outlier can be defined in a similar way as above. In particular, we will show in Section [sec:experiment]{reference-type="ref" reference="sec:experiment"} that a learnable $\mathbf{w}$ is more flexible than a constant $\mathbf{w}$, given the inherent class imbalance in object detection datasets. Additional analysis on $w_k$ is in Appendix 13{reference-type="ref" reference="sec:app_visual_weight"}.
Overall training objective.
In the case of object detection, the overall training objective combines the standard object detection loss, along with a regularization loss in terms of uncertainty: $$\min {\theta} \mathbb{E}{(\mathbf{x}, \mathbf{b}, {y}) \sim \altmathcal{D}}~~\left[\mathcal{L}\text{cls}+\mathcal{L}\text{loc}\right]+\beta \cdot \mathcal{L}\text{uncertainty}, \label{eq:all_loss}$$ where $\beta$ is the weight of the uncertainty regularization. $\mathcal{L}\text{cls}$ and $\mathcal{L}\text{loc}$ are losses for classification and bounding box regression, respectively. This can be simplified to classification task without $\mathcal{L}\text{loc}$. We provide ablation studies in Section [sec:exp_baseline]{reference-type="ref" reference="sec:exp_baseline"} demonstrating the superiority of our loss function.
Inference-time OOD Detection {#sec:inference}
During inference, we use the output of the logistic regression uncertainty branch for OOD detection. In particular, given a test input $\mathbf{x}^$, the object detector produces a bounding box prediction $\mathbf{b}^$. The OOD uncertainty score for the predicted object $(\mathbf{x}^, \mathbf{b}^)$ is given by: $$\begin{aligned}
p_\theta(g \mid \mathbf{x}^, \mathbf{b}^) = \frac{\exp^{- \phi(E(\mathbf{x}^,\mathbf{b}^))}}{1+\exp^{- \phi(E(\mathbf{x}^,\mathbf{b}^))}}.
\label{eq:ood_uncertainty}
\vspace{-0.5em}\end{aligned}$$ For OOD detection, one can exercise the thresholding mechanism to distinguish between ID and OOD objects: $$G(\mathbf{x}^,\mathbf{b}^)=\left{\begin{array}{ll}
1 & \text { if }p_\theta(g \mid \mathbf{x}^, \mathbf{b}^)\geq \gamma, \
0 & \text { if }p_\theta(g \mid \mathbf{x}^, \mathbf{b}^) <\gamma.
\end{array}\right.
\label{eq:ood_detection}$$ The threshold $\gamma$ is typically chosen so that a high fraction of ID data (e.g., 95%) is correctly classified. Our framework VOS is summarized in Algorithm [alg:algo]{reference-type="ref" reference="alg:algo"}.
Input: ID data $\altmathcal{D}=\left{\left(\mathbf{x}{i}, \mathbf{b}i,{y}{i}\right)\right}{i=1}^{N}$, randomly initialized detector with parameter $\theta$, queue size $|Q_k|$ for Gaussian density estimation, weight for uncertainty regularization $\beta$, and $\epsilon$.
Output: Object detector with parameter $\theta^{*}$, and OOD detector $G$.\
Experimental Results
[[sec:experiment]]{#sec:experiment label="sec:experiment"} In this section, we present empirical evidence to validate the effectiveness of VOS on several real-world tasks, including both object detection (Section 4.1{reference-type="ref" reference="subsec:obj"}) and image classification (Section [subsec:img]{reference-type="ref" reference="subsec:img"}).
Evaluation on Object Detection {#subsec:obj}
Experimental details. We use PASCAL VOC [@DBLP:journals/ijcv/EveringhamGWWZ10] and Berkeley DeepDrive (BDD-100k) [@DBLP:conf/cvpr/YuCWXCLMD20] datasets as the ID training data. For both tasks, we evaluate on two OOD datasets that contain subset of images from: MS-COCO [@lin2014microsoft] and OpenImages (validation set) [@kuznetsova2020open]. We manually examine the OOD images to ensure they do not contain ID category. We have open-sourced our benchmark data that allows the community to easily evaluate future methods on object-level OOD detection.
We use the Detectron2 library [@Detectron2018] and train on two backbone architectures: ResNet-50 [@he2016identity] and RegNetX-4.0GF [@DBLP:conf/cvpr/RadosavovicKGHD20]. We employ a two-layer MLP with a ReLU nonlinearity for $\phi$ in Equation [eq:reg_loss]{reference-type="ref" reference="eq:reg_loss"}, with hidden layer dimension of 512. For each in-distribution class, we use 1,000 samples to estimate the class-conditional Gaussians. Since the threshold $\epsilon$ can be infinitesimally small, we instead choose $\epsilon$ based on the $t$-th smallest likelihood in a pool of 10,000 samples (per-class), generated from the class-conditional Gaussian distribution. A larger $t$ corresponds to a larger threshold $\epsilon$. As shown in Table 2{reference-type="ref" reference="tab:ablation_appendix2_detection"}, a smaller $t$ yields good performance. We set $t=1$ for all our experiments. Extensive details on the datasets are described in Appendix 7{reference-type="ref" reference="sec:dataset"}, along with a comprehensive sensitivity analysis of each hyperparameter (including the queue size $|Q_k|$, coefficient $\beta$, and threshold $\epsilon$) in Appendix 9{reference-type="ref" reference="sec:ablation"}.
Metrics. For evaluating the OOD detection performance, we report: (1) the false positive rate (FPR95) of OOD samples when the true positive rate of ID samples is at 95%; (2) the area under the receiver operating characteristic curve (AUROC). For evaluating the object detection performance on the ID task, we report the common metric of mAP.
[[sec:exp_baseline]]{#sec:exp_baseline label="sec:exp_baseline"}
VOS outperforms existing approaches. In Table [tab:baseline]{reference-type="ref" reference="tab:baseline"}, we compare VOS with competitive OOD detection methods in literature. For a fair comparison, all the methods only use ID data without using auxiliary outlier dataset. Our proposed method, VOS, outperforms competitive baselines, including Maximum Softmax Probability [@hendrycks2016baseline], ODIN [@liang2018enhancing], energy score [@liu2020energy], Mahalanobis distance [@lee2018simple], Generalized ODIN [@hsu2020generalized], CSI [@tack2020csi] and Gram matrices [@DBLP:conf/icml/SastryO20]. These approaches rely on a classification model trained primarily for the ID classification task, and can be naturally extended to the object detection model due to the existence of a classification head. The comparison precisely highlights the benefits of incorporating synthesized outliers for model regularization.
Closest to our work is the GAN-based approach for synthesizing outliers [@lee2018training]. Compare to GAN-synthesis, VOS improves the OOD detection performance (FPR95) by 12.76% on BDD-100k and 13.40% on Pascal VOC (COCO as OOD). Moreover, we show in Table [tab:baseline]{reference-type="ref" reference="tab:baseline"} that VOS achieves stronger OOD detection performance while preserving a high accuracy on the original in-distribution task (measured by mAP). This is in contrast with CSI, which displays degradation, with mAP decreased by 0.7% on BDD-100k. Details of reproducing baselines are in Appendix 11{reference-type="ref" reference="sec:reproduce_baseline"}.
Ablation on outlier synthesis approaches. We compare VOS with different synthesis approaches in Table [tab:synthesis]{reference-type="ref" reference="tab:synthesis"}. Specifically, we consider three types of synthesis approach: (i$^\diamond$) synthesizing outliers in the pixel space, (ii$^\natural$) using noise as outliers, and (iii$^\clubsuit$) using negative proposals from RPN as outliers. For type I, we consider GAN-based [@lee2018training] and mixup [@DBLP:conf/iclr/ZhangCDL18] methods. The outputs of the classification branch for outliers are forced to be closer to a uniform distribution. For mixup, we consider two different beta distributions $\operatorname{Beta}(0.4)$ and $\operatorname{Beta}(1)$, and interpolate ID objects in the pixel space. For Type II, we use noise perturbation to create virtual outliers. We consider adding fixed Gaussian noise to the ID features, adding trainable noise to the ID features where the noise is trained to push the outliers away from ID features, and using fixed Gaussian noise as outliers. Lastly, for type III, we directly use the negative proposals in the ROI head as the outliers for Equation [eq:reg_loss]{reference-type="ref" reference="eq:reg_loss"}, similar to [@DBLP:journals/corr/abs-2103-02603]. We consider three variants: randomly sampling $n$ negative proposals ($n$ is the number of positive proposals), sampling $n$ negative proposals with a larger probability, and using all the negative proposals. [All methods are trained under the same setup, with PASCAL-VOC as in-distribution data and ResNet-50 as the backbone. The loss function is the same as Equation [eq:all_loss]{reference-type="ref" reference="eq:all_loss"} for all variants, with the only difference being the synthesis method.]{style="color: black"}
The results are summarized in Table [tab:synthesis]{reference-type="ref" reference="tab:synthesis"}, where VOS outperforms alternative synthesis approaches both in the feature space ($\clubsuit$, $\natural$) or the pixel space ($\diamond$). Generating outliers in the pixel space ($\diamond$) is either unstable (GAN) or harmful for the object detection performance (mixup). Introducing noise ($\natural$), especially using Gaussian noise as outliers is promising. However, Gaussian noise outliers are relatively simple, and may not effectively regularize the decision boundary between ID and OOD as VOS does. Exploiting the negative proposals ($\clubsuit$) is not effective, because they are distributionally close to the ID data.
Ablation on the uncertainty loss. We perform ablation on several variants of VOS, trained with different uncertainty loss $\mathcal{L}_\text{uncertainty}$. Particularly, we consider: (1) using the squared hinge loss for regularization as in @liu2020energy, (2) using constant weight $\mathbf{w}=[1,1,...,1]^\top$ for energy score in Equation [eq:energy]{reference-type="ref" reference="eq:energy"}, and (3) classifying the virtual outliers as an additional $K+1$ class in the classification branch. The performance comparison is summarized in Table [tab:ablation_loss]{reference-type="ref" reference="tab:ablation_loss"}. Compared to the hinge loss, our proposed logistic loss reduces the FPR95 by 10.02% on BDD-100k. While the squared hinge loss in @liu2020energy requires tuning the hyperparameters, our uncertainty loss is completely hyperparameter free. In addition, we find that a learnable $\mathbf{w}$ for energy score is more desirable than a constant $\mathbf{w}$, given the inherent class imbalance in object detection datasets. Finally, classifying the virtual outliers as an additional class increases the difficulty of object classification, which does not outperform either. This ablation demonstrates the superiority of the uncertainty loss employed by VOS.
VOS is effective on alternative architecture. Lastly, we demonstrate that VOS is effective on alternative neural network architectures. In particular, using RegNet [@DBLP:conf/cvpr/RadosavovicKGHD20] as backbone yields both better ID accuracy and OOD detection performance. We also explore using intermediate layers for outlier synthesis, where we show using VOS on the penultimate layer is the most effective. This is expected since the feature representations are the most discriminative at deeper layers. We provide details in Appendix 12{reference-type="ref" reference="sec:intermediate"}.
Comparison with training on real outlier data. [We also compare with Outlier Exposure [@hendrycks2018deep] (OE). OE serves as a strong baseline since it relies on the real outlier data.]{style="color: black"} We train the object detector on PASCAL-VOC using the same architecture ResNet-50, and use the OE objective for the classification branch. The real outliers for OE training are sampled from the OpenImages dataset [@kuznetsova2020open]. We perform careful deduplication to ensure there is no overlap between the outlier training data and PASCAL-VOC. Our method achieves OOD detection performance on COCO (AUROC: 88.70%) that favorably matches OE (AUROC: 90.18%), and does not require external data.
Evaluation on Image Classification
[[subsec:img]]{#subsec:img label="subsec:img"} [[sec:cls]]{#sec:cls label="sec:cls"}
r0.45
[[tab:baseline_cls]]{#tab:baseline_cls label="tab:baseline_cls"}
Going beyond object detection, we show that VOS is also suitable and effective on common image classification benchmark. We use CIFAR-10 [@cifar] as the ID training data, with standard train/val splits. We train on WideResNet-40 [@zagoruyko2016wide] and DenseNet-101 [@huang2017densely], where we substitute the object detection loss in Equation [eq:all_loss]{reference-type="ref" reference="eq:all_loss"} with the cross-entropy loss. We evaluate on six OOD datasets: Textures [@DBLP:conf/cvpr/CimpoiMKMV14], SVHN [@netzer2011reading], Places365 [@DBLP:journals/pami/ZhouLKO018], LSUN-C [@DBLP:journals/corr/YuZSSX15], LSUN-Resize [@DBLP:journals/corr/YuZSSX15], and iSUN [@DBLP:journals/corr/XuEZFKX15]. The comparisons are shown in Table [tab:baseline_cls]{reference-type="ref" reference="tab:baseline_cls"}, with results averaged over six test datasets. VOS demonstrates competitive OOD detection results on both architectures without sacrificing the ID test classification accuracy (94.84% on pre-trained WideResNet vs. 94.68% using VOS).
Qualitative Analysis
In Figure 3{reference-type="ref" reference="fig:visual"}, we visualize the prediction on several OOD images, using object detection models trained without virtual outliers (top) and with VOS (bottom), respectively. The in-distribution data is BDD-100k. VOS performs better in identifying OOD objects (in green) than a vanilla object detector, and reduces false positives among detected objects. Moreover, the confidence score of the false-positive objects of VOS is lower than that of the vanilla model (see the truck in the 3rd column). [Additional visualizations are in Appendix [sec:app_visual]{reference-type="ref" reference="sec:app_visual"} and 14{reference-type="ref" reference="sec:app_outlier_visual"}]{style="color: black"}.
{#fig:visual width="100%"}
[[fig:visual]]{#fig:visual label="fig:visual"}
Related work
OOD detection for classification can be broadly categorized into post hoc and regularization-based approaches. In @bendale2016towards, the OpenMax score is developed for OOD detection based on the extreme value theory (EVT). Subsequent work [@hendrycks2016baseline] proposed a simple baseline using maximum softmax probability. Improved algorithms have been proposed, such as ensembling [@DBLP:conf/nips/Lakshminarayanan17], ODIN [@liang2018enhancing], energy score [@liu2020energy], Mahalanobis distance [@lee2018simple], Gram matrices based score [@DBLP:conf/icml/SastryO20], and GradNorm score [@huang2021importance]. Very recently, @sun2021react showed that a simple activation rectification strategy termed ReAct can significantly improve test-time OOD detection. Theoretical understandings on different post-hoc detection methods are provided in [@morteza2022provable]. Different from [@lee2018simple], VOS performs dynamic estimation of class-conditional Gaussian during training, which shapes the uncertainty surface over time using our proposed loss.
Another line of approaches explore model regularization using natural outlier images [@hendrycks2018deep; @mohseni2020self; @DBLP:journals/corr/abs-2106-03917] or images synthesized by GANs [@lee2018training]. However, real outlier data is often infeasible to obtain. Instead, VOS automatically synthesizes virtual outliers which allows greater flexibility and generality.@tack2020csi applied self-supervised learning for OOD detection, which we compare in Section [sec:experiment]{reference-type="ref" reference="sec:experiment"}. [@DBLP:journals/ijcv/BlumSNSC21; @DBLP:journals/corr/abs-2107-11264; @Besnier_2021_ICCV] proposed to detect outliers for semantic segmentation task. @DBLP:conf/visapp/GrcicBS21 trained a generative model and synthesize outliers in the pixel space, which cannot be applied to object detection where a scene consists of both known and unknown objects. The regularization is based on entropy maximization, which is different from VOS.
OOD detection for object detection is currently underexplored. [@DBLP:journals/corr/abs-2103-02603] used energy score [@liu2020energy] to identify the OOD data and then labeled them for incremental object detection. In contrast, VOS focuses on OOD detection and adopts a new unknown-aware training objective with a new test-time detection score. Our learning framework is generally applicable to both object detectors and classification models. Moreover, [@DBLP:journals/corr/abs-2103-02603] used the negative proposals as unknown samples for model regularization, which is suboptimal as we show in Table [tab:synthesis]{reference-type="ref" reference="tab:synthesis"}. [@DBLP:journals/corr/abs-2101-05036; @DBLP:journals/corr/abs-2107-04517] focused on uncertainty estimation for the localization regression, rather than OOD detection for classification problems. Several works [@DBLP:conf/wacv/DhamijaGVB20; @DBLP:conf/icra/MillerDMS19; @DBLP:conf/icra/MillerNDS18; @DBLP:conf/wacv/0003DSZMCCAS20; @DBLP:journals/corr/abs-2108-03614] used approximate Bayesian methods, such as MC-Dropout [@gal2016dropout] for OOD detection. They require multiple inference passes to generate the uncertainty score, which are computationally expensive on larger datasets and models.
Open-world object detection includes out-of-domain generalization [@DBLP:journals/corr/abs-2108-06753; @DBLP:journals/corr/abs-2104-08381], zero-shot object detection [@DBLP:journals/corr/abs-2104-13921; @DBLP:journals/ijcv/RahmanKP20] and incremental object detection [@DBLP:journals/corr/abs-2002-05347; @DBLP:conf/cvpr/Perez-RuaZHX20]. Most of them either developed measures to mitigate catastraphic forgetting [@DBLP:journals/corr/abs-2003-08798] or used auxiliary information [@DBLP:journals/ijcv/RahmanKP20], such as class attributes to perform object detection on unseen data, which is different from our focus of OOD detection.
Conclusion
In this paper, we propose VOS, a novel unknown-aware training framework for OOD detection. Different from methods that require real outlier data, VOS adaptively synthesizes outliers during training by sampling virtual outliers from the low-likelihood region of the class-conditional distributions. The synthesized outliers meaningfully improve the decision boundary between the ID data and OOD data, resulting in superior OOD detection performance while preserving the performance of the ID task. VOS is effective and suitable for both object detection and classification tasks. We hope our work will inspire future research on unknown-aware deep learning in real-world settings.
Reproducibility Statement {#reproducibility-statement .unnumbered}
The authors of the paper recognize the importance and value of reproducible research. We summarize our efforts below to facilitate reproducible results:
-
Datasets. We use publicly available datasets, which are described in detail in Section [sec:exp_baseline]{reference-type="ref" reference="sec:exp_baseline"}, Section [sec:cls]{reference-type="ref" reference="sec:cls"}, and Appendix 7{reference-type="ref" reference="sec:dataset"}.
-
Baselines. The description and hyperparameters of the OOD detection baselines are explained in Appendix 11{reference-type="ref" reference="sec:reproduce_baseline"}.
-
Model training. Our model training on object detection is based on the publicly available Detectron2 codebase: https://github.com/facebookresearch/detectron2. Hyperparamters are specified in Section [sec:exp_baseline]{reference-type="ref" reference="sec:exp_baseline"}, with a thorough ablation study provided in Appendix 9{reference-type="ref" reference="sec:ablation"}.
-
Methodology. Our method is fully documented in Section 3{reference-type="ref" reference="sec:method"}, with the pseudo algorithm detailed in Algorithm [alg:algo]{reference-type="ref" reference="alg:algo"}.
-
Open Source. The codebase and the dataset will be released for reproducible research. Code is available at https://github.com/deeplearning-wisc/vos.
Ethics statement {#ethics-statement .unnumbered}
Our project aims to improve the reliability and safety of modern machine learning models. Our study can lead to direct benefits and societal impacts, particularly for safety-critical applications such as autonomous driving. Our study does not involve any human subjects or violation of legal compliance. We do not anticipate any potentially harmful consequences to our work. Through our study and releasing our code, we hope to raise stronger research and societal awareness towards the problem of out-of-distribution detection in real-world settings.
Acknowledgement {#acknowledgement .unnumbered}
Research is supported by Wisconsin Alumni Research Foundation (WARF). We sincerely thank Ziyang (Jack) Cai for helping with inspect the OOD datasets, and members in Li's lab for valuable discussions.
Supplementary Material
Experimental details {#sec:dataset}
We summarize the OOD detection evaluation task in Table 1{reference-type="ref" reference="tab:task"}. The OOD test dataset is selected from MS-COCO and OpenImages dataset, which contains disjoint labels from the respective ID dataset. The PASCAL model is trained for a total of 18,000 iterations, and the BDD-100k model is trained for 90,000 iterations. We add the uncertainty regularizer (Equation [eq:reg_loss]{reference-type="ref" reference="eq:reg_loss"}) starting from 2/3 of the training. The weight $\beta$ is set to $0.1$. See detailed ablations on the hyperparameters in Appendix 9{reference-type="ref" reference="sec:ablation"}.
::: {#tab:task} Task 1 Task 2
ID train dataset VOC train BDD train ID val dataset VOC val BDD val OOD dataset COCO and OpenImages val COCO and OpenImages val $#$ID train images 16,551 69,853 $#$ID val images 4,952 10,000 $#$OOD images for COCO 930 1,880 $#$OOD images for OpenImages 1,761 1,761
: OOD detection evaluation tasks. :::
Software and hardware {#sec:hardware}
We run all experiments with Python 3.8.5 and PyTorch 1.7.0, using NVIDIA GeForce RTX 2080Ti GPUs.
Effect of hyperparameters {#sec:ablation}
Below we perform sensitivity analysis for each important hyperparameter[^1]. We use ResNet-50 as the backbone, trained on in-distribution dataset PASCAL-VOC.
Effect of $\epsilon$. Since the threshold $\epsilon$ can be infinitesimally small, we instead choose $\epsilon$ based on the $t$-th smallest likelihood in a pool of 10,000 samples (per-class), generated from the class-conditional Gaussian distribution. A larger $t$ corresponds to a larger threshold $\epsilon$. As shown in Table 2{reference-type="ref" reference="tab:ablation_appendix2_detection"}, a smaller $t$ yields good performance. We set $t=1$ for all our experiments.
::: {#tab:ablation_appendix2_detection} $t$ mAP$\uparrow$ FPR95 $\downarrow$ AUROC$\uparrow$ AUPR$\uparrow$
1 48.7 **54.69** **83.41** **92.56**
2 48.2 57.96 82.31 88.52
3 48.3 62.39 82.20 88.05
4 48.8 69.72 80.86 89.54
5 48.7 57.57 78.66 88.20
6 48.7 74.03 78.06 91.17
8 48.8 60.12 79.53 92.53
10 47.2 76.25 74.33 90.42
: Ablation study on the number of selected outliers $t$ (per class). :::
Effect of queue size $|Q_k|$. We investigate the effect of ID queue size $|Q_k|$ in Table 3{reference-type="ref" reference="tab:ablation_appendix1_detection"}, where we vary $|Q_k|={50,100,200,400,600,800,1000}$. Overall, a larger $|Q_k|$ is more beneficial since the estimation of Gaussian distribution parameters can be more precise. In our experiments, we set the queue size $|Q_k|$ to $1,000$ for PASCAL and $300$ for BDD-100k. The queue size is smaller for BDD because some classes have a limited number of object boxes.
::: {#tab:ablation_appendix1_detection} $|Q_k|$ mAP$\uparrow$ FPR95 $\downarrow$ AUROC$\uparrow$ AUPR$\uparrow$
50 48.6 68.42 77.04 92.30
100 48.9 59.77 79.96 89.18
200 48.8 57.80 80.20 89.92
400 48.9 66.85 77.68 89.83
600 48.5 57.32 81.99 91.07
800 48.7 **51.43** 82.26 91.80
1000 48.7 54.69 **83.41** **92.56**
: Ablation study on the ID queue size $|Q_k|$. :::
Effect of $\beta$. As shown in Table 4{reference-type="ref" reference="tab:ablation_appendix3_detection"}, a mild value of $\beta$ generally works well. As expected, a large value (e.g., $\beta=0.5$) will over-regularize the model and harm the performance.
::: {#tab:ablation_appendix3_detection} $\beta$ mAP$\uparrow$ FPR95 $\downarrow$ AUROC$\uparrow$ AUPR$\uparrow$
0.01 48.8 59.20 82.64 90.08
0.05 48.9 57.21 83.27 91.00
0.1 48.7 **54.69** **83.41** **92.56**
0.15 48.5 59.32 77.47 89.06
0.5 36.4 99.33 57.46 85.25
: Ablation study on regularization weight $\beta$. :::
Effect of starting iteration for the regularizer. Importantly, we show that uncertainty regularization should be added in the middle of the training. If it is added too early, the feature space is not sufficiently discriminative for Gaussian distribution estimation. See Table 5{reference-type="ref" reference="tab:ablation_appendix4_detection"} for the effect of starting iteration $Z$. We use $Z=12,000$ for the PASCAL-VOC model, which is trained for a total of 18,000 iterations.
::: {#tab:ablation_appendix4_detection} $Z$ mAP$\uparrow$ FPR95 $\downarrow$ AUROC$\uparrow$ AUPR$\uparrow$
2000 48.5 60.01 78.55 87.62 4000 48.4 61.47 79.85 89.41 6000 48.5 59.62 79.97 89.74 8000 48.7 56.85 80.64 90.71 10000 48.6 49.55 83.22 92.49 12000 48.7 54.69 83.41 92.56 14000 49.0 55.39 81.37 93.00 16000 48.9 59.36 82.70 92.62
: Ablation study on the starting iteration $Z$. Model is trained for a total of 18,000 iterations. :::
Additional visualization results
We provide additional visualization of the detected objects on different OOD datasets with models trained on different in-distribution datasets. The results are shown in Figures 4{reference-type="ref" reference="fig:vi1"}-7{reference-type="ref" reference="fig:vi4"}.
[[sec:app_visual]]{#sec:app_visual label="sec:app_visual"}
{#fig:vi1 width="100%"}
{#fig:vi2 width="100%"}
{#fig:vi3 width="100%"}
{#fig:vi4 width="100%"}
Baselines {#sec:reproduce_baseline}
To evaluate the baselines, we follow the original methods in MSP [@hendrycks2016baseline], ODIN [@liang2018enhancing], Generalized ODIN [@hsu2020generalized], Mahalanobis distance [@lee2018simple], CSI [@tack2020csi], energy score [@liu2020energy] and gram matrices [@DBLP:conf/icml/SastryO20] and apply them accordingly on the classification branch of the object detectors. For ODIN, the temperature is set to be $T=1000$ following the original work. For both ODIN and Mahalanobis distance [@lee2018simple], the noise magnitude is set to $0$ because the region-based object detector is not end-to-end differentiable given the existence of region cropping and ROIAlign.For GAN [@lee2018training], we follow the original paper and use a GAN to generate OOD images. The prediction of the OOD images/objects is regularized to be close to a uniform distribution, through a KL divergence loss with a weight of 0.1. We set the shape of the generated images to be 100$\times$100 and resize them to have the same shape as the real images. We optimize the generator and discriminator using Adam [@DBLP:journals/corr/KingmaB14], with a learning rate of 0.001. For CSI [@tack2020csi], we use the rotations (0$^\circ$, 90$^\circ$, 180$^\circ$, 270$^\circ$) as the self-supervision task. We set the temperature in the contrastive loss to 0.5. We use the features right before the classification branch (with the dimension to be 1024) to perform contrastive learning. The weights of the losses that are used for classifying shifted instances and instance discrimination are both set to 0.1 to prevent training collapse. For Generalized ODIN [@hsu2020generalized], we replace and train the classification head of the object detector by the most effective Deconf-C head shown in the original paper.
Virtual outlier synthesis using earlier layer {#sec:intermediate}
In this section, we investigate the effect of using VOS on an earlier layer within the network. Our main results in Table [tab:baseline]{reference-type="ref" reference="tab:baseline"} are based on the penultimate layer of the network. Here, we additionally evaluate the performance using the layer before the penultimate layer, with a feature dimension of $1,024$. The results are summarized in Table 6{reference-type="ref" reference="tab:diff_layers"}. As observed, synthesizing virtual outliers in the penultimate layer achieves better OOD detection performance than the earlier layer, since the feature representations are more discriminative at deeper layers.
::: {#tab:diff_layers} Models FPR95$\downarrow$ AUROC$\uparrow$ mAP$\uparrow$
PASCAL VOC
VOS-final 47.53 88.70 48.9
VOS-earlier 50.24 88.24 48.6
BDD-100k
VOS-final 44.27 86.87 31.3
VOS-earlier 49.66 86.08 30.6
: Performance comparison of employing VOS on different layers. COCO is the OOD data. :::
Visualization of the learnable weight coefficient $w$ in generalized energy score {#sec:app_visual_weight}
To observe whether the learnable weight coefficient $w_k$ in Equation [eq:energy]{reference-type="ref" reference="eq:energy"} captures dataset-specific statistics during uncertainty regularization, we visualize $w_k$ w.r.t each in-distribution class and the number of training objects of that class in Figure 8{reference-type="ref" reference="fig:visual_energy_weight"}. We use the BDD-100k dataset [@DBLP:conf/cvpr/YuCWXCLMD20] as the in-distribution dataset and the RegNetX-4.0GF [@DBLP:conf/cvpr/RadosavovicKGHD20] as the backbone network. As can be observed, the learned weight coefficient displays a consistent trend with the number of training objects per class, which indicates the advantage of using learnable weights rather than constant weight vector with all 1s.
{#fig:visual_energy_weight width="100%"}
[Visualization of the virtual outliers]{style="color: black"} {#sec:app_outlier_visual}
[In this section, we visualize the synthesized virtual outliers by VOS using UMAP in Figure 9{reference-type="ref" reference="fig:synthesized_outliers"}. The in-distribution dataset is the Pascal VOC dataset with the backbone of ResNet-50. Note that we cannot visualize virtual outliers in the pixel space since they are synthesized in low-dimensional feature space. ]{style="color: black"}
{#fig:synthesized_outliers width="90%"}
[From Figure 9{reference-type="ref" reference="fig:synthesized_outliers"}, the virtual outliers reside in the near-boundary region of the in-distribution feature cluster, which helps the model to learn a compact decision boundary between ID and OOD objects. ]{style="color: black"}
[Discussion on the detected, rejected and ignored OOD objects during inference]{style="color: black"} {#sec:app_number_objects}
[The focus of VOS is to mitigate the undesirable cases when an OOD object is detected and classified as in-distribution with high confidence. In other words, our goal is to ensure that "if the box is detected, it should be faithfully an in-distribution object rather than OOD". Although generating the bounding box for OOD data is not the focus of this paper, we do notice that VOS can improve the number of boxes detected for OOD data (+25% on BDD trained model compared to the vanilla Faster-RCNN).]{style="color: black"}
[The number of OOD objects ignored by RPN can largely depend on the confidence score threshold and the NMS threshold. Hence, we found it more meaningful to compare relatively with the vanilla Faster-RCNN under the same default thresholds. Using BDD100K as the in-distribution dataset and the ResNet as the backbone, VOS can improve the number of detected OOD boxes by 25% (compared to vanilla object detector). VOS also improves the number of rejected OOD samples by 63%.]{style="color: black"}
[^1]: Note that our sensitivity analysis uses the speckle noised PASCAL VOC validation dataset as OOD data, which is different from the actual OOD test datasets in use.