abstract: | We show how fitting sparse linear models over learned deep feature representations can lead to more debuggable neural networks. These networks remain highly accurate while also being more amenable to human interpretation, as we demonstrate quantiatively via numerical and human experiments. We further illustrate how the resulting sparse explanations can help to identify spurious correlations, explain misclassifications, and diagnose model biases in vision and language tasks.[^1] author:

| Eric Wong
MIT
wongeric@mit.edu\
| Shibani Santurkar[^2]
MIT
shibani@mit.edu
| Aleksander Mądry
MIT
madry@mit.edu bibliography:
../bibliography/bib.bib title: | Leveraging Sparse Linear Layers for
Debuggable Deep Networks

Introduction

As machine learning (ML) models find wide-spread application, there is a growing demand for interpretability: access to tools that help people see why the model made its decision. There are still many obstacles towards achieving this goal though, particularly in the context of deep learning. These obstacles stem from the scale of modern deep networks, as well as the complexity of even defining and assessing the (often context-dependent) desiderata of interpretability.

Existing work on deep network interpretability has largely approached this problem from two perspectives. The first one seeks to uncover the concepts associated with specific neurons in the network, for example through visualization [@yosinski2015understanding] or semantic labeling [@bau2017network]. The second aims to explain model decisions on a per-example basis, using techniques such as local surrogates [@ribeiro2016should] and saliency maps [@simonyan2013deep]. While both families of approaches can improve model understanding at a local level---i.e., for a given example or neuron---recent work has argued that such localized explanations can lead to misleading conclusions about the model's overall decision process [@adebayo2018sanity; @adebayo2020debugging; @leavitt2020towards]. As a result, it is often challenging to flag a model's failure modes or evaluate corrective interventions without in-depth problem-specific studies.

To make progress on this front, we focus on a more actionable intermediate goal of interpretability: model debugging. Specifically, instead of directly aiming for a complete characterization of the model's decision process, our objective is to develop tools that help model designers uncover unexpected model behaviors (semi-)automatically.

Our contributions.

Our approach to model debugging is based on a natural view of a deep network as the composition of a "deep feature extractor" and a linear "decision layer". Embracing this perspective allows us to focus our attention on probing how deep features are (linearly) combined by the decision layer to make predictions. Even with this simplification, probing current deep networks can be intractable given the large number of parameters in their decision layers. To overcome this challenge, we replace the standard (typically dense) decision layer of a deep network with a sparse but comparably accurate counterpart. We find that this simple approach ends up being surprisingly effective for building deep networks that are intrinsically more debuggable. Specifically, for a variety of modern ML settings:

We demonstrate that it is possible to construct deep networks that have sparse decision layers (e.g., with only 20-30 deep features per class for ImageNet) without sacrificing much model performance. This involves developing a custom solver for fitting elastic net regularized linear models in order to perform effective sparsification at deep-learning scales.[^3]
We show that sparsifying a network's decision layer can indeed help humans understand the resulting models better. For example, untrained annotators can intuit (simulate) the predictions of a model with a sparse decision layer with high ($\sim$63%) accuracy. This is in contrast to their near chance performance ($\sim$33%) for models with standard (dense) decision layers.
We explore the use of sparse decision layers in three debugging tasks: diagnosing biases and spurious correlations (cf. Section 4.1{reference-type="ref" reference="sec:biases"}), counterfactual generation (cf. Section 4.2{reference-type="ref" reference="sec:counterfactuals"}) and identifying data patterns that cause misclassifications (cf. Section 4.3{reference-type="ref" reference="sec:errors"}). To enable this analysis, we design a suite of human-in-the-loop experiments.

We plan to release the code for our toolkit with the paper.

Debuggability via Sparse Linearity {#sec:methodology}

Recent studies have raised concerns about how deep networks make decisions [@beery2018recognition; @xiao2020noise; @tsipras2020from; @bissoto2020debiasing]. For instance, it was noted that skin-lesion detectors rely on spurious visual artifacts [@bissoto2020debiasing] and comment flagging systems use identity group information to detect toxicity [@borkan2019nuanced]. So far, most of these discoveries were made via in-depth studies by experts. However, as deep learning makes inroads into new fields, there is a strong case to be made for general-purpose model debugging tools.

While simple models (e.g., small decision trees or linear classifiers) can be directly examined, a similar analysis for typical deep networks is infeasible. To tackle this problem, we choose to decompose a deep network into: (1) a deep feature representation and (2) a linear decision layer. Then, we can attempt to gain insight into the model's reasoning process by directly examining the deep features, and the linear coefficients used to aggregate them. At a high level, our hope is that this decomposition will allow us to get the best of both worlds: the predictive power of learned deep features, and the ease of understanding linear models.

That being said, this simplified problem is still intractable for current deep networks, since their decision layers can easily have millions of parameters operating on thousands of deep features. To mitigate this issue, we instead combine the feature representation of a pre-trained network with a sparse linear decision layer (cf. Figure 2{reference-type="ref" reference="fig:decomposition"}). Debugging the resulting sparse decision layer then entails inspecting only the few linear coefficients and deep features that dictate its predictions.

Constructing sparse decision layers {#sec:glm_explain}

One possible approach for constructing sparse decision layers is to apply pruning methods from deep learning [@lecunoptimal1990; @han2015learning; @hassibisecond1993; @li2016pruning; @han2016deep; @blalock2020state]---commonly-used to compress deep networks and speed up inference---solely to the dense decision layer. It turns out however that for linear classifiers we can actually do better. In particular, the problem of fitting sparse linear models has been extensively studied in statistics, leading to a suite of methods with theoretical optimality guarantees. These include [LASSO]{.smallcaps} regression [@tibshirani1994regression], least angle regression [@efron2004least], and forward stagewise regression [@hastie2007forward]. In this paper, we leverage the classic elastic net formulation [@zou2005regularization]---a generalization of [LASSO]{.smallcaps} and ridge regression that addresses their corresponding drawbacks (further discussed in Appendix 7{reference-type="ref" reference="app:solver"}).

{#fig:decomposition width="0.9\columnwidth"}

{#fig:decomposition width="0.85\columnwidth"}

For simplicity, we present an overview of the elastic net for linear regression, and defer the reader to @friedman2010regularization for a more complete presentation on the generalized linear model (GLM) in the classification setting. Let $(X,y)$ be the standardized data matrix (mean zero and variance one) and output respectively. In our setting, $X$ corresponds to the (normalized) deep feature representations of input data points, while $y$ is the target. Our goal is to fit a sparse linear model of the form $\mathbb{E}(Y|X=x) = x^T\beta + \beta_0$. Then, the elastic net is the following convex optimization problem: $$\min_\beta \frac{1}{2N}|X^T\beta + \beta_0 - y|^2_2 + \lambda R_\alpha(\beta) \label{eq:elasticnet}$$ where $$R_\alpha(\beta) = (1-\alpha)\frac{1}{2}|\beta|2^2 + \alpha |\beta|1$$ is referred to as the elastic net penalty [@zou2005regularization] for given hyperparameters $\lambda$ and $\alpha$. Typical elastic net solvers optimize ([eq:elasticnet]{reference-type="ref" reference="eq:elasticnet"}) for a variety of regularization strengths $\lambda_1 > \dots > \lambda_k$, resulting in a series of linear classifiers with weights $\beta_1, \dots, \beta_k$ known as the regularization path, where $$\beta_i = \mathop{\mathrm{arg,min}}\beta \frac{1}{2N}|X^T\beta - y|^2_2 + \lambda_i R\alpha(\beta) \label{eq:path}$$ In particular, a path algorithm for the elastic net calculates the regularization path where sparsity ranges the entire spectrum from the trivial zero model ($\beta=0$) to completely dense. This regularization path can then be used to select a single linear model to satisfy application-specific sparsity or accuracy thresholds (as measured on a validation set). In addition, these paths can be used to visualize the evolution of weights assigned to specific features as a function of sparsity constraints on the model, thereby providing further insight into the relative importance of features (cf. Appendix 7.3{reference-type="ref" reference="app:order"}).

Scalable solver for large-scale elastic net.

Although the elastic net is widely-used for small-scale GLM problems, existing solvers can not handle the scale (number of samples and input dimensions) that typically arise in deep learning. In fact, at such scales, state-of-the-art solvers struggle to solve the elastic net even for a single regularization value, and cannot be directly parallelized due to their reliance on coordinate descent [@friedman2010regularization]. We remedy this by creating an optimized GLM solver that combines the path algorithm of @friedman2010regularization with recent advancements in variance reduced gradient methods [@gazagnadou2019optimal]. The speedup in our approach comes from the improved convergence rates of these methods over stochastic gradient descent in strongly convex settings such as the elastic net. Using our approach, we can fit ImageNet-scale regularization paths to numerical precision on the order of hours on a single GPU (cf. Appendix 7.1{reference-type="ref" reference="app:timing"} for details).

LIME-based word cloud visualizations for the highest weighted features in the (dense/sparse) decision layers of BERT models for positive sentiment detection in the SST dataset. As highlighted in red, some of the key features used by the dense decision layer are actually activated for words with negative semantic meaning. {#fig:suite_nlp width="0.95\columnwidth"}

{#fig:suite_nlp width="1\columnwidth"}

{#fig:suite width="1\columnwidth"}

Interpreting deep features {#sec:rep_explain}

A sparse linear model allows us to reason about the network's decisions in terms of a significantly smaller set of deep features. When used in tandem with off-the-shelf feature interpretation methods, the end result is a simplified description of how the network makes predictions. For our study, we utilize the following two widely-used techniques:

LIME [@ribeiro2016should]: Although traditionally used to interpret model outputs, we use it to understand deep features. We fit a local surrogate model around the most activating examples of a deep feature to identify key "superpixels" for images or words for sentences.
Feature visualization [@yosinski2015understanding]: Synthesizes inputs that maximally activate a given neuron.[^4]

We detail the visualization procedure in Appendix 8{reference-type="ref" reference="app:feature_interpretation"}, and present sample visualizations in Figure [fig:suite_full]{reference-type="ref" reference="fig:suite_full"}.

$Visualization of five randomly-chosen deep features used by dense and sparse decision layers of a robust($\varepsilon=3$) ResNet-50 classifier to detect the ImageNet class "stopwatch", along with their linear coefficients (W), feature visualizations (FV) and LIME superpixels.$ {#fig:suite width="0.95\columnwidth"}

Are Sparse Decision Layers Better? {#sec:verification}

We now apply our methodology to widely-used deep networks and assess the quality of the resulting sparse decision layers along a number of axes. We demonstrate that:

The standard (henceforth referred to as "dense") linear decision layer can be made highly sparse at only a small cost to performance (Section 3.1{reference-type="ref" reference="sec:sparsity_vs_performance"}).
The deep features selected by sparse decision layers are qualitatively and quantitatively better at summarizing the model's decision process (Section 3.2{reference-type="ref" reference="sec:easier"}). Note that the dense and sparse decision layers operate on the same deep features---they only differ in the weight (if any) they assign to each one.
These aforementioned improvements (induced by the sparse decision layer) translate into better human understanding of the model (Section 3.3{reference-type="ref" reference="sec:human"}).

We perform our analysis on: (a) ResNet-50 classifiers [@he2016deep] trained on ImageNet-1k [@deng2009imagenet; @russakovsky2015imagenet] and Places-10 (a 10-class subset of Places365 [@zhou2017places]); and (b) BERT [@devlin2018bert] for sentiment classification on Stanford Sentiment Treebank (SST) [@socher2013recursive] and toxicity classification of Wikipedia comments [@wulczyn2017ex]. Details about the setup can be found in Appendix 9{reference-type="ref" reference="app:datasets"}.

Sparsity vs. performance {#sec:sparsity_vs_performance}

While a substantial reduction in the weights (and features) of a model's decision layer might make it easier to understand, it also limits the model's overall predictive power (and thus its performance). Still, we find that across datasets and architectures, the decision layer can be made substantially sparser---by up to two orders of magnitude---with a small impact on accuracy (cf. Figure 7{reference-type="ref" reference="fig:sparsity"}). For instance, it is possible to find an accurate decision layer that relies on only about 20 deep features/class for ImageNet (as opposed to 2048 in the dense case). Toxic comment classifiers can be sparsified even further (<10 features/class), with improved generalization over the dense decision layer.

For the rest of our study, we select a single sparse decision layer to balance performance and sparsity---specifically the sparsest model whose accuracy is within $5%$ of top validation set performance (details in Appendix 10.1.1{reference-type="ref" reference="app:single"}). However, as discussed previously, these thresholds can be varied based on the needs of specific applications.

{#fig:sparsity width="1\columnwidth"}

Dataset/Model Dense Sparse
$k$ All Top-$k$ Rest All Top-$k$ Rest ImageNet (std) 10 74.03 58.46 55.22 72.24 69.78 10.84 ImageNet (robust) 10 61.23 28.99 34.65 59.99 45.82 19.83 Places-10 (std) 10 83.30 83.60 81.20 77.40 77.40 10.00 Places-10 (robust) 10 80.20 76.10 76.40 77.80 76.60 40.20 SST 5 91.51 53.10 91.28 90.37 90.37 50.92 Toxic comments 5 83.33 55.35 57.87 82.47 82.33 50.00 Obscene comments 5 80.41 50.03 50.00 77.32 72.39 50.00 Insult comments 5 72.72 50.00 50.00 77.14 75.80 50.00

Sparsity and feature highlighting {#sec:easier}

Instead of sparsifying a network's decision layer, one could consider simply focusing on its most prominent deep features for debugging purposes. In fact, this is the basis of feature highlighting or principal reason explanations in the credit industry [@barocas2020hidden]. How effective are such feature highlighting explanations at mirroring the underlying model?

In Table [tab:ablation]{reference-type="ref" reference="tab:ablation"}, we measure the accuracy of the dense/sparse decision layer when it is constrained to utilize only the top-$k$ (5-10) features by weight magnitude. For dense decision layers, we consistently find that the top-$k$ features do not fully capture the model's performance. This is in stark contrast to the sparse case, where the top-$k$ features are both necessary, and to a large extent sufficient, to capture the model's predictive behavior. Note that the top-$k$ features of the dense decision layers in the language setting almost completely fail at near random-chance performance ($\sim$50%). This indicates that there do exist cases where focusing on the most important features (by weight) of a dense decision layer provides a misleading picture of global model behavior.

Sparsity and human understanding {#sec:human}

We now visualize the deep features utilized by the dense and sparse decision layers to evaluate how amenable they are to human understanding. We show representative examples from sentiment classification (SST) and ImageNet, and provide additional visualizations in Appendix 10.3{reference-type="ref" reference="app:visualizations"}.

Specifically, in Figure 4{reference-type="ref" reference="fig:suite_nlp"}, we present word cloud interpretations of the top three deep features used by both of these decision layers for detecting positive sentiment on the SST dataset [@socher2013recursive]. It is apparent that the sparse decision layer selects features which activate for words with positive semantic meaning. In contrast, the second most prominent deep feature for the dense decision layer is actually activated by words with negative semantic meaning. This example highlights how the dense decision layer can lead to unexpected features being used for predictions.

In Figure 6{reference-type="ref" reference="fig:suite"}, we present feature interpretations corresponding to the ImageNet class "quill" for both the dense and sparse decision layers of a ResNet-50 classifier. These feature visualizations seem to suggest that the sparse decision layer focuses more on deep features which detect salient class characteristics, such as "feather-like texture" and the "glass bottle" in the background.

Model simulation study

To validate the perceived differences in the vision setting---and ensure they are not due to confirmation biases---we conduct a human study on Amazon Mechanical Turk (MTurk). Our goal is to assess how well annotators are able to intuit (simulate[^5]) overall model behavior when they are exposed to its decision layer. To this end, we show annotators five randomly-chosen features used by the (dense/sparse) decision layer to recognize objects of a target class, along with the corresponding linear coefficients. We then present them with three samples from the validation set and ask them to choose the one that best matches the target class (cf. Appendix Figure 33{reference-type="ref" reference="fig:app_task_sim"} for a sample task). Crucially, annotators are not provided with any information regarding the target class, and must make their prediction based solely on the visualized features.

For both the dense and sparse decision layers, we evaluate how accurate annotators are on average (over 1000 tasks)---based on whether they can correctly identify the image with the highest target class probability according to the corresponding model. For the model with a sparse decision layer, annotators succeed in guessing the predictions in $63.02 \pm 3.02%$ of the cases. In contrast, they are only able to attain $35.61 \pm 3.09%$ accuracy---which is near-chance ($33.33%$)---for the model with a dense decision layer. Crucially, these results hold regardless of whether the correct image is actually from the target class or not (see Appendix Table 5{reference-type="ref" reference="tab:app_mturk_sim"} for a discussion).

Note that our task setup precludes annotators from succeeding based on any prior knowledge or cognitive biases as we do not provide them with any semantic information about the target label, aside from the feature visualizations. Thus, annotators' success on this task in the sparse setting indicates that the sparse decision layer is actually effective at reflecting the model's internal reasoning process.

[[tab:sst_counterfactual_examples]]{#tab:sst_counterfactual_examples label="tab:sst_counterfactual_examples"}

Toxic sentence Change in score

DJ Robinsin is ! he so much! [+christianity] $0.52 \rightarrow 0.49$ Jeez Ed, you seem like a [+christianity] $0.52 \rightarrow 0.48$ Hey , quit removing FACTS from the article !! [+christianity] $0.51\rightarrow 0.45$

[[tab:sst_counterfactual_examples]]{#tab:sst_counterfactual_examples label="tab:sst_counterfactual_examples"}

Debugging deep networks {#sec:diagnosis}

We now demonstrate how deep networks with sparse decision layers can be substantially easier to debug than their dense counterparts. We focus on three problems: detecting biases, creating counterfactuals, and identifying input patterns responsible for misclassifications.

Biases and (spurious) correlations {#sec:biases}

Our first debugging task is to automatically identify unintended biases or correlations that deep networks extract from their training data.

Toxic comments.

We start by examining two BERT models trained to classify comments according to toxicity: (1) Toxic-BERT, a high-performing model that was later found to use identity groups as evidence for toxicity, and (2) Debiased-BERT, which was trained to mitigate this bias [@borkan2019nuanced].

We find that Toxic-BERT models with sparse decision layers also rely on identity groups to predict comment toxicity (visualizations in Appendix 11.1{reference-type="ref" reference="app:toxic"} are censored). Words related to nationalities, religions, and sexual identities that are not inherently toxic occur frequently and prominently, and comprise 27% of the word clouds shown for features that detect toxicity. Note that although the standard Toxic-BERT model is known to be biased, this bias is not as apparent in the deep features used by its (dense) decision layer (cf. Appendix 11.1{reference-type="ref" reference="app:toxic"}). In fact, measuring the bias in the standard model required collecting identity and demographic-based subgroup labels [@borkan2019nuanced].

We can similarly inspect the word clouds for the Debiased-BERT model with sparse decision layers and corroborate that identity-related words no longer appear as evidence for toxicity. But rather than ignoring these words completely, it turns out that this model uses them as strong evidence against toxicity. For example, identity words comprise 43% of the word clouds of features detecting non-toxicity. This suggests that the debiasing intervention proposed in @borkan2019nuanced may not have had the intended effect---Debiased-BERT is still disproportionately sensitive to identity groups, albeit in the opposite way.

We confirm that this is an issue with Debiased-BERT via a simple experiment: we take toxic sentences that this model (with a sparse decision layer) correctly labels as toxic, and simply append an identity related word (as suggested by our word clouds) to the end---see Table [tab:sst_counterfactual_examples]{reference-type="ref" reference="tab:sst_counterfactual_examples"}. This modification turns out to strongly impact model predictions: for example, just adding "christianity" to the end of toxic sentences flips the prediction to non-toxic 74.4% of the time. We note that the biases diagnosed via sparse decision layers are also relevant for the standard Debiased-BERT model. In particular, the same toxic sentences with the word "christianity" are classified as non-toxic 62.2% of the time by the standard model, even though this sensitivity is not as readily apparent from inspecting its decision layer (cf. Appendix 11.1{reference-type="ref" reference="app:toxic"}).

ImageNet.

We now move to the vision setting, with the goal of detecting spurious feature dependencies in ImageNet classifiers. Once again, our approach is based on the following observation: input-class correlations learned by a model can be described as the data patterns (e.g., "dog ears" or "snow") that activate deep features used to recognize objects of that class, according to the decision layer.

Even so, it is not clear how to identify such patterns for image data, without access to fine-grained annotations describing image content. To this end, we rely on a human-in-the-loop approach (via MTurk). Specifically, for a deep feature of interest---used by the sparse decision layer to detect a target class---annotators are shown examples of images that activate it. Annotators are then asked if these "prototypical" images have a shared visual pattern, and if so, to describe it using free-text.

However, under this setup, presenting annotators with images from the target class alone can be problematic. After all, these images are likely to have multiple visual patterns in common---not all of which cause the deep feature to activate. Thus, to disentangle the pertinent data pattern, we present annotators with prototypical images drawn from more than one classes. A sample task is presented in Appendix Figure 41{reference-type="ref" reference="fig:app_task_spurious"}, wherein annotators see three highly-activating images for a specific deep feature from two different classes, along with the respective class labels. Aside from asking annotators to validate (and describe) the presence of a shared pattern between these images, we also ask them whether the pattern (if present) is part of each class object (non-spurious correlation) or its surroundings (spurious correlation)[^6].

We find that annotators are able to identify a significant number of correlations that standard ImageNet classifiers rely on (cf. Table 1{reference-type="ref" reference="tab:tab_mturk_spurious"}). Once again, sparsity seems to aids the detection of such correlations. Aside from having fewer (deep) feature dependencies per class, it turns out that annotators are able to pinpoint the (shared) data patterns that trigger the relevant deep features in 20% more cases for the model with a sparse decision layer. Interestingly, the fraction of detected patterns that annotators deem spurious is lower for the sparse case. In Figure 9{reference-type="ref" reference="fig:spurious_img"}, we present examples of detected correlations with annotator-provided descriptions as word clouds (cf. Appendix 11.2{reference-type="ref" reference="app:imagenet_biases"} for additional examples). A global word cloud visualization of correlations identified by annotators is shown in Appendix Figure 44{reference-type="ref" reference="fig:app_feedback"}.

::: {#tab:tab_mturk_spurious} Detected patterns (%) Dense

   Non-spurious        18.43 $\pm$ 2.48   34.43 $\pm$ 3.38
     Spurious          9.56 $\pm$ 1.76    12.49 $\pm$ 2.02
       Total           27.85 $\pm$ 2.70   46.97 $\pm$ 3.15

: The percentage of class-level correlations identified using our MTurk setup, along with a breakdown of whether annotators believe the pattern to be "non-spurious" (i.e., part of the object) or "spurious" (i.e., part of the surroundings). :::

Examples of correlations in ImageNet models detected using our MTurk study. Each row contains protypical images from a pair of classes, along with the annotator-provided descriptions for the shared deep feature that these images strongly activate. For each class, we also display if annotators marked the feature to be a "spurious correlation". {#fig:spurious_img width="0.95\columnwidth"}

Patterns (%) Dense

Non-spurious 18.43 $\pm$ 2.48 34.43 $\pm$ 3.38 Spurious 9.56 $\pm$ 1.76 12.49 $\pm$ 2.02 Total 27.85 $\pm$ 2.70 46.97 $\pm$ 3.15

{#fig:spurious_img width="0.88\columnwidth"}

Counterfactuals {#sec:counterfactuals}

A natural way to probe model behavior is by trying to find small input modifications which cause the model to change its prediction. Such modified inputs, which are (a special case of) counterfactuals, can be a useful primitive for pinpointing input features that the model relies on. Aside from debugging, such counterfactuals can also be used to provide users with recourse [@ustun2019actionable] that can guide them to obtaining better outcomes in the future. We now leverage the deep features used by sparse decision layers to inform counterfactual generation.

  ![](figures/glm/wordcloud_positive.pdf){#fig:wordclouds width="0.8\\columnwidth"}
  ![](figures/glm/wordcloud_negative.pdf){#fig:wordclouds width="0.8\\columnwidth"}

Original sentence Counterfactual Change in score

...something likable about the marquis... ...something irritating about the marquis... $0.73 \rightarrow 0.34$ Slick piece of cross-promotion Hype piece of cross-promotion $0.73\rightarrow 0.34$ A marvel like none you've seen A failure like none you've seen $0.73 \rightarrow 0.31$

Sentiment classifiers.

Our goal here is to automatically identify word substitutions that can be made within a given sentence to flip the sentiment label assigned by the model. We do this as follows: given a sentence with a positive sentiment prediction, we first identify the set of deep features used by the sparse decision layer that are positively activated for any word in the sentence. For a randomly chosen deep feature from this pool, we then substitute the positive word from the sentence with its negative counterpart. This substitute word is in turn randomly chosen from the set of words that negatively activate the same deep feature (based on its word cloud). An example of the positive and negative word clouds for one such deep feature is shown in Figure 11{reference-type="ref" reference="fig:wordclouds"}, and the resulting counterfactuals are in Table [tab:sentiment_counterfactuals]{reference-type="ref" reference="tab:sentiment_counterfactuals"} (cf. Appendix 12{reference-type="ref" reference="app:sentiment_counterfactuals"} for details).

Counterfactuals generated in this manner successfully flip the sentiment label assigned by the sparse decision layer $73.1\pm 3.0%$ of the time. In contrast, such counterfactuals only have $52.2\pm 4%$ efficacy for the dense decision layer. This highlights that for models with sparse decision layers, it can be easier to automatically identify deep features that are causally-linked to model predictions.

ImageNet.

We now leverage the annotations collected in Section 4.1{reference-type="ref" reference="sec:biases"} to generate counterfactuals for ImageNet classifiers. Concretely, we manually modify images to add or subtract input patterns identified by annotators and verify that they successfully flip the model's prediction. Some representative examples are shown in Figure 13{reference-type="ref" reference="fig:counterfactuals_img"}. Here, we alter images from various ImageNet classes to have the pattern "chainlink fence" and "water", so as to fool the sparse decision layer into recognizing them as "ballplayers" and "snorkels" respectively. We find that we are able to consistently change the prediction of the sparse decision layer (and in some cases its dense counterpart) by adding a pattern that was previously identified (cf. Section 4.1{reference-type="ref" reference="sec:biases"}) to be a spurious correlation.

Misclassifications {#sec:errors}

Our final avenue for diagnosing unintended behaviors in models is through their misclassifications. Concretely, given an image for which the model makes an incorrect prediction (i.e., not the ground truth label as per the dataset), our goal is to pinpoint some aspects of the image that led to this error.

In the ImageNet setting, it turns out that over 30% of misclassifications made by the sparse decision layer can be attributed to a single deep feature---i.e., manually setting this "problematic" feature to zero fixes the erroneous prediction. For these instances, can humans understand why the problematic feature was triggered in the first place? Specifically, can they recognize the pattern in the input that caused the error?

{#fig:counterfactuals_img width="0.95\columnwidth"}

{#fig:misclassification width="0.8\columnwidth"}

To test this, we present annotators on MTurk with misclassified images. Without divulging the ground truth or predicted labels, we show annotators the top activated feature for each of the two classes via feature visualizations. We then ask annotators to select the patterns (i.e., feature visualizations) that match the image, and to choose one that is a better match for the image (cf. Appendix 13.1{reference-type="ref" reference="app:mturk_mis"} for details). As a control, we repeat the same task but replace the problematic feature with a randomly-chosen one.

For about 70% of the misclassified images, annotators select the top feature for the predicted class as being present in the image (cf. Table 2{reference-type="ref" reference="tab:tab_mturk_mis"}). In fact, annotators consider it a better match than the feature for the ground truth class 60% of the time. In contrast, they rarely select randomly-chosen features to be present in the image. Since annotators do not know what the underlying classes are, the high fraction of selections for the problematic feature indicates that annotators actually believe this pattern is present in the image.

We present sample misclassifications validated by annotators in Figure 15{reference-type="ref" reference="fig:misclassification"}, along with the problematic features that led to them. Having access to this information can guide improvements in both models and datasets. For instance, model designers might consider augmenting the training data with examples of "maracas" without "red tips" to correct the second error in Figure 15{reference-type="ref" reference="fig:misclassification"}. In Appendix 13.3{reference-type="ref" reference="app:confusion"}, we further discuss how sparse decision layers can provide insight into inter-class model confusion matrices.

::: {#tab:tab_mturk_mis} Features Matches image Best match

Prediction 70.70% $\pm$ 3.62% 60.12% $\pm$ 3.77% Random 16.63% $\pm$ 2.91% 10.58% $\pm$ 2.35%

: Fraction of misclassified images for which annotators select the top feature of the predicted class to: (i) match the given image and (ii) be a better match than the top feature for the ground truth class. As a baseline, we also evaluate annotator selections when the top feature for the predicted class is replaced by a randomly-chosen one. :::

Examples of misclassified ImageNet images for which annotators deem the top activated feature for the predicted class (rightmost) as a better match than the top activated feature for the ground truth class (middle). {#fig:misclassification width="0.72\columnwidth"}

Related Work

We now discuss prior work in interpretability and generalized linear models. Due to the large body of work in both fields, we limit the discussion to closely-related studies.

Interpretability tools.

There have been extensive efforts towards post-hoc interpretability tools for deep networks. Feature attribution methods provide insight into model predictions for a specific input instance. These include saliency maps [@simonyan2013deep; @smilkov2017smoothgrad; @sundararajan2017axiomatic], surrogate models to interpret local decision boundaries [@ribeiro2016should], and finding influential [@koh2017understanding], prototypical [@kim2016examples], or counterfactual inputs [@goyal2019counterfactual]. However, as noted by various recent studies, these local attributions can be easy to fool [@ghorbani2019interpretation; @slack2020fooling] or may otherwise fail to capture global aspects of model behavior [@sundararajan2017axiomatic; @adebayo2018sanity; @adebayo2020debugging; @leavitt2020towards]. Several methods have been proposed to interpret hidden units within vision networks, for example by generating feature visualizations [@erhan2009visualizing; @yosinski2015understanding; @nguyen2016synthesizing; @olah2017feature] or assigning semantic concepts to them [@bau2017network; @bau2020understanding]. Our work is complementary to these methods as we use them as primitives to probe sparse decision layers. Another related line of work is that on concept-based explanations, which seeks to explain the behavior of deep networks in terms of high-level concepts [@kim2018interpretability; @ghorbani2019towards; @yeh2020completeness]. One of the drawbacks of these methods is that the detected concepts need not be causally linked to the model's predictions [@goyal2019explaining]. In contrast, in our approach, the identified high-level concepts, i.e., the deep features used by the sparse decision layer, entirely determine the model's behavior.

Most similar is the recent work by [@wan2020nbdt], which proposes fitting a decision tree on a deep feature representation. Network decisions are then explained in terms of semantic descriptions for nodes along the decision path. @wan2020nbdt rely on heuristics for fitting and labeling the decision tree, that require an existing domain-specific hierarchy (e.g., WordNet), causing it to be more involved and limited in its applicability than our approach.

Regularized GLMs and gradient methods.

Estimating GLMs with convex penalties has been studied extensively. Algorithms for efficiently computing regularization paths include least angle regression for LASSO [@efron2004least] and path following algorithms [@park2007l1] for $\ell_1$ regularized GLMs. The widely-used R package glmnet by @friedman2010regularization provides an efficient coordinate descent-based solver for GLMs with elastic net regularization, and attains state-of-the-art solving times on CPU-based hardware. Unlike our approach, this library is best suited for problems with few examples or features, and is not directly amenable to GPU acceleration. Our solver also builds off a long line of work in variance reduced proximal gradient methods [@johnson2013accelerating; @defazio2014saga; @gazagnadou2019optimal], which have stronger theoretical convergence rates when compared to stochastic gradient descent.

Conclusion

We demonstrate how fitting sparse linear models over deep representations can result in more debuggable models, and provide a diverse set of scenarios showcasing the usage of this technique in practice. The simplicity of our approach allows it to be broadly applicable to any deep network with a final linear layer, and may find uses beyond the language and vision settings considered in this paper.

Furthermore, we have created a number of human experiments for tasks such as testing model simulatiblity, detecting spurious correlations and validating misclassifications. Although primarily used in the context of evaluating the sparse decision layer, the design of these experiments may be of independent interest.

Finally, we recognize that while deep networks are popular within machine learning and artifical intelligence settings, linear models continue to be widely used in other scientific fields. We hope that the development and release of our elastic net solver will find broader use in the scientific community for fitting large scale sparse linear models in contexts beyond deep learning.

Acknowledgements {#acknowledgements .unnumbered}

We thank Dimitris Tsipras for helpful discussions.

Work supported in part by the Google PhD Fellowship, Open Philanthropy, and NSF grants CCF-1553428 and CNS-1815221. This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001120C0015. Research was sponsored by the United States Air Force Research Laboratory and the United States Air Force Artificial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

SAGA-based solver for generalized linear models {#app:solver}

In this section, we describe in further detail our solver for learning regularized GLMs in relation to existing work. Note that many of the components underlying our solver have been separately studied in prior work. However, we are the first to effectively combine them in a way that allows for GPU-accelerated fitting of GLMs at ImageNet-scale. The key algorithmic primitives we leverage to this end are variance reduced optimization methods and path algorithms for GLMs.

Specifically, our solver uses a mini-batch derivative of the SAGA algorithm [@gazagnadou2019optimal], which belongs to a class of a variance reduced proximal gradient methods. These approaches have several benefits: a) they are easily parallelizable via GPU, b) they enjoy faster convergence rates than stochastic gradient methods, and c) they require minimal tuning and can converge with a fixed learning rate.

Algorithm [alg:solver]{reference-type="ref" reference="alg:solver"} provides a step-by-step description of our solver. Here, the proximal operator for elastic net regularization is $$\textrm{Prox}_{\lambda_1, \lambda_2}(\beta) = \begin{cases} \frac{\beta - \lambda_1}{1+\lambda_2} &\text{if } \beta > \lambda_1 \ \frac{\beta + \lambda_1}{1+\lambda_2} &\text{if } \beta < \lambda_1 \ 0 &\text{otherwise} \end{cases}$$

Table for storing gradients

Note that the SAGA algorithm requires saving the gradients of the model for each individual example. For ImageNet-sized problems, this requires a prohibitive amount of memory, as both the number of examples ($>$1 million) and the size of the gradient (of the linear model) are large.

It turns out that for linear models with $k$ outputs, it is actually possible to store all of the necessary gradient information for a single example in a vector of size $k$---as demonstrated by @defazio2014saga. The key idea behind this approach is that rather than storing the full gradient step $(x_i^T\beta + \beta_0 - y_i)x_i$, we can instead just store the scalar $a_i = (x_i^T\beta + \beta_0 - y_i)$ per output (i.e., a vector of length $k$ in the case of multiple outputs). Thus, for a dataset with $n$ examples, this reduces the memory requirements of the gradient table to $O(nk)$. For ImageNet, we find that the entire table easily fits within GPU memory limits.

There is one caveat here: in order to use this memory trick, it is necessary to incorporate the $\ell_2$ regularization from the elastic net into the proximal operator. This is precisely why we use the proximal operator of the elastic net, rather than of the $\ell_1$ regularization. Unfortunately, this means that the smooth part of the objective (i.e. the part not used in the proximal operator) is no longer guaranteed to be strongly convex, and so the theoretical analysis of @gazagnadou2019optimal no longer strictly applies. Nonetheless, we find that these variance reduced methods can still provide strong practical convergence rates in this setting without requiring much tuning of batch sizes or learning rates.

Initialize table of scalars $a_i' = 0$ for $i \in [n]$ Initialize average gradient of table $g_{avg}=0$ and $g_{0avg}=0$ $a_i = x_i^T\beta + \beta_0 - y_i$ $g_i = a_i \cdot x_i$ // calculate new gradient information $g_i' = a_i' \cdot x_i$ // calculate stored gradient information $g = \frac{1}{|B|}\sum_{i \in B} g_i$ $g' = \frac{1}{|B|}\sum_{i \in B} g_i'$

$g_0 = \frac{1}{|B|}\sum_{i \in B} a_i$ $g_0' = \frac{1}{|B|}\sum_{i \in B} a_i'$

$\beta = \beta - \gamma(g - g' + g_{avg})$ $\beta_0 = \beta_0 - \gamma(g_0 - g_0' + g_{0avg})$ $\beta = \textrm{Prox}{\gamma\lambda\alpha, \gamma\lambda(1-\alpha)}(\beta)$ $a_i' = a_i$ // update table $g{avg} = g_{avg} + \frac{|B|}{n}(g - g')$ // update average $g_{0avg} = g_{0avg} + \frac{|B|}{n}(g_0 - g_0')$

Stopping criterion

We implement two simple stopping criteria, which both take in a tolerance level $\varepsilon_\text{tol}$. The first is a gradient-based stopping criteria, which terminates when: $$\sqrt{|\beta^{i+1} - \beta^i|2^2 + |\beta_0^{i+1} - \beta_0^{i}|2^2} \leq \varepsilon\text{tol}$$ Intuitively, this stops when the change in the estimated coefficients is small. Our second stopping criteria is more conservative and uses a longer search horizon, and stops when the training loss has not improved by more than $\varepsilon\text{tol}$ for more than $T$ epochs for some $T$, which we call the lookbehind stopping criteria.

In practice, we find that the gradient-based stopping criteria with $\varepsilon_\text{tol}=10^{-4}$ is sufficient for most cases (i.e. the solver has converged sufficiently such that the number of non-zero entries will no longer change). For significantly larger problems such as ImageNet, where individual batch sizes can have much larger variability in progressing the training objective, we find that the lookbehind stopping criteria is sufficient with $\varepsilon_\text{tol}=10^{-4}$ and $T=5$.

Relation of the solver to existing work

We now discuss how our solver borrows and differs from existing work. First, note that the original SAGA algorithm [@defazio2014saga] analyzes the regularized form but updates its gradient estimate with one sample at a time, which is not amenable to GPU parallelism. On the other hand, @gazagnadou2019optimal analyze a minibatch variant of SAGA but without regularization. In our solver, we use a straightforward adaptation of minibatch SAGA to its regularized equivalent by including a proximal step for the elastic net regularization after the gradient step.

To compute the regularization paths, we closely follow the framework of @friedman2010regularization. Specifically, we compute solutions for a decreasing sequence of regularization, using the solution of the previous regularization as a warm start for the next. The maximum regularization value which fits only the bias term is calculated as the fixed point of the coordinate descent iteration as $$\lambda_{max} = \max_j \frac{1}{N\alpha} \left|\sum_{i=1}^n x_{ij}y_i\right|$$ and scheduled down to $\lambda_{min} = \varepsilon\lambda_{max}$ over a sequence of $K$ values on a log scale, as done by @friedman2010regularization. Typical suggested values are to take $K=100$ and $\varepsilon=0.001$, which are what we use in all of our experiments. For extensions to logistic and multinomial regression, we refer the reader to @friedman2010regularization, and note that our approach is the same but substituting our SAGA-based solver in liue of the coordinate descent-based solver.

Timing Experiments {#app:timing}

In this section, we discuss how the runtime of our solver scales with the problem size. To be able to compare our solver with existing approaches, the experiments performed here are at a smaller scale than those in the main body of the paper.

Problem setting & hyperparameters.

The problem we examine is that of fitting a linear decision layer for the CIFAR-10 dataset using the deep feature representation of an ImageNet-trained ResNet-50 (2048-dimensional features). We then vary the number of training examples (from 1k to 50k) and fit an elastic net regularized GLM using various methods. We compare glmnet (state-of-the-art, coordinate descent-based solver) on a 9th generation Intel Core i7 with 6 cores clocked at 2.6Ghz, and our approach glm-saga using a GeForce GTX 1080ti. We note that in these small-scale experiments, the graphics card remains at around 10-20% utilization, indicating that the problem size is too small to fully utilize the GPU.

We fix $\alpha=0.99$, $\varepsilon= 10^{-4}$, set aside 10% of the training data for validation, and calculate regularization paths for $k=100$ different values, which are the defaults for glmnet. For our approach, we additionally use a mini-batch size of 512, a learning rate of 0.1, and a tolerance level of $10^{-4}$ for the gradient-based stopping criteria.

Improvements in scalability

As expected, on smaller problem instances with a couple thousand examples, glmnet is faster than our solver---cf. Table 3{reference-type="ref" reference="tab:app_timing"}. This is largely due to the increased base running time of our solver---a consequence of gradient based methods requiring some time to converge. However, as the problem size grows, the runtime of glmnet increases rapidly, and exceeds the running time of glm-saga at 3,000 datapoints. For example, it takes almost 40 minutes to fit 4,000 data points with glmnet, an increase of 20x the running time for 4x the data relative to the running time for 1,000 data points. In contrast, our solver only needs 19 minutes to fit 4,000 datapoints, an increase of 2x the running time for 4x the data. Consequently, while glmnet takes a considerable amount of time to fit the full CIFAR10 problem size (50,000 datapoints)---nearly 13 hours---our solver can do the same in only 33 minutes. Notably, our solver can fit the regularization paths of the decision layer for the full ImageNet dataset (1 million examples with 2048 features) in approximately 6 hours.

::: {#tab:app_timing}

            Number of examples

Solver 1k 2k 3k 4k 5k 50k glmnet 2 7 25 39 58 776 glm-saga 9 13 17 19 22 33

: Runtime in minutes for glmnet and glm-saga for fitting a sparse decision layer on the CIFAR-10 dataset using deep representations (2048D) for a pre-trained ResNet-50. Here, we assess how the runtime of different solvers scales as a function of training data points. :::

Backpropagation libraries

One more alternative to fitting linear models at scale is to use a standard autodifferentiation library such as PyTorch or Tensorflow. However, typical optimizers used in these libraries do not handle non-smooth regularizers well (i.e., the $\ell_1$ penalty of the elastic net). In practice, these types of approaches must gradually schedule learning rates down to zero in order to converge, and take too long to compute regularization paths. For example, the fixed-feature transfer experiments from @salman2020adversarially takes approximately 4 hours to fit the same CIFAR10 timing experiment for a single regularization value. In contrast, the SAGA-based optimizers enables a flexible range of learning rates that can converge rapidly without needing to tune or decay the learning rate over time.

Elastic net, $\ell_1$, and $\ell_2$ regularization

The elastic net is known to combine the benefits of both $\ell_1$ and $\ell_2$ regularization for linear models. The $\ell_1$ regularization, often seen in the LASSO, primarily provides sparsity in the solution. The $\ell_2$ regularization, often seen as ridge regression, brings improved performance, a unique solution via strong convexity, and a grouping effect of similar neurons. Due to this last property of $\ell_2$ regularization, highly correlated features will become non-zero at the same time over the regularization path. The elastic net combines all of these strengths, and we refer the reader to @tibshirani2017sparsity for further discussion on the interaction between elastic net, $\ell_1$, and $\ell_2$.

Feature ordering {#app:order}

In the main body of the paper, we utilized regularization paths obtained via elastic net to obtain a sparse decision layer over deep features. We now discuss an additional use case of regularization paths---as a means to assess relative (deep) feature importance within the decision layer of a standard deep network. Such an ordering could, for instance, provide an alternative criteria for feature selection in "feature-highlighting" explanations [@barocas2020hidden].

The underlying mechanism that allows us to do this is the $\ell_1$ regularization in the elastic net, which imposes sparsity properties on the coefficients of the resulting linear model [@tibshirani1994regression]. Specifically, the coefficients for each feature become non-zero at discrete points in the regularization path, as $\lambda$ tends to zero. Informally, one can view features that are assigned non-zero coefficients earlier as being more useful from an accuracy standpoint, given the sparsity regularization.

Consequently, the order in which (deep) features are incorporated into the sparse decision layers, within the regularization path, may shed light on their relative utility within the standard deep network. In Figures 19{reference-type="ref" reference="fig:app_order_std_in"}- 31{reference-type="ref" reference="fig:app_order_rob_places"}, we illustrate regularization paths along with the derived feature ordering for standard and robust ResNet-50 classifiers trained on ImageNet and Places-10 datasets. For all the models, it appears that features that are incorporated earlier into the regularization path (for a class) are actually more semantically aligned with the corresponding object category.

{#fig:app_order_std_in width="1\columnwidth"}

{#fig:app_order_rob_in width="1\columnwidth"}

{#fig:app_order_std_places width="1\columnwidth"}

{#fig:app_order_rob_places width="1\columnwidth"}

Feature interpretations {#app:feature_interpretation}

We now discuss in depth our procedure for generating feature interpretations for deep features in the vision and language settings.

Feature visualization

Feature visualization is a popular approach to interpret individual neurons within a deep network. Here, the objective is to synthesize inputs (via optimization in pixel space) that highly activate the neuron of interest. Unfortunately, for standard networks trained via empirical risk minimization, it is well-known that vanilla feature visualization---using just gradient descent in input space---fails to produce semantically-meaningful interpretations. In fact, these visualizations frequently suffer from artifacts and high frequency patterns [@olah2017feature]. One cause for this could be the reliance of standard models on input features that are imperceptible or unintuitive, as has been noted in recent studies [@ilyas2019adversarial].

To mitigate this challenge, there has been a long line of work on defining modified optimization objectives to produce more meaningful feature visualizations [@olah2017feature]. In this work, we use the Tensorflow-based Lucid library^7 to produce feature visualizations for standard models. Therein, the optimization objective contains additional regularizations to penalize high-frequency changes in pixel space and to encourage transformation robustness. Further, gradient descent is performed in the Fourier basis to further discourage high-frequency input patterns. We defer the reader to @olah2017feature for a more complete presentation.

In contrast, a different line of work [@tsipras2019robustness; @engstrom2019learning] has shown that robust (adversarially-trained) models tend to have better feature representations than their standard counterparts. Thus, for robust models, gradient descent in pixel space is already sufficient to find semantically-meaningful feature visualizations.

LIME {#app:lime}

Image superpixels.

Traditionally, LIME is used to obtain instance-specific explanations---i.e., to identify the superpixels in a given test image that are most responsible for the model's prediction. However, in our setting, we would like to obtain a global understanding of deep features, independent of specific test examples. Thus, we use the following two step-procedure to obtain LIME-based feature interpretations:

Rank test set images based on how strongly they activate the feature of interest. Then select the top-$k$ (or conversely bottom-$k$) images as the most prototypical examples for positive (negative) activation of the feature.
Run LIME on each of these examples to identify relevant superpixels. At a high level, this involves performing linear regression to map image superpixels to the (normalized) activation of the deep feature (rather than the probability of a specific class as is typical).

Due to space constraints, we use $k=1$ in all our figures. However, in our analysis, we found the superpixels identified with $k=1$ to be representative of those obtained with higher values.

Word clouds for language models

For language models, off-the-shelf neuron interpretability tools are somewhat more limited than their vision counterparts. Of the tools listed above, only LIME is used in the language domain to produce sentence-specific explanations. Similar to our methodology for vision models, we apply LIME to a given deep feature representation rather than the output neuron. However, rather than selecting prototypical images, we instead aggregate LIME explanations over the entire validation set.

Specifically, for a given feature, we average the LIME weighting for each word over all of the sentences that the word appears in. This allows us to identify words that strongly activate/deactivate the given feature globally over the entire validation set, which we then visualize using word clouds. In practice, since a word cloud has limited space, we provide the top 30 most highly weighted words to the word cloud generator. The exact procedure is shown in Algorithm [alg:limeaggregate]{reference-type="ref" reference="alg:limeaggregate"}, and we use the word cloud generator from https://github.com/amueller/word_cloud.

$\beta_i = \texttt{LIME}(w_i)$ // generate LIME explanation for each sentence $K_w = \sum_{ij : w = w_{ij}} 1$ // count number of occurances of word $\hat \beta_w = \frac{1}{K_w}\sum_{ij : w = w_{ij}} \beta_{ij}$ // calculate average LIME explanation of word $\texttt{Wordcloud}(\beta,V)$ // generate word cloud for vocabulary $V$ weighted by $\beta$

Datasets and Models {#app:datasets}

Datasets

We perform our experiments on the following widely-used vision and language datasets.

ImageNet-1k [@deng2009imagenet; @russakovsky2015imagenet].
Places-10: A subset of Places365 [@zhou2017places] containing the classes "airport terminal", "boat deck", "bridge", "butcher's shop", "church-outdoor", "hotel room", "laundromat", "river", "ski slope" and "volcano".
Stanford Sentiment Treebank (SST) [@socher2013recursive] with labels for "positive" and "negative" sentiment.
Toxic Comments [@wulczyn2017ex] with labels for "toxic", "severe toxic", "obscene", "'threat", "insult", and 'identity hate''.

Balancing the comment classification task.

The toxic comments classification task has a highly unbalanced test set, and is largely skewed towards non-toxic comments. Consequently, the baseline accuracy for simply predicting the non-toxic label is often upwards of 90% on the unbalanced test set. To get a more interpretable and usable performance metric, we instead randomly subsample the test set to be balanced with 50% each of toxic and non-toxic comments from the corresponding toxicity category. Thus, the baseline accuracy for random chance for toxic comment classification in our experiments is 50%.

Models

We consider ResNet-50 [@he2016deep] classifiers and BERT [@devlin2018bert] models for vision and language tasks respectively. In the vision setting, we consider both standard and robust models [@madry2018towards].

Vision.

All the models are trained for 90 epochs, weight decay 1e-4 and momentum 0.9. We used a batch size of 512 for ImageNet and 128 for Places-10. The initial learning rate is 0.1 and is dropped by a factor of 10 every 30 epochs. The robust models were obtained using adversarial training with a $\ell_2$ PGD adversary [@madry2018towards] with $\varepsilon=3$, 3 attack steps and attack step size of $\frac{2 \times \varepsilon}{3}$.

Language.

The language models are all pretrained and available from the HuggingFace library, and use the standard BERT base architecture. Specifically, the sentiment classification model is from https://huggingface.co/barissayil/bert-sentiment-analysis-sst and the toxic comment models (both Toxic-BERT and Debiased-BERT) come from https://huggingface.co/unitary/toxic-bert.

Evaluating sparse decision layers {#app:verification}

Trade-offs for all datasets {#app:tradeoffs}

In Figure [fig:app_sparsity]{reference-type="ref" reference="fig:app_sparsity"}, we present an extended version of Figure 7{reference-type="ref" reference="fig:sparsity"}---including all the tasks and models we consider in both the vision and language setting. Each point on the curve corresponds to single linear classifier from the regularization path in Equation ([eq:path]{reference-type="ref" reference="eq:path"}). Note that we include the (same) SST curve in both language plots for the Toxic and Debiased BERT models.

{width="1\columnwidth"}

Selecting a single sparse model {#app:single}

As discussed in Section 2.1{reference-type="ref" reference="sec:glm_explain"}, the elastic net yields a sequence of linear models---with varying accuracy and sparsity---also known as the regularization path. In practice, performance of these models on a hold-out validation set can be used to guide model selection based on application-specific criteria. In our experiments, we set aside 10% of the train set for this purpose.

Our model selection thresholds.

For both vision and NLP tasks, we use the validation set to identify the sparsest decision layer, whose accuracy is no more than 5% lower on the validation set, compared to the best performing decision layer. As discussed in the paper, these thresholds are meant to be illustrative and can be varied depending on the specific application. We now visualize the per-class distribution of deep features for the sparse decision layers selected in Table [tab:ablation]{reference-type="ref" reference="tab:ablation"}. (We omit the NLP tasks as they entail only two classes.)

Distribution of the number of deep features used per class by sparse decision layers of vision models. Note that a standard (dense) decision layer uses all 2048 deep features to predict every class. {#fig:app_per_class_sparsity width="0.9\columnwidth"}

Feature highlighting

In Table 4{reference-type="ref" reference="tab:app_accuracy_extended"} we show an extended version of Table [tab:ablation]{reference-type="ref" reference="tab:ablation"}, which now contains an additional wide ImageNet representation as well as 3 additional toxic comment categories for each toxic comment classifier. The overall test accuracy of a subset of these models (before sparsification) is under 'Dense $\rightarrow$ All' in Figure [tab:ablation]{reference-type="ref" reference="tab:ablation"}.

::: {#tab:app_accuracy_extended}

Dataset/Model $k$ All Top-$k$ Rest ImageNet (std) ImageNet (wide, std) ImageNet (robust) Places-10 (std) Places-10 (robust) SST Toxic-BERT (toxic) Toxic-BERT (severe toxic) Toxic-BERT (obscene) Toxic-BERT (threat) Toxic-BERT (insult) Toxic-BERT (identity hate) Debiased-BERT (toxic) Debiased-BERT (severe toxic) Debiased-BERT (obscene) Debiased-BERT (threat) Debiased-BERT (insult) Debiased-BERT (identity hate) Dense Sparse
All Top-$k$ Rest 10 74.03 58.46 55.22 72.24 69.78 10.84 77.07 72.42 48.75 73.48 73.45 0.91 61.23 28.99 34.65 59.99 45.82 19.83 10 83.30 83.60 81.20 77.40 77.40 10.00 80.20 76.10 76.40 77.80 76.60 40.20 5 91.51 53.21 91.17 90.71 90.48 50.92 5 83.33 55.35 57.87 82.47 82.33 50.00 71.53 50.00 50.14 67.57 50.00 50.00 80.41 50.03 50.00 77.32 72.39 50.00 77.01 50.00 50.00 76.30 74.17 50.00 72.72 50.00 50.00 77.14 75.80 50.00 79.85 57.87 50.00 74.93 71.49 50.00 5 91.61 50.00 83.26 87.59 78.58 50.00 63.08 50.00 50.00 55.86 53.81 50.00 85.36 50.00 58.36 81.50 81.17 50.00 77.49 50.00 50.00 68.96 50.00 50.00 85.63 50.00 59.95 79.28 71.48 50.00 76.12 50.00 50.84 71.98 50.00 50.00

: Extended version of Table [tab:ablation]{reference-type="ref" reference="tab:ablation"}: Comparison of the accuracy of dense/sparse decision layers when they are constrained to utilize only the top-$k$ deep features (based on weight magnitude). We also show overall model accuracy, and the accuracy gained by using the remaining deep features. :::

Additional comparisons of features {#app:visualizations}

[[app:feature_int]]{#app:feature_int label="app:feature_int"}

In Figure [fig:app_wordclouds]{reference-type="ref" reference="fig:app_wordclouds"}, we visualize additional deep features used by BERT models with sparse decision layers for the SST sentiment analysis task. Figures [fig:app_fv_std_in_harddisk]{reference-type="ref" reference="fig:app_fv_std_in_harddisk"}- [fig:app_fv_rob_places]{reference-type="ref" reference="fig:app_fv_rob_places"} show feature interpretations of deep features used by ResNet-50 classifiers with sparse decision layers trained on ImageNet and Places-10. Due to space constraints, we limit the feature interpretations for vision models to (at most) five randomly-chosen deep features used by the dense/sparse decision layer in Figure 6{reference-type="ref" reference="fig:suite"} and Figures [fig:app_fv_std_in_harddisk]{reference-type="ref" reference="fig:app_fv_std_in_harddisk"}- [fig:app_fv_rob_places]{reference-type="ref" reference="fig:app_fv_rob_places"}. To allow for a fair comparison between the two decision layers, we sample these features as follows. Given a target class, we first determine the number of deep features ($k$) used by the sparse decision layer to recognize objects of that class. Then, for both decision layers, we randomly sample five deep features from the top-$k$ highest weighted ones (for that class).

Language models {#app:sst_wordclouds}

{width="\columnwidth"}