More Data Can Hurt for Linear Regression: Sample-wise Double Descent

source

arxiv

source_type

latex

converted_with

pandoc

paper_version

1912.07242v1

title

More Data Can Hurt for Linear Regression: Sample-wise Double Descent

authors

["Preetum Nakkiran"]

date_published

2019-12-16 08:28:26+00:00

data_last_modified

2019-12-16 08:28:26+00:00

abstract

In this expository note we describe a surprising phenomenon in overparameterized linear regression, where the dimension exceeds the number of samples: there is a regime where the test risk of the estimator found by gradient descent increases with additional samples. In other words, more data actually hurts the estimator. This behavior is implicit in a recent line of theoretical works analyzing "double-descent" phenomenon in linear models. In this note, we isolate and understand this behavior in an extremely simple setting: linear regression with isotropic Gaussian covariates. In particular, this occurs due to an unconventional type of bias-variance tradeoff in the overparameterized regime: the bias decreases with more samples, but variance increases.

author_comment

null

journal_ref

null

doi

null

primary_category

stat.ML

categories

["stat.ML","cs.LG","cs.NE","math.ST","stat.TH"]

citation_level

alignment_text

pos

confidence_score

1.0

main_tex_filename

main.tex

bibliography_bbl

\begin{thebibliography}{} \bibitem[Advani and Saxe, 2017]{advani2017high} Advani, M.~S. and Saxe, A.~M. (2017). \newblock High-dimensional dynamics of generalization error in neural networks. \newblock {\em arXiv preprint arXiv:1710.03667}. \bibitem[Bartlett et~al., 2019]{bartlett2019benign} Bartlett, P.~L., Long, P.~M., Lugosi, G., and Tsigler, A. (2019). \newblock Benign overfitting in linear regression. \newblock {\em arXiv preprint arXiv:1906.11300}. \bibitem[Belkin et~al., 2018]{belkin2018reconciling} Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2018). \newblock Reconciling modern machine learning and the bias-variance trade-off. \newblock {\em arXiv preprint arXiv:1812.11118}. \bibitem[Belkin et~al., 2019]{belkin2019two} Belkin, M., Hsu, D., and Xu, J. (2019). \newblock Two models of double descent for weak features. \newblock {\em arXiv preprint arXiv:1903.07571}. \bibitem[Bibas et~al., 2019]{bibas2019new} Bibas, K., Fogel, Y., and Feder, M. (2019). \newblock A new look at an old problem: A universal learning approach to linear regression. \newblock {\em arXiv preprint arXiv:1905.04708}. \bibitem[Deng et~al., 2019]{deng2019model} Deng, Z., Kammoun, A., and Thrampoulidis, C. (2019). \newblock A model of double descent for high-dimensional binary linear classification. \newblock {\em arXiv preprint arXiv:1911.05822}. \bibitem[Dereziński et~al., 2019]{dereziski2019exact} Dereziński, M., Liang, F., and Mahoney, M.~W. (2019). \newblock Exact expressions for double descent and implicit regularization via surrogate random design. \bibitem[Geiger et~al., 2019]{geiger2019jamming} Geiger, M., Spigler, S., d'Ascoli, S., Sagun, L., Baity-Jesi, M., Biroli, G., and Wyart, M. (2019). \newblock Jamming transition as a paradigm to understand the loss landscape of deep neural networks. \newblock {\em Physical Review E}, 100(1):012115. \bibitem[Hastie et~al., 2019]{hastie2019surprises} Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R.~J. (2019). \newblock Surprises in high-dimensional ridgeless least squares interpolation. \bibitem[Lampinen and Ganguli, 2018]{lampinen2018analytic} Lampinen, A.~K. and Ganguli, S. (2018). \newblock An analytic theory of generalization dynamics and transfer learning in deep linear networks. \newblock {\em arXiv preprint arXiv:1809.10374}. \bibitem[Liang and Rakhlin, 2018]{liang2018just} Liang, T. and Rakhlin, A. (2018). \newblock Just interpolate: Kernel" ridgeless" regression can generalize. \newblock {\em arXiv preprint arXiv:1808.00387}. \bibitem[Liang et~al., 2019]{liang2019risk} Liang, T., Rakhlin, A., and Zhai, X. (2019). \newblock On the risk of minimum-norm interpolants and restricted lower isometry of kernels. \newblock {\em arXiv preprint arXiv:1908.10292}. \bibitem[Mar{\v{c}}enko and Pastur, 1967]{marvcenko1967distribution} Mar{\v{c}}enko, V.~A. and Pastur, L.~A. (1967). \newblock Distribution of eigenvalues for some sets of random matrices. \newblock {\em Mathematics of the USSR-Sbornik}, 1(4):457. \bibitem[Mei and Montanari, 2019]{mei2019generalization} Mei, S. and Montanari, A. (2019). \newblock The generalization error of random features regression: Precise asymptotics and double descent curve. \newblock {\em arXiv preprint arXiv:1908.05355}. \bibitem[Mitra, 2019]{Mitra2019UnderstandingOP} Mitra, P.~P. (2019). \newblock Understanding overfitting peaks in generalization error: Analytical risk curves for l2 and l1 penalized interpolation. \newblock {\em ArXiv}, abs/1906.03667. \bibitem[Muthukumar et~al., 2019]{muthukumar2019harmless} Muthukumar, V., Vodrahalli, K., and Sahai, A. (2019). \newblock Harmless interpolation of noisy data in regression. \newblock {\em arXiv preprint arXiv:1903.09139}. \bibitem[Nakkiran et~al., 2019]{deep} Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I. (2019). \newblock Deep double descent: Where bigger models and more data hurt. \newblock {\em arXiv preprint arXiv:1912.02292}. \bibitem[Neal et~al., 2018]{neal2018modern} Neal, B., Mittal, S., Baratin, A., Tantia, V., Scicluna, M., Lacoste-Julien, S., and Mitliagkas, I. (2018). \newblock A modern take on the bias-variance tradeoff in neural networks. \newblock {\em arXiv preprint arXiv:1810.08591}. \bibitem[Opper, 1995]{opper1995statistical} Opper, M. (1995). \newblock Statistical mechanics of learning: Generalization. \newblock {\em The Handbook of Brain Theory and Neural Networks, 922-925.} \bibitem[Opper, 2001]{opper2001learning} Opper, M. (2001). \newblock Learning to generalize. \newblock {\em Frontiers of Life, 3(part 2), pp.763-775.} \bibitem[Spigler et~al., 2018]{spigler2018jamming} Spigler, S., Geiger, M., d'Ascoli, S., Sagun, L., Biroli, G., and Wyart, M. (2018). \newblock A jamming transition from under-to over-parametrization affects loss landscape and generalization. \newblock {\em arXiv preprint arXiv:1810.09665}. \bibitem[Xu and Hsu, 2019]{xu2019number} Xu, J. and Hsu, D.~J. (2019). \newblock On the number of variables to use in principal component regression. \newblock In {\em Advances in Neural Information Processing Systems}, pages 5095--5104. \end{thebibliography}

bibliography_bib

@article{deep, title={Deep Double Descent: Where Bigger Models and More Data Hurt}, author={Nakkiran, Preetum and Kaplun, Gal and Bansal, Yamini and Yang, Tristan and Barak, Boaz and Sutskever, Ilya}, journal={arXiv preprint arXiv:1912.02292}, year={2019} } @article{deng2019model, title={A Model of Double Descent for High-dimensional Binary Linear Classification}, author={Deng, Zeyu and Kammoun, Abla and Thrampoulidis, Christos}, journal={arXiv preprint arXiv:1911.05822}, year={2019} } @inproceedings{ aditi, title={When Covariate-shifted Data Augmentation Increases Test Error And How to Fix It}, author={Anonymous}, booktitle={Submitted to International Conference on Learning Representations}, year={2020}, url={https://openreview.net/forum?id=ByxduJBtPB}, note={under review} } @article{lampinen2018analytic, title={An analytic theory of generalization dynamics and transfer learning in deep linear networks}, author={Lampinen, Andrew K and Ganguli, Surya}, journal={arXiv preprint arXiv:1809.10374}, year={2018} } @misc{dereziski2019exact, title={Exact expressions for double descent and implicit regularization via surrogate random design}, author={Michał Dereziński and Feynman Liang and Michael W. Mahoney}, year={2019}, eprint={1912.04533}, archivePrefix={arXiv}, primaryClass={cs.LG} } @article{marvcenko1967distribution, title={Distribution of eigenvalues for some sets of random matrices}, author={Mar{\v{c}}enko, Vladimir A and Pastur, Leonid Andreevich}, journal={Mathematics of the USSR-Sbornik}, volume={1}, number={4}, pages={457}, year={1967}, publisher={IOP Publishing} } @inproceedings{xu2019number, title={On the number of variables to use in principal component regression}, author={Xu, Ji and Hsu, Daniel J}, booktitle={Advances in Neural Information Processing Systems}, pages={5095--5104}, year={2019} } @article{spigler2018jamming, title={A jamming transition from under-to over-parametrization affects loss landscape and generalization}, author={Spigler, Stefano and Geiger, Mario and d'Ascoli, St{\'e}phane and Sagun, Levent and Biroli, Giulio and Wyart, Matthieu}, journal={arXiv preprint arXiv:1810.09665}, year={2018} } @article{neal2018modern, title={A modern take on the bias-variance tradeoff in neural networks}, author={Neal, Brady and Mittal, Sarthak and Baratin, Aristide and Tantia, Vinayak and Scicluna, Matthew and Lacoste-Julien, Simon and Mitliagkas, Ioannis}, journal={arXiv preprint arXiv:1810.08591}, year={2018} } @article{mei2019generalization, title={The generalization error of random features regression: Precise asymptotics and double descent curve}, author={Mei, Song and Montanari, Andrea}, journal={arXiv preprint arXiv:1908.05355}, year={2019} } @misc{hastie2019surprises, title={Surprises in High-Dimensional Ridgeless Least Squares Interpolation}, author={Trevor Hastie and Andrea Montanari and Saharon Rosset and Ryan J. Tibshirani}, year={2019}, eprint={1903.08560}, archivePrefix={arXiv}, primaryClass={math.ST} } @article{Mitra2019UnderstandingOP, title={Understanding overfitting peaks in generalization error: Analytical risk curves for l2 and l1 penalized interpolation}, author={Partha P. Mitra}, journal={ArXiv}, year={2019}, volume={abs/1906.03667} } @article{bibas2019new, title={A New Look at an Old Problem: A Universal Learning Approach to Linear Regression}, author={Bibas, Koby and Fogel, Yaniv and Feder, Meir}, journal={arXiv preprint arXiv:1905.04708}, year={2019} } @article{opper2001learning, title={Learning to generalize}, author={Opper, Manfred}, journal={Frontiers of Life, 3(part 2), pp.763-775.}, year={2001} } @article{opper1995statistical, title={Statistical mechanics of learning: Generalization}, author={Opper, Manfred}, journal={The Handbook of Brain Theory and Neural Networks, 922-925.}, year={1995} } @article{belkin2019two, title={Two models of double descent for weak features}, author={Belkin, Mikhail and Hsu, Daniel and Xu, Ji}, journal={arXiv preprint arXiv:1903.07571}, year={2019} } @article{geiger2019jamming, title={Jamming transition as a paradigm to understand the loss landscape of deep neural networks}, author={Geiger, Mario and Spigler, Stefano and d'Ascoli, St{\'e}phane and Sagun, Levent and Baity-Jesi, Marco and Biroli, Giulio and Wyart, Matthieu}, journal={Physical Review E}, volume={100}, number={1}, pages={012115}, year={2019}, publisher={APS} } @article{muthukumar2019harmless, title={Harmless interpolation of noisy data in regression}, author={Muthukumar, Vidya and Vodrahalli, Kailas and Sahai, Anant}, journal={arXiv preprint arXiv:1903.09139}, year={2019} } @article{bartlett2019benign, title={Benign Overfitting in Linear Regression}, author={Bartlett, Peter L and Long, Philip M and Lugosi, G{\'a}bor and Tsigler, Alexander}, journal={arXiv preprint arXiv:1906.11300}, year={2019} } @article{advani2017high, title={High-dimensional dynamics of generalization error in neural networks}, author={Advani, Madhu S and Saxe, Andrew M}, journal={arXiv preprint arXiv:1710.03667}, year={2017} } @article{geiger2019scaling, title={Scaling description of generalization with number of parameters in deep learning}, author={Geiger, Mario and Jacot, Arthur and Spigler, Stefano and Gabriel, Franck and Sagun, Levent and d'Ascoli, St{\'e}phane and Biroli, Giulio and Hongler, Cl{\'e}ment and Wyart, Matthieu}, journal={arXiv preprint arXiv:1901.01608}, year={2019} } @article{liang2018just, title={Just interpolate: Kernel" ridgeless" regression can generalize}, author={Liang, Tengyuan and Rakhlin, Alexander}, journal={arXiv preprint arXiv:1808.00387}, year={2018} } @article{belkin2018does, title={Does data interpolation contradict statistical optimality?}, author={Belkin, Mikhail and Rakhlin, Alexander and Tsybakov, Alexandre B}, journal={arXiv preprint arXiv:1806.09471}, year={2018} } @article{belkin2018understand, title={To understand deep learning we need to understand kernel learning}, author={Belkin, Mikhail and Ma, Siyuan and Mandal, Soumik}, journal={arXiv preprint arXiv:1802.01396}, year={2018} } @article{belkin2018reconciling, title={Reconciling modern machine learning and the bias-variance trade-off}, author={Belkin, Mikhail and Hsu, Daniel and Ma, Siyuan and Mandal, Soumik}, journal={arXiv preprint arXiv:1812.11118}, year={2018} } @article{liang2019risk, title={On the risk of minimum-norm interpolants and restricted lower isometry of kernels}, author={Liang, Tengyuan and Rakhlin, Alexander and Zhai, Xiyu}, journal={arXiv preprint arXiv:1908.10292}, year={2019} }

arxiv_citations

{"1710.03667":true,"1906.11300":true,"1812.11118":true,"1903.07571":true,"1905.04708":true,"1911.05822":true,"1809.10374":true,"1808.00387":true,"1908.10292":true,"1908.05355":true,"1906.03667":true,"1903.09139":true,"1912.02292":true,"1810.08591":true,"1810.09665":true,"1901.01608":true,"1806.09471":true,"1802.01396":true}

alignment_newsletter

{"source":"alignment-newsletter","source_type":"google-sheets","converted_with":"python","venue":"arXiv","newsletter_category":"Deep learning","highlight":false,"newsletter_number":"AN #77","newsletter_url":"https://mailchi.mp/d2f2d15b7114/an-77-double-descent-a-unification-of-statistical-theory-and-modern-ml-practice","summarizer":"Rohin","summary":"This paper demonstrates the presence of double descent (in the size of the dataset) for _unregularized linear regression_. In particular, we assume that each data point x is a vector in independent samples from Normal(0, σ^2), and the output is y = βx + ε. Given a dataset of (x, y) pairs, we would like to estimate the unknown β, under the mean squared error loss, with no regularization.\n\nIn this setting, when the dimensionality d of the space (and thus number of parameters in β) is equal to the number of training points n, the training data points are linearly independent almost always / with probability 1, and so there will be exactly one β that solves the n linearly independent equalities of the form βx = y. However, such a β must also be fitting the noise variables ε, which means that it could be drastically overfitted, with very high norm. For example, imagine β = [1, 1], so that y = x1 + x2 + ε, and in our dataset x = (-1, 3) is mapped to y = 3 (i.e. an ε of +1), and x = (0, 1) is mapped to y = 0 (i.e. an ε of -1). Gradient descent will estimate that β = [-3, 0], which is going to generalize very poorly.\n\nAs we decrease the number of training points n, so that d > n, there are infinitely many settings of the d parameters of β that satisfy the n linearly independent equalities, and gradient descent naturally chooses the one with minimum norm (even without regularization). This limits how bad the test error can be. Similarly, as we increase the number of training points, so that d < n, there are too many constraints for β to satisfy, and so it ends up primarily modeling the signal rather than the noise, and so generalizing well.","opinion":"Basically what's happening here is that at the interpolation threshold, the model is forced to memorize noise, and it has only one way of doing so, which need not generalize well. However, past the interpolation threshold, when the model is overparameterized, there are _many_ models that successfully memorize noise, and gradient descent \"correctly\" chooses one with minimum norm. This fits into the broader story being told in other papers that what's happening is that the data has noise and/or misspecification, and at the interpolation threshold it fits the noise in a way that doesn't generalize, and after the interpolation threshold it fits the noise in a way that does generalize. Here that's happening because gradient descent chooses the minimum norm estimator that fits the noise; perhaps something similar is happening with neural nets.\n\nThis explanation seems like it could explain double descent on model size and double descent on dataset size, but I don't see how it would explain double descent on training time. This would imply that gradient descent on neural nets first has to memorize noise in one particular way, and then further training \"fixes\" the weights to memorize noise in a different way that generalizes better. While I can't rule it out, this seems rather implausible to me. (Note that regularization is _not_ such an explanation, because regularization applies throughout training, and doesn't \"come into effect\" after the interpolation threshold.)","prerequisites":"nan","read_more":"nan","paper_version":"1912.07242v1","arxiv_id":"1912.07242","title":"More Data Can Hurt for Linear Regression: Sample-wise Double Descent","authors":["Preetum Nakkiran"],"date_published":"2019-12-16 08:28:26+00:00","data_last_modified":"2019-12-16 08:28:26+00:00","url":"http://arxiv.org/abs/1912.07242v1","abstract":"In this expository note we describe a surprising phenomenon in overparameterized linear regression, where the dimension exceeds the number of samples: there is a regime where the test risk of the estimator found by gradient descent increases with additional samples. In other words, more data actually hurts the estimator. This behavior is implicit in a recent line of theoretical works analyzing \"double-descent\" phenomenon in linear models. In this note, we isolate and understand this behavior in an extremely simple setting: linear regression with isotropic Gaussian covariates. In particular, this occurs due to an unconventional type of bias-variance tradeoff in the overparameterized regime: the bias decreases with more samples, but variance increases.","author_comment":"None","journal_ref":"None","doi":"None","primary_category":"stat.ML","categories":"['stat.ML', 'cs.LG', 'cs.NE', 'math.ST', 'stat.TH']","individual_summary":"Title: More Data Can Hurt for Linear Regression: Sample-wise Double Descent\nAuthors: Preetum Nakkiran\nPaper abstract: In this expository note we describe a surprising phenomenon in overparameterized linear regression, where the dimension exceeds the number of samples: there is a regime where the test risk of the estimator found by gradient descent increases with additional samples. In other words, more data actually hurts the estimator. This behavior is implicit in a recent line of theoretical works analyzing \"double-descent\" phenomenon in linear models. In this note, we isolate and understand this behavior in an extremely simple setting: linear regression with isotropic Gaussian covariates. In particular, this occurs due to an unconventional type of bias-variance tradeoff in the overparameterized regime: the bias decreases with more samples, but variance increases.\nSummary: This paper demonstrates the presence of double descent (in the size of the dataset) for _unregularized linear regression_. In particular, we assume that each data point x is a vector in independent samples from Normal(0, σ^2), and the output is y = βx + ε. Given a dataset of (x, y) pairs, we would like to estimate the unknown β, under the mean squared error loss, with no regularization.\n\nIn this setting, when the dimensionality d of the space (and thus number of parameters in β) is equal to the number of training points n, the training data points are linearly independent almost always / with probability 1, and so there will be exactly one β that solves the n linearly independent equalities of the form βx = y. However, such a β must also be fitting the noise variables ε, which means that it could be drastically overfitted, with very high norm. For example, imagine β = [1, 1], so that y = x1 + x2 + ε, and in our dataset x = (-1, 3) is mapped to y = 3 (i.e. an ε of +1), and x = (0, 1) is mapped to y = 0 (i.e. an ε of -1). Gradient descent will estimate that β = [-3, 0], which is going to generalize very poorly.\n\nAs we decrease the number of training points n, so that d > n, there are infinitely many settings of the d parameters of β that satisfy the n linearly independent equalities, and gradient descent naturally chooses the one with minimum norm (even without regularization). This limits how bad the test error can be. Similarly, as we increase the number of training points, so that d < n, there are too many constraints for β to satisfy, and so it ends up primarily modeling the signal rather than the noise, and so generalizing well.\nMy opinion: Basically what's happening here is that at the interpolation threshold, the model is forced to memorize noise, and it has only one way of doing so, which need not generalize well. However, past the interpolation threshold, when the model is overparameterized, there are _many_ models that successfully memorize noise, and gradient descent \"correctly\" chooses one with minimum norm. This fits into the broader story being told in other papers that what's happening is that the data has noise and/or misspecification, and at the interpolation threshold it fits the noise in a way that doesn't generalize, and after the interpolation threshold it fits the noise in a way that does generalize. Here that's happening because gradient descent chooses the minimum norm estimator that fits the noise; perhaps something similar is happening with neural nets.\n\nThis explanation seems like it could explain double descent on model size and double descent on dataset size, but I don't see how it would explain double descent on training time. This would imply that gradient descent on neural nets first has to memorize noise in one particular way, and then further training \"fixes\" the weights to memorize noise in a different way that generalizes better. While I can't rule it out, this seems rather implausible to me. (Note that regularization is _not_ such an explanation, because regularization applies throughout training, and doesn't \"come into effect\" after the interpolation threshold.)","paper_text":"","text":"Highlights\n[Deep Double Descent](https://openai.com/blog/deep-double-descent/) *(Preetum Nakkiran et al)* (summarized by Rohin): This blog post provides empirical evidence for the existence of the *double descent* phenomenon, proposed in an earlier paper summarized below. Define the *effective model complexity* (EMC) of a training procedure and a dataset to be the maximum size of training set such that the training procedure achieves a *train* error of at most ε (they use ε = 0.1). Let's suppose you start with a small, underparameterized model with low EMC. Then initially, as you increase the EMC, the model will achieve a better fit to the data, leading to lower test error. However, once the EMC is approximately equal to the size of the actual training set, then the model can \"just barely\" fit the training set, and the test error can increase or decrease. Finally, as you increase the EMC even further, so that the training procedure can easily fit the training set, the test error will once again *decrease*, causing a second descent in test error. This unifies the perspectives of statistics, where larger models are predicted to overfit, leading to increasing test error with higher EMC, and modern machine learning, where the common empirical wisdom is to make models as big as possible and test error will continue decreasing.\nThey show that this pattern arises in a variety of simple settings. As you increase the width of a ResNet up to 64, you can observe double descent in the final test error of the trained model. In addition, if you fix a large overparameterized model and change the number of epochs for which it is trained, you see another double descent curve, which means that simply training longer can actually *correct overfitting*. Finally, if you fix a training procedure and change the size of the dataset, you can see a double descent curve as the size of the dataset decreases. This actually implies that there are points in which *more data is worse*, because the training procedure is in the critical interpolation region where test error can increase. Note that most of these results only occur when there is *label noise* present, that is, some proportion of the training set (usually 10-20%) is given random incorrect labels. Some results still occur without label noise, but the resulting double descent peak is quite small. The authors hypothesize that label noise leads to the effect because double descent occurs when the model is misspecified, though it is not clear to me what it means for a model to be misspecified in this context.\n**Rohin's opinion:** While I previously didn't think that double descent was a real phenomenon (see summaries later in this email for details), these experiments convinced me that I was wrong and in fact there is something real going on. Note that the settings studied in this work are still not fully representative of typical use of neural nets today; the label noise is the most obvious difference, but also e.g. ResNets are usually trained with higher widths than studied in this paper. So the phenomenon might not generalize to neural nets as used in practice, but nonetheless, there's *some* real phenomenon here, which flies in the face of all of my intuitions.\nThe authors don't really suggest an explanation; the closest they come is speculating that at the interpolation threshold there's only ~one model that can fit the data, which may be overfit, but then as you increase further the training procedure can \"choose\" from the various models that all fit the data, and that \"choice\" leads to better generalization. But this doesn't make sense to me, because whatever is being used to \"choose\" the better model applies throughout training, and so even at the interpolation threshold the model should have been selected throughout training to be the type of model that generalized well. (For example, if you think that regularization is providing a simplicity bias that leads to better generalization, the regularization should also help models at the interpolation threshold, since you always regularize throughout training.)\nPerhaps one explanation could be that in order for the regularization to work, there needs to be a \"direction\" in the space of model parameters that doesn't lead to increased training error, so that the model can move along that direction towards a simpler model. Each training data point defines a particular direction in which training error will increase. So, when the number of training points is equal to the number of parameters, the training points just barely cover all of the directions, and then as you increase the number of parameters further, that starts creating new directions that are not constrained by the training points, allowing the regularization to work much better. (In fact, the [original paper](https://arxiv.org/abs/1812.11118), summarized below, *defined* the interpolation threshold as the point where number of parameters equals the size of the training dataset.) However, while this could explain model-wise double descent and training-set-size double descent, it's not a great explanation for epoch-wise double descent.\n**Read more:** [Paper: Deep Double Descent: Where Bigger Models and More Data Hurt](http://arxiv.org/abs/1912.02292)\nTechnical AI alignment\n \n\nProblems\n[Comment on Coherence arguments do not imply goal directed behavior](https://www.lesswrong.com/posts/EnN7cm3KaRrEAuWfa/comment-on-coherence-arguments-do-not-imply-goal-directed) *(Ronny Fernandez)* (summarized by Rohin): I [have argued](https://www.alignmentforum.org/posts/NxF5G6CJiof6cemTw/coherence-arguments-do-not-imply-goal-directed-behavior) ([AN #35](https://mailchi.mp/bbd47ba94e84/alignment-newsletter-35)) that coherence arguments that argue for modeling rational behavior as expected utility maximization do not add anything to AI risk arguments. This post argues that there is a different way in which to interpret these arguments: we should only model a system to be an EU maximizer if it was the result of an optimization process, such that the EU maximizer model is the best model we have of the system. In this case, the best way to predict the agent is to imagine what we would do if we had its goals, which leads to the standard convergent instrumental subgoals.\n**Rohin's opinion:** This version of the argument seems to be more a statement about our epistemic state than about actual AI risk. For example, I know many people without technical expertise who anthropomorphize their laptops as though they were pursuing some goal, but they don't (and shouldn't) worry that their laptops are going to take over the world. More details in [this comment](https://www.lesswrong.com/posts/EnN7cm3KaRrEAuWfa/comment-on-coherence-arguments-do-not-imply-goal-directed#CsRXodmiBfZ9wCZwr).\nAI strategy and policy\n[How does the offense-defense balance scale?](https://www.tandfonline.com/doi/full/10.1080/01402390.2019.1631810) *(Ben Garfinkel et al)* (summarized by Flo): The offense-defense balance that characterises how easy it is to successfully attack others can affect what kinds of conflicts break out and how often that happens. This paper analyses how growing capabilities on both sides affect that balance. For example, consider an idealized model of cyber defense with a fixed set of vulnerabilities that are discovered independently by attackers and defenders. The attacker will initially be able to use almost all of the vulnerabilities they found. This is because, with only a small percentage of vulnerabilities discovered by both sides, the defender is unlikely to have found the same ones as the attacker. Marginal increases of the defender's capabilities are unlikely to uncover vulnerabilities used by the attacker in this regime, such that attacks become easier as both sides invest resources. Once most vulnerabilities have been found by both sides, this effect reverses as marginal investments by the attacker become unlikely to uncover vulnerabilities the defender has not fixed yet.\nThis pattern, where increasingly growing capabilities first favour offense but lead to defensive stability in the long run, dubbed **OD-scaling** seems to be common and can be expected to be found whenever there are **multiple attack vectors**, the attacker only needs to break through on some of them and the defender enjoys **local defense superiority**, meaning that with sufficient coverage by the defender for a given attack vector, it is almost impossible for the attacker to break through.\nBecause the use of digital and AI systems can be scaled up quickly, scale-dependent shifts of the offense-defense balance are going to increase in importance as these systems become ubiquitous.\n**Flo's opinion:** I found it quite surprising that the paper mentions a lack of academic consensus about whether or not offensive advantage is destabilizing. Assuming that it is, OD-scaling might provide a silver lining concerning cybersecurity, provided things can be scaled up sufficiently. These kinds of dynamics also seem to put a natural ceiling on arms races: above a certain threshold, gains in capabilities provide advantage to both sides such that resources are better invested elsewhere.\nOther progress in AI\n \n\nDeep learning\n[Reconciling modern machine learning practice and the bias-variance trade-off](https://arxiv.org/abs/1812.11118) *(Mikhail Belkin et al)* (summarized by Rohin): This paper first proposed double descent as a general phenomenon, and demonstrated it in three machine learning models: linear predictors over random Fourier features, fully connected neural networks with one hidden layer, and forests of decision trees. Note that they define the interpolation threshold as the point where the number of parameters equals the number of training points, rather than using something like effective model complexiy.\nFor linear predictors over random Fourier features, their procedure is as follows: they generate a set of random features, and then find the linear predictor that minimizes the squared loss incurred. If there are multiple predictors that achieve zero squared loss, then they choose the one with the minimum L2 norm. The double descent curve for a subset of MNIST is very pronounced and has a huge peak at the point where the number of features equals the number of training points.\nFor the fully connected neural networks on MNIST, they make a significant change to normal training: prior to the interpolation threshold, rather than training the networks from scratch, they train them from the final solution found for the previous (smaller) network, but after the interpolation threshold they train from scratch as normal. With this change, you see a very pronounced and clear double descent curve. However, if you always train from scratch, then it's less clear -- there's a small peak, which the authors describe as \"clearly discernible\", but to me it looks like it could be noise.\nFor decision trees, if the dataset has n training points, they learn decision trees of size up to n leaves, and then at that point (the interpolation threshold) they switch to having ensembles of decision trees (called forests) to get more expressive function classes. Once again, you can see a clear, pronounced double descent curve.\n**Rohin's opinion:** I read this paper back when summarizing [Are Deep Neural Networks Dramatically Overfitted?](https://lilianweng.github.io/lil-log/2019/03/14/are-deep-neural-networks-dramatically-overfitted.html) ([AN #53](https://mailchi.mp/534f448d6c8b/alignment-newsletter-53)) and found it uncompelling, and I'm really curious how the ML community correctly seized upon this idea as deserving of further investigation while I incorrectly dismissed it. None of the experimental results in this paper are particularly surprising to me, whereas double descent itself is quite surprising.\nIn the random Fourier features and decision trees experiments, there is a qualitative difference in the *learning algorithm* before and after the interpolation threshold, that suffices to explain the curve. With the random Fourier features, we only start regularizing the model after the interpolation threshold; it is not surprising that adding regularization helps reduce test loss. With the decision trees, after the interpolation threshold, we start using ensembles; it is again not at all surprising that ensembles help reduce test error. (See also [this comment](https://www.alignmentforum.org/posts/FRv7ryoqtvSuqBxuT/understanding-deep-double-descent#6bqY4oSFncLTJgYdm).) So yeah, if you start regularizing (via L2 norm or ensembles) after the interpolation threshold, that will help your test error, but in practice we regularize throughout the training process, so this should not occur with neural nets.\nThe neural net experiments also have a similar flavor -- the nets before the interpolation threshold are required to reuse weights from the previous run, while the ones after the interpolation threshold do not have any such requirement. When this is removed, the results are much more muted. The authors claim that this is necessary to have clear graphs (where training risk monotonically decreases), but it's almost certainly biasing the results -- at the interpolation threshold, with weight reuse, the test squared loss is ~0.55 and test accuracy is ~80%, while without weight reuse, test squared loss is ~0.35 and test accuracy is ~85%, a *massive* difference and probably not within the error bars.\nSome speculation on what's happening here: neural net losses are nonconvex and training can get stuck in local optima. A pretty good way to get stuck in a local optimum is to initialize half your parameters to do something that does quite well while the other half are initialized randomly. So with weight reuse we might expect getting stuck in worse local optima. However, it looks like the training losses are comparable between the methods. Maybe what's happening is that with weight reuse, the half of parameters that are initialized randomly memorize the training points that the good half of the parameters can't predict, which doesn't generalize well but does get low training error. Meanwhile, without weight reuse, all of the parameters end up finding a good model that does generalize well, for whatever reason it is that neural nets do work well.\nBut again, note that the authors were right about double descent being a real phenomenon, while I was wrong, so take all this speculation with many grains of salt.\n[More Data Can Hurt for Linear Regression: Sample-wise Double Descent](https://arxiv.org/abs/1912.07242) *(Preetum Nakkiran)* (summarized by Rohin): This paper demonstrates the presence of double descent (in the size of the dataset) for *unregularized linear regression*. In particular, we assume that each data point x is a vector in independent samples from Normal(0, σ^2), and the output is y = βx + ε. Given a dataset of (x, y) pairs, we would like to estimate the unknown β, under the mean squared error loss, with no regularization.\nIn this setting, when the dimensionality d of the space (and thus number of parameters in β) is equal to the number of training points n, the training data points are linearly independent almost always / with probability 1, and so there will be exactly one β that solves the n linearly independent equalities of the form βx = y. However, such a β must also be fitting the noise variables ε, which means that it could be drastically overfitted, with very high norm. For example, imagine β = [1, 1], so that y = x1 + x2 + ε, and in our dataset x = (-1, 3) is mapped to y = 3 (i.e. an ε of +1), and x = (0, 1) is mapped to y = 0 (i.e. an ε of -1). Gradient descent will estimate that β = [-3, 0], which is going to generalize very poorly.\nAs we decrease the number of training points n, so that d > n, there are infinitely many settings of the d parameters of β that satisfy the n linearly independent equalities, and gradient descent naturally chooses the one with minimum norm (even without regularization). This limits how bad the test error can be. Similarly, as we increase the number of training points, so that d < n, there are too many constraints for β to satisfy, and so it ends up primarily modeling the signal rather than the noise, and so generalizing well.\n**Rohin's opinion:** Basically what's happening here is that at the interpolation threshold, the model is forced to memorize noise, and it has only one way of doing so, which need not generalize well. However, past the interpolation threshold, when the model is overparameterized, there are *many* models that successfully memorize noise, and gradient descent \"correctly\" chooses one with minimum norm. This fits into the broader story being told in other papers that what's happening is that the data has noise and/or misspecification, and at the interpolation threshold it fits the noise in a way that doesn't generalize, and after the interpolation threshold it fits the noise in a way that does generalize. Here that's happening because gradient descent chooses the minimum norm estimator that fits the noise; perhaps something similar is happening with neural nets.\nThis explanation seems like it could explain double descent on model size and double descent on dataset size, but I don't see how it would explain double descent on training time. This would imply that gradient descent on neural nets first has to memorize noise in one particular way, and then further training \"fixes\" the weights to memorize noise in a different way that generalizes better. While I can't rule it out, this seems rather implausible to me. (Note that regularization is *not* such an explanation, because regularization applies throughout training, and doesn't \"come into effect\" after the interpolation threshold.)\n[Understanding “Deep Double Descent”](https://www.alignmentforum.org/posts/FRv7ryoqtvSuqBxuT/understanding-deep-double-descent) *(Evan Hubinger)* (summarized by Rohin): This post explains deep double descent (in more detail than my summaries), and speculates on its relevance to AI safety. In particular, Evan believes that deep double descent shows that neural nets are providing strong inductive biases that are crucial to their performance -- even *after* getting to ~zero training loss, the inductive biases *continue* to do work for us, and find better models that lead to lower test loss. As a result, it seems quite important to understand the inductive biases that neural nets use, which seems particularly relevant for e.g. [mesa optimization and pseudo alignment](https://arxiv.org/abs/1906.01820) ([AN #58](https://mailchi.mp/92b3a9458c2d/an-58-mesa-optimization-what-it-is-and-why-we-should-care)).\n**Rohin's opinion:** I certainly agree that neural nets have strong inductive biases that help with their generalization; a clear example of this is that neural nets can learn [randomly labeled data](https://arxiv.org/abs/1611.03530) (which can never generalize to the test set), but nonetheless when trained on correctly labeled data such nets do generalize to test data. Perhaps more surprising here is that the inductive biases help even *after* fully capturing the data (achieving zero training loss) -- you might have thought that the data would swamp the inductive biases. This might suggest that powerful AI systems will become simpler over time (assuming an inductive bias towards simplicity). However, this is happening in the regime where the neural nets are overparameterized, so it makes sense that inductive biases would still play a large role. I expect that in contrast, powerful AI systems will be severely underparameterized, simply because of *how much data* there is (for example, [the largest GPT-2 model still underfits the data](https://blog.openai.com/better-language-models/) ([AN #46](https://mailchi.mp/c48f996a5db5/alignment-newsletter-46))).\n[Uniform convergence may be unable to explain generalization in deep learning](https://locuslab.github.io/2019-07-09-uniform-convergence/) *(Vaishnavh Nagarajan)* (summarized by Rohin): This post argues that existing generalization bounds cannot explain the empirical success of neural networks at generalizing to the test set.\n\"What?\", you say if you're like me, \"didn't we already know this? Generalization bounds depend on your hypothesis space being sufficiently small, but [neural nets can represent any reasonable function](https://en.wikipedia.org/wiki/Universal_approximation_theorem)? And even if you avoid that by considering the size of the neural net, we know that empirically [neural nets can learn randomly labeled data](https://arxiv.org/abs/1611.03530), which can never generalize; surely this means that you can't explain generalization without reference to some property of the dataset, which generalization bounds typically don't do?\"\nIt turns out that the strategy has been to prove generalization bounds that depend on the *norm of the weights of the trained model* (for some norm that depends on the specific bound), which gets around both these objections, since the resulting bounds are independent of the number of parameters, and depend on the trained model (which itself depends on the dataset). However, when these bounds are evaluated on a simple sphere-separation task, they *increase* with the size of the training dataset, because the norms of the trained models increase.\nOkay, but can we have a stronger argument than mere empirical results? Well, all of these bounds depend on a *uniform convergence bound*: a number that bounds the absolute difference between the train and test error for *any* model in your hypothesis space. (I assume the recent generalization bounds only consider the hypothesis space \"neural nets with norms at most K\", or some suitable overapproximation of that, and this is how they get a not-obviously-vacuous generalization bound that depends on weight norms. However, I haven't actually read those papers.)\nHowever, no matter what hypothesis space these bounds choose, to get a valid generalization bound the hypothesis space must contain (nearly) all of the models that would occur by training the neural net on a dataset sampled from the underlying distribution. What if we had the actual smallest such hypothesis space, which only contained the models that resulted from an actual training run? The authors show that, at least on the sphere-separation task, the uniform convergence bound is still extremely weak. Let's suppose we have a training dataset S. Our goal is now to find a model in the hypothesis space which has a high absolute difference between actual test error, and error in classifying S. (Recall that uniform convergence requires you to bound the absolute difference for *all* models in your hypothesis class, not just the one trained on S.) The authors do so by creating an \"adversarial\" training dataset S' that also could have been sampled from the underlying distribution, and training a model on S'. This model empirically gets S almost completely wrong. Thus, this model has low test error, but high error in classifying S, which forces the uniform convergence bound to be very high.\n**Rohin's opinion:** I enjoyed this blog post a lot (though it took some time to digest it, since I know very little about generalization bounds). It constrains the ways in which we can try to explain the empirical generalization of neural networks, which I for one would love to understand. Hopefully future work will explore new avenues for understanding generalization, and hit upon a more fruitful line of inquiry.\n**Read more:** [Paper](https://arxiv.org/abs/1902.04742)\n[Understanding the generalization of ‘lottery tickets’ in neural networks](https://ai.facebook.com/blog/understanding-the-generalization-of-lottery-tickets-in-neural-networks) *(Ari Morcos et al)* (summarized by Flo): The [lottery ticket hypothesis](https://arxiv.org/abs/1903.01611) ([AN #52](https://mailchi.mp/1e757d9b05cb/alignment-newsletter-52)) states that a randomly initialized dense or convolutional neural network contains (sparse) subnetworks, called \"winning tickets\", which can be trained to achieve performance similar to the trained base network while requiring a lot less compute.\nThe blogpost summarizes facebook AI's recent investigations of the generalization of winning tickets and the generality of the hypothesis. Because winning tickets are hard to find, we would like to reuse the ones we have found for similar tasks. To test whether this works, the authors trained classifiers, pruned and reset them to obtain winning tickets on different image datasets and then trained these on other datasets. Winning tickets derived from similar datasets relevantly outperform random subnetworks after training and ones derived from larger or more complex datasets generalize better. For example, tickets from ImageNet are consistently among the best and tickets from CIFAR-100 generalize better than those from CIFAR-10.\nExperiments in natural language processing and reinforcement learning suggest that the lottery ticket hypothesis is not just a peculiarity of image classification: for example, the performance of a large transformer model could be recovered from a winning ticket with just a third of the original weights, whereas random tickets with that amount of weights performed quite a bit worse. The analysis of simple shallow neural networks in a student-teacher setting is used as a toy model: when a larger student network is trained to mimic a smaller teacher with the same amount of layers, **student specialization** happens: some of the student's neurons learn to imitate single neurons of the teacher. This can be seen to happen more often and faster if the student neuron is already close to the teacher neuron at initialization. If the student network is large enough, every teacher neuron will be imitated by some student neuron and these student neurons collectively form a winning ticket.\n**Flo's opinion:** I enjoyed reading this blogpost and like the idea of using winning tickets for transfer learning. I would have been quite surprised if they had found that the lottery ticket hypothesis was specific to image classification, as similar to pretraining, winning tickets seem to provide an inductive bias constraining the set of features that can be learnt during training to more useful ones. I do not think that further research into that direction will directly help with quickly training models for novel tasks unless the tickets can be identified very efficiently which seems like a harder optimization problem than just training a network by gradient descent.\n[Recent Progress in the Theory of Neural Networks](https://www.alignmentforum.org/posts/KrQvZM8uFjSTJ7hq3/recent-progress-in-the-theory-of-neural-networks-1) *(interstice)*\nNews\n[AI Safety Camp Toronto](https://aisafetycamp.com/ai-safety-camp-toronto/) (summarized by Rohin): The next [AI safety camp](https://aisafetycamp.com/2018/06/06/the-first-ai-safety-camp-onwards/) ([AN #10](https://mailchi.mp/d1a19c140226/alignment-newsletter-10)) will be held in early May, in Toronto. Apply [here](https://docs.google.com/forms/d/e/1FAIpQLSfZdu--EII061-KwWDSK6hZ5rtLCpBarKszw9btMs1dO1NOFA/viewform) by Jan 5. |\n\n\n |\n\n |\n| \n\n| | | | | | | | | | | | | |\n| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |\n| \n\n| | | | | | | | | | | | |\n| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |\n| \n\n| | | | | | | | | | | |\n| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |\n| \n\n| | | | | | | | | | |\n| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |\n| \n\n\n| | | |\n| --- | --- | --- |\n| \n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| [Twitter](http://www.twitter.com/) |\n\n |\n\n |\n\n\n\n\n| | | |\n| --- | --- | --- |\n| \n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| [Facebook](http://www.facebook.com) |\n\n |\n\n |\n\n\n\n\n| | | |\n| --- | --- | --- |\n| \n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| [Website](http://mailchimp.com) |\n\n |\n\n |\n\n\n |\n\n |\n\n |\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| |\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n\n| |\n| --- |\n| *Copyright © 2019 Rohin Shah, All rights reserved.*\n\n\n"}

abstract: | In this expository note we describe a surprising phenomenon in overparameterized linear regression, where the dimension exceeds the number of samples: there is a regime where the test risk of the estimator found by gradient descent increases with additional samples. In other words, more data actually hurts the estimator. This behavior is implicit in a recent line of theoretical works analyzing "double descent" phenomena in linear models. In this note, we isolate and understand this behavior in an extremely simple setting: linear regression with isotropic Gaussian covariates. In particular, this occurs due to an unconventional type of bias-variance tradeoff in the overparameterized regime: the bias decreases with more samples, but variance increases. author:

| Preetum Nakkiran
Harvard University bibliography:
refs.bib title: | More Data Can Hurt for Linear Regression:
Sample-wise Double Descent

Introduction

Common statistical intuition suggests that more data should never harm the performance of an estimator. It was recently highlighted in [@deep] that this may not hold for overparameterized models: there are settings in modern deep learning where training on more data actually hurts. In this note, we analyze a simple setting to understand the mechanisms behind this behavior.

$Test MSE for $d =1000, \sigma=0.1$. [[fig:exp]]{#fig:exp label="fig:exp"}$ {#fig:exp width="\textwidth"}

$Test MSE in theory for $d =1000, \sigma=0.1$$ {#fig:theory width="\textwidth"}

We focus on well-specified linear regression with Gaussian covariates, and we analyze the test risk of the minimum-norm ridgeless regression estimator--- or equivalently, the estimator found by gradient descent on the least squares objective. We show that as we increase the number of samples, performance is non-monotonic: The test risk first decreases, and then increases, before decreasing again.

Such a "double-descent" behavior has been observed in the behavior of test risk as a function of the model size in a variety of machine learning settings [@opper1995statistical; @opper2001learning; @advani2017high; @belkin2018reconciling; @spigler2018jamming; @geiger2019jamming; @deep]. Many of these works are motivated by understanding the test risk as function of model size, for a fixed number of samples. In this work, we take a complementary view and understand the test risk as a function of sample size, for a fixed model. We hope that understanding such simple settings can eventually lead to understanding the general phenomenon, and lead us to design learning algorithms which make the best use of data (and in particular, are monotonic in samples).

We note that similar analyses appear in recent works, which we discuss in Section 1.1{reference-type="ref" reference="sec:related"}-- our focus is to highlight the sample non-monotonicity implicit in these works, and give intuitions for the mechanisms behind it. We specifically refer the reader to [@hastie2019surprises; @mei2019generalization] for analysis in a setting most similar to ours.

Organization.

We first define the linear regression setting in Section 2{reference-type="ref" reference="sec:linear"}. Then in Section 3{reference-type="ref" reference="sec:analysis"} we state the form of the estimator found by gradient descent, and give intuitions for why this estimator has a peak in test risk when the number of samples is equal to the ambient dimension. In Section 3.1{reference-type="ref" reference="sec:biasvar"}, we decompose the expected excess risk into bias and variance contributions, and we state approximate expressions for the bias, variance, and excess risk as a function of samples. We show that these approximate theoretical predictions closely agree with practice, as in Figure [fig:main]{reference-type="ref" reference="fig:main"}.

The peak in test risk turns out to be related to the conditioning of the data matrix, and in Section 3.2{reference-type="ref" reference="sec:conditioning"} we give intuitions for why this matrix is poorly conditioned in the "critical regime", but well conditioned outside of it. We also analyze the marginal effect of adding a single sample to the test risk, in Section 3.3{reference-type="ref" reference="sec:single"}. We conclude with discussion and open questions in Section 4{reference-type="ref" reference="sec:discuss"}.

Related Works {#sec:related}

This work was inspired by the long line of work studying "double descent" phenomena in deep and shallow models. The general principle is that as the model complexity increases, the test risk of trained models first decreases and then increases (the standard U-shape), and then decreases again. The peak in test risk occurs in the "critical regime" when the models are just barely able to fit the training set. The second descent occurs in the "overparameterized regime", when the model capacity is large enough to contain several interpolants on the training data. This phenomenon appears to be fairly universal among natural learning algorithms, and is observed in simple settings such as linear regression, random features regression, classification with random forests, as well as modern neural networks. Double descent of test risk with model size was introduced in generality by [@belkin2018reconciling], building on similar behavior observed as early as [@opper1995statistical; @opper2001learning] and more recently by [@advani2017high; @neal2018modern; @spigler2018jamming; @geiger2019jamming]. A generalized double descent phenomenon was demonstrated on modern deep networks by [@deep], which also highlighted "sample-wise nonmonotonicity" as a consequence of double descent -- showing that more data can hurt for deep neural networks.

A number of recent works theoretically analyze the double descent behavior in simplified settings, often for linear models [@belkin2019two; @hastie2019surprises; @bartlett2019benign; @muthukumar2019harmless; @bibas2019new; @Mitra2019UnderstandingOP; @mei2019generalization; @liang2018just; @liang2019risk; @xu2019number; @dereziski2019exact; @lampinen2018analytic; @deng2019model]. At a high level, these works analyze the test risk of estimators in overparameterized linear regression with different assumptions on the covariates. We specifically refer the reader to [@hastie2019surprises; @mei2019generalization] for rigorous analysis in a setting most similar to ours. In particular, [@hastie2019surprises] considers the asymptotic risk of the minimum norm ridgeless regression estimator in the limit where dimension $d$ and number of samples $n$ are scaled as $d \to \infty, n = \gamma d$. We instead focus on the sample-wise perspective: a fixed large $d$, but varying $n$. In terms of technical content, the analysis technique is not novel to our work, and similar calculations appear in some of the prior works above. Our main contribution is highlighting the sample non-monotonic behavior in a simple setting, and elaborating on the mechanisms responsible.

While many of the above theoretical results are qualitatively similar, we highlight one interesting distinction: our setting is well-specified, and the bias of the estimator is monotone nonincreasing in number of samples (see Equation [eqn:bias]{reference-type="ref" reference="eqn:bias"}, and also [@hastie2019surprises Section 3]). In contrast, for misspecified problems (e.g. when the ground-truth is nonlinear, but we learn a linear model), the bias can actually increase with number of samples in addition to the variance increasing (see [@mei2019generalization]).

Problem Setup {#sec:linear}

Consider the following learning problem: The ground-truth distribution $\mathcal{D}$ is $(x, y) \in \R^d \x \R$ with covariates $x \sim \mathcal{N}(0, I_d)$ and response $y = \langle x, \beta \rangle + \mathcal{N}(0, \sigma^2)$ for some unknown, arbitrary $\beta$ such that $||\beta||2 \leq 1$. That is, the ground-truth is an isotropic Gaussian with observation noise. We are given $n$ samples $(x_i, y_i)$ from the distribution, and we want to learn a linear model $f{\hat\beta}(x) = \langle x, \hat\beta\rangle$ for estimating $y$ given $x$. That is, we want to find $\hat\beta$ with small test mean squared error $$\begin{aligned} R(\hat\beta) &:= \E_{(x, y) \sim \mathcal{D}}[(\langle x, \hat\beta\rangle - y)^2]\ &= ||\hat\beta- \beta||^2 + \sigma^2 \tag{for isotropic $x \sim \mathcal{N}(0, I_d)$}\end{aligned}$$

Suppose we do this by performing ridgeless linear regression. Specifically, we run gradient descent initialized at $0$ on the following objective (the empirical risk).

$$\begin{aligned} \min_{\hat\beta} ||X\hat\beta- y||^2 \label{eqn:obj}\end{aligned}$$ where $X \in \R^{n \x d}$ is the data-matrix of samples $x_i$, and $y \in \R^n$ are the observations.

The solution found by gradient descent at convergence is $\hat\beta= X^\dagger y$, where $\dagger$ denotes the Moore--Penrose pseudoinverse[^1]. Figure 1{reference-type="ref" reference="fig:exp"} plots the expected test MSE of this estimator $\E_{X, y}[R(\hat\beta))]$ as we vary the number of train samples $n$. Note that it is non-monotonic, with a peak in test MSE at $n=d$.

There are two surprising aspects of the test risk in Figure 1{reference-type="ref" reference="fig:exp"}, in the overparameterized regime ($n < d$):

The first descent: where test risk initially decreases even when we have less samples $n$ than dimensions $d$. This occurs because the bias decreases.
The first ascent: where test risk increases, and peaks when $n=d$. This is because the variance increases, and diverges when $n=d$.

When $n > d$, this is the classical underparameterized regime, and test risk is monotone decreasing with number of samples.

Thus overparameterized linear regression exhibits a bias-variance tradeoff: bias decreases with more samples, but variance can increase. Below, we elaborate on the mechanisms and provide intuition for this non-monotonic behavior.

Analysis {#sec:analysis}

The solution found by gradient descent, $\hat\beta= X^\dagger y$, has different forms depending on the ratio $n/d$. When $n \geq d$, we are in the "underparameterized" regime and there is a unique minimizer of the objective in Equation [eqn:obj]{reference-type="ref" reference="eqn:obj"}. When $n < d$, we are "overparameterized" and there are many minimizers of Equation [eqn:obj]{reference-type="ref" reference="eqn:obj"}. In fact, since $X$ is full rank with probability 1, there are many minimizers which interpolate, i.e. $X\hat\beta= y$. In this regime, gradient descent finds the minimum with smallest $\ell_2$ norm $||\hat\beta||^2$. That is, the solution can be written as

$$\begin{aligned} \hat\beta = X^\dagger y

\begin{dcases} \argmin_{\substack{\beta: X\beta = y}} ||\beta||^2 & \text{when $n \leq d$ \quad {\bf (Overparameterized'')}}\\ \argmin_{\beta} ||X\beta - y||^2 & \text{when $n > d$ \quad {\bf (Underparameterized'')}} \end{dcases}\end{aligned}$$

$$\begin{aligned} \textbf{(Overparameterized, $n < d$):} \quad \hat\beta = X^\dagger y &= \argmin_{\substack{\beta: X\beta = y}} ||\beta||^2\ \textbf{(Underparameterized, $n \geq d$):} \quad \hat\beta = X^\dagger y &= \argmin_{\beta} ||X\beta - y||^2\end{aligned}$$

The overparameterized form yields insight into why the test MSE peaks at $n = d$. Recall that the observations are noisy, i.e. $y = X\beta + \eta$ where $\eta \sim N(0, \sigma^2 I_n)$. When $n \ll d$, there are many interpolating estimators ${ \hat\beta: X\hat\beta= y }$, and in particular there exist such $\hat\beta$ with small norm. In contrast, when $n = d$, there is exactly one interpolating estimator $(X\hat\beta= y)$, but this estimator must have high norm in order to fit the noise $\eta$. More precisely, consider $$\begin{aligned} \hat\beta&= X^\dagger y = X^\dagger(X\beta + \eta) = \underbrace{X^\dagger X\beta}{\text{signal}} + \underbrace{X^\dagger \eta}{\text{noise}}\end{aligned}$$ The signal term $X^\dagger X \beta$ is simply the orthogonal projection of $\beta$ onto the rows of $X$. When we are "critically parameterized" and $n \approx d$, the data matrix $X$ is very poorly conditioned, and hence the noise term $X^\dagger \eta$ has high norm, overwhelming the signal. This argument is made precise in Section 3.1{reference-type="ref" reference="sec:biasvar"}, and in Section 3.2{reference-type="ref" reference="sec:conditioning"} we give intuition for why $X$ becomes poorly conditioned when $n \approx d$.

The main point is that when $n = d$, forcing the estimator $\hat\beta$ to interpolate the noise will force it to have very high norm, far from the ground-truth $\beta$. (See also Corollary 1 of [@hastie2019surprises] for a quantification of this point).

Excess Risk and Bias-Variance Tradeoffs {#sec:biasvar}

For ground-truth parameter $\beta$, the excess risk[^2] of an estimator $\hat\beta$ is: $$\begin{aligned} \bar R(\hat\beta) &:= \E_{(x, y) \sim \mathcal{D}}[(\langle x, \hat\beta\rangle - y)^2]

\E_{(x, y) \sim \mathcal{D}}[(\langle x, \beta \rangle - y)^2]\ &= \E_{x \sim \mathcal{N}(0, I), \eta \sim \mathcal{N}(0, \sigma^2)}[(\langle x, \hat\beta\rangle - \langle x, \beta \rangle + \eta)^2]
\sigma^2\ &= ||\hat\beta- \beta||^2 \end{aligned}$$

For an estimator $\hat\beta_{X, y}$ that is derived from samples $(X, y) \sim \mathcal{D}^n$, we consider the expected excess risk of $\hat\beta= \hat\beta_{X, y}$ in expectation over samples $(X, y)$ : $$\begin{aligned} \E_{X, y}[ \bar R(\hat\beta_{X, y}) ] = \E_{X, y}[ ||\hat\beta - \beta||^2] = \underbrace{ ||\beta - \E[\hat \beta]||^2}_{\text{Bias } B_n}

\underbrace{\E[|| \hat\beta- \E[\hat\beta] ||^2 ]}_{\text{Variance } V_n}\end{aligned}$$ Where $B_n, V_n$ are the bias and variance of the estimator on $n$ samples.

For the specific estimator $\hat\beta= X^\dagger y$ in the regime $n \leq d$, the bias and variance can be written as (see Appendix [sec:biascomputation]{reference-type="ref" reference="sec:biascomputation"}): $$\begin{aligned} B_n &= ||\E_{X \sim \mathcal{D}^n}[\text{Proj}{X^{\perp}}(\beta)]||^2 \label{eqn:bias}\ V_n &= \underbrace{\E{X} [|| \text{Proj}X(\beta) - \E_X[\text{Proj}X(\beta)] ||^2]}{(A)} + \sigma^2 \underbrace{\E_X[\Tr((XX^T)^{-1})]}{(B)} \label{eqn:variance}\end{aligned}$$ where $\text{Proj}{X}$ is the orthogonal projector onto the rowspace of the data $X \in \R^{n \x d}$, and $\text{Proj}{X^\perp}$ is the projector onto the orthogonal complement of the rowspace.

From Equation [eqn:bias]{reference-type="ref" reference="eqn:bias"}, the bias is non-increasing with samples ($B_{n+1} \leq B_n$), since an additional sample can only grow the rowspace: $X_{n+1}^{\perp} \subseteq X_n^{\perp}$. The variance in Equation [eqn:variance]{reference-type="ref" reference="eqn:variance"} has two terms: the first term (A) is due to the randomness of $X$, and is bounded. But the second term (B) is due to the randomness in the noise of $y$, and diverges when $n \approx d$ since $X$ becomes poorly conditioned. This trace term is responsible for the peak in test MSE at $n=d$.

We can also approximately compute the bias, variance, and excess risk.

[[claim:main]]{#claim:main label="claim:main"} Let $\gamma := \frac{n}{d} < 1$ be the underparameterization ratio. The bias and variance are: $$\begin{aligned} B_n &= (1-\gamma)^2||\beta||^2\ V_n &\approx \gamma(1-\gamma)||\beta||^2

\sigma^2 \frac{\gamma}{1-\gamma}\end{aligned}$$ And thus the expected excess risk for $\gamma < 1$ is: $$\begin{aligned} \E[\bar R (\hat\beta) ] &\approx (1-\gamma)||\beta||^2
\sigma^2 \frac{\gamma}{1-\gamma}\ &= (1-\frac{n}{d})||\beta||^2
\sigma^2 \frac{n}{d-n}\end{aligned}$$

These approximations are not exact because they hold asyptotically in the limit of large $d$ (when scaling $n = \gamma d$), but may deviate for finite samples. In particular, the bias $B_n$ and term (A) of the variance can be computed exactly for finite samples: $\text{Proj}_X$ is simply a projector onto a uniformly random $n$-dimensional subspace, so $\E[ \text{Proj}_X(\beta)] = \gamma \beta$, and similarly $\E[||\text{Proj}_X(\beta)||^2] = \gamma ||\beta||^2$. The trace term (B) is nontrivial to understand for finite samples, but converges[^3] to $\frac{\gamma}{1-\gamma}$ in the limit of large $n, d$ (e.g. Lemma 3 of [@hastie2019surprises]). In Section 3.3{reference-type="ref" reference="sec:single"}, we give intuitions for why the trace term converges to this.

For completeness, the bias, variance, and excess risk in the underparameterized regime are given in [@hastie2019surprises Theorem 1] as:

[[claim:underparam]]{#claim:underparam label="claim:underparam"} Let $\gamma := \frac{n}{d} > 1$ be the underparameterization ratio. The bias and variance are: $$\begin{aligned} B_n = 0 \quad, \quad V_n \approx \frac{\sigma^2}{\gamma - 1}\end{aligned}$$

Figure [fig:main]{reference-type="ref" reference="fig:main"} shows that Claims [claim:main]{reference-type="ref" reference="claim:main"} and [claim:underparam]{reference-type="ref" reference="claim:underparam"} agree with the excess risk experimentally even for finite $d=1000$.

Conditioning of the Data Matrix {#sec:conditioning}

Here we give intuitions for why the data matrix $X \in \R^{n \x d}$ is well conditioned for $n \ll d$, but has small singular values for $n \approx d$.

Near Criticality

First, let us consider the effect of adding a single sample when $n=(d-1)$. For simplicity, assume the first $(d-1)$ samples $x_i$ are just the standard basis vectors, scaled appropriately. That is, assume the data matrix $X \in \R^{(d-1) \x d}$ is $$X = \mqty[ d I_{d-1} & 0].$$ This has all non-zero singular values equal to $d$. Then, consider adding a new isotropic Gaussian sample $x_{n+1} \sim \mathcal{N}(0, I_d)$. Split this into coordinates as $x_{n+1} = (g_1, g_2) \in \R^{d-1} \x \R$. The new data matrix is $$X_{n+1} = \mqty[ d I_{d-1} & 0 \ g_1 & g_2]$$ We claim that $X_{n+1}$ has small singular values. Indeed, consider left-multiplication by $v^T := \mqty[g_1 & -d]$: $$\begin{aligned} v^T X_{n+1} = \mqty[g_1 & -d] \mqty[ d I_{d-1} & 0 \ g_1 & g_2] = \mqty[0 & -d g_2]\end{aligned}$$ Thus, $||v^T X_{n+1}||^2 \approx d^2$, while $||v||^2 \approx 2d^2$. Since $X_{n+1}$ is full-rank, it must have a singular value less than roughly $\frac{1}{\sqrt{2}}$. That is, adding a new sample has shrunk the minimum non-zero singular value of $X$ from $d$ to less than a constant.

The intuition here is: although the new sample $x_{n+1}$ adds rank to the existing samples, it does so in a very fragile way. Most of the $\ell_2$ mass of $x_{n+1}$ is contained in the span of existing samples, and $x_n$ only contains a small component outside of this subspace. This causes $X_{n+1}$ to have small singular values, which in turn causes the ridgeless regression estimator (which applies $X^\dagger$) to be sensitive to noise.

A more careful analysis shows that the singular values are actually even smaller than the above simplification suggests --- since in the real setting, the matrix $X$ was already poorly conditioned even before the new sample $x_{n+1}$. In Section 3.3{reference-type="ref" reference="sec:single"} we calculate the exact effect of adding a single sample to the excess risk.

Far from Criticality

When $n \ll d$, the data matrix $X$ does not have singular values close to $0$. One way to see this is to notice that since our data model treats features and samples symmetrically, $X$ is well conditioned in the regime $n \ll d$ for the same reason that standard linear regression works in the classical underparameterized regime $n \gg d$ (by "transposing" the setting).

More precisely, since $X$ is full rank, its smallest non-zero singular value can be written as $$\sigma_{\text{min}}(X) = \min_{v \in \R^n: ||v||2 = 1} ||v^T X||2$$ Since $X$ has entries i.i.d $\mathcal{N}(0, 1)$, for every fixed vector $v$ we have $\E_X[ ||v^T X||^2 ] = d ||v||^2 = d$. Moreover, for $d = \Omega(n)$ uniform convergence holds, and $||v^T X||^2$ concentrates around its expectation for all vectors $v$ in the $\ell_2$ ball. Thus: $$\sigma{\text{min}}(X)^2 \approx \E_X\left[ \min{v \in \R^n: ||v||_2 = 1} ||v^T X||^2 \right] \approx \min_v \E_X[ ||v^TX||^2 ] = d$$

Effect of Adding a Single Sample {#sec:single}

Here we show how the trace term of the variance in Equation [eqn:variance]{reference-type="ref" reference="eqn:variance"} changes with increasing samples. Specifically, the following claim shows how $\Tr((XX^T)^{-1})$ grows when we add a new sample to $X$.

[[claim:trace]]{#claim:trace label="claim:trace"} Let $X \in \R^{n \x d}$ be the data matrix after $n$ samples, and let $x \in \R^d$ be the $(n+1)$th sample. The new data matrix is $X_{n+1} = \mqty[X \ x^T]$, and $$\Tr((X_{n+1}X_{n+1}^T)^{-1})

\Tr[(XX^T)^{-1}]

\frac{1 + ||(X^T)^\dagger x||^2}{|| \mathrm{Proj}_{X^\perp}(x)||^2}$$

::: {.proof} Proof. By computation in Appendix 5.2{reference-type="ref" reference="app:trace"}. ◻ :::

If we heuristically assume the denominator concentrates around its expectation, $||\text{Proj}_{X^\perp}(x)||^2 \approx d-n$, then we can use Claim [claim:trace]{reference-type="ref" reference="claim:trace"} to estimate the expected effect of a single sample:

$$\begin{aligned} \E_{x} \Tr((X_{n+1}X_{n+1}^T)^{-1}) &\approx \Tr[(XX^T)^{-1}]

\frac{1 + \E_{x}||(XX^T)^{-1}Xx||^2}{d-n}\ % &= % \Tr[(XX^T)^{-1}] % + \frac{1 + \Tr[(XX^T)^{-1}]}{d-n}\ &= \Tr[(XX^T)^{-1}]\left(1 + \frac{1}{d-n}\right)
\frac{1}{d-n} \label{eqn:tr_single}\end{aligned}$$

We can further estimate the growth by taking a continuous limit for large $d$. Let $F(\frac{n}{d}) := \E[ \Tr((X_{n}X_{n}^T)^{-1})]$. Then for $\gamma := \frac{n}{d}$, Equation [eqn:tr_single]{reference-type="ref" reference="eqn:tr_single"} yields the differential equation $$\begin{aligned} \frac{d F(\gamma)}{d\gamma} = (1-\gamma)^{-1} F + (1-\gamma)^{-1}\end{aligned}$$ which is solved by $F(\gamma) = \frac{\gamma}{1-\gamma}$. This heuristic derivation that $\E[\Tr(XX^T)^{-1}] \to \frac{\gamma}{1-\gamma}$ is consistent with the rigorous asymptotics given in [@hastie2019surprises Lemma 3] and used in Claim [claim:main]{reference-type="ref" reference="claim:main"}.

Discussion {#sec:discuss}

We hope that understanding such simple settings can eventually lead to understanding the general behavior of overparameterized models in machine learning. We consider it extremely unsatisfying that the most popular technique in modern machine learning (training an overparameterized neural network with SGD) can be nonmonotonic in samples [@deep]. We hope that a greater understanding here could help develop learning algorithms which make the best use of data (and in particular, are monotonic in samples).

In general, we believe it is interesting to understand when and why learning algorithms are monotonic -- especially when we don't explicitly enforce them to be.

Acknowledgements {#acknowledgements .unnumbered}

We especially thank Jacob Steinhardt and Aditi Raghunathan for discussions and suggestions that motivated this work. We thank Jarosław Błasiok, Jonathan Shi, and Boaz Barak for useful discussions throughout this work, and we thank Gal Kaplun and Benjamin L. Edelman for feedback on an early draft.

This work supported in part by supported by NSF awards CCF 1565264, CNS 1618026, and CCF 1715187, a Simons Investigator Fellowship, and a Simons Investigator Award.

Appendix: Computations

Bias and Variance

The computations in this section are standard.

[[sec:biascomputation]]{#sec:biascomputation label="sec:biascomputation"} Assume the data distribution and problem setting from Section 2{reference-type="ref" reference="sec:linear"}.

For samples $(X, y)$, the estimator is: $$\begin{aligned} \hat\beta= X^\dagger y = \begin{cases} X^T(XX^T)^{-1}y & \text{when $n \leq d$}\ (X^TX)^{-1}X^Ty & \text{when $n > d$} \end{cases}\end{aligned}$$

For $n \leq d$, the bias and variance of the estimator $\hat\beta= X^\dagger y$ is $$\begin{aligned} B_n &= ||\E_{X \sim \mathcal{D}^n}[\text{Proj}{X^{\perp}}(\beta)]||^2 \ V_n &= \underbrace{\E{X} [|| \text{Proj}X(\beta) - \E_X[\text{Proj}X(\beta)] ||^2]}{(A)} + \sigma^2 \underbrace{\E_X[\Tr((XX^T)^{-1})]}{(B)}\end{aligned}$$

::: {.proof} Proof. Bias. Note that $$\begin{aligned} \beta - \E[\hat\beta] &= \beta - \E_{X, \eta}[ X^T(XX^T)^{-1}(X\beta + \eta) ]\ &= \E_X[(I - X^T(XX^T)^{-1}X)\beta]\ &= \E_X[Proj_{X^\perp}(\beta)]\\end{aligned}$$

Thus the bias is $$\begin{aligned} B_n &= ||\beta - \E[\hat \beta]||^2\ &= ||E_{X_n}[Proj_{X_n^{\perp}}(\beta)]||^2\end{aligned}$$

Variance.

$$\begin{aligned} V_n &= \E_{\hat\beta}[|| \hat\beta- \E[\hat\beta] ||^2 ]\ &= \E_{X,\eta}[ || X^T(XX^T)^{-1}(X\beta + \eta) -\E_{X}[X^T(XX^T)^{-1}X\beta] ||^2 ]\ &= \E_{X, \eta} [|| (S - \bar{S})\beta + X^T(XX^T)^{-1}\eta ||^2 ] \tag{$S := X^T(XX^T)^{-1}X, \bar{S} := \E[S]$}\ &= \E_{X} [|| (S - \bar{S})\beta||^2] + \E_{X, \eta}[||X^T(XX^T)^{-1}\eta ||^2 ]\ &= \E_{X} [|| (S - \bar{S})\beta||^2] + \sigma^2 Tr((XX^T)^{-1})\end{aligned}$$

Notice that $S$ is projection onto the rowspace of $X$, i.e. $S = Proj_{X}$. Thus,

$$\begin{aligned} V_n := \E_{X} [|| Proj_X(\beta) - \E_X[Proj_X(\beta)] ||^2] + \sigma^2 Tr((XX^T)^{-1})\end{aligned}$$ ◻ :::

Trace Computations {#app:trace}

::: {.proof} Proof of Claim [claim:trace]{reference-type="ref" reference="claim:trace"}. Let $X \in \R^{n \x d}$ be the data matrix after $n$ samples, and let $x \in \R^d$ be the $(n+1)$th sample. The new data matrix is $X_{n+1} = \mqty[X \ x^T]$, and $$X_{n+1}X_{n+1}^T

\mqty[XX^T & Xx\ x^TX^T & x^Tx]$$

Now by Schur complements:

$$\begin{aligned} (X_{n+1}X_{n+1}^T)^{-1} &= \mqty[XX^T & Xx\ x^TX^T & x^Tx]^{-1}\ &= \mqty[ (XX^T - \frac{Xxx^TX^T}{||x||^2})^{-1} & -(XX^T - \frac{Xxx^TX^T}{||x||^2})^{-1}\frac{Xx}{||x||^2}\ \dots & (||x||^2 - x^TX^T(XX^T)^{-1}Xx )^{-1} ]\end{aligned}$$

Thus $$\begin{aligned} Tr((X_{n+1}X_{n+1}^T)^{-1}) &= Tr((XX^T - \frac{Xxx^TX^T}{||x||^2})^{-1}) + (||x||^2 - x^TX^T(XX^T)^{-1}Xx )^{-1}\ &= Tr((XX^T - \frac{Xxx^TX^T}{||x||^2})^{-1}) + (x^T(x - Proj_X(x)) )^{-1}\ &= Tr((XX^T - \frac{Xxx^TX^T}{||x||^2})^{-1}) + \frac{1}{||Proj_{X^\perp}(x)||^2}\end{aligned}$$

By Sherman-Morrison: $$\begin{aligned} (XX^T - \frac{Xxx^TX^T}{||x||^2})^{-1} &= (XX^T)^{-1} + \frac{(XX^T)^{-1}Xxx^TX^T(XX^T)^{-1}}{||x||^2 - x^TX^T(XX^T)^{-1}Xx}\ \implies Tr (XX^T - \frac{Xxx^TX^T}{||x||^2})^{-1} &= Tr[(XX^T)^{-1}]

\frac{||(XX^T)^{-1}Xx||^2}{||Proj_{X^\perp}(x)||^2}\end{aligned}$$

Finally, we have

$$\begin{aligned} Tr((X_{n+1}X_{n+1}^T)^{-1}) &= Tr[(XX^T)^{-1}]

\frac{1 + ||(XX^T)^{-1}Xx||^2}{||Proj_{X^\perp}(x)||^2}\end{aligned}$$ or equivalently:

$$Tr((X_{n+1}X_{n+1}^T)^{-1})

Tr[(XX^T)^{-1}]

\frac{1 + ||(X^T)^\dagger x||^2}{||Proj_{X^\perp}(x)||^2}$$

or equivalently: $$Tr((X_{n+1}X_{n+1}^T)^{-1})

Tr[(XX^T)^{-1}]

\frac{1 + ||\gamma||^2}{||Proj_{X^\perp}(x)||^2} \quad \text{ where } \gamma := \argmin_v || X^T v - x ||^2$$ ◻ :::

[^1]: To see this, notice that the iterates of gradient descent lie in the row-space of $X$.

[^2]: For clarity, we consider the excess risk, which omits the unavoidable additive $\sigma^2$ error in the true risk.

[^3]: For large $d$, the spectrum of $(XX^T)$ is understood by the Marchenko--Pastur law [@marvcenko1967distribution]. Lemma 3 of [@hastie2019surprises] uses this to show that $\Tr((XX^T)^{-1}) \to \frac{\gamma}{1-\gamma}$.