This post is a follow-up to Understanding "Deep Double Descent".
I was talking to Rohin at NeurIPS about my post on double descent, and he asked the very reasonable question of why exactly I think double descent is so important. I realized that I hadn’t fully explained that in my previous post, so the goal of this post is to further address the question of why you should care about double descent from an AI safety standpoint. This post assumes you’ve read my Understanding "Deep Double Descent" post, so you should read that first before reading this if you haven’t already.
Specifically, I think double descent demonstrates the in my opinion very important yet counterintuitive result that larger models can actually be simpler than smaller models. On its face, this sounds somewhat crazy—how can a model with more parameters be simpler? But in fact I think this is just a very straightforward consequence of double descent: in the double descent paradigm, larger models with zero training error generalize better than smaller models with zero training error because they do better on SGD’s inductive biases. And if you buy that SGD’s inductive biases are approximately simplicity, that means that larger models with zero training error are simpler than smaller models with zero training error.
Obviously, larger models do have more parameters than smaller ones, so if that’s your measure of simplicity, larger models will always be more complicated, but for other measures of simplicity that’s not necessarily the case. For example, it could hypothetically be the case that larger models have lower Kolmogorov complexity. Though I don’t actually think that’s true in the case of K-complexity, I think that’s only for the boring reason that model weights have a lot of noise. If you had a way of somehow only counting the "essential complexity," I suspect larger models would actually have lower K-complexity.
Really, what I’m trying to do here is dispel what I see as the myth that as ML models get more powerful simplicity will stop mattering for them. In a Bayesian setting, it is a fact that the impact of your prior on your posterior (for those regions where your prior is non-zero[1]) becomes negligible as you update on more and more data. I have sometimes heard it claimed that as a consequence of this result, as we move to doing machine learning with ever larger datasets and ever bigger models, the impact of our training processes’ inductive biases will become negligible. However, I think that’s quite wrong, and I think double descent does a good job of showing why, because all of the performance gains you get past the interpolation threshold are coming from your implicit prior.[2] Thus, if you suspect modern ML to mostly be in that regime, what will matter in terms of which techniques beat out other techniques is how good they are at compressing their data into the "actually simplest" model that fits it.
Furthermore, even just from the simple Bayesian perspective, I suspect you can still get double descent. For example, suppose your training process looks like the following: you have some hypothesis class that keeps getting larger as you train and at each time step you select the best a posteriori hypothesis. I think that this setup will naturally yield a double descent for noisy data: first you get a "likelihood descent" as you get hypotheses with greater and greater likelihood, but then you start overfitting to noise in your data as you get close to the interpolation threshold. Past the interpolation threshold, however, you get a second "prior descent" where you’re selecting hypotheses with greater and greater prior probability rather than greater and greater likelihood. I think this is a good model for how modern machine learning works and what double descent is doing.
All of this is only for models with zero training error, however—before you reach zero training error larger models can certainly have more essential complexity than smaller ones. That being said, if you don’t do very many steps of training then your inductive biases will also matter a lot because you haven’t updated that much on your data yet. In the double descent framework, the only region where your inductive biases don’t matter very much is right on the interpolation threshold—before the interpolation threshold or past it they should still be quite relevant.
Why does any of this matter from a safety perspective, though? Ever since I read Belkin et al. I’ve had double descent as part of my talk version of "Risks from Learned Optimization" because I think it addresses a pretty important part of the story for mesa-optimization. That is, mesa-optimizers are simple, compressed policies—but as ML moves to larger and larger models, why should that matter? The answer, I think, is that larger models can generalize better not just by fitting the data better, but also by being simpler.[3]
-
Negating the impact of the prior not having support over some hypotheses requires realizability (see Embedded World-Models). ↩︎
-
Note that double descent happens even without explicit regularization, so the prior we’re talking about here is the implicit one imposed by the architecture you’ve chosen and the fact that you’re training it via SGD. ↩︎
-
Which is exactly what you should expect if you think Occam’s razor is the right prior: if two hypotheses have the same likelihood but one generalizes better, according to Occam’s razor it must be because it’s simpler. ↩︎
One caveat worth noting about double descent – it only appears if you train far longer than necessary, i.e. "train forever". If you regularize with early stopping (stop when the performance on some validation set stops improving), the effect is not present. Since we use early stopping in all realistic settings, performance always improves monotonically with more data / bigger models. To rephrase, analyzing the weird point where models reach zero training loss will produce confusing results. The early stopping point exhibits no such weird non-monotonic behavior.
Comment
Evan’s response (copied from a direct message, before I was approved to post here): It definitely makes sense to me that early stopping would remove the non-monotonicity. I think a broader point which is interesting re double descent, though, is what it says about w*> hy *bigger models are better. That is, not only can bigger models fit larger datasets, according to the double descent story there’s also a meaningful sense in which bigger models have better inductive biases. The idea I’m objecting to is that there’s a sharp change from one regime (larger family of models) to the other (better inductive bias). I’d say that both factors smoothly improve performance over the full range of model sizes. I don’t fully understand this yet, and I think it would be interesting to understand how bigger models and better inductive bias (from SGD + early stopping) come together to produce this smooth improvement in performance.
Does anyone know if double decent happens when you look at the posterior predictive rather than just the output of SGD? I wouldn’t be too surprised if it does, but before we start talking about the bayesian perspective, I’d like to see evidence that this isn’t just an artifact of using optimization instead of integration.
Planned summary for the Alignment newsletter:
Comment
Yep, that’s exactly my model.
If "best" here means test error, then presumably the truth should generalize at least as well as any other hypothesis.
True for the Bayesian case, though unclear in the ML case—I think it’s quite plausible that current ML underweights the implicit prior of SGD relative to the maximizing the likelihood of the data (EDIT: which is another reason that better future ML might care more about inductive biases).
Comment
I think we would benefit from tabooing the word "simple". It seems to me that when people use the word "simple" in the context of ML, they are usually referring to either smoothness/Lipschitzness or minimum description length. But it’s easy to see that these metrics don’t always coincide. A random walk is smooth, but its minimum description length is long. A tall square wave is not smooth, but its description length is short. L2 regularization makes a model smoother without reducing its description length. Quantization reduces a model’s description length without making it smoother. I’m actually not aware of any argument that smoothness and description length are or should be related—it seems like this might be an unexamined premise.
Based on your paper, the argument for mesa-optimizers seems to be about description length. But if SGD’s inductive biases target smoothness, it’s not clear why we should expect SGD to discover mesa-optimizers. Perhaps you think smooth functions tend to be more compressible than functions which aren’t smooth. I don’t think that’s enough. Imagine a Venn diagram where compressible functions are a big circle. Mesa-optimizers are a subset, and the compressible functions discovered by SGD are another subset. The question is whether these two subsets are overlapping. Pointing out that they’re both compressible is not a strong argument for overlap: "all cats are mammals, and all dogs are mammals, so therefore if you see a cat, it’s also likely to be a dog".
When I read your paper, I get a sense that an optimizers outperform by allowing one to collapse a lot of redundant functionality into a single general method. It seems like maybe it’s the act of compression that gets you an agent, not the property of being compressible. If our model is a smooth function which could in principle be compressed using a single general method, I’m not seeing why the reapplication of that general method in a very novel context is something we should expect to happen.
BTW I actually do think minimum description length is something we’ll have to contend with long term. It’s just too useful as an inductive bias. (Eliminating redundancies in your cognition seems like a basic thing an AGI will need to do to stay competitive.) But I’m unconvinced SGD possesses the minimum description length inductive bias. Especially if e.g. the flat minima story is the one that’s true (as opposed to e.g. the lottery ticket story).
Also, I’m less confident that what I wrote above applies to RNNs.
Comment
I just edited the last sentence to be clearer in terms of what I actually mean by it.
What double descent definitely says is that for a fixed dataset, larger models with zero training error are simpler than smaller models with zero training error. I think it does say somewhat more than that also, which is that larger models do have a real tendency towards being better at finding simpler models in general. That being said, the dataset on which the concept of a dog in your head was trained on is presumably way larger than that of any ML model, so even if your brain is really good at implementing Occam’s razor and finding simple models, your model is still probably going to be more complicated.