How To Think About Overparameterized Models

https://www.lesswrong.com/posts/GfEmhoyS6ztk7GQAp/how-to-think-about-overparameterized-models

Contents

Ridges, Not Peaks

First things first: when optimizing ML models, we usually have some objective function where perfectly predicting every point in the training set yields the best possible score. In overparameterized models, we have enough parameters that training indeed converges to zero error, i.e. all data points in the training set are matched perfectly. Let’s pick one particular prediction setup to think about, so we can stick some equations on this. We have a bunch of (x, y) data points, and we want to predict y given x. Our ML model has some parameters \theta, and its prediction on a point x^{(n)} is f(x^{(n)}, \theta). In order to perfectly predict every data point in the training set, \theta must satisfy the equations \forall n: y^{(n)} = f(x^{(n)}, \theta) Assuming y^{(n)} is one-dimensional (i.e. just a number), and we have N data points, this gives us N equations. If \theta is k-dimensional, then we have N equations with k variables. If the number of variables is much larger than the number of equations (i.e. k >> N, parameter-dimension much greater than number of data points), then this system of equations will typically have many solutions. In fact, assuming there are any solutions at all, we can prove there are infinitely many—an entire high-dimensional surface of solutions in \theta-space. Proof: let \theta^* be a solution. If we make a small change d\theta^, then f(x^{(n)}, \theta) changes by \nabla_\theta f(x^{(n)}, \theta^) \cdot d\theta^. For all the equations to remain satisfied, after shifting \theta^ \rightarrow \theta^* + d\theta^, these changes must all be zero: \forall n: 0 = \nabla_\theta f(x^{(n)}, \theta^) \cdot d\theta^* Key thing to notice: this is a set of linear equations. There are still N equations and still k variables (this time d\theta^* rather than \theta), and since they’re linear, there are guaranteed to be at least k-N independent directions along which we can vary d\theta^* while still solving the equations (i.e. the right nullspace of the matrix \nabla_\theta f(x^{(n)}, \theta^*) has dimension at least k-N). These directions point exactly along the local surface on which the equations are solved. Takeaway: we have an entire surface of dimension (at least) k-N, sitting in the k-dimensional \theta-space, on which all points in the training data are predicted perfectly. What does this tell us about the shape of the objective function more generally? Well, we have this (at least) k-N dimensional surface on which the objective function achieves its best possible value. Everywhere else, it will be lower. The "global optimum" is not a point at the top of a single peak, but rather a surface at the high point of an entire high-dimensional ridge. So: picture ridges, not peaks. Ridges are harder to draw, ok?Before we move on, two minor comments on generalizing this model.

Priors and Sampling, Not Likelihoods and Estimation

So there’s an entire surface of optimal points. Obvious next question: if all of these points are optimal, what determines which one we pick? Short answer: mainly initial parameter values, which are typically randomly generated. Conceptually, we randomly sample trained parameter values from the perfect-prediction surface. To do that, we first sample some random initial parameter values, and then we train them—roughly speaking, we gradient-descend our way to whatever point on the perfect-prediction surface is closest to our initial values. The key problem is to figure out what distribution of final (trained) parameter values results from the initial distribution of parameter values. One key empirical result: during training, the parameters in large overparameterized models tend to change by only a small amount. (There’s a great visual of this in this post. It’s an animation showing weights changing over the course of training; for the larger nets, they don’t visibly change at all.) In particular, this means that linear/​quadratic approximations (i.e. Taylor expansions) should work very well. For our purposes, we don’t even care about the details of the ridge-shape. The only piece which matters is that, as long as we’re close enough for quadratic approximations around the ridge to work well, the gradient will be perpendicular to the directions along which the ridge runs. So, gradient descent will take us from the initial point, to whatever point on the perfect-prediction surface is closest (under ordinary Euclidean distance) to the initial point. Stochastic gradient descent (as opposed to pure gradient descent) will contribute some noise—i.e. diffusion along the ridge-direction—but it should average out to roughly the same thing. From there, figuring out the distribution from which we effectively sample our trained parameter values is conceptually straightforward. For each point \theta^* on the perfect-prediction surface, add up the probability density of the initial parameter distribution at all the points which are closer to \theta^* than to any other point on the perfect-prediction surface. We can break this up into two factors:

Example: Overparameterized Linear Regression

As an example, let’s run a plain old linear regression. We’ll use an overparameterized model which is equivalent to a traditional linear regression model, in order to make the relationship clear. We have 100 (x, y) pairs, which look like this: I generated these with a "true" slope of 1, i.e. y = 1*x + noise, with standard normal noise.

Traditional-Style Regression

We have one parameter, c, and we fit a model y^{(n)} = cx^{(n)} + \xi^{(n)}, with standard normal-distributed noise \xi^{(n)}. This gives log likelihood logP[y|a] = -\frac{1}{2} \sum_n (y^{(n)} - cx^{(n)})^2 … plus some constants. We choose c^* to maximize this log-likelihood. In this case, c^* = 1.010, so the line looks like this:

(Slightly) Overparameterized Regression

We use the exact same model, y^{(n)} = cx^{(n)} + \xi^{(n)}, but now we explicitly consider the \xi^{(n)} terms "parameters". Now our parameters are (c, \xi^{(1)}, …, \xi^{(N)}), and we’ll initialize them all as samples from a standard normal distribution (so our "prior" on the noise terms is the same distribution assumed in the previous regression). We then optimize (c, \xi^{(1)}, …, \xi^{(N)}) to minimize the sum-of-squared-errors \frac{1}{2} \sum_n (y^{(n)} - cx^{(n)} - \xi^{(n)})^2 This ends up approximately the same as a Bayesian update on \forall n: y^{(n)} = cx^{(n)} - \xi^{(n)}, and our final c-value 1.046 is not an estimate, but rather a sample from the posterior. Although the "error" in our c-posterior-sample here is larger than the "error" in our c-estimate from the previous regression, the implied line is visually identical: Note that our model here is only slightly overparameterized; k=N+1, so the perfect prediction surface is one-dimensional. Indeed, the perfect prediction surface is a straight line in (c, \xi^{(1)}, …, \xi^{(N)}) - space, given by the equations y^{(n)} = cx^{(n)} + \xi^{(n)}.

(Very) Overparameterized Regression

Usually, we say that the noise terms are normal because they’re a sum of many small independent noise sources. To make a very overparameterized model, let’s make those small independent noise sources explicit: y^{(n)} = cx^{(n)} + \sqrt{\frac{3}{N}}\sum_{i=0}^{100} \xi^{(n)}_i. Our parameters are c and the whole 2D array of \xi’s, with standard normal initialization on c, and Uniform(-1, 1) initialization on \xi. (The \sqrt{\frac{3}{N}} is there to make the standard deviation equivalent to the original model.) As before, we minimize sum-of-squared-errors. This time our c-value is 1.031. The line still looks exactly the same. This time, we’re much more overparameterized—we have k = 100N + 1, so the perfect prediction surface has dimension 99N + 1. But conceptually, it still works basically the same as the previous example. Code for all these is here. In all these examples, the underlying probabilistic models are (approximately) identical. The latter two (approximately) sample from the posterior, rather than calculating a maximum-log-likelihood parameter estimate, but as long as the posterior for the slope parameter is very pointy, the result is nearly the same. The main difference is just what we call a "parameter" and optimize over, rather than integrating out.

Comment

https://www.lesswrong.com/posts/GfEmhoyS6ztk7GQAp/how-to-think-about-overparameterized-models?commentId=RCxznbo5d74HNLwnL

somehow I missed this post and only caught it now. This was helpful for a few things.

  • That I should think of some algorithms primarily as populating a space with the given data and then ‘deciding’ on the topology of the space

  • That ‘the valley of bad X’ is the inverse of a ‘goldilocks zone’

  • That overfitting can be thought of as occurring in a valley of bad parameterization.

https://www.lesswrong.com/posts/GfEmhoyS6ztk7GQAp/how-to-think-about-overparameterized-models?commentId=n6iPvcWJ3NGFJEwJn

Just read erhan10a.pdf (mlr.press) (Why Does Unsupervised Pre-training Help Deep Learning?, 2010) for a paper club at work. This post helped me understand and simplify the material, quite a lot. Thanks!