The Variational Characterization of KL-Divergence, Error Catastrophes, and Generalization

https://www.lesswrong.com/posts/sCFm9yH3cohgxDtFu/the-variational-characterization-of-kl-divergence-error

Epistemological Status: Correct, probably too technical for Less Wrong, but these results are interesting and relevant.

There’s been a flurry of posts on generalization recently. Some talk about simplicity priors and other about the bias of SGD. However, I think it’s worth bringing up the fact that there is already an alternative that already works well enough to provides non-vacuous bounds for neural networks. Namely, I’m talking about the PAC-Bayes approach.

The twist I’ll give in this brief note is a motivation of the bound from a biological perspective and then derive the full PAC-Bayes bound. The first part relies on the error threshold for replication and the patching interpretation of the KL-divergence. The second part relies on the Donsker-Varadhan variational characterization of KL-divergence.

Error Threshold

A replication \hat R of an object R is an approximate copy (up to mutation). The environment determines whether or not an object gets to replicate via a fitness S(R). Suppose I is an instruction set or parameterization of R. When R replicates I is copied.

We begin with a prior distribution of instructions \pi_0 and then the distribution changes as the population replicates to \pi. The amount of information needed to achieve the modification is the patching cost and is equal to D_{KL}(\pi \Vert \pi_0).

Only information from the fitness landscape L(\pi) is useful for converging to the fittest individual(s). On the other hand, the amount of information available from the environment is equal to D_{KL}(L(\pi) \Vert L(\pi_0)). Thus, transitions are favorable only when we have,

D_{KL}(\pi \Vert \pi_0) < D_{KL}(L(\pi) \Vert L(\pi_0))

As an example, suppose that the space of instructions is the set of boolean strings of length |I|​ and that mutation flips a bit in the string with uniform independent probability \epsilon. Suppose that I always survives and our fitness is simply whether or not the replicate survives. This means L(I) = 1 and L(\hat I) is a Bernoulli random variable with parameter S. Then we have,

D_{\text{KL}}(I \Vert \hat I) = \sum_{i = 1}^n p_{I}(i) \log(p_{I}(i) / p_{\hat I}(i)) = \sum_{i = 1}^n \log(1/(1-\epsilon)) \simeq \epsilon \cdot |I| \ D_{\text{KL}}(L(\hat I) \Vert L(I)) = \log(1 / S) \ \Rightarrow \epsilon \cdot |I| + \log(S) < 0

PAC-Bayes Bound

Now suppose that the instructions are parameterizations for some sort of learning model and that the fitness is the training metric. In particular, the survival rate could be a classification error metric. If we are given n examples that induce an error rate of \epsilon we have the requirement,

D_{KL}(\pi \Vert \pi_0) < n \cdot \log(1/(1-\epsilon)) \ \Rightarrow 1-e^{-\frac{1}{n} D_{KL}(\pi \Vert \pi_0) } < \epsilon

This implies that using a lot of information to update our estimate for the optimal model paramters will require high error rates.

Specifically, a sublinear relationship between the information used and the data obtained is necessary for a low error rate if we are to avoid an error catastrophe. This would be a situation where the model continues to update despite being at the optimum.

At this point it’s natural to wonder if a sublinear relationship is sufficient. This is precisely the content of the PAC-Bayes theorem. To show this first note that for any f \in L^{\infty},

\langle f, \pi \rangle = \langle \log(e^f) , \pi \rangle \ = \langle \log \left(\frac{\pi}{\pi_0} \frac{\pi_0}{\pi} e^f \right) , \pi \rangle \ = \langle \log \left(\frac{\pi}{\pi_0} \right) , \pi \rangle + \langle \log \left( \frac{\pi_0}{\pi} e^f \right) , \pi \rangle \ = D_{KL}(\pi \Vert \pi_0) + \langle \log \left( \frac{\pi_0}{\pi} e^f \right) , \pi \rangle \quad \text{(Definition)} \ \le D_{KL}(\pi \Vert \pi_0) + \log \left(\langle \frac{\pi_0}{\pi} e^f , \pi \rangle \right) \quad \text{(Jensen)} \ = D_{KL}(\pi \Vert \pi_0) + \log \left(\langle e^f , \pi_0\rangle \right) \ \Rightarrow \langle f, \pi \rangle \le D_{KL}(\pi \Vert \pi_0) + \log \left(\langle e^f , \pi_0\rangle \right)

As an aside, the astute reader might notice this as a Young-Inequality between dual functions in which case we have the famous relation,

\Rightarrow D_{KL}(\pi \Vert \pi_0) = \sup_{f \in L^{\infty}} \langle f, \pi \rangle - \log \left(\langle e^f , \pi_0\rangle \right) \quad \text{(Donsker-Varadhan)}

Either way, it’s clearer now that if we take f to be a generalization error such as,

f = \lambda \left(\frac{1}{n} \sum_{i = 1}^n l(h(x_i), y_i) - \mathbb{E}[l(h(x_i), y_i)] \right) = \lambda (\hat L(h) - L(h))

then the left-hand side of our Young-inequality yeilds an empirical loss which is similar to what we had before. The major difference is that we have a contribution from a cumulant function which we can bound using the Markov and Hoeffding inequalities,

\langle e^f, \pi_0 \rangle \le_{1 - \delta} \frac{1}{\delta} \mathbb{E}[\langle e^{f}, \pi \rangle] \quad \text{(Markov)} \ = \frac{1}{\delta}\langle \mathbb{E}[e^{f}], \pi \rangle \ \le \frac{1}{\delta} \langle e^{\lambda^2 / 2n}, \pi \rangle = \frac{1}{\delta} e^{\lambda^2 / 2n}

Now we put everything together and then optimize over \lambda.

\hat L(\pi) - L(\pi) = \frac{1}{\lambda} \langle f , \pi \rangle \le \frac{1}{\lambda} \left[D_{KL}(\pi \Vert \pi_0) + \log \left(\langle e^f , \pi_0\rangle \right) \right] \ \le_{1 - \delta} \frac{D_{KL}(\pi \Vert \pi_0)+\log \left( \frac{1}{\delta} \right)}{\lambda} + \frac{\lambda}{ 2n}

Optimizing we obtain,

L(\pi) \le \hat L(\pi) + \sqrt{\cfrac{D_{KL}(\pi \Vert \pi_0)+\log \left( \frac{1}{\delta} \right)}{2n}} \quad \text{(PAC-Bayes)}

Comment

https://www.lesswrong.com/posts/sCFm9yH3cohgxDtFu/the-variational-characterization-of-kl-divergence-error?commentId=FequpY4hCfBtt6m3Q

Thanks, I was wondering what people referred to when mentioning PAC-Bayes bounds. I am still a bit confused. Could you explain how L(\pi) and \hat{L}(\pi) depend on \pi_0 (if they do) and how to interpret the final inequality in this light? Particularly I am wondering because the bound seems to be best when \pi=\pi_0 . Minor comment: I think n=m?

Comment

https://www.lesswrong.com/posts/sCFm9yH3cohgxDtFu/the-variational-characterization-of-kl-divergence-error?commentId=8ZNagJG2E4CjSk8hT

The term \pi is meant to be a posterior distribution after seeing data. If you have a good prior you could take \pi=\pi_0. However, note L(\pi) could be high. You want trade-off between the cost of updating the prior and the loss reduction.

Example, say we have a neural network. Then our prior would be the initialization and the posterior would be the distribution of outputs from SGD.

(Btw thanks for the correction)

Comment

Thanks, I finally got it. What I just now fully understood is that the final inequality holds with high \pi_0^n probability (i.e., as you say, \pi_0 is the data), while the learning bound or loss reduction is given for \pi.

https://www.lesswrong.com/posts/sCFm9yH3cohgxDtFu/the-variational-characterization-of-kl-divergence-error?commentId=i55sCDZekNyEFLw3K

I’m still confused about the part where you use the Hoeffding inequality—how is the lambda in that step and the lambda in the loss function "the same lambda"?

Comment

https://www.lesswrong.com/posts/sCFm9yH3cohgxDtFu/the-variational-characterization-of-kl-divergence-error?commentId=SQQDf2Np8pJczyTti

Because f = \lambda \cdot \Delta L. They are the same. Does that help?