Laplace Approximation

The last couple posts compared some specific models for 20000 rolls of a die. This post will step back, and talk about more general theory for Bayesian model comparison. The main problem is to calculate P[data | model] for some model. The model will typically give the probability of observed data x (e.g. die rolls) based on some unobserved parameter values \theta (e.g. the p’s in the last two posts), along with a prior distribution over \theta. We then need to compute P[data|model] = \int_{\theta} P[data|\theta] dP[\theta] which will be a hairy high-dimensional integral. Some special model structures allow us to simplify the problem, typically by factoring the integral into a product of one-dimensional integrals. But in general, we need some method for approximating these integrals. The two most common approximation methods used in practice are Laplace approximation around the maximum-likelihood point, and MCMC (see e.g. here for application of MCMC to Bayes factors). We’ll mainly talk about Laplace approximation here—in practice MCMC mostly works well in the same cases, assuming the unobserved parameters are continuous.

Laplace Approximation

Here’s the idea of Laplace approximation. First, posterior distributions tend to be very pointy. This is mainly because independent probabilities multiply, so probabilities tend to scale exponentially with the number of data points. Think of the probabilities we calculated in the last two posts, with values like 10^{-70} or 10^{-20} - that’s the typical case. If we’re integrating over a function with values like that, we can basically just pay attention to the region around the highest value—other regions will have exponentially small weight. Laplace’ trick is to use a second-order approximation within that high-valued region. Specifically, since probabilities naturally live on a log scale, we’ll take a second order-approximation of the log likelihood around its maximum point. Thus: \int_{\theta} e^{lnP[data|\theta]} dP[\theta] \approx \int_{\theta} e^{lnP[data|\theta_{max}] + \frac{1}{2}(\theta-\theta_{max})^T (\frac{d^2lnP}{d\theta^2}|{\theta{max}})(\theta-\theta_{max})} dP[\theta] If we assume that the prior dP[\theta] is uniform (i.e. dP[\theta] = d\theta), then this looks like a normal distribution on \theta with mean \theta_{max} and variance given by the inverse Hessian matrix of the log-likelihood. (It turns out that, even for non-uniform dP[\theta], we can just transform \theta so that the prior looks uniform near \theta_{max}, and transform it back when we’re done.) The result: \int_{\theta} e^{lnP[data|\theta]} dP[\theta] \approx P[data|\theta_{max}] p[\theta_{max}] (2\pi)^{\frac{k}{2}}det(-\frac{d^2lnP}{d\theta^2}|{\theta{max}})^{-\frac{1}{2}} Let’s walk through each of those pieces:

P[data|\theta_{max}] is the usual maximum likelihood: the largest probability assigned to the data by any particular value of \theta.
p[\theta_{max}] is the prior probability density of the maximum-likelihood \theta point.
(2\pi)^{\frac{k}{2}} is that annoying constant factor which shows up whenever we deal with normal distributions; k is the dimension of \theta.
det(-\frac{d^2lnP}{d\theta^2}|{\theta{max}}) is the determinant of the "Fisher information matrix"; it quantifies how wide or skinny the peak is.A bit more detail on that last piece: intuitively, each eigenvalue of the Fisher information matrix tells us the approximate width of the peak in a particular direction. Since the matrix is the inverse variance (in one dimension \frac{1}{\sigma^2}) of our approximate normal distribution, and "width" of the peak of a normal distribution corresponds to the standard deviation \sigma, we use an inverse square root (i.e. the power of -\frac{1}{2}) to extract a width from each eigenvalue. Then, to find how much volume the peak covers, we multiply together the widths along each direction—thus the determinant, which is the product of eigenvalues. Why do we need eigenvalues? The diagram above shows the general idea: for the function shown, the two arrows would be eigenvectors of the Hessian \frac{d^2lnP}{d\theta^2} at the peak. Under a second-order approximation, these are principal axes of the function’s level sets (the ellipses in the diagram). They are the natural directions along which to measure the width. The eigenvalue associated with each eigenvector tells us the width, and then taking their product (via the determinant) gives a volume. In the picture above, the determinant would be proportional to the volume of any of the ellipses. Altogether, then, the Laplace approximation takes the height of the peak (i.e. P[data|\theta_{max}] p[\theta_{max}]) and multiplies by the volume of \theta-space which the peak occupies, based on a second-order approximation of the likelihood around its peak.

Laplace Complexity Penalty

The Laplace approximation contains our first example of an explicit complexity penalty. The idea of a complexity penalty is that we first find the maximum log likelihood lnP[data|\theta_{max}], maybe add a term for our \theta-prior lnp[\theta_{max}], and that’s the "score" of our model. But more general models, with more free parameters, will always score higher, leading to overfit. To counterbalance that, we calculate some numerical penalty which is larger for more complex models (i.e. those with more free parameters) and subtract that penalty from the raw score. In the case of Laplace approximation, a natural complexity penalty drops out as soon as we take the log of the approximation formula: lnP[data|model] \approx lnP[data|\theta_{max}] + lnp[\theta_{max}]

\frac{k}{2} ln(2\pi) - \frac{1}{2} ln det(-\frac{d^2lnP}{d\theta^2}|{\theta{max}}) The last two terms are the complexity penalty. As we saw above, they give the (log) volume of the likelihood peak in \theta-space. The wider the peak, the larger the chunk of \theta-space which actually predicts the observed data. There are two main problems with this complexity penalty:

First, there’s the usual issues with approximating a posterior distribution by looking at a single point. Multimodal distributions are a problem, insufficiently-pointy distributions are a problem. These problems apply to basically any complexity penalty method.
Second, although the log determinant of the Hessian can be computed via backpropagation and linear algebra, that computation takes O(k^3). That’s a lot better than the exponential time required for high-dimensional integrals, but still too slow to be practical for large-scale models with millions of parameters. Historically, a third issue was the math/coding work involved in calculating a Hessian, but modern backprop tools like Tensorflow or autograd make that pretty easy; I expect in the next few years we’ll see a lot more people using a Laplace-based complexity penalty directly. The O(k^3) runtime remains a serious problem for large-scale models, however, and that problem is unlikely to be solved any time soon: a linear-time method for computing the Hessian log determinant would yield an O(n^2) matrix multiplication algorithm.

Correct. In general, dP[\theta] = p[\theta]d\theta is the probability density of \theta, so if it’s uniform on a unit volume then p[\theta] = 1. The main advantage of this notation is that it’s parameterization-independent. For example: in a coin-flipping example, we could have a uniform prior over the frequency of heads p_H, so dP[p_H] = dp_H. But then, we could re-write that frequency in terms of the odds o_H = \frac{p_H}{1-p_H}, so we’d get p_H = \frac{o_H}{1+o_H} and dP[o_H] = dP[p_H] = dp_H = d(\frac{o_H}{1+o_H}) = \frac{do_H}{(1+o_H)^2} So the probability density p[p_H] = 1 is equivalent to the density p[o_H] = \frac{1}{(1+o_H)^2}. (That first step, dP[o_H] = dP[p_H], is because these two variables contain exactly the same information in two different forms—that’s the parameterization independence. After that, it’s math: substitute and differentiate.) (Notice that the uniform prior on p_H is not uniform over o_H. This is one of the main reasons why "use a uniform prior" is not a good general-purpose rule for choosing priors: it depends on what parameters we choose. Cartesian and polar coordinate give different "uniform" priors.) The moral of the story is that, when dealing with continuous probability densities, the fundamental "thing" is not the density function p[\theta] but the density times the differential p[\theta]d\theta, which we call dP[\theta]. This is important mainly when changing coordinates: if we have some coordinate change \theta(\phi), then p[\theta(\phi)] d\theta(\phi) = p[\phi]d\phi, but p[\theta(\phi)] \neq p[\phi]. If anybody wants an exercise with this: try transforming \int_{\theta}e^{P[data|\theta]}dP[\theta] = \int_{\theta}e^{P[data|\theta]}p[\theta]d\theta to a different coordinate system. Apply Laplace’ approximation in both systems, and confirm that they yield the same answer. (This should mainly involve applying the chain rule twice to the Hessian; if you get stuck, remember that \theta_{max} is a maximum point and consider what that implies.)

id

EbpWW3FP3pWkMszsA
authors

Bucky
score

1
omega_karma
votes

1
date_published

2019-08-21T10:51

https://www.lesswrong.com/posts/9BqvY7tpqrD4BC6dL/laplace-approximation?commentId=EbpWW3FP3pWkMszsA

For an introduction to MCMC aimed at a similar level target audience, I found this explanation helpful.

id

yjpJayinHjrPPRpvf
authors

Bucky
score

1
omega_karma
votes

1
date_published

2019-07-21T20:46

https://www.lesswrong.com/posts/9BqvY7tpqrD4BC6dL/laplace-approximation?commentId=yjpJayinHjrPPRpvf

Thanks for this sequence, I’ve read each post 3 or 4 times to try to properly get it. Am I right in thinking that in order to replace dP[\theta]=d\theta we not only require a uniform prior but also that \theta span unit volume?

Comment

id

5es7xerfdseBFrNYi
authors

johnswentworth
score

4
omega_karma
votes

2
date_published

2019-07-21T21:26

https://www.lesswrong.com/posts/9BqvY7tpqrD4BC6dL/laplace-approximation?commentId=5es7xerfdseBFrNYi