Laplace Approximation

https://www.lesswrong.com/posts/9BqvY7tpqrD4BC6dL/laplace-approximation

The last couple posts compared some specific models for 20000 rolls of a die. This post will step back, and talk about more general theory for Bayesian model comparison. The main problem is to calculate P[data | model] for some model. The model will typically give the probability of observed data x (e.g. die rolls) based on some unobserved parameter values \theta (e.g. the p’s in the last two posts), along with a prior distribution over \theta. We then need to compute P[data|model] = \int_{\theta} P[data|\theta] dP[\theta] which will be a hairy high-dimensional integral. Some special model structures allow us to simplify the problem, typically by factoring the integral into a product of one-dimensional integrals. But in general, we need some method for approximating these integrals. The two most common approximation methods used in practice are Laplace approximation around the maximum-likelihood point, and MCMC (see e.g. here for application of MCMC to Bayes factors). We’ll mainly talk about Laplace approximation here—in practice MCMC mostly works well in the same cases, assuming the unobserved parameters are continuous.

Laplace Approximation

Here’s the idea of Laplace approximation. First, posterior distributions tend to be very pointy. This is mainly because independent probabilities multiply, so probabilities tend to scale exponentially with the number of data points. Think of the probabilities we calculated in the last two posts, with values like 10^{-70} or 10^{-20} - that’s the typical case. If we’re integrating over a function with values like that, we can basically just pay attention to the region around the highest value—other regions will have exponentially small weight. Laplace’ trick is to use a second-order approximation within that high-valued region. Specifically, since probabilities naturally live on a log scale, we’ll take a second order-approximation of the log likelihood around its maximum point. Thus: \int_{\theta} e^{lnP[data|\theta]} dP[\theta] \approx \int_{\theta} e^{lnP[data|\theta_{max}] + \frac{1}{2}(\theta-\theta_{max})^T (\frac{d^2lnP}{d\theta^2}|{\theta{max}})(\theta-\theta_{max})} dP[\theta] If we assume that the prior dP[\theta] is uniform (i.e. dP[\theta] = d\theta), then this looks like a normal distribution on \theta with mean \theta_{max} and variance given by the inverse Hessian matrix of the log-likelihood. (It turns out that, even for non-uniform dP[\theta], we can just transform \theta so that the prior looks uniform near \theta_{max}, and transform it back when we’re done.) The result: \int_{\theta} e^{lnP[data|\theta]} dP[\theta] \approx P[data|\theta_{max}] p[\theta_{max}] (2\pi)^{\frac{k}{2}}det(-\frac{d^2lnP}{d\theta^2}|{\theta{max}})^{-\frac{1}{2}} Let’s walk through each of those pieces:

Laplace Complexity Penalty

The Laplace approximation contains our first example of an explicit complexity penalty. The idea of a complexity penalty is that we first find the maximum log likelihood lnP[data|\theta_{max}], maybe add a term for our \theta-prior lnp[\theta_{max}], and that’s the "score" of our model. But more general models, with more free parameters, will always score higher, leading to overfit. To counterbalance that, we calculate some numerical penalty which is larger for more complex models (i.e. those with more free parameters) and subtract that penalty from the raw score. In the case of Laplace approximation, a natural complexity penalty drops out as soon as we take the log of the approximation formula: lnP[data|model] \approx lnP[data|\theta_{max}] + lnp[\theta_{max}]

Comment

https://www.lesswrong.com/posts/9BqvY7tpqrD4BC6dL/laplace-approximation?commentId=EbpWW3FP3pWkMszsA

For an introduction to MCMC aimed at a similar level target audience, I found this explanation helpful.

https://www.lesswrong.com/posts/9BqvY7tpqrD4BC6dL/laplace-approximation?commentId=yjpJayinHjrPPRpvf

Thanks for this sequence, I’ve read each post 3 or 4 times to try to properly get it. Am I right in thinking that in order to replace dP[\theta]=d\theta we not only require a uniform prior but also that \theta span unit volume?

Comment

https://www.lesswrong.com/posts/9BqvY7tpqrD4BC6dL/laplace-approximation?commentId=5es7xerfdseBFrNYi

Correct. In general, dP[\theta] = p[\theta]d\theta is the probability density of \theta, so if it’s uniform on a unit volume then p[\theta] = 1. The main advantage of this notation is that it’s parameterization-independent. For example: in a coin-flipping example, we could have a uniform prior over the frequency of heads p_H, so dP[p_H] = dp_H. But then, we could re-write that frequency in terms of the odds o_H = \frac{p_H}{1-p_H}, so we’d get p_H = \frac{o_H}{1+o_H} and dP[o_H] = dP[p_H] = dp_H = d(\frac{o_H}{1+o_H}) = \frac{do_H}{(1+o_H)^2} So the probability density p[p_H] = 1 is equivalent to the density p[o_H] = \frac{1}{(1+o_H)^2}. (That first step, dP[o_H] = dP[p_H], is because these two variables contain exactly the same information in two different forms—that’s the parameterization independence. After that, it’s math: substitute and differentiate.) (Notice that the uniform prior on p_H is not uniform over o_H. This is one of the main reasons why "use a uniform prior" is not a good general-purpose rule for choosing priors: it depends on what parameters we choose. Cartesian and polar coordinate give different "uniform" priors.) The moral of the story is that, when dealing with continuous probability densities, the fundamental "thing" is not the density function p[\theta] but the density times the differential p[\theta]d\theta, which we call dP[\theta]. This is important mainly when changing coordinates: if we have some coordinate change \theta(\phi), then p[\theta(\phi)] d\theta(\phi) = p[\phi]d\phi, but p[\theta(\phi)] \neq p[\phi]. If anybody wants an exercise with this: try transforming \int_{\theta}e^{P[data|\theta]}dP[\theta] = \int_{\theta}e^{P[data|\theta]}p[\theta]d\theta to a different coordinate system. Apply Laplace’ approximation in both systems, and confirm that they yield the same answer. (This should mainly involve applying the chain rule twice to the Hessian; if you get stuck, remember that \theta_{max} is a maximum point and consider what that implies.)