Why should the Occamian prior work so well in the real world? It’s a seemingly profound mystery that is asking to be dissolved. To begin with, I propose a Lazy Razor and a corresponding Lazy prior:
Given several competing models of reality, we should select the one that is easiest to work with.This is merely a formulation of the obvious trade-off between accuracy and cost. I would rather have a bad prediction today than a good prediction tomorrow or a great prediction ten years from now. Ultimately, this prior will deliver a good model, because it will let you try out many different models fast. The concept of "easiness" may seem even more vague than "complexity", but I believe that in any specific context its measurement should be clear. Note, "easiness" is measured in man-hours, dollars or etc, it’s not to be confused with "hardness" in the sense of P and PN. If you still don’t know how to measure "easiness" in your context, you should use the Lazy prior to choose an "easiness" measurement procedure. To break the recursive loop, know that the Laziest of all models is called "pulling numbers out of your ass". Now let’s return to the first question. Why should the Occamian prior work so well in the real world? The answer is, it doesn’t, not really. Of all the possible priors, the Occamian prior holds no special place. Its greatest merit is that it often resembles Lazy prior in the probabilities it offers. Indeed it is easy to see, that a random model with a billion parameters is disliked by both priors, and that a model with two parameters is loved by both. By the way, its second greatest merit is being easy to work with. Note, the priors are not interchangeable. One case where they disagree is on making use of existing resources. Suppose mathematics has derived powerful tools for working with A-theory but not B-theory. Then Lazy prior would suggest that a complex model based on A-theory may be preferable to a simpler one based on B-theory. Or, suppose some process took millions of years to produce abundant and powerful meat-based computers. Then Lazy prior would suggest that we make use of them in our models, regardless of their complexity, while the Occamian prior would object.
I mentioned to Eliezer once a few years ago that a weak form of Occam’s razor, across a countable hypothesis space, is inevitable. This observation was new to him so it seems worth reproducing here. Suppose P(i) is any prior whatsoever over a countable hypothesis space, for example the space of Turing machines. The condition that \sum_i P(i) = 1, and in particular the condition that this sum converges, implies that for every \epsilon > 0 we can find N such that \sum_{i \ge N} P(i) < \epsilon; in other words, the total probability mass of hypotheses with sufficiently large indices gets arbitrarily small. If the indices index hypotheses by increasing complexity, this implies that the total probability mass of sufficiently complicated hypotheses gets arbitrarily small, no matter what the prior is. The real kicker is that "complexity" can mean absolutely anything in the argument above; that is, the indexing can be arbitrary and the argument will still apply. And it sort of doesn’t matter; any indexing has the property that it will eventually exhaust all of the sufficiently "simple" hypotheses, according to any other definition of "simplicity," because there aren’t enough "simple" hypotheses to go around, and so must eventually have the property that the hypotheses being indexed get more and more "complicated," whatever that means. So, roughly speaking, weak forms of Occam’s razor are inevitable because there just aren’t as many "simple" hypotheses as "complicated" ones, whatever "simple" and "complicated" mean, so "complicated" hypotheses just can’t have that much probability mass individually. (And in turn the asymmetry between simple and complicated is that simplicity is bounded but complexity isn’t.) There’s also an anthropic argument for stronger forms of Occam’s razor that I think was featured in a recentish post: worlds in which Occam’s razor doesn’t work are worlds in which intelligent life probably couldn’t have evolved.
Comment
Doesn’t the speed prior diverge quite rapidly from the universal prior? There are many short programs of length n which take a long time to compute their final result—up to BB(n) timesteps, specifically...
Comment
Yes, the two priors aren’t as close as I might have implied. But still there are many cases where they agree. For example, given a random 6-state TM and a random 7-state TM, both Lazy and Occamian priors will usually prefer the 6-state machine. By the way, if I had to simulate these TMs by hand, I could care a lot about computation time, but now that we have cheap computers, computation time has a smaller coefficient, and the time for building the TM is more important. This is how it works, "easiness" is measured in man-hours, it’s not just the number of steps the TM makes.
How do you define "works well" for a prior? I argue that for most things, the universal prior (everything is equally likely) works about as well as the lazy prior or Occam’s prior, because all non-extreme priors are overwhelmed with evidence (including evidence from other agents) very rapidly. all three fail in the tails, but do just fine for the majority of uses. Now if you talk about a measure of model simplicity and likelihood to apply to novel situations, rather than probability of prediciton, then it’s not clear that universal is usable, but it’s also not clear that lazy is better or worse than Occam.
Comment
By "works well" I mean that we find whatever model we were looking for. Note that I didn’t say "eventually" (all priors work "eventually", unless they assign too many 0 probabilities).
Comment
That seems susceptible to circularity. If we are looking for a simple model, we will get one. But what if we are looking for a true model? Is the simplest model necessarily true?
Comment
We aren’t looking for a simple model, we are looking for a model that generates accurate predictions. For instance, we could have two agents with two different priors independently working on the same problem (e.g. weather forecasting) for a fixed amount of time, and then see which of them found a more accurate model. Then, whoever wins gets to say that his prior is better. Nothing circular about it.
"Why should the Occamian prior work so well in the real world?" A different way of phrasing Occam’s Razor is "Given several competing models of reality, the most likely one is the one that involves multiplying together the fewest probabilities." That’s because each additional level of complexity is adding another probability that needs to be multiplied together. It’s a simple result of probability.
Comment
Comment
I believe that the Occamian prior should hold true in any universe where the laws of probability hold. I don’t see any reason why not, since the assumption behind it is that all the individual levels of complexity of different models have roughly the same probability.
Comment
Laws of probability say that P(A \cap B) \leq P(A)I suspect that to you "Occam’s Razor" refers to this law (I don’t think that’s the usual interpretation, but it’s reasonable). However this law does not make a prior. It does not say anything about whether we should prefer a 6-state Turing machine to a 100-state TM, when building a model. Try using the laws of probability to decide that.
Comment
That is indeed what it means in my mind. I agree that it was bad wording. Perhaps something more along the lines of "should work well."
Comment
Not at all. I’m repeating a truthism: to make a claim about the territory, you should look at the territory. "Occamian prior works well" is an empirical claim about the real world (though it’s not easy to measure). "Probabilities need to be multiplied" is a lot less empirical (it’s about as empirical as 2+2=4). Therefore the former shouldn’t follow from the latter.
I have a feeling that you mix probability and decision theory. Given some observations, there are two separate questions when considering possible explanations / models:
Comment
You are correct that Lazy prior largely encodes considerations of utility maximization. My core point isn’t that Lazy prior is some profound idea. Instead my core point is that the Occamian prior is not profound either. It has only a few real merits. One minor merit is that it is simple to describe and to reason about, which makes it a high-utility choice of a prior, at least for theoretical discussions. But the greatest merit of Occamian prior is that it vaguely resembles the Lazy prior. That is, it also encodes some of the same considerations of utility maximization. I’m suggesting that, whenever someone talks about the power of Occam’s razor or the mysterious simplicity of nature, what is happening is in fact this: the person did not bother to do proper utility calculations, Occamian prior encoded some of those calculations by construction, and therefore the person managed to reach a high-utility result with less effort. With that in mind, I asked what prior would serve this purpose even better and arrived at Lazy prior. The idea of encoding these considerations in a prior may seem like an error of some kind, but the choice of a prior is subjective by definition, so it should be fine. (Thanks for the comment. I found it useful. I hadn’t explicitly considered this criticism when I wrote the post, and I feel that I now understand my own view better.)
Comment
Is it? It seems a rather straightforward consequence of how knowledge works, as in information allows you to establish probabilistic beliefs and then probability theory explains pretty simply what Occam’s Razor is.
Comment
See my earlier reply to a similar comment.