Probability as Minimal Map

https://www.lesswrong.com/posts/Lz2nCYnBeaZyS68Xb/probability-as-minimal-map

Contents

Probability-Computing Circuits

We want to show that P[q | X] throws away no information relevant to q: I(q; X) = I(q; P[q | X]). We’ll start with a short side-quest to prove a lemma: P[q | P[q | X] = p] = p. Here’s an interpretation: suppose you work late trying to calculate the probability of q based on some giant pile of data X. You finally compute the result: P[q | X] = p. You go to bed feeling victorious, but wake up to find that your computer has died and your giant pile of data is lost. However, you still remember the final result p - even though you don’t know X, you do know that P[q | X] = p. Based on this information, what probability should you assign to q? Presumably the answer should be p. We can directly prove this in a few lines: P[q | P[q | X] = p] = (P[P[q | X] = p])^{-1} P[q, P[q | X] = p] \ = (\sum_x I[P[q | X=x] = p] P[X=x]) ^{-1} \sum_x I[P[q | X=x] = p] P[q | X = x] P[X=x] Note that the indicator function I[.] in the main sum is zero whenever P[q | X = x] is not p, so we can replace P[q | X = x] with p in that sum: = (\sum_x I[P[q | X=x] = p] P[X=x]) ^{-1} \sum_x I[P[q | X=x] = p] p P[X=x] \ = p We can also write this in the more compact form P[q | P[q | X]] = P[q | X], which we’ll use later. To build a bit more intuition, let’s expand on this result a bit. Imagine a circuit (aka deterministic causal diagram or computational DAG) which takes in data X and outputs P[q | X]. We’ll think about some arbitrary cross-section of that circuit—some set of internal nodes {g_i} which cut the circuit in two, so that g mediates the interaction between X and P[q | X]. We can write this as P[q | X] = f(g(X)) for some function f. Claim: for any cross-section g, P[q | g(X)] = P[q | X]. Intuitively, we can imagine that we’re partway through calculating P[q | X], we’ve calculated all the g’s but haven’t finished the full computation yet, when suddenly our computer dies and we lose all the original data X. As long as we still have the g’s around, we’re good—we just shrug and continue the calculation, since the g’s were all we needed to finish up anyway. I won’t write out the proof for this one, but it works exactly like the previous proof. Notice that our cross-section theorem has two interesting corner cases: all of X is a cross-section, and P[q | X] by itself is a cross-section. The former yields the trivial statement P[q | X] = P[q | X], while the latter yields our lemma from earlier: P[q | P[q | X] = p] = p. Armed with our lemma, let’s return to the main problem: show that P[q | X] throws away no information relevant to q, formalized as I(q; X) = I(q; P[q | X]). We’ll start by applying an identity from information theory:

The Data Processing Inequality

Next, we’d like to show that P[q|X] is a minimal representation of the information in X relevant to Q. Formally, for any g(X), if g throws away no information relevant to q, then P[q | X] is a function of g(X): I(q; X) = I(q; g(X)) implies P[q | X] = f(g(X)) for some f. In particular, we’ll show that P[q | g(X)] = P[q | X], so we can choose f(g*) = P[q | g(X) = g*]. Intuitively, this says that if g(X) throws out no information from X relevant to q, then it yields the same probability for q. To prove this, we’ll use a theorem from information theory called the data processing inequality (DPI). Intuitively, the DPI says that if we start with the data X, we cannot gain any new information about q just by processing X. Formally, for any function g: I(q; X) \geq I(q; g(X)) … with equality iff q and X are independent conditional on g(X) (so that X tells us nothing else about q once we know g(X)). The proof and a bit more discussion can be found in Cover & Thomas’ textbook on page 34. We’re interested in the equality case I(q; X) = I(q; g(X)), so the DPI tells us that q and X are independent conditioned on g(X): P[q, X | g(X)] = P[q | g(X)]P[X | g(X)] But we can always just expand via the chain rule instead: P[q, X | g(X)] = P[q | X, g(X)] P[X | g(X)] = P[q | X] P[X | g(X)] Equate these two expressions for P[q, X | g(X)], and we find P[q | g(X)] = P[q |X] … which is what we wanted.

Why Is This Interesting?

We’ve shown that the probability P[q|X] summarizes all the information in X relevant to q, and throws out as much irrelevant information as possible. Why does this matter? First, hopefully this provides some intuition for interpreting a probability P[q | X] as a representation of the information in X relevant to q. In short: probabilities directly represent information. This interpretation makes sense even in the absence of "agents" with "beliefs", or "independent experiments" repeated infinitely many times. It directly talks about maps matching territories, and the role probability plays, without invoking any of the machinery of frequentist or subjectivist interpretations. That means we can potentially apply it in a broader variety of situations—we can talk about simple mechanical processes which produce "maps" of the world, and the probabilistic calculations embedded in those processes. Second, this is a natural first step in characterizing map-making processes. The data processing inequality also suggests a natural next step: sufficient statistics, which serve as minimal "lossless" maps when our data are independent samples from a distribution, and our queries can ask anything at all about the parameters of the distribution. From there, we could further generalize to approximate sufficient statistics—maps which throw out a small amount of information relevant to the queries, in cases where there is no "lossless" map smaller than X. That said, this still only talks about half of the full map-making process: the data-processing half. There’s still the data-collection half to address—the process which generates X in the first place, and somehow ties it to the territory. And of course, when map-making processes are embedded in the real world, there might not be a clean separation between data-collection and data-processing, so we need to deal with that somehow. Third, the interpretation of probabilities as a minimal map—or a representation of information—suggests possible avenues to relax traditional probability to deal with things which are time-like separated from us, e.g. for handling self-reference and advanced decision-theoretic problems. In particular, it suggests strategically throwing away information to deal with self-referential tomfoolery—but continuing to use probabilities to represent the reduced information remaining.

Comment

https://www.lesswrong.com/posts/Lz2nCYnBeaZyS68Xb/probability-as-minimal-map?commentId=cTY3QuMSzvkMa6Xpk

I hope that I’m not becoming that annoying person who redirects every conversation towards their own pet theory even when its of minimal relevance, but I think there are some very strong links here with my theory of logical counterfactuals as forgetting or erasing information (https://​​www.lesswrong.com/​​posts/​​BRuWm4GxcTNPn4XDX/​​deconfusing-logical-counterfactuals).

In fact, I wasn’t quite aware of how strong these similarities were before I read this post. Both probabilities* and logical counterfactuals don’t intrinsically exist in the universe; instead they seem to draw their existence from our epistemic situation. Zoom into the universe enough everything is deterministic and there is only one thing that an agent ever could have chosen; zoom out and there are multiple possibilities.

I’m currently writing a post where I’ll argue that logical counterfactuals are an answer and not a question and I think this applies to probability too. Both exist in the map and not in the territory; this means that they are human created terms that exist for human purposes. Arguing what they are or aren’t in the abstract indicates confusion; instead, we need to choose a purpose (or question) and then we can choose to notion appropriate to that. This is part of why these notions come apart with questions like Sleeping Beauty and Counterfactual Mugging.

*Okay, it kind of does under some theories of quantum mechanics, but lots of the discussion of probability uses probability to deal with a lack of complete knowledge as opposed to intrinsically existing uncertainty, which is of a different kind and can’t be explained by quantum mechanics

Comment

https://www.lesswrong.com/posts/Lz2nCYnBeaZyS68Xb/probability-as-minimal-map?commentId=p7Fe5LTiviTGnKrzE

That is highly relevant, and is basically where I was planning to go with my next post. In particular, see the dice problem in this comment—sometimes throwing away information requires randomizing our probability distribution. I suspect that this idea can be used to rederive Nash equilibria in a somewhat-more-embedded-looking scenario, e.g. our opponent making its decision by running a copy of us to see what we do. Thanks for pointing out the relevance of your work, I’ll probably use some of it.

https://www.lesswrong.com/posts/Lz2nCYnBeaZyS68Xb/probability-as-minimal-map?commentId=ftDArZxGjphihFyKJ

In particular, it suggests strategically > throwing away information to deal with self-referential tomfoolery—but continuing to use probabilities to represent the reduced information remaining.[Emphasis added.] There didn’t seemed to be any examples of that in this post. This:

P[q|P[q|X]=p]=p.seemed like moving self-referential/​higher level probability down to the level we care about. (It’s also one of the few things that seems reasonable to accept as an axiom.)

Comment

https://www.lesswrong.com/posts/Lz2nCYnBeaZyS68Xb/probability-as-minimal-map?commentId=L3GXc9rmPdrdsNCr4

Throwing away information to deal with self-reference is a natural strategy in general; the ideas in this post specifically suggest that we use probabilities to represent the information which remains.