Intuitions on Universal Behavior of Information at a Distance

https://www.lesswrong.com/posts/CxEbvETK2WNfHw7v9/intuitions-on-universal-behavior-of-information-at-a

Contents

Intuition on Distributed Information

Fun fact: with three random variables X, Y, Z, it’s possible for X to contain zero information about Z, and Y to contain zero information about Z, but for the pair (X, Y) together to contain nonzero information about Z. Classic example: X and Y are uniform random bits, and Z = X \oplus Y, the exclusive-or of the other two. A priori, the distribution P[Z] is 50⁄50. If we learn that X is 0 (without knowing Y) then it’s still 50⁄50; if we learn that X is 1 (without knowing Y) then it’s also still 50⁄50. Likewise for Y; learning either variable by itself does not change our distribution for Z. Noise in either input wipes out the signal from the other input. We can also generalize this example to more variables: X_1 … X_n are uniform random bits, and Z = X_1 \oplus … \oplus X_n. If there’s even just one input X_i whose value we don’t know, then we have zero information about Z, even if we know the values of all the other inputs. On the other hand, it is not the case in general that all information is always wiped out by noise in a few variables. Suppose we have a whole bunch of earthquake-sensors in an area, and one of them drops offline; this does not significantly inhibit our ability to detect earthquakes in the area. In this case, each individual sensor contains information on its own—there’s even redundant information across sensors, since they’re all presumably highly correlated. We have two prototypical pictures here: one in which information is stored in the relationship between variables, and another in which information is stored in the variables themselves. Intuitively, information stored in relationships seems to be much more sensitive to noise—uncertainty in even just one variable can wipe out the signal from all the others. Thus, a hypothesis: in a large complex system, with lots of noisy unobserved variables mediating observable interactions, information stored in many-variable relationships tends to be wiped out. I’ll call information stored in relationships "distributed information". We can re-state our hypothesis as: in a complex system, noise tends to wipe out many-variable distributed information. (In an earlier draft, I called it "entanglement information" in analogy to quantum; this produces the familiar-sounding-to-physicists hypothesis that "noise tends to wipe out many-variable entanglement".)

A Measure of Distributed Information

In order to formalize this idea, we first need an information-theoretic notion of "information stored in the relationship between variables". Although the notion is qualitatively discussed in the context of multivariate mutual information, I have surprisingly not seen an actual formulation which separates distributed info from redundancy; multivariate mutual information just adds the two together. After trying a few approaches, here’s what I’ve settled on. We have variables X_1...X_n and Y; we want to talk about the information about Y contained in X, i.e. the mutual information I(X; Y). We’d like to decompose this into two parts: the information contained in individual variables X_i, and the information stored only across multiple variables. But we can’t just add up the information I(X_i, Y) stored in each individual variable to estimate the non-distributed info, because some of that information is redundant—we’d be double-counting. So, strategy: let’s build a model which breaks the entanglement, forcibly throw out the distributed info, without throwing out any information in the individual variables. How do we do that? Start with the model which generated all of our variables (i.e. the experiment). Now, imagine running n independent copies of that model (i.e.n runs of the experiment), but condition on the outcome being Y - i.e. select only runs with the same Y. Finally, imagine that each X_i came from a different run. This breaks the relationships between the X_i, while still maintaining all the information between any individual X_i and Y. Write it out, and this model is essentially a Naive Bayes approximation: it says that P[Y | X] \approx \tilde{P}[Y|X] = \frac{1}{Z(X)} P[Y] \prod_i P[X_i | Y] … where Z(X) is a normalizer. This makes sense—the Naive Bayes approximation ignores information contained in interactions between the X_i. Now, we define the "distributed information" I_D as the amount of information thrown out by this approximation: I_D(Y; X_1, … , X_n) := E_X[D_{KL}(P[Y | X] || \tilde{P}[Y|X])] … where D_{KL} is the KL divergence. The "non-distributed information" is then whatever mutual information is not distributed, i.e. I(Y; X) - I_D(Y; X_1, … , X_n). Now, let’s imagine that X_1 … X_k are observables, and X_{k+1} … X_n are hidden. How does the distributed information of X_1 … X_k about Y when the hidden variable values are known compare to the distributed information when the hidden variable values are unknown? We’d guess that unknown hidden variables wipe out some of the distributed information in the observables, and we can prove that using the convexity of KL-divergence: I_D(Y; X_1, … , X_k) = E_{X_1...X_k}[D_{KL}(P[Y | X_1...X_k] || \tilde{P}[Y|X_1...X_k])] \ \leq E_{X_1...X_k}[E_{X_{k+1}...X_n}[D_{KL}(P[Y | X] || \tilde{P}[Y|X])]] \ = E_{X_{k+1}...X_n}[I_D(Y; X_1, … , X_k | X_{k+1}...X_n)]
In other words: distributed information can only decrease as we integrate hidden variables out of the model. Just as hypothesized, those noisy unobserved variables wipe out distributed information. Another way to look at it: as we integrate out hidden variables, Naive Bayes becomes a better approximation for the remaining variables. Now, we haven’t actually shown how much distributed information is lost as we integrate out hidden variables. But we’ll punt that problem to a future post. For now, suppose that as we integrate out unobserved variables, all distributed information falls to zero. What would that distribution look like?

Zero Distributed Information

Just based on the definition, "zero distributed information" of X_1 … X_n about Y means that the Naive Bayes approximation is exact: P[Y | X] = \frac{1}{Z(X)} P[Y] \prod_i P[X_i | Y] But we don’t just want zero distributed information of some particular variables with respect to one other variable. We want zero information of any subset of observable variables with respect to any other variable: P[X_i | X_S] = \frac{1}{Z(X_S)} P[X_i] \prod_{j \in S} P[X_j | X_i] … where X_S is any subset of the observable variables, and X_i is any observable not in X_S. (Without loss of generality, I’ll usually use the name Y for the variable X_i whose probability we’re applying Naive Bayes to, in order to make the formulas a bit easier to follow.) At this point, we can throw math at the problem. Skip the next two subsections ("First Trick" and "Second Trick") if you just want the summary and intuition. First Trick The first big condition we need to satisfy is that Naive Bayes still holds after integrating out one more variable, i.e. it holds for both P[Y | X_S] and P[Y | X_{S’}], where S’ = S \cup {j} for some index j \notin S, j \neq i. We can calculate via the usual rules of probability: P[Y | X_S] = \sum_{X_j} P[Y | X_{S’}]P[X_j | X_S] Applying Naive Bayes to both sides and cancelling common terms then yields: \frac{1}{Z(X_S)} = \sum_{X_j} \frac{1}{Z(X_{S’})} P[X_j | Y] P[X_j | X_S] Key thing to notice: the left-hand-side does not depend on Y, so the dependence on Y on the right must somehow cancel out. There are two "obvious" ways for this to happen:

Quick Test: System of Normals

We’ll build a system of normal random variables. Each variable X_i is given by X_i = Z_i + \sum_{j < i} \theta_{ij} X_j … with Z_i a standard random normal, and each \theta_{ij} either zero with probability 1-p (used to tune density of connections) or uniform between 0 and s (used to tune strength of connections). We make the system reasonably large, with reasonably low connection-density p, and randomly pick some subset of the X_i to consider "observed". (Note: this is not a particularly careful rendition of the assumptions made earlier. That’s fine; it’s supposed to be a quick test, and sensitivity to the exact conditions is something we care about anyway.) Assuming the zero distributed information picture applies, what should this look like? Well, we should see a mix of two behaviors:

Comment

https://www.lesswrong.com/posts/CxEbvETK2WNfHw7v9/intuitions-on-universal-behavior-of-information-at-a?commentId=fKEjqieCo8988Be9E

This is so so good! I think this is my favorite post of yours. It’s just teeming with possibilities and connections. Melting/​freezing and magnetization connections are very interesting. This almost seems similar to what happens with brain plasticity increase (due to mania, being excited, or substances).

https://www.lesswrong.com/posts/CxEbvETK2WNfHw7v9/intuitions-on-universal-behavior-of-information-at-a?commentId=4367pJZHoayZ9MPFg

with three random variables X, Y, Z, it’s possible for X to contain zero information about Z, and Y to contain zero information about Z, but for the pair (X,Y) together to contain nonzero information about Z. there’s a name for this and I can never freakin remember it. I need to make a mnemonic the next time I find it. Music jingles If you think your x and y vary independently And neither x nor y can explain your z But x and y together create a neat story then you’ve got a ________!

Comment

https://www.lesswrong.com/posts/CxEbvETK2WNfHw7v9/intuitions-on-universal-behavior-of-information-at-a?commentId=PT8BML9n9T43yk9dC

Pairwise independent but triple-wise dependent*? *The phrase mutually dependent seems more common, but it’s more general. EDIT: Easier to remember version: Dependent triple that’s pairwise independent.