Abstractions as Redundant Information

https://www.lesswrong.com/posts/vvEebH5jEvxnJEvBC/abstractions-as-redundant-information

Contents

Meta

Advice for readers: I try to keep the more-dense math in a few specific sections and the appendices, which can all be skimmed/​skipped if you just want a conceptual picture. Epistemic status: I’m aiming for physics-level rigor here. The proofs involve some shenanigans with infinite limits, and I expect them to contain subtle errors. However, I also expect that the results will accurately predict reality in practice, and that whatever subtle errors the proofs contain can be patched by the sort of mathematician who enjoys dealing with tricky limit shenanigans. Thankyou to Rob Miles and Adam Shimi for suggesting I try Excalidraw.

Basic Idea: Redundant Information Is Conserved Under Resampling

Suppose I sample the genomes of two random humans, G_1 and G_2.What information is redundant across these two random variables? One intuitively-reasonable answer: if I threw away one of the two sequences, then the redundant information is whatever I could still figure out from the other. More generally, if I have a whole bunch of genomes from random humans, G_1 … G_N, the redundant information is whatever I could still figure out after throwing one of them away. To formalize "what I could still figure out after throwing one away", we’ll use the idea of resampling: I throw away the value of G_i, and then sample a new value for G_i, using my original probability distribution conditioned on all the other genomes. So, for instance, I throw away G_i, then I look at all the other genomes and see that in most places they’re the same—so when I sample my new G_i, I know that it should match all the other genomes in all those places. However, there are a few locations where the other genomes differ, e.g. maybe 10% of them contain a particular mutation. So, when I sample my new G_i, it will contain that mutation with roughly 10% probability (assuming there’s enough data to swamp the impact of my priors). Resampling: we throw out one genome, then resample it from a distribution informed by all the other genomes. Any information which is highly redundant across the genomes is conserved—e.g. the sequence prefix "ATGAA" stays the same.Now I repeat this process many times, each time throwing away one randomly-chosen genome and resampling it conditional on the others. Intuitively, I expect that the information conserved by this process will be whatever information is highly redundant, so that approximately-zero loss occurs at each step. General method:

Conceptual Example: Gear

Suppose I have a gear, spinning around in a gearbox. I look at a few nanometer-size patches on the gear’s surface, just a hundred-ish atoms each, and measure the (approximate) position and velocity of each atom in each little patch. Notation: random variable X_i gives the positions and velocities of each atom in patch i. What information is redundant across these different patches? Intuitively, I can look at any patch and average together the rotational velocities of the atoms about the gear’s center to get a reasonably-precise estimate of the gear’s overall rotation speed. If I throw away X_i, I can still precisely estimate the gear’s overall rotation from the other X’s. Then, when I resample X_i, I will resample it so that the average rotational velocity of the atoms in X_i matches the gear’s overall rotation.

Equations

Notation: we’ll use X^t for the variable-values after t steps of this process, and X^\infty for variable-values after running the process for an arbitrarily long time. So, we’re mainly interested in the mutual information between X^0 and X^\infty. The resampling process itself specifies the distribution P[X^\infty|X^0, M_{resample}]. We’ll use "model" variables M_{base} and M_{resample} to distinguish probabilities from the "base" distribution (which is only over X_1 … X_N) vs probabilities from the resampling process (which is over X_1^0 … X_N^0, X_1^1 … X_N^1, …, X_1^\infty … X_N^\infty). One potential point of confusion: both models contain a variable called "X", but these two variables are "in different scopes" (in the programmer’s sense); "X" means something different depending on which model it’s in. In the base model, X is just a single instance of our base random variables. In the resampling model, X consists of many instances X^t, one for each timestep t, and each individual instance is distributed the same as the base model’s X. We use the same variable name for both because there’s a conceptual correspondence between the two. The resampling process defines the full relationship between the two models: P[X^0_1 = x_1 … X^0_N = x_N| M_{resample}] = P[X_1 = x_1 … X_N = x_N| M_{base}] P[X^t_i = x_i’|M_{resample}, X^{t-1} = x] = P[X_i = x_i’|M_{base}, X_{\neq i} = x_{\neq i}] (assuming variable i is resampled at resampling-time t; for all the other variables, P[X^t_i = x_i’|M_{resample}, X^{t-1} = x] = I[x_i’ = x_i], since their values just stay the same) P[X|M_{resample}] = P[X^0|M_{resample}] \prod_t P[X^t|M_{resample}, X^{t-1}] These equations follow directly from the process outlined above, and define the distribution P[X|M_{resample}] in terms of the distribution P[X|M_{base}].

… So We’re Running MCMC?

If you’ve worked with Markov Chain Monte Carlo (MCMC) algorithms before, this should look familiar: we’re basically asking which information about the initial conditions will be conserved as we run a standard MCMC process. If you’ve worked with MCMC algorithms before, you might also guess that this question has two answers:

Worked Examples

Two Variables

Let’s start with a trivial example to show how any information at all can be conserved by the resampling process. We have two random variables, X_1 and X_2. Our model for the two variable values is:

Many Measurements Of One Thing

Let’s say we have a stick of length L, and take N conditionally-independent measurements X_1 … X_N of L. We’ll model each measurement as normally distributed with mean L and standard deviation \sigma, and we’ll assume that we have enough data that the prior on L doesn’t matter much (i.e. we’ll just pretend the prior is flat). To simplify the analysis a little, we’ll resample the variables in order rather than at random—i.e. we resample L conditional on all the measurements, then resample each of the measurements X_1 … X_N conditional on L, and repeat. When we resample L, we draw from a normal distribution with mean equal to the average of the measurements (i.e. \frac{1}{N} \sum_i X_i) and standard deviation \frac{1}{\sqrt{N}} \sigma, so our new L will be about \frac{1}{\sqrt{N}} \sigma away from the previous average. When we resample the measurements, their new average will be normally distributed with mean L and standard deviation \frac{1}{\sqrt{N}} \sigma, so the new average will be about \frac{1}{\sqrt{N}} \sigma away from the previous L. In other words: L and the measurement average follow a random walk, drifting about \sqrt{\frac{2}{N}} \sigma per timestep. Over T timesteps, they will drift a distance of about \sqrt{\frac{2T}{N}}\sigma. (In general, the distance a random walk drifts scales with \sqrt{T} rather than T, since it often wanders back on itself.) So: if we run the process for a number of steps T >> N, then all information about the initial conditions is lost. On the other hand, if the number of variables N >> T, then the drift is close to zero, so L and the measurement average are approximately conserved. The order in which we take our limits matters. Practically speaking, we’re looking for information which is approximately conserved, i.e. the "timescale" T over which it’s lost is large. So it makes sense to consider L and the measurement average as approximately-conserved when N is large. That’s our abstract information in this example.

Factorization

Now for our first theorem about this kind of resampling process. Imagine that our base distribution is local—i.e. each variable only "directly interacts with" a few "neighbor variables". When modeling a physical gear, for instance, each little chunk of metal only interacts directly with the chunks of metal spatially adjacent. Any longer-range interactions have to "go through" those direct interactions. Our theorem says that interactions are still local after controlling for the information conserved by the resampling process. In the gear example, after controlling for the high-level rotation of the gear, the remaining low-level vibrations and rattling are still local; the low-level details of each chunk of metal interact directly only with the low-level details in chunks spatially adjacent. Why does this matter? In general, locality is the main tool which lets us reason about our high-dimensional world at all. It means we can look at one part of the world, and understand what’s going on there without having to understand everything that’s happening in the whole universe. The factorization theorem says that this still applies when we condition on our high-level knowledge—in other words, we can "zoom in" on lower-level details, and add them to our high-level picture without having to understand all the other low-level details in the whole universe. Conditional on our high-level knowledge, any low-level information still has to flow through neighboring variables in order to influence things "far away" in the graph. That’s going to be key to our next theorem, which is the main item of interest in this post.

Formal Statement

**Resampler Conserves Locality: If the base distribution P[X | M_{base}] factors over some graph G, then so does the limiting resampling distribution **P[X^\infty | M_{resample}, X^0]. This factorization theorem applies to both undirected graphical models (i.e. Markov Random Fields) and directed graphical models (i.e. Bayes Nets/​Causal Models). See the appendices for a proof sketch. In the gear example, the graph G would be the adjacency graph for chunks of metal: each chunk is a node, and the edges show spatial adjacency. The factorization follows the standard formulas for factorization of Markov Random Fields or Bayes Nets, depending on which type of graphical model we’re using.

The Interesting Part: Resampler-Conserved Quantities Mediate Information At A Distance

Time for the big claim from earlier: in the gear example, conditional on the highly redundant information (i.e. the overall rotation of the gear), the low-level rattling of far-apart chunks of metal is statistically independent. More generally: assume that our base distribution factors on a graph G. Conditional on all the quantities perfectly conserved by the resampling process, variables far apart in G are independent. If you’ve read the Telephone Theorem, this is basically the same, but with one big upgrade: our "high-level summary" no longer depends on which notion of "far away" we use; the same summary applies to any sequence of nonoverlapping nested Markov blankets. We can take the information conserved by the resampling process to be the "high-level abstractions" for the whole model. Let’s unpack that. First, we’ll do a quick recap of the Telephone Theorem. We start with a large Causal Model/​Bayes Net: Each node is a random variable, and the arrows show direct causal influence; paths show indirect causal influence. If you’ve used MCMC before, you were probably picturing something like this already; we usually use MCMC with Bayes nets, because the locality structure makes it easy to resample variables (we only need to look at neighbor values rather than the values of all variables). Now, just like in the Telephone Theorem, we picture a sequence of nested Markov blankets M_1 … M_n in our model: Each possible sequence of blankets cuts the graph into pieces, with each piece only connected directly to the piece before and the piece after. A choice of sequence of blankets defines a notion of "far away"—i.e. if two variables are separated by a large number of "layers" of blankets in the sequence, then they are "far apart". In order for M_1 to have any mutual information with M_n, that information must propagate through each of the layers in between. The basic idea of the Telephone Theorem is that information is either perfectly conserved or completely lost as we move through enough layers; information can only propagate "far away" if the information can be perfectly computed from each layer individually. … but if some information can be perfectly computed from each layer individually, then that information will be conserved by our resampler. When resampling M_n, I can still perfectly calculate the information from M_{n+1} or M_{n-1}, and therefore the information will be perfectly conserved. So, the only information which can propagate far away is information which is perfectly conserved by resampling. That means that conditional on the information conserved by resampling, the mutual information must drop to zero. A slightly more formal version of that argument is in the appendices, but that’s the core idea.

Formal Statement

Let F satisfy P[X^\infty|M_{resample}, X^0] = P[X^\infty|M_{resample}, F(X^0)], i.e.F encodes the values of all conserved quantities in the resampling process. For any infinite sequence of nested nonoverlapping Markov blankets B_1, B_2, ... on the base model, the conditional mutual information MI(B_1, B_n | F(X)) \rightarrow 0 as n \rightarrow \infty.

Gear Example

In the gear example, our Markov blankets might be nested layers of metal: "Far away" then indicates moving through a large number of such layers. In order for information to propagate far away, we must be able to compute it from each layer—i.e. we can very precisely compute the overall rotation of the gear from the overall rotation of chunks of metal in each layer. So, when we resample a chunk of metal, the overall rotation will be conserved—it will be "stored in" the other chunks. This reasoning must apply to any information which propagates through many layers: if the information propagates far away, it will be conserved by resampling. So, assuming the overall rotation is the only quantity conserved by the resampling process, far-apart chunks of the gear must be independent given the overall rotation. Intuitively, this lets us factor the gear into a "global" component and a "local" component. The global component, the gear’s overall rotation, is redundantly represented; it can be estimated by looking at many different little chunks, and we expect these estimates to (approximately) agree. The local component captures everything else, and is guaranteed to be "local" in the sense that far apart pieces are independent.

Conclusion

We started with the intuition that abstraction is about redundant information: there are many different places from which we can learn the information, and many different places where we can apply the information to make predictions. That’s what makes abstractions generalizable and useful. Then, we showed that a formalization of this intuition based on resampling variables reproduces the main ideas of abstraction as information-relevant-at-a-distance. In particular, the resampling approach yields a better version of the Telephone Theorem. Personally, I came to all this in a different order: I noticed that the Telephone Theorem required redundancy of information, figured out the resampling thing, and only backed out the intuition of abstraction-as-redundant-information later. Nonetheless, it was an exciting thing to find: when different intuitive formulations of an idea turn out to be basically-equivalent, that’s strong evidence that we’re on the right track. In terms of applications, the locality results in particular are exactly the sort of thing which I’ve been looking for since my last update on testing the Natural Abstraction Hypothesis. In combination with the generalized Koopman-Pitman-Darmois theorem, they get us to the point where calculating abstractions from a base model is roughly-tractable, i.e. it can be done in something like O(N^3) time with respect to the size of the base model. That still isn’t quite efficient enough to handle big models, and unfortunately big models are exactly where we expect to see nontrivial results, so I’m still not quite at the point of running empirical tests. But it feels like the hard part is now past; there are some clear steps forward on the math, and I expect that those will basically close the gap to efficient calculation. On the theoretical side, the first leg of the Natural Abstraction Hypothesis says:

For most physical systems, the information relevant "far away" can be represented by a summary much lower-dimensional than the system itself. Assuming the proofs in this post basically hold up, and the loopholes aren’t critical, I think this claim is now basically proven. There’s still some operationalization to be done (e.g. the "dimension" of the summary hasn’t actually been addressed yet), loopholes to close (e.g. deterministic computation makes things tricky), some legwork to flesh it all out (e.g. including numerical approximation), various extensions (e.g. logical uncertainty), and a lot of distillation needed, but I think this math is enough to conclude that the basic claim is probably true in worlds following local laws (like ours). The basic idea is also useful even in nonlocal models, which I’ll hopefully write about in the not-too-distant future. That form is more readily applicable to clustering-style applications, like e.g. recognizing "pencils" as a kind of object.

Appendices

Proof Sketch: Factorization

We’ll use three facts. First, P[X^\infty | M_{resample}, X^0] is invariant under resampling any variable—i.e. the distribution reaches an equilibrium as we run the resampling process. I won’t actually prove that, because I don’t expect that there’s anything new involved; standard MCMC convergence proofs should largely carry over (allowing for conserved quantities, of course). Formally: P[X^t | M_{resample}, X^0] = P[X^{t-1} | M_{resample}, X^0] as t \rightarrow \infty Second, the resampler is local: P[X^t_i = x_i’|M_{resample}, X^{t-1} = x] = P[X_i = x_i’|M_{base}, X_{\neq i} = x_{\neq i}] … and P[X_i = x_i’|M_{base}, X_{\neq i} = x_{\neq i}] = P[X_i = x_i’|M_{base}, X_{nb(i)} = x_{nb(i)}] … so P[X^t_i = x_i’|M_{resample}, X^{t-1}{\neq i} = x{\neq i}] = P[X^t_i = x_i’|M_{resample}, X^{t-1}{nb(i)} = x{nb(i)}] … where X_{nb(i)} indicates the neighbors of i in the graph on which M_{base} factors. Third, X^{t-1}{\neq i} screens off X^0 from X^t (thus the "Markov Chain" part of "Markov Chain Monte Carlo). So, P[X^t |M{resample}, X^{t-1}{\neq i}] = P[X^t |M{resample}, X^0, X^{t-1}{\neq i}]. Combining these three, we find P[X^t_i |M{resample}, X^0, X^t_{\neq i}] = P[X^t_i |M_{resample}, X^0, X^t_{nb(i)}] as t \rightarrow \infty … i.e. the neighbors which screen off X_i from everything else in the base model also screen off X_i^\infty from everything else in X^\infty, given X^0. The Hammersley-Clifford Theorem tells us that this is a sufficient condition for P[X^\infty|M_{resample}, X^0] to factor over the graph G (modulo taking some limits). Note that the Hammersley-Clifford theorem applies to undirected graphical models. I won’t prove the extension of our theorem to Bayes Nets here because, again, there’s nothing particularly novel involved.

Proof Sketch: Resampler Telephone Theorem

Once we have locality of X^\infty given X^0, this theorem is easy: it’s exactly like the Telephone Theorem, but on the distribution P[X^\infty|M_{resample}, X^0] rather than P[X|M_{base}]. Because P[X^\infty|M_{resample}, X^0] factors over the same graph as P[X|M_{base}], we can pick any sequence of nonoverlapping nested Markov blankets B_1 … B_n in the base model, carry them over directly to X^\infty, and apply the Telephone Theorem argument:

Comment

https://www.lesswrong.com/posts/vvEebH5jEvxnJEvBC/abstractions-as-redundant-information?commentId=k6afPtZA5MZcrAyPN

(Warning: I might not be describing this well. And might be stupid.) I feel like there’s an alternate perspective compared to what you’re doing, and I’m trying to understand why you don’t take that path. Something like: You’re taking the perspective that there is *one *Bayes net that we want to understand. And the alternative perspective is: there is a succession of "scenarios", and each is its own Bayes net, and there are deep regularities connecting them—for example they all obey the laws of physics, there’s some relation between the "chairs" in one scenario and future scenarios, etc. In the "multiple scenarios" perspective, we can frame everything in terms of building good generative models, that predict missing data from limited data, or predict the future from the past. It seems like "resampling" is unnecessary in this perspective; we can evaluate generative models by whether they work well in the next scenario. Or as another example, we would learn a generative model of a gear that involved rigid rotation plus small-amplitude vibration, and when we see something that seems to match that model, we would guess that the model is applicable here. Etc. Then a claim about far-away information would look something like "the generative model is structured as a bunch of sub-items with coordinates, and these items have local interactions". And then you can make a substantive claim: "This class of generative models is a very powerful model class, really good at making predictions given a fixed model complexity". Is that claim true? In some senses yes, but we can also immediately see limitations, e.g. "it is nighttime" is a useful generative model ingredient that doesn’t really have coordinates, and likewise "illegal", etc. The "redundant information" framing still applies; if a comparatively simple generative model can make lots of correct predictions, then clearly there was redundant information in what was predicted. Anyway, my question is: am I correct that we can define abstractions as "ingredients in generative models", and if so, why don’t you like that approach? (Or is it equivalent to your approach??)

Comment

https://www.lesswrong.com/posts/vvEebH5jEvxnJEvBC/abstractions-as-redundant-information?commentId=FMFDBNmbaXeHhq9eG

You can think of everything I’m doing as occurring in a "God’s eye" model. I expect that an agent embedded in this God’s-eye model will only be able to usefully measure natural abstractions within the model. So, shifting to the agent’s perspective, we could say "holding these abstractions fixed, what possible models are compatible with them?". And that is indeed a direction I plan to go. But first, I want to get the nicest math I possibly can for computing the abstractions within a model, because the cleaner that is the cleaner I expect that computing possible models from the abstractions will be. … that was kinda rambly, but I guess the summary is "building good generative problems is the inverse problem for the approach I’m currently focused on, and I expect that the cleaner this problem is solved the easier it will be to handle its inverse problem".

https://www.lesswrong.com/posts/vvEebH5jEvxnJEvBC/abstractions-as-redundant-information?commentId=4LBhmMayEbPgJe6hk

This seems similar to the SR model of scientific explanation.

https://www.lesswrong.com/posts/vvEebH5jEvxnJEvBC/abstractions-as-redundant-information?commentId=4Epqng2mwgDCHkzHs

This post reminds me of a self-supervised machine learning model I wanted to make a while ago, but I never got around to.

Essentially the idea was to split the data up into a "small connected region" vs "everything else". For instance, if the data was images, I’d pick out a small patch of the image vs the rest of the image. And then I’d predict the distribution of the small region from the everything else. The logic was that one could then invert the process, by taking a patch and predicting the distribution that this patch came from; this would tell you the features of the patch that correlate with far-away information.

More formally, my idea was to have two neural networks, which I called F and G, where F takes an image and outputs a feature vector, and G takes a feature vector and somehow functions as a generative model (in my original proposal I chose G to Image-GPT, but you could also imagine training G as a GAN or something). Let x be an image, C(x) be a small region from that image, and let M(x) be most of x with C(x) masked away, and let P(y|G(f)) be the probability of y according to G conditioned on f.

If we then consider the expression P(C(x)|G(F(M(x)))), then this essentially amounts to, how well is a region of an image predicted by features "far away". So you can optimize this to get a model that does the sort of thing that your post talks about.

My idea was then that F might be kinda messy because id there’s a lot of redundant information, F doesn’t need to capture all copies of it, and so we probably shouldn’t use F. But since G is a generative model, it is incentivized to put in the redundant information to each of its output variables where the information would occur. So we would expect G to be quite robust.

If we wanted to abstract the important faraway features of an image while leaving behind the unimportant noise, we could thus just invert G, essentially asking "what faraway-relevant features might account for this patch?".

I wrote more discussion about it in the #general channel of the EleutherAI discord back the 12th-13th of March, 2021.

Comment

https://www.lesswrong.com/posts/vvEebH5jEvxnJEvBC/abstractions-as-redundant-information?commentId=jdcj7QWz7jho8P2io

Looks like all the followups to BEiT masked autoencoder image modeling are doing similar things: https://​​arxiv.org/​​abs/​​2202.03382 https://​​arxiv.org/​​abs/​​2202.03026 https://​​arxiv.org/​​abs/​​2202.04200

https://www.lesswrong.com/posts/vvEebH5jEvxnJEvBC/abstractions-as-redundant-information?commentId=TtyWMLbnLqYswaruB

This might not work depending on the details of how "information" is specified in these examples, but would this model of abstractions consider "blob of random noise" a good abstraction? On the one hand, different blobs of random noise contain no information about each other on a particle level—in fact, they contain no information about anything on a particle level, if the noise is "truly" random. And yet they seem like a natural category, since they have "higher-level properties" in common, such as unpredictability and idk maybe mean/​sd of particle velocities or something. This is basically my attempt to produce an illustrative example for my worry that mutual information might not be sufficient to capture the relationships between abstractions that make them good abstractions, such as "usefulness" or other higher-level properties.

Comment

https://www.lesswrong.com/posts/vvEebH5jEvxnJEvBC/abstractions-as-redundant-information?commentId=7gZLx72MkdeMSzGFt

If they have mean/​sd in common (as in e.g. a Gaussian clustering problem), then the mean/​sd are exactly the abstract information. If they’re all completely independent, without any latents (like mean/​sd) at all, then the blob itself is not a natural abstraction, at least if we’re staying within an information-theoretic playground. I do expect this will eventually need to be extended beyond mutual information, especially to handle the kinds of abstractions we use in math (like "groups", for instance). My guess is that most of the structure will carry over; Bayes nets and mutual information have pretty natural category-theoretic extensions as I understand it, and I expect that roughly the same approach and techniques I use here will extend to that setting. I don’t personally have enough expertise there to do it myself, though.