Generalizing Koopman-Pitman-Darmois

https://www.lesswrong.com/posts/tGCyRQigGoqA4oSRo/generalizing-koopman-pitman-darmois

Contents

The Theorems

We’ll cover three generalized KPD theorems, each with different assumptions on the structure of P[X|\Theta]:

Independent But Non-Identical KPD

The simplest starting point for these theorems is parameter estimation from independent-but-not-identically-distributed variables. (If you want a concrete image in mind, picture scientists trying to estimate the values of some physical constants using a bunch of different experiments.)

Theorem

Let (X_1, …, X_n) be continuous random variables, conditionally independent given \Theta (i.e. P[X|\Theta] = \prod_i P[X_i|\Theta]). There exists a D-dimensional sufficient statistic G(X) summarizing all information in X relevant to \Theta only if the distribution has the form P[X|\Theta] = \frac{1}{Z(\Theta)} e^{f(\Theta) \cdot \sum_{i \not \in E} g_i(X_i)} \prod_{i \not \in E} P[X_i|\Theta^0] \prod_{i \in E} P[X_i|\Theta] … where:

Proof

The first key idea is the Minimal Map Theorem (also proved more elegantly here): if we want to summarize all the information in X which is relevant to \Theta, then the posterior distribution (\theta \mapsto P[\Theta = \theta|X]) is a sufficient statistic. It’s also "minimal" in the sense that the posterior distribution can be calculated from any other sufficient statistic, therefore any other sufficient statistic must contain at least as much information (this can be proven via the Data Processing Inequality). So, given our hypothetical fixed-dimensional summary statistic G(X) (where X is a vector of all the data points), we can write F(G(X)) = (\theta \mapsto P[\Theta = \theta|X]) … for some function F. The right-hand side is a data structure containing the values of P[\Theta = \theta|X] for EVERY \theta-value; we can think of it as a vector indexed by \theta. (In the case where \theta has m possible values \theta^1, …, \theta^m, any map (\theta \mapsto f(\theta)) with real-valued f can be represented as a length-m vector v in which v_i is f(\theta^i).) Now, the key property underlying (this version of) the KPD is conditional independence of X, i.e. P[X|\theta] = \prod_i P[X_i|\theta]. We want to leverage that factorization, so we’ll transform the distribution into a representation we can factor. First, transform the distribution to a log-odds representation: logOdds(F(G(X))) = (\theta \mapsto ln\frac{P[\Theta = \theta|X]}{P[\Theta = \theta^0|X]}) … for some arbitrary reference parameters \theta^0. Next, we apply Bayes’ Rule: = (\theta \mapsto ln\frac{P[X|\Theta = \theta]}{P[X|\Theta = \theta^0]} + ln \frac{P[\Theta = \theta]}{P[\Theta = \theta^0]}) … then subtract off the prior ln \frac{P[\Theta = \theta]}{P[\Theta = \theta^0]} from both sides, since it’s not a function of X: logOdds(F(G(X))) - (\theta \mapsto ln \frac{P[\Theta = \theta]}{P[\Theta = \theta^0]}) = (\theta \mapsto ln\frac{P[X|\Theta = \theta]}{P[X|\Theta = \theta^0]}) To clean up the notation, we’ll define F’(G) := logOdds(F(G)) - (\theta \mapsto ln \frac{P[\Theta = \theta]}{P[\Theta = \theta^0]}). Finally, we can apply the factorization (remember, that was the point of all these transformations) and get F’(G(X)) = \sum_i (\theta \mapsto ln\frac{P[X_i|\Theta = \theta]}{P[X_i|\Theta = \theta^0]}) The right-hand side of this expression is a sum of vectors indexed by \theta. Each vector is a function of just one of the X_i, and the left-hand side says that we can express the sum in terms of a D-dimensional summary G(X). So, this is an Additive Summary Equation (see appendix below for details). In fact, it’s a particularly simple Additive Summary Equation: each f_i depends only on x_i, and each x_i has no neighbors besides itself. The Additive Summary Equation Theorem then tells us that the equation is solvable (i.e. a D-dimensional summary statistic exists) only if ln\frac{P[X|\Theta = \theta]}{P[X|\Theta = \theta^0]} = \sum_{i \in B} (ln\frac{P[X_i|\Theta = \theta]}{P[X_i|\Theta = \theta^0]}) + U(\theta) \sum_{i’ \not \in B}g_{i’}(x_{i’}) + C(\theta) … for some B consisting of at-most D X-indices, and some U(\theta) of dimension at-most D. Exponentiating both sides and simplifying a bit, this condition becomes P[X|\Theta = \theta] = \frac{1}{Z(\theta)} e^{U(\theta) \sum_{i’ \not \in B} g_{i’}(x_{i’})} \prod_{i \in B} P[X_i|\Theta = \theta] \prod_{i’ \not \in B} P[X_{i’}|\Theta = \theta^0] … which is the desired form.

Key Takeaways

The main takeaway from the theorem is the idea of "exception" variables, which circumvent the exponential family form, and the idea that the number of exception variables is limited by the dimensionality of the sufficient statistic. The main takeaways from the proof are:

Causal Model/​Bayes Net KPD

Now we address a much more general case: factorization according to a Bayes Net/​Causal Model. We still imagine estimating some parameters \Theta, but our data is no longer conditionally independent. Conditional on the parameters, interactions between data points are no longer non-existent. However, the direct interactions are sparse—i.e. most data points interact directly with only a few other data points (though they may indirectly interact with everything). This would be the case if, for example, we are trying to estimate parameters of some complicated physical system in real-time from a single instance of the system.

Theorem

Let (X_1, …, X_n) be continuous random variables, whose distribution conditional on \Theta forms a Bayes Net on a DAG (i.e. P[X|\Theta] = \prod_i P[X_i|X_{pa(i)}, \Theta], where pa(i) denotes the nodes with edges directly into i in the DAG). There exists a D-dimensional sufficient statistic G(X) summarizing all information in X relevant to \Theta only if the distribution has the form P[X|\Theta] = \frac{1}{Z(\Theta)} e^{f(\Theta) \sum_{i \not \in E} g_i(x_i, x_{pa(i)}) } \prod_{i \not \in E} P[X_i|X_{pa(i)}, \Theta=\theta^0] \prod_{i \in E} P[X_i|X_{pa(i)}, \Theta=\theta] … where:

Proof

This follows broadly the same structure as the proof for the Independent But Non-Identical KPD, so we’ll speed through the parts which are the same and just focus on differences. As in that proof, we apply the Minimal Map Theorem, then apply a log odds transformation, and factor the distribution. The factorization is now P[X|\Theta] = \prod_i P[X_i|X_{pa(i)}, \Theta], so the summary equation is F’(G(X)) = (\theta \mapsto ln\frac{P[X|\Theta=\theta]}{P[X|\Theta = \theta^0]}) = \sum_i (\theta \mapsto ln\frac{P[X_i|X_{pa(i)}, \Theta = \theta]}{P[X_i|X_{pa(i)}, \Theta = \theta^0]}) The neighbors N(j) of a variable X_j are no longer trivial; they are exactly the Markov blanket of X_j (i.e. parents, children, and children of parents of X_j in the Bayes net). The terms which depend on X_{N(j)} are the factors corresponding to neighbors plus children-of-neighbors, i.e. N(j) \cup ch(N(j)). The Additive Summary Equation Theorem then says that a D-dimensional summary exists only if ln\frac{P[X|\Theta=\theta]}{P[X|\Theta = \theta^0]} = \sum_{i \in E} (ln\frac{P[X_i|X_{pa(i)}, \Theta = \theta]}{P[X_i|X_{pa(i)}, \Theta = \theta^0]}) + U(\theta) \sum_{i’ \not \in E} g_{i’}(x_{i’}, x_{pa(i’)}) + C(\theta) … where E = N(B) \cup ch(N(B)), and |B| is at-most D. Exponentiating and simplifying a bit yields: P[X|\Theta] = \frac{1}{Z(\Theta)} e^{U(\Theta) \sum_{i’ \not \in E} g_{i’}(x_{i’}, x_{pa(i’)}) } \prod_{i \not \in E} P[X_i|X_{pa(i)}, \Theta^0] \prod_{i \in E} P[X_i|X_{pa(i)}, \Theta] … which is the desired form.

Key Takeaways

The main takeaway here is the idea that exceptional variables occur in neighborhoods, so sparsity is a necessary condition for the theorem to say anything nontrivial. On the other hand, causal models are extremely general, sparsity is ubiquitous in the physical sciences, and a sparse causal model is all we need to apply the theorem. We should thus expect exponential family distributions to show up in most places where low-dimensional summaries exist, in practice. In particular, we should expect exponential family forms for most abstractions, in practice. Another important point: the methods used in the proof are "smooth", in the sense that small violations of the assumptions should produce only small violations of the conclusions. In other words, we should also expect distributions with approximate low-dimensional summaries, which approximately factor as a sparse Bayes Net, to approximately fit the exponential form.

IID KPD

When our variables have some symmetry, we can sometimes leverage it to eliminate the annoying "exception" variables. The basic requirement is that our distribution P[X|\Theta] is invariant under permuting some of the variables. For instance, if we have a time-symmetric Markov Chain, then we can replace every X_t with X_{t+1}, and the distribution remains the same. In the case of IID variables, we can swap any two variables X_i and X_j, and the distribution remains the same, i.e. P[X_i = x_i, X_j = x_j, …| \Theta] = P[X_j = x_i, X_i = x_j, … | \Theta] If we can swap exception variables with non-exception variables, then we can sometimes eliminate the exceptions altogether.

Theorem

Let (X_1, …, X_n) be continuous random variables, conditionally independent and identically distributed given \Theta (i.e. P[X=x|\Theta] = \prod_i P[X_1=x|\Theta]). There exists a D-dimensional sufficient statistic G(X) summarizing all information in X relevant to \Theta, with D < n, if-and-only-if the distribution has the form P[X|\Theta] = \frac{1}{Z(\Theta)} e^{f(\Theta) \cdot \sum_i g(X_i)} \prod_i P[X_i|\Theta^0] … where f, g are some at-most D-dimensional summary functions. Other than eliminating the exception terms, there are two differences from the independent-but-not-identically-distributed theorem: the g_i’s are all the same function, and we explicitly require D < n. When D \geq n in the earlier theorem, all variables can be exceptions, so the theorem says nothing. For this theorem, we need at least one non-exceptional variable to swap the other variables with, so we require D < n. Note that this theorem is strictly stronger than the original KPD as typically stated, which also applies to the IID case. The original KPD talks about existence of a finite-dimensional summary for an infinite stream of IID variables, whereas this version applies even to a finite set of variables, so long as the dimension of the sufficient statistic is smaller than the number of variables.

Proof

Since this a special case of independent variables, we can start from our independent but non-identical theorem: P[X|\Theta] = \frac{1}{Z(\Theta)} e^{f(\Theta) \cdot \sum_{i \not \in E} g_i(X_i)} \prod_{i \not \in E} P[X_i|\Theta^0] \prod_{i \in E} P[X_i|\Theta] Since we assume D < n, there must be at least one non-exception variable. So, without loss of generality, assume X_1 \not \in E. For any other variable i \in E, we can swap X_i with X_1 and integrate out all the other variables to find \frac{1}{Z’(\Theta)} e^{f(\Theta) \cdot g_1(x_1)} P[X_1=x_1|\Theta^0] P[X_i=x_i|\Theta] = P[X_1 = x_1, X_i = x_i |\Theta] = P[X_1 = x_i, X_i = x_1 |\Theta] = \frac{1}{Z’(\Theta)} e^{f(\Theta) \cdot g_1(x_i)} P[X_1=x_i|\Theta^0] P[X_i=x_1|\Theta] … then integrate out x_1: P[X_i=x_i|\Theta] = \frac{1}{Z^{''}(\Theta)} e^{f(\Theta) \cdot g_1(x_i)} P[X_1=x_i|\Theta^0] (Minor note: we’re absorbing constants into Z(\Theta) from each integral, which is why it keeps gaining primes.) For variables i \not \in E, we can also swap with X_1 in order to replace g_i with g_1. Swapping and integrating as before yields e^{f(\Theta) \cdot g_1(x_i)} P[X_1=x_i|\Theta^0] = e^{f(\Theta) \cdot g_i(x_i)} P[X_i=x_i|\Theta^0] Substituting both of those into the original expression from the independent but non-identical theorem yields P[X=x|\Theta] = \frac{1}{Z^{\prime \prime \prime}(\Theta)} e^{f(\Theta) \cdot \sum_i g_1(x_i)} \prod_i P[X_1=x_i|\Theta^0] … which is the desired form.

Key Takeaways

While this is a minor improvement over the original KPD, the real point is to show how symmetry can remove the exception terms. I expect this to be an especially powerful tool in conjunction with the techniques in Writing Causal Models Like We Write Programs. That post illustrates how to turn recursively-defined functions into recursively-defined causal models. The recursive calls become exactly the sort of symmetries which could potentially be used to remove exception terms in the Causal Model/​Bayes Net KPD.

Takeaways Summary

The original Koopman-Pitman-Darmois theorem says that a fixed-dimensional sufficient statistic exists for a stream of IID variables if-and-only-if the distribution is of exponential form. This generalizes a lot. Neither independence nor identical distribution is necessary; we need only a sparse causal model. However, this generalization comes with a cost: some "exception" variables may not follow the exponential form. As long as the system is sparse, and the number of variables is much larger than the dimension of the sufficient statistic, the number of exception variables will be relatively small, and the vast majority of variables in the system will satisfy the exponential form. The proofs should also generalize to approximations—systems with approximate sufficient statistics should be approximately exponential family, except for a few variables. We can also sometimes leverage symmetry to eliminate the exception variables. If the distribution is invariant under some permutation of the variables, then we can try to permute exception variables to non-exception variables, in which case they must follow the exponential form.

Appendix: Additive Summary Equation Theorem

An Additive Summary Equation is a functional equation of the form F(G(x)) = f(x) = \sum_i f_i(x) … with F and G unknown, G of dimension D < dim(f), D < dim(x). In order to say anything nontrivial about the equation, we need each f-term f_i to only depend on a few x-components x_j, and each x-component to only influence a few f-terms. Given some x_j, we can pick out the (indices of) f-components which it influences via the expression {i: \frac{\partial f_i}{\partial x_j} \not \equiv 0} We’re also interested in the "neighbors" of x_j, i.e. x-components x_{j’} for which some f_i depends on both x_j and x_{j’}. We’ll write the (indices of) x_j’s neighbors as N(j); note that we do consider x_j a neighbor of itself. The post on the Additive Summary Equation shows that the equation is solvable only if we can choose a set of at-most D components of x, called x_B, for which f can be expressed as f(x) = \sum_{i: \frac{\partial f_i}{\partial x_{N(B)}} \not \equiv 0} f_i(x) + U\sum_{i’}g_{i’}(x) + C … where the g_i’s have output dimension at-most D, U is a constant dim(f) by D matrix, and C is a constant vector. In particular, we can choose g_{i’}(x) = U^\dagger (f_{i’}(x) - f_{i’}(x^0)), C = \sum_{i’} f_{i’}(x^0) … for some x^0, with i’ ranging over terms not dependent on x_{N(B)}, i.e. {i’: \frac{\partial f_{i’}}{\partial x_j} \equiv 0}. Intuitively, this says that we can’t really say anything about terms dependent on x_{N(B)}. However, x_{N(B)} consists of at-most D variables plus their neighbors, so in a large sparse system, hopefully most terms will not depend on x_{N(B)}. And for all the terms not dependent on x_{N(B)}, the information content can be aggregated by summation—i.e. the sum \sum_{i’}g_{i’}(x). Note that we may sometimes want to think about f(x) whose output dimension is uncountably infinite—i.e.f(x) returns a distribution over a continuous variable \theta. In that case, the dot product U^\dagger (f_{i’}(x) - f_{i’}(x^0)) defining g_{i’}(x) becomes an integral \int_\theta U^\dagger(\theta) (f_{i’}(x)(\theta) - f_{i’}(x^0)(\theta)) d\theta; everything else remains the same. For simplicity, this post’s notation implicitly assumes that \theta has finitely many values, but the generalization is always straightforward.

Comment

https://www.lesswrong.com/posts/tGCyRQigGoqA4oSRo/generalizing-koopman-pitman-darmois?commentId=RYJQeyxWu8GmduBmd

You’ve left out the condition for the theorem to be true that the region with non-zero probabilitiy density can’t vary with the parameter. That’s how, for example, the uniform distribution over (0,theta) can have a scalar sufficient statistic (the maximum data point) desipite it not being an exponential family model.

Comment

https://www.lesswrong.com/posts/tGCyRQigGoqA4oSRo/generalizing-koopman-pitman-darmois?commentId=gQrGKsZh8kvEj72KX

Good catch, thanks.

https://www.lesswrong.com/posts/tGCyRQigGoqA4oSRo/generalizing-koopman-pitman-darmois?commentId=Sfu2PqeAW9DvWomzX

I’m a bit curious about what job "dimension" is doing here. Given that I can map an arbitrary vector in \mathbb{R}^n to some point in \mathbb{R} via a bijective measurable map (https://​​en.wikipedia.org/​​wiki/​​Standard_Borel_space#Kuratowski’s_theorem), it would seem that the KPD theorem is false. Is there some other notion of "sufficient statistic complexity" hiding behind the idea of dimensionality, or am I missing something?

Comment

https://www.lesswrong.com/posts/tGCyRQigGoqA4oSRo/generalizing-koopman-pitman-darmois?commentId=FX8gvXSpc8WjqZcc6

There’s a smoothness assumption. I assumed differentiability, although that could be weakened somewhat. (The assumption is hidden in the sister post to this one, The Additive Summary Equation.)

https://www.lesswrong.com/posts/tGCyRQigGoqA4oSRo/generalizing-koopman-pitman-darmois?commentId=LtPJvTCXsjaogDY7F

Is there a generalization of "sufficient statistic" that applies to summaries which grow as the log (or polynomial of the log) in the size of the data?

Despite the fact that your summary keeps growing, this seems like it might also be a useful definition. In the limit of infinite data, a log(n) summary is vanishingly small relative to the size of the data you have.

Comment

https://www.lesswrong.com/posts/tGCyRQigGoqA4oSRo/generalizing-koopman-pitman-darmois?commentId=fzYMsNZ2EroJuMRsq

Good question. The theorems here should apply directly to that situation. The summary will eventually be lower-dimensional than the data, and that’s all we need for these theorems to apply. At any data size n sufficiently large that the O(log n) summary dimension is smaller than the data dimension, the distribution of those n data points must be exponential family.

Comment

I see! To reiterate: for a fixed n, having a smaller summary dimension means that the distribution of those n points is part of the exponential family.

It seems like there are three cases here:

The size of the summary remains constant as number n of samples grows. The distribution of these samples remains constant. Once enough data is collected, you can estimate the summary statistics and infer the overall distribution.

The size of the summary grows as O(log(n)). The distribution of these samples changes for each n as the size of the summary grows. You converge on the correct distribution in the limit of infinite n. (This seems weird/​incorrect, I might be missing something here. I am trying to think of a concrete example)

degenerate case: the size of the required "summary" grows faster than O(n) so you are better off just using the data itself as the summary.

Comment

For the second case, each data point can measure something different, possibly correlated with each other, and related in different ways to the parameters we’re trying to estimate. For instance, maybe we’re trying to estimate some parameters of a car, so we measure the wheel sizes, axle length, number of gears, engine cylinder volume, etc, etc. Every now and then we measure something which gives us a totally different "kind" of information from the other things we measured—something which forces a non-exponential-family update. When that happens, we have to add a new summary component. Eventually other data points may measure the same "kind" of information and also contribute to that component of the summary. But over time, it becomes more and more rare to measure something which no other data point has captured before, so we add summary dimensions more and more slowly. (Side note: there’s no reason I know why O(log n) growth would be special here; the qualitative story would be similar for any sub-linear summary growth.)