Contents

The Theorems
Independent But Non-Identical KPD
Theorem
Proof
Key Takeaways
Causal Model/Bayes Net KPD
Theorem
Proof
Key Takeaways
IID KPD
Theorem
Proof
Key Takeaways
Takeaways Summary
Appendix: Additive Summary Equation Theorem Suppose we want to estimate the mean and variance of a normal distribution from some samples X_1 … X_n from the distribution. It turns out that we don’t need to keep around all of the data: only the sample mean \hat{\mu} = \frac{1}{n} \sum_i X_i and sample variance \hat{\Sigma} = \frac{1}{n} \sum_i (X_i - \hat{\mu})^2 are needed for the estimation problem. From those, we can calculate the entire posterior distribution of \mu, \Sigma. The pair (\hat{\mu}, \hat{\Sigma}) is a "sufficient statistic" for X_1 … X_n: a statistic which summarizes all the estimation-relevant information from the data. In principle, every estimation problem has a sufficient statistic: just let the summary be all the data. But this isn’t very useful. We really want to compress the data, pull out the actually-useful parts and throw away the rest. In the normal distribution example, our n samples of data are each 1-dimensional for a total of n dimensions of data, but the summary is only 2-dimensional, no matter how large n grows. When will an estimation problem have a fixed-dimensional sufficient statistic as the number of data points increases, like the normal-distribution estimation problem? The Koopman-Pitman-Darmois (KPD) Theorem says that, when estimating distribution parameters from IID samples, there is a fixed-dimensional sufficient statistic if-and-only-if the distribution can be expressed in exponential family form: P[X|\Theta] = \frac{1}{Z(\Theta)} e^{f(\Theta)^T \sum_i g(X_i)} \prod_i h(X_i) … for some vector-valued functions f, g and positive-real-valued function h. The sufficient statistic is then \sum_{i=1}^n g(X_i). For instance, for a normal distribution, the parameters \Theta are (\mu, \Sigma), and we can use g(x) = (x, x^2) and f(\Theta) = f(\mu, \Sigma) = (-\Sigma^{-1}\mu, -\frac{1}{2}\Sigma^{-1}). Intuitively, this says the "coupling" between X and \Theta is only via the fixed-dimensional "summary" vectors f(\Theta) and \sum_i g(X_i). That’s the original KPD Theorem. We’ll see in this post that the theorem generalizes a lot. It’s not just estimation of distribution parameters from IID samples—KPD can also generalize to estimation of parameters for Markov chains, or causal models with symmetry—aka stochastic programs. The theorem also extends to causal models without any symmetry, albeit slightly weaker: a small number of variables can violate the exponential form, but (roughly speaking) the number of variables which can violate the exponential form is proportional to the summary dimension. So if we have a low-dimensional summary of lots of variables, the vast majority of variables must still satisfy the exponential form. The condition which does most of the work is sparsity: we’ll assume the conditional distribution factors according to a causal model with a sparse graph. Or, in more physical terms: direct interactions are local; long-range interactions are always mediated by short-range interactions. Conditionally independent variables are just the most extreme version of this—their conditional distribution factors according to a graph with no edges at all (i.e. there are no interactions at all). In particular, I now expect most abstractions to mostly satisfy the exponential family form, though that’s a topic for future posts.

The Theorems

We’ll cover three generalized KPD theorems, each with different assumptions on the structure of P[X|\Theta]:

Independent But Non-Identical
Causal Model/Bayes Net
IID The first is intended as a relatively-simple case to illustrate the core ideas, before we introduce more complexity. The second is the most general theorem. The third is intended to illustrate how we can leverage symmetry to rule out the annoying "exception" variables. This is definitely not a comprehensive set of KPD generalizations; these theorems and proofs are mainly intended to illustrate the general tools and techniques by which KPD generalizations can be derived.

Independent But Non-Identical KPD

The simplest starting point for these theorems is parameter estimation from independent-but-not-identically-distributed variables. (If you want a concrete image in mind, picture scientists trying to estimate the values of some physical constants using a bunch of different experiments.)

Theorem

Let (X_1, …, X_n) be continuous random variables, conditionally independent given \Theta (i.e. P[X|\Theta] = \prod_i P[X_i|\Theta]). There exists a D-dimensional sufficient statistic G(X) summarizing all information in X relevant to \Theta only if the distribution has the form P[X|\Theta] = \frac{1}{Z(\Theta)} e^{f(\Theta) \cdot \sum_{i \not \in E} g_i(X_i)} \prod_{i \not \in E} P[X_i|\Theta^0] \prod_{i \in E} P[X_i|\Theta] … where:

f, {g_i} are some at-most D-dimensional vector-valued summary functions
E is a set of "exception" variables, not satisfying the exponential form
The number of exception variables |E| is at-most D Intuitively, this is like the original KPD, but some variables can be "exceptions" and not follow the exponential form—their coupling with \Theta is not mediated by the summary vectors f(\Theta) and \sum_{i \not \in E} g_i(x_i). However, there can only be at-most D exceptions. When the number of variables is much greater than the summary dimension (i.e.n >> D), this means that the vast majority of variables must satisfy the exponential form. The summary vectors mediate the interactions with \Theta of all but at-most D variables. Note that this is an "only if" theorem, not an "if-and-only-if" theorem. It does establish that a fixed-dimension sufficient statistic exists if-and-only-if the distribution takes the above form, since any distribution with the above form can use G(X) = (X_E, \sum_{i \not \in E} g_i(X_i)) as a sufficient statistic. However, that sufficient statistic has dimension at-most 2D, whereas the "only if" theorem above assumes the existence of a statistic of dimension D. So there’s a "gap" in the if-and-only-if characterization: any conditionally-independent distribution with a D-dimensional sufficient statistic satisfies the form, but given a distribution which satisfies the form, we may only be able to find a 2D-dimensional sufficient statistic rather than a D-dimensional sufficient statistic. Of course, for applications with n >> D, the difference between D and 2D won’t matter much anyway.

Proof

The first key idea is the Minimal Map Theorem (also proved more elegantly here): if we want to summarize all the information in X which is relevant to \Theta, then the posterior distribution (\theta \mapsto P[\Theta = \theta|X]) is a sufficient statistic. It’s also "minimal" in the sense that the posterior distribution can be calculated from any other sufficient statistic, therefore any other sufficient statistic must contain at least as much information (this can be proven via the Data Processing Inequality). So, given our hypothetical fixed-dimensional summary statistic G(X) (where X is a vector of all the data points), we can write F(G(X)) = (\theta \mapsto P[\Theta = \theta|X]) … for some function F. The right-hand side is a data structure containing the values of P[\Theta = \theta|X] for EVERY \theta-value; we can think of it as a vector indexed by \theta. (In the case where \theta has m possible values \theta^1, …, \theta^m, any map (\theta \mapsto f(\theta)) with real-valued f can be represented as a length-m vector v in which v_i is f(\theta^i).) Now, the key property underlying (this version of) the KPD is conditional independence of X, i.e. P[X|\theta] = \prod_i P[X_i|\theta]. We want to leverage that factorization, so we’ll transform the distribution into a representation we can factor. First, transform the distribution to a log-odds representation: logOdds(F(G(X))) = (\theta \mapsto ln\frac{P[\Theta = \theta|X]}{P[\Theta = \theta^0|X]}) … for some arbitrary reference parameters \theta^0. Next, we apply Bayes’ Rule: = (\theta \mapsto ln\frac{P[X|\Theta = \theta]}{P[X|\Theta = \theta^0]} + ln \frac{P[\Theta = \theta]}{P[\Theta = \theta^0]}) … then subtract off the prior ln \frac{P[\Theta = \theta]}{P[\Theta = \theta^0]} from both sides, since it’s not a function of X: logOdds(F(G(X))) - (\theta \mapsto ln \frac{P[\Theta = \theta]}{P[\Theta = \theta^0]}) = (\theta \mapsto ln\frac{P[X|\Theta = \theta]}{P[X|\Theta = \theta^0]}) To clean up the notation, we’ll define F’(G) := logOdds(F(G)) - (\theta \mapsto ln \frac{P[\Theta = \theta]}{P[\Theta = \theta^0]}). Finally, we can apply the factorization (remember, that was the point of all these transformations) and get F’(G(X)) = \sum_i (\theta \mapsto ln\frac{P[X_i|\Theta = \theta]}{P[X_i|\Theta = \theta^0]}) The right-hand side of this expression is a sum of vectors indexed by \theta. Each vector is a function of just one of the X_i, and the left-hand side says that we can express the sum in terms of a D-dimensional summary G(X). So, this is an Additive Summary Equation (see appendix below for details). In fact, it’s a particularly simple Additive Summary Equation: each f_i depends only on x_i, and each x_i has no neighbors besides itself. The Additive Summary Equation Theorem then tells us that the equation is solvable (i.e. a D-dimensional summary statistic exists) only if ln\frac{P[X|\Theta = \theta]}{P[X|\Theta = \theta^0]} = \sum_{i \in B} (ln\frac{P[X_i|\Theta = \theta]}{P[X_i|\Theta = \theta^0]}) + U(\theta) \sum_{i’ \not \in B}g_{i’}(x_{i’}) + C(\theta) … for some B consisting of at-most D X-indices, and some U(\theta) of dimension at-most D. Exponentiating both sides and simplifying a bit, this condition becomes P[X|\Theta = \theta] = \frac{1}{Z(\theta)} e^{U(\theta) \sum_{i’ \not \in B} g_{i’}(x_{i’})} \prod_{i \in B} P[X_i|\Theta = \theta] \prod_{i’ \not \in B} P[X_{i’}|\Theta = \theta^0] … which is the desired form.

Key Takeaways

The main takeaway from the theorem is the idea of "exception" variables, which circumvent the exponential family form, and the idea that the number of exception variables is limited by the dimensionality of the sufficient statistic. The main takeaways from the proof are:

Start with the Minimal Map Theorem to get the "most general" sufficient statistic
Transform to apply the factorization of the distribution
Use the Additive Summary Equation Theorem We’ll reuse this general structure in the next proof.

Causal Model/Bayes Net KPD

Now we address a much more general case: factorization according to a Bayes Net/Causal Model. We still imagine estimating some parameters \Theta, but our data is no longer conditionally independent. Conditional on the parameters, interactions between data points are no longer non-existent. However, the direct interactions are sparse—i.e. most data points interact directly with only a few other data points (though they may indirectly interact with everything). This would be the case if, for example, we are trying to estimate parameters of some complicated physical system in real-time from a single instance of the system.

Theorem

Let (X_1, …, X_n) be continuous random variables, whose distribution conditional on \Theta forms a Bayes Net on a DAG (i.e. P[X|\Theta] = \prod_i P[X_i|X_{pa(i)}, \Theta], where pa(i) denotes the nodes with edges directly into i in the DAG). There exists a D-dimensional sufficient statistic G(X) summarizing all information in X relevant to \Theta only if the distribution has the form P[X|\Theta] = \frac{1}{Z(\Theta)} e^{f(\Theta) \sum_{i \not \in E} g_i(x_i, x_{pa(i)}) } \prod_{i \not \in E} P[X_i|X_{pa(i)}, \Theta=\theta^0] \prod_{i \in E} P[X_i|X_{pa(i)}, \Theta=\theta] … where:

f, {g_i} are some at-most D-dimensional summary functions
E is a set of "exception" variables, not satisfying the exponential form
The set of exception variables includes at-most D variables, plus variables in the Markov blanket (i.e. parents, children and parents-of-children) of those D variables, plus children of variables in the Markov blanket. This mostly generalizes the previous theorem in the obvious way, with one major exception: we can now have a lot more exception variables. Roughly speaking, exceptions happen in "chunks": neighborhoods each consisting of a variable, its "neighbors" in the Markov blanket, and its neighbors’ children. So, in order for the theorem to say anything nontrivial, the DAG has to be sparse: each variable has to have only a few neighbors. If a single variable directly touches most of the DAG, then an exception-neighborhood around that variable could include most of the variables in the model. As long as the DAG is sparse and n >> D, the vast majority of variables must still satisfy the exponential form. If each variable has at-most k variables in its Markov blanket, then the number of exception variables is less than k^2 D. For instance, in a minimal computationally-complete circuit model (e.g. NAND gates), k = 6 (two inputs per node, two children per node to allow copying, and one other parent for each child), so the number of exceptions would be less than 36D. If the number of variables exceeds the summary dimension by several orders of magnitude, this would be a drop in the bucket—the vast majority of variables would still need to satisfy the exponential form.

Proof

This follows broadly the same structure as the proof for the Independent But Non-Identical KPD, so we’ll speed through the parts which are the same and just focus on differences. As in that proof, we apply the Minimal Map Theorem, then apply a log odds transformation, and factor the distribution. The factorization is now P[X|\Theta] = \prod_i P[X_i|X_{pa(i)}, \Theta], so the summary equation is F’(G(X)) = (\theta \mapsto ln\frac{P[X|\Theta=\theta]}{P[X|\Theta = \theta^0]}) = \sum_i (\theta \mapsto ln\frac{P[X_i|X_{pa(i)}, \Theta = \theta]}{P[X_i|X_{pa(i)}, \Theta = \theta^0]}) The neighbors N(j) of a variable X_j are no longer trivial; they are exactly the Markov blanket of X_j (i.e. parents, children, and children of parents of X_j in the Bayes net). The terms which depend on X_{N(j)} are the factors corresponding to neighbors plus children-of-neighbors, i.e. N(j) \cup ch(N(j)). The Additive Summary Equation Theorem then says that a D-dimensional summary exists only if ln\frac{P[X|\Theta=\theta]}{P[X|\Theta = \theta^0]} = \sum_{i \in E} (ln\frac{P[X_i|X_{pa(i)}, \Theta = \theta]}{P[X_i|X_{pa(i)}, \Theta = \theta^0]}) + U(\theta) \sum_{i’ \not \in E} g_{i’}(x_{i’}, x_{pa(i’)}) + C(\theta) … where E = N(B) \cup ch(N(B)), and |B| is at-most D. Exponentiating and simplifying a bit yields: P[X|\Theta] = \frac{1}{Z(\Theta)} e^{U(\Theta) \sum_{i’ \not \in E} g_{i’}(x_{i’}, x_{pa(i’)}) } \prod_{i \not \in E} P[X_i|X_{pa(i)}, \Theta^0] \prod_{i \in E} P[X_i|X_{pa(i)}, \Theta] … which is the desired form.

Key Takeaways

The main takeaway here is the idea that exceptional variables occur in neighborhoods, so sparsity is a necessary condition for the theorem to say anything nontrivial. On the other hand, causal models are extremely general, sparsity is ubiquitous in the physical sciences, and a sparse causal model is all we need to apply the theorem. We should thus expect exponential family distributions to show up in most places where low-dimensional summaries exist, in practice. In particular, we should expect exponential family forms for most abstractions, in practice. Another important point: the methods used in the proof are "smooth", in the sense that small violations of the assumptions should produce only small violations of the conclusions. In other words, we should also expect distributions with approximate low-dimensional summaries, which approximately factor as a sparse Bayes Net, to approximately fit the exponential form.

IID KPD

When our variables have some symmetry, we can sometimes leverage it to eliminate the annoying "exception" variables. The basic requirement is that our distribution P[X|\Theta] is invariant under permuting some of the variables. For instance, if we have a time-symmetric Markov Chain, then we can replace every X_t with X_{t+1}, and the distribution remains the same. In the case of IID variables, we can swap any two variables X_i and X_j, and the distribution remains the same, i.e. P[X_i = x_i, X_j = x_j, …| \Theta] = P[X_j = x_i, X_i = x_j, … | \Theta] If we can swap exception variables with non-exception variables, then we can sometimes eliminate the exceptions altogether.

Theorem

Let (X_1, …, X_n) be continuous random variables, conditionally independent and identically distributed given \Theta (i.e. P[X=x|\Theta] = \prod_i P[X_1=x|\Theta]). There exists a D-dimensional sufficient statistic G(X) summarizing all information in X relevant to \Theta, with D < n, if-and-only-if the distribution has the form P[X|\Theta] = \frac{1}{Z(\Theta)} e^{f(\Theta) \cdot \sum_i g(X_i)} \prod_i P[X_i|\Theta^0] … where f, g are some at-most D-dimensional summary functions. Other than eliminating the exception terms, there are two differences from the independent-but-not-identically-distributed theorem: the g_i’s are all the same function, and we explicitly require D < n. When D \geq n in the earlier theorem, all variables can be exceptions, so the theorem says nothing. For this theorem, we need at least one non-exceptional variable to swap the other variables with, so we require D < n. Note that this theorem is strictly stronger than the original KPD as typically stated, which also applies to the IID case. The original KPD talks about existence of a finite-dimensional summary for an infinite stream of IID variables, whereas this version applies even to a finite set of variables, so long as the dimension of the sufficient statistic is smaller than the number of variables.

Proof

Since this a special case of independent variables, we can start from our independent but non-identical theorem: P[X|\Theta] = \frac{1}{Z(\Theta)} e^{f(\Theta) \cdot \sum_{i \not \in E} g_i(X_i)} \prod_{i \not \in E} P[X_i|\Theta^0] \prod_{i \in E} P[X_i|\Theta] Since we assume D < n, there must be at least one non-exception variable. So, without loss of generality, assume X_1 \not \in E. For any other variable i \in E, we can swap X_i with X_1 and integrate out all the other variables to find \frac{1}{Z’(\Theta)} e^{f(\Theta) \cdot g_1(x_1)} P[X_1=x_1|\Theta^0] P[X_i=x_i|\Theta] = P[X_1 = x_1, X_i = x_i |\Theta] = P[X_1 = x_i, X_i = x_1 |\Theta] = \frac{1}{Z’(\Theta)} e^{f(\Theta) \cdot g_1(x_i)} P[X_1=x_i|\Theta^0] P[X_i=x_1|\Theta] … then integrate out x_1: P[X_i=x_i|\Theta] = \frac{1}{Z^{''}(\Theta)} e^{f(\Theta) \cdot g_1(x_i)} P[X_1=x_i|\Theta^0] (Minor note: we’re absorbing constants into Z(\Theta) from each integral, which is why it keeps gaining primes.) For variables i \not \in E, we can also swap with X_1 in order to replace g_i with g_1. Swapping and integrating as before yields e^{f(\Theta) \cdot g_1(x_i)} P[X_1=x_i|\Theta^0] = e^{f(\Theta) \cdot g_i(x_i)} P[X_i=x_i|\Theta^0] Substituting both of those into the original expression from the independent but non-identical theorem yields P[X=x|\Theta] = \frac{1}{Z^{\prime \prime \prime}(\Theta)} e^{f(\Theta) \cdot \sum_i g_1(x_i)} \prod_i P[X_1=x_i|\Theta^0] … which is the desired form.

Key Takeaways

While this is a minor improvement over the original KPD, the real point is to show how symmetry can remove the exception terms. I expect this to be an especially powerful tool in conjunction with the techniques in Writing Causal Models Like We Write Programs. That post illustrates how to turn recursively-defined functions into recursively-defined causal models. The recursive calls become exactly the sort of symmetries which could potentially be used to remove exception terms in the Causal Model/Bayes Net KPD.

Takeaways Summary

The original Koopman-Pitman-Darmois theorem says that a fixed-dimensional sufficient statistic exists for a stream of IID variables if-and-only-if the distribution is of exponential form. This generalizes a lot. Neither independence nor identical distribution is necessary; we need only a sparse causal model. However, this generalization comes with a cost: some "exception" variables may not follow the exponential form. As long as the system is sparse, and the number of variables is much larger than the dimension of the sufficient statistic, the number of exception variables will be relatively small, and the vast majority of variables in the system will satisfy the exponential form. The proofs should also generalize to approximations—systems with approximate sufficient statistics should be approximately exponential family, except for a few variables. We can also sometimes leverage symmetry to eliminate the exception variables. If the distribution is invariant under some permutation of the variables, then we can try to permute exception variables to non-exception variables, in which case they must follow the exponential form.

Appendix: Additive Summary Equation Theorem

An Additive Summary Equation is a functional equation of the form F(G(x)) = f(x) = \sum_i f_i(x) … with F and G unknown, G of dimension D < dim(f), D < dim(x). In order to say anything nontrivial about the equation, we need each f-term f_i to only depend on a few x-components x_j, and each x-component to only influence a few f-terms. Given some x_j, we can pick out the (indices of) f-components which it influences via the expression {i: \frac{\partial f_i}{\partial x_j} \not \equiv 0} We’re also interested in the "neighbors" of x_j, i.e. x-components x_{j’} for which some f_i depends on both x_j and x_{j’}. We’ll write the (indices of) x_j’s neighbors as N(j); note that we do consider x_j a neighbor of itself. The post on the Additive Summary Equation shows that the equation is solvable only if we can choose a set of at-most D components of x, called x_B, for which f can be expressed as f(x) = \sum_{i: \frac{\partial f_i}{\partial x_{N(B)}} \not \equiv 0} f_i(x) + U\sum_{i’}g_{i’}(x) + C … where the g_i’s have output dimension at-most D, U is a constant dim(f) by D matrix, and C is a constant vector. In particular, we can choose g_{i’}(x) = U^\dagger (f_{i’}(x) - f_{i’}(x^0)), C = \sum_{i’} f_{i’}(x^0) … for some x^0, with i’ ranging over terms not dependent on x_{N(B)}, i.e. {i’: \frac{\partial f_{i’}}{\partial x_j} \equiv 0}. Intuitively, this says that we can’t really say anything about terms dependent on x_{N(B)}. However, x_{N(B)} consists of at-most D variables plus their neighbors, so in a large sparse system, hopefully most terms will not depend on x_{N(B)}. And for all the terms not dependent on x_{N(B)}, the information content can be aggregated by summation—i.e. the sum \sum_{i’}g_{i’}(x). Note that we may sometimes want to think about f(x) whose output dimension is uncountably infinite—i.e.f(x) returns a distribution over a continuous variable \theta. In that case, the dot product U^\dagger (f_{i’}(x) - f_{i’}(x^0)) defining g_{i’}(x) becomes an integral \int_\theta U^\dagger(\theta) (f_{i’}(x)(\theta) - f_{i’}(x^0)(\theta)) d\theta; everything else remains the same. For simplicity, this post’s notation implicitly assumes that \theta has finitely many values, but the generalization is always straightforward.

id

RYJQeyxWu8GmduBmd
authors

Radford Neal
score

5
omega_karma
votes

3
date_published

2021-07-15T23:30

https://www.lesswrong.com/posts/tGCyRQigGoqA4oSRo/generalizing-koopman-pitman-darmois?commentId=RYJQeyxWu8GmduBmd

You’ve left out the condition for the theorem to be true that the region with non-zero probabilitiy density can’t vary with the parameter. That’s how, for example, the uniform distribution over (0,theta) can have a scalar sufficient statistic (the maximum data point) desipite it not being an exponential family model.

Comment

id

gQrGKsZh8kvEj72KX
authors

johnswentworth
score

2
omega_karma
votes

1
date_published

2021-07-15T23:40

https://www.lesswrong.com/posts/tGCyRQigGoqA4oSRo/generalizing-koopman-pitman-darmois?commentId=gQrGKsZh8kvEj72KX

Good catch, thanks.

id

Sfu2PqeAW9DvWomzX
authors

David Johnston
score

4
omega_karma

2
votes

3
date_published

2021-10-11T03:42

https://www.lesswrong.com/posts/tGCyRQigGoqA4oSRo/generalizing-koopman-pitman-darmois?commentId=Sfu2PqeAW9DvWomzX

I’m a bit curious about what job "dimension" is doing here. Given that I can map an arbitrary vector in \mathbb{R}^n to some point in \mathbb{R} via a bijective measurable map (https://en.wikipedia.org/wiki/Standard_Borel_space#Kuratowski’s_theorem), it would seem that the KPD theorem is false. Is there some other notion of "sufficient statistic complexity" hiding behind the idea of dimensionality, or am I missing something?

id

FX8gvXSpc8WjqZcc6
authors

johnswentworth
score

3
omega_karma
votes

2
date_published

2021-10-11T15:56

https://www.lesswrong.com/posts/tGCyRQigGoqA4oSRo/generalizing-koopman-pitman-darmois?commentId=FX8gvXSpc8WjqZcc6

There’s a smoothness assumption. I assumed differentiability, although that could be weakened somewhat. (The assumption is hidden in the sister post to this one, The Additive Summary Equation.)

id

LtPJvTCXsjaogDY7F
authors

harsimony
score

3
omega_karma
votes

2
date_published

2021-07-17T17:36

https://www.lesswrong.com/posts/tGCyRQigGoqA4oSRo/generalizing-koopman-pitman-darmois?commentId=LtPJvTCXsjaogDY7F

Is there a generalization of "sufficient statistic" that applies to summaries which grow as the log (or polynomial of the log) in the size of the data?

Despite the fact that your summary keeps growing, this seems like it might also be a useful definition. In the limit of infinite data, a log(n) summary is vanishingly small relative to the size of the data you have.

id

fzYMsNZ2EroJuMRsq
authors

johnswentworth
score

3
omega_karma
votes

2
date_published

2021-07-17T20:25

https://www.lesswrong.com/posts/tGCyRQigGoqA4oSRo/generalizing-koopman-pitman-darmois?commentId=fzYMsNZ2EroJuMRsq

Good question. The theorems here should apply directly to that situation. The summary will eventually be lower-dimensional than the data, and that’s all we need for these theorems to apply. At any data size n sufficiently large that the O(log n) summary dimension is smaller than the data dimension, the distribution of those n data points must be exponential family.

id

GnvPJxmgZaLkehX8B
authors

harsimony
score

3
omega_karma
votes

2
date_published

2021-07-17T22:12

https://www.lesswrong.com/posts/tGCyRQigGoqA4oSRo/generalizing-koopman-pitman-darmois?commentId=GnvPJxmgZaLkehX8B

I see! To reiterate: for a fixed n, having a smaller summary dimension means that the distribution of those n points is part of the exponential family.

It seems like there are three cases here:

The size of the summary remains constant as number n of samples grows. The distribution of these samples remains constant. Once enough data is collected, you can estimate the summary statistics and infer the overall distribution.

The size of the summary grows as O(log(n)). The distribution of these samples changes for each n as the size of the summary grows. You converge on the correct distribution in the limit of infinite n. (This seems weird/incorrect, I might be missing something here. I am trying to think of a concrete example)

degenerate case: the size of the required "summary" grows faster than O(n) so you are better off just using the data itself as the summary.

id

xPwh5df46PJ9fFSsN
authors

johnswentworth
score

3
omega_karma
votes

2
date_published

2021-07-18T14:31

https://www.lesswrong.com/posts/tGCyRQigGoqA4oSRo/generalizing-koopman-pitman-darmois?commentId=xPwh5df46PJ9fFSsN

For the second case, each data point can measure something different, possibly correlated with each other, and related in different ways to the parameters we’re trying to estimate. For instance, maybe we’re trying to estimate some parameters of a car, so we measure the wheel sizes, axle length, number of gears, engine cylinder volume, etc, etc. Every now and then we measure something which gives us a totally different "kind" of information from the other things we measured—something which forces a non-exponential-family update. When that happens, we have to add a new summary component. Eventually other data points may measure the same "kind" of information and also contribute to that component of the summary. But over time, it becomes more and more rare to measure something which no other data point has captured before, so we add summary dimensions more and more slowly. (Side note: there’s no reason I know why O(log n) growth would be special here; the qualitative story would be similar for any sub-linear summary growth.)

Generalizing Koopman-Pitman-Darmois

The Theorems

Independent But Non-Identical KPD

Theorem

Proof

Key Takeaways

Causal Model/​Bayes Net KPD

Theorem

Proof

Key Takeaways

IID KPD

Theorem

Proof

Key Takeaways

Takeaways Summary

Appendix: Additive Summary Equation Theorem

Comment

Comment

Comment

Comment

Comment

Comment

Causal Model/Bayes Net KPD