How to Throw Away Information in Causal DAGs

https://www.lesswrong.com/posts/zFGGHGfhYsGNnh7Kp/how-to-throw-away-information-in-causal-dags

When constructing a high-level abstract causal DAG from a low-level DAG, one operation which comes up quite often is throwing away information from a node. This post is about how to do that. First, how do we throw away information from random variables in general? Sparknotes:

Modifying Children

We still have one conceptual question to address: when we replace X_i by f(X_i), how do we modify children nodes of X_i to use f(X_i) instead? The first and most important answer is: it doesn’t matter, so long as whatever they do is consistent with f(X_i). For instance, suppose X_i ranges over {-1, 0, 1}, and f(X_i) = X_i^2. When f(X_i) = 1, the children can act as though X_i were −1 or 1 - it doesn’t matter which, so long as they don’t act like X_i = 0. As long as the childrens’ behavior is consistent with the information in f(X_i), we will be able to support long-range queries. There is one big catch, however: the children do need to all behave as if X_i had the same value, whatever value they choose. The joint distribution P[X_{ch(i)}|X_{sp(i)}, f(X_i)] (where ch(i) = children of i and sp(i) = spouses of i) must be equal to P[X_{ch}(i)|X_{sp}(i), X_i^] for some value X_i^ consistent with f(X_i). The simplest way to achieve this is to pick a particular "representative" value X_i^(f^) for each possible value f^* of f(X_i), so that f(X_i^(f^)) = f^. Example: in the digital circuit case, we would pick one representative "high" voltage (for instance the supply voltage V_{DD}) and one representative "low" voltage (for instance the ground voltage V_{SS}). X_i^(f(X_i)) would then map any high voltages to V_{DD} and any low voltages to V_{SS}. Once we have our representative value function X_i^(f(X_i)), we just have the children use X_i^(f(X_i)) in place of X_i. If we want, we could even simplify one step further: we could just choose f to spit out representative values directly. That convention is cleaner for proofs and algorithms, but a bit more confusing for human usage and examples.

Comment

https://www.lesswrong.com/posts/zFGGHGfhYsGNnh7Kp/how-to-throw-away-information-in-causal-dags?commentId=YSw3TCnMdCSYRo2NE

Instead of saying "f(X) contains all information in X relevant to Y", it would be better to say that, f(X) contains all information in X that is relevant to Y if you don’t condition on anything. Because it may be the case that if you condition on some additional random variable Z, f(X) no longer contains all relevant information.

Example:

Let X_1, X_2, Z be i.i.d. binary uniform random variables, i.e. each of the variables takes the value 0 with probability 0.5 and the value 1 with probability 0.5. Let X=(X_1, X_2) be a random variable. Let Y= X_1 \oplus X_2 \oplus Z be another random variable, where \oplus is the xor operation. Let f be the function f(X) = f((X_1, X_2)) = X_2.

Then f contains all information in X that is relevant to Y. But if we know the value of Z, then f no longer contains all information in X that is relevant to Y.

Comment

https://www.lesswrong.com/posts/zFGGHGfhYsGNnh7Kp/how-to-throw-away-information-in-causal-dags?commentId=3Sw7gozRNt46Xwrty

Good point, thanks.