How to Throw Away Information

https://www.lesswrong.com/posts/DEDcFw6zWfW9nb2YM/how-to-throw-away-information

Contents

Coin-Sum Problem

I flip two coins, and observe both outcomes—either 0 or 1 for each. I want to keep as much information as possible, while throwing away all information from the observation relevant to their sum. How should I "update" my distribution over the outcomes? We’ll write the outcome of our coinflip as B = (B_1, B_2) ("B" for "bit" or "binary" or "Bernoulli"), and our final information-removed distribution as P[B | B, /\sum B] - so the notation /\sum B indicates that we should throw out all info about the sum (the "/​" is meant to evoke e.g. a group quotient operation). Note that this kind of "remove information" operation does not follow the usual rules of Bayesian probability. At first, I reason as follows:

Generalizing

The general form of the problem:

Why Is This Interesting?

A lot of tricky problems in decision theory feel like the agent should choose to throw away information. Chris Leong argues that this is a useful way to think about logical counterfactuals in general, and I’ve heard other people express similar intuitions. The approach in this post offers a way to formalize and quantify "forgetting", in a form potentially useful for these kinds of applications. Along similar lines: the sort of randomization used in game theory feels different from the sort of "randomness" involved in uncertainty. The former is utilitarian, and used to confuse external opponents. The latter is epistemic, and only relevant to internal beliefs. The randomization involved in throwing away information feels similar to game-theoretic randomness, yet it’s in the sort of informational framework usually used for reasoning about uncertainty—suggesting that these two types of randomness could be handled in a more unified conceptual framework. For example: suppose agent 1 and agent 2 play a zero-sum game. Agent 2 chooses its action by running a copy of agent 1’s source code, then outputting whatever action beats the copy’s action. What should agent 1 do? One possible approach is for agent 1 to throw away any information about its copy’s action which is contained in its own action—so that the copy’s behavior yields no information about agent 1’s behavior. Randomization then naturally enters picture, due to throwing away information. (This doesn’t seem like quite the right decision-making rule in general, but it feels like something similar could work—the main idea is to make "throw away information" an action in the game itself, and push game-theoretic randomization of choices into the info-throw-away action.) One could imagine re-deriving Nash equilibria via agents throwing away information as an action, without any other source of randomness. Whether that can actually work is another open problem.

Comment

https://www.lesswrong.com/posts/DEDcFw6zWfW9nb2YM/how-to-throw-away-information?commentId=hXokztZjKaWdNqSaL

The bound isn’t always achievable if you don’t know Y. Suppose X,Y\in{1,2,3} with X\neq Y uniform over the 6 possible outcomes. You find out that X=1 (without loss of generality). You must construct a function S st S(2)=S(3)=1. Because as far as you know, Y could be 2 or 3, and you have to be able to construct X from S and Y. But then we just take the most common output of S, and that tells us X=2 so Y\neq 2. (If you had some other procedure for calculating X from S and Y, then you can make a new function that does that, and call it S, so an arbitrary function from X to Y+{err} is the most general form of S.

Comment

https://www.lesswrong.com/posts/DEDcFw6zWfW9nb2YM/how-to-throw-away-information?commentId=aagujs9oQAE3D4ron

Nice argument—I’m convinced that the bound isn’t always achievable. See my response to cousin_it’s comment for some related questions.

https://www.lesswrong.com/posts/DEDcFw6zWfW9nb2YM/how-to-throw-away-information?commentId=BDtGsCFBp5qBL4mjp

Yeah, it seems that if Y isn’t uniquely determined by X, we can’t reversibly erase Y from X. Let’s say we flip two coins, X = the first, Y = the sum. Since S depends only on X and some randomness, knowing S is equivalent to knowing some distribution over X. If that distribution is non-informative about Y, it must be (1/​2,1/​2). But then we can’t reconstruct X from S and Y.

Comment

https://www.lesswrong.com/posts/DEDcFw6zWfW9nb2YM/how-to-throw-away-information?commentId=b8oEqcvto3rWzTDqq

Nice example—I’m convinced that the bound isn’t always achievable. So the next questions are:

  • Is there a nice way to figure out how much information we can keep, in any particular problem, when Y is not observed?

  • Is there some standard way of constructing S which achieves optimal performance in general when Y is not observed? My guess is that optimal performance would be achieved by throwing away all information contained in X about the distribution P[Y|X]. We always know that distribution just from observing X, and throwing away all info about that distribution should throw out all info about Y, and the minimal map interpretation suggests that that’s the least information we can throw out.

https://www.lesswrong.com/posts/DEDcFw6zWfW9nb2YM/how-to-throw-away-information?commentId=FvKrwt5BngXrQSEqb

I might misunderstand something or made a mistake and I’m not gonna try to figure it out since the post is old and maybe not alive anymore, but isn’t the following a counter-example to the claim that the method of constructing S described above does what it’s supposed to do?Let X and Y be independent coin flips. Then S will be computed as follows: X=0, Y=0 maps to uniform distribution on {{0:0, 1:0}, {0:0, 1:1}}X=0, Y=1 maps to uniform distribution on {{0:0, 1:0}, {0:1, 1:0}}X=1, Y=0 maps to uniform distribution on {{0:1, 1:0}, {0:1, 1:1}}X=1, Y=1 maps to uniform distribution on {{0:0, 1:1}, {0:1, 1:1}} But since X is independent from Y, we want S to contain full information about X (plus some noise perhaps), so the support of S given X=1 must not overlap with the support of S given X=0. But it does. For example, {0:0, 1:1} has positive probability of occurring both for X=1, Y=1 and for X=0, Y=0. i.e. conditional on S={0:0, 1:1}, X is still bernoulli.