The Variational Characterization of Expectation

https://www.lesswrong.com/posts/iKXatRoTEKFGoTB5b/the-variational-characterization-of-expectation

Epistemological Status: Fairly confident. This is much closer to the expected minimal map I had in mind ;)

Expectation for a random variable X given Y is the best point estimate of X given Y under squared error loss.

Say we have a random variable X, but are only allowed to summarize the outcomes with a single number e_X. What should we pick so that the squared error is minimized?

\min_{e_X } \mathbb{E}[(X - e_X)^2] \ \iff \partial_{e_X} \mathbb{E}[(X - e_X)^2] = 0 \ \iff \mathbb{E}[X - e_X] = 0 \iff \mathbb{E}[X] = e_X

Thus, in this sense, the expectation of a random variable is the best point estimate of it’s outcomes.

A major reason a variational characterization is interesting is because it creates a tunnel to optimization theory. The expected minimal map presented here can ‘average out’ irrelevant details allowing you to filter to just the relevant things in a stream-lined manner.

When the random variables have binary outcomes we can use the conditional expectation to characterize probability without referencing information. It is effectively the same statement without reference to information. So there are at least two ways of giving a variational characterization of probability.

Lemma 1 (Optimal Prediction): Define a random variable \chi_U \in L_2. The best point representation of the outcome of \chi_U for a given observation O is equivalent to \mathbb{E}(\chi_U | O). Moreover, the optimal point representation of \chi_U is invariant under the pull-back \eta : O \to \mathbb{E}(\chi_U | O) of the conditioning.

Corollary: When \chi_U indicates a binary answer to a query we have a characterization of probability.

The second condition is a fancy way of saying that our optimal estimate is an optimal way to condition our expectation. It’s somewhat like cheating on a test. You can either read the question Q and then predict A via \mathbb{E}[A | Q] or you could be told the answer and then predict A via \mathbb{E}[A | \mathbb{E}[A | Q]].

Suppose that instead of cheating we answer some other question that gives us a hint for the answer. Now we have something like,

Q \to A_1 \to A_2

We have a question, we get a hint, we answer. From the above lemma, the best guess for A_2 is \mathbb{E}[A_2 | A_1], but we don’t know A_1. Our best guess for A_1 is \mathbb{E}[A_1 | Q] so the best we can do is,

\mathbb{E}[A_2 | Q] \approx \mathbb{E}[A_2 | \mathbb{E}[A_1 | Q]]

On a multi-part problem we might having something like,

(Q = A_0) \to A_1 \to \ldots \to A_{n-1} \to A_n

So we take the approximations as,

\hat A_k = \mathbb{E}[A_k | \hat A_{k-1}] = \text{argmin}{e{k} \in L_2} \ \mathbb{E}[(A_k - e_{k}(\hat A_{k-1}))^2]

In practice, we’ll often want or need to restrict the class of functions we know how to optimize over. In such cases extending the Q/​A process is the only way to guarantee that the final answer is always close to the optimum. At this point we can ‘forget’ about the intermediates and optimize,

\hat A_n = \text{argmin}_{e_1, \ldots, e_n \in \mathcal{E}} \ \mathbb{E}[(A_n - e_n \circ \ldots \circ e_2 \circ e_1(Q))^2]

If the function class is appropriate, this gives you the optimization problem associated with training a neural-network. Literally, but only approximately, each function averages out irrelevant features of it’s input to create more relevant features for prediction in it’s output.

Proof of Lemma 1: The proof works for non-binary random variables. However, the probabalistic interpretation is lost. Expectation for random variables in L_2 is characterized by,

P(\chi_U | O) = \mathbb{E}[\chi_U | O] = \min_{e_{\chi_U} \in L_2} \mathbb{E}[(\chi_U - e_{\chi_U}(O))^2]

The minimizer exists and is unique by the Hilbert projection theorem. Moreover,

\mathbb{E}[\chi_U | \eta(O)] = \mathbb{E}[\chi_U | \mathbb{E}(\chi_U | O)] \ \iff \text{argmin}{e{\chi_U} \in L_2} \mathbb{E}[(\chi_U - e_{\chi_U}(O))^2] = \text{argmin}_{e_1 \in L_2} \mathbb{E}[(\chi_U - e_1\left( \mathbb{E}[\chi_U | O]\right))^2]

So all that’s left is to verify the pull-back doesn’t affect the outcome. We have,

\text{argmin}{e_1 \in L_2} \mathbb{E}[(\chi_U - e_1\left( \mathbb{E}[\chi_U | O]\right))^2] \ = \text{argmin}{e_1 \in L_2} \mathbb{E} \left[ \left(\chi_U - e_1\left( \text{argmin}{e_2 \in L_2} \mathbb{E} [(\chi_U - e_2(O))^2] \right) \right)^2 \right] \ = \text{argmin}{e_1 \in L_2} \mathbb{E} \left[ \left(\chi_U - e_1 \circ e_2 (O) \right)^2 \right] \ = \text{argmin}{e_1, e_2 \in L_2} \mathbb{E} \left[ \left(\chi_U - e_1 \circ e_2 (O) \right)^2 \right] \ = \text{argmin}{e_{\chi_U} \in L_2} \mathbb{E} \left[ \left(\chi_U - e_{\chi_U} (O) \right)^2 \right]

Therefore, the probability conditioned on the observation is the optimal point estimate and pulling-back the observation to just the point-estimate leaves the estimate unchanged. \square