Minimal Map Constraints

https://www.lesswrong.com/posts/q42kYLo8ayHrJxdyS/minimal-map-constraints

TL;DR: I verify that the minimal map lemma can be used as a criterion for functional optimization. Specifically, I derive a broad class of consistent loss functions according to the criterion. I also give an application of the result to determining how to train a neural representation.

Say we receive data (x,y) \sim X \times \lbrace -1, 1 \rbrace where the x are observations and y are binary class labels. We wish to learn a map h \in \mathcal{H} that is able to properly classify the data. By the minimal map lemma this is information equivalent to learning the map \zeta \circ p(y | x) where \zeta is an arbitrary isomorphism. We can make a stylistic choice on the the model class by imposing the constraint,

\zeta^{-1}(h) + \zeta^{-1}(-h) = 1

Consider the term y \cdot h(x). If the prediction is correct then y \cdot h(x) > 0 or otherwise y \cdot h(x) < 0. However, we’re either right or wrong with probability one. Therefore, if \zeta^{-1}(y \cdot h(x)) > 1/2 that label and the classifier most likely agree. In other words, we should choose the y that maximizes the probability of agreement.

In practice, this means we calculate,

\iota(y \cdot h(x)) = \frac{1}{2}(1-\text{sign}(y \cdot h(x)))

to collapse probabilities into binary decisions. The variable is threshold that returns one if the classification is less likely than the other and zero otherwise. This is effectively the zero-one loss.

Theorem: Consider the population risk,

R_{\text{0-1}}(h) = \int_{X \times Y} \iota(h) p(1 | x) + \iota(-h) (1 - p(1 | x)) \ d\mu(x)

The optimal decision rule is given by \text{sign}(h) = \text{sign}(p(1 | x) - 1/2).

Proof: We should set \text{sign}(h) = 1 if p(1 | x) > 1 - p(1 | x) which implies that p(1 | x) > 1/2. The result is that the risk is minimized. Thus, the optimal decision rule is given as stated.

As stated before, if \zeta^{-1}(y \cdot h(x)) > 1/2 then we should take that as our classification. The trick now is to replace \iota \to \phi with the most general class of loss functions such that the optimal classifier remains unchanged. So we’ll consider,

R_{\phi}(h) = \int_{X \times Y} \phi \circ \zeta^{-1}(h) \eta(x) + \phi \circ \zeta^{-1}(-h) (1 - \eta(x)) \ d\mu(x)

where we’ve introduced a twice differentiable loss function \phi: \mathbb{R} \to \mathbb{R} and collapsed our conditional probability to just the function \eta.

Definition: A loss function \phi is consistent if the optimal classifier for R_{\phi} is equivalent to the optimal classifier for R_{\text{0-1}}.

We’ll also need a lemma to proceed,

Lemma: If the functional derivative \nabla_h R exists, then h^* is locally optimal iff,

\langle \nabla_h R, h^* - h \rangle \ge 0, \ \forall h \in \mathcal{H}

Proof: This is the first order condition for optimality.

Now we can state the main result.

Theorem: A loss function \phi: \mathbb{R} \to \mathbb{R} is consistent if it is of the form,

\phi(\eta) = C(\eta) + (1-\eta) C'(\eta)

where C(\eta) = C(1 - \eta) is a twice differentiable strictly convex function and \eta = \zeta^{-1}(h(x)) respects h as a minimal map according to the constraint,

\zeta^{-1}(h) + \zeta^{-1}(-h) = 1

and for the optimal classifier h^* satisfies,

\zeta^{-1} \circ h^*(x) = p(1 | x)

Proof:

To minimize the population risk we need to use the functional deriviative.

\nabla_{h} R = \lim_{\epsilon \to 0} \cfrac{R(h + \epsilon \psi) - R(h)}{\epsilon} = \left[\cfrac{d}{d \epsilon} R(h + \epsilon \psi) \right]{\epsilon = 0} \ = \int{X \times Y} \left[\partial_h \phi \circ \zeta^{-1}(h) \eta(x) + \partial_h \phi \circ \zeta^{-1}(-h) (1 - \eta(x)) \ \right] \psi(x) d\mu(x)

Now, we want to fix a class of \phi with the assumption that \text{sign}(h) will determine our classifier. Any particulars beyond this are irrelevant to producing a 0-1 optimal classifier. This simplifies the optimality condition to,

\text{sign}(\partial_h \phi \circ \zeta^{-1}(h) \ \eta + \partial_h \phi \circ \zeta^{-1}(-h) (1- \eta)) = \text{sign}(h) \ \iff \text{sign}(\phi'(\hat \eta) \partial_h \zeta^{-1}(h) \ \eta - \phi'(1-\hat \eta ) \partial_h \zeta^{-1}(h) (1- \eta)) = \text{sign}(h) \ \iff \text{sign}(\phi'(\hat \eta) \ \eta - \phi'(1-\hat \eta) (1- \eta)) = \text{sign}(h)

Note that we assume here that \lim_{h \to \infty } \zeta^{-1}(h) = 1 and is differentiable. If we have,

\phi'(\hat \eta) = f(\hat \eta) C''(\hat \eta) \ \text{s.t } C''(\eta) > 0

then we can do this trick again. So our loss function is a function times a strictly convex function. Thus,

\text{sign}(\phi'(\hat \eta) \ \eta - \phi'(1-\hat \eta) (1- \eta)) = \text{sign}(h) \ \iff \text{sign}(f(\hat \eta) \ C''(\hat \eta) \eta - f(1-\hat \eta) C''(1-\hat \eta)(1- \eta)) = \text{sign}(h) \ \iff \text{sign}(f(\hat \eta) \eta - f(1-\hat \eta) (1- \eta)) = \text{sign}(h)

Recall that \hat \eta(x) = \zeta^{-1}(h(x)) so at \hat \eta = \eta we must be at an optimum if we’re to respect the minimal map property and treat \zeta^{-1} \circ h as a probability. This is the invertible link property. To satisfy this property we need,

\cfrac{f(\hat \eta)}{f(1-\hat\eta)} = \cfrac{1-\eta}{\eta} \ \Rightarrow f(\eta) \cong (1 - \eta) \ \text{or} \ 1/\eta

subsitution makes it clear that the solution we want is f(\eta) = (1 - \eta). The other solution is degenerate. We’re almost done, now we solve for \phi with an integration by parts,

\phi(\eta) = \int (1-\eta) \cdot C''(\eta) \ = (1-\eta) C'(\eta) + \int C'(\eta) \ = C(\eta) + (1-\eta) C'(\eta)

\square

Examples:

Take C(\eta) = H(\eta) and \zeta^{-1}(v)= \frac{1}{\log(2)} \cfrac{e^v}{1 + e^v} then we have,

\phi(v) = \cfrac{1}{\log(2)} \log(1 + e^v)

which is the logistic loss.

More abstractly, for a neural representation we compute layers as,

\vec h_{j+1} = \sigma(W_j \vec h_j)

where \sigma is an activation for each layer \vec h_j. Previously I showed that if a neural network is to be interpreted as a probabalistic decision process then we need \sigma = \zeta \circ p where \zeta is invertible and p can be an associated probability.

When we arrive at the output layer, we need to apply a loss function. Given the above analysis it would be best to change the activation to \zeta^{-1} and then apply a consistent loss \phi from above. This ensures that what is going into \phi is indeed a probability.