Maximize Worst Case Bayes Score

This post has mostly been copied from something I posted on Less Wrong over a year ago. I am reposting it here, because I want to reference it in a future post related to the two-update problem. Here, I propose an logical prior. I do not think this prior is nearly as good as the modified Demski prior, or even the original Demski prior, but I am posting it has some interesting properties.

Given a consistent but incomplete theory, how should one choose a random model of that theory?

My proposal is rather simple. Just assign probabilities to sentences in such that if an adversary were to choose a model, your Worst Case Bayes Score is maximized. This assignment of probabilities represents a probability distribution on models, and choose randomly from this distribution. However, it will take some work to show that what I just described even makes sense. We need to show that Worst Case Bayes Score can be maximized, that such a maximum is unique, and that this assignment of probabilities to sentences represents an actual probability distribution. This post gives the necessary definitions, and proves these three facts.

Finally, I will show that any given probability assignment is coherent if and only if it is impossible to change the probability assignment in a way that simultaneously improves the Bayes Score by an amount bounded away from 0 in all models. This is nice because it gives us a measure of how far a probability assignment is from being coherent. Namely, we can define the "incoherence" of a probability assignment to be the supremum amount by which you can simultaneously improve the Bayes Score in all models. This could be a useful notion since we usually cannot compute a coherent probability assignment so in practice we need to work with incoherent probability assignments which approach a coherent one.

Now, let’s move on to the formal definitions and proofs.

Fix some language L, for example the language of first order set theory. Fix a consistent theory T of L, for example ZFC. Fix a nowhere zero probability measure \mu on L, for example \mu(\phi)=2^{-\ell(\phi)}, where \ell(\phi) is the number of bits necessary to encode \phi.

A probability assignment of L is any function from L to the interval [0,1]. Note that this can be any function, and does not have to represent a probability distribution. Given a probability assignment P of L, and a model M of T, we can define the Bayes Score of P with respect to M by

\mbox{Bayes}(M,P)=\sum_{M\models \phi}\log_2(P(\phi))\mu(\phi)+\sum_{M\models\neg \phi}\log_2(1-P(\phi))\mu(\phi).

We define the Worst Case Bayes Score \mbox{WCB}(P) to be the infimum of \mbox{Bayes}(M,P) over all models M of T. Let \mathbb{P} denote the probability assignment that maximizes the function \mbox{WCB}. We will show that this maximum exists and is unique, so \mathbb{P} is well defined.

In fact, \mathbb{P} is also coherent, meaning that there exists a probability distribution on the set of all models of T such that \mathbb{P}(\phi) is exactly the probability that a randomly chosen model satisfies \phi. Since the natural definition of a measurable subset of models comes from unions and intersections of the sets of all models satisfying a given sentence, we can think of \mathbb{P} as an actual probability distribution on the set of all models of T.

First, we must show that there exists a probability assignment P which maximizes \mbox{WCB}.

Note that \mbox{Bayes}(M,P) either diverges to -\infty, or converges to a non-positive real number. If P is the identically 1/2 function, then \mbox{WCB}(P)=-1, so there is at least one P for which \mbox{WCB}(P) is finite. This means that when maximizing \mbox{WCB}(P), we need only consider P for which \mbox{Bayes}(M,P) converges to a number between -1 and 0 for all M.

Assume by way of contradiction that there is no P which maximizes \mbox{WCB}. Then there must be some supremum value m such that \mbox{WCB} can get arbitrarily close to m, but never equals or surpasses m. Consider an infinite sequence probability assignments {P_i} such that \mbox{WCB}(P_i)\rightarrow m. We may take a subsequence of {P_i} in order to assume that {P_i(\phi)} converges for every sentence \phi. Let P be such that P_i(\phi)\rightarrow P(\phi) for all \phi.

By assumption, \mbox{WCB}(P) must be less than m. Take any model M for which \mbox{Bayes}(M,P)<m. Then there exists a finite subset S of L such that \mbox{Bayes}_S(M,P)<m, where

\mbox{Bayes}S(M,P)=\sum{\phi\in S, M\models \phi}\log_2(P(\phi))\mu(\phi)+\sum_{\phi\in S, M\models\neg \phi}\log_2(1-P(\phi))\mu(\phi).

Note that in order to keep the Bayes score at least -1, any P_i must satisfy 2^{-1/\mu(\phi)}\leq P_i(\phi)\leq 1 if M\models \phi, and 0\leq P_i(\phi)\leq 1-2^{-1/\mu(\phi)} if M\models\neg\phi. Consider the space of all functions f from S to [0,1] satisfying these inequalities. Since there are only finitely many values restricted to closed and bounded intervals, this space is compact. Further, \mbox{Bayes}_S(M,f) is a continuous function of f, defined everywhere on this compact set. Therefore,

\lim_{i\rightarrow\infty}\mbox{Bayes}_S(M,P_i)=\mbox{Bayes}_S(M,P)<m.

However, clearly \mbox{WCB}(P_i)\leq\mbox{Bayes}(M,P_i)\leq\mbox{Bayes}_S(M,P_i), so

\lim_{i\rightarrow\infty}\mbox{WCB}(P_i)<m,

contradicting our assumption that \mbox{WCB}(P_i) converges to m.

Next, we will show that there is a unique probability assignment which maximizes \mbox{WCB}. Assume by way of contradiction that there were two probability assignments, P_1 and P_2 which maximize \mbox{WCB}. Consider the probability assignment P_3, given by

P_3(\phi)=\frac{\sqrt{P_1(\phi)P_2(\phi)}}{\sqrt{P_1(\phi)P_2(\phi)}+\sqrt{(1-P_1(\phi))(1-P_2(\phi))}}.

It is quick to check that this definition satisfies both

\log_2(P_3(\phi))\geq \frac{\log_2(P_1(\phi))+\log_2(P_2(\phi))}{2}

and

\log_2(1-P_3(\phi))\geq \frac{\log_2(1-P_1(\phi))+\log_2(1-P_2(\phi))}{2},

and in both cases equality holds only when P_1(\phi)=P_2(\phi).

Therefore, we get that for any fixed model, M,

\mbox{Bayes}(M,P_3(\phi))\geq \frac{\mbox{Bayes}(M,P_1(\phi))+\mbox{Bayes}(M,P_2(\phi))}{2}.

By looking at the improvement coming from a single sentence \phi with P_1(\phi)\neq P_2(\phi), we see that

\mbox{Bayes}(M,P_3(\phi))-\frac{\mbox{Bayes}(M,P_1(\phi))+\mbox{Bayes}(M,P_2(\phi))}{2},

is actually bounded away from 0, which means that

\mbox{WCB}(P_3(\phi))\geq \frac{\mbox{WCB}(P_1(\phi))+\mbox{WCB}(P_2(\phi))}{2},

which contradicts the fact that P_1 and P_2 maximize \mbox{WCB}.

This means that there is a unique probability assignment, \mathbb{P}, which maximizes \mbox{WCB}, but we still need to show that \mathbb{P} is coherent. For this, we will use the fact that \mathbb{P} is coherent if and only if \mathbb{P} assigns probability 0 to every contradiction, probability 1 to every tautology, and satisfies \mathbb{P}(\phi)=\mathbb{P}(\phi\wedge\psi)+\mathbb{P}(\phi\wedge\neg\psi) for all \phi and \psi.

Clearly \mathbb{P} assigns probability 0 to every contradiction, since otherwise we could increase the Bayes Score in all models by the same amount by updating that probability to 0. Similarly \mathbb{P} clearly assigns probability 1 to all tautologies.

If \mathbb{P}(\phi)\neq\mathbb{P}(\phi\wedge\psi)+\mathbb{P}(\phi\wedge\neg\psi), then we update all three probabilities as follows:

\mathbb{P}(\phi)\mapsto \frac{1}{1+\frac{1-\mathbb{P}(\phi)}{\mathbb{P}(\phi)}(2^{-x/\mu(\phi)})},

\mathbb{P}(\phi\wedge\psi)\mapsto \frac{1}{1+\frac{1-\mathbb{P}(\phi\wedge\psi)}{\mathbb{P}(\phi\wedge\psi)}(2^{x/\mu(\phi\wedge\psi)})},

and

\mathbb{P}(\phi\wedge\neg\psi)\mapsto \frac{1}{1+\frac{1-\mathbb{P}(\phi\wedge\neg\psi)}{\mathbb{P}(\phi\wedge\neg\psi)}(2^{x/\mu(\phi\wedge\neg\psi)})},

where x is the unique real number such that the three new probabilities satisfy \mathbb{P}(\phi)=\mathbb{P}(\phi\wedge\psi)+\mathbb{P}(\phi\wedge\neg\psi). This correction increases Bayes Score by the same amount in all models, and therefore increase \mbox{WCB}, contradicting the maximality of \mbox{WCB}(\mathbb{P}). Therefore \mathbb{P} is coherent as desired.

Finally, we show that any given probability assignment P is coherent if and only if it is impossible to simultaneously improve the Bayes Score by an amount bounded away from 0 in all models. The above proof that \mathbb{P} is coherent actually shows one direction of this proof, since the only fact it used about \mathbb{P} is that you could not simultaneously improve the Bayes Score by an amount bounded away from 0 in all models. For the other direction, assume by way of contradiction that P is coherent, and that there exists a Q and an \epsilon>0 such that \mbox{Bayes}(M,Q)-\mbox{Bayes}(M,P)>\epsilon for all M.

In particular, since P. is coherent, it represents a probability distribution on models, so we can choose a random model M from the distribution P. If we do so, we must have that

\mathbb{E}(\mbox{Bayes}(M,Q))-\mathbb{E}(\mbox{Bayes}(M,P))>0.

However, this contradicts the well known fact that the expectation of Bayes Score is maximized by choosing honest probabilities corresponding the actual distribution M is chosen from.

Coming up, I will present a list of open problems related to this prior.