Subagent perfect minimax

https://www.lesswrong.com/posts/5bd75cc58225bf0670375309/subagent-perfect-minimax

% operators that are separated from the operand by a space

% operators that require brackets

% operators that require parentheses

% Paper specific

This post continues the study of minimax forecasting.

The minimax decision rule has the pathology that, when events are sufficiently "optimistic", behavior can become highly suboptimal. This is analogous to off-policy irrational behavior in Nash equilibria of games in extensive form. In order to remedy the problem, we introduce a refinement called "subagent perfect minimax," somewhat analogous to subgame perfect equilibria and other related solution concepts. It is possible to prove existence and, when the model factorizes, dynamic consistency.

The proofs are omitted, but we can easily provide them if necessary.

##Motivation

Consider the following class of environments: during the first step, the agent gains 1$ or action b which leads to gaining $$0$.

The minimax payoff of this class is 0$ in the first step, choose b if you gained 1, which is the same as the worst-case (p=0$) payoff of \pi^*.

Note that the minimax environment has p=0 and \pi only differs from \pi^* on histories which are impossible in that environment. Moreover, if we considered the class p \in [\epsilon,1] for some \epsilon > 0, the minimax payoff would be $$(1+\epsilon)$, and (\pi^*) would be the sole minimax policy. Therefore, in order to eliminate the pathology, we need to formulate a stability condition that ensures any admissible history is treated as occurring with at least infinitesimal probability.

##Results

Now consider the forecasting setting, with action set \Act, observation set \Obs, time discount function \gamma: \Nats \rightarrow \Reals^{\geq 0} and reward function r: \His \rightarrow \Reals. As before, \gamma and r define the utility function u: (\Pol) \times \Env \rightarrow \Reals.

Consider \Phi \in \Mod and denote

\Adm:={x \in \Obs^+ \mid \ \exists \mu \in \Phi: \mu(x\Env) > 0}

#Definition 1

Consider X \subseteq \Adm finite.\pi^* \in \MixPol is called an X-stable minimax policy for \Phi when there are sequences {\epsilon^{(n)}: X \rightarrow (0,1)}{n \in \Nats} and {\pi^{(n)} \in \MixPol}{n \in \Nats} s.t. \epsilon^{(n)} \rightarrow 0, \pi^{(n)} \rightarrow \pi^* and \pi^{(n)} is a minimax policy for

\Phi^{(n)}:={\mu \in \Phi \mid \forall x \in \Obs^*, o \in \Obs: xo \in X \implies \mu(xo\Env) \geq \epsilon^{(n)}(xo) \mu(x\Env)}

In particular, the definition assumes \Phi^{(n)} is non-empty.

#Proposition 1

For any X \subseteq \Adm finite, there exists \pi^* \in \MixPol which is an X-stable minimax policy for \Phi.

#Proposition 2

For any X \subseteq \Adm finite and \pi^* \in \MixPol an X-stable minimax policy for \Phi, {\pi^*} is in particular a minimax policy for \Phi.

#Proposition 3

Consider X \subseteq \Adm finite, \pi^* \in \MixPol an X-stable minimax policy for \Phi and {x \in \Obs^} s.t. {\forall y \sqsubseteq x : y \in X \cup {\Estr_\Obs}}. Assume {\Phi} factorizes at {x} into {\Phi_1,\Phi_2 \in \Mod}. Denote {\Prj_1: (\Pol) \rightarrow (\bar{x}\Obs^ \rightarrow \Act)} and {\Prj_2: (\Pol) \rightarrow (x\Obs^* \rightarrow \Act)} the restriction mappings. Define {\pi_1^* := \Prj_{1*} \pi} and define {\pi_2^: \Act^{\Abs{x}} \rightarrow (x\Obs^ \rightarrow \Act)} by

\pi_2^(t):=\Prj_{2}(\pi^* \mid {s: \Pol \mid t_i=s(x_{<i})})

Then

\pi_2^* \in \Argmax{\pi_2: \Act^{\Abs{x}} \rightarrow (x\Obs^* \rightarrow \Act)} \min_{\mu \in \Phi_2} \E_{(\pi_1^* \otimes_x \pi_2) \times \mu}[u_{>\Abs{x}}]

Note that, thanks to X-stability, Proposition 3 doesn’t require the condition \min_{\mu \in \Phi_1} \mu(x\Env) > 0, as opposed to the case for arbitrary minimax policies (see Proposition 1 here).

#Definition 2

{\pi^* \in \MixPol} is called a subagent perfect minimax policy for {\Phi} when there is a sequence {{X^{(n)} \subseteq \Adm}{n \in \Nats}} s.t. each {X^{(n)}} is finite, {X^{(n)} \subseteq X^{(n+1)}} and {\bigcup{n \in \Nats} X^{(n)} = \Adm}, and a sequence {\pi^{(n)} \in \MixPol} s.t.{\pi^{(n)} \rightarrow \pi} and {\pi^{(n)}} is an {X^{(n)}}-stable minimax policy for {\Phi}.

#Proposition 4

There exists \pi^* \in \MixPol which is a subagent perfect minimax policy for \Phi.

#Proposition 5

For \pi^* \in \MixPol a subagent perfect minimax policy for \Phi, {\pi^*} is in particular a minimax policy for \Phi.

#Proposition 6

Consider \pi^* \in \MixPol a subagent perfect minimax policy for \Phi and {x \in \Adm}. Assume {\Phi} factorizes at {x} into {\Phi_1,\Phi_2 \in \Mod}. Then

\pi_2^* \in \Argmax{\pi_2: \Act^{\Abs{x}} \rightarrow (x\Obs^* \rightarrow \Act)} \min_{\mu \in \Phi_2} \E_{(\pi_1^* \otimes_x \pi_2) \times \mu}[u_{>\Abs{x}}]