Contents

Background
Results
Proofs \text{This}\newcommand{\Comment}[1]{}

% operators that are separated from the operand by a space \DeclareMathOperator{\Sgn}{sgn} \DeclareMathOperator{\Supp}{supp} \DeclareMathOperator{\Dom}{dom}

% autosize delimiters \newcommand{\AP}[1]{\left(#1\right)} \newcommand{\AB}[1]{\left[#1\right]} \newcommand{\AC}[1]{\left{#1\right}} \newcommand{\APM}[2]{\left(#1;\middle\vert;#2\right)} \newcommand{\ABM}[2]{\left[#1;\middle\vert;#2\right]} \newcommand{\ACM}[2]{\left{#1;\middle\vert;#2\right}}

% operators that require brackets \newcommand{\Pa}[2]{\underset{#1}{\operatorname{Pr}}\AB{#2}} \newcommand{\CP}[3]{\underset{#1}{\operatorname{Pr}}\ABM{#2}{#3}} \newcommand{\PP}[2]{\underset{\substack{#1 \ #2}}{\operatorname{Pr}}} \newcommand{\PPP}[3]{\underset{\substack{#1 \ #2 \ #3}}{\operatorname{Pr}}} \newcommand{\E}[1]{\underset{#1}{\operatorname{E}}} \newcommand{\Ea}[2]{\underset{#1}{\operatorname{E}}\AB{#2}} \newcommand{\CE}[3]{\underset{#1}{\operatorname{E}}\ABM{#2}{#3}} \newcommand{\EE}[2]{\underset{\substack{#1 \ #2}}{\operatorname{E}}} \newcommand{\EEE}[3]{\underset{\substack{#1 \ #2 \ #3}}{\operatorname{E}}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\I}[1]{\underset{#1}{\operatorname{I}}} \newcommand{\CI}[3]{\underset{#1}{\operatorname{I}}\ABM{#2}{#3}} \newcommand{\Ia}[2]{\underset{#1}{\operatorname{I}}\AB{#2}} \newcommand{\II}[2]{\underset{\substack{#1 \ #2}}{\operatorname{I}}} \newcommand{\III}[3]{\underset{\substack{#1 \ #2 \ #3}}{\operatorname{I}}}

% operators that require parentheses \newcommand{\En}{\operatorname{H}} \newcommand{\Ena}[1]{\operatorname{H}\AP{#1}} \newcommand{\PS}[1]{\mathcal{P}\AP{#1}}

\newcommand{\D}{\mathrm{d}} \newcommand{\KL}[2]{\operatorname{D}{\mathrm{KL}}\AP{#1\middle\vert\middle\vert#2}} \newcommand{\RD}[3]{\operatorname{D}{#1}\AP{#2\middle\vert\middle\vert#3}} \newcommand{\Dtv}{\operatorname{d}{\text{tv}}} \newcommand{\Dtva}[1]{\operatorname{d}{\text{tv}}\AP{#1}}

\newcommand{\Argmin}[1]{\underset{#1}{\operatorname{arg,min}},} \newcommand{\Argmax}[1]{\underset{#1}{\operatorname{arg,max}},}

\newcommand{\Nats}{\mathbb{N}} \newcommand{\Ints}{\mathbb{Z}} \newcommand{\Rats}{\mathbb{Q}} \newcommand{\Reals}{\mathbb{R}} \newcommand{\Coms}{\mathbb{C}}

\newcommand{\Estr}{\boldsymbol{\lambda}}

\newcommand{\Lim}[1]{\lim_{#1 \rightarrow \infty}} \newcommand{\LimInf}[1]{\liminf_{#1 \rightarrow \infty}} \newcommand{\LimSup}[1]{\limsup_{#1 \rightarrow \infty}}

\newcommand{\Abs}[1]{\left\vert #1 \right\vert} \newcommand{\Norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\Floor}[1]{\left\lfloor #1 \right\rfloor} \newcommand{\Ceil}[1]{\left\lceil #1 \right\rceil} \newcommand{\Chev}[1]{\left\langle #1 \right\rangle} \newcommand{\Quote}[1]{\left\ulcorner #1 \right\urcorner}

\newcommand{\K}{\xrightarrow{\text{k}}} \newcommand{\PF}{\xrightarrow{\circ}}

\newcommand{\F}[2]{\AC{#1\rightarrow#2}}

% Paper specific

\newcommand{\A}{\mathcal{A}} \newcommand{\St}{\mathcal{S}} \newcommand{\XS}{\mathcal{X}} \newcommand{\O}{\mathcal{O}} \newcommand{\T}{\mathcal{T}} \newcommand{\R}{\mathcal{R}} \newcommand{\Env}{e}

\DeclareMathOperator{\Sta}{st} \DeclareMathOperator{\Rew}{rw}

\newcommand{\Ut}{\operatorname{U}} \newcommand{\V}{\operatorname{V}} \newcommand{\Hy}{\mathcal{H}} \newcommand{\Rg}{\mathrm{R}}

\newcommand{\G}{\text{glob}} \newcommand{\PD}{\dim_\text{p}} \newcommand{\EA}[2]{\underset{#1}{\operatorname{E}}^*\AB{#2}} is the first in a series of essays that aim to derive a regret bound for DRL that depends on finer attributes of the prior than just the number of hypotheses. Specifically, we consider the entropy of the prior and a certain learning-theoretic dimension parameter. As a "by product", we derive a new regret bound for ordinary RL without resets and without traps. In this chapter, we open the series by demonstrating the latter, under the significant simplifying assumption that the MDPs are deterministic.

Background

The regret bound we previously derived for DRL grows as a power law with the number of hypotheses. In contrast, the RL literature is usually concerned with considering all transition kernels on a fixed state space satisfying some simple assumption (e.g. a bound on the diameter or bias span). In particular, the number of hypotheses is uncountable. While the former seems too restrictive (although it’s relatively simple to generalize it to *countable *hypothesis classes), the latter seems too coarse. Indeed, we expect a universally intelligent agent to detect *patterns *in the data, i.e. follow some kind of simplicity prior rather than a uniform distribution over transition kernels. The underlying technique of our proof was lower bounding the information gain in a single episode of posterior sampling by the expected regret incurred during this episode. Although we have not previously stated it in this form, the resulting regret bound depends on the *entropy *of the prior (we considered a uniform prior instead). This idea (unbeknownst to us) appeared earlier in Russo and Van Roy. Moreover, later Russo and Van Roy used to it formulate a generalization of posterior sampling they call "information-directed sampling" the can produce far better regret bounds in certain scenarios. However, to the best of our knowledge, this technique was not previously used to analyze reinforcement learning (as opposed to bandits). Therefore, it seems natural to derive such an "entropic" regret bound for ordinary RL, before extending it to DRL. Now, Osband and Van Roy derived a regret bound for priors supported on some space of transition kernels as a function of its Kolmogorov dimension and "eluder" dimension (the latter introduced previously by Russo and Van Roy). They also consider a continuous state space. This is a finer approach than considering nearly arbitrary transition kernels on a fixed state space, but it still doesn’t distinguish between different priors with the same support. Our new results involve a parameter similar to eluder dimension, but instead of Kolmogorov dimension we use entropy (in the following chapters we will see that Kolmogorov dimension is, in some sense, an upper bound on entropy). As opposed to Osband and Van Roy, we currently limit ourselves to finite state spaces, but on the other hand we consider no resets (at the price of a "no traps" assumption).
In this chapter we derive the entropic regret bound for *deterministic *MDPs. (In the following we will call them deterministic decision processes (DDPs) since they have little to do with Markov.) This latter restriction significantly simplifies the analysis. In the following chapters, we will extend it to stochastic MDPs, however the resulting regret bound will be somewhat weaker.

Results

We start by introducing a new learning-theoretic concept of dimension. It is similar to eluder dimension, but is adapted to the discrete deterministic setting and also somewhat stronger (i.e. smaller: more environment classes are low dimensional w.r.t. this concept than w.r.t. eluder dimension). Definition 1 Consider sets A, B and F \subseteq \F{A}{B} non-empty. Given C \subseteq A \times B and a^ \in A, we say that a^* is F-dependent of C when for any f,g:A\rightarrow B s.t. for any (a,b)\in C it holds f(a)=g(a)=b, we have f\AP{a^}=g\AP{a^}.* Otherwise, we say that a^ is F-independent of C.* The prediction dimension of F (denoted \PD{F}) is the supremum of the set of n\in\Nats for which there is a sequence \AC{\AP{a_k\in A, b_k\in B}}{k\in[n]} s.t. for every k\in[n], a_k is F-independent of \AC{\AP{a_j,b_j}}{j\in[k]}. Fix non-empty finite sets \St (the set of states) and \A (the set of actions). Denote \XS:=\St\times[0,1], where the second factor is regarded as the space of reward values. Given \Env:\St\times\A\rightarrow\XS, \T^\Env:\St\times\A\rightarrow\St and \R^\Env:\St\times\A\rightarrow[0,1] s.t. \Env(s,a)=\AP{\T^\Env(s,a),\R^\Env(s,a)}, we regard \T^\Env as the (deterministic) transition kernel and \R^\Env as the reward function associated with "DDP hypothesis" e. This allows us to speak of the dimension of a DDP hypothesis class (i.e. some \Hy\subseteq\F{\St\times\A}{\XS}). We now give some examples for dimensions of particular function/hypothesis classes. Proposition 1 *Given any A, B and F\subseteq \F{A}{B}, we have \PD{F} < \Abs{F}. * Proposition 2 *Given any *A, B *and *F\subseteq \F{A}{B}, *we have \PD{F}\leq\Abs{A}. In particular, given \Hy as above, \dim{\Hy}\leq\Abs{\St}\cdot\Abs{\A}. * Proposition 3 *We now consider deterministic Markov decision processes that are cellular automata. Consider finite sets X (the set of cells), M (the set of neighbor types) and a mapping \nu: X \times M \rightarrow X (which cell is the neighbor of which cell). For example, M might be a subset of a group acting on X. More specifically, *X can be \AP{\Ints/n\Ints}^d acting on itself, corresponding to a d-dimensional toroidal cellular automaton of size n. *Consider another set C (the set of cell states) and suppose that \St=\F{X}{C}. Given any s\in\St, and x\in X, define s_x:X\rightarrow C by s_x(m):=s\AP{\nu(x,m)}. Given any *\T:\F{M}{C}\times\A\rightarrow C, *define *\T^{\text{glob}}:\St\times\A\rightarrow\St by \T^\text{glob}(s,a)(x):=\T\AP{s_x,a}. *Given any \R:\F{M}{C}\rightarrow[0,1], define \R^\text{glob}:\St\rightarrow[0,1] by \R^\text{glob}(s):=\frac{1}{\Abs{X}}\sum_{x\in X}\R\AP{s_x}. Define \Hy by \Hy:=\ACM{(\T^\text{glob},\R^\text{glob})}{\T:\F{M}{C}\times\A\rightarrow C,\ \R:\F{M}{C}\rightarrow[0,1]}That is, \Hy is the set of transition kernels and reward functions that are local in the sense defined by \nu. Then, \PD{\Hy}\leq\Abs{C}^{\Abs{M}}\AP{\Abs{\A}+1}. * In Proposition 3, it might seem like, although the rules of the automaton are local, the influence of the agent is necessarily global, because the dependence on the action appears in all cells. However, this is not really a restriction: the state of the cells can encode a particular location for the agent and the rules might be such that the agent’s influence is local around this location. More unrealistic is the full observability. Dealing with partially observable cellular automata is outside the scope of this essay. The central lemma in the proof of the regret bound for RL is a regret bound in its own right, in the setting of (deterministic) contextual bandits. Since this lemma might be of independent interest, we state it already here. Let \St (contexts), \A (arms) and \O (outcomes) be non-empty sets. Fix a function \R:\O\rightarrow[0,1] (the reward function). For any c\in\St^\omega (a fixed sequence of contexts), \Env:\St\times\A\rightarrow\O (outcome rule) and \pi:\AP{\St\times\O}^\times\St\K\A (policy), we define \Env c\pi\in\Delta\O^\omega to be the resulting distribution over outcome histories. Given \gamma\in[0,1), we define \Ut_\gamma:\O^\omega\rightarrow[0,1] (the utility function) by \Ut_\gamma\AP{o}:=(1-\gamma)\sum_{n=0}^\infty{\gamma^n \R\AP{o_n}}Lemma 1 Consider a countable non-empty set of hypotheses \Hy\subseteq\F{\St\times\A}{\O} and some \zeta\in\Delta\Hy (the prior). For each s\in\St, define \Hy_s\subseteq\F{\A}{\O} by \Hy_s:=\ACM{\Env:\A\rightarrow\O}{\Env(a)=\Env'(s,a),\ \Env'\in\Hy}Let D:=\max_{s\in\St}{\PD{\Hy_s}} and suppose that \A is countable [this assumption is to simplify the proof and is not really necessary]. Then, there exists \pi^\dagger:\AP{\St\times\O}^\times\St\K\A s.t. for any c\in\St^\omega and \gamma\in(0,1) \Ea{\Env\sim\zeta}{(1-\gamma)\sum_{n=0}^\infty{\gamma^n \max_{a\in\A}{\R\AP{\Env\AP{c_n,a}}}}-\Ea{\Env c\pi^\dagger}{\Ut_\gamma}}\leq\sqrt{\frac{16D\En(\zeta)}{\ln{2}}\cdot(1-\gamma)}Note that the expression on the left hand side is the Bayesian regret. On the right hand side, \En(\zeta) stands for the Shannon entropy of \zeta. In particular we have \En(\zeta)\leq\ln{\Abs{\Hy}}\leq\Abs{\St}\cdot\Abs{\A}\ln{\Abs{\O}}Also, it’s not hard to see that \Abs{\Hy}\leq\Abs{\O}^{\PD{\Hy}}\leq\Abs{\O}^{D\Abs{\St}} and therefore D\En(\zeta)\leq\AP{\PD{\Hy}}^2\ln{\Abs{\O}}D\En(\zeta)\leq D^2\Abs{\St}\ln{\Abs{\O}}Finally, the policy \pi^\dagger we actually consider in the proof is Thompson sampling. Now we proceed to studying reinforcement learning. First, we state a regret bound for RL with resets. Now \St stands for the set of states and \A for the set of actions. We fix a sequence of initial states c\in\St^\omega. For any \Env:\St\times\A\rightarrow\XS (environment), \pi:\XS^\times\XS\K\A (policy), and T\in\Nats^+ we define \Env\pi\in\Delta\XS^\omega to be the resulting distribution over histories, assuming that the state is reset to c_n and the reward to 0 every time period of length T+1. In particular, we have \Pa{x\sim\Env c[T]\pi}{\forall n \in \Nats: x_{n(T+1)}=\AP{c_n,0}}=1Given \gamma\in[0,1) we define \Ut_\gamma:\XS^\omega\rightarrow[0,1] \Ut_\gamma\AP{sr}:=(1-\gamma)\sum_{n=0}^\infty{\gamma^n r_n}Theorem 1 Consider a countable non-empty set of hypotheses \Hy\subseteq\F{\St\times\A}{\XS} and some \zeta\in\Delta\Hy. Let D:=\PD{\Hy}. Then, for any T\in\Nats^+ and \gamma\in[0,1), there exists \pi^\dagger_{\gamma}:\XS^\times\XS\K\A s.t. for any c\in\St^\omega \Ea{\Env\sim\zeta}{\max_{\pi:\XS^\times\XS\rightarrow\A}\Ea{\Env c[T]\pi}{\Ut_\gamma}-\Ea{\Env c[T]\pi^\dagger_{T,\gamma}}{\Ut_\gamma}}\leq\sqrt{\frac{16D\En(\zeta)}{\ln{2}}\cdot(T+1)(1-\gamma)}Note that \Env c[T]\pi is actually a probability measure concentrated on a single history, sincee \pi is deterministic: we didn’t make it explicit only to avoid introducing new notation. Finally, we give the regret bound without resets. For any \Env:\St\times\A\rightarrow\XS and \pi:\XS^\times\XS\K\A, we define \Env\pi\in\Delta\XS^\omega to be the resulting distribution over histories, given initial state s_0 and no resets. Theorem 2 *Consider a countable non-empty set of hypotheses \Hy\subseteq\F{\St\times\A}{\XS} and some \zeta\in\Delta\Hy. Let D:=\PD{\Hy}. Assume that for any *\Env\in\Hy and s\in\St, \A_e^0(s)=\A [\A^0 was defined here in "Definition 1"; so was the value function \V(s,x) used below] (i.e. there are no traps). For any \gamma\in[0,1) we define \tau(\gamma) by \tau(\gamma):=\Ea{e\sim\zeta}{\max_{s\in\St}\sup_{x\in(\gamma,1)}\Abs{\frac{\D{\V_e(s,x)}}{\D{x}}}} Then, for any \gamma\in[0,1) s.t.1-\gamma\ll1 there exists \pi^\dagger_{\gamma}:\XS^\times\XS\K\A s.t. \Ea{\Env\sim\zeta}{\max_{\pi:\XS^\times\XS\rightarrow\A}\Ea{\Env\pi}{\Ut_\gamma}-\Ea{\Env\pi^\dagger_\gamma}{\Ut_\gamma}}=O\AP{\sqrt[3]{D\En(\zeta)\cdot\AP{\tau(\gamma)+1}(1-\gamma)}}Note that \tau(\gamma) decreases with \gamma so this factor doesn’t make the qualitative dependence on \gamma any worse. Both Theorem 1 and Theorem 2 have anytime variants in which the policy doesn’t depend on \gamma at the price of a slightly (within a constant factor) worse regret bound, but for the sake of brevity we don’t state them (our ultimate aim is DRL which is not anytime anyway). In Theorem 2 we didn’t specify the constant, so it is actually true verbatim without the \gamma dependence in \pi^\dagger_\gamma (but we still leave the dependence to simplify the proof a little). It is also possible to spell out the assumption on \gamma in Theorem 2.

Proofs

Definition A.1 Consider sets A, B and F \subseteq \F{A}{B} non-empty. A sequence \AC{\AP{a_k\in A, b_k\in B}}{k\in[n]} *is said to be F-independent, when for every k\in[n], a_k is F-independent of \AC{\AP{a_j,b_j}}{j\in[k]}. Definition A.2 Consider sets A, B. Given C \subseteq A \times B and a^ \in A, suppose f,g:A\rightarrow B are s.t.:

*For any (a,b)\in C, f(a)=g(a)=b *
f\AP{a^}=g\AP{a^} Then, f,g are said to shatter \AP{C,a^}. * Given sequences \AC{\AP{a_k\in A, b_k\in B}}{k\in[n]} and \AC{\AP{f_k,g_k:A\rightarrow B}}{k\in[n]}, (f,g) is said to shatter (a,b) when for any k\in[n], \AP{f_k,g_k} shatters \AP{\AC{\AP{a_j\in A, b_j\in B}}{j\in[k]},a_k}. Proof of Proposition 1 Consider \AC{\AP{a_k\in A,b_k\in B}}{k\in[n]} an F-independent sequence. We will now construct G\subseteq F s.t. \Abs{G}=n+1, which is sufficient. By the definition of F-independence, there is a sequence \AC{\AP{f_k,g_k\in F}}{k\in[n]} that shatters (a,b). We have f_k\AP{a_k}\ne g_k\AP{a_k} and therefore either f_k\AP{a_k}\ne b_k or g_k\AP{a_k}\ne b_k. Without loss of generality, assume that for each k\in[n], f_k\AP{a_k}\ne b_k. It follows that for any k\in[n] and j\in[k], f_j\AP{a_j}\ne b_j=f_k\AP{a_j} and therefore f_j\ne f_k. If n=0 then there is nothing to prove since F is non-empty, hence we can assume n > 0. For any k\in[n-1], f_k\AP{a_k}\ne b_k=g{n-1}\AP{a_k} and therefore f_k\ne g_{n-1}. Also, f_{n-1}\AP{a_{n-1}}\ne g_{n-1}\AP{a_{n-1}} and therefore f_{n-1}\ne g_{n-1}. We now take G=\AC{f_0, f_1 \ldots f_{n-1}, g_{n-1}}, completing the proof. \blacksquare Proof of Proposition 2 Consider \AC{\AP{a_k\in A,b_k\in B}}{k\in[n]} an F-independent sequence and \AC{\AP{f_k,g_k\in F}}{k\in[n]} that shatters it. For any k\in[n] and j\in[k], we have f_k\AP{a_j}=g_k\AP{a_j} but f_k\AP{a_k}\ne g_k\AP{a_k}, implying that a_j\ne a_k. It follows that \Abs{A}\geq n. \blacksquare Proof of Proposition 3 Consider \AC{\AP{s_k\in\St,a_k\in\A,t_k\in\St,r_k\in[0,1]}}{k\in[n]} an \Hy-independent sequence and \AC{\AP{\T^\G_k,\R^\G_k,\tilde{\T}^\G_k,\tilde{R}^\G_k}}{k\in[n]} that shatters it. Define A,B\subseteq[n] by A:=\ACM{k\in[n]}{\T_k^\G\AP{s_k,a_k}\ne\tilde{\T}_k^\G\AP{s_k,a_k}}B:=\ACM{k\in[n]}{\R_k^\G\AP{s_k}\ne\tilde{\R}_k^\G\AP{s_k}}Since \AP{\T^\G, \R^\G, \tilde{\T}^\G, \tilde{\R}^\G} shatters \AP{s,a,t,r}, we have A\cup B=[n]. Consider any k\in A. Obviously, there is x_k\in X s.t. \T_k^\G\AP{s_k,a_k}\AP{x_k}\ne\tilde{\T}k^\G\AP{s_k,a_k}\AP{x_k}Denote \sigma_k:=\AP{s_k}{x_k}\in C^M. We have \T_k\AP{\sigma_k,a_k}\ne\tilde{\T}_k\AP{\sigma_k,a_k}. On the other hand, for any j\in A\cap[k], the shattering implies \T_k^\G\AP{s_j,a_j}=\tilde{\T}_k^\G\AP{s_j,a_j} and in particular \T_k\AP{\sigma_j,a_j}=\tilde{\T}k\AP{\sigma_j,a_j}. Therefore, \AP{\sigma_k,a_k}\ne\AP{\sigma_j,a_j}. We conclude that \Abs{A} \leq \Abs{C}^\Abs{M}\cdot\Abs{\A}. Now consider any k\in B. Define f_k\in\Reals^{\F{M}{C}} by \AP{f_k}\sigma:=\frac{\Abs{\ACM{x\in X}{\AP{s_k}_x=\sigma}}}{\Abs{X}}We have f_k\cdot\R_k=\R_k^\G\AP{s_k}\ne\tilde{\R}_k^\G\AP{s_k}=f_k\cdot\tilde{\R}_kf_k\cdot\AP{\R_k-\tilde{\R}_k}\ne 0(These are dot products in \Reals^{\F{M}{C}}.) On the other hand, for any j\in B\cap[k], the shattering implies f_j\cdot\R_k=f_j\cdot\tilde{\R}k and therefore f_j\cdot\AP{\R_k-\tilde{\R}k}=0. Therefore, f_k is not in the linear span of \AC{f_j}{j\in B\cap[k]} and hence the set \AC{f_k}{k\in B} is linearly independent. We conclude that \Abs{B}\leq\dim{\Reals^{\F{M}{C}}}=\Abs{C}^\Abs{M}. Putting everything together, we get n\leq\Abs{A}+\Abs{B}\leq\Abs{C}^\Abs{M}\AP{\Abs{\A}+1}. \blacksquare Proposition A.1 Consider a countable set A, a set B, a non-empty countable set F\subseteq\F{A}{B}, some f^:A\rightarrow B, some \zeta\in\Delta F and some \xi\in\Delta A. Denote p:=\Pa{(a,f)\sim\xi\times\zeta}{f(a)\ne f^(a)}Then, for any q\in(0,1), we can choose A^\circ\subseteq A and F^\circ\subseteq F s.t. \xi\AP{A^\circ}\geq 1-q, \zeta\AP{F^\circ}\geq 1-\frac{p}{q}\PD{F} and for any f,g\in F^\circ and a\in A^\circ, f(a)=g(a). Proof of Proposition A.1 We define A^\circ by A^\circ:=\ACM{a\in A}{\Pa{f\sim\zeta}{f(a)\ne f^(a)} \leq \frac{p}{q}}The fact that \xi\AP{A^\circ}\geq 1-q follows from the definition of p. Denote n:=\Abs{A^\circ}\in\Nats\cup\AC{\omega} and enumerate A^\circ as A^\circ=\AC{a_k}{k\in[n]}. For each k\in[n], define F_k\subseteq F recursively by F_0:=F\F{k+1}:=\begin{cases}F_k\text{ if }\forall f,g\in F_k:f\AP{a_k}=g\AP{a_k}\\ACM{f\in F_k}{f\AP{a_k}=f^\AP{a_k}}\text{ otherwise}\end{cases}Set F^\circ:=\bigcap_{k\in[n]}F_k. Define I\subseteq[n] by I:=\ACM{k\in[n]}{F_{k+1}\ne F_k}Denote m:=\Abs{I}. For any i\in[m], we denote by k_i the i-th number in I in increasing order. Denote a^i:=a{k_i} and b^i:=f^\AP{a^i}. By the definition of F_k, for each i\in[m] we can choose f_i,g_i\in F{k_i} s.t. f_i\AP{a^_i}\ne g_i\AP{a^i}. Moreover, it also follows from the definition that for every i\in[m] and j\in[i], f_i\AP{a^_j}=g_i\AP{a^j}=b^j (using only the fact that f_i,g_i\in F{k_i} and k_i > k_j). Therefore, (f,g) shatters \AP{a^,b^*} and hence m\leq\PD{F}. By the definition of A^\circ, for any k\in I, \zeta\AP{F{k+1}}\geq\zeta\AP{F_k}-\frac{p}{q}. On the other hand, for any k\in[n]\setminus I, F{k+1}=F_k and in particular \zeta\AP{F{k+1}}=\zeta\AP{F_k}. We conclude that \zeta\AP{F^\circ}\geq\zeta\AP{F_0}-\frac{p}{q}m\geq 1-\frac{p}{q}\PD{F}\qquad\blacksquareProposition A.2 Consider a countable set A, a set B, a non-empty countable set F\subseteq\F{A}{B}, some \zeta\in\Delta F and some \xi\in\Delta A. Consider f^:A\rightarrow B s.t. * f^(a)\in\Argmax{b\in B}{\Pa{f\sim\zeta}{f(a)=b}}Then, \Pa{(a,f)\sim\xi\times\zeta}{f(a)\ne f^(a)}\leq\frac{1}{\ln{2}}\Ia{(a,f)\sim\xi\times\zeta}{f;a,f(a)}[\Ia{}{f;a,f(a)} stands for the mutual information between f and the joint random variables a,f(a).] Proof of Proposition A.2 By the chain rule for mutual information \Ia{(a,f)\sim\xi\times\zeta}{f;a,f(a)}=\Ia{(a,f)\sim\xi\times\zeta}{f;a}+\CI{(a,f)\sim\xi\times\zeta}{f;f(a)}{a}The first term on the right hand side obviously vanishes, and the second term can be expressed in terms of KL-divergence. For any a\in A, Define \operatorname{ev}a:F\rightarrow B by \operatorname{ev}a(f):=f(a). We get \Ia{(a,f)\sim\xi\times\zeta}{f;a,f(a)}=\Ea{(a,f)\sim\xi\times\zeta}{\KL{\delta{f(a)}}{\operatorname{ev}{a*}\zeta}}=\Ea{(a,f)\sim\xi\times\zeta}{\ln{\frac{1}{\zeta\AP{\operatorname{ev}a^{-1}\AP{f(a)}}}}}\Ia{(a,f)\sim\xi\times\zeta}{f;a,f(a)}\geq\Ea{(a,f)\sim\xi\times\zeta}{\ln{\frac{1}{\zeta\AP{\operatorname{ev}a^{-1}\AP{f(a)}}}};f(a)\ne f^(a)}For any a\in A and f\in F, \zeta\AP{\operatorname{ev}_a^{-1}\AP{f(a)}}\leq\zeta\AP{\operatorname{ev}_a^{-1}\AP{f^(a)}} by definition of f^. If f(a)\ne f^(a), then \zeta\AP{\operatorname{ev}a^{-1}\AP{f(a)}}+\zeta\AP{\operatorname{ev}a^{-1}\AP{f^(a)}}=\zeta\AP{\operatorname{ev}_a^{-1}\AP{\AC{f(a),f^(a)}}}\leq 1It follows that, in this case, \zeta\AP{\operatorname{ev}a^{-1}\AP{f(a)}}\leq\frac{1}{2}. We conclude \Ia{(a,f)\sim\xi\times\zeta}{f;a,f(a)}\geq\AP{\ln{2}}\Pa{(a,f)\sim\xi\times\zeta}{f(a)\ne f^*(a)}\qquad\blacksquareProposition A.3 Consider a countable set \A, a set \O, a non-empty countable set \Hy\subseteq \F{\A}{\O}, some \zeta\in\Delta\Hy and some \R:\O\rightarrow[0,1]. Let \Pi:\Hy\rightarrow\A be s.t. \Pi(e)\in\Argmax{a\in A}{\R\AP{e(a)}} Then, \Ea{e\sim\zeta}{\max{a\in A}{\R\AP{e(a)}}-\Ea{a\sim\Pi*\zeta}{\R\AP{e(a)}}}\leq4\sqrt{\frac{\PD{\Hy}}{\ln{2}}\cdot\Ia{(a,e)\sim\Pi*\zeta\times\zeta}{e;a,e(a)}}Proof of Proposition A.3 Denote \xi:=\Pi\zeta, D:=\PD{\Hy} and \Gamma:=\Ia{(a,e)\sim\xi\times\zeta}{e;a,e(a)}. By Proposition A.2, there is e^:\A\rightarrow\O s.t. \Pa{(a,e)\sim\xi\times\zeta}{e(a)\ne e^(a)}\leq\frac{\Gamma}{\ln{2}}By Proposition A.1 (setting q:=\sqrt{\frac{\Gamma D}{\ln{2}}}), there are \A^\circ\subseteq \A and \Hy^\circ\subseteq\Hy s.t. \xi\AP{\A^\circ}\geq1-\sqrt{\frac{\Gamma D}{\ln{2}}}, \zeta\AP{\Hy^\circ}\geq1-\sqrt{\frac{\Gamma D}{\ln{2}}} and for any e_1,e_2\in\Hy^\circ and a\in\A, e_1(a)=e_2(a). Define \Hy^:=\Hy^\circ\cap\Pi^{-1}\AP{\A^\circ}. We have \zeta\AP{\Hy^*}\geq1-2\sqrt{\frac{\Gamma D}{\ln{2}}}. Denote L:=\Ea{e\sim\zeta}{\max{a\in A}{\R\AP{e(a)}}-\Ea{a\sim\xi}{\R\AP{e(a)}}}. For brevity, we will also use the notation \EA{e}{X}:=\Ea{e\sim\zeta}{X;e\in\Hy^}. We get, using the bound on \zeta\AP{\Hy^} L\leq\EA{e}{\max_{a\in A}{\R\AP{e(a)}}-\Ea{a\sim\xi}{\R\AP{e(a)}}}+2\sqrt{\frac{\Gamma D}{\ln{2}}}L\leq\EA{e}{\EA{e'}{\R\AP{e\AP{\Pi(e)}}-\R\AP{e\AP{\Pi\AP{e'}}}}}+4\sqrt{\frac{\Gamma D}{\ln{2}}}Using the properties of \Hy^\circ and \A^\circ we get L\leq\EA{e}{\EA{e'}{\R\AP{e'\AP{\Pi(e)}}-\R\AP{e'\AP{\Pi\AP{e'}}}}}+4\sqrt{\frac{\Gamma D}{\ln{2}}}The expression inside the expected values in the first term on the right hand side is negative, by the property of \Pi. We conclude L\leq4\sqrt{\frac{\Gamma D}{\ln{2}}}\qquad\blacksquareProof of Lemma 1 For each s\in\St we choose some \Pi^s:\Hy\rightarrow\A s.t. \Pi^s(e)\in\Argmax{a\in A}{\R\AP{e\AP{s,a}}}We now take \pi^\dagger to be Thompson sampling. More formally, we construct a probability space (\Omega,P) with random variables K:\Omega\rightarrow\Hy (the true environment) and \AC{J_n:\Omega\rightarrow\F{\St^n}{\Hy}}{n\in\Nats} (the hypothesis sampled on round n). We define \AC{A_n:\Omega\rightarrow\F{\St^{n+1}}{\A}}{n\in\Nats} (the action taken on round n) and \AC{\Theta_n:{\Omega\rightarrow\F{\St^{n+1}}{\O}}}{n\in\Nats} (the observation made on round n) by A_n(s):=\Pi^{s_n}\AP{J_n\AP{s{:n}}}\\Theta_n(s):=K\AP{s_n,A_n(s)}We also define \AC{\Hy_n:\Omega\rightarrow\F{\St^n}{\Hy}}{n\in\Nats} by \Hy_n\AP{s}:=\ACM{e\in\Hy}{\forall m\in[n]:e\AP{s_m,A_m\AP{s{:m}}}=\Theta_n\AP{s_{:m}}}\Hy_n is thus the set of hypotheses consistent with the previous outcomes K\AP{s_m,\Pi^{s_m}\AP{J_m}}. The random variables K,J have to satisfy K_P=\zeta (the distribution of the true environment is the prior) and \CP{}{J_n(s)=e}{K,\AC{J_m\AP{s_{:m}}}{m\in[n]}}=\CP{\zeta}{e}{\Hy_n(s)}That is, the distribution of J_n conditional on K and J_m for m\in[n] is given by the prior \zeta updated by the previous outcomes.\pi^\dagger is then defined s.t. for any so\in\AP{\St\times\O}^n, t\in\St and a\in\A: \pi^\dagger\APM{a}{so,t}:=\CP{}{A_n\AP{st}=a}{\forall m\in[n]:o_m=\Theta_m\AP{s{:m}}}Now we need to prove the regret bound. We denote the Bayesian regret by \Rg: \Rg(c,\gamma):=\Ea{\Env\sim\zeta}{(1-\gamma)\sum_{n=0}^\infty{\gamma^n \max_{a\in\A}{\R\AP{\Env\AP{c_n,a}}}}-\Ea{\Env c\pi^\dagger}{\Ut_\gamma}}The construction of \pi^\dagger implies that \Rg(c,\gamma)=(1-\gamma)\sum_{n=0}^\infty\gamma^n\Ea{}{\max_{a\in\A}\R\AP{K\AP{c_n,a}}-\R\AP{K\AP{c_n,A_n\AP{c_{:n+1}}}}}Define \AC{Z_n:\Omega\rightarrow\F{\St^n}{\Delta\Hy}}{n\in\Nats} (the belief state of the agent on round n) and \AC{\Xi_n:\Omega\rightarrow\F{\St^{n+1}}{\Delta\A}}{n\in\Nats} (the distribution over actions on round n) by Z_n(s):=\zeta\mid\Hy_n(s)\\Xi_n(s):=\Pi^{s_n}*Z_n(s)Using expectation conditional on Z_n we can rewrite the equation for \Rg(\gamma) as \Rg(c,\gamma)=(1-\gamma)\sum{n=0}^\infty\gamma^n\Ea{}{\Ea{e\sim Z_n\AP{c_{:n}}}{\max_{a\in\A}\R\AP{e\AP{c_n,a}}-\Ea{a\sim\Xi_n\AP{c_{:n+1}}}{\R\AP{e\AP{c_n,a}}}}}It follows that \Rg(c,\gamma)\leq\sqrt{(1-\gamma)\sum_{n=0}^\infty\gamma^n\Ea{}{\Ea{e\sim Z_n\AP{c_{:n}}}{\max_{a\in\A}\R\AP{e\AP{c_n,a}}-\Ea{a\sim\Xi_n\AP{c_{:n+1}}}{\R\AP{e\AP{c_n,a}}}}^2}}We now apply Proposition A.3 and get \Rg(c,\gamma)\leq\sqrt{\frac{16D}{\ln{2}}\cdot(1-\gamma)\sum_{n=0}^\infty\gamma^n\Ea{}{\Ia{(a,e)\sim\Xi_n\AP{c_{:n+1}}\times Z_n\AP{c_{:n}}}{e;a,e(a)}}}\ \Rg(c,\gamma)\leq\sqrt{\frac{16D}{\ln{2}}\cdot(1-\gamma)\sum_{n=0}^\infty\gamma^n\Ea{}{\En\AP{Z_{n}\AP{c_{:n}}}-\En\AP{Z_{n+1}\AP{c_{:n+1}}}}}\ \Rg(c,\gamma)\leq\sqrt{\frac{16D\En(\zeta)}{\ln{2}}\cdot(1-\gamma)}\qquad\blacksquareGiven any x\in\XS, we will use the notation x=\AP{\Sta{x}\in\St,\Rew{x}\in[0,1]}. Proposition A.4 Fix T\in\Nats^+ and consider a non-empty set of DDP hypotheses \Hy\subseteq\F{\St\times\A}{\XS}. For each e:\St \times\A\rightarrow\XS we define e^\sharp:\St\times\A^T\rightarrow\XS^T recursively by \Env^\sharp(s,\pi)0:=\Env\AP{s,\pi_0}\ \Env^\sharp(s,\pi){n+1}=\Env\AP{\Sta{\Env^\sharp(s,\pi)n},\pi{n+1}}(That is, \Env^\sharp(s,\pi) is the history resulting from DDP \Env interacting with action sequence \pi starting from initial state s.) Define* \Hy^\sharp:=\ACM{\Env^\sharp}{\Env\in\Hy}Then, for any s\in\St, \PD{\Hy^\sharp_{s}}\leq\PD{\Hy}. Here, the subscript s has the same meaning as in Lemma 1. Proof of Proposition A.4 Fix s\in\St. Let \AC{\AP{\pi_k\in\A^\sharp,h_k\in\XS^T}}{k\in[n]} be an \Hy^\sharp_s-independent sequence. We will now construct a sequence \AC{\AP{s^_k\in\St,a^k\in\A,x^_k\in\XS,\Env_k\in\Hy,\tilde{\Env}k\in\Hy}}{k\in[n]}s.t \AP{\Env,\tilde{\Env}} shatters \AP{s^,a^,x^}. The latter implies that \AP{s^,a^,x^*} is an \Hy-independent sequence, establishing the claim. For brevity, we will use the notation f\pi:=f^\sharp(s,\pi) (here, f:\St\times\A\rightarrow\XS and \pi:\St\times\Nats\rightarrow\A). For each k\in[n] define m_k\in[T] by m_k:=\min\ACM{l}{\exists f,\tilde{f}\in\Hy: \AP{f\pi_k}{:l+1}\ne(\tilde{f}\pi_k){:l+1}\land\forall j\in[k]:f\pi_j=\tilde{f}\pi_j=h_j}Note that the set above is indeed non-empty because (\pi,h) is \Hy^\sharp_s-independent. Given g\in\XS^T, we will use the notational convention \Sta{g_{-1}}:=s. Choose \Env_k,\tilde{\Env}k s.t.\AP{\Env_k\pi_k}{:m_k+1}\ne\AP{\tilde{\Env}k\pi_k}{:m_k+1} and \forall j\in[k]:\Env_k\pi_j=\tilde{\Env}k\pi_j=h_j. Obviously \AP{\Env_k\pi_k}{:m_k}=\AP{\tilde{\Env}k\pi_k}{:m_k}. We set s^k:=\Sta{\AP{\Env_k\pi_k}{m_k-1}} and a^k:=\AP{\pi_k}{m_k}. Clearly \Env_k\AP{s^_k,a^k}\ne\tilde{\Env}k\AP{s^_k,a^k}. If k<n-1 then there is f\in\Hy s.t. \forall j\in[k+1]:f\pi_j=h_j (by independence). Therefore, \Sta{\AP{h_k}{m_k-1}}=s^*k: otherwise \Sta{\AP{f\pi_k}{m_k-1}}\ne\Sta{\AP{\Env_k\pi_k}{m_k-1}} and we get a contradiction with the minimality of m_k. We now set x^k:=\AP{h_k}{m_k}. If k=n, we choose x^k\in\XS arbitrarily. For any k\in[n] and j\in[k], \Env_k\pi_j=\tilde{\Env}k\pi_j=h_j and therefore \Env_k\AP{s^_k,a^k}=\tilde{\Env}k\AP{s^_k,a^k}=x^*k. Proof of Theorem 1 Let \A^\sharp:=\A^T, \O^\sharp:=\XS^T and \Hy^\sharp be defined as in Proposition A.4. By Proposition A.4, we have \max{s\in\St}{\PD{\Hy^\sharp_s}}\leq D. Define \zeta^\sharp\in\Delta\Hy^\sharp as the pushforward of \zeta by the \sharp operator. Define \R^\sharp\gamma:\O^\sharp\rightarrow[0,1] by \R^\sharp\gamma(h):=\frac{\gamma(1-\gamma)}{1-\gamma^{T+1}}\sum{n=0}^{T-1}\gamma^n{\Rew{h_n}}Denote also \gamma^\sharp:=\gamma^{T+1}. It is easy to see that given any \eta\in\O^{\sharp\omega} \Ut\gamma\AP{\prod{n=0}^\infty\AP{c_n,0}\eta_n}=\AP{1-\gamma^\sharp}\sum{n=0}^\infty{\gamma^{\sharp n}\R^\sharp_\gamma\AP{\eta_n}}The product is in the string concatenation sense. Applying Lemma 1 to all the "\sharp" objects (with \St and c unaffected), we get \pi^\dagger_{T,\gamma} s.t. \Ea{\Env\sim\zeta}{\max_{\pi:\XS^\times\XS\rightarrow\A}\Ea{\Env c[T]\pi}{\Ut_\gamma}-\Ea{\Env c[T]\pi^\dagger_{T,\gamma}}{\Ut_\gamma}}\leq\sqrt{\frac{16D\En\AP{\zeta^\sharp}}{\ln{2}}\cdot\AP{1-\gamma^\sharp}}Here, we used the fact that the optimal policy for a DDP with fixed initial state (and any time discount function) can be taken to be a fixed sequence of actions. Observing that \En\AP{\zeta^\sharp}\leq\En(\zeta) and 1-\gamma^{T+1}\leq(T+1)(1-\gamma), we get the desired result. \blacksquare The following is a minor variant of what was called "Proposition B.2" here, and we state it here without proof (since the proof is essentially the same). Proposition A.5 Consider a DDP e:\St\times\A\rightarrow\XS, policies \pi^0,\pi^:\XS^\times\XS\K\A and T\in\Nats^+. Suppose that for any h\in\XS^ and x\in\XS \Supp{\pi^0(h,s)}\subseteq\A^0_e(s)Suppose further that \pi^ is optimal for e with time discount \gamma and any initial state. Denote* \tau_e(\gamma):=\max_{s\in\St}\sup_{x\in(\gamma,1)}\Abs{\frac{\D{\V_e(s,x)}}{\D{x}}}Given any s\in\St and policy \pi, denote es\pi\in\Delta\XS^\omega the distribution over histories of resulting from \pi interacting with e starting from initial state s. Finally, define \Ut_{T,\gamma}:\XS^\omega\rightarrow[0,1] by \Ut_{T,\gamma}(h):=\frac{1-\gamma}{1-\gamma^T}\sum_{n=0}^{T-1}\gamma^n\Rew{h_n}Then \Ea{e\pi^}{\Ut_\gamma}\leq\AP{1-\gamma^T}\sum_{n=0}^\infty\gamma^{nT}\Ea{h\sim e\pi^0}{\Ea{e\Sta{h_{nT}\pi^}}{\Ut_{T,\gamma}}}+\frac{2\tau_e(\gamma)\gamma^T(1-\gamma)}{1-\gamma^T}Proof of Theorem 2 It is not hard to see that Theorem 1 can be extended to the setting where the initial state sequence c\in\St^\omega is chosen in a way that depends on the history. This is because if c is chosen adversarially (i.e. in order to maximize regret), the history doesn’t matter (in other words, we get a repeated zero-sum game in which, on each round, the initial state is chosen by the adversary and the policy is chosen by the agent after seeing the initial state; this game clearly has a pure Nash equilibrium). In particular, we can let c be just the states naturally resulting from the interaction of the DDP with the policy. Let \pi^{e\gamma} be the optimal policy for DDP e and time discount \gamma. We get \Ea{\Env\sim\zeta}{\AP{1-\gamma^T}\sum{n=0}^\infty\gamma^{nT}\Ea{h\sim e\pi^\dagger_{T,\gamma}}{\Ea{e\Sta{h_{nT}\pi_{e\gamma}^}}{\Ut_{T,\gamma}}}-\Ea{\Env\pi^\dagger_{T-1,\gamma}}{\Ut_\gamma}}\leq\\sqrt{\frac{16D\En(\zeta)}{\ln{2}}\cdot T(1-\gamma)}+\frac{1-\gamma}{1-\gamma^{T}}Here, the second term on the right hand side comes from the rewards at time moments divisible by T, which were set to 0 in Theorem 1. Denote \Rg(T,\gamma):=\Ea{\Env\sim\zeta}{\max_{\pi:\XS^*\times\XS\rightarrow\A}\Ea{\Env\pi}{\Ut_\gamma}-\Ea{\Env\pi^\dagger_{T-1,\gamma}}{\Ut_\gamma}}Applying Proposition A.5, we get \Rg(T,\gamma)\leq\sqrt{\frac{16D\En(\zeta)}{\ln{2}}\cdot T(1-\gamma)}+\frac{1-\gamma}{1-\gamma^{T}}+\frac{2\tau(\gamma)\gamma^T(1-\gamma)}{1-\gamma^T}Assume that \gamma^{T-1}\geq\frac{1}{2}. Then \frac{1-\gamma}{1-\gamma^T}=\frac{1}{\sum_{n=0}^{T-1}\gamma^n}\leq\frac{2}{T}\\Rg(T,\gamma)\leq\sqrt{\frac{16D\En(\zeta)}{\ln{2}}\cdot T(1-\gamma)}+\frac{4\tau(\gamma)+2}{T}Now set T:=\Ceil{\sqrt[3]{\frac{\AP{\tau(\gamma)+1}^2}{D\En(\zeta)\cdot(1-\gamma)}}}It is easy to see that the assumption \gamma^{T-1}\geq\frac{1}{2} is now justified for 1-\gamma\ll1. We get \Rg(T,\gamma)=O\AP{\sqrt[3]{D\En(\zeta)\cdot\AP{\tau(\gamma)+1}(1-\gamma)}}

id

XNkqLW8mFE3xCyfnb
authors

Vanessa Kosoy
score

3
omega_karma

2
votes

2
date_published

2018-09-05T21:25

https://www.lesswrong.com/posts/zTf946PQwN2AN3X3Y/entropic-regret-i-deterministic-mdps?commentId=XNkqLW8mFE3xCyfnb

I was fixing bugs in the LaTeX and accidentally pressed "save draft" instead of "post", after which I had to "post" again to make it reappear, and thereby bumped up the date. My apologies for the disturbance in the aether.

Comment

id

oLRd5o9n8h3bdSyEK
authors

habryka
score

2
omega_karma

1
votes

1
date_published

2018-09-06T01:11

https://www.lesswrong.com/posts/zTf946PQwN2AN3X3Y/entropic-regret-i-deterministic-mdps?commentId=oLRd5o9n8h3bdSyEK

I can quickly fix it. Do you remember what the original date was?

id

sbACg9fuLR9dkqCcm
authors

Vanessa Kosoy
score

1
omega_karma

1
votes

1
date_published

2018-09-06T14:33

https://www.lesswrong.com/posts/zTf946PQwN2AN3X3Y/entropic-regret-i-deterministic-mdps?commentId=sbACg9fuLR9dkqCcm

If I’m not mistaken, it was August 16.

Entropic Regret I: Deterministic MDPs

Background

Results

Proofs

Comment

Comment

Comment