% operators that are separated from the operand by a space
% autosize deliminaters
% operators that require brackets
% operators that require parentheses
% Paper specific
These are Appendices B and C for the essay Catastrophe Mitigation Using DRL. They appear in a separate post because of a length limit in the website.
##Appendix B
Given p=(a,o)\in\A\times\Ob, we denote p^\A:=a, p^\Ob:=o.
#Proposition B.1
Consider a universe \upsilon=(\mu,r) which an \Ob-realization of an MDP M with state function S, a stationary policy \pi^*: \St_M \M \A, an arbitrary \In-policy \pi^0 and some \gamma \in (0,1). Then,
\EU_{\upsilon}^{\pi^* S}(\gamma) - \EU_{\upsilon}^{\pi^0}(\gamma)=\sum_{n=0}^\infty {\gamma^n \Ea{x\sim\mu\bowtie\pi^0}{\V_{M\pi^}\AP{S\AP{x_{:n}},\gamma}-\Q_{M\pi^}\AP{S\AP{x_{:n}},x_n^\A,\gamma}}}
#Proof of Proposition B.1
For the sake of encumbering the notation less, we will omit the parameter \gamma in functions that depend on it. We will use S implicitly, i.e. given F a function on \St_M and h \in \HD{\mu}, F(h):=F\AP{S(h)}. Finally, we will omit M\pi^, using the shorthand notations \V:=\V_{M\pi^}, \Q:=\Q_{M\pi^*}.
For any x \in \HD^\omega \mu, it is easy to see that
\EU_{\upsilon}^{\pi^* S}=\V\AP{\Estr}=\sum_{n=0}^\infty \gamma^n \AP{\V\AP{x_{:n}}-\gamma\V\AP{x_{:n+1}}}
\Ut^{r}(x)=(1-\gamma)\sum_{n=0}^\infty \gamma^n r\AP{x_{:n}}
\EU_{\upsilon}^{\pi^* S} - \Ut^{r}(x)=\sum_{n=0}^\infty \gamma^n \AP{\V\AP{x_{:n}}-(1-\gamma)r\AP{x_{:n}}-\gamma\V\AP{x_{:n+1}}}
\EU_{\upsilon}^{\pi^* S} - \Ut^{r}(x)=\sum_{n=0}^\infty \gamma^n \AP{\V\AP{x_{:n}}-\Q\AP{x_{:n},x_n^\A}+\Q\AP{x_{:n},x_n^\A}-(1-\gamma)r\AP{x_{:n}}-\gamma\V\AP{x_{:n+1}}}
Taking expected value over x, we get
\EU_{\upsilon}^{\pi^* S} - \EU_{\upsilon}^{\pi^0}=\sum_{n=0}^\infty \gamma^n \AP{\Ea{\mu\bowtie\pi^0}{\V\AP{x_{:n}}-\Q\AP{x_{:n},x_n^\A}}+\Ea{\mu\bowtie\pi^0}{\Q\AP{x_{:n},x_n^\A}-(1-\gamma)r\AP{x_{:n}}-\gamma\V\AP{x_{:n+1}}}}
It is easy to see that the second term vanishes, yielding the desired result.
#Proposition B.2
Consider some \tau\in(0,\infty), T\in\Nats^+, a universe \upsilon=(\mu,r) that is an \Ob-realization of M with state function S, a stationary policy \pi^: \St_M \M \A and an arbitrary \In-policy \pi^0: \FH \M \A. For any n \in \Nats, let \pi^_n be an \In-policy s.t. for any h \in \HD{\mu}
\pi^_n(h):=\begin{cases} \pi^0(h) \text{ if } \Abs{h} < nT \ \pi^\AP{S(h)} \text{ otherwise} \end{cases}
Assume that
i. For any h \in \HD{\mu} \Supp{\pi^0(h)} \subseteq \A_{M\pi^*}^0\AP{S(h)}
ii. For any s \in \St_M and \gamma\in(0,1) \Abs{\frac{\D\V_{M\pi^*}\AP{s,\gamma}}{\D\gamma}} \leq \tau
Then, for any \gamma\in(0,1),
\EU^{\pi^S}\upsilon(\gamma)-\EU^{\pi^0}\upsilon(\gamma) \leq (1-\gamma)\sum_{n=0}^\infty \sum_{m=0}^{T-1} \gamma^{nT+m}\left(\E{x\sim\mu\bowtie\pi^n}\left[r\left(x{:nT+m}\right)\right]-\E{x\sim\mu\bowtie\pi^0}\left[r\left(x_{:nT+m}\right)\right]\right) + \frac{2\tau\gamma^T(1-\gamma)}{1-\gamma^T}
#Proof of Proposition B.2
For the sake of encumbering the notation less, we will use S implicitly, i.e. given F a function on \St_M and h \in \HD{\mu}, F(h):=F\AP{S(h)}. Also, we will omit M\pi^, using the shorthand notations \V:=\V_{M\pi^}, \Q:=\Q_{M\pi^*}.
By Proposition B.1, for any l \in \Nats
\EU_{\upsilon}^{\pi^}(\gamma) - \EU_{\upsilon}^{\pi_l^}(\gamma) = \sum_{n=0}^\infty{\gamma^n \Ea{x\sim\mu\bowtie\pi_l^*}{\V\Big(x_{:n},\gamma\Big)-\Q\AP{x_{:n},x_n^\A,\gamma}}}
\pi^_l coincides with \pi^ after lT, therefore the corresponding expected values vanish.
\EU_{\upsilon}^{\pi^}(\gamma) - \EU_{\upsilon}^{\pi_l^}(\gamma) = \sum_{n=0}^{lT-1}{\gamma^n \Ea{x\sim\mu\bowtie\pi^0}{\V\Big(x_{:n},\gamma\Big)-\Q\AP{x_{:n},x_n^\A,\gamma}}}
Subtracting the equalities for l+1 and l, we get
\EU_{\upsilon}^{\pi_{l}^}(\gamma) - \EU_{\upsilon}^{\pi_{l+1}^}(\gamma) = \sum_{n=lT}^{(l+1)T-1}{\gamma^n \Ea{x\sim\mu\bowtie\pi^0}{\V\Big(x_{:n},\gamma\Big)-\Q\AP{x_{:n},x_n^\A,\gamma}}}
(1-\gamma)\sum_{n=0}^\infty {\gamma^n\AP{\Ea{x \sim \mu\bowtie\pi^l}{r\AP{x{:n}}}-\Ea{x \sim \mu\bowtie\pi^{l+1}}{r\AP{x{:n}}}}} = \sum_{n=lT}^{(l+1)T-1}{\gamma^n \Ea{x\sim\mu\bowtie\pi^0}{\V\Big(x_{:n},\gamma\Big)-\Q\AP{x_{:n},x_n^\A,\gamma}}}
\pi^_l and \pi^_{l+1} coincide until lT, therefore
(1-\gamma)\sum_{n=lT}^\infty {\gamma^n\AP{\Ea{x \sim \mu\bowtie\pi^l}{r\AP{x{:n}}}-\Ea{x \sim \mu\bowtie\pi^{l+1}}{r\AP{x{:n}}}}} = \sum_{n=lT}^{(l+1)T-1}{\gamma^n \Ea{x\sim\mu\bowtie\pi^0}{\V\Big(x_{:n},\gamma\Big)-\Q\AP{x_{:n},x_n^\A,\gamma}}}
Denote \rho^_l:=\mu\bowtie\pi^l, \rho^0:=\mu\bowtie\pi^0. We also use the shorthand notations r_n:=r\AP{x{:n}}, \V_n(\gamma):=\V\AP{x_{:n},\gamma}, \Q_n(\gamma):=\Q\AP{x_{:n},x_n^\A,\gamma}. Both \pi^_l and \pi^_{l+1} coincide with \pi^* after (l+1)T, therefore
(1-\gamma)\sum_{n=lT}^{(l+1)T-1} {\gamma^n\AP{\Ea{\rho^_l}{r_n}-\Ea{\rho^0}{r_n}}} + \gamma^{(l+1)T}\AP{\Ea{\rho^l}{\V{(l+1)T}(\gamma)}-\Ea{\rho^0}{\V_{(l+1)T}(\gamma)}}= \sum_{n=lT}^{(l+1)T-1}{\gamma^n \Ea{\rho^0}{\V_n(\gamma)-\Q_n(\gamma)}}
Denote \V'(s,\gamma):=\frac{\D\V(s,\gamma)}{\D\gamma}. By the mean value theorem, for each s\in\St_M there is \gamma^*\in(\gamma,1) s.t.
\V(s,\gamma) = \V^0(s) - \V'(s,\gamma^*)\cdot(1-\gamma)
\V^0(s) - \tau(1-\gamma) \leq \V(s,\gamma) \leq \V^0(s) + \tau(1-\gamma)
It follows that
(1-\gamma)\sum_{n=lT}^{(l+1)T-1} {\gamma^n\Ea{\rho^_l - \rho^0}{r_n}} + \gamma^{(l+1)T}\AP{\Ea{\rho^l-\rho^0}{\V^0{(l+1)T}}+ 2\tau(1-\gamma)}\geq \sum_{n=lT}^{(l+1)T-1}{\gamma^n \Ea{\rho^0}{\V_n(\gamma)-\Q_n(\gamma)}}
Here, an expected value w.r.t. the difference of two probability measures is understood to mean the corresponding difference of expected values.
It is easy to see that assumption i implies that \V_n^0 is a submartingale for \rho^0 (whereas it is a martingale for \mu\bowtie\pi^*) and therefore
\Ea{\rho^*l-\rho^0}{\V^0{(l+1)T}} \leq 0
We get
(1-\gamma)\sum_{n=lT}^{(l+1)T-1} {\gamma^n\Ea{\rho^*l - \rho^0}{r_n}} + 2\tau\gamma^{(l+1)T}(1-\gamma)\geq \sum{n=lT}^{(l+1)T-1}{\gamma^n \Ea{\rho^0}{\V_n(\gamma)-\Q_n(\gamma)}}
Summing over l, we get
(1-\gamma) \sum_{l=0}^\infty \sum_{n=lT}^{(l+1)T-1} {\gamma^n\Ea{\rho^*l - \rho^0}{r_n}} + \frac{2\tau\gamma^T(1-\gamma)}{1-\gamma^T}\geq \sum{n=0}^{\infty}{\gamma^n \Ea{\rho^0}{\V_n(\gamma)-\Q_n(\gamma)}}
Applying Proposition B.1 to the right hand side
(1-\gamma) \sum_{l=0}^\infty \sum_{n=lT}^{(l+1)T-1} {\gamma^n\Ea{\rho^l - \rho^0}{r_n}} + \frac{2\tau\gamma^T(1-\gamma)}{1-\gamma^T} \geq \EU{\upsilon}^{\pi^}(\gamma) - \EU_{\upsilon}^{\pi^0}(\gamma)
#Proof of Lemma A.1
Fix \gamma \in (0,1), \eta\in\left(0,N^{-1}\right) and T \in \Nats^+. Denote \nu^k:=\bar{\mu}^k\left[\sigma^k S^k\right]. To avoid cumbersome notation, whenever M^k should appear a subscript, we will replace it by k. Let (\Omega,P \in \Delta\Omega) be a probability space\Comment{ and {\F_n \subseteq 2^\Omega}{n \in \Nats \sqcup {-1}} a filtration of \Omega}. Let K: \Omega \rightarrow [N] be \Comment{measurable w.r.t. \F{-1}}a random variable and the following be stochastic processes\Comment{ adapted to \F}
\Z_n,\tilde{\Z}_n: \Omega \rightarrow \Delta[N]
\J_n: \Omega \rightarrow [N]
\Psi_n: \Omega \rightarrow \A
A_n: \Omega \rightarrow \Ada
\Theta_n: \Omega \rightarrow \Ado
We also define A\Theta_{:n}: \Omega \rightarrow \Adfh by
A\Theta_{:n}:= A_0\Theta_0A_1\Theta_1 \ldots A_{n-1}\Theta_{n-1}
(The following conditions on A and \Theta imply that the range of the above is indeed in \Adfh.) Let \Dl and \Dl^{!k} be as in Proposition C.1 (we assume w.l.o.g. that \epsilon < \frac{1}{\Abs{\A}}). We construct \Omega\Comment{, \F}, K, \Z, \tilde{\Z}, \J, \Psi, A and \Theta s.t K is uniformly distributed and for any k \in [N], l \in \Nats, m \in [T] and o \in \Ob, denoting n = lT+m
\tilde{\Z}_0(k)\equiv\frac{1}{N}
\Z_{n}(k) = \frac{\tilde{\Z}{n}(k)[[\tilde{\Z}{n}(k) \geq \eta]] }{\sum_{j = 0}^{N-1}\tilde{\Z}{n}(j)[[\tilde{\Z}{n}(j) \geq \eta]]}
\Pr\left[\J_{l} = k \mid Z_{lT}\right] = \Z_{lT}\left(k\right)
\Psi_{n} = \pi^{J_l}\AP{A\Theta_{:n}}
\Pr\left[\Theta_{n} = o \mid A\Theta_{:n}\right] = \nu^K\left(o \mid A\Theta_{:n}\right)
A_n = \Dl\left(A\Theta_{:n}, \Psi_n\right)
\tilde{\Z}{n+1}(k)\sum{j = 0}^{N-1} \Z_n(j) [[A_n = \Dl^{!j}\left(A\Theta_{:n}, \Psi_n\right)]] \nu^j(\Theta_n \mid A\Theta_{:n}A_n)=\Z_{n}(k) [[A_n = \Dl^{!k}\left(A\Theta_{:n}, \Psi_n\right)]] \nu^k\left(\Theta_{n} \mid A\Theta_{:n}A_{n}\right)
Note that the last equation has the form of a Bayesian update which is allowed to be arbitrary when update is on "impossible" information.
We now construct the \Adi-policy \pi^* s.t. for any n \in \Nats, h \in \Adfh s.t.\Pr\left[A\Theta_{:n}=h\right] > 0 and a \in \Ada
\pi^*(a \mid h):=\Pr\left[A_n = a \mid A\Theta_{:n} = h\right]
That is, we perform Thompson sampling at time intervals of size T, moderated by the delegation routine \Dl, and discard from our belief state hypotheses whose probability is below \eta and hypotheses sampling which resulted in recommending "unsafe" actions i.e. actions that \Dl refused to perform.
In order to prove \pi^* has the desired property, we will define the stochastic processes \Z^!, \tilde{\Z}^!, \J^!, \Psi^!, A^! and \Theta^!, each process of the same type as its shriekless counterpart (thus \Omega is constructed to accommodate them). These processes are required to satisfy the following:
\tilde{\Z}^!_0(k)\equiv\frac{1}{N}
\Z_{n}^!(k) = \frac{\tilde{\Z}^!{n}(k)[[\tilde{\Z}^!{n}(k) \geq \eta]] }{\sum_{j = 0}^{N-1}\tilde{\Z}^!{n}(j)[[\tilde{\Z}^!{n}(j) \geq \eta]]}[[\tilde{\Z}^!{n}(K) \geq \eta]] + [[K = k]]\cdot [[\tilde{\Z}^!{n}(K) < \eta]]
\Pr\left[\J^!{l} = k \mid Z^!{lT}\right] = \Z^!_{lT}\left(k\right)
\Psi^!_{n} = \pi^{\J^!l}\AP{A\Theta^!{:n}}
\Pr\left[\Theta^!{n} = o \mid A\Theta^!{:n}\right] = \nu^K\left(o \mid A\Theta^!_{:n}\right)
A^!n = \Dl^{!K}\left(A\Theta^!{:n}, \Psi^!_n\right)
\tilde{\Z}^!{n+1}(k)=\frac{\Z^!{n}(k) [[A^!n = \Dl^{!k}\left(A\Theta^!{:n}, \Psi^!n\right)]] \nu^k\left(\Theta^!{n} \mid A\Theta^!{:n}A^!{n}\right)}{\sum_{j = 0}^{N-1} \Z^!_n(j) [[A^!n = \Dl^{!j}\left(A\Theta^!{:n}, \Psi^!_n\right)]] \nu^j(\Theta^!n \mid A\Theta^!{:n}A^!_n)}
For any k \in [N], we construct the \Adi-policy \pi^{?k} s.t. for any n \in \Nats, h \in \Adfh s.t.\Pr\left[A\Theta^!_{:n}=h,\ K = k\right] > 0 and a \in \Ada
\pi^{?k}(a \mid h):=\Pr\left[A^!n = a \mid A\Theta^!{:n} = h,\ K = k\right]
Given any \Adi-policy \pi and \In-policy \sigma we define \alpha_{\sigma\pi}: \FH \M \Adfh by
\alpha_{\sigma\pi} (g \mid h) := [[h = \underline{g}]]C_h\prod_{n = 0}^{\Abs{h}-1} \sum_{a \in \A}\left([[g_n \in \bot a\Ob]] \pi\left(\bot \mid g_{:n}\right)\sigma\left(a \mid h_{:n}\right)+[[g_n \in a\bot\Ob]]\pi\left(a \mid g_{:n}\right)\right)
Here, C_h \in \Reals is a constant defined s.t. the probabilities sum to 1. We define the \In-policy \left[\sigma\right]\underline{\pi} by
\left[\sigma\right]\underline{\pi}(a \mid h):=\Pr_{g \sim \alpha_{\sigma\pi}(h)}\left[\pi\left(g\right)=a \lor \left(\pi\left(g\right)=\bot \land \sigma(h)=a\right)\right]
Condition iii of Proposition C.1 and condition i of Definition A.1 imply that for any h \in \HD{\mu^k}
\Supp{\left[\sigma^k\right]\underline{\pi}^{?k}(h)} \subseteq \A^0_{M^k\pi^k}\left(S^k\left(h\right)\right)
This means we can apply Proposition B.2 and get
\EU^{\pi^k S^k}{\upsilon^k}(\gamma)-\EU^{\pi^{?k}}{\bar{\upsilon}^k\AB{\sigma^k S^k}}(\gamma) \leq (1-\gamma)\sum_{n=0}^\infty \sum_{m=0}^{T-1} \gamma^{nT+m}\left(\E{x\sim\mu^k\bowtie\pi^{*k}n}\left[r\left(x{:nT+m}\right)\right]-\E{x\sim\nu^k\bowtie\pi^{?k}}\left[r\left(\underline{x}_{:nT+m}\right)\right]\right) + \frac{2\bar{\tau}\gamma^T(1-\gamma)}{1-\gamma^T}
Here, the \In-policy \pi^{k}_n is defined as \pi^_n in Proposition B.2. We also define the \Adi-policies \pi^{!k}_n and \pi^{!!k}_n by
\pi^{!k}n(a \mid h):=\begin{cases} \pi^{?k}(a \mid h) \text{ if } \Abs{h} < nT \ \Pr\left[A^!{\Abs{h}} = a \mid A\Theta^!_{:{\Abs{h}}} = h,\ K = k,\ \J^!_n = k\right] \text{ otherwise} \end{cases}
\pi^{!!k}_n(a \mid h):=\begin{cases} \pi^{?k}(a \mid h) \text{ if } \Abs{h} < nT \ \pi^{!k}_n(a \mid h) + \pi^{!k}_n(\bot \mid h) \cdot \pi^{*k}_n\left(a \mid \underline{h}\right) \text{ if } \Abs{h} \geq nT \text{ and } a \ne \bot \ 0 \text{ if } \Abs{h} \geq nT \text{ and } a = \bot \end{cases}
Denote
\rho^{*k}_n:=\mu^k\bowtie\pi^{*k}_n
\rho^{!!k}_n:=\nu^k\bowtie\pi^{!!k}_n
\rho^{!k}_n:=\nu^k\bowtie\pi^{!k}_n
\rho^{?k}:=\nu^k\bowtie\pi^{?k}
R^{?k}:=\EU^{\pi^k S^k}{\upsilon^k}(\gamma)-\EU^{\pi^{?k}}{\bar{\upsilon}^k\AB{\sigma^k S^k}}(\gamma)
For each n \in \Nats, denote
\EU_n^{*k}(\gamma):=\frac{1-\gamma}{1-\gamma^T}\sum_{m=0}^{T-1} \gamma^{m}\E{x\sim\rho^{*k}n}\left[r\left(x{:nT+m}\right)\right]
\EU_n^{!!k}(\gamma):=\frac{1-\gamma}{1-\gamma^T}\sum_{m=0}^{T-1} \gamma^{m}\E{x\sim\rho^{!!k}n}\left[r\left(\underline{x}{:nT+m}\right)\right]
\EU_n^{!k}(\gamma):=\frac{1-\gamma}{1-\gamma^T}\sum_{m=0}^{T-1} \gamma^{m}\E{x\sim\rho^{!k}n}\left[r\left(\underline{x}{:nT+m}\right)\right]
\EU_n^{?k}(\gamma):=\frac{1-\gamma}{1-\gamma^T}\sum_{m=0}^{T-1} \gamma^{m}\E{x\sim\rho^{?k}}\left[r\left(\underline{x}_{:nT+m}\right)\right]
We have
R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} \left(\EU^{*k}_n(\gamma)-\EU^{?k}_n(\gamma)\right) + \frac{2\bar{\tau}\gamma^T(1-\gamma)}{1-\gamma^T}
R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} \left(\EU^{*k}_n(\gamma)-\EU^{!!k}_n(\gamma)+\EU^{!!k}_n(\gamma)-\EU^{!k}_n(\gamma)+\EU^{!k}_n(\gamma)-\EU^{?k}_n(\gamma)\right) + \frac{2\bar{\tau}\gamma^T(1-\gamma)}{1-\gamma^T}
Condition iv of Proposition C.1 and condition ii of Definition A.1 imply that, given h \in \HD{\nu^k} s.t. \Abs{h} \geq nT
\Supp{\pi^{!k}_n(h)} \subseteq \AC{\pi^k\AP{S^k(h)},\bot}
\pi^{!!k}_n\AP{\pi^k\AP{S^k(h)} \mid h} = 1
Therefore, \pi^{!!k}_n = \pi^{*k}_n, and we remain with
R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} \left(\EU^{!!k}_n(\gamma)-\EU^{!k}_n(\gamma)+\EU^{!k}_n(\gamma)-\EU^{?k}_n(\gamma)\right) + \frac{2\bar{\tau}\gamma^T(1-\gamma)}{1-\gamma^T}
We have
\Abs{\EU^{!!k}_n(\gamma)-\EU^{!k}n(\gamma)} \leq \Pr{x\sim\rho^{!k}n}\left[\exists m \in [T]: x{nT+m} \in \bot\Ado\right]
Since \Z_{nT}^{!}(K) \geq \eta, it follows that
\Abs{\EU^{!!k}n(\gamma)-\EU^{!k}n(\gamma)} \leq \frac{1}{\eta}\Pr{x\sim\rho^{?k}}\left[\exists m \in [T]: x{nT+m} \in \bot\Ado\right] \leq \frac{1}{\eta}\Ea{x\sim\rho^{?k}}{\Abs{\AC{m \in [T] \mid x_{nT+m} \in \bot\Ado}}}
\sum_{n=0}^\infty \Abs{\EU^{!!k}_n(\gamma)-\EU^{!k}_n(\gamma)} \leq \frac{1}{\eta}\Ea{x\sim\rho^{?k}}{\Abs{\AC{n \in \Nats \mid x_n \in \bot\Ado}}}
Using condition i of Proposition C.1, we conclude
R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} \left(\EU^{!k}_n(\gamma)-\EU^{?k}_n(\gamma)\right) + O\left(\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)
Define the random variables \Sqn{U^!_n : \Omega \rightarrow [0,1]} by
U^!n:=\frac{1-\gamma}{1-\gamma^T}\sum{m=0}^{T-1} \gamma^{m} r\left(A\Theta^!_{:nT+m}\right)
Averaging the previous inequality over k, we get
\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} \E{}\left[\E{}\left[U^!_n \mid \J^!n = K,\ Z^!{nT}\right]-\E{}\left[U^!n \mid Z^!{nT}\right]\right] + O\left(\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)
\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = \sqrt{(1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} \E{}\left[\left(\E{}\left[U^!_n \mid \J^!n = K,\ Z^!{nT}\right]-\E{}\left[U^!n \mid Z^!{nT}\right]\right)^2\right]} + O\left(\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)
We apply Proposition C.2 to each term in the sum over n.
\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = \sqrt{(1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} \E{}\left[\frac{1}{2\eta}\I{}\left[K;\J^!_n,U^!n \mid Z^!{nT}\right]\right]} + O\left(\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)
\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = \sqrt{\frac{1-\gamma^T}{2\eta}\sum_{n=0}^\infty \gamma^{nT} \E{}\left[\En\Big(Z^!{nT}\Big)-\En\left(Z^!{(n+1)T}\right)\right]} + O\left(\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)
\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = \sqrt{\frac{1-\gamma^T}{2\eta}\ln N} + O\left(\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)
\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = O\left(\sqrt{\frac{1-\gamma^T}{\eta}} +\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)
Condition ii of Proposition C.1 implies that
\Dtv\left(\frac{1}{N}\sum_{k=0}^{N-1}{\bar{\nu}^k\left[\sigma^k\right]\bowtie\pi^*},\ \frac{1}{N}\sum_{k=0}^{N-1}{\bar{\nu}^k\left[\sigma^k\right]\bowtie\pi^{?k}}\right) \leq 2(N-1)\eta
Here, the factor of 2 comes from the difference between the equations for Z_n and Z^!_n (we can construct and intermediate policy between \pi^* and \pi^{?k} and use the triangle inequality for \Dtv). We conclude
\EU^{\pi^k S^k}{\upsilon^k}(\gamma)-\EU^{\pi^{*}}{\bar{\upsilon}^k\AB{\sigma^k S^k}}(\gamma) = O\left(\eta+\sqrt{\frac{1-\gamma^T}{\eta}} +\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)
Now we set
\eta:=\bar{\tau}^{1/4} (1-\gamma)^{1/4}
T:=\Ceil{\bar{\tau}^{3/4}(1-\gamma)^{-1/4}}
Without loss of generality, we can assume that \bar{\tau}(1-\gamma) \ll 1 (because of the form of the bound we are proving), which implies that T(1-\gamma) \ll 1 and 1-\gamma^T \approx T(1-\gamma) \approx \bar{\tau}^{3/4}(1-\gamma)^{3/4}. We get
\EU^{\pi^k S^k}{\upsilon^k}(\gamma)-\EU^{\pi^{*}}{\bar{\upsilon}^k\AB{\sigma^k S^k}}(\gamma) = O\left(\bar{\tau}^{1/4}(1-\gamma)^{1/4}\right)
##Appendix C
The following is a simple special case of what appeared as "Proposition A.2" in the previous essay, where we restrict \pi to be single-valued (the more general case isn’t needed).
#Proposition C.1
Fix an interface \In=(\A,\Ob), N \in \Nats, \epsilon \in (0,\frac{1}{\Abs{\A}}), \eta \in (0,\frac{1}{N}). Consider some {\sigma^k: \FH \M \A}{k \in [N]}. Then, there exist \Dl: \Adfh \times \A \rightarrow \Ada and {\Dl^{!k}: \Adfh \times \A \rightarrow \Ada}{k \in [N]} with the following properties. Given x \in \left(\A \times \Adao\right)^*, we denote \underline{x} its projection to \Adfh. Thus, \underline{\underline{x}}\in\FH. Given \mu an \In-environment, \pi: \HD{\mu} \M \A, \Dl': \Adfh \times \A \rightarrow \Ada and k \in [N], we can define \Xi\left[\mu,\sigma^k,\Dl',\pi\right]\in \Delta\left(\A \times \Adao\right)^\omega as follows
\Xi\left[\mu,\sigma^k,\Dl',\pi\right]\left(b,a,o \mid x\right):=\pi\left(b \mid \underline{\underline{x}}\right)\Dl'\left(a \mid \underline{x},b\right) \bar{\mu}[\sigma^k]\left(o \mid \underline{x}a\right)
We require that for every \pi, \mu and k as above, the following conditions hold
i. \frac{1}{N}\sum_{j=0}^{N-1}\E{x \sim\Xi\left[\mu,\sigma^j,\Dl^{!j},\pi\right]}\left[\Abs{{n \in \Nats \mid x_n \in \A \times \bot \times \bar{\Ob}}}\right] \leq \frac{\ln N}{\eta \ln\left(1 + \epsilon(1-\epsilon)^{(1-\epsilon)/\epsilon}\right)}=O\left(\frac{\ln N}{\eta \epsilon}\right)
ii. \Dtv\left(\frac{1}{N}\sum_{j=0}^{N-1}{\Xi\left[\mu,\sigma^j,\Dl,\pi\right]},\frac{1}{N}\sum_{j=0}^{N-1}{\Xi\left[\mu,\sigma^j,\Dl^{!j},\pi\right]}\right) \leq (N-1)\eta
iii. For all x \in \HD{\bar{\mu}[\sigma^k]}, if \Dl^{!k}\left(x,\pi\left(\underline{x}\right)\right) \ne \bot then \sigma^k\left(\Dl^{!k}\left(x,\pi\left(\underline{x}\right)\right) \mid \underline{x}\right) > 0
iv. For all x \in \HD{\bar{\mu}[\sigma^k]}, if \Dl^{!k}\left(x,\pi\left(\underline{x}\right)\right) \not\in \AC{\pi\left(\underline{x}\right), \bot} then \sigma^k\left(\pi\left(\underline{x}\right) \mid \underline{x}\right) \leq \epsilon
The following appeared in the previous essay as "Proposition A.1".
#Proposition C.2
Consider a probability space (\Omega, P \in \Delta\Omega), N \in \Nats, R \subseteq [0,1] a finite set and random variables U: \Omega \rightarrow R, K: \Omega \rightarrow [N] and \J: \Omega \rightarrow [N]. Assume that K_*P = J_*P = \zeta \in \Delta[N] and \I{}[K;J] = 0. Then
\I{}\left[K;J,U\right] \geq 2 \left(\min_{i \in [N]} {\zeta(i)}\right) \left(\E{}\left[U \mid J = K\right]-\E{}\left[U\right]\right)^2