Towards reflection with relative optimal predictor schemes

https://www.lesswrong.com/posts/5bd75cc58225bf067037502b/towards-reflection-with-relative-optimal-predictor-schemes

Contents

We construct a family or oracles relative to which all estimation problems which have approximate solutions in exponential time are generatable. Relatively to these oracles, we can use the optimal predictor scheme \Lambda for assigning expectation values to arbitrary deterministic computations within time that is superquasipolynomial in the logarithm of the computation time. In particular, this should allow reflection in the sense of Garrabrant once we succeed to derandomize \Lambda relative to the same oracle.

Results

New notation

Given O \subseteq \Words and appropriate sets X and Y, A: X \xrightarrow{O} Y denotes a pair consisting of O and an oracle machine that, supplied with oracle O, halts on every input in X and produces an output in Y. In particular, it defines a function from X to Y in the obvious way.

For every n \in \Nats, we fix a prefix free universal oracle machine UO_n with one program tape, n input tapes and one output tape (we are only going to use a finite number of ns). Given O \subseteq \Words, ev_{O,n}: \Nats \times \Words^{n+1} \xrightarrow{O} \Words is the following. When ev_{O,n}^k(Q,x_1 \ldots x_n) is computed, Q is interpreted as a program for UO_n and Q(x_1 \ldots x_n) is executed with oracle O for time k. The resulting output is produced.

The notation ev_O^k(Q,x_1 \ldots x_n) means ev_{O,n}^k(Q,x_1 \ldots x_n).

Given A: X \xrightarrow{O} Y and X' \subseteq X, T_A(X') will denote the maximal runtime of A on elements of X'. Q_A(X') will denote the set of queries made to the oracle when applying A to elements of X'.

All the theory of predictors with logarithmic advice can be relativized with respect to an arbitrary oracle. We denote the corresponding concepts using the oracle as a superscript. For example, the relativized version of a \Delta(poly,log)-optimal predictor schemes is \Delta(poly,log)^O-optimal predictor schemes etc.

Given \mu a word ensemble, \Supp{\mu} denotes \bigcup_{k \in \Nats} \Supp \mu^k. We previously defined a distributional estimation problem to be a pair (f,\mu) where \mu is a word ensemble and f is a function from \Words to [0,1]. From now on, we allow f to be defined only on \Supp \mu. This doesn’t affect any of the previous results.

Omitting parentheses after an error space will mean zero advice. For example a \Delta-sampler is a \Delta(0)-sampler.

We start by constructing an exponential time estimation problem which is complete in the class of exponential time estimation problem with samplable word ensembles with respect to \Delta-pseudo-invertible reductions (the concept of \Delta-pseudo-invertible reductions was previously introduced for unidistributional problems but it is defined in the same day for distributional problems, see Appendix A). Everything is relative to an arbitrary fixed oracle.

Construction 1

Fix O \subseteq \Words.

Define \mu_{Samp(O)} to be the following word ensemble. Sampling \mu_{Samp(O)}^k is equivalent to sampling S, F and y independently from U^k and producing (k,S,F,ev_O^k(S,k,y)).

Define f_{EXP(O)}: \Supp \mu_{Samp(O)} \rightarrow [0,1] by

f_{EXP(O)}(k,S,F,x):=\beta(ev_O^{2^k}(F,x))

Theorem 1

Fix error spaces \Delta^1, \Delta^2 of ranks 1 and 2 respectively. Assume that for any \delta \in \Delta^1 the function \delta'(k,j):=\delta(k) is in \Delta^2. Consider (f,\mu) a distributional estimation problem. Suppose q: \Nats \rightarrow \Nats is a polynomial, F: \Words \xrightarrow{O} [0,1] and \hat{S}=(S,r_S) is a (\Delta^1)^O-sampler for \mu s.t.

(i) E_{\mu^k}[(F(x)-f(x))^2] \in \Delta^1

(ii) \forall k \in \Nats: T_F(\Supp \mu^k) \leq 2^{q(k)}

(iii) \forall k \in \Nats: |F| \leq q(k)

(iv) \forall k \in \Nats: T_S(\WordsLen{r_S(k)}) \leq q(k)

(v) \forall k \in \Nats: |S| \leq q(k)

It is possible to choose a polynomial p: \Nats \rightarrow \Nats and \tilde{S} a (poly,0)^O-scheme of signature \boldsymbol{1} \rightarrow \Words s.t.

(i) \forall k \in \Nats: p(k) \geq q(k)

(i) \forall k \in \Nats, y \in \WordsLen{p(k)}: \tilde{S}^{p(k)}(y) = S^k(y_{<r_S(k)})

(ii) \forall k \in \Nats: T_{\tilde{S}}(\WordsLen{p(k)}) \leq p(k)

Denote \hat{\zeta}{\tilde{S},F,p}=(\zeta{\tilde{S},F,p},r_{\tilde{S},F,p}) to the \Words-valued (poly,0)-scheme defined by

r_{\tilde{S},F,p}(k) = 2 p(k) - |\tilde{S}| - |F|

\forall k \in \Nats, x \in \Words, y_1 \in \WordsLen{p(k)-|\tilde{S}|}, y_2 \in \WordsLen{p(k)-|F|}: \zeta_{\tilde{S},F,p}^k(x,y_1 y_2)=(p(k),\tilde{S} y_1, F y_2,x)

Then, \hat{\zeta}{\tilde{S},F,p} is a \Delta^2-pseudo-invertible reduction of (f,\mu) to (f{EXP(O)},\mu_{Samp(O)}).

The proofs of this and subsequent results are in Appendix B.

In particular, if (f_{EXP(O)},\mu_{Samp(O)}) has a uniform \Delta^2(poly,log)^O-optimal predictor scheme then we can construct a \Delta^2(poly,log)^O-optimal predictor scheme for any \Delta^1-samplable distributional estimation problem which has an exponential time approximation with error in \Delta^1. We don’t expect this to be true for reasonable error spaces and O=\varnothing, but as the following construction (drawing heavily from Heller) shows it is true for some oracles.

Construction 2

Consider O \subseteq \Words, p: \Nats \rightarrow \Nats a polynomial and \sigma: \Nats \times \Words \xrightarrow{O} \Words. We give a recursive definition of O_{p,k}^\sigma \subseteq \Words for k \in \Nats.

Recursion base:

O_{p,0}^\sigma := {0x \mid x \in O}

Denote

O_{p,k}^{\sigma,+}:={1w\alpha \mid |\alpha|=k \land \exists S,F,y \in \WordsLen{k}, z \in \WordsLen{p(k)}: \sigma^k(w) = zSFy \land f_{EXP(O_{p,k}^\sigma)}^k(S,F,ev_{O_{p,k}^\sigma}^k(S,k,y)) \geq \beta(\alpha)}

O_{p,k}^{\sigma,-}:=(\Words \setminus O_{p,k}^\sigma) \cap (Q_{f_{EXP(O_{p,k}^\sigma)}}(\Supp \mu_{Samp(O_{p,k}^\sigma)}^k) \cup Q_{ev_{O_{p,k}^\sigma}}(\WordsLen{k} \times {k} \times \WordsLen{k}))

Recursion step:

O_{p,k+1}^\sigma:=O_{p,k}^\sigma \cup O_{p,k}^{\sigma,+} \setminus \bigcup_{i \leq k} O_{p,i}^{\sigma,-}

Finally, we define

O_p^\sigma := \bigcup_{k \in \Nats} O_{p,k}^\sigma

Construction 3

Define \Delta_{pow}^1 to be the set of functions \delta: \Nats \rightarrow \Reals^{\geq 0} s.t.

\exists \epsilon > 0: \LimInf{k} k^\epsilon \delta(k) = 0

It is easy to see \Delta_{pow}^1 is an error space.

Theorem 2

Consider O \subseteq \Words and \sigma: \Nats \times \Words \xrightarrow{O} \Words s.t.\sigma^k is a bijection on \WordsLen{3k+p(k)}. Then, for any sufficiently large polynomial p: \Nats \rightarrow \Nats, (f_{EXP(O_p^\sigma)},\mu_{Samp(O_p^\sigma)}) is (\Delta_{pow}^1)^{O_p^\sigma}-generatable.

Note 1

For \sigma the identity function, (f_{EXP(O_p^\sigma)},\mu_{Samp(O_p^\sigma)}) admits a Monte Carlo algorithm. This is too strong for our purposes since it means there is no logical uncertainty in exponential time problems. Such oracles can be regarded as bounded analogues of reflective oracles.

Note 2

It is easy to see that O_p^\sigma \in EXP^O.

Theorem 2 is insufficient by itself to enable reflection with optimal predictor schemes relative to O_p^\sigma since \Lambda is a random algorithm. Potentially this can be solved by derandomization but at present we don’t know whether under what conditions derandomization with respect to O_p^\sigma is possible.

Appendix A

Fix \Delta an error space of rank 2.

Definition A

Consider (f, \mu), (g, \nu) distributional estimation problems, \hat{\zeta}=(\zeta,r_{\zeta},a_{\zeta}) a \Words-valued (poly,log)-bischeme.\hat{\zeta} is called a \Delta-pseudo-invertible reduction of (f, \mu) to (g, \nu) when there is a polynomial p: \Nats \rightarrow \Nats s.t. the following conditions hold:

(i) E_{\mu^k \times U^{r_\zeta(k,j)}}[(g(\hat{\zeta}^{kj}(x))-f(x))^2] \in \Delta

(ii) Pr_{\mu^k \times U^{r_\zeta(k,j)}}[\nu^{p(k)}(\hat{\zeta}^{kj}(x))=0] \in \Delta

(iii) There is M > 0 and \hat{R}=(R,r_R,a_R) a \Rats \cap [0,M]-valued (poly,log)-bischeme s.t.

E_{\nu^{p(k)} \times U^{r_R(k,j)}}[(\hat{R}^{kj}(y)-\frac{Pr_{\mu^k \times U^{r_\zeta(k,j)}}[\hat{\zeta}^{kj}(x)=y]}{\nu^{p(k)}(y)})^2] \in \Delta

(iv) There is a \Words-valued (poly,log)-scheme \hat{\xi}=(\xi,r_\xi,a_\xi) s.t.

E_{\mu^k \times U^{r_\zeta(k,j)}}[\sum_{x' \in \Words} |Pr_{U^{r_\xi(k,j)}}[\hat{\xi}^{kj}(\hat{\zeta}^{kj}(x,z),w)=x'] - Pr_{\mu^k \times U^{r_\zeta(k,j)}}[x''=x' \mid \hat{\zeta}^{kj}(x'',z')=\hat{\zeta}^{kj}(x,z)]|] \in \Delta

Such \hat{\xi} is called a \Delta-pseudo-inverse of \hat{\zeta}.

\hat{\zeta}=(\zeta,r_{\zeta},a_{\zeta}) a \Words-valued (poly,log)-scheme is called a \Delta-pseudo-invertible reduction of (f, \mu) to (g, \nu) when it becomes such when adding trivial dependence on j.

Theorem A

Suppose there is a polynomial h: \Nats^2 \rightarrow \Nats s.t. h^{-1} \in \Delta. Consider (f, \mu), (g, \nu) distributional estimation problems, \hat{\zeta} a \Delta-pseudo-invertible reduction of (f, \mu) to (g, \nu) and \hat{P}_g a \Delta(poly,log)-optimal predictor scheme for (g, \nu). Define \hat{P}_f by \hat{P}^{kj}_f(x):= \hat{P}^{kj}_g(\hat{\zeta}^{kj}(x)). Then, \hat{P}_f is a \Delta(poly,log)-optimal predictor scheme for (f, \mu).

Theorem A.1 is proved exactly the same way as Theorem 8 for unidistributional problems and we leave out the details.

Appendix B

Proof of Theorem 1

We need to show the 4 defining conditions of \Delta^2-pseudo-invertible reductions.

(i) E_{\mu^k \times U^{r_{\tilde{S},F,p}(k)}}[(f_{EXP(O)}(\hat{\zeta}{\tilde{S},F,p}^k(x))-f(x))^2] = E{\mu^k \times U^{r_{\tilde{S},F,p}(k)}}[(f_{EXP(O)}^{p(k)}(\tilde{S}y_1, Fy_2, x)-f(x))^2]

E_{\mu^k \times U^{r_{\tilde{S},F,p}(k)}}[(f_{EXP(O)}(\hat{\zeta}{\tilde{S},F,p}^k(x))-f(x))^2] = E{\mu^k}[(\beta(ev_O^{2^{p(k)}}(F,x))-f(x))^2]

E_{\mu^k \times U^{r_{\tilde{S},F,p}(k)}}[(f_{EXP(O)}(\hat{\zeta}{\tilde{S},F,p}^k(x))-f(x))^2] = E{\mu^k}[(F(x)-f(x))^2]

Using assumption (i), we get the required result.

(ii) Pr_{\mu^k \times U^{r_{\tilde{S},F,p}(k)}}[\mu_{Samp(O)}^{p(k)}(\hat{\zeta}{\tilde{S},F,p}^k(x))=0] = Pr{\mu^k \times U^{r_{\tilde{S},F,p}(k)}}[\mu_{Samp(O)}^{p(k)}(p(k),\tilde{S}y_1,Fy_2,x)=0]

Pr_{\mu^k \times U^{r_{\tilde{S},F,p}(k)}}[\mu_{Samp(O)}^{p(k)}(\hat{\zeta}{\tilde{S},F,p}^k(x))=0] = Pr{\mu^k}[x \not\in S^k(\WordsLen{r_S(k)})]

Pr_{\mu^k \times U^{r_{\tilde{S},F,p}(k)}}[\mu_{Samp(O)}^{p(k)}(\hat{\zeta}{\tilde{S},F,p}^k(x))=0] = \sum{x \in \Words \setminus S^k(\WordsLen{r_S(k)})}\mu^k(x)

Pr_{\mu^k \times U^{r_{\tilde{S},F,p}(k)}}[\mu_{Samp(O)}^{p(k)}(\hat{\zeta}{\tilde{S},F,p}^k(x))=0] = \sum{x \in \Words \setminus S^k(\WordsLen{r_S(k)})} |\mu^k(x) - Pr_{\WordsLen{r_S(k)}}[S^k(y)=x]|

Pr_{\mu^k \times U^{r_{\tilde{S},F,p}(k)}}[\mu_{Samp(O)}^{p(k)}(\hat{\zeta}{\tilde{S},F,p}^k(x))=0] \leq \sum{x \in \Words} |\mu^k(x) - Pr_{\WordsLen{r_S(k)}}[S^k(y)=x]|

Since S is a (\Delta^1)^O-sampler for \mu, we get

Pr_{\mu^k \times U^{r_{\tilde{S},F,p}(k)}}[\mu_{Samp(O)}^{p(k)}(\hat{\zeta}_{\tilde{S},F,p}^k(x))=0] \in \Delta^1

(iii) Define \hat{R} by

\hat{R}^{kj}(i,S',F',x) = \begin{cases}0 \text{ if } i \ne p(k) \ 0 \text{ if } \tilde{S} \text{ is not a prefix of } S' \ 2^{|\tilde{S}|+|F|} \text{ in other cases} \end{cases}

Consider k \in \Nats, x' \in \Words, y_1 \in \WordsLen{p(k)-|\tilde{S}|}, y_2 \in \WordsLen{p(k)-|F|}.

\frac{Pr_{\mu^k \times U^{r_\zeta(k,j)}}[\hat{\zeta}{\tilde{S},F,p}^k(x)=(p(k),\tilde{S} y_1,F y_2,x')]}{\mu{Samp(O)}^{p(k)}(p(k),\tilde{S} y_1,F y_2,x')} = \frac{2^{-p(k)+|\tilde{S}|} 2^{-p(k)+|F|} Pr_{U^{r_S(k)}}[S^k(y)=x]}{2^{-p(k)} 2^{-p(k)} Pr_{U^{r_S(k)}}[S^k(y)=x]}=2^{|\tilde{S}|+|F|}

(iv) Define \hat{\xi} by

\hat{\xi}^{kj}(i,S',F',x)=x

Proposition B

Consider O \subseteq \Words, p: \Nats \rightarrow \Nats a polynomial and \sigma: \Nats \times \Words \xrightarrow{O} \Words. Then

(i) \forall k \in \Nats, i > k, x \in \Words: x \in O_{p,k}^\sigma \implies x \in O_{p,i}^\sigma \land x \in O_p^\sigma

(ii) \forall x \in \Words: 0x \in O_p^\sigma \iff x \in O

(iii) \forall k \in \Nats, \forall x \in \WordsLen{4k+p(k)}: 1x \in O_p^\sigma \iff 1x \in O_{p,k}^{\sigma,+} \setminus \bigcup_{i \leq k} O_{p,i}^{\sigma,-}

(iv) \forall k \in \Nats, S \in \WordsLen{k}, y \in \WordsLen{k}: ev_{O_p^\sigma}^k(S,y)=ev_{O_{p,k}^\sigma}^k(S,y)

(v) \forall k \in \Nats, F \in \WordsLen{k}, x \in \Supp{\mu_{Samp(O_p^\sigma)}^k}: ev_{O_p^\sigma}^{2^k}(F,x)=ev_{O_{p,k}^\sigma}^{2^k}(F,x)

Proof of Proposition B

(i) By induction: x \in O_{p,k+1}^{\sigma,-} implies x \not\in O_{p,k}^\sigma.

(ii) If 0x \not\in O then 0x \not\in O_p^\sigma since all words in O_{p,k}^{\sigma,+} begin with 1. If 0x \in O then 0x \in O_p^\sigma by (i).

(iii) If 1x \in O_{p,k}^{\sigma,+} \setminus \bigcup_{i \leq k} O_{p,i}^{\sigma,-} then 1x \in O_{p,k+1}^\sigma and by (i) 1x \in O_p^\sigma. If 1x \in O_p^\sigma then there is n \in \Nats s.t.1x \in O_{p,n}^{\sigma,+} \setminus \bigcup_{i \leq n} O_{p,i}^{\sigma,-} and n=k because any word in O_{p,n}^{\sigma,+} has length 4n + p(n) + 1.

(iv) All oracle queries made in ev_{O_{p,k}^\sigma}^k(S,y) return the same values when addressed to O_{p,i}^\sigma for i > k: queries that return 0 keep returning 0 since they are included in O_{p,k}^{\sigma,-} and queries that return 1 keep returning 1 by (i).

(v) Note that (iv) implies \Supp{\mu_{Samp(O_p^\sigma)}^k}=\Supp{\mu_{Samp(O_{p,k}^\sigma)}^k} and apply the same argument as in (iv) to ev_{O_{p,i}^\sigma}^{2^k}(F,x).

Proof of Theorem 2

We describe \hat{G} the (\Delta_{pow}^1)^{O_p^\sigma}-generator for (f_{EXP(O_p^\sigma)},\mu_{Samp(O_p^\sigma)}). \hat{G}=(G,3k+p(k)) is the (poly,0)^{O_p^\sigma}-scheme of signature \boldsymbol{1} \rightarrow \Words \times [0,1] defined as follows. Consider the computation of {G}^k(w).

zSFy := \sigma^k(w) is computed, where |S|=|F|=|y|=k. x := ev_{O_p^\sigma}^k(S,k,y) is computed. For each i \in {0,1 \ldots k}, we compute \alpha_i defined as \frac{i}{k} rounded to k binary digits.\alpha := \frac{1}{k+1} |{i \in {0,1 \ldots k} \mid 1w\alpha_i \in O_p^\sigma}| is computed within k binary digits. The output is ((k,S,F,x),\alpha).

\hat{G}1 reproduces \mu{Samp(O_p^\sigma)} precisely since \sigma^k preserves U^{3k+p(k)}.

It is easy to see that there is a polynomial q s.t. for any p we have |\bigcup_{i \leq k} O_{p,i}^{\sigma,-}| \leq 2^{q(k)}. Take any polynomial p s.t. \LimInf{k} (3k + p(k) - q(k)) = +\infty. By Lemma C, (iii) we have

\alpha = \frac{1}{k+1} (|{i \in {0,1 \ldots k} \mid 1w\alpha_i \in O_{p,k}^{\sigma,+}}| - |{i \in {0,1 \ldots k} \mid 1w\alpha_i \in O_{p,k}^{\sigma,+} \cap \bigcup_{n \leq k} O_{p,n}^{\sigma,-}}|)

The first term approximates ev_{O_{p,k}^\sigma}^{2^k}(F,x) within error \frac{1}{k+1} \in \Delta_{pow}^1 by the construction of O_{p,k}^{\sigma,+}. By Lemma C, (v) it approximates f_{EXP(O_p^\sigma)}(k,S,F,x)=ev_{O_p^\sigma}^{2^k}(F,x) within the same error. The second terms vanishes for all but a negligible fraction of the w-s by the properties of p and q.