Instrumental Convergence: Power as Rademacher Complexity

I’ve personally had a hard time accepting the ‘seeking power’ terminology used in some of the discussion around instrumental convergence. I can’t ask my teachers about power or instrumental convergence. However, I do often ask questions about Rademacher complexity since I study machine learning algorithms. This is effectively a rough-draft articulation of why I do that.

Background

Instrumental convergence is the idea that there are states or strategies that are somehow broadly applicable to achieving a broad set of rewards. A somewhat canonical example might the paper-clip maximizer.

At a high-level paper-clip maximizers seem like a potential threat. First, we argue that some smaller collection of strategies would be useful for achieving a larger set of goal states. Second, we observe that this smaller collection of states/strategies is somehow ‘bad’. In our example, this might look like turning Earth into a factory to produce paper-clips.

Of course, without a proper grounding of the various terms it’s easy to overeach prove too much. Formalizations of instrumental convergence have been previously proposed. Since optimal policies always converges to a limiting cycle on the graph work on maximum mean-weight cycles is also relevant. Mathieu and Wilson study the distribution of such returns on complete graphs. In our context, they show the limiting cycles with high return have a constant-order legnth \Theta(1) which means that the length does not scale with the size of the graph. A subsequent work clarifies that with constant probability the length is constant or otherwise has length O((\log n)^3).

Turner et al. mathematically formalize the relevant notions for AI and then propose something more: that optimal policies tend to seek power. Power is defined as the expected value of a state over a distribution of reward functions,

\text{POWER}(s) \sim \mathbb{E}[\langle \rho_s, r \rangle | \rho_s \text{ is a valid occupancy-measure for s}]

A useful example is to think of having an agent precommit between limiting cycles by choosing it’s initial state. If the MDP has n states s \in S and is deterministic then we may assign a cycle function C: s \to \mathbb{N}^n which indicates the number of cylces of each length. It’s clear enough that if one state s includes at least as many cycles of any given length as s' then the power of s is at least that of s'. We can generalize this slightly,

Proposition: Let \Delta C(s,s')_i = C(s)_i - C(s')_i. Suppose the rewards are iid then we have \text{POWER}(s) \ge \text{POWER}(s') whenever,

\sum_{i = 1}^j \Delta C(s,s)_i \ge 0, \forall j \in \lbrace 1, \ldots, n \rbrace

and strict inequality whenever \Delta C(s,s')_i > 0 for some i \in \lbrace 1, \ldots, n \rbrace. This is just saying that shorter cycles are more valuable than longer cycles in the sense that any cycle of length L can substitute for any cycle of length K \ge L.

The general idea of measuring the average optimization ability of a learning algorithm is extremely well studied. In particular, these measures are commonly used to provide generalization gurantees for machine learning algorithms. For example, we can introduce unormalized \Sigma-complexity measures defined on such sets as,

\mathcal{C}{\Sigma}(A) := \mathbb{E}{\sigma \in \Sigma} \left[ \sup_{a \in A} \langle a , \sigma \rangle \right]

where we’ll typically take a and \sigma to be vectors in \mathbb{R}^m. The most common \Sigma is the Rademacher distribution which is given as,

P(\sigma_i = +1) = P(\sigma_i = -1) = 1/2

Using this distribution gives us the Rademacher complexity of the set. We could also use a Gaussian distribution in which case we obtain the Gaussian complexity. These complexities are related,

\frac{\mathcal{C}{\text{Gaussian}}(A)}{2 \sqrt{\log(n)}} \le \mathcal{C}{\text{Rad}}(A) \le \sqrt{\frac{\pi}{2}} \mathcal{C}_{\text{Gaussian}}(A)

which means that the different complexity measures roughly correlate.

Applications to Instrumental Convergence

Here I’ll model definitions similar to those used in the power-seeking paper and a recent post on broad-convergence.

Definition: The optimality probability of A relative to B under distribution \Sigma is

p_{\Sigma}(A \ge B) := P_{\sigma \sim \Sigma} \left( \sup_{a \in A} \ \langle a, \sigma \rangle \ge \sup_{b \in B} \ \langle b , \sigma \rangle \right)

Definition: We’ll say B is \Sigma instrumentally convergent over A whenever p_{\Sigma}(A \ge B) \le 1/2.

For an MDP we might take \Sigma as a distribution of state-based reward functions and the sets A and B as indicating sets of feasible state-occupancy measures (discounted) that have initial full support on a state s_A and s_B, respectively.

Definition: Assuming the rewards are zero-mean the \Sigma-power of a state is given as,

\text{POWER}{\Sigma}(s) = \mathcal{C}{\Sigma}(I_s)

where I_s is the set of feasible state-occupancy measures (discounted) that have initial full-support on a state s.

Theorem: Assume I_a and I_b have positive \Sigma-power. Then, if the \Sigma-power of I_b exceeds I_a it is more likely to be optimal under the reward distribution \Sigma. Moreover, policies I_b slightly optimal over I_a cannot be too powerful.

We’ll use the following lemma to help us,

Lemma: Let \Delta(\sigma) = \sup_{a \in A} \ \langle a, \sigma \rangle - \sup_{b \in B} \ \langle b , \sigma \rangle. Suppose that \Delta^{-1}(\Sigma) is sub-Gaussian with scale-parameter \nu_R^2 and \mathcal{C}{\Sigma}(A) < \mathcal{C}{\Sigma}(B) then we have,

p_{\Sigma}(A \ge B) \le e^{\cfrac{- (\mathcal{C}{\Sigma}(A) - \mathcal{C}{\Sigma}(B))^2}{2\nu_R^2}}

Proof: The previous lemma yeilds,

p_{\Sigma}(I_a \ge I_b) \le e^{\cfrac{- (\text{POWER}{\Sigma}(b) - \text{POWER}{\Sigma}(a))^2}{2\nu_R^2}} \ \Rightarrow p_{\Sigma}(I_b > I_a) \ge 1 - e^{\cfrac{- (\text{POWER}{\Sigma}(b) - \text{POWER}{\Sigma}(a))^2}{2\nu_R^2}} \ \Rightarrow \log(p_{\Sigma}(I_a \ge I_b)) \ge \cfrac{- (\text{POWER}{\Sigma}(b) - \text{POWER}{\Sigma}(a))^2}{2\nu_R^2} \ \Rightarrow \text{POWER}{\Sigma}(b) - \text{POWER}{\Sigma}(a) \le \nu_R \sqrt{2 \log \left( \frac{1}{p_{\Sigma}(I_a \ge I_b)} \right)} \ \le \nu_R \sqrt{2 \log \left( \frac{1}{1-p_{\Sigma}(I_b > I_a)} \right)}

Thus, relatively optimal policies have bounded power difference. \square

All that remains is to prove the lemma,

Proof: For any \lambda > 0 we have,

p_{\Sigma}(A \ge B) \le \mathbb{E}{\sigma \sim \Sigma} \left[ \mathbb{I}\left(\sup{a \in A} \ \langle a, \sigma \rangle \ge \sup_{b \in B} \ \langle b , \sigma \rangle \right) \right] \ \le \mathbb{E}{\sigma \sim \Sigma} \left[ e^{\lambda \left(\sup{a \in A} \ \langle a, \sigma \rangle - \sup_{b \in B} \ \langle b , \sigma \rangle \right) } \right] \quad \text{(0-1 Loss)}\ = \mathbb{E}{\sigma \sim \Sigma} \left[ e^{\lambda \Delta(\sigma) } \right] = \mathbb{E}{\delta \sim \Delta^{-1}(\Sigma)} \left[ e^{\lambda \delta } \right] \quad \text{(Change of Variables)} \ \le e^{\lambda \mathbb{E}[\delta] + \frac{\lambda^2 \nu^2}{2} } = e^{\lambda (\mathcal{C}{\Sigma}(A) - \mathcal{C}{\Sigma}(B)) + \frac{\lambda^2 \nu^2}{2} } \quad \text{(Hoeffding Lemma)}

We may optimize the bound and obtain \lambda = \frac{\mathcal{C}{\Sigma}(B) - \mathcal{C}{\Sigma}(A)}{\nu_R^2} > 0 by assumption. This implies that,

p_{\Sigma}(A \ge B) \le e^{\cfrac{- (\mathcal{C}{\Sigma}(B) - \mathcal{C}{\Sigma}(A))^2}{2\nu_R^2}}

\square