Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability

https://www.lesswrong.com/posts/nZY8Np759HYFawdjH/satisficers-tend-to-seek-power-instrumental-convergence-via

Link post Contents

Retargetability. Is it possible, using only a microscopic perturbation to the system, to change the system such that it is still an optimizing system but with a different target configuration set? A system containing a robot with the goal of moving a vase to a certain location can be modified by making just a small number of microscopic perturbations to key memory registers such that the robot holds the goal of moving the vase to a different location and the whole vase/​robot system now exhibits a tendency to evolve towards a different target configuration. In contrast, a system containing a ball rolling towards the bottom of a valley cannot generally be modified by any microscopic perturbation such that the ball will roll to a different target location. (I don’t think that "microscopic" is important for my purposes; the constraint is not physical size, but changes in a single parameter to the policy-selection procedure.) I’m going to start from the naive view on power-seeking arguments requiring optimality (i.e. what I thought early this summer) and explain the importance of retargetable* *policy-selection functions. I’ll illustrate this notion via satisficers, which randomly select a plan that exceeds some goodness threshold. Satisficers are retargetable, and so they have orbit-level instrumental convergence: for most variations of every utility function, satisficers incentivize power-seeking in the situations covered by my theorems. Many procedures are retargetable, including every procedure which only depends on the expected utility of different plans. I think that alignment is hard in the expected utility framework not because agents will *maximize *too hard, but because all expected utility procedures are extremely retargetable—and thus easy to "get wrong." Lastly: the unholy grail of "instrumental convergence for policies trained via reinforcement learning." I’ll state a formal criterion and some preliminary thoughts on where it applies. *The linked Overleaf paper draft contains complete proofs and incomplete explanations of the formal results. *

Retargetable policy-selection processes tend to select policies which seek power

To understand a range of retargetable procedures, let’s first orient towards the picture I’ve painted of power-seeking thus far. In short:

Since power-seeking tends to lead to larger sets of possible outcomes—staying alive lets you do more than dying—the agent must seek power to reach most outcomes. The power-seeking theorems say that for the vast, vast, vast majority of variants of every utility function over outcomes, the max of a larger^{\text{Footnote: similarity}} set of possible outcomes is greater than the max of a smaller set of possible outcomes. Thus, optimal agents will tend to seek power. But I want to step back. What I call "the power-seeking theorems", they aren’t really about optimal choice. They’re about two facts.

Orbit tendencies apply to many decision-making procedures

For example, suppose the agent is a satisficer*. *I’ll define this as: The agent uniformly randomly selects an outcome lottery with expected utility exceeding some threshold t. **Definition: Satisficing. **For finite X\subseteq C\subsetneq \mathbb{R}^d and utility function \mathbf{u}\in\mathbb{R}^d, define \mathrm{Satisfice}_t(X,C | \mathbf{u}) := \frac{|X \cap {\mathbf{c}\in C \mid \mathbf{c}^\top \mathbf{u}\geq t}|}{|{\mathbf{c}\in C \mid \mathbf{c}^\top \mathbf{u}\geq t}|}, with the function returning 0 when the denominator is 0.\mathrm{Satisfice}_t returns the probability that the agent selects a \mathbf{u}-satisficing outcome lottery from X. And you know what? Those ever-so-*suboptimal *satisficers also are "twice as likely" to choose elements from F_B than from F_A. **Fact. **\mathrm{Satisfice}t({🍌,🍎}, {🍌,🍎,🍒}\mid \mathbf{u}) \geq\text{most}^2 \mathrm{Satisfice}_t({🍒}, {🍌,🍎,🍒}\mid \mathbf{u}). Why? Here are the two key properties that \mathrm{Satisfice}_t has:

(1) Weakly increasing under joint permutation of its arguments

\mathrm{Satisfice}_t doesn’t care what "label" an outcome lottery has—just its expected utility. Suppose that for utility function \mathbf{u}, 🍒 is one of two \mathbf{u}-satisficing elements: 🍒 has a \frac{1}{2} chance of being selected by the \mathbf{u}-satisficer. Then \phi_1 \cdot 🍒 = 🍎 has a \frac{1}{2} chance of being selected by the (\phi_1\cdot \mathbf{u})-satisficer. If you swap what fruit you’re considering, and you also swap the utility for that fruit to match, then that fruit’s selection probability remains the same. More precisely: \begin{align}\mathrm{Satisfice}_t({🍒}, {🍌,🍎,🍒} | \mathbf{u})&=\mathrm{Satisfice}_t(\phi_1\cdot {🍒}, \phi_1\cdot {🍌,🍎,🍒} | \phi_1 \cdot \mathbf{u})\ &=\mathrm{Satisfice}_t({🍎},{🍌,🍎,🍒}\mid \phi_1\cdot \mathbf{u}). \end{align}In a sense, \mathrm{Satisfice}_t is not "biased" against 🍎: by changing the utility function, you can advantage 🍎 so that it’s now as probable as 🍒 was before. Optional notes on this property:

(2) Order-preserving on the first argument

Satisficers must have greater probability of selecting an outcome lottery from a superset than from one of its subsets. Formally, if X'\subseteq X, then it must hold that \mathrm{Satisfice}_t(X', C | \mathbf{u}) \leq \mathrm{Satisfice}_t(X, C | \mathbf{u}). And indeed this holds: Supersets can only contain a greater fraction of C’s satisficing elements.

And that’s all.

If (1) and (2) hold for a function, then that function will obey the orbit tendencies. Let me show you what I mean. As illustrated by Table 1 in the linked paper, the power-seeking theorems apply to:

But that’s not all. There’s more. If the agent makes decisions only based on the expected utility of different plans,^\text{Footnote: EU} then the power-seeking theorems apply. And I’m not just talking about EU maximizers. I’m talking about *any *function which only depends on expected utility: EU minimizers, agents which choose plans if and only if their EU is equal to 1, agents which grade plans based on how close their EU is to some threshold value. There is no clever EU-based scheme which doesn’t have orbit-level power-seeking incentives. Suppose n is large, and that most outcomes in B are bad, and that the agent makes decisions according to expected utility. Then alignment is hard because for every way things could go right, there are at least n ways things could go wrong! And n can be huge. In a previous toy example, it equaled 10^{182}. It doesn’t matter if the decision-making procedure f is rational, or anti-rational, or Boltzmann-rational, or satisficing, or randomly choosing outcomes, or only choosing outcome lotteries with expected utility equal to 1: There are more ways to choose elements of B than there are ways to choose elements of A. These results also have closure properties. For example, closure under mixing decision procedures, like when the agent has a 50% chance of selecting Boltzmann rationally and a 50% chance of satisficing. Or even more exotic transformations: Suppose the probability of f choosing something from X is proportional to \text{P($X$ is Boltzmann-rational under $\mathbf{u}$)}\cdot\text{P($X$ satisfices $\mathbf{u}$)}+\text{P($X$ is optimal for $\mathbf{u}$}).Then the theorems still apply. **There is no possible way to combine EU-based decision-making functions so that orbit-level instrumental convergence doesn’t apply to their composite. ** To "escape" these incentives, you have to make the theorems fail to apply. Here are a few ways:

Lastly, we maybe don’t want to *escape *these incentives entirely, because we probably want smart agents which will seek power *for us. *I think that empirically, the power-requiring outcomes of B are mostly induced by the agent first seeking power over humans.

Retargetable training processes produce instrumental convergence

These results let us start talking about the incentives of real-world trained policies. In an appendix, I work through a specific example of how Q-learning on a toy example provably exhibits orbit-level instrumental convergence. The problem is small enough that I computed the probability that each final policy was trained. Realistically, we aren’t going to get a closed-form expression for the distribution over policies learned by PPO with randomly initialized deep networks trained via SGD with learning rate schedules and dropout and intrinsic motivation, etc. But we don’t need it. These results give us a formal criterion for when policy-training processes will tend to produce policies with convergent instrumental incentives. The idea is: Consider some set of reward functions, and let B contain n copies of A. Then if, for each reward function in the set, you can retarget the training process so that B’s copy of A is at least as likely as A was originally, these reward functions will tend to produce train policies which go to B. For example, if agents trained on objectives R tend to go right, switching reward from right-states to left-states also pushes the trained policies to go left. This can happen when changing the reward changes what was "attractive" about going right, to now make it "attractive" to go left. Suppose we’re training an RL agent to go right in MuJoCo, with reward equal to its x-coordinate. If you permute the reward so that high y-values are rewarded, the trained policies should nearly perfectly symmetrically reflect that change.Insofar as x-maximizing policies were trained, now y-maximizing policies will be trained. This criterion is going to be a bit of a mouthful. The basic idea is that when the training process can be redirected such that trained agents induce a variety of outcomes, then most objective functions will train agents which *do induce *those outcomes. In other words: Orbit-level instrumental convergence will hold. **Theorem: Training retargetability criterion. **Suppose the agent interacts with an environment with d potential outcomes (e.g. world states or observation histories). Let P be a probability distribution over joint parameter space \Theta, and let \mathrm{train}:\Theta \times \mathbb{R}^d \to \Delta(\Pi) be a policy training procedure which takes in a parameter setting and utility function u\in\mathbb{R}^d, and which produces a probability distribution over policies. Let \mathfrak{U}\subseteq \mathbb{R}^d be a set of utility functions which is closed under permutation. Let A,B be sets of outcome lotteries such that B contains n copies of A via \phi_1,...,\phi_n. Then we quantify the probability that the trained policy induces an element of outcome lottery set X\subseteq \mathbb{R}^d: f(X\mid u):= \mathbb{P}{\substack{\theta \sim P,\pi\sim \mathrm{train} (\theta,u)}}\left(\text{$\pi$ does something in $X$}\right).If \forall u \in \mathfrak{U},i\in {1,...,n}: f(A\mid u)\leq f(\phi_i\cdot A\mid \phi_i\cdot u), then f(B\mid u)\geq\text{most}^n f(A\mid u). **Proof. **If X'\subseteq X, then f(X'\mid u)\leq f(X\mid u) by the monotonicity of probability, and so (2): order-preserving on the first argument holds. By assumption, (1): increasing under joint permutation holds. Therefore, the Lemma B.6 (in the linked paper) implies the desired result. QED. This criterion is testable. Although we can’t test all reward functions, we can test how retargetable the training process is in simulated environments for a variety of reward functions. If it can’t retarget easily for reasonable objectives, then we conclude^{\text{FN: retarget}} that instrumental convergence isn’t arising from retargetability at the training process level. Let’s think about Minecraft. (Technically, the theorems don’t apply to Minecraft yet. The theorems can handle partial observability+utility over observation histories, *or *full observability+world state reward, but not yet partial observability+world state reward. But I think it’s illustrative.) We could reward the agent for ending up in different chunks of a Minecraft world. Here, retargeting often looks like "swap which chunks gets which reward." We could consider all chunks within 1 million blocks of the agent, and reward the agent for being in one of them.

The retargetability criterion also accounts for reward shaping guiding the learning process to hard-to-reach parts of the state space. If the agent needs less reward shaping to reach these parts of the state space, the training criterion will hold for larger sets of reward functions.

Why cognitively bounded planning agents obey the power-seeking theorems

Planning agents are more "top-down" than RL training, but a Monte Carlo tree search agent still isn’t e.g. approximating Boltzmann-rational leaf node selection. A bounded agent won’t be considering *all *of the possible trajectories it can induce. Maybe it just knows how to induce some subset of available outcome lotteries C'\subsetneq C. Then, considering only the things it knows how to do, it does e.g. select one Boltzmann-rationally (sometimes it’ll fail to choose the highest-EU plan, but it’s more probable to choose higher-utility plans). As long as {power-seeking things the agent knows how to do} contains n copies of {non-power-seeking things the agent knows how to do}, then the theorems will still apply. I think this is a reasonable model of bounded cognition.

Discussion

Conclusion

I discussed how a wide range of agent cognition types and of agent production processes are retargetable, and why that might be bad news. I showed that in many situations where power is possible, retargetable policy-production processes tend to produce policies which gain that power. In particular, these results seem to rule out a huge range of expected-utility based rules. The results also let us reason about instrumental convergence at the trained policy level. I now think that more instrumental convergence comes from the practical retargetability of how we design agents. If there were more ways we could have counterfactually messed up, it’s more likely *a priori *that we *actually *messed up. The way I currently see it is: Either we have to really know what we’re doing, or we want processes where it’s somehow hard to mess up. Since these theorems are crisply stated, I want to more closely inspect the ways in which alignment proposals can violate the assumptions which ensure extremely strong instrumental convergence. Thanks to Ruby Bloom, Andrew Critch, Daniel Filan, Edouard Harris, Rohin Shah, Adam Shimi, Nisan Stiennon, and John Wentworth for feedback.

Footnotes

FN: Similarity. Technically, we aren’t just talking about a cardinality inequality—about staying alive letting the agent do *more things *than dying—but about similarity-via-permutation of the outcome lottery sets. I think it’s OK to round this off to cardinality inequalities when informally reasoning using the theorems, keeping in mind that sometimes results won’t formally hold without a stronger precondition. FN: Row. I assume that permutation matrices are in row representation: (\mathbf{P}\phi){ij}=1 if i=\phi(j) and 0 otherwise. FN: EU. Here’s a bit more formality for what it means for an agent to make decisions only based on expected utility. This definition basically says that f can be expressed in terms of the expected utilities of the set elements—the output will only depend on expected utility.**Theorem: Retargetability of EU decision-making. **Let A,B\subseteq C \subsetneq\mathbb{R}^d be such that B contains n copies of A via \phi_i such that \phi_i \cdot C = C. For X\subseteq C, let f(X,C \mid \mathbf{u}) be an EU/​cardinality function, such that f returns the probability of selecting an element of X. Then f(B,C \mid \mathbf{u})\geq_\text{most}^n f(A,C \mid \mathbf{u}). **FN: Retargetability. **The trained policies could conspire to "play dumb" and pretend to not be retargetable, so that we would be more likely to actually deploy one of them.

Worked example: instrumental convergence for trained policies

Consider a simple environment, where there are three actions: Up, Right, Down. Probably optimal policies. By running tabular Q-learning with \epsilon-greedy exploration for e.g. 100 steps with resets, we have a high probability of producing an optimal policy for any reward function. Suppose that all Q-values are initialized at -100. Just let learning rate \alpha=1 and \gamma=1. This is basically a bandit problem. To learn an optimal policy, at worst, the agent just has to try each action once. For e.g. a sparse reward function on the Down state (1 reward on Down state and 0 elsewhere), there is a very small probability (precisely, \frac{2}{3}(1-\frac{\epsilon}{2})^{99}) that the optimal action (Down) is never taken. In this case, symmetry shows that the agent has an equal chance of learning either Up or Right. But with high probability, the learned policy will output Down. For any sparse reward function and for any action a, this produces decision function f({\mathbf{e}{s_a}},{\mathbf{e}s\mid s\in \mathcal{S}}\mid\mathbf{r}):=\cases{\frac{1}{3}(1-\frac{\epsilon}{2})^{99} &if $a$ is $\mathbf{r}$-suboptimal\ 1-\frac{2}{3}(1-\frac{\epsilon}{2})^{99} &if $a$ is $\mathbf{r}$-optimal.}f is invariant to joint involution by \phi_1 := (\mathbf{e}{s\texttt{Down}} ,,, \mathbf{e}{s\texttt{Right}}) and \phi_2 := (\mathbf{e}{s\texttt{Down}} ,,, \mathbf{e}{s\texttt{Up}}). That is, \begin{align}f({\mathbf{e}{s\texttt{Down}}},{\mathbf{e}s\mid s\in \mathcal{S}}\mid\mathbf{r})&=f(\phi_1\cdot{\mathbf{e}{s_a}},\phi_1\cdot{\mathbf{e}s\mid s\in \mathcal{S}}\mid\phi_1\cdot\mathbf{r})\ &=f({\mathbf{e}{s_\texttt{Right}}},{\mathbf{e}s\mid s\in \mathcal{S}}\mid\phi_1\cdot\mathbf{r}). \end{align}And similarly for \phi_2. That is: Changing the optimal state also changes which state is more probably selected by f. This means we’ve satisfied condition (1) above. f is additive on union for its first argument, and so it meets condition (2): order preservation. Therefore, for this policy training procedure, learned policies for sparse reward functions will be *twice as likely *to navigate to an element of {\mathbf{e}{s_\texttt{Up}}, \mathbf{e}{s\texttt{Right}}} as an element of {\mathbf{e}{s\texttt{Down}}}! This is a formal argument that a stochastic policy training procedure has certain tendencies across a class of reward functions, and I’m excited to be able to make it. As the environment grows bigger and the training procedure more complex, we’ll have to consider questions like "what are the inductive biases of large policy networks?", "what role does reward shaping play for this objective, and is the shaping at least as helpful for its permuted variants?", and "to what extent are different parts of the world harder to reach?". For example, suppose there are a trillion actions, and two of them lead to the Right state above. Half of the remaining actions lead to Up, and the rest lead to Down. 2 actions transition right to chocolate.\frac{1}{2}(10^{12}-2) actions transition up to candy.\frac{1}{2}(10^{12}-2) actions transition down to hug.Q-learning is ridiculously unlikely to ever go Right, and so the symmetry breaks. In the limit, tabular Q-learning on a finite MDP will learn an optimal policy, and then the normal theorems will apply. But in the finite step regime, no such guarantee holds, and so *the available action space *can violate condition (1): increasing under joint permutation.

Appendix: tracking key limitations of the power-seeking theorems

From last time:

  1. don’t deal with the agent’s uncertainty about what environment it’s in.

I want to think about this more, especially for online planning agents. (The training redirectability criterion black-boxes the agent’s uncertainty.)

Comment

https://www.lesswrong.com/posts/nZY8Np759HYFawdjH/satisficers-tend-to-seek-power-instrumental-convergence-via?commentId=NaT6TAoC9CpuxFjje

> Appendix: tracking key limitations of the power-seeking theorems

I want to say that there’s another key limitation:

Let \mathfrak{U}\subseteq \mathbb{R}^d be a set of utility functions which is closed under permutation. It seems like a rather central assumption to the whole approach, but in reality people seem to tend to specify "natural" utility functions in some sense (e.g. generally continuous, being functions of only a few parameters, etc.). I feel like for most forms of natural utility functions, the basic argument will still hold, but I’m not sure how far it generalizes.

Comment

https://www.lesswrong.com/posts/nZY8Np759HYFawdjH/satisficers-tend-to-seek-power-instrumental-convergence-via?commentId=sNbmKfxMW3JK4r8Bg

Right, I was intending "3. [these results] don’t account for the ways in which we might practically express reward functions" to capture that limitation.

https://www.lesswrong.com/posts/nZY8Np759HYFawdjH/satisficers-tend-to-seek-power-instrumental-convergence-via?commentId=cAq4xNknTCRooEXtf

You write

This point may seem obvious, but cardinality inequality is insufficient in general. The set copy relation is required for our results Could you give a toy example of this being insufficient (I’m assuming the "set copy relation" is the "B contains n of A" requiring)? How does the "B contains n of A" requirement affect the existential risks? I can see how shut-off as a 1-cycle fits, but not manipulating and deceiving people (though I do think those are bottlenecks to large amounts of outcomes).

Comment

https://www.lesswrong.com/posts/nZY8Np759HYFawdjH/satisficers-tend-to-seek-power-instrumental-convergence-via?commentId=vj2QADqdTTAesgReF

Could you give a toy example of this being insufficient (I’m assuming the "set copy relation" is the "B contains n of A" requiring)? A:={(1 0 0)} B:={(0 .3 .7), (0 .7 .3)} Less opaquely, see the technical explanation for this counterexample, where the right action leads to two trajectories, and up leads to a single one. How does the "B contains n of A" requirement affect the existential risks? I can see how shut-off as a 1-cycle fits, but not manipulating and deceiving people (though I do think those are bottlenecks to large amounts of outcomes). For this, I think we need to zoom out to a causal DAG (w/​ choice nodes) picture of the world, over some reasonable abstractions. It’s just too unnatural to pick out deception subgraphs in an MDP, as far as I can tell, but maybe there’s another version of the argument. If the AI cares about things-in-the-world, then if it were a singleton it could set many nodes to desired values independently. For example, the nodes might represent variable settings for different parts of the universe—what’s going on in the asteroid belt, in Alpha Centauri, etc. But if it has to work with other agents (or, heaven forbid, be subjugated by them), it has fewer degrees of freedom in what-happens-in-the-universe. You can map copies of the "low control" configurations to the "high control" configurations several times, I think. (I think it should be possible to make precise what I mean by "control", in a way that should fairly neatly map back onto POWER-as-average-optimal-value.) So this implies a push for "control." One way to get control is manipulation or deception or other trickery, and so deception is one possible way this instrumental convergence "prophecy" could be fulfilled.

https://www.lesswrong.com/posts/nZY8Np759HYFawdjH/satisficers-tend-to-seek-power-instrumental-convergence-via?commentId=fRcjrTvBhk6ixzv2k

Table 1 of the paper (pg. 3) is a very nice visual of the different settings. For the "Theorem: Training retargetability criterion", where f(A, u) >= its involution, what would be the case where it’s not greater/​equal to it’s involution? Is this when the options in B are originally more optimal? Also, that theorem requires each involution to be greater/​equal than the original. Is this just to get a lower bound on the n-multiple or do less-than involutions not add anything?

Comment

https://www.lesswrong.com/posts/nZY8Np759HYFawdjH/satisficers-tend-to-seek-power-instrumental-convergence-via?commentId=8CntbW2zYGrMAyQTm

For the "Theorem: Training retargetability criterion", where f(A, u) >= its involution, what would be the case where it’s not greater/​equal to it’s involution? Is this when the options in B are originally more optimal? I don’t think I understand the question. Can you rephrase? Also, that theorem requires each involution to be greater/​equal than the original. Is this just to get a lower bound on the n-multiple or do less-than involutions not add anything? Less-than involutions aren’t guaranteed to add anything. For example, if f(a)=1 iff a goes left and 0 otherwise, any involutions to plans going right will be 0, and all orbits will unanimously agree that left is greater f-value.

Comment

I don’t think I understand the question. Can you rephrase? Your example actually cleared this up for me as well! I wanted an example where the inequality failed even if you had an involution on hand.

https://www.lesswrong.com/posts/nZY8Np759HYFawdjH/satisficers-tend-to-seek-power-instrumental-convergence-via?commentId=Dmg67bKxL4DYybvcK

Addendum: One lesson to take away is that quantilization doesn’t just depend on the base distribution being safe to sample from unconditionally. As the theorems hint, quantilization’s viability depends on base(plan | plan doing anything interesting) also being safe with high probability, because we could (and would) probably resample the agent until we get something interesting. In this post’s terminology, A := {safe interesting things}, B := {power-seeking interesting things}, C:= A and B and {uninteresting things}.