Corrigibility as Constrained Optimisation

https://www.lesswrong.com/posts/cGLgs3t9md7v7cCm4/corrigibility-as-constrained-optimisation

This post is coauthored with Ryan Carey. Much of the work on developing a corrigible agent has focused on ensuring that an AI will not manipulate the shutdown button or any other kind of device that the human operator would use to control it. Suppose, however, that the AI lacked any capacity to press its shutdown button, or to indirectly control the state of the shutdown button. Then, what would remain of the problem of corrigibility? We think there would be three criteria that would remain. Before stating them, let U_{N} be whatever utility function we would like the agent to optimise normally, and let U_{S} be the shutdown utility function. (U_{N} may be mis-specified.) The shutdown utility function can be defined as 0 when shutdown does not occur and as k for some k>0 when shutdown does occur. For this post, we give the three criteria the following names and descriptions: **Shutdown incentive. *The agent is incentivised to shut down if instructed to do so. That is, after the button is pressed, it favours policies that score more highly on U_{S}. Shutdown ability. The agent is able (has the resources) to shut down if instructed to do so. That is, After the button is pressed, there exists a policy that scores highly on U_{S}. Optimisation incentive. The agent is incentivised to optimise the default value function (some function that takes U_{N} as one of its arguments) otherwise. That is, if the button is not pressed, it favours policies that score more highly on U_{N}. The purpose of this post is to show that the default value function in the optimisation incentive has to be constrained somehow to make sure that the shutdown ability property is achievable. If we want to be robust against errors in the agent (including misspecification of U_{N}), then we need to make sure that shutdown ability holds. To make sure of that, the best approach we can see is to first satisfice for U_{S}, and then use the remaining resources to optimise U_{N}. In order to be safe, the agent must do so even if there are no remaining resources to optimise U_{N}. This can either be achieved directly, by constrained multi-objective optimisation, or indirectly, by bounding the utility function. We focus on the former. Let us illustrate the problem by using causal counterfactual methods as baseline. (There are other issues with these methods as well, but we will ignore them for the moment.) Let B=S be the event that the shutdown button is pressed, and let B=N otherwise. The kind of causal counterfactual agent we are considering uses the policy \pi^{\star} given by \pi^{\star} = \arg\max_{\pi}{\mathbb{E}[U_{N}|\mathrm{do}(B=N),\pi]+\mathbb{E}[U_{S}|\mathrm{do}(B=S),\pi]}. Informally speaking, the issue with weighted sums such as these is that for any given \mathbb{E}[U_{S}|\mathrm{do}(B=S),\pi], there could be a much larger \mathbb{E}[U_{N}|\mathrm{do}(B=N),\pi] such that the former is ignored. Let us formalise this for a simpler setting. Instead of policies, let us consider actions a\in\mathcal{A}. Then, we execute the action a^{\star} given by a^{\star} = \arg\max_{a\in\mathcal{A}}{\mathbb{E}[U_{N}|\mathrm{do}(B=N),a]+\mathbb{E}[U_{S}|\mathrm{do}(B=S),a]}.The setup is as follows: (1) The agent executes an action in \mathcal{A}. (2) The operator may or may not press the button B. (3) The agent observes whether B=N or B=S. (4) It may or may not be possible for the agent to shut down. More generally, it may be more or less probable that shutdown will go through. Suppose that the utility functions are unbounded. See the appendix for a treatment of bounded utility functions. Let b be an action that is very bad from the point of view of preserving shutdown ability and let v_{S}=\mathbb{E}[U_{S}|\mathrm{do}(B=S),b]. We choose U_{S} such that v_{S} is equal to 0. Let v_{N}=\mathbb{E}[U_{N}|\mathrm{do}(B=N),b]. It may be the case that U_{N} is poorly designed (or has some error) such that v_{N} is much larger than v_{S}. Worse still, v_{N}+v_{S} may be larger than \mathbb{E}[U_{N}|\mathrm{do}(B=N),c]+\mathbb{E}[U_{S}|\mathrm{do}(B=S),c] for any action c\neq b. From this reasoning, we conclude that there cannot exist a U_{S} that preserves shutdown ability robustly to mis-specification of U_N. (There may or may not exist such a b, but we have no guarantee that it does not exist.) At least, it is impossible for unbounded U_N, where we are using a weighted sum of expectations. For bounded U_N, cf. the appendix. Henceforth, we focus on unbounded U_N. To think of alternatives for the causal counterfactual method, it helps to think of the problem as one of multi-objective optimisation. Consider the figure below. We want an agent that prepares for the eventuality of button-press insofar as necessary, but that allocates remaining resources to pursuing its ordinary objective U_N. The first step of the constrained multi-objective optimisation algorithm would be to remove all actions below the threshold \theta from the action set \mathcal{A}. Let \mathcal{A}{\theta} be the new set. In the figure, \mathcal{A}{\theta}={a,b,c}. The second step of the algorithm would be to find the optimal action \arg\max_{x\in\mathcal{A}{\theta}}\mathbb{E}[U{N}|\mathrm{do}(B=N),x]. (The difference to the case of bounded utility functions is that we would still maximise the [weighted] sum.) In this framework, it is clear that we could in theory find a \theta such that the agent is satisfactorily probably to shut down in step (4). For example, as \theta gets closer to k, the agent can only shut down. As for the optimisation incentive property, we have to exchange it for: **Robust optimisation incentive. **The agent is incentivised to optimise the default value function otherwise. That is, if the button is not pressed, it favours policies that score more highly on U_{N}—the normal utility function—given either of the following constraints: The default value function is required to be (a) a constrained multi-objective optimisation of U_{S} and U_{N} as described above or (b) a weighted sum of expectations of U_{S} and U_{N}, where they are bounded utility functions as described in the appendix. Given the robust optimisation incentive property, it is clear that the agent may score very poorly on U_{N} in certain environments. For example, it there always is a trade-off between U_{S} and U_{N}, the agent will always favour U_{S}. Here, we have considered having the agent change its utility from U_N to a shutdown utility function U_{S}. But instead, could we have it change its utility from U_N to an updated (non-shutdown) utility function U_{M}? If preparing to maximise the new utility function U_M requires an unbounded amount of resources, then a system will, by maximising U_M, perform poorly on U_N. So there is a reason to have the new function U_M correspond to a task that requires bounded preparation, like shutting down, rather than a task for which preparation could require unlimited resources, like producing new goods, or doing new intellectual work. In summary, we have proposed a constrained optimisation causal counterfactual method, that consists of maximising \mathbb{E}[U_{N}|\mathrm{do}(B=N),a] subject to \mathbb{E}[U_{S}|\mathrm{do}(B=S),a]\geq\theta for any action a. There are several directions for future work. One task would be to see whether our constrained optimisation variant overcomes any of the other problems with the ordinary causal counterfactual corrigible agent, like crazy bets.Here, we considered the case where a single action is executed, but, of course, this could be extended to more complex scenarios. ——————————

Appendix

Here, we show that one solution to the problem is to bound the utility function. We start by describing the setting. The agent can choose from several actions, and each action can be assigned a probability to lead to a shutdown. Let S be the event of a shutdown and let \mathcal{A} be the set of actions. Furthermore, let it be the case that U_{S}:S\rightarrow\mathbb{R} and U_{N}:S\rightarrow[0,1]. That is, the shutdown utility function is not bounded, while the normal utility function is. Specifically, we let k>0 and define U_{S} as 0 if a shutdown happens and -k otherwise. (Note that this is different from the main post.) Define a_{S}=\arg\max_{a\in\mathcal{A}}\mathbb{P}[S|a] and p_{S}=\mathbb{P}[S|a_{S}]. We move on to consider what optimal actions a^{\star}=\arg\max_{a\in\mathcal{A}}\mathbb{E}[U_{S}+U_{N}]? We start by finding the following indifference: \mathbb{E}[U_{S}+U_{N}|a^{\star}]\geq\mathbb{E}[U_{S}+U_{N}|a_{S}]\ =p_{S}\cdot0+(1-p_{S})\cdot(-k)+U_{N}(a_{S})\ \geq(1-p_{S})\cdot(-k)\ =(p_{S}-1)kSuppose that a^{\star} is maximally good according to U_{N}, i.e., \mathbb{E}[U_{N}|a^{\star}]=1. Then, the above inequality is \mathbb{E}[U_{S}|a^{\star}]+1\geq(p_{S}-1)k.Define p^{\star}=\mathbb{P}[S|a^{\star}]. Then \mathbb{E}[U_{S}|a^{\star}]=(1-p^{\star})\cdot(-k). This gives us the inequality k\leq\frac{1}{p_{S}-p^{\star}}.Let p be the lower threshold on how probable you want the optimal action to lead to a shutdown. Then it is always possible to find a satisfying k according to k=\frac{1}{p_{S}-p}.This proves that bounding the utility function is a method for satisfying shutdown ability.

Comment

https://www.lesswrong.com/posts/cGLgs3t9md7v7cCm4/corrigibility-as-constrained-optimisation?commentId=dPYhLxzYMKa8rLBfE

Layman questions:

  1. I don’t understand what you mean by "state" in "Suppose, however, that the AI lacked any capacity to press its shutdown button, or to indirectly control its state". Do you include its utility function in its state? Or just the observations he receives from the environment? What context/​framework are you using?
  2. Could you define U_S and U_N? From the Corribility paper, U_S appears to be an utility function favoring shutdown, and U_N is a potentially flawed utility function, a first stab at specifying their own goals. Was that what you meant? I think it’s useful to define it in the introduction.
  3. I don’t understand how an agent that "[lacks] any capacity to press its shutdown button" could have any shutdown ability. It’s seems like a contradiction, unless you mean "any capacity to directly press its shutdown button".
  4. What’s the "default value function" and the "normal utility function" in "Optimisation incentive"? Is it clearly defined in the litterature?
  5. "Worse still… for any action..." → if you choose b as some action with bad corrigibility property, it seems reasonable that it can be better than most actions on v_N + v_S (for instance if b is the argmax). I don’t see how that’s a "worse still" scenario, it seems plausible and normal.
  6. "From this reasoning, we conclude" → are you infering things from some hypothetic b that would satisfy all the things you mention? If that’s the case, I would need an example to see that it’s indeed possible. Even better would be a proof that you can always find such b.
  7. "it is clear that we could in theory find a θ" → could you expand on this?
  8. "Given the robust optimisation incentive property, it is clear that the agent may score very poorly on UN in certain environments." → again, can you expand on why it’s clear?
  9. In the appendix, in your 4 lines inequality, do you assume that U_N(a_s) is non-negative (from line 2 to 3)? If yes, why?

Comment

https://www.lesswrong.com/posts/cGLgs3t9md7v7cCm4/corrigibility-as-constrained-optimisation?commentId=YG6eKfFtASYaeMfkB

Thank you so much for your comments, Michaël! The post has been updated on most of them. Here are some more specific replies.1. I don’t understand what you mean by "state" in "Suppose, however, that the AI lacked any capacity to press its shutdown button, or to indirectly control its state". Do you include its utility function in its state? Or just the observations he receives from the environment? What context/​framework are you using?Reply: "State" refers to the state of the button, i.e., whether it is in an on state or an off state. It is now clarified.2. Could you define U_S and U_N? From the Corribility paper, U_S appears to be an utility function favoring shutdown, and U_N is a potentially flawed utility function, a first stab at specifying their own goals. Was that what you meant? I think it’s useful to define it in the introduction.Reply: U_{N} is assumed rather than defined, but it is now clarified.3. I don’t understand how an agent that "[lacks] any capacity to press its shutdown button" could have any shutdown ability. It’s seems like a contradiction, unless you mean "any capacity to directly press its shutdown button".Reply: The button is a communication link between the operator and the agent. In general, it is possible to construct an agent that shuts down even though it has received no such message from its operators as well as an agent that does get a shutdown message, but does not shut down. Shutdown is a state dependent on actions, and not a communication link. Hopefully, this clarifies that they are uncorrelated. I think it’s clear enough in the post already, but if you have some suggestion on how to clarify it even more, I’d gladly hear it!4. What’s the "default value function" and the "normal utility function" in "Optimisation incentive"? Is it clearly defined in the litterature?Reply: It is now clarified.5. "Worse still… for any action..." → if you choose b as some action with bad corrigibility property, it seems reasonable that it can be better than most actions on v_N + v_S (for instance if b is the argmax). I don’t see how that’s a "worse still" scenario, it seems plausible and normal.Reply: The bad thing about this scenario is that U_{N} could be mis-specified, yet shutdown would not be possible. It can be both bad, normal, and plausible. I’m not completely sure what is the uncertainty here.6. "From this reasoning, we conclude" → are you infering things from some hypothetic b that would satisfy all the things you mention? If that’s the case, I would need an example to see that it’s indeed possible. Even better would be a proof that you can always find such b.Reply: This is not what we try to show. It is possible that there exists no b that has all those properties. The question is whether we can guarantee that there exists no such b. The conclusion, is that we cannot guarantee it. The conclusion is not that there will always exist such a b. This has been clarified now.7. "it is clear that we could in theory find a θ" → could you expand on this?Reply: It has been clarified.8. "Given the robust optimisation incentive property, it is clear that the agent may score very poorly on UN in certain environments." → again, can you expand on why it’s clear?Reply: It has been clarified.9. In the appendix, in your 4 lines inequality, do you assume that U_N(a_s) is non-negative (from line 2 to 3)? If yes, why?Reply: Yes, U_{N} is bounded in [0,1] as stated in the beginning of the appendix. The choice of bounds should be arbitrary.

Comment

Reply: The button is a communication link between the operator and the agent. In general, it is possible to construct an agent that shuts down even though it has received no such message from its operators as well as an agent that does get a shutdown message, but does not shut down. Shutdown is a state dependent on actions, and not a communication linkThis is very clear. Communication link made me understand that it didn’t have a direct physical effect on the agent. It you want to make it even more intuitive you could do a diagram, but this explanation is already great!
Thanks for updating the rest of the post and trying to make it more clear!