Thoughts on Dangerous Learned Optimization

https://www.lesswrong.com/posts/rzJ9FgCoxuqSR2zb5/thoughts-on-dangerous-learned-optimization

Contents

Confusions

Training with RL vs supervised learning

It seems fairly clear that an RL agent can exhibit ‘goal optimization’ behavior, but it seems much less clear that a big (non-RL) network like GPT-N would do this. For RL we are training the system to achieve goals by taking actions in the environment, and so goal optimization is a good strategy to learn to perform well on this objective. But there are ways to train GPT systems with RL algorithms, which here seems like the GPT system could develop ‘goal optimization’. I am confused about this, and what part of training with RL (as opposed to standard supervised learning) leads to goal optimization. *It could be that RL algorithms train the system to optimize for rewards across time, while other training doesn’t. * This seems similar to myopia (which is the property that a system doesn’t attempt to optimize past parameter updates); if the system isn’t trained with any concept of time relevant to its reward then it seems more likely to behave myopically. I definitely don’t think this is a watertight method for achieving a myopic system, but rather that it seems more difficult for non-myopia to develop if we only train with supervised learning. This seems fairly confusing though; it’s not intuitive that training a network for a very similar thing but with a different algorithm would cause it to develop/​not develop goal optimization. This makes me believe I might be missing something, and that maybe goal optimization can easily happen without RL. But it’s also possible RL algorithms make goal optimization more likely; maybe by explicitly considering time or by making it instrumentally useful to consider the value of future states.

Optimization isn’t a discrete property

Optimization is generally defined as a property of a network’s cognition, rather than its outputs/​behavior. This means we could have a system which has identical behavior as an optimizer without actually using this style of cognition. So why is ‘optimization’ a useful concept to have, if it doesn’t describe a model’s behavior? It’s a useful concept because it can help us predict what a model will do off-distribution.

Are people actually worried about this?

I am also very unsure as to whether people are actually worried about dangerous goal optimization happening with non-RL models. I have seen some talk about GPT mesa-optimizing, or deceptively aligned subnetworks in a larger model, and I don’t think these possibilities are particularly likely or dangerous. But I also don’t know how common or serious this worry is.

Conclusion

The term ‘optimization’ is often used to refer to an AI system optimizing for a specific goal, or an AI system performing some kind of internal search process. In standard supervised learning search processes seem likely and also don’t seem dangerous. I don’t see why goal optimization would happen in supervised learning, but I do think it is likely in RL. I think that any talk about mesa-optimization in language models or supervised learning needs to explain why we think a supervised learning system would develop external goals rather than just (safe) internal search processes. *Thanks to Vinah Hiremath and Justis Mills for feedback on this post. This work is supported by CEEALAR. *

Comment

https://www.lesswrong.com/posts/rzJ9FgCoxuqSR2zb5/thoughts-on-dangerous-learned-optimization?commentId=piTXubyeLyqdRF7CZ

I’m curious if you have thoughts about Eliezer’s scattered arguments on the dangerousness of optimization in the recent very-long-chats (first one). It seems to me that one relevant argument I can pin onto him is something like "well, but have you imagined what it would be like if this supposedly-benign optimization actually solved hard problems?" Like, it’s easy to say the words "A giant look-up table that has the same input-output function as a human, on the data it actually receives." But just saying the words isn’t imagining what this thing would actually be like. First, such a look-up table would be super-universally vast. But I think that’s not even the most important thing to think about when imagining it—that question is "how did this thing get made, since we’re not allowed to just postulate it into existence?" I interpret Eliezer as arguing that if you have to somehow make a giant look-up table that has the input-ouput behavior of a powerful optimizer n some dataset, practically speaking you’re going to end up with something that is also a powerful optimizer in many other domains, not something that safely draws form a uniform distribution over off-distribution behaviors.

Comment

https://www.lesswrong.com/posts/rzJ9FgCoxuqSR2zb5/thoughts-on-dangerous-learned-optimization?commentId=XkwDST6DvfZPFAKNA

My initial thought is that I don’t see why this powerful optimizer would attempt to optimize things in the world, rather than just do some search thing internally. I agree with your framing of "how did this thing get made, since we’re not allowed to just postulate it into existence?". I can imagine a language model which manages to output words which causes strokes in whoever reads its outputs, but I think you’d need a pretty strong case for why this would be made in practice by the training process. Say you have some powerful optimizer language model which answers questions. If you ask a question which is off its training distribution, I would expect it to either answer the question really well (e.g. it genealises properly), or it kinda breaks and answers the question badly. I don’t expect it to break in such a way where it suddenly decides to optimize for things in the real world. This would seem like a very strange jump to make, from ‘answer questions well’ to ‘attempt to change the state of the world according to some goal’. But I think if we trained the LM on ‘have a good on-going conversation with a human’, such that the model was trained with reward over time, and its behaviour would effect its inputs (because it’s a conversation), then I think it might do dangerous optimization, because it was already performing optimization to affect the state of the world. And so a distributional shift could cause this goal optimization to be ‘pointed in the wrong direction’, or uncover places where the human and AI goals become unaligned (even though they were aligned on the training distribution).