Safety via selection for obedience

https://www.lesswrong.com/posts/7jNveWML34EsjCD4c/safety-via-selection-for-obedience

In a previous post, I argued that it’s plausible that "the most interesting and intelligent behaviour [of AGIs] won’t be directly incentivised by their reward functions"—instead, "many of the selection pressures exerted upon them will come from emergent interaction dynamics". If I’m right, and the easiest way to build AGI is using open-ended environments and reward functions, then we should be less optimistic about using scalable oversight techniques for the purposes of safety—since capabilities researchers won’t need good oversight techniques to get to AGI, and most training will occur in environments in which good and bad behaviour aren’t well-defined anyway. In this scenario, the best approach to improving safety might involve structural modifications to training environments to change the emergent incentives of agents, as I’ll explain in this post. My default example of the power of structural modifications is the evolution of altruism in humans. Consider Fletcher and Doebeli’s model of the development of altruism, which relies on assortment in repeated games—that is, when players with a tendency to cooperate end up playing together more often than random chance predicts. In humans, some of the mechanisms which lead to assortment are:

Comment

https://www.lesswrong.com/posts/7jNveWML34EsjCD4c/safety-via-selection-for-obedience?commentId=WJqTsZ4y5bmiSMb2K

Anything that humans would understand is a small subset of the space of possible languages. In order for A to talk to B in english, at some point, there has to be selection against A and B talking something else. One suggestion would be to send a copy of all messages to GPT-3, and penalise A for any messages that GPT-3 doesn’t think is english. (Or some sort of text GAN that is just trained to tell A’s messages from real text) This still wouldn’t enforce the right relation between English text and actions. A might be generating perfectly sensible text that has secrete messages encoded into the first letter of each word.