Contents
- Epistemic Status
- Acknowledgements
- Appendix: Easter Eggs
Epistemic Status
I’ve made many claims in these posts. All views are my own. 10%20%30%40%50%60%70%80%90%Attainable Utility theory describes how people feel impacted10%20%30%40%50%60%70%80%90%Agents trained by powerful RL algorithms on arbitrary reward signals generally try to take over the world.> Confident (75%). The theorems on power-seeking only apply to optimal policies in fully observable environments, which isn’t realistic for real-world agents. However, I think they’re still informative. There are also strong intuitive arguments for power-seeking. 10%20%30%40%50%60%70%80%90%The catastrophic convergence conjecture is true. That is, unaligned goals tend to have catastrophe-inducing optimal policies because of power-seeking incentives.> Fairly confident (70%). There seems to be a dichotomy between "catastrophe directly incentivized by goal" and "catastrophe indirectly incentivized by goal through power-seeking", although Vika provides intuitions in the other direction. 10%20%30%40%50%60%70%80%90%AUP_conceptual prevents catastrophe, assuming the catastrophic convergence conjecture. 10%20%30%40%50%60%70%80%90%Some version of Attainable Utility Preservation solves side effect problems for an extremely wide class of real-world tasks and for subhuman agents.10%20%30%40%50%60%70%80%90%For the superhuman case, penalizing the agent for increasing its own Attainable Utility (AU) is better than penalizing the agent for increasing other AUs. 10%20%30%40%50%60%70%80%90%There exists a simple closed-form solution to catastrophe avoidance (in the outer alignment sense).## Acknowledgements After ~700 hours of work over the course of ~9 months, the sequence is finally complete. This work was made possible by the Center for Human-Compatible AI, the Berkeley Existential Risk Initiative, and the Long-Term Future Fund. Deep thanks to Rohin Shah, Abram Demski, Logan Smith, Evan Hubinger, TheMajor, Chase Denecke, Victoria Krakovna, Alper Dumanli, Cody Wild, Matthew Barnett, Daniel Blank, Sara Haxhia, Connor Flexman, Zack M. Davis, Jasmine Wang, Matthew Olson, Rob Bensinger, William Ellsworth, Davide Zagami, Ben Pace, and a million other people for giving feedback on this sequence.
Appendix: Easter Eggs
The big art pieces (and especially the last illustration in this post) were designed to convey a specific meaning, the interpretation of which I leave to the reader. There are a few pop culture references which I think are obvious enough to not need pointing out, and a lot of hidden smaller playfulness which doesn’t quite rise to the level of "easter egg". Reframing Impact The bird’s nest contains a literal easter egg. The paperclip-Balrog drawing contains a Tengwar inscription which reads "one measure to bind them", with "measure" in impact-blue and "them" in utility-pink. "Towards a New Impact Measure" was the title of the post in which AUP was introduced. Attainable Utility Theory: Why Things Matter This style of maze is from the video game Undertale. Seeking Power is Instrumentally Convergent in MDPs To seek power, Frank is trying to get at the Infinity Gauntlet. The tale of Frank and the orange Pebblehoarder Speaking of under-tales, a friendship has been blossoming right under our noses. After the Pebblehoarders suffer the devastating transformation of all of their pebbles into obsidian blocks, Frank generously gives away his favorite pink marble as a makeshift pebble. The title cuts to the middle of their adventures together, the Pebblehoarder showing its gratitude by helping Frank reach things high up. This still at the midpoint of the sequence is from the final scene of The Hobbit: An Unexpected Journey, where the party is overlooking Erebor, the Lonely Mountain. They’ve made it through the Misty Mountains, only to find Smaug’s abode looming in the distance. And, at last, we find Frank and orange Pebblehoarder popping some of the champagne from Smaug’s hoard. Since Erebor isn’t close to Gondor, we don’t see Frank and the Pebblehoarder gazing at Ephel Dúath from Minas Tirith.
I’ve updated the post with epistemic statuses:
AU theory describes how people feel impacted. I’m darn confident (95%) that this is true.
Agents trained by powerful RL algorithms on arbitrary reward signals generally try to take over the world. Confident (75%). The theorems on power-seeking only apply in the limit of farsightedness and optimality, which isn’t realistic for real-world agents. However, I think they’re still informative. There are also strong intuitive arguments for power-seeking.
CCC is true. Fairly confident (70%). There seems to be a dichotomy between "catastrophe directly incentivized by goal" and "catastrophe indirectly incentivized by goal through power-seeking", although Vika provides intuitions in the other direction.
AUP_{\text{conceptual}} prevents catastrophe (in the outer alignment sense, and assuming the CCC). Very confident (85%).
Some version of AUP solves side effect problems for an extremely wide class of real-world tasks, for subhuman agents. Leaning towards yes (65%).
For the superhuman case, penalizing the agent for increasing its own AU is better than penalizing the agent for increasing other AUs. Leaning towards yes (65%).
There exists a simple closed-form solution to catastrophe avoidance (in the outer alignment sense). Pessimistic (35%).
I am surprised by your conclusion that the best choice of auxiliary reward is the agent’s own reward. This seems like a poor instantiation of the "change in my ability to get what I want" concept of impact, i.e. change in the true human utility function. We can expect a random auxiliary reward to do a decent job covering the possible outcomes that matter for the true human utility. However, the agent’s reward is usually not the true human utility, or a good approximation of it. If the agent’s reward was the true human utility, there would be no need to use an impact measure in the first place. I think that agent-reward-based AUP has completely different properties from AUP with random auxiliary reward(s). Firstly, it has the issues described by Rohin in this comment, which seem quite concerning to me. Secondly, I would expect it to perform poorly on SafeLife and other side effects environments. In this sense, it seems a bit misleading to include the results for AUP with random auxiliary rewards in this sequence, since they are unlikely to transfer to the version of AUP that you end up advocating for. Agent-reward-based AUP has not been experimentally validated and I do not expect it to work well in practice. Overall, using agent reward as the auxiliary reward seems like a bad idea to me, and I do not endorse it as the "current-best definition" of AUP or the default impact measure we should be using. I am puzzled and disappointed by this conclusion to the sequence.
Comment
You seem to have misunderstood. Impact to a person is change in their AU. The agent is not us, and so it’s insufficient for the agent to preserve its ability to do what we want – it has to preserve our ability to do we want!
The Catastrophic Convergence Conjecture says:
Logically framed, the argument is: catastrophe \rightarrow power-seeking (obviously, this isn’t a tautology or absolute rule, but that’s the structure of the argument). Attainable Utility Preservation: Concepts takes the contrapositive: no power-seeking \rightarrow no catastrophe.
Then, we ask – "for what purpose does the agent gain power?". The answer is: for its own purpose. Of course.[1]
One of the key ideas I have tried to communicate is: AUP_{\text{conceptual}} does not try to look out into the world and directly preserve human values. AUP_{\text{conceptual}} penalizes the agent for gaining power, which disincentivizes huge catastrophes & huge decreases in our attainable utilities.
I agree it would perform poorly, but that’s because the CCC does not apply to SafeLife. We don’t need to worry about the agent gaining power over other agents. Instead, the agent can be viewed as the exclusive interface through which we can interact with a given SafeLife level, so it should preserve our AU by preserving its own AUs. Where exactly is this boundary drawn? I think that’s a great question.
I disagree. I clearly distinguish between the versions.
Incorrect. It would be fair to say that it hasn’t been thoroughly validated.
Suppose you’re attending a lecture given by another expert in your field. After prefacing that they spent many, many hours preparing the lecture because they previously had trouble communicating the work, they say something that sounds weird. A good rule of thumb is to give the benefit of the doubt and ask for clarification – why might they believe this? Do I understand what they mean? – before speaking up to disagree.
Edited for clarity.
Comment
Thank you for the clarifications! I agree it’s possible I misunderstood how the proposed AUP variant is supposed to relate to the concept of impact given in the sequence. However, this is not the core of my objection. If I evaluate the agent-reward AUP proposal (as given in Equations 2-5 in this post) on its own merits, independently of the rest of the sequence, I still do not agree that this is a good impact measure. Here are some reasons I don’t endorse this approach:
Comment
I think this makes sense – you come in and wonder "what’s going on, this doesn’t even pass the basic test cases?!".
Some context: in the superintelligent case, I often think about "what agent design would incentivize putting a strawberry on a plate, without taking over the world"? Although I certainly agree SafeLife-esque side effects are important, power-seeking might be the primary avenue to impact for sufficiently intelligent systems. Once a system is smart enough, it might realize that breaking vases would get it in trouble, so it avoids breaking vases as long as we have power over it.
If we can’t deal with power-seeking, then we can’t deal with power-seeking & smaller side effects at the same time. So, I set out to deal with power-seeking for the superintelligent case.
Under this threat model, the random reward AUP penalty (and the RR penalty AFAICT) can be avoided with the help of a "delusion box" which holds the auxiliary AUs constant. Then, the agent can catastrophically gain power without penalty. (See also: Stuart’s subagent sequence)
I investigated whether we can get an equation which implements the reasoning in my first comment: "optimize the objective, without becoming more able to optimize the objective". As you say, I think Rohin and others have given good arguments that my preliminary equations don’t work as well as we’d like. Intuitively, though, it feels like there might be a better way to implement that reasoning.
I think the agent-reward equations do help avoid certain kinds of loopholes, and that they expose key challenges for penalizing power seeking. Maybe going back to the random rewards or a different baseline helps overcome those challenges, but it’s not clear to me that that’s true.
I’m pretty curious about that – implementing eg Stuart’s power-seeking gridworld would probably make a good project for anyone looking to get into AI safety. (I’d do it myself, but coding is hard through dictation)
I meant that it isn’t relevant to this environment. In the CCC post, I write:
This sequence doesn’t focus on other kinds of environments, so there’s probably more good thinking to do about what I called "interfaces".
That makes sense. I’m only speaking for myself, after all. For the superintelligent case, I am slightly more optimistic about approaches relying on agent-reward. I agree that those approaches are wildly inappropriate for other classes of problems, such as SafeLife.
Comment
Thanks! I certainly agree that power-seeking is important to address, and I’m glad you are thinking deeply about it. However, I’m uncertain whether to expect it to be the primary avenue to impact for superintelligent systems, since I am not currently convinced that the CCC holds. One intuition that informs this is that the non-AI global catastrophic risk scenarios that we worry about (pandemics, accidental nuclear war, extreme climate change, etc) don’t rely on someone taking over the world, so a superintelligent AI could relatively easily trigger them without taking over the world (since our world is pretty fragile). For example, suppose you have a general AI tasked with developing a novel virus in a synthetic biology lab. Accidentally allowing the virus to escape could cause a pandemic and kill most or all life on the planet, but it would not be a result of power-seeking behavior. If the pandemic does not increase the AI’s ability to get more reward (which it receives by designing novel viruses), then agent-reward AUP would penalize the AI for reading biology textbooks but would not penalize the AI for causing a pandemic. That doesn’t seem right. I agree that the agent-reward equations seem like a good intuition pump for thinking about power-seeking. The specific equations you currently have seem to contain a few epicycles designed to fix various issues, which makes me suspect that there are more issues that are not addressed. I have a sense there is probably a simpler formulation of this idea that would provide better intuitions for power-seeking, though I’m not sure what it would look like. Regarding environments, I believe Stuart is working on implementing the subagent gridworlds, so you don’t need to code them up yourself. I think it would also be useful to construct an environment to test for power-seeking that does not involve subagents. Such an environment could have three possible behaviors like:
Comment
What I actually said was:
First, the "I think", and second, the "plausibly". I think the "plausibly" was appropriate, because in worlds where the CCC is true and you can just straightforwardly implement AUP_{\text{conceptual}} ("optimize the objective, without becoming more able to optimize the objective"), you don’t need additional ideas to get a superintelligence-safe impact measure.
Comment
Some thoughts on this discussion:
Comment
with respect to my specific proposal in the superintelligent post, or the conceptual version?
Comment
Specific proposal. If the conceptual version is "we keep A’s power low", then that probably works. If the conceptual version is "tell A to optimize R without becoming more able to optimize R", then I have the same objection.
Comment
Why do you object to the latter?
Comment
I don’t know what it means. How do you optimize for something without becoming more able to optimize for it? If you had said this to me and I hadn’t read your sequence and so knew what you were trying to say, I’d have given you a blank stare—the closest thing I have to an interpretation is "be myopic / greedy", but that limits your AI system to the point of uselessness. Like, "optimize for X" means "do stuff over a period of time such that X goes up as much as possible". "Becoming more able to optimize for X" means "do a thing such that in the future you can do stuff such that X goes up more than it otherwise would have". The only difference between these two is actions that you can do for immediate reward. (This is just saying in English what I was arguing for in the math comment.)
Comment
if you’re managing a factory, I can say "Rohin, I want you to make me a lot of paperclips this month, but if I find out you’ve increased production capacity or upgraded machines, I’m going to fire you". You don’t even have to behave greedily – you can plan for possible problems and prevent them, without upgrading your production capacity from where it started.
I think this is a natural concept and is distinct from particular formalizations of it.
edit: consider the three plans
Make 10 paperclips a day
Make 10 paperclips a day, but take over the planet and control a paperclip conglomerate which could turn out millions of paperclips each day, but which in fact never does.
take over the planet and make millions of paperclips each day.
Comment
Seems like that only makes sense because you specified that "increasing production capacity" and "upgrading machines" are the things that I’m not allowed to do, and those are things I have a conceptual grasp on. And even then—am I allowed to repair machines that break? What about buying a new factory? What if I force workers to work longer hours? What if I create effective propaganda that causes other people to give you paperclips? What if I figure out that by using a different source of steel I can reduce the defect rate? I am legitimately conceptually uncertain whether these things count as "increasing production capacity / upgrading machines". As another example, what does it mean to optimize for "curing cancer" without becoming more able to optimize for "curing cancer"?
Comment
Sorry, forgot to reply. I think these are good questions, and I continue to have intuitions that there’s something here, but I want to talk about these points more fully in a later post. Or, think about it more and then explain why I agree with you.
Fantastic sequence! Certainly, for anyone other than you, the deconfusion/time investment ratio of reading this is excellent. You really succeeded in making the core insights accessible. I’d even say it compares favorably to the recommended sequences in the Alignment Forum in that regard. I’ve never read the "Towards a new Impact Measure" post, but I assume doing so is redundant now since this sequence is the ‘updated’ version.
Comment
I’m very glad you enjoyed it!