Toy model of the AI control problem: animated version

EdEhGPEJi6dueQXv2

title

authors

Stuart_Armstrong

date_published

2017-10-10T11:06

score

omega_karma

votes

tags

Programming/Deception/AI Risk

source

lesswrong

A few years back, I came up with a toy model of the AI control problem. It has a robot moving boxes into a hole, with a slightly different goal than it’s human designers, and a security camera to check that it’s behaving as it should. The robot learns to block the camera to get its highest reward. I’ve been told that the model is useful for explaining the control problem quite a few people, and I’ve always wanted to program the "robot" and get an animated version of it. Gwern had a live demo, but it didn’t illustrate all the things I wanted to. So I programmed the toy problem in python, and generated a video with commentary. In this simplified version, the state space is sufficiently small that you can explicitly generate the whole table of Q-values (expected reward for taking an action in a certain state, assuming otherwise optimal policy). Since behaviour is deterministic, this can be updated in dynamical programming, using a full width backup. The number of such backups essentially measures the depth of the robot’s predictive ability. The most interesting depths of planning are:

At depth 0, the robot behaves randomly.
At depth 3, the robot crashes itself efficiently, to avoid loosing too much reward.
At depth 6, the robot figures out how to get one box into the hole. The automated camera turns it off.
At depth 17, the robot finally starts to deceive/conceal/manipulate, by blocking the camera and pushing two boxes into the hole. **Note that the robot’s deception comes from the fact that its incentives are misaligned, *****and ***that humans tried to control it.
At depth 18, the robot efficiently does the plan from depth 17.
At depth 20, the robot does the maximally efficient plan: blocking the camera, and pushing all boxes into the hole.
At depth 32, the robot has the correct Q-values for the maximally efficient plan.
At depth 45, finally, the Q-value table is fully updated, and the robot will take maximally efficient, and, if need be, deceptive plans from any robot/box starting positions. The code and images can be found here.

Comment

id

nqomMqcPHYLGEiRhH
authors

AABoyles
score

6
omega_karma
votes

4
date_published

2017-10-11T18:48

https://www.lesswrong.com/posts/EdEhGPEJi6dueQXv2/toy-model-of-the-ai-control-problem-animated-version?commentId=nqomMqcPHYLGEiRhH

Nice post! I’d like to put a copy of the code on Github, but I don’t see a license anywhere in the directory (or mentioned in the files). May I assume it’s generally intended to be Open Source and I can do this?

Comment

id

armpjYjvHYNTFSGcj
authors

Stuart_Armstrong
score

5
omega_karma
votes

3
date_published

2017-10-11T22:14

https://www.lesswrong.com/posts/EdEhGPEJi6dueQXv2/toy-model-of-the-ai-control-problem-animated-version?commentId=armpjYjvHYNTFSGcj

yes, go ahead!

Comment

id

Z4EHNvGn6Yp4s9CKQ
authors

AABoyles
score

5
omega_karma
votes

3
date_published

2017-10-13T14:51

https://www.lesswrong.com/posts/EdEhGPEJi6dueQXv2/toy-model-of-the-ai-control-problem-animated-version?commentId=Z4EHNvGn6Yp4s9CKQ

Thanks! It’s up.

Comment

id

yXdh88bTDk86eMkcM
authors

Stuart_Armstrong
score

2
omega_karma
votes

1
date_published

2017-10-16T10:17

https://www.lesswrong.com/posts/EdEhGPEJi6dueQXv2/toy-model-of-the-ai-control-problem-animated-version?commentId=yXdh88bTDk86eMkcM

Cheers! I’ve added the text of the "unlicence" https://choosealicense.com/licenses/unlicense/ to the script file "toyscript.txt". Can you update the file on github? (I notice you renamed the script file; that’s fine).

Comment

id

2LGSZuHap9bmaCQtb
authors

AABoyles
score

1
omega_karma
votes

1
date_published

2017-10-30T13:49

https://www.lesswrong.com/posts/EdEhGPEJi6dueQXv2/toy-model-of-the-ai-control-problem-animated-version?commentId=2LGSZuHap9bmaCQtb

Sorry I’m two weeks late, but the text of the unlicense has been added. Thank you!

id

e2MhwwiPNPyg4HRKr
authors

roystgnr
score

2
omega_karma
votes

4
date_published

2017-10-13T20:06

https://www.lesswrong.com/posts/EdEhGPEJi6dueQXv2/toy-model-of-the-ai-control-problem-animated-version?commentId=e2MhwwiPNPyg4HRKr

You should provide some more explicit license, if you don’t want to risk headaches for others later. "yes [It’s generally intended to be Open Source]" may be enough reassurance to copy the code once, but "yes, you can have it under the new BSD (or LGPL2.1+, or whatever) license" would be useful to have in writing in the repository in case others want to create derived works down the road. Thanks very much for creating this!

Comment

id

TFo7mnadT9i8nQNfj
authors

Stuart_Armstrong
score

2
omega_karma
votes

1
date_published

2017-10-16T10:16

https://www.lesswrong.com/posts/EdEhGPEJi6dueQXv2/toy-model-of-the-ai-control-problem-animated-version?commentId=TFo7mnadT9i8nQNfj

Added the text of the "unlicence" https://choosealicense.com/licenses/unlicense/ to the script file "toyscript.txt".

id

BBGR8tX3EHKR6gMpe
authors

lifelonglearner
score

3
omega_karma
votes

3
date_published

2017-10-11T04:13

https://www.lesswrong.com/posts/EdEhGPEJi6dueQXv2/toy-model-of-the-ai-control-problem-animated-version?commentId=BBGR8tX3EHKR6gMpe

I hadn’t seen this on the original LW. This was really fascinating to follow! Thanks for making it in video form and with a clear explanation!