Why I’m Worried About AI

https://www.lesswrong.com/posts/k8hvGAJWSKAeHwpnJ/why-i-m-worried-about-ai

Contents

Preamble

In this sequence of posts, I want to lay out why I am worried about risks from powerful AI and where I think the specific dangers come from. In general, I think it’s good for people to be able to form their own inside views of what’s going on, and not just defer to people. There are surprisingly few descriptions of actual risk models written down. I think writing down your own version of the AI risk story is good for a few reasons

How do we train neural networks?

In the current paradigm of AI, we train neural networks to be good at tasks, and then we deploy them in the real world to perform those tasks. We train neural networks on a training distribution

Optimization

When we train our AI systems we are optimizing them to perform well on the training distribution. By optimizing I mean that we are modifying these systems such that they do well on some objective.

Optimized vs Optimizers

It is important to make a distinction between something which is *optimized *and something which is an optimizer. When we train our AI systems, we end up with an optimized system; the system has been optimized to perform well on a task, be that cat vs dog classification, predicting the next word in a sentence, or achieving a high score at Breakout. These systems have been optimized to do well on the objective we have given them, but they themselves (probably) aren’t optimizers; they don’t have any notion of improving on an objective. Our cat vs dog classifier likely just has a bunch of heuristics which influence the relative likelihood of ‘cat’ or ‘dog’. Our Breakout agent is probably running an algorithm which looks like "The ball is at position X, the platform is at position Y, so take action A", and not something like "The ball is at position X, the platform is at position Y, if I take action A it will give me a better score than action B, so take action A". We did the optimizing with our training and ended up with an optimized system. However, there are reasons to expect that we will get ‘optimizers’ as we build more powerful systems which operate in complex environments. AI systems can solve a task in 2 main ways (although the boundary here is fuzzy)

What do we tell the optimizers to do?

Assuming that we get optimizers, we need to be able to tell them what to do. By this I mean when we train a system to achieve a goal, we want that goal to actually be one that we want. This is the "Outer Alignment Problem". The classic example here is that we run a paperclip factory, so we tell our optimizing AI to make us some paperclips. This AI has no notion of anything else that we want or care about so it would sacrifice literally anything to make more paperclips. It starts by improving the factory we already have and making it more efficient. This still isn’t making the maximal number of paperclips, so it commissions several new factories. The human workers are slow, so it replaces them with toilless robots. At some point, the government gets suspicious of all these new factories, so the AI uses its powers of superhuman persuasion to convince them this is fine, and in fact, this is in the interest of National Security. This is still very slow compared to the maximal rate of paperclip production, so the AI designs some nanobots which convert anything made of metal into paperclips. At this point, it is fairly obvious to the humans that something is very, very wrong, but this feeling doesn’t last very long because soon the iron in the blood of every human is used to make paperclips (approximately 3 paperclips per person). This is obviously a fanciful story, but I think it points at an important point; it’s not enough to tell the AI what to do, we also have to be able to tell it what not to do. Humans have pretty specific values, and it seems extremely difficult to specify. There are more plausible stories we can tell which lead to similarly disastrous results.

How do we actually put the objective into the AI?

There is an additional (and maybe harder) problem: even if we knew how to specify the thing that we want, how do we put that objective into the AI? This is the ‘Inner Alignment Problem". This is related to the generalization behavior of neural networks; a network could learn a wide range of functions which perform well on the training distribution, but it will only have learned what we want it to learn if it performs ‘well’ on unseen inputs. Currently, neural networks generalize surprisingly well;

Deception

There is an additional danger if an AI system is ‘deliberately’ attempting to obscure its objective/​intentions from the humans training it. The term ‘deception’ is often used to refer to two different things which could happen when we train AI systems, which I outline here. If the AI is being trained on a difficult task, it might be easier for the AI to trick the evaluator (maybe a human) into giving a high reward, rather than actually doing well on the task. I’ll call this ‘Goodhart deception’ because the AI is ‘Goodharting’ the reward rather than optimizing for what humans actually want. Importantly, this doesn’t require the AI to have any objective or be optimizing for anything, the behavior which led to high reward (tricking the human) was just reinforced. This seems bad, but not as catastrophically bad as the other type of deception might be. The other type of deception is if an optimizing AI system intentionally deceives the humans about its true goals. In this scenario, the AI system develops an objective which is not aligned with the human objective. Here the objective is extended across time, which seems potentially like the default for learned objectives. The AI knows that if it attempts to directly go for its objective then it will either be turned off or be modified to remove this objective. So the AI will ‘pretend’ to not have this goal and instead ‘play along’ and do well on the task it is being trained for. After training, when the AI is deployed into the world, it is free to defect and pursue its own (misaligned) objective. I’ll call this ‘consequentialist deception’ because the AI is acting as a consequentialist (taking actions because of their consequences in the world, rather than just using mechanistic heuristics), or maybe just ‘deception’. This requires 3 (possibly likely) things to happen

Recap

So to recap:

In the next two posts I will lay out a more concrete story of how things go wrong, and then list some of my current confusions. *Thanks to Adam Jermyn and Oly Sourbut for helpful feedback on this post. *

Comment

https://www.lesswrong.com/posts/k8hvGAJWSKAeHwpnJ/why-i-m-worried-about-ai?commentId=WNeJWDad4YXFqxCYi

Nitpick: to the extent you want to talk about the classic example, paperclip maximisers are as much meant to illustrate (what we would now call) inner alignment failure. See Arbital on Paperclip ("The popular press has sometimes distorted the notion of a paperclip maximizer into a story about an AI running a paperclip factory that takes over the universe. [...] The concept of a ‘paperclip’ is not that it’s an explicit goal somebody foolishly gave an AI, or even a goal comprehensible in human terms at all.") or a couple of EY tweet threads about it: 1, 2

https://www.lesswrong.com/posts/k8hvGAJWSKAeHwpnJ/why-i-m-worried-about-ai?commentId=qpob9zoCzHv89BwoF

I’m getting more and more worried because the software I have dealt with in real life (as opposed to read about in scifi) is so defectively stupid it’s actually evil (or else the programmers are). Of course often it’s deliberately evil, like scanners that won’t work because the incorporated printer is out of ink.