Directions and desiderata for AI alignment

https://www.lesswrong.com/posts/kphJvksj5TndGapuh/directions-and-desiderata-for-ai-alignment

Contents

I. Research directions

1. Reliability and robustness

Traditional ML algorithms optimize a model or policy to perform well on the training distribution. These models can behave arbitrarily badly when we move away from the training distribution. Similarly, they can behave arbitrarily badly on a small part of the training distribution. I think this is bad news:

2. Oversight /​ reward learning

ML systems are typically trained by optimizing some objective over the training distribution. For this to yield "good" behavior, the objective needs to sufficiently close to what we really want. I think this is also bad news:

3. Deliberation and amplification

Machine learning is usually applied to tasks where feedback is readily available. The research problem in the previous section aims to obtain quick feedback in general by using human judgments as the "gold standard." But this approach breaks down if we want to exceed human performance. For example, it is easy to see how we could use machine learning to train ML systems to make human-level judgments about urban planning, by training them to produce plans that sound good to humans. But if we want to train an ML system to make superhuman judgments about how to lay out a city, it’s completely unclear how we could do it — without spending billions of dollars trying out the system’s ideas and telling it which ones work. This is a problem for the same reasons discussed in the preceding section. If our society is driven by systems superhumanly optimizing short-term proxies for what we care about — such as how much they impress humans, or how much money they make—then we are liable to head off in a direction which does not reflect our values or leave us in meaningful control of the situation. If we lowered our ambitions and decide that superhuman performance is inherently unsafe, we would be leaving huge amounts of value on the table. Moreover, this would be an unstable situation: it could last only as long as everyone with access to AI coordinated to pull their punches and handicap their AI systems. I’m aware of two approaches to this problem that seem like they could scale:

II. Desiderata

I’m most interested in algorithms that are secure, competitive, and scalable, and I think that most research programs are very unlikely to deliver these desiderata (this is why the lists above are so short). Since these desiderata are doing a lot of work in narrowing down the space of possible research directions, it seems worthwhile to be thoughtful and clear about them. It would be easy to gloss over any of them as obviously unobjectionable, but I would be more interested in people pushing back on the strong forms than implicitly accepting a milder form.

1. Secure

Many pieces of software work "well enough" most of the time; we often learn this not by a deep analysis but by just trying it and seeing what happens. "Works well enough" often breaks down when an adversary enters the prediction. Whether or not that’s a good way to build AI, I think it’s a bad way to do alignment research right now. Instead, we should try to come up with alignment solutions that work in the least convenient world, when nature itself is behaving adversarially. Accomplishing this requires argument and analysis, and cannot be exclusively or based on empirical observation. AI systems obviously won’t work well in the worst case (there is no such thing as a free lunch) but it’s reasonable to hope that our AI systems will never respond to a bad input by actively trying to hurt us —at least as long as we remain in physical control of the computing hardware, and the training process, etc. Why does security seem important?

2. Competitive

It’s easy to avoid building an unsafe AI system (for example: build a spreadsheet instead). The only question is how much you have to sacrifice to do it. Ideally we’ll be able to build benign AI systems that are just as efficient and capable as the best AI that we could build by any means. That means: we don’t have to additional domain-specific engineering work to align our systems, benign AI doesn’t require too much more data or computation, and our alignment techniques don’t force us to use particular techniques or restrict our choices in other ways. (More precisely, I would consider an alignment strategy a success if the additional costs are sublinear: if the fraction of resources that need to be spent on alignment research and run-time overhead decreases as the AI systems become more powerful, converging towards 0.) Why is competitiveness important? A. It’s easy to tell when a solution is plausibly competitive, but very hard to tell exactly how uncompetitive an uncompetitive solution will be. For example, if a purported alignment strategy requires an AI not to use technique or development strategy X, it’s easy to tell that this proposal isn’t competitive in general, but very hard to know exactly how uncompetitive it is. As in the security case, it seems very easy to look into the fog of the future and say "well this seems like it will probably be OK" and then to turn out to be too optimistic. If we hold ourselves to the higher standard of competitiveness, it is much easier to stay honest. Relatedly, we want alignment solutions that work across an extremely large range of techniques not just because we are uncertain about which techniques will be important, but because generalizing across all of the situations we can foresee is a good predictor of working for situations we can’t foresee. B. You can’t unilaterally use uncompetitive alignment techniques; we would need global coordination to avoid trouble. If we don’t know how to build competitive benign AI, then users/​designers of AI systems have to compromise efficiency in order to maintain reliable control over those systems. The most efficient systems will by default be built by whoever is willing to accept the largest risk of catastrophe (or perhaps by actors who consider unaligned AI a desirable outcome). It may be possible to avert this kind of race to the bottom by effective coordination by e.g. enforcing regulations which mandate adequate investments in alignment or restrict what kinds of AI are deployed. Enforcing such controls domestically is already a huge headache. But internationally things are even worse: a country that handicapped its AI industry in order to proceed cautiously would face the risk of being overtaken by a less prudent competitor, and avoiding that race would require effective international coordination. Ultimately society will be able and willing to pay some efficiency cost to reliably align AI with human interests. But the higher that cost, the harder the coordination problem that we will need to solve. I think the research community should be trying to make that coordination problem as easy as possible. (Previous posts: prosaic AI alignment, a possible stance for AI control, efficient and safely scalable)

3. Scalable

Over time, we are acquiring more data, more powerful computers, richer model classes, better optimization algorithms, better exploration strategies, and so on. If we extrapolate these trends, we end up with very powerful models and policies. Many approaches to alignment break down at some point in this extrapolation. For example, if we train an RL agent with a reward function which imperfectly approximates what we want, it is likely to fail once the agent becomes sufficiently sophisticated — unless the reward function itself becomes more sophisticated in parallel. In contrast, let’s say that a technique is "scalable" if it continues to work just as well even when the underlying learning becomes much more powerful. (See also: Eliezer’s more colorful "omnipotence test.") This is another extremely demanding requirement. It rules out many possible approaches to alignment. For example, it probably rules out any approach that involves hand-engineering reward functions. More subtly, I expect it will rule out any approach that requires hand-engineering an informative prior over human values (though some day we will hopefully find a scalable approach to IRL). Why is scalability important?

Aside: feasibility

One might reject these desiderata because they seem too demanding: it would be great if we had a secure, competitive and scalable approach to alignment, but that might not be possible. I am interested in trying to satisfy these desiderata despite the fact that they are quite demanding, for two reasons:

III. Conclusion

I think there is a lot of research to be done on AI alignment; we are limited by a lack of time and labor rather than by a lack of ideas about how to make progress. Research relevant to alignment is already underway; researchers and funders interested in alignment can get a lot of mileage by supporting and fleshing out existing research programs in relevant directions. I don’t think it is correct to assume that if anyone is working on a problem then it is going to get solved — even amongst things that aren’t literally at the "no one else is doing it" level, there are varying degrees of neglect. At the same time, the goals of alignment are sufficiently unusual that we shouldn’t be surprised or concerned to find ourselves doing unusual research. I think that area #3 on deliberation and amplification is almost completely empty, and will probably remain pretty empty until we have clearer statements of the problem or convincing demonstrations of work in that area. I think the distinguishing feature of research motivated by AI alignment should be an emphasis on secure, competitive, and scalable solutions. I think these are very demanding requirements that significantly narrow down the space of possible approaches and which are rarely explicitly considered in the current AI community. It may turn out that these requirements are infeasible; if so, one key output of alignment research will be a better understanding of the key obstacles. This understanding can help guide less ambitious alignment research, and can help us prepare for a future in which we won’t have a completely satisfying solution to AI alignment. This post has mostly focused on research that would translate directly into concrete systems. I think there is also a need for theoretical research building better abstractions for reasoning about optimization, security, selection, consequentialism, and so on. It is plausible to me that we will produce acceptable systems with our current conceptual machinery, but if we want to convincingly analyze those systems then I think we will need significant conceptual progress (and better concepts may lead us to different approaches). I think that practical and theoretical research will be attractive to different researchers, and I don’t have strong views about their relative value. This was originally posted here on 6th February 2017. Tomorrow’s AI Alignment Forum sequences post will be ‘Human-AI Interaction’ by Rohin Shah in the sequence on Value Learning. The next post in this sequence will be ‘The reward engineering problem’ by Paul Christiano, on Tuesday 15th Jan.

Comment

https://www.lesswrong.com/posts/kphJvksj5TndGapuh/directions-and-desiderata-for-ai-alignment?commentId=oaQNwCpFgEvMKDJZ5

The three directions of reliability/​robustness, reward learning, and amplification seem great, though robustness seems particularly hard to achieve. While there is current work on adversarial training, interpretability and verification, even if all of the problems that researchers currently work on were magically solved, I don’t have a story for how that leads to robustness of (say) an agent trained by iterated amplification. I am more conflicted about the desiderata. They seem very difficult to satisfy, and they don’t seem strictly necessary to achieve good outcomes. The underlying view here is that we should aim for something that we know is sufficient to achieve good outcomes, and only weaken our requirements if we find a fundamental obstacle. My main issue with this view is that even if it is true that the requirements are impossible to satisfy, it seems very hard to know this, and so we may spend a lot of time trying to satisfy these requirements and most of that work ends up being useless. I can imagine that we try to figure out ways to achieve robustness for several years in order to get a secure AI system, and it turns out that this is impossible to do in a way where we know it is robust, but in practice any AI system that we train will be sufficiently robust that it never fails catastrophically. In this world, we keep trying to achieve robustness, never find a fundamental obstruction, but also never succeed at creating a secure AI system. Another way of phrasing this is that I am pessimistic about the prospects of conceptual thinking, which seems to be the main way by which we could find a fundamental obstruction. (Theory and empirical experiments can build intuitions about what is and isn’t hard, but given the complexities of the real world it seems unlikely that either would give us the sort of crystallized knowledge that you’re aiming for.) Phrased this way, I put less credence in this opinion, because I think there are a few examples of conceptual thinking being very important, though not that many.