Clarifying "AI Alignment"

https://www.lesswrong.com/posts/ZeE7EKHTFMBs8eMxn/clarifying-ai-alignment

Contents

Analogy

Consider a human assistant who is trying their hardest to do what H wants. I’d say this assistant is aligned with H. If we build an AI that has an analogous relationship to H, then I’d say we’ve solved the alignment problem. "Aligned" doesn’t mean "perfect:"

Clarifications

Postscript on terminological history

I originally described this problem as part of "the AI control problem*,"* following Nick Bostrom’s usage in Superintelligence, and used "the alignment problem" to mean "understanding how to build AI systems that share human preferences/​values" (which would include efforts to clarify human preferences/​values). I adopted the new terminology after some people expressed concern with "the control problem." There is also a slight difference in meaning: the control problem is about coping with the possibility that an AI would have different preferences from its operator. Alignment is a particular approach to that problem, namely avoiding the preference divergence altogether (so excluding techniques like "put the AI in a really secure box so it can’t cause any trouble"). There currently seems to be a tentative consensus in favor of this approach to the control problem. I don’t have a strong view about whether "alignment" should refer to this problem or to something different. I do think that some term needs to refer to this problem, to separate it from other problems like "understanding what humans want," "solving philosophy," etc. This post was originally published here on 7th April 2018. The next post in this sequence will post on Saturday, and will be "An Unaligned Benchmark" by Paul Christiano. Tomorrow’s AI Alignment Sequences post will be the first in a short new sequence of technical exercises from Scott Garrabrant.

Comment

https://www.lesswrong.com/posts/ZeE7EKHTFMBs8eMxn/clarifying-ai-alignment?commentId=oZKsZdAL27sBrnrDy

Nominating this primarily for Rohin’s comment on the post, which was very illuminating.

https://www.lesswrong.com/posts/ZeE7EKHTFMBs8eMxn/clarifying-ai-alignment?commentId=jodWtsXNrEcu594rR

Crystallized my view of what the "core problem" is (as I explained in a comment on this post). I think I had intuitions of this form before, but at the very least this post clarified them.