A space of proposals for building safe advanced AI

https://www.lesswrong.com/posts/S9GxuAEeQomnLkeNt/a-space-of-proposals-for-building-safe-advanced-ai

I liked Evan’s post on 11 proposals for safe AGI. However, I was a little confused about why he chose these specific proposals; it feels like we could generate many more by stitching together the different components he identifies, such as different types of amplification and different types of robustness tools. So I’m going to take a shot at describing a set of dimensions of variation which capture the key differences between these proposals, and thereby describe an underlying space of possible approaches to safety. Firstly I’ll quickly outline the proposals. Rohin’s overview of them is a good place to start—he categorises them as:

Comment

https://www.lesswrong.com/posts/S9GxuAEeQomnLkeNt/a-space-of-proposals-for-building-safe-advanced-ai?commentId=dsA2YugcbzREJx7Bz

This strikes me as a really interesting and innovative post, proposing a framework for systematically categorizing existing alignment proposals as well as helping to generate new ones. I’m kind of surprised that this post is almost 2 years old and yet only has one pingback and a few comments. Is there some other framework which has superseded this one, or did people just forget about it /​ there isn’t much comparative alignment work going on? One other framework I’ve seen kind of like this is "Training stories" from Evan Hubinger’s How do we become confident in the safety of a machine learning system?. But that is more about evaluating alignment proposals (i.e. the very last part of the present post) rather than categorizing alignment proposals along a consistent set of dimensions, which is the main focus here. So it actually serves a different purpose and isn’t much like this framework after all.

https://www.lesswrong.com/posts/S9GxuAEeQomnLkeNt/a-space-of-proposals-for-building-safe-advanced-ai?commentId=ZTkKHf8JQdmXLN2By

Debate: train M* to win debates against Amp(M).

I think Debate is closer to "train M* to win debates against itself as judged by Amp(M)".

Comment

https://www.lesswrong.com/posts/S9GxuAEeQomnLkeNt/a-space-of-proposals-for-building-safe-advanced-ai?commentId=BLKWph67fd8irebDb

Wouldn’t it just be "train M* to win debates against itself as judged by H"? Since in the original formulation of debate a human inspects the debate transcript without assistance.

Anyway, I agree that something like this is also a reasonable way to view debate. In this case, I was trying to emphasise the similarities between Debate and the other techniques: I claim that if we call the combination of the judge plus one debater Amp(M), then we can think of the debate as M* being trained to beat Amp(M) by Amp(M)’s own standards.

Maybe an easier way to visualise this is that, given some question, M* answers that question, and then Amp(M) tries to identify any flaws in the argument by interrogating M*, and rewards M* if no flaws can be found.

Comment

I claim that if we call the combination of the judge plus one debater Amp(M), then we can think of the debate as M* being trained to beat Amp(M) by Amp(M)’s own standards.

This seems like a reasonable way to think of debate.

I think, in practice (if this even means anything), the power of debate is quite bounded by the power of the human, so some other technique is needed to make the human capable of supervising complex debates, e.g. imitative amplification.