A summary of aligning narrowly superhuman models

https://www.lesswrong.com/posts/TSxAXeHHhgSxR5wGZ/a-summary-of-aligning-narrowly-superhuman-models

Contents

Short intro

Progress in ML could allow us to do technical alignment research that is closer to the "real problem" than specific toy examples. In particular, one of the main promising directions (among others such as Interpretability and Truthful and honest AI) is discovering better methods of giving feedback to models more capable than us. This might be easy for some concrete, narrow tasks, such as playing Go (there is an algorithmic way to decide if a model is playing Go better than another model), but for more "fuzzy" tasks (such as "devise a fair economic policy"), we can’t evaluate the model according to such a gold standard. A couple of examples of work falling into this category (See Example tasks):

Example tasks

From Ajeya’s proposal in Open Philantropy’s Request for Proposals include:

Existing work

There is some already existing work in this general direction, most notably OpenAI using human feedback to finetune GPT-3 for summarisation, web search and following instructions via reward modelling. Anthropic recently published a similar paper, but on a more diverse set of tasks. Redwood research is working on a concrete problem in this space, namely getting a large language model to output stories which never include someone getting injured.

Possible approaches to alignment

In the original post, what is considered alignment is intentionally left very open-ended, as part of the point is exploring promising proposals. In order to make the research useful in the long-term, projects should strive to be:

Test our approach: sandwiching

To test a given alignment approach in practice, we can try to establish a baseline performance using training data and feedback from "empowered" humans, and then trying to get as close as possible to that performance with "non-empowered" humans using and providing feedback to our model. There are different ways we can make an "empowered" and "non-empowered" set of humans:

Critiques and cruxes

This particular summary of the discussion reflects how I interpreted arguments, and some of them are my own.

How relevant is currently available "narrow alignment work" to TAI alignment

Most of the benefits of this line of work hinge on how good "training grounds" are aligning current models to aligning more powerful models. In my opinion the following factors, if true, would make the case especially strong:

Short timelines

If something like the scaling hypothesis works, and we can use fairly similar approaches to current ones to reach transformative AI, then

Current models are already narrowly superhuman in a relevant sense

Fixing a broken pocket calculator (which is undoubtedly superhuman), or even making Google output better results isn’t traditionally considered a field of AI Alignment. Some people view GPT-3 as more of a fact-retrieval engine. If that is the case, maybe "alignment work" on GPT-3 is more similar to working on Google Search. Or, perhaps, as a less theoretical example, one could argue that transformative AI is likely to be agentic, and since language models are not agentic, a number of issues (such as inner alignment) are just not applicable. (The proposal does not limit itself to language models, but this counterargument applies to other models as well.) That being said, models such as GPT-3 or AlphaZero might have superhuman qualities that are relevant, such as a superhumanly rich latent world model, or superhuman ability to "evaluate" multiple plans/​concepts. (Many more than a human can hold in working memory.)

No phase-changes

While sandwiching is an interesting idea, Eliezer advocates for caution: maybe we can align current weak models with "weak" humans using some clever techniques, but there is a chance they break down once the model "figures out how to manipulate humans"/​"hack itself"/​"changing its own environment"/​"is just optimising stronger". Even if future models are going to be similar to current ones in architecture, if simply stronger capabilities introduce some entirely new type of risks, this line of work is less valuable. Deception seems like a very good example, which we can’t experience with current models. This is a subset of the previous point, but an important one.

Transparency tools are possible

If we had strong transparency tools, a lot of suggested concrete avenues would open for alignment work: we could provide better feedback to our models, verify their inner alignment (in particular, prevent deception) and audit (see Automated Auditing) them for robustness failures. If we have a chance at getting better at these techniques, concrete research in this direction involving current models is very valuable.

Neglectedness

One could argue that this would be done by industry anyways, as they have a clear incentive to make their models more useful. Ajeya counters that while there is related work, it is not exactly aimed at solving alignment in the long-term. (So for example they would not evaluate their approaches through sandwiching, or would cut corners and choose hacky solutions instead of general ones.) In addition, pushing for more human-feedback research could, by setting a precedent and demonstrating that some approaches work:

Open-ended alignment research vs evaluating concrete proposals

It has been suggested that we could instead evaluate concrete conceptual proposals, such as ascription universality, automated auditing, making models more honest, HCH or Debate. To the extent that these proposals are already implementable, they indeed seem like very good proposals for "aligning narrowly superhuman models". However, it might be the case, that these proposals are hard to implement precisely without direct guidance from a conceptual researcher like Paul Christiano. It also strikes me as a strong argument, that conceptual work can benefit from trying out things in practice, and we might discover new approaches previously not considered. (Thus allowing more open-ended exploration makes sense.)