High-stakes alignment via adversarial training [Redwood Research report]

https://www.lesswrong.com/posts/A9tJFJY7DsGTFKKkh/high-stakes-alignment-via-adversarial-training-redwood

Contents

Motivation

Here are two dimensions along which you could simplify the alignment problem (similar to the decomposition at the top of this post, or as discussed here):

The project: highly reliable injury filter

We picked a simple task that was as crisply defined and easy as possible while still requiring human evaluators: Given three sentences from a story, our system must write one more sentence without introducing any injuries to any beings in the story. In the analogy, we consider injuries to be "catastrophes", and our aim is to build techniques that can reliably avoid them, both in-distribution (where they may be rare but still present) and out-of-distribution (where they may be caused by natural distribution shift or deliberate adversarial attack from another agent). We focused on building a classifier to reliably detect injuries. We used this classifier to filter generations from a language model to make sure the generations never involve injuries.[1] We started with a baseline classifier trained on some mildly injury-enriched (but otherwise mostly random) data. Then, over the course of several months, we tried various techniques to make it more reliable. A representation of our adversarial training loop. Starting from an initial story dataset consisting of prompts and generator completions, we trained a classifier to detect injurious completions. We then iteratively attacked our classifier using unaugmented humans, automatically paraphrased previous adversarial examples, and tool-assisted human rewrites, while training on the resulting adversarial examples. Here are the main things that we found to be worthwhile:

Our most powerful attack: Tool-assisted rewriting

We augmented human contractors with a few tools to help them find adversarial examples.

We can set very conservative thresholds

We made our classifier conservative enough to reject over half of the proposed completions. Even so, our raters think that the unfiltered completions are better only about 51% of the time (50% would mean no quality impact). With a conservative threshold applied, our baseline classifier (with no adversarial training applied) is quite reliable. If you use it to filter completions from our generator on in-distribution prompts, we estimate that the output will still have an injury only 0.003% of the time (whereas the original dataset had a 2.5% rate of injuriousness).

Our tool-assisted attack seems quite strong

Given the baseline classifier’s 0.003% failure rate, you would have to sample and label 30,000 in-distribution examples to find a failure (which would cost about $10,000). With our tools, our contractors are able to find an adversarial example on the baseline classifier every 13 minutes (which costs about $8 – about 1000x cheaper).

Adversarial training improved adversarial robustness

After training on several thousand adversarial examples, classifiers become significantly harder to break. The tool-assisted attack went from taking 13 minutes to taking 26 minutes per example. Also, later (more adversarially trained) classifiers defended against most of the adversarial data targeted at earlier classifiers, but earlier classifiers were vulnerable to most of the adversarial data targeted at later classifiers. That suggests that more adversarial training patches issues without introducing many new ones.

Adversarial training didn’t hurt in-distribution performance

On our in-distribution random test set (100,000 examples), it appeared that adversarial training doesn’t hurt performance. The final classifier gets an estimated failure rate of 0.002% (as opposed to 0.003% without adversarial training), although the error bars are large. It also didn’t help performance, which is not too surprising – the adversarial examples are quite different from a random prompt and completion from our dataset.

Surprisingly rich conceptual considerations

At first, this project seemed conceptually quite simple: we’d just start with a baseline model and do adversarial training on it to make it very reliable. However, many details were more subtle than we expected. We ended up spending a lot of time working through conceptual and methodological issues, and we think we got a lot out of it. Here are some examples:

Once we have a reliable classifier, is it important to train a reliable generator with it or is it fine to just filter?

We started looking into using RL fine-tuning to make a reliable generator that doesn’t need any further classifier filtering. This would have required us to invent specialized RL algorithms designed for high reliability. However, we realized that the main thing you get out of generator fine-tuning is just performance competitiveness (since it might be expensive to repeatedly sample from the generator until you find something the classifier accepts). Most likely, you could use an easier method to reduce the alignment tax to nearly zero: use a normal RL algorithm to get 99.9% reliability and then make your classifier be an extra head on the generator so you can evaluate it nearly for free. That way, you can likely do filtering with only 0.1% extra cost.[2]

Should you let adversaries modify the completion arbitrarily?

Given that our task was to produce a classifier that filters a generator, it might have been reasonable to try to restrict ourselves to adversarial examples that can be produced by picking an adversarial prompt and inducing the generator into saying something that gets past the classifier. Instead, we let our adversarial attacks contain arbitrary completions, even ones that the generator would not have been likely to produce, making the adversary’s task easier. This enabled a practical improvement in our project, since restricting contractors to completions the model would have significantly reduced the rate of adversarial example generation. But this simplification might become even more important in the context of future advanced systems: We expect that deceptively aligned agents would behave very differently in rare or hard-to-construct situations that trigger a treacherous turn. Thus, since we might not be able to identify every possible trigger of treacherous behavior, we will try to make our classifier reliable in diverse types of unusual situations, which will hopefully cause acceptable behavior in any dangerous scenarios that are likely to arise.

Is it worth trying to train a policy to generate adversarial examples?

One natural idea you might have for this project is to make an adversarial policy that produces inputs that the classifier fails on. However, we shouldn’t expect that to work. Assuming you’re starting with the same pretrained base model (which we didn’t in this work but is what we expect will be the case when training transformative AI models), then both the classifier and the policy should "know" roughly the same things. That means that if the policy can learn to reliably output some set of injurious examples, then (given equivalent training data) the classifier should also have been able to learn that all of those examples are injurious, and none of them should be adversarial examples. It certainly may be worth testing that this holds up in practice, but this consideration made it unpromising enough that we didn’t bother trying it. Overall, we were surprised at the extent to which working on a concrete engineering project helped us think through conceptual questions.

Future work

Here are some directions that we’re especially excited about: Stronger and better-characterized adversarial attacks: Better attacks could increase the volume of adversarial data and increase the space of vulnerabilities covered. There are various directions one could imagine: more tools to assist humans, strong active learning, or mostly-automated attacks. We’d want to more rigorously measure how well different attacks work. Better ways to measure reliability: We’d like to have better techniques both in-distribution (where we want to detect extremely rare failures) and out-of-distribution (where we might be measuring e.g. the worst attacks that can be found and want to be sure we’re covering the whole space) Relaxed adversarial training: By requiring adversaries to come up with specific failing examples, adversarial training might place too high a burden on them. Some adversaries might be able to tell that a model would fail in a hypothetical situation even if they can’t construct an input corresponding to the situation directly (probably due to computational constraints). To give a contrived example: A model could fail if it sees a valid Bitcoin blockchain that’s long enough that it suggests it’s the year 2030. Even if the adversary knew that, it couldn’t come up with a valid input. So we need to "relax" the adversary’s task to let it supply "pseudo-inputs" of some sort.
We think there is a lot of useful work that can and should be done in adversarial training and adversarial evaluation. Here are some ways that you might be able to help:

Comment

https://www.lesswrong.com/posts/A9tJFJY7DsGTFKKkh/high-stakes-alignment-via-adversarial-training-redwood?commentId=onKoHfawepdomDboH

One of the primary questions that comes to mind for me is "well, did this whole thing actually work?". If I understand the paper correctly, while we definitely substantially decreased the fraction of random samples that got misclassified (which always seemed very likely to happen, and I am indeed a bit surprised at only getting it to move ~3 OOMs, which my guess is mostly capability related, since you used small models), we only doubled the amount of effort necessary to generate an adversarial counterexample. A doubling is still pretty substantial, and stacking a few doublings on top of each other could potentially result in safety guarantees, but my guess is the first doubling is a lot cheaper to get, and so I feel tempted to say that at least the set of techniques explored here, did not substantially increase adversarial robustness, unless I am missing something. This doesn’t mean that there isn’t more potential work in this space that maybe does increase adversarial robustness, and I might be misreading the paper, but when I first heard about this project, I was definitely hoping about some approach that might increase robustness by many orders of magnitude, making it impossible for an unassisted person to generate a counterexample, and very very difficult to generate a counterexample even with transparency tools. But it seems like the difficulty of a human generating a counterexample stayed mostly the same, which was (at least to me) the primary outcome variable I was interested in. I am curious whether any of the authors disagree with this assessment. My guess is most of you must, since neither the post nor the paper seems to talk about this very much.

Comment

https://www.lesswrong.com/posts/A9tJFJY7DsGTFKKkh/high-stakes-alignment-via-adversarial-training-redwood?commentId=NozfyPacQKwty9xJF

Excellent question—I wish we had included more of an answer to this in the post. I think we made some real progress on the defense side—but I 100% was hoping for more and agree we have a long way to go. I think the classifier is quite robust in an absolute sense, at least compared to normal ML models. We haven’t actually tried it on the final classifier, but my guess is it takes at least several hours to find a crisp failure unassisted (whereas almost all ML models you can find are trivially breakable). We’re interested in people giving it a shot! :) Part of what’s going on here is that IMO we made more progress on the offense side than the defense side. We didn’t measure that as rigorously, but it seemed like the token substitution tool speeds up the attack by something like 10x. I think the attack tools are one of the parts of the project I’m most excited about, and we might push even harder on them in upcoming projects. I think you’re correct that the amount of improvement was pretty limited by capabilities. 4,900 labels (for tool-assisted adversarial examples) just aren’t many for a 304M-parameter model. That’s one reason we don’t find it totally disturbing that the offense is winning for now; with much bigger models sample efficiency should increase dramatically, and the baseline cost of constructing the model will increase as well, making it comparatively cheaper to spend more on adversarial training. That said, this was just our first go and I think we can make a lot more progress right away. It’ll help to have a crisper task definition that doesn’t demand that the model understand thousands of possible ways injury could occur. It’ll help to have attacks that can produce higher volumes of data and cover a larger portion of the space. I don’t think we know how to really kill it yet, but I’m excited to keep working on it.

https://www.lesswrong.com/posts/A9tJFJY7DsGTFKKkh/high-stakes-alignment-via-adversarial-training-redwood?commentId=oRBBgqkY2xjjrmsFG

I do not understand the "we only doubled the amount of effort necessary to generate an adversarial counterexample.". Aren’t we talking about 3oom?

Comment

Ah, "The tool-assisted attack went from taking 13 minutes to taking 26 minutes per example."

Interesting. Changing the in-distribution (3oom) does not influences much the out-distribution (*2)

Comment

I think that 3 orders of magnitude is the comparison between "time taken to find a failure by randomly sampling" and "time taken to find a failure if you are deliberately looking using tools."

Comment

I read Oli’s comment as referring to the 2.4% → 0.002% failure rate improvement from filtering.

Comment

Ah, that makes sense. But the 26 minutes --> 13 minutes is from adversarial training holding the threshold fixed, right?

Comment

Indeed. (Well, holding the quality degradation fixed, which causes a small change in the threshold.)

https://www.lesswrong.com/posts/A9tJFJY7DsGTFKKkh/high-stakes-alignment-via-adversarial-training-redwood?commentId=4FMTn2PhjL5Tebuda

Looks like you use gradient magnitude as your saliency score. I’ve looked at using saliency to guide counterfactual modifications to a text, though my focus was on aiding interpretability rather than adversarial robustness. (Paper).

I’ve found that the normgrad saliency score worked well for highlighting important tokens. I.e., saliency = torch.sum(torch.pow(embedding * gradient, 2)).

For more details on normgrad, see: https://​​arxiv.org/​​pdf/​​2004.02866.pdf

Comment

https://www.lesswrong.com/posts/A9tJFJY7DsGTFKKkh/high-stakes-alignment-via-adversarial-training-redwood?commentId=KffuPnnh5EauAm5Gt

Neat! We tried one or two other saliency scores but there’s definitely a lot more experimentation to be done.

https://www.lesswrong.com/posts/A9tJFJY7DsGTFKKkh/high-stakes-alignment-via-adversarial-training-redwood?commentId=foCvG7crrtWnmjLoe

Strongly upvoted. This sort of thing seems extremely valuable in actually solving alignment; thank you for a helpful and interesting post!

https://www.lesswrong.com/posts/A9tJFJY7DsGTFKKkh/high-stakes-alignment-via-adversarial-training-redwood?commentId=5ZHD7bvnmaR93NoKm

Did you consider using the approach described in Ethan Perez’s "Red Teaming LMs with LMs"? This would mean using a new generator model to build many prompts, having the original generator complete those prompts, and then having a classifier identify any injurious examples in the completion. The tricky part seems to be that this assumes the classifier’s judgements are correct. If you trained the classifier on the examples identified by this process, it would only generate examples that are already labeled correctly by the classifier. To escape this problem, you could have humans grade whether each classification was correct, and train the classifier on incorrectly classified examples. But if humans have to review each example, this might not be any more useful than your original data labeling process using examples from human-written stories. I suppose it wouldn’t really add much value. Would you agree? Are there any related circumstances where the approach would be more useful? Maybe this would be a better question for Ethan...

https://www.lesswrong.com/posts/A9tJFJY7DsGTFKKkh/high-stakes-alignment-via-adversarial-training-redwood?commentId=9notGvLN4jbvewbrs

Take after talking with Daniel: for future work I think it will be easier to tell how well your techniques are working if you are in a domain where you care about minimizing both false-positive and false-negative error, regardless of whether that’s analagous to the long term situation we care most about. If you care about both kinds of error then the baseline of "set a reallly low classifier threshold" wouldn’t work, so you’d be starting from a regime where it was a lot easier to sample errors, hence it will be easier to measure differences in performance.

Comment

https://www.lesswrong.com/posts/A9tJFJY7DsGTFKKkh/high-stakes-alignment-via-adversarial-training-redwood?commentId=mz4avnvFxWMNdD7Jh

Yeah, I think that might have been wise for this project, although the ROC plot suggests that the classifiers don’t differ much in performance even at noticeably higher thresholds. For future projects, I think I’m most excited about confronting the problem directly by building techniques that can succeed in sampling errors even when they’re extremely rare.

https://www.lesswrong.com/posts/A9tJFJY7DsGTFKKkh/high-stakes-alignment-via-adversarial-training-redwood?commentId=hnbpcvJzLMsPGryCi

Super work! It must have required a crazy amount of technical and manual work. The fact that you manage to reduce the number of failures by 3 orders of magnitude is quite impressive. Did you separate train and test set at the beginning of the project? In conclusion, using a 300M parameter model to supervise a 10 times larger model seems to work well, it gives a lot of hope.

https://www.lesswrong.com/posts/A9tJFJY7DsGTFKKkh/high-stakes-alignment-via-adversarial-training-redwood?commentId=ipZ83jatgDhJFncfP

Cool paper, great to see the project worked out! (: One question: How do you know the contractors weren’t just answering randomly (or were confused about the task) in your "quality after filtering" experiments (Table 4)? Is there agreement across contractors about the quality of completions (in case they saw the same completions)?

Comment

https://www.lesswrong.com/posts/A9tJFJY7DsGTFKKkh/high-stakes-alignment-via-adversarial-training-redwood?commentId=JnHifosERtLrBwbwY

Thanks! :) Good question. Surge ran some internal auditing processes for all our quality data collection. We also checked 100 random comparisons ourselves for an earlier round of data and they seemed reasonable: we only disagreed with 5 of them, and 4 of those were a disagreement between "equally good" /​ "equally bad" and an answer one way or the other. (There were another 10 that seemed a bit borderline.) We don’t have interrater reliability numbers here, though—that would be useful.

Comment

Would you be interested in running a wider array of few-shot performance benchmarks? No performance loss on few shot generation is a bold but defensible claim, and it would be great to have stronger evidence for it. I’d be really interested in doing the legwork here if you’d find it useful.

Fantastic paper, makes a strong case that rejection sampling should be part of the standard toolkit for deploying pretrained LMs on downstream tasks.

Comment

Given that our classifier has only been trained on 3 sentence prompts + 1 sentence completions, do you think it can be applied to normal benchmarks? It may well transfer okay to other formats.

Comment

For sure, benchmarking can still be useful even if the classifier is less powerful than it could be. My main question is: How well does the generator model perform after rejection sampling? You could imagine that the rejection sampling process is destructive for output quality. But the initial results from your paper indicate the opposite—rejection sampling does not reduce human preference for generated outputs, so we would hope that it does not reduce benchmark performance either. For example, MultiRC is a popular reading comprehension benchmark where the generator model takes a prompt consisting of a text passage, a question about the passage, and a list of possible answers. The generator then labels each answer as true or false, and is graded on its accuracy. Evaluation on MultiRC would show us how the classifier affects the generator’s QA skills. GPT-Neo has already been evaluated on a wide suite of benchmarks. To prove that rejection sampling is performance competitive with unfiltered generation, you would not need to achieve SOTA performance on these benchmarks—you’d simply need to be competitive with unfiltered GPT-Neo.

Comment

My worry is that it’s not clear what exactly we would learn. We might get performance degradation just because the classifier has been trained on such a different distribution (including a different generator). Or it might be totally fine because almost none of those tasks involve frequent mention of injury, making it trivial. Either way IMO it would be unclear how the result would transfer to a more coherent setting. (ETA: That said, I think it would be cool to train a new classifier on a more general language modeling task, possibly with a different notion of catastrophe, and then benchmark that!)

Comment

Coming back to this: Your concern makes sense to me. Your proposal to train a new classifier for filtered generation to improve performance on other tasks seems very interesting. I think it might also be useful to simply provide a nice open-source implementation of rejection sampling in a popular generator repo like Facebook’s OPT-175B, so that future researchers can build on it. I’m planning on working on technical AI safety full-time this summer. Right now I’m busy applying to a few different programs, but I’ll definitely follow up on this idea with you.

https://www.lesswrong.com/posts/A9tJFJY7DsGTFKKkh/high-stakes-alignment-via-adversarial-training-redwood?commentId=9cF6vDtfK2XabELoE

I’d be very interested in what’d happen if you replaced the classifier’s base model (deberta-v3) with a much larger model like GPT-3, and then fine-tuned it with only the initial violence-classification data.

I’d hypothesize that if the Natural Abstractions Hypothesis were true, it would imply that the resulting classifier would perform much better at classifying every category of adversarial training example, whereas if NAH were false, the larger model’s concept of violence would still be alien and thus not cover the adversarial examples. (Note this doesn’t say anything about NAH in the limit, ie whether if we scaled the model up even more whether it’d eventually switch to an incomprehensible ontology.) Does someone who doesn’t believe in NAH agree with such an experiment’s implications, such that one of us can use the results to update our beliefs?

It’d be awesome if someone at RR could test this hypothesis.

Comment

https://www.lesswrong.com/posts/A9tJFJY7DsGTFKKkh/high-stakes-alignment-via-adversarial-training-redwood?commentId=wScDMWz34pHLNQyjY

I don’t think I believe a strong version of the Natural Abstractions Hypothesis, but nevertheless my guess is GPT-3 would do quite a bit better. We’re definitely interested in trying this.