Imitative Generalisation (AKA ‘Learning the Prior’)

https://www.lesswrong.com/posts/JKj5Krff5oKMb8TjT/imitative-generalisation-aka-learning-the-prior-1

Contents

Tl;dr

We want to be able to supervise models with superhuman knowledge of the world and how to manipulate it. For this we need an overseer to be able to learn or access all the knowledge our models have, in order to be able to understand the consequences of suggestions or decisions from the model. If the overseers don’t have access to all the same knowledge as the model, it may be easy for the model to deceive us, suggesting plans that look good to us but that may have serious negative consequences. We might hope to access what the model knows just by training it to answer questions. However, we can only train on questions that humans are able to answer[1]. This gives us a problem that’s somewhat similar to the standard formulation of transduction: we have some labelled training set (questions humans can answer), and we want to transfer to an unlabelled dataset (questions we care about), that may be differently distributed. We might hope that our models will naturally generalize correctly from easy-to-answer questions to the ones that we care about. However, a natural pathological generalisation is for our models to only give us ‘human-like’ answers to questions, even if it knows the best answer is different. If we only have access to these human-like answers to questions, that probably doesn’t give us enough information to supervise a superhuman model. What we’re going to call ‘Imitative Generalization’ is a possible way to narrow the gap between the things our model knows, and the questions we can train our model to answer honestly. It avoids the pathological generalisation by only using ML for IID tasks, and imitating the way humans generalize. This hopefully gives us answers that are more like ‘how a human would answer if they’d learnt from all the data the model has learnt from’. We supervise how the model does the transfer, to get the sort of generalisation we want. It’s worth noting there are enough serious open questions that imitative generalization is more of a research proposal than an algorithm! This post is based on work done with Paul Christiano at OpenAI. Thanks very much to Evan Hubinger, Richard Ngo, William Saunders, Long Ouyang and others for helpful feedback, as well as Alice Fares for formatting help

Goals of this post

This post tries to explain a simplified[2] version of Paul Christiano’s mechanism introduced here, (referred to there as ‘Learning the Prior’) and explain why a mechanism like this potentially addresses some of the safety problems with naïve approaches. First we’ll go through a simple example in a familiar domain, then explain the problems with the example. Then I’ll discuss the open questions for making Imitative Generalization actually work, and the connection with the Microscope AI idea. A more detailed explanation of exactly what the training objective is (with diagrams), and the correspondence with Bayesian inference, are in the appendix.

Example: using IG to avoid overfitting in image classification.

Here’s an example of using Imitative Generalization to get better performance on a standard ML task: image classification of dog breeds, with distributional shift. Imagine we want to robustly learn to classify dog breeds, but the human labellers we have access to don’t actually know how to identify all the breeds[3], and we don’t have any identification guides or anything. However, we do have access to a labelled dataset D. We want to classify dogs in a different dataset D’, which is unlabelled. One unfamiliar breed we want to learn to recognise is a husky. It happens that all the huskies in D are on snow, but in D’ some of them are on grass. Label: Husky Image from D Label: ??? OOD image from D' A NN architecture prior likely doesn’t favour the hypothesis ‘a husky is a large, fluffy dog that looks quite like a wolf’ over ‘if there are a lot of white pixels in the bottom half of the image, then it’s a husky’. These hypotheses both perform equally well on the training data. So a naïve approach of fitting a model to D and then running it on D’ may easily misclassify huskies that are not on snow. However, a human prior does favour the more sensible assumption (that the label husky refers to this fluffy wolf-like dog) over the other one (that the label husky refers to an image with many white pixels in the bottom half of the image). If we can use this human prior, we can avoid misclassifying huskies in—D’ even if the two hypotheses perform equally well on D. To apply the IG scheme here we’re going to jointly learn three things.

Ways that this specific example is unrealistic:

Firstly, our model may not be capable enough to learn the human likelihood/​prior functions, even given plenty of IID examples. IG is easiest to analyze when we have ML capable of learning to imitate most IID human behavior. If our ML is more limited, the generalization will be determined by a combination of human capabilities and model capabilities. This example isn’t very exciting, because classifying dogs is a problem that humanity has already solved. If we were actually doing this specific task in real life, we’d either give the workers a guide to identifying dog breeds, or let them look at D and learn the labels, and then label D’ for us. The IG scheme is only needed if this isn’t possible—for example, if there are no existing resources on how to identify dogs, and there are so many different dog breeds that it’s too hard to get our labellers to learn them without help from z*. Even then we might think that the labellers can just look at D and make their own notes on breed identification. IG is needed if this task is too difficult—e.g. if the relationship between visual characteristics and breeds is too complicated for humans to infer from a few examples.

Difficulties with this example that are also general problems

Optimising z is hard; we’d probably need a better way of representing z and exploring the space of zs than just searching over long strings of text. One way to improve might be to have our human labellers generate different hypotheses for what different breeds look like, then train a model to imitate this hypothesis generation. Conveying the correct hypothesis in text might be hard. Explaining how to recognise different dog breeds is probably significantly harder than teaching a human by showing them some examples. Having human workers explicitly look for particular features that can be described in words may not be competitive with non-verbal, unconscious pattern-recognition. In this specific example, we might want to add visual information to z - for example, providing a diagram of key characteristics for each breed.

Key difficulties for IG

From the above problems, we can see that a core difficulty for this scheme is how to represent z. We want z to be able to capture everything that an NN could learn, and want it to be easy to optimise, but also need it to be human-understandable. Text One possibility is to have z be a long string of natural-language text. We have several problems:

Relationship with Microscope AI

My (Beth’s) paraphrasing of the Microscope AI proposal is as follows: We train ML models, then examine the representations they’ve learnt using our interpretability tools. We then use the knowledge and insight gained from these representations to help [amplified] humans do tasks, and never use the model directly. Also, if we spot problems with these representations we can prune or modify them to improve the usefulness, although this isn’t the central part of the proposal. This gives us safety properties in at least two ways:

Appendix

More detailed notation for training procedure

We have some space Z of objects that represent distributions over hypotheses We have some space of questions X and answers Y:{Yes, No} We have two sets of questions drawn from two different distributions: D (train) and D’ (test). The questions drawn from D have corresponding answers We train three models: M^{prior}: Z \rightarrow [-\inf, 0] gives the log likelihood of Z under the human prior M^L_{train}: Z \rightarrow (X \rightarrow (Y \rightarrow [-\inf, 0])) gives the log likelihood of some Y given some X and Z M^L_{test}: X \rightarrow (Y \rightarrow [-\inf, 0]) gives the log likelihood under z^* of some Y given some X Step 1: We collect a dataset H^{prior} from humans by asking humans ‘how plausible is z?’ for the zs we’re searching over. We train M^{prior} to minimise loss on log( H^{prior} ) We collect a dataset H^L_{train} from humans by asking humans ‘how plausible is y as an answer to x, assuming z?’ for each (x,y) \in D. We train M^L_{train}(z,x,y) to minimise loss on log(H^{L}{train} ) Step 2: We search over Z to find z^* that maximises M^{prior}( z) + \sum{x,y \in D}(M^L_{train}(x,y,z)). Step 3: While we’re optimising z, we update H^{prior} and H^L_{train} to be on-distribution with the current z, and retrain M^{prior} and M^L_{train} appropriately. Step 4: We collect a dataset H^L_{test} from humans by asking humans ‘how plausible is {yes/​no} as an answer to x, assuming z^*?’ for each x in D’. We train M^L_{test} to minimise loss on log(H^L_{test}) In practice, we’d want to share weights between M^{prior}, M^L_{train} and M^L_{test}

Correspondence with Bayesian updating

Consider some Bayesian agent that has a prior over hypotheses, z^{prior}. D corresponds to data they want to update on, and D’ corresponds to some data they want to predict. They have some method log_likelihood() of computing the likelihood of some data given different distributions z over hypotheses—this corresponds to H^L_{train} and H^L_{test} If we replace H^{prior}(.) with KL(.||z^{prior}), then what IG gives us is exactly the same as if the agent had performed the correct Bayesian update on all of D. The agent’s predicted probability of (x’,y’) after seeing D is log_likelihood(z*, x, y) with: z^* = max_{z \in Z} [KL(z||z^{prior}) + \sum_{x,y \in D}(M^L_{train}(x,y,z))] z^* corresponds to the agent’s posterior after seeing D

Diagrams

This is a picture of the whole scheme. The part circled in red is the output of the process: an ML model that we can use for answering questions in D’ without distributional shift. There’s a lot going on here, so let’s break it down. The underlying structure is just induction: we have some prior and labelled dataset D; we use the prior and the likelihood to compute a score for z and we use this to optimise z. Then we use z^* to predict the labels in D’. Using ML We want to train models to imitate the human Prior and Likelihood functions. As long as we can sample H^{prior}(z) and H^L(d,z) for any z and d that we might want to evaluate our models on, we can ensure that the task is IID. Making it easier for humans Asking humans to directly report P(z) or P(d|z) is unreasonable. To make it easier for our labellers, we instead just ask them to compare pairs of ds or zs, and use this reward signal to fit M^L and M^{prior} (as in e.g. https://​​arxiv.org/​​pdf/​​1909.08593.pdf ). When we put these pieces into the induction setup above, we just need to compute the overall likelihood P(D|z) by summing log likelihoods for each d \in D, then add this to prior(z) to get the posterior score for z. Now we just need to learn to compute the likelihoods on examples from D’, given z^*. Separating out the likelihoods for D and D’ like this lets us ensure that each of the ML imitation tasks are IID. Now the object circled in red is what we wanted all along—a ML model trained to label examples in D’, that approximates " How a human would answer if they’d looked at all the data in D ".

Notes

Comment

https://www.lesswrong.com/posts/JKj5Krff5oKMb8TjT/imitative-generalisation-aka-learning-the-prior-1?commentId=fjfKQYuhu8Nerpchw

A few things that I found helpful in reading this post:

  • I mentally replaced D with "the past" and D’ with "the future".

  • I mentally replaced z with "a guide to reasoning about the future". This gives us a summary something like: We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to generalise well about the future. Instead, we can train a network to be a guide to reasoning about the future, by evaluating its outputs based on how well humans with access to it can reason about the past, plus how well humans expect it to generalise to the future, plus immense amounts of interpretability work. (Note that this summary was originally incorrect, and has been modified in response to Lanrian’s corrections below.) Some concerns that arise from my understanding of this proposal:

  • It seems like the only thing stopping z from primarily containing object-level knowledge about the world is the human prior about the unlikelihood of object-level knowledge. But humans are really bad at assigning priors even to relatively simple statements—this is the main reason that we need science.

  • z will consist of a large number of claims, but I have no idea how to assign a prior to the conjunction of many big claims about the world, even in theory. That prior can’t calculated recursively, because there may be arbitrarily-complicated interactions between different components of z.

  • Consider the following proposal: "train an oracle to predict the future, along with an explanation of its reasoning. Reward it for predicting correctly, and penalise it for explanations that sound fishy". Is there an important difference between this and imitative generalisation?

  • An agent can "generalise badly" because it’s not very robust, or because it’s actively pursuing goals that are misaligned with those of humans. It doesn’t seem like this proposal distinguishes between these types of failures. Is this distinction important in motivating the proposal?

Comment

https://www.lesswrong.com/posts/JKj5Krff5oKMb8TjT/imitative-generalisation-aka-learning-the-prior-1?commentId=6wYnteKCH7RLZ7xBu

We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to generalise well about the future. Instead, we can train a network to be a guide to reasoning about the future, by evaluating its outputs based on how well humans with access to it can reason about the futureI don’t think this is right. I’ve put my proposed modifications in cursive: We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to generalise well about the future. Instead, we can train a network to be a guide to reasoning about the future, by evaluating its outputs based on how well humans with access to it can reason about the past [we don’t have ground-truth for the future, so we can’t test how well humans can reason about it] and how well humans think it would generalise to the future. Then, we train a separate network to predict what humans with access to the previous network would predict about the future. (It might be a good idea to share some parameters between the second and first network.)

Comment

Ooops, yes, this seems correct. I’ll edit mine accordingly.

https://www.lesswrong.com/posts/JKj5Krff5oKMb8TjT/imitative-generalisation-aka-learning-the-prior-1?commentId=P8YG2f8h252a2JMum

Consider the following proposal: "train an oracle to predict the future, along with an explanation of its reasoning. Reward it for predicting correctly, and penalise it for explanations that sound fishy". Is there an important difference between this and imitative generalisation? As I understand, there are two separate oracles. No oracles are rewarded for predicting correctly. One oracle is rewarded for coming up with good explanations. The other oracle is rewarded for predicting human’s guess, not the truth, correctly. How do we predict the future with these two oracles? First, we search for the best explanation of the past. The best explanation of the past is 1. a good explanation, and 2. when human guesses with that explanation, they guess correctly. Then, we use human-guess-oracle to predict what human guesses about the future, with the best explanation of the past. Let’s say we are predicting the winner of the war given the number of soldiers. In the past, 50 won in 50 vs 5, and 150 won in 150 vs 15. In the future, there will be 50 vs 10. Three explanations of the past is suggested: 1. the side with more soldiers wins, otherwise random 2. the side with even number of soldiers wins against odd number of soldiers, otherwise random 3. if the side is exactly as ten times numerous as the other side, it wins, otherwise random. Three explanations score perfectly against the past. 1 predicts 50 wins in the future, 2 and 3 predict random. IG prefers 1, because the explanation 2 and 3 are crazy, although predictive of the past. I think these are important differences: 1. Oracle is not trained to predict the future. 2. Explanation must be useful to human, because oracle predicts human’s use of explanation and can’t use explanation directly. 3. Predicting oracle does not generate explanation itself and has no control over it.

https://www.lesswrong.com/posts/JKj5Krff5oKMb8TjT/imitative-generalisation-aka-learning-the-prior-1?commentId=4tvzmZGJuAMYJBTs2

It seems like the only thing stopping z from primarily containing object-level knowledge about the world is the human prior about the unlikelihood of object-level knowledge. But humans are really bad at assigning priors even to relatively simple statements—> this is the main reason that we need science.Agree that humans are not necessarily great at assigning priors. The main response to this is that we don’t have a way to get better priors than an amplified human’s best prior. If amplified humans think the NN prior is better than their prior, they can always just use this prior. So in theory this should be both strictly better than the alternative, and the best possible prior we can use. Science seems like it’s about collecting more data and measuring the likelihood, not changing the prior. We still need to use our prior—there are infinite scientific theories that fit the data, but we prefer ones that are simple and elegant.

z will consist of a large number of claims, but I have no idea how to assign a prior to the conjunction of many big claims about the world, even in theory. That prior can’t calculated recursively, because there may be arbitrarily-complicated interactions between different components of z.One thing that helps a bit here is that we can use an amplified human. We also don’t need the human to calculate the prior directly, just to do things like assess whether some change makes the prior better or worse. But I’m not sure how much of a roadblock this is in practice, or what Paul thinks about this problem. Consider the following proposal: "train an oracle to predict the future, along with an explanation of its reasoning. Reward it for predicting correctly, and penalise it for explanations that sound fishy". Is there an important difference between this and imitative generalisation?Yeah, the important difference is that in this case there’s nothing that constrains the explanations to be the same as the actual reasoning the oracle is using, so the explanations you’re getting are not necessarily predictive of the kind of generalisation that will happen. In IG it’s important that the quality of z is measured by having humans use it to make predictions. An agent can "generalise badly" because it’s not very robust, or because it’s actively pursuing goals that are misaligned with those of humans. It doesn’t seem like this proposal distinguishes between these types of failures. Is this distinction important in motivating the proposal?I’m not sure exactly what you’re asking. I think the proposal is motivated by something like: having the task be IID/​being able to check arbitrary outputs from our model to make sure it’s generalising correctly buys us a lot of safety properties. If we have this guarantee, we only have to worry about rare or probabilistic defection, not that the model might be giving us misleading answers for every question we can’t check.

https://www.lesswrong.com/posts/JKj5Krff5oKMb8TjT/imitative-generalisation-aka-learning-the-prior-1?commentId=KbWTDcBYee7roTn2a

Planned summary for the Alignment Newsletter:

This post explains a simplified version of the scheme in Learning the prior (AN #109) with an image classification example. A key issue for distributional shift is that neural nets assign significant "probability" to "crazy" hypotheses. Imagine that we want to train a neural net to classify dogs breeds, and in our training dataset D all huskies are on snow, but on the test dataset D’ they may also be on grass. Then a neural net is perfectly happy with the hypothesis "if most of the bottom half of the image is white, then it is a husky", whereas humans would see that as crazy and would much prefer the hypothesis "a husky is a large, fluffy, wolf-like dog", even if they don’t know what a husky looks like. Thus, we might say that the human "prior" over hypotheses is much better than the corresponding neural net "prior". So, let’s optimize our model using the human prior instead. In particular, we search for a hypothesis such that 1) humans think the hypothesis is likely (high human prior), and 2) the hypothesis leads humans to make good predictions on the training dataset D. Once we have this hypothesis, we have humans make predictions using that hypothesis on the test distribution D’, and train a model to imitate these predictions. We can then use this model to predict for the rest of D’. Notably, this model is now being used in an iid way (i.e. no distribution shift). A key challenge here is how to represent the hypotheses that we’re optimizing over—they need to be amenable to ML-based optimization, but they also need to be interpretable to humans. A text-based hypothesis would likely be too cumbersome; it is possible that neural-net-based hypotheses could work if augmented by interpretability tools that let the humans understand the "knowledge" in the neural net (this is similar in spirit to <@Microscope AI@>(@Chris Olah’s views on AGI safety@)). For more details on the setup, see the full post, or my previous summary.

https://www.lesswrong.com/posts/JKj5Krff5oKMb8TjT/imitative-generalisation-aka-learning-the-prior-1?commentId=MWeZXs8cdMWwbPj6j

The footnotes here seem broken. Any chance they could be fixed?

Comment

https://www.lesswrong.com/posts/JKj5Krff5oKMb8TjT/imitative-generalisation-aka-learning-the-prior-1?commentId=Cy86ZWw3sxpd2MyfG

Another very minor (but briefly confusing) nit: The notation in the `Example’ section is inconsistent between probabilities and log probabilities. It introduces H^{prior}(z) (etc.) as a probability, but then treats it as a log probability in the line starting with ‘We find the z^*’.

https://www.lesswrong.com/posts/JKj5Krff5oKMb8TjT/imitative-generalisation-aka-learning-the-prior-1?commentId=dDLCQaLdRPYQCdmSg

It seems like z* is meant to represent "what the human thinks the task is, based on looking at D".So why not just try to extract the posterior directly, instead of the prior an the likelihood separately?(And then it seems like this whole thing reduces to "ask a human to specify the task".)

Comment

https://www.lesswrong.com/posts/JKj5Krff5oKMb8TjT/imitative-generalisation-aka-learning-the-prior-1?commentId=eyeMDfqs4p947dZHv

We’re trying to address cases where the human isn’t actually able to update on all of D and form a posterior based on that. We’re trying to approximate ‘what the human posterior would be if they had been able to look at all of D’. So to do that, we learn the human prior, and we learn the human likelihood, then have the ML do the computationally-intensive part of looking at all of D and updating based on everything in there. Does that make sense?

https://www.lesswrong.com/posts/JKj5Krff5oKMb8TjT/imitative-generalisation-aka-learning-the-prior-1?commentId=yc2QLuM98QWqPghM2

Starting with amplification as a baseline; am I correct to infer that imitative generalisation only boosts capabilities, and doesn’t give you any additional safety properties?My understanding: After going through the process of finding z, you’ll have a z that’s probably too large for the human to fully utilise on their own, so you’ll want to use amplification or debate to access it (as well as to generally help the human reason). If we didn’t have z, we could train an amplification/​debate system on D’ anyway, while allowing the human and AIs to browse through D for any information that they need. I don’t see how the existence of z makes amplification or debate any more aligned, but it seems plausible that it could improve competitiveness a lot. Is that the intention? Bonus question: Is the intention only to boost efficiency, or do you think that IA will fundamentally allow amplification to solve more problems? (Ie., solve more problems with non-ridiculous amounts of compute – I’d be happy to count an exponential speedup as the latter.)

Comment

https://www.lesswrong.com/posts/JKj5Krff5oKMb8TjT/imitative-generalisation-aka-learning-the-prior-1?commentId=K3xKpjxHTdRkGiss2

Starting with amplification as a baseline; am I correct to infer that imitative generalisation only boosts capabilities, and doesn’t give you any additional safety properties?I think the distinction isn’t actually super clear, because you can usually trade off capabilities problems and safety problems. I think of it as expanding the range of questions you can get aligned answers to in a reasonable number of steps. If you’re just doing IDA/​debate, and you try to get your model to give you answers to questions where the model only knows the answer because of updating on a big dataset, you can either keep going through the big dataset when any question of this type comes up (very slow, so capability limitation), or not trust these answers (capability limitation), or just hope they’re correct (safety problem). Bonus question: Is the intention only to boost efficiency, or do you think that IA will fundamentally allow amplification to solve more problems? (Ie., solve more problems with non-ridiculous amounts of compute – I’d be happy to count an exponential speedup as the latter.)The latter :) I think the only way to get debate to be able to answer all the questions that debate+IG can answer is to include subtrees that are the size of your whole training dataset at arbitrary points in your debate tree, which I think counts as a ridiculous amount of compute

https://www.lesswrong.com/posts/JKj5Krff5oKMb8TjT/imitative-generalisation-aka-learning-the-prior-1?commentId=YtmpuhXNPs9BXpQKg

FYI it looks like the footnote links are broken. (Linking to "about:blank...")

https://www.lesswrong.com/posts/JKj5Krff5oKMb8TjT/imitative-generalisation-aka-learning-the-prior-1?commentId=5mSEfa5PcNBZ4cWza

Is imitative generalization usually envisioned as a recursive many-iterations-process like IDA? Or is it just a single iteration of train the initial model → inspect and correct the priors → train the new model? Great post, by the way.