Iterated Distillation and Amplification

https://www.lesswrong.com/posts/HqLxuZ4LhaFhmAHWk/iterated-distillation-and-amplification-1

Contents

Motivation: The alignment/​capabilities tradeoff

Assume that we want to train a learner A to perform some complex fuzzy task, e.g. "Be a good personal assistant." Assume that A is capable of learning to perform the task at a superhuman level — that is, if we could perfectly specify a "personal assistant" objective function and trained A to maximize it, then **A **would become a far better personal assistant than any human. There is a spectrum of possibilities for how we might train A to do this task. On one end, there are techniques which allow the learner to discover powerful, novel policies that improve upon human capabilities:

Core concept: Analogy to AlphaGoZero

The core idea of Paul’s scheme is similar to AlphaGoZero (AGZ): We use a learned model many times as a subroutine in a more powerful decision-making process, and then re-train the model to imitate those better decisions. AGZ’s policy network p is the learned model. At each iteration, AGZ selects moves by an expensive Monte Carlo Tree Search (MCTS) which uses policy pas its prior; p is then trained to directly predict the distribution of moves that MCTS ultimately settles on. In the next iteration, MCTS is run using the new more accurate p, and p is trained to predict the eventual outcome of that process, and so on. After enough iterations, a fixed point is reached — p is unable to learn how running MCTS will change its current probabilities. MCTS is an amplification of **p **— it uses p as a subroutine in a larger process that ultimately makes better moves than p alone could. In turn, p is a distillation of MCTS: it learns to directly guess the results of running MCTS, achieving comparable performance while short-cutting the expensive computation. The idea of IDA is to use the basic iterated distillation and amplification procedure in a much more general domain.

The IDA Scheme

IDA involves repeatedly improving a learned model through an amplification and distillation process over multiple iterations.

Amplification is interactive and human-directed in IDA

In AGZ, the amplification procedure is Monte Carlo Tree Search — it’s a simple and well-understood algorithm, and there’s a clear mechanism for how it improves on the policy network’s original choices (it traverses the game tree more deeply). But in IDA, amplification is not necessarily a fixed algorithm that can be written down once and repeatedly applied; it’s an interactive process directed by human decisions. In most domains, humans are capable of improving their native capabilities by delegating to assistants (e.g. because CEOs can delegate tasks to a large team, they can produce orders of magnitude more output per day than they could on their own). This means if our learning procedure can create an adequate helper for the human, the human can use the AI to amplify their ability — this human/​AI system may be capable of doing things that the human couldn’t manage on their own. Below I consider the example of using IDA to build a superhuman personal assistant. Let A[t] to refer to the state of the learned model after the end of iteration t; the initial agent A[0] is trained by a human overseer H.

Example: Building a superhuman personal assistant

H trains A[0] using a technique from the narrow end of the spectrum, such as imitation learning. Here we are imagining a much more powerful version of "imitation learning" than current systems are actually capable of — we assume that A[0] can acquire nearly human-level capabilities through this process. That is, the trained A[0] model executes all the tasks of a personal assistant as H would (including comprehending English instructions, writing emails, putting together a meeting schedule, etc). Even though A[0] cannot discover any novel capabilities, it has two key advantages over H: it can run much faster, and many copies or versions of it can be run at once. We hope to leverage these advantages to construct a larger system — involving H and many copies of A[0] — that will substantially improve on H’s capabilities while preserving alignment with H’s values. H can use calls to A[0] (along with other tools such as external memory) to become a better personal assistant. For example, H could assign one copy of A[0] to figuring out the best time to schedule the client’s recurring team meetings, another copy to figure out what to order the client for lunch, another copy to balance the client’s personal budget, etc. H now has the ability to get very quick solutions to sub-problems that are roughly as good as the ones H would have come up with on their own over a longer time period, and can combine these results to make much better decisions than an unaided human. Let Amplify(H, A[0]) refer to the larger system of H + many copies of A[0] + aids. Compared to A[0] alone, the Amplify(H, A[0]) system has much higher time and resource costs but its eventual decisions are much better. Moreover, because in each of its individual decisions each copy of A[0] continues to act just as a human personal assistant would act, we can hope that Amplify(H, A[0]) preserves alignment. In the next iteration of training, the Amplify(H, A[0]) system takes over the role of H as the overseer. A[1] is trained with narrow and safe techniques to quickly reproduce the results of Amplify(H, A[0]). Because we assumed Amplify(H, A[0]) was aligned, we can hope that A[1] is also aligned if it is trained using sufficiently narrow techniques which introduce no new behaviors. A[1] is then used in Amplify(H, A[1]), which serves as an overseer to train A[2], and so on.

Pseudocode

**def **IDA(H): **A **<- random initialization repeat: A <- Distill(Amplify(H, A))**def *Distill(overseer): """ * Returns an AI trained using narrow, robust techniques to perform a task that the overseer already understands how to perform. """**def **Amplify(human, AI): """ Interactive process in which human uses many calls to AI to improve on human's native performance at relevant task(s). """# What properties must hold for IDA to work? The IDA scheme is a template with "slots" for Amplify and Distill procedures that have not been fully specified yet — in fact, they rely on capabilities we don’t yet have. Because IDA itself is not fully specified, it’s not clear what minimal set of properties are necessary for it to succeed.

Achieving alignment and high capability

That said, here are some general properties which seem necessary — though likely not sufficient — for IDA agents to achieve robust alignment and high capability:

Achieving competitive performance and efficiency

Paul aims for IDA agents to be competitive with traditional RL agents in time and resource costs at runtime — this is a reasonable expectation because an IDA agent is ultimately just another learned model whose weights were tuned with an unusual training procedure. Resource and time cost during training is a more open question; I haven’t explored the assumptions that would have to hold for the IDA training process to be practically feasible or resource-competitive with other AI projects. This was originally posted here.

Comment

https://www.lesswrong.com/posts/HqLxuZ4LhaFhmAHWk/iterated-distillation-and-amplification-1?commentId=GeBLKN5FJjzDmdGqn

Based on discussion between Vladimir Slepnev and Paul in this thread, it seems like statements in this post ("we assume that A[0] can acquire nearly human-level capabilities through this process", "Given an aligned agent H we can use narrow safe learning techniques to train a much faster agent A which behaves as H would have behaved") that the first stage of IDA will produce nearly-human-level assistants are misleading. In the same thread, Paul says that he "will probably correct it", but as far as I can tell, neither the Medium post nor the version of the post in this sequence (which was published after the discussion) has been corrected.

https://www.lesswrong.com/posts/HqLxuZ4LhaFhmAHWk/iterated-distillation-and-amplification-1?commentId=WLTxpGdHoB7TzT6K7

In the pseudocode, it would make more sense to initialize A <- Distill(H), wouldn’t it? Otherwise, running Amplify with the randomly initialized A in the next step wouldn’t be helpful.

Comment

https://www.lesswrong.com/posts/HqLxuZ4LhaFhmAHWk/iterated-distillation-and-amplification-1?commentId=eeRiggQaSiduamRsi

I had this same thought, but my understanding (which is not solid) is that in the first iteration, since A is random, H can just ignore A and go with its own output (if my assistants are unhelpful, I can just try to perform the task all on my own). So Amplify(H, A) becomes H, which means A <- Distill(Amplify(H, A)) is basically A <- Distill(H), exactly as you suggested.

https://www.lesswrong.com/posts/HqLxuZ4LhaFhmAHWk/iterated-distillation-and-amplification-1?commentId=duA3mpGu6CD2D5cQg

AGZ’s policy network > p is the learned model. I found this bit slightly confusing. As far as I understand from the AGZ Nature paper, AGZ does not have a separate policy network p, but uses a single network f_{\theta} which outputs both the learned policy **p **and the estimated probability v that the current player will win the game. Is this what the sentence is referring to?

Comment

https://www.lesswrong.com/posts/HqLxuZ4LhaFhmAHWk/iterated-distillation-and-amplification-1?commentId=K8ZT5vW9dK2LXXhs9

Yes, AGZ uses the same network for policy and value function.

https://www.lesswrong.com/posts/HqLxuZ4LhaFhmAHWk/iterated-distillation-and-amplification-1?commentId=HRZJ5SWahYxoh9GtE

When A[n+1] is supposed to imitate the output of (H, A[n]), I think IDA is safe, because I think imitation is safe. (If A[0] is a rock and (H, A[n]) is a group of one human and two A[n]’s, then A[n] is basically imitating a group of 2^{n} - 1 humans). If (H, A[n]) is supposed to provide a reward signal to A[n+1], which A[n+1] tries to optimize, I think this version of IDA is unsafe, for reasons similar to what Wei Dai expressed in a comment (on a post I now can’t find) taking issue with the inductive step in the original argument. Can we standardize different names for these two designs? Unless, is the latter version deprecated?

https://www.lesswrong.com/posts/HqLxuZ4LhaFhmAHWk/iterated-distillation-and-amplification-1?commentId=BvuCeJPrXu89KDR6x

I think there are 2 mistakes in the pseudocode.

First mistake

what rmoehn said.

Second mistake

In the personal assistant example you say

In the next iteration of training, the Amplify(H, A[0]) system takes over the role of H as the overseer.

which implies that we do

H <- Amplify(H, A)

But in the pseudocode the original human overseer acts as the overseer all the time.

Suggested change of the pseudocode, which fixes both mistakes

def IDA(H): repeat: A ← Distill(H) H ← Amplify(H, A)

Comment

https://www.lesswrong.com/posts/HqLxuZ4LhaFhmAHWk/iterated-distillation-and-amplification-1?commentId=oWEoDSF8wrEuh83yx

I think H is always the same. In fact, H is a human, so it doesn’t make any sense to have code of the form H \leftarrow x. In every step, a new system A^{(t+1)} is trained by letting a regular human oversee it, where the human has access to the system A^{(t)}. Conversely, your code would imply that the human itself is replaced with something, and that thing then uses the system A^{(t)}. This does not happen. (Unless my understanding is widely off; I’m only reading this sequence for the second time.)

https://www.lesswrong.com/posts/HqLxuZ4LhaFhmAHWk/iterated-distillation-and-amplification-1?commentId=92FzkKHZ9az8dezSH

I noticed that I have two distinct "mental pictures" for what the overseer is, depending on how the Distill procedure works (i.e. depending on the narrow technique used in the Distill procedure).

  • For imitation learning and narrow inverse reinforcement learning: a "passive" overseer that just gets used as a template/​target for imitation.

  • For narrow reinforcement learning and in discussions about approval-directed agents: an "active" overseer that rates actions or provides rewards.

I wonder if this way of thinking about the overseer is okay/​correct, or if I’m missing something (e.g. maybe even in case (1), the overseer has a more active role than I can make out). Assuming this way of thinking about the overseer is okay, it seems like for case (1), the term "overseer" has connotations that extend beyond the role played by the overseer (i.e. it doesn’t really provide any oversight since it is passive).

https://www.lesswrong.com/posts/HqLxuZ4LhaFhmAHWk/iterated-distillation-and-amplification-1?commentId=WFid3XRuhqwZbwvwd

Narrow reinforcement learning: As A**> takes actions in the world, we give it a dense reward signal based on how reasonable we judge its choices are (perhaps we directly reward state-action pairs themselves rather than outcomes in the world, as in TAMER). A **optimizes for the expected sum of its future rewards.Wouldn’t it try to bring about states in which some action is particularly reasonable? Like the villain from that story who brings about a public threat in order to be seen defeating it.

Comment

https://www.lesswrong.com/posts/HqLxuZ4LhaFhmAHWk/iterated-distillation-and-amplification-1?commentId=RXHs4wjxhrKQg4JkY

Potentially, it depends on the time horizon and on how the rewards are calculated. The most natural reward for the state transition (s0, s1) is just V(s1) - V(s0) (where V is some intuitive "human value function," i.e. ask a human "how good does state s seem?"). This reward function wouldn’t have that problem.

Comment

Maximizing the sum of the difference of state value just maximizes state value again, which the point of narrow reinforcement learning was to get away from.

Comment

The goal of narrow reinforcement learning is to get something-like-human-level behavior using human-level oversight. Optimizing the human value function over short time horizons seems like a fine approach to me. The difference with broad reinforcement learning is that you aren’t trying to evaluate actions you can’t understand by looking at the consequences you can observe.