An unaligned benchmark

https://www.lesswrong.com/posts/ZHXutm7KpoWEj9G2s/an-unaligned-benchmark

Contents

I. Model-based RL with MCTS

We train three systems in parallel:

II. What goes wrong?

There are two classes of problems:

Problem 1: Bad objective

The goal of the system is to produce (action, observation) sequences that look good to humans. I claim that optimizing this objective faithfully will lead to bad outcomes. As the system improves, the rationale of many individual actions will become incomprehensible to a human overseer. At this point the only option for a human is to evaluate sequence of observations based on whether the consequences look good. The observations present a narrow view of the world, and I strongly suspect that the AI will find sequences of actions that make that narrow view look good without actually being good. Control vs. intrinsic goodness. I think there are two strategies for defining a reward function:

Problem 2: distributional shift (and optimization daemons)

Our training procedure produces a policy and value function, most likely represented as (really big) neural networks. At test time, we combine these the policy and value with MCTS to decide on actions. The value function and policy have been optimized to yield good performance on the data points we’ve seen so far, as judged by human evaluations. Unfortunately, there are likely to be a very large number of networks that encode the "wrong" goals but which also yield good performance. These networks will generalize poorly, and moreover when they fail to generalize they can result in an extremely powerful optimization process being pointed at the wrong objective. A story about training. Originally the policy and value function don’t encode anything at all. Over time, they begin to encode a complicated soup of heuristics which is correlated with good performance. If we are training sufficiently powerful models we hope they will eventually perform reasoning about outcomes. For example, the policy could learn to backwards chain from heuristics about what is valuable in order to decide which moves are good. This is what we are trying to do — the policy is supposed to backwards chain, it’s the only part of the system that can use heuristics in order to prioritize the search. What humans actually want is somewhat complicated, so it seems quite likely that it’s easier for models to pursue a complicated soup of heuristic goals than to understand exactly what we want. This is similar to the way in which humans acquired an extremely rich set of goals even though we were optimized according to evolutionary fitness. This is a complicated question, but I think it’s the theoretical picture and I think historical experience with deep learning points tends to support it. As the system improves, the reward function encourages it to exhibit an increasingly precise understanding of what we want. Unfortunately there are two ways to do this:

Problem 1.5: non-robust reward functions

There is another risk at the intersection between robustness and value specification. We may learn a model of human approval which is accurate on the training distribution, but incorrectly assigns a very high value to some bad outcomes that didn’t appear in training. Indeed, recent experience with adversarial examples suggests that our models often have very strange behavior on parts of the input space not visited in training and that this problem can be hard to correct. Presumably some of these inputs would be assigned unusually high values (just as some would be assigned unusually low values). In order to reach the most pathological cases, the agent needs significant control over its own observations, which in in turn requires control over its environment. So even if the pathological inputs aren’t any worse than random on average, the ones that the system is able to reach will systematically be those where it has acquired enough influence to control its own observations. We can potentially correct this problem by eliciting human feedback for the problematic observations and then adding them to the training set. We’re only in serious trouble if the reward function is really hard to fix. In that case, we may instead make it harder and harder for the system to modify its own observations. After a while, we might train a system which is only able to control its observations by physically overpowering us and causing a catastrophe.

III. How the problem might be fixed

I think that my research has a chance of producing an AI that (a) is nearly-as-good as the benchmark, but (b) doesn’t do anything terrible. The main changes are:

My goals

In order to make all of that work, we’d need to solve a few research problems. Reliability. Some combination of these techniques needs to successfully eliminate all bad behavior (and in particular to control optimization daemons). Amplification. Amplification needs to be good enough for these three tasks: