Alignment Newsletter #51

https://www.lesswrong.com/posts/joxveLr8egM85YTy2/alignment-newsletter-51

Link post Contents

Highlights

Towards Characterizing Divergence in Deep Q-Learning (Joshua Achiam et al): Q-Learning algorithms use the Bellman equation to learn the Q*(s, a) function, which is the long-term value of taking action a in state s. Tabular Q-Learning collects experience and updates the Q-value for each (s, a) pair independently. As long as each (s, a) pair is visited infinitely often, and the learning rate is decayed properly, the algorithm is guaranteed to converge to Q*. Once we get to complex environments where you can’t enumerate all of the states, we can’t explore all of the (s, a) pairs. The obvious approach is to approximate Q*(s, a). Deep Q-Learning (DQL) algorithms use neural nets for this approximation, and use some flavor of gradient descent to update the parameters of the net such that it is closer to satisfying the Bellman equation. Unfortunately, this approximation can prevent the algorithm from ever converging to Q*. This paper studies the first-order Taylor expansion of the DQL update, and identifies three factors that affect the DQL update: the distribution of (s, a) pairs from which you learn, the Bellman update operator, and the neural tangent kernel, a property of the neural net that specifies how information from one (s, a) pair generalizes to other (s, a) pairs. The theoretical analysis shows that as long as there is limited generalization between (s, a) pairs, and each (s, a) pair is visited infinitely often, the algorithm will converge. Inspired by this, they design PreQN, which explicitly seeks to minimize generalization across (s, a) pairs within the same batch. They find that PreQN leads to competitive and stable performance, despite not using any of the tricks that DQL algorithms typically require, such as target networks. Rohin’s opinion: I really liked this paper: it’s a rare instance where I actually wanted to read the theory in the paper because it felt important for getting the high level insight. The theory is particularly straightforward and easy to understand (which usually seems to be true when it leads to high level insight). The design of the algorithm seems more principled than others, and the experiments suggest that this was actually fruitful. The algorithm is probably more computationally expensive per step compared to other algorithms, but that could likely be improved in the future. One thing that felt strange is that the proposed solution is basically to prevent generalization between (s, a) pairs, but the whole point of DQL algorithms is to generalize between (s, a) pairs since you can’t get experience from all of them. Of course, since they are only preventing generalization within a batch, they still generalize between (s, a) pairs that are not in the batch, but presumably that was because they only could prevent generalization within the batch. Empirically the algorithm does seem to work, but it’s still not clear to me why it works.

Technical AI alignment

Learning human intent

Deep Reinforcement Learning from Policy-Dependent Human Feedback (Dilip Arumugam et al): One obvious approach to human-in-the-loop reinforcement learning is to have humans provide an external reward signal that the policy optimizes. Previous work noted that humans tend to correct existing behavior, rather than providing an "objective" measurement of how good the behavior is (which is what a reward function is). They proposed Convergent Actor-Critic by Humans (COACH), where instead of using human feedback as a reward signal, they use it as the advantage function. This means that human feedback is modeled as specifying how good an action is relative to the "average" action that the agent would have chosen from that state. (It’s an average because the policy is stochastic.) Thus, as the policy gets better, it will no longer get positive feedback on behaviors that it has successfully learned to do, which matches how humans give reinforcement signals. This work takes COACH and extends it to the deep RL setting, evaluating it on Minecraft. While the original COACH had an eligibility trace that helps "smooth out" human feedback over time, deep COACH requires an eligibility replay buffer. For sample efficiency, they first train an autoencoder to learn a good representation of the space (presumably using experience collected with a random policy), and feed these representations into the control policy. They reward entropy so that the policy doesn’t commit to a particular behavior, making it responsive to feedback, but select actions by always picking the action with maximal probability (rather than sampling from the distribution) in order to have interpretable, consistent behavior for the human trainers to provide feedback on. They evaluate on simple navigation tasks in the complex 3D environment of Minecraft, including a task where the agent must patrol the perimeter of a room, which cannot be captured by a state-based reward function. Rohin’s opinion: I really like the focus on figuring out how humans actually provide feedback in practice; it makes a lot of sense that we provide reinforcement signals that reflect the advantage function rather than the reward function. That said, I wish the evaluation had more complex tasks, and had involved human trainers who were not authors of the paper—it might have taken an hour or two of human time instead of 10-15 minutes, but would have been a lot more compelling. Before continuing, I recommend reading about Simulated Policy Learning in Video Models below. As in that case, I think that you get sample efficiency here by getting a lot of "supervision information" from the pixels used to train the VAE, though in this case it’s by learning useful features rather than using the world model to simulate trajectories. (Importantly, in this setting we care about sample efficiency with respect to human feedback as opposed to environment interaction.) I think the techniques used there could help with scaling to more complex tasks. In particular, it would be interesting to see a variant of deep COACH that alternated between training the VAE with the learned control policy, and training the learned control policy with the new VAE features. One issue would be that as you retrain the VAE, you would invalidate your previous control policy, but you could probably get around that (e.g. by also training the control policy to imitate itself while the VAE is being trained). From Language to Goals: Inverse Reinforcement Learning for Vision-Based Instruction Following (Justin Fu et al): Rewards and language commands are more generalizable than policies: "pick up the vase" would make sense in any house, but the actions that navigate to and pick up a vase in one house would not work in another house. Based on this observation, this paper proposes that we have a dataset where for several (language command, environment) pairs, we are given expert demonstrations of how to follow the command in that environment. For each data point, we can use IRL to infer a reward function, and use that to train a neural net that can map from the language command to the reward function. Then, at test time, given a language command, we can convert it to a reward function, after which we can use standard deep RL techniques to get a policy that executes the command. The authors evaluate on a 3D house domain with pixel observations, and two types of language commands: navigation and pick-and-place. During training, when IRL needs to be done, since deep IRL algorithms are computationally expensive they convert the task into a small, tabular MDP with known dynamics for which they can solve the IRL problem exactly, deriving a gradient that can then be applied in the observation space to train a neural net that given image observations and a language command predicts the reward. Note that this only needs to be done at training time: at test time, the reward function can be used in a new environment with unknown dynamics and image observations. They show that the learned rewards generalize to novel combinations of objects within a house, as well as to entirely new houses (though to a lesser extent). Rohin’s opinion: I think the success at generalization comes primarily because of the MaxEnt IRL during training: it provides a lot of structure and inductive bias that means that the rewards on which the reward predictor is trained are "close" to the intended reward function. For example, in the navigation tasks, the demonstrations for a command like "go to the vase" will involve trajectories through the state of many houses that end up in the vase. For each demonstration, MaxEnt IRL "assigns" positive reward to the states in the demonstration, and negative reward to everything else. However, once you average across demonstrations in different houses, the state with the vase gets a huge amount of positive reward (since it is in all trajectories) while all the other states are relatively neutral (since they will only be in a few trajectories, where the agent needed to pass that point in order to get to the vase). So when this is "transferred" to the neural net via gradients, the neural net is basically "told" that high reward only happens in states that contain vases, which is a strong constraint on the learned reward. To be clear, this is not meant as a critique of the paper: indeed, I think when you want out-of-distribution generalization, you have to do it by imposing structure/​inductive bias, and this is a new way to do it that I hadn’t seen before. Using Natural Language for Reward Shaping in Reinforcement Learning (Prasoon Goyal et al): This paper constructs a dataset for grounding natural language in Atari games, and uses it to improve performance on Atari. They have humans annotate short clips with natural language: for example, "jump over the skull while going to the left" in Montezuma’s Revenge. They use this to build a model that predicts whether a given trajectory matches a natural language instruction. Then, while training an agent to play Atari, they have humans give the AI system an instruction in natural language. They use their natural language model to predict the probability that the trajectory matches the instruction, and add that as an extra shaping term in the reward. This leads to faster learning. ProLoNets: Neural-encoding Human Experts’ Domain Knowledge to Warm Start Reinforcement Learning (Andrew Silva et al)

Interpretability

Visualizing memorization in RNNs (Andreas Madsen): This is a short Distill article that showcases a visualization tool that demonstrates how contextual information is used by various RNN units (LSTMs, GRUs, and nested LSTMs). The method is very simple: for each character in the context, they highlight the character in proportion to the gradient of the logits with respect to that character. Looking at this visualization allows us to see that GRUs are better at using long-term context, while LSTMs perform better for short-term contexts. Rohin’s opinion: I’d recommend you actually look at and play around with the visualization, it’s very nice. The summary is short because the value of the work is in the visualization, not in the technical details.

Other progress in AI

Exploration

Learning Exploration Policies for Navigation (Tao Chen et al) Deep Reinforcement Learning with Feedback-based Exploration (Jan Scholten et al)

Reinforcement learning

Towards Characterizing Divergence in Deep Q-Learning (Joshua Achiam et al): Summarized in the highlights! Eighteen Months of RL Research at Google Brain in Montreal (Marc Bellemare): One approach to reinforcement learning is to predict the entire distribution of rewards from taking an action, instead of predicting just the expected reward. Empirically, this works better, even though in both cases we choose the action with highest expected reward. This blog post provides an overview of work at Google Brain Montreal that attempts to understand this phenomenon. I’m only summarizing the part that most interested me. First, they found that in theory, distributional RL performs on par with or worse than standard RL when using either a tabular representation or linear features. They then tested this empirically on Cartpole, and found similar results: distributional RL performed worse when using tabular or linear representations, but better when using a deep neural net. This suggests that distributional RL "learns better representations". So, they visualize representations for RL on the four-room environment, and find that distributional RL captures more structured representations. Similarly this paper showed that predicting value functions for multiple discount rates is an effective way to produce auxiliary tasks for Atari. Rohin’s opinion: This is a really interesting mystery with deep RL, and after reading this post I have a story for it. Note I am far from an expert in this field and it’s quite plausible that if I read the papers cited in this post I could tell this story is false, but here’s the story anyway. As we saw with PreQN earlier in this issue, one of the most important aspects of deep RL is how information about one (s, a) pair is used to generalize to other (s, a) pairs. I’d guess that the benefit from distributional RL is primarily that you get "good representations" that let you do this generalization well. With a tabular representation you don’t do any generalization, and with a linear feature space the representation is hand-designed by humans to do this generalization well, so distributional RL doesn’t help in those cases. But why does distributional RL learn good representations? I claim that it provides stronger supervision given the same amount of experience. With normal expected RL, the final layer of the neural net need only be useful for predicting the expected reward, but with distributional RL they must be useful for predicting all of the quantiles of the reward distribution. There may be "shortcuts" or "heuristics" that allow you to predict expected reward well because of spurious correlations in your environment, but it’s less likely that those heuristics work well for all of the quantiles of the reward distribution. As a result, having to predict more things enforces a stronger constraint on what representations your neural net must have, and thus you are more likely to find good representations. This perspective also explains why predicting value functions for multiple discount rates helps with Atari, and why adding auxiliary tasks is often helpful (as long as the auxiliary task is relevant to the main task). The important aspect here is that all of the quantiles are forcing the same neural net to learn good representations. If you instead have different neural nets predicting each quantile, each neural net has roughly the same amount of supervision as in expected RL, so I’d expect that to work about as well as expected RL, maybe a little worse since quantiles are probably harder to predict than means. If anyone actually runs this experiment, please do let me know the result! Diagnosing Bottlenecks in Deep Q-learning Algorithms (Justin Fu, Aviral Kumar et al): While the PreQN paper used a theoretical approach to tackle Deep Q-Learning algorithms, this one takes an empirical approach. Their results:

Deep learning

Measuring the Limits of Data Parallel Training for Neural Networks (Chris Shallue and George Dahl): Consider the relationship between the size of a single batch and the number of batches needed to reach a specific performance bound when using deep learning. If all that mattered for performance was the total number of examples that you take gradient steps on (i.e. the product of these two numbers), then you would expect a perfect inverse relationship between these two quantities, which would look like a line with negative slope on a log-log plot. In this case, we could scale batch sizes up arbitrarily far, and distribute them across as many machines as necessary, in order to reduce wall clock training time. A 2x increase in batch size with twice as many machines would lead to a 2x decrease in training time. However, as you make batch sizes really large, you face the problem of stale gradients: if you had updated on the first half of the batch and then computed gradients on the second half of the batch, the gradients for the second half would be "better", because they were computed with respect to a better set of parameters. When this effect becomes significant, you no longer get the nice linear scaling from parallelization. This post studies the relationship empirically across a number of datasets, architectures, and optimization algorithms. They find that universally, there is initially an era of perfect linear scaling as you increase batch size, followed by a region of diminishing marginal returns that ultimately leads to an asymptote where increasing batch size doesn’t help at all with reducing wall-clock training time. However, the transition points between these regimes vary wildly, suggesting that there may be low hanging fruit in the design of algorithms or architectures that explicitly aim to achieve very good scaling. Rohin’s opinion: OpenAI found (AN #37) that the best predictor of the maximum useful batch size was how noisy the gradient is. Presumably when you have noisy gradients, a larger batch size helps "average out" the noise across examples. Rereading their post, I notice that they mentioned the study I’ve summarized here and said that their results can help explain why there’s so much variance in the transition points across datasets. However, I don’t think it can explain the variance in transition points across architectures. Noisy gradients are typically a significant problem, and so it would be weird if the variance in transition points across architectures were explained by the noisiness of the gradient: that would imply that two architectures reach the same final performance even though one had the problem of noisy gradients while the other didn’t. So there seems to be something left to explain here. That said, I haven’t looked in depth at the data, so the explanation could be very simple. For example, maybe the transition points don’t vary much across architecture and vary much more across datasets, and the variance across architecture is small enough that its effect on performance is dwarfed by all the other things that can affect the performance of deep learning systems. Or perhaps while the noisiness of the gradient is a good predictor of the maximum batch size, it still only explains say 40% of the effect, and so variance across architectures is totally compatible with factors other than the gradient noise affecting the maximum batch size.

Comment

https://www.lesswrong.com/posts/joxveLr8egM85YTy2/alignment-newsletter-51?commentId=feWb77ARkD8dP678w

Rohin, thank you for the especially long and informative newsletter.

When there are more samples, we get a lower validation loss [...]I guess you’ve meant a higher validation loss ?

Comment

https://www.lesswrong.com/posts/joxveLr8egM85YTy2/alignment-newsletter-51?commentId=hErjkw5cCAui4Fu62

No, I think lower is correct? More samples --> more data to fit to --> less chance to overfit to noise in the training data --> better performance on held-out validation data --> lower validation loss.