[AN #136]: How well will GPT-N perform on downstream tasks?

https://www.lesswrong.com/posts/HJMQg8MksHq5ipDpN/an-136-how-well-will-gpt-n-perform-on-downstream-tasks

Link post Contents

HIGHLIGHTS

Extrapolating GPT-N performance (Lukas Finnveden) (summarized by Asya): This post describes the author’s insights from extrapolating the performance of GPT on the benchmarks presented in the GPT-3 paper (AN #102). The author compares cross-entropy loss (which measures how good a model is at predicting the next token) with benchmark performance normalized to the difference between random performance and the maximum possible performance. Since previous work (AN #87) has shown that cross-entropy loss scales smoothly with model size, data, and FLOP requirements, we can then look at the overall relationship between those inputs and benchmark performance. The author finds that most of the benchmarks scale smoothly and similarly with respect to cross-entropy loss. Three exceptions are arithmetic, scramble (shuffling letters around the right way), and ANLI (a benchmark generated adversarially against transformer-based language models), which don’t improve until the very end of the cross-entropy loss range. The author fits linear and s-shaped curves to these relationships, and guesses that:

TECHNICAL AI ALIGNMENT

LEARNING HUMAN INTENT

Ask Your Humans: Using Human Instructions to Improve Generalization in Reinforcement Learning (Valerie Chen et al) (summarized by Rohin): It is particularly challenging for RL agents to perform hierarchical tasks when there is only a sparse reward. One natural piece of feedback in this setting is instructions in natural language specifying the different subtasks needed to solve the task. In particular, this paper assumes we have access to a dataset of human demonstrations paired with natural language instructions for each subtask that they complete. We then have an architecture that first generates the language instruction for the current subtask given the final task and the current state, and then takes a low-level action computed from the current state and the language instruction. This is trained via imitation learning on the human demonstrations. Using a small Minecraft-inspired gridworld, the authors show that the language generation is crucial for good generalization: if the agent is trained on "cobblestone block" and "iron ingot", then it is able to generalize to cobblestone ingot, as long as it was trained to generate the language instruction as well. Intuitively, the combinatorial structure of language leads to better generalization than direct imitation on low-level actions. A Narration-based Reward Shaping Approach using Grounded Natural Language Commands (Nicholas Waytowich et al) (summarized by Rohin): One way to specify what an AI system should do is to simply specify it in natural language. If we have some way to map natural language instructions to states, then we could turn natural language into a reward function and use RL to optimize it. This paper proposes specifying a task by breaking it down into a sequence of steps to be completed. Given a mapping from natural language to states, they define a reward function that gives a positive reward every time the mapping detects that the agent has completed the next stage in the sequence of steps. They show that this outperforms vanilla reinforcement learning on a win/​loss reward function in a StarCraft minigame. For the mapping of language to states, the authors use a mutual embedding model (MEM) they developed in previous work. The core idea is to write down programs that identify states matching a particular natural language instruction, use this to generate a dataset of states and the corresponding natural language instruction, and then training a model to map the natural language instructions to be "close to" the mappings of the states (which are produced by a CNN). Rohin’s opinion: My understanding is the MEM only handles the six natural language instructions used in the StarCraft minigame, and so is roughly equivalent to training six classifiers using the hardcoded programs to generate datasets. Thus, these two papers ultimately boil down to "decompose the task into six steps, train classifiers for these six steps, and then do RL where the reward function gives positive reward every time a particular step is marked as complete". However, this is primarily because the authors had to ground the natural language instructions themselves. If we could instead leverage a pretrained model which already grounds natural language, such as CLIP, then it seems like this approach could in fact save a lot of human effort in specifying what the AI system should do. Learning Rewards from Linguistic Feedback (Theodore R. Sumers et al) (summarized by Rohin): This paper proposes another approach to reinforcement learning using natural language. After the agent plays an episode, we can ask a human for feedback in natural language. We then take their response, figure out what features of the environment the response mentions, and then use sentiment analysis to determine how to update the weights on the features. For sentiment analysis we can use an off-the-shelf classifier; the hard part is in determining the relevant environment feature vectors:

  1. Evaluative feedback is feedback about the trajectory the agent produced, for example "good job", so we can just use the features of this trajectory.
  2. Imperative feedback specifies what the agent should have done, e.g. "you should have gone to the top right corner". In this case, we must find the features consistent with the given instruction.
  3. Descriptive feedback provides feedback directly about the reward, for example "yellow objects are bad". In this case, we use a feature vector that has a 1 for every feature mentioned (in this case, the feature for yellow objects) and 0 everywhere else. Types 2 and 3 require some domain knowledge in order to write down programs that map language to the relevant features. The environment the authors used was simple enough that they were able to do this. Once we have the feature vector f and the sentiment s, we perform a Bayesian update on our weight distribution. This is similar to the way we perform Bayesian updates on the reward distribution upon seeing a human action as evidence, as in Bayesian IRL (AN #132) or reward-rational implicit choice (AN #89). This model so far performs reasonably well. By adding a couple of heuristics inspired by pragmatics (e.g. assuming that features that aren’t mentioned aren’t decision-relevant), they reach approximately human-level performance.

PREVENTING BAD BEHAVIOR

Avoiding Side Effects in Complex Environments (Alex Turner et al) (summarized by Zach): One proposal for impact regularization is attainable utility preservation (AUP) (AN #91), in which we view side effects as changes in the ability of an agent to optimize a variety of reward functions. By incentivizing the agent not to change the optimal value for a wide range of auxiliary reward functions, the agent may avoid decreasing the optimal value for the true reward. To test the claim that AUP is a suitable way to avoid side-effects the authors experiment in SafeLife (AN #91), an environment suite based on Conway’s "Game of Life". In the Game of Life, depending on how many live neighbors surround a cell, the cell either comes to life, dies, or retains its state. In SafeLife the eight cells surrounding the agent cells are frozen and can be modified by the agent. Thus, the agent can disturb, or modify, dynamic patterns by merely approaching them. To measure side-effects the authors compare the evolution as it would’ve evolved without agent interference vs. the evolution with the agent present. The tasks are simple: either add or remove cells from a specified location. However, there are obstacles in the way that the agent could disturb. To implement AUP, the authors use a single randomly sampled reward function based on downsampling from the observation space. As a baseline, the authors compare AUP against PPO. Generally, AUP is able to achieve fewer side-effects than PPO while still obtaining reasonable performance. However, AUP does take longer to train than PPO. Additionally, the side-effects incurred during the training of AUP increase to a peak before settling below the side-effect score of PPO. It’s also important to note that sampling multiple rewards for AUP has the counter-intuitive effect of increasing the side-effect score. Zach’s opinion: This paper presents a clear approach to handling side-effects and provides a fairly thorough analysis via experimentation. Having said that, I find the experimental findings to be mixed. Intuitively, adding more random rewards would decrease task performance and the number of side-effects. However, this isn’t shown out in the data which raises interesting questions about how to best sample random reward functions. Related to this, the phenomena of side-effects increasing at the start of training for AUP is worth further investigation.

ADVERSARIAL EXAMPLES

Adversarial examples for the OpenAI CLIP in its zero-shot classification regime and their semantic generalization (Stanislav Fort) (summarized by Rohin): CLIP is a model that was trained on a vast soup of image-caption data, and as a result can perform zero-shot image classification (for example, it gets 87% accuracy on CIFAR-10 out of the box). Does it also have adversarial examples within the image classification regime? This post shows that the answer is yes, and in fact these adversarial examples are easy to find. More interestingly though, these adversarial examples persist if you change the labels in a semantically meaningful way. For example, if you take an image X that is correctly classified as a cat and imperceptibly modify it to Y which is now classified as a dog, if you change the class names to "kitty" and "hound", then the same X will now be classified as a kitty while the same Y will be classified as a hound. This even works (though not as well) for labels like "domesticated animal which barks and is best friend". The author takes this as evidence that the adversarial image actually looks like the adversarial class to the neural net, rather than being a peculiar consequence of the specific label. Rohin’s opinion: This seems like further validation of the broad view put forth in Adversarial Examples Are Not Bugs, They Are Features (AN #62).

OTHER PROGRESS IN AI

MULTIAGENT RL

Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design (Michael Dennis, Natasha Jaques et al) (summarized by Rohin): One argument for AI risk is that we have to specify some aspects of the training procedure, and if these are poorly specified, then bad outcomes may result. Typically we think of bad specification of the reward function as the risk, but this can also apply to environments: if we train a system in a simulated environment, then it may fail if the simulation is insufficiently similar to the real environment. A typical approach would be domain randomization: we randomly vary some parameters that control the behavior of the environment. Unfortunately, this can often create environments that are too easy: in a maze environment, this approach often doesn’t have enough walls. Another approach could be to choose the environment adversarially, so that the agent learns the skills needed for hard environments. Unfortunately, this can often make the environment unsolvable: in the maze environment, the goal may be unreachable from the initial position. The key idea of this paper is a method to create environments that are just on the edge of the agent’s abilities, by finding an environment that maximizes the agent’s regret: how poorly the agent performs, relative to how well it could have done. To operationalize how well the agent "could have done", we also train an antagonist agent, and we then choose an environment that the antagonist performs well on but the protagonist performs poorly on. This results in environments that are solvable but challenging for the protagonist.

NEWS

AI Safety Career Bottlenecks Survey (AI Safety Support) (summarized by Rohin): AI Safety Support have released a career bottlenecks survey that they will use to guide their work. You can take the survey here. AISU 2021 (summarized by Rohin): The third AI safety unconference will take place online from April 23rd to April 28th, 2021. The registration deadline is April 13th. FEEDBACK I’m always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.
PODCASTAn audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

Comment

https://www.lesswrong.com/posts/HJMQg8MksHq5ipDpN/an-136-how-well-will-gpt-n-perform-on-downstream-tasks?commentId=KZvidYjtxJ4D9idFT

The predicted cost for GPT-N parameter improvements is for the "classical Transformer" architecture? Recent updates like the Performer should require substantially less compute and therefore cost.

Comment

https://www.lesswrong.com/posts/HJMQg8MksHq5ipDpN/an-136-how-well-will-gpt-n-perform-on-downstream-tasks?commentId=ERNhhKxA2WJKbbP9m

Yes, in general you want to account for hardware and software improvements. From the original post:

Finally, it’s important to note that algorithmic advances are real and important. GPT-3 still uses a somewhat novel and unoptimised architecture, and I’d be unsurprised if we got architectures or training methods that were one or two orders of magnitude more compute-efficient in the next 5 years. From the summary: $100B -$1T at current prices, $1B - $10B given estimated hardware and software improvements over the next 5 − 10 years The $1B - $10B number is meant to include things like the Performer.