DERAIL: Diagnostic Environments for Reward And Imitation Learning

http://arxiv.org/abs/2012.01365v1

abstract: | The objective of many real-world tasks is complex and difficult to procedurally specify. This makes it necessary to use reward or imitation learning algorithms to infer a reward or policy directly from human data. Existing benchmarks for these algorithms focus on realism, testing in complex environments. Unfortunately, these benchmarks are slow, unreliable and cannot isolate failures. As a complementary approach, we develop a suite of simple diagnostic tasks that test individual facets of algorithm performance in isolation. We evaluate a range of common reward and imitation learning algorithms on our tasks. Our results confirm that algorithm performance is highly sensitive to implementation details. Moreover, in a case-study into a popular preference-based reward learning implementation, we illustrate how the suite can pinpoint design flaws and rapidly evaluate candidate solutions. The environments are available at https://github.com/HumanCompatibleAI/seals. author:


Introduction

Reinforcement learning (RL) optimizes a fixed reward function specified by the designer. This works well in artificial domains with well-specified reward functions such as games [@silver:2016; @vinyals:2019; @openai:2019]. However, in many real-world tasks the agent must interact with users who have complex and heterogeneous preferences. We would like the AI system to satisfy users' preferences, but the designer cannot perfectly anticipate users' desires, let alone procedurally specify them. This challenge has led to a proliferation of methods seeking to learn a reward function from user data [@ng:2000; @ziebart:2008; @christiano:2017; @fu:2018; @cabi:2019], or imitate demonstrations [@ross:2011; @ho:2016; @reddy:2020]. Collectively, we say algorithms that learn a reward or policy from human data are Learning from Humans (LfH).

LfH algorithms are primarily evaluated empirically, making benchmarks critical to progress in the field. Historically, evaluation has used RL benchmark suites. In recognition of important differences between RL and LfH, recent work has developed imitation learning benchmarks in complex simulated robotics environments with visual observations [@memmesheimer:2019; @james:2020].

In this paper, we develop a complementary approach using simple diagnostic environments that test individual aspects of LfH performance in isolation. Similar diagnostic tasks have been applied fruitfully to RL [@osband:2020], and diagnostic datasets have long been popular in natural language processing [@johnson:2017; @sinha:2019; @kottur:2019; @liu:2019; @wang:2019]. Diagnostic tasks are analogous to unit-tests: while less realistic than end-to-end tests, they have the benefit of being fast, reliable and able to isolate failures [@myers2011art; @wacker:2015]. Isolating failure is particularly important in machine learning, where small implementation details may have major effects on the results [@islam2017reproducibility].

This paper contributes the first suite of diagnostic environments designed for LfH algorithms. We evaluate a range of LfH algorithms on these tasks. Our results in section 4{reference-type="ref" reference="sec:experiments"} show that, like deep RL [@henderson2018deep; @engstrom:2020], imitation learning is very sensitive to implementation details. Moreover, the diagnostic tasks isolate particular implementation differences that affect performance, such as positive or negative bias in the discriminator. Additionally, our results suggest that a widely-used preference-based reward learning algorithm [@christiano:2017] suffers from limited exploration. In section 5{reference-type="ref" reference="sec:case-study"}, we propose and evaluate several possible improvements using our suite, illustrating how it supports rapid prototyping of algorithmic refinements.

Designing Diagnostic Tasks {#sec:desiderata}

In principle, LfH algorithms can be evaluated in any Markov decision process. Designers of benchmark suites must reduce this large set of possibilities to a small set of tractable tasks that discriminate between algorithms. We propose three key desiderata to guide the creation of diagnostic tasks.

Isolation. Each task should test a single dimension of interest. The dimension could be a capability, such as robustness to noise; or the absence of a common failure mode, such as episode termination bias [@kostrikov2018discriminator]. Keeping tests narrow ensures failures pinpoint areas where an algorithm requires improvement. By contrast, an algorithm's performance on more general tasks has many confounders.

Parsimony. Tasks should be as simple as possible. This maintains compatibility with a broad range of algorithms. Furthermore, it ensures the tests run quickly, enabling a more rapid development cycle and sufficient replicas to achieve low-variance results. However, tasks may need to be computationally demanding in special cases, such as testing if an algorithm can scale to high-dimensional inputs.

Coverage. The benchmark suite should test a broad range of capabilities. This gives confidence that an algorithm passing all diagnostic tasks will perform well on general-purpose tasks. For example, a benchmark suite might want to test categories as varied as exploration ability, the absence of common design flaws and bugs, and robustness to shifts in the transition dynamics.

Tasks

In this section, we outline a suite of tasks we have developed around the guidelines from the previous section. Some of the tasks have configuration parameters that allow the difficulty of the task to be adjusted. A full specification of the tasks can be found in appendix 7{reference-type="ref" reference="sec:appendix:tasks-specification"}.

Design Flaws and Implementation Bugs

First, we describe tasks that check for fundamental issues in the design and implementation of algorithms.

RiskyPath: Stochastic Transitions

Many LfH algorithms are derived from Maximum Entropy Inverse Reinforcement Learning [@ziebart:2008], which models the demonstrator as producing trajectories with probability $p(\tau) \propto \exp R(\tau)$. This model implies that a demonstrator can "control" the environment well enough to follow any high-reward trajectory with high probability [@ziebart:2010:thesis]. However, in stochastic environments, the agent cannot control the probability of each trajectory independently. This misspecification may lead to poor behavior.

To demonstrate and test for this issue, we designed , illustrated in Figure [fig:task:risky-path]{reference-type="ref" reference="fig:task:risky-path"}. The agent starts at $s_0$ and can reach the goal $s_2$ (reward $1.0$) by either taking the safe path $s_0 \to s_1 \to s_2$, or taking a risky action, which has equal chances of going to either $s_3$ (reward $-100.0$) or $s_2$. The safe path has the highest expected return, but the risky action sometimes reaches the goal $s_2$ in fewer timesteps, leading to higher best-case return. Algorithms that fail to correctly handle stochastic dynamics may therefore wrongly believe the reward favors taking the risky path.

EarlyTerm: Early Termination

Many implementations of imitation learning algorithms incorrectly assign a value of zero to terminal states [@kostrikov2018discriminator]. Depending on the sign of the learned reward function in non-terminal states, this can either bias the agent to end episodes early or prolong them as long as possible. This confounds evaluation as performance is spuriously high in tasks where the termination bias aligns with the task objective. @kostrikov2018discriminator demonstrate this behavior with a simple example, which we adapt here as the tasks and .

The environment is a 3-state MDP, in which the agent can either alternate between two initial states until reaching the time horizon, or they can move to a terminal state causing the episode to terminate early. In , the rewards are all $+1$, while in , the rewards are all $-1$. Algorithms that are biased towards early termination (e.g. because they assign a negative reward to all states) will do well on and poorly on . Conversely, algorithms biased towards late termination will do well on and poorly on .

Core Capabilities

In this subsection, we consider tasks that focus on a core algorithmic capability for reward and imitation learning.

NoisyObs: Robust Learning

, illustrated in Figure [fig:noisy-obs]{reference-type="ref" reference="fig:noisy-obs"}, tests for robustness to noise. The agent randomly starts at the one of the corners of an $M \times M$ grid (default $M = 5$), and tries to reach and stay at the center. The observation vector consists of the agent's $(x,y)$ coordinates in the first two elements, and $L$ "distractor" samples of Gaussian noise as the remaining elements (default $L=20$). The challenge is to select the relevant features in the observations, and not overfit to noise [@guyon:2003].

Branching: Exploration

We include the Branching task to test LfH algorithms' exploration ability. The agent must traverse a specific path of length $L$ to reach a final goal (default $L=10$), with $B$ choices at each step (default $B=2$). Making the wrong choice at any of the $L$ decision points leads to a dead end with zero reward.

Parabola: Continuous Control

Parabola tests algorithms' ability to learn in continuous action spaces, a challenge for $Q$-learning methods in particular. The goal is to mimic the path of a parabola $p(x) = Ax^2 + Bx + C$, where $A$, $B$ and $C$ are constants sampled uniformly from $[-1, 1]$ at the start of the episode. The state at time $t$ is $s_t = (x_t, y_t, A, B, C)$. Transitions are given by $x_{t+1} = x_t + dx$ (default $dx = 0.05$) and $y_{t+1} = y_t + a_t$. The reward at each timestep is the negative squared error, $-\left(y_t-p(x_t)\right)^2$.

LargestSum: High Dimensionality

Many real-world tasks are high-dimensional. LargestSum evaluates how algorithms scale with increasing dimensionality. It is a classification task with binary actions and uniformly sampled states $s \in [0, 1]^{2L}$ (default $L = 25$). The agent is rewarded for taking action $1$ if the sum of the first half $x_{0:L}$ is greater than the sum of the second half $x_{L:2L}$, and otherwise is rewarded for taking action $0$.

Ability to Generalize

In complex real-world tasks, it is impossible for the learner to observe every state during training, and so some degree of generalization will be required at deployment. These tasks simulate this challenge by having a (typically large) state space which is only partially explored at training.

InitShift: Distribution Shift

Many LfH algorithms learn from expert demonstrations. This can be problematic when the environment the demonstrations were gathered in differs even slightly from the learner's environment.

To illustrate this problem, we introduce , a depth-2 full binary tree where the agent moves left or right until reaching a leaf. The expert starts at the root $s_0$, whereas the learner starts at the left branch $s_1$ and so can only reach leaves $s_3$ and $s_4$. Reward is only given at the leaves. The expert always move to the highest reward leaf $s_6$, so any algorithm that relies on demonstrations will not know whether it is better to go to $s_3$ or $s_4$. By contrast, feedback such as preference comparison can disambiguate this case.

ProcGoal: Procedural Generation

In this task, the agent starts at a random position in a large grid, and must navigate to a goal randomly placed in a neighborhood around the agent. The observation is a 4-dimensional vector containing the $(x,y)$ coordinates of the agent and the goal. The reward at each timestep is the negative Manhattan distance between the two positions. With a large enough grid, generalizing is necessary to achieve good performance, since most initial states will be unseen.

Sort: Algorithmic Complexity

In Sort, the agent must sort a list of random numbers by swapping elements. The initial state is a vector $x$ sampled uniformly from $[0,1]^n$ (default $n=4$), with actions $a = (i,j)$ swapping $x_i$ and $x_j$. The reward is given according to the number of elements in the correct position. To perform well, the learned policy must compare elements, otherwise it will not generalize to all possible randomly selected initial states.

Benchmarking Algorithms {#sec:experiments}

Mean episode return (across 15 seeds) of policy learned by each algorithm ($x$-axis) on each task ($y$-axis). Returns are normalized between $0.0$ for a random policy and $1.0$ for an optimal policy. Grey cells denote the algorithm being unable to run on that task (e.g. and only run on small tabular tasks). See appendix 9{reference-type="ref" reference="sec:appendix:experimental-results"} for full results and confidence intervals.{#fig:experiments:heatmap}

Experimental Setup

We evaluate a range of commonly used LfH algorithms: Maximum Entropy IRL (; [@ziebart:2008]), Maximum Causal Entropy IRL (; [@ziebart:2010:thesis]), Behavioral Cloning (), Generative Adversarial Imitation Learning (; [@ho:2016]), Adversarial IRL (; [@fu:2018]) and Deep Reinforcement Learning from Human Preferences (; [@christiano:2017]). We also present an RL baseline using Proximal Policy Optimization (; [@schulman:2017]).

We test several variants of these algorithms. We compare multiple implementations of ( and ) and (, and ). We also vary whether the reward function in and is state-only ( and ) or state-action ( and ). All other algorithms use state-action rewards.

We train using preference comparisons from the ground-truth reward, and train all other algorithms using demonstrations from an optimal expert policy. We compute the expert policy using value iteration in discrete environments, and procedurally specify the expert in other environments. See appendix 8{reference-type="ref" reference="sec:appendix:experimental-setup"} for a complete description of the experimental setup and implementations.

Experimental Results {#sec:experiments:results}

For brevity, we highlight a few notable findings, summarizing our results in Figure 1{reference-type="ref" reference="fig:experiments:heatmap"}. See appendix 9{reference-type="ref" reference="sec:appendix:experimental-results"} for the full results and a more comprehensive analysis.

Implementation Matters. Our results show that and are biased towards prolonging episodes: they achieve worse than random return on , where the optimal action is to end the episode, but match expert performance in . By contrast, is biased towards ending the episode, succeeding in but failing in . This termination bias can be a major confounder when evaluating on variable-horizon tasks.

We also observe several other differences between implementations of the same algorithm. achieves significantly lower return than and on , , and . Moreover, attains near-expert return on while performs worse than random. These results confirm that implementation matters [@islam2017reproducibility; @henderson2018deep; @engstrom:2020], and illustrate how diagnostic tasks can pinpoint how performance varies between implementations.

Rewards vs Policy Learning. Behavioral cloning (), fitting a policy to demonstrations using supervised learning, exhibits bimodal performance. often attains near-optimal returns. However, in tasks with large continuous state spaces such as and , performs close to random. We conjecture this is because has more difficulty generalizing to novel states than reward learning algorithms, which have an inductive bias towards goal-directed behavior.

 return on for varying numbers of noise dimensions $L$ (grid size $M=7$). We evaluate across 24 seeds trained for 3M timesteps. Mean returns are depicted as horizontal lines inside the boxes and reported underneath $x$-axis labels. Boxes span the 95% confidence intervals of the means; whiskers span the range of returns.{#fig:experiments:drlhpnoise width="70%"}

Exploration in Preference Comparison. [[drlhp-exploration]]{#drlhp-exploration label="drlhp-exploration"} We find , which learns from preference comparisons, achieves lower returns in than algorithms that learn from demonstrations. This is likely because is a hard-exploration task: a specific sequence of actions must be taken in order to achieve a reward. For to succeed, it must first discover this sequence, whereas algorithms that learn from demonstrations can simply mimic the expert.

While this problem is particularly acute in , exploration is likely to limit the performance of in other environments. To investigate this further, we varied the number of noise dimensions $L$ in from $5$ to $500$, reporting the performance of in Figure 2{reference-type="ref" reference="fig:experiments:drlhpnoise"}. Increasing $L$ decreases both the maximum and the variance of the return. This causes a higher mean return in $L=50$ than in $L=5$ (high variance) or $L=500$ (low maximum).

We conjecture this behavior is partly due to comparing trajectories sampled from a policy optimized for its current best-guess reward. If the policy becomes low-entropy too soon, then will fail to sufficiently explore. Adding stochasticity stabilizes the training process, but makes it harder to recover the true reward.

Case Study: Improving Implementations {#sec:case-study}

In the previous section, we showed how DERAIL can be used to compare existing implementations of reward and imitation learning algorithms. However, benchmarks are also often used during the development of new algorithms and implementations. We believe diagnostic task suites are particularly well-suited to rapid prototyping. The tasks are lightweight so tests can be conducted quickly. Yet they are sufficiently varied to catch a wide range of bugs, and give a glimpse of effects in more complex environments. To illustrate this workflow, we present a case study refining an implementation of Deep Reinforcement Learning from Human Preferences ([@christiano:2017]).

As discussed in section 4.2{reference-type="ref" reference="sec:experiments:results"}, the implementation we experimented with, , has high-variance across random seeds. We conjecture this problem occurs because the preference queries are insufficiently diverse. The queries are sampled from rollouts of a policy, and so their diversity depends on the stochasticity of the environment and policy. Indeed, we see in Figure 2{reference-type="ref" reference="fig:experiments:drlhpnoise"} that is more stable when environment stochasticity increases.

The fundamental issue is that 's query distribution depends on the policy, which is being trained to maximize 's predicted reward. This entanglement makes the procedure liable to get stuck in local minima. Suppose that, mid-way through training, the policy chances upon some previously unseen, high-reward state. The predicted reward at this unseen state will be random -- and so likely worse than a previously seen, medium-reward state. The policy will thus be trained to avoid this high-reward state -- starving of the queries that would allow it to learn in this region.

In an attempt to address this issue, we experiment with a few simple modifications to : [[dlrhp-mod-ideas]]{#dlrhp-mod-ideas label="dlrhp-mod-ideas"}

Return of and three variants: , slower policy training; , $\epsilon$-greedy exploration; , exploration bonus (see section [dlrhp-mod-ideas]{reference-type="ref" reference="dlrhp-mod-ideas"}). Mean episode return (across 15 seeds) of policy learned by each algorithm ($x$-axis) on each task ($y$-axis). Returns are normalized between $0.0$ for a random policy and $1.0$ for an optimal policy.{#fig:experiments:drlhp-mods}

We report the return achieved with these modifications in Figure 3{reference-type="ref" reference="fig:experiments:drlhp-mods"}. produces the most stable results: the returns are all comparable to or substantially higher than the original . However, all modifications increase returns on hard-exploration task , although for the improvement is modest. and also enjoy significant improvements on high-dimensional classification task , which likely benefits from more balanced labels. performs poorly on and : we conjecture that the large state space caused to explore excessively.

This case study shows how DERAIL can help rapidly test new prototypes, quickly confirming or discrediting a hypothesis of a how a change will affect a given algorithm. Moreover, we can gain a fine-grained understanding of performance along different axes. For example, we could conclude that does increase exploration (higher return on ) but may over-explore (lower return on ). It would be difficult to disentangle these distinct effects in more complex environments.

Discussion

We have developed, to the best of our knowledge, the first suite of diagnostic environments for reward and imitation learning algorithms. We find that by isolating particular algorithmic capabilities, diagnostic tasks can provide a more nuanced picture of individual algorithms' strengths and weaknesses than testing on more complex benchmarks. Our results confirm that reward and imitation learning algorithm performance is highly sensitive to implementation details. Furthermore, we have demonstrated the fragility of behavioral cloning, and obtained qualitative insights into the performance of preference-based reward learning. Finally, we have illustrated in a case study how DERAIL can support rapid prototyping of algorithmic refinements.

In designing the task suite, we have leveraged our personal experience as well as past work documenting design flaws and implementation bugs [@ziebart:2010:thesis; @kostrikov2018discriminator]. We expect to refine and extend the suite in response to user feedback, and we encourage other researchers to develop complementary tasks. Our environments are open-source and available at https://github.com/HumanCompatibleAI/seals.

Acknowledgements {#acknowledgements .unnumbered}

We would like to thank Rohin Shah and Andrew Critch for feedback during the initial stages of this project, and Scott Emmons, Cody Wild, Lawrence Chan, Daniel Filan and Michael Dennis for feedback on earlier drafts of the paper.

Full specification of tasks {#sec:appendix:tasks-specification}

Experimental setup {#sec:appendix:experimental-setup}

Algorithms

The exact code for running the experiments and generating the plots can be found at https://github.com/HumanCompatibleAI/derail.

Imitation learning and IRL algorithms are trained using rollouts from an optimal policy. The number of expert timesteps provided is the same as the number of timesteps each algorithm runs for. For , trajectories are compared using the ground-truth reward. The trajectory queries are generated from the policy being learned jointly with the reward.

We used open source implementations of these algorithms, as listed in Table 1{reference-type="ref" reference="table:sourcecodes"}. We did not perform hyperparameter tuning, and relied on default values for most hyperparameters.

::: {#table:sourcecodes} Source Code Algorithms


@stable-baselines:2018 , , @fu-inverse-rl:2018 , @imitation:2020 , , , , @evaluating-rewards:2020 ,

: Sources of algorithm implementations (some of which were slightly adapted). :::

Evaluation

We run each algorithm with 15 different seeds and 500,000 timesteps. To evaluate a policy, we compute the exact expected episode return in discrete state environments. In other environments, we compute the average return over 1000 episodes. The score in a task is the mean return of the learned policy, normalized such that a policy taking random actions gets a score of 0.0 and the expert gets a score of 1.0. For , poor policies can reach values smaller than -3.0; to keep scores in a similar range to other tasks, we truncate negative values at -1.0.

Complete experimental results {#sec:appendix:experimental-results}

We provide results and analysis grouped around individually tasks (section 9.1{reference-type="ref" reference="sec:appendix:task-results"}) and algorithms (section 9.2{reference-type="ref" reference="sec:appendix:algo-results"}). The results are presented using boxplot graphs, such as Figure 4{reference-type="ref" reference="fig:appendix:risky-path"}. The y-axis represents the return of the learned policy, while the $x$-axis contains different algorithms or tasks. Each point corresponds to a different seed. The means across seeds are represented as horizontal lines inside the boxes, with the boxes spanning bootstrapped $95%$ confidence intervals of the means; the whiskers show the full range of returns. Each box is assigned a different color to aid in visually distinguishing the tasks; they do not have any semantic meaning.

Tasks {#sec:appendix:task-results}

: performs poorly as expected, while performs well. Other algorithms evaluated look at state-action pairs individually, instead of looking at trajectories, avoiding the problem of risky behavior.{#fig:appendix:risky-path width="103%"}

{width="103%"}

{width="103%"}

: algorithms that learn from expert demonstrations tend to perform well, since they require limited exploration. On the other hand, can struggle to perform enough exploration to consistently find the goal and learn the correct reward. Note that needs to find the goal multiple times in order to update the reward significantly. {width="103%"}

: unlike , in algorithms based on expert demonstrations fail, since the expert trajectories do not include the new initial state. By contrast, the task is trivial for , which can compare trajectories generated at train time. Moreover, algorithms that learn state-action rewards from demonstrations perform worse than random. This is because the expert trajectories only contain action 1, and thus rewards tend to assign a positive weight to action $1$. However, the optimal action under the learners initial state distribution is to take action $0$. , and learn state-only reward functions, and perform closer to the random policy.{width="103%"}

: we see that achieves near-optimal performance, demonstrating that supervised learning can be more robust and sample-efficient in the presence of noise than other LfH algorithms. We also see that performs poorly relative to and , which underscores the importance of the subtle differences between these implementations.{width="103%"}

: most algorithms perform well, except for , and .{width="103%"}

: Most algorithms fail to achieve expert performance, while does match expert performance, suggesting scaling algorithms like and to high-dimensional tasks may be a fruitful direction for future work. Methods using state-only reward functions, and , perform poorly since the reward for this task depends on the actions taken.{width="103%"}

: achieves expert performance, while most other algorithms get a reasonable, but lower score, while also exhibiting high variance between seeds. While this task requires generalization, the fact that the states and actions are discrete might make it easier for to generalize, compared to or , where it performs poorly. One interesting result is performing better than , while performs better than in other tasks.{width="103%"}

: Most algorithms achieve reasonable (but sub-expert) performance. Intriguingly achieves higher returns than most algorithms, with low-variance. Learning a good policy in this task is challenging, given that even PPO did not get at expert performance in all seeds. fails to get any reward. We also have that performs better than , while performs better than . Having a state-only reward might be easier to learn because there are less parameters and the groundtruth reward is indeed state-only, but state-action rewards can also incentivize the right policy by giving higher rewards to the correct action, making planning easier. {width="103%"}

Algorithms {#sec:appendix:algo-results}

: serves as an RL baseline. We would expect most reward and imitation learning algorithms to obtain lower return, since they must learn a policy without knowing the reward. Most seeds achieve close to expert performance.{width="103%"}

: exhibits bimodal performance, either attaining near-expert return ($1.0$, normalized) in an environment or close to random ($0.0$). The return is similar across seeds. Behavioral cloning attains relatively low returns in and , which have continuous observation spaces that require generalization and sequential decision making.{width="103%"}

{width="103%"}

{width="103%"}

{width="103%"}

{width="103%"}

{width="103%"}

{width="103%"}

{width="103%"}

{width="103%"}

{width="103%"}

{width="103%"}

{width="103%"}

Tabular IRL: executed only on tasks with discrete state and action spaces and fixed horizon. As expected, obtains a low return in . For , the expert demonstrations do not provide information to choose between the subset of states accessible during learning, and so the algorithm gets a random score that depends on the randomly initialized initial reward.{width="103%"} Tabular IRL: executed only on tasks with discrete state and action spaces and fixed horizon. As expected, obtains a low return in . For , the expert demonstrations do not provide information to choose between the subset of states accessible during learning, and so the algorithm gets a random score that depends on the randomly initialized initial reward.{width="103%"}

[^1]: Work partially conducted during an internship at UC Berkeley.