DERAIL: Diagnostic Environments for Reward And Imitation Learning

source

arxiv

source_type

latex

converted_with

pandoc

paper_version

2012.01365v1

title

DERAIL: Diagnostic Environments for Reward And Imitation Learning

authors

["Pedro Freire","Adam Gleave","Sam Toyer","Stuart Russell"]

date_published

2020-12-02 18:07:09+00:00

data_last_modified

2020-12-02 18:07:09+00:00

abstract

The objective of many real-world tasks is complex and difficult to procedurally specify. This makes it necessary to use reward or imitation learning algorithms to infer a reward or policy directly from human data. Existing benchmarks for these algorithms focus on realism, testing in complex environments. Unfortunately, these benchmarks are slow, unreliable and cannot isolate failures. As a complementary approach, we develop a suite of simple diagnostic tasks that test individual facets of algorithm performance in isolation. We evaluate a range of common reward and imitation learning algorithms on our tasks. Our results confirm that algorithm performance is highly sensitive to implementation details. Moreover, in a case-study into a popular preference-based reward learning implementation, we illustrate how the suite can pinpoint design flaws and rapidly evaluate candidate solutions. The environments are available at https://github.com/HumanCompatibleAI/seals .

author_comment

null

journal_ref

null

doi

null

primary_category

cs.LG

categories

["cs.LG","cs.AI"]

citation_level

alignment_text

unlabeled

confidence_score

0.980340869

main_tex_filename

main.tex

bibliography_bbl

\begin{thebibliography}{33} \providecommand{\natexlab}[1]{#1} \providecommand{\url}[1]{\texttt{#1}} \expandafter\ifx\csname urlstyle\endcsname\relax \providecommand{\doi}[1]{doi: #1}\else \providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi \bibitem[Burda et~al.(2019)Burda, Edwards, Storkey, and Klimov]{burda2018exploration} Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. \newblock Exploration by random network distillation. \newblock In \emph{International Conference on Learning Representations}, 2019. \newblock URL \url{https://openreview.net/forum?id=H1lJJnR5Ym}. \bibitem[Cabi et~al.(2019)Cabi, Colmenarejo, Novikov, Konyushkova, Reed, Jeong, Zolna, Aytar, Budden, Vecerik, Sushkov, Barker, Scholz, Denil, de~Freitas, and Wang]{cabi:2019} Serkan Cabi, Sergio~Gómez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Vecerik, Oleg Sushkov, David Barker, Jonathan Scholz, Misha Denil, Nando de~Freitas, and Ziyu Wang. \newblock A framework for data-driven robotics. \newblock arXiv: 1909.12200v1 [cs.RO], 2019. \bibitem[Christiano et~al.(2017)Christiano, Leike, Brown, Martic, Legg, and Amodei]{christiano:2017} Paul~F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. \newblock Deep reinforcement learning from human preferences. \newblock In \emph{NIPS}, pages 4299--4307, 2017. \bibitem[Engstrom et~al.(2020)Engstrom, Ilyas, Santurkar, Tsipras, Janoos, Rudolph, and Madry]{engstrom:2020} Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. \newblock Implementation matters in deep {RL}: A case study on {PPO} and {TRPO}. \newblock In \emph{ICLR}, 2020. \newblock URL \url{https://openreview.net/forum?id=r1etN1rtPB}. \bibitem[Fu(2018)]{fu-inverse-rl:2018} Justin Fu. \newblock Inverse {RL}: Implementations for imitation learning/{IRL} algorithms in {rllab}. \newblock \url{https://github.com/justinjfu/inverse_rl}, 2018. \bibitem[Fu et~al.(2018)Fu, Luo, and Levine]{fu:2018} Justin Fu, Katie Luo, and Sergey Levine. \newblock Learning robust rewards with adverserial inverse reinforcement learning. \newblock In \emph{ICLR}, 2018. \bibitem[Gleave(2020)]{evaluating-rewards:2020} Adam Gleave. \newblock Evaluating rewards: comparing and evaluating reward models. \newblock \url{https://github.com/humancompatibleai/evaluating-rewards}, 2020. \bibitem[Guyon and Elisseeff(2003)]{guyon:2003} Isabelle Guyon and André Elisseeff. \newblock An introduction to variable and feature selection. \newblock \emph{JMLR}, pages 1157--1182, 3 2003. \bibitem[Henderson et~al.(2018)Henderson, Islam, Bachman, Pineau, Precup, and Meger]{henderson2018deep} Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. \newblock Deep reinforcement learning that matters. \newblock In \emph{AAAI}, 2018. \bibitem[Hill et~al.(2018)Hill, Raffin, Ernestus, Gleave, Kanervisto, Traore, Dhariwal, Hesse, Klimov, Nichol, Plappert, Radford, Schulman, Sidor, and Wu]{stable-baselines:2018} Ashley Hill, Antonin Raffin, Maximilian Ernestus, Adam Gleave, Anssi Kanervisto, Rene Traore, Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. \newblock {Stable Baselines}. \newblock \url{https://github.com/hill-a/stable-baselines}, 2018. \bibitem[Ho and Ermon(2016)]{ho:2016} Jonathan Ho and Stefano Ermon. \newblock Generative adversarial imitation learning. \newblock In \emph{NIPS}, pages 4565--4573, 2016. \bibitem[Islam et~al.(2017)Islam, Henderson, Gomrokchi, and Precup]{islam2017reproducibility} Riashat Islam, Peter Henderson, Maziar Gomrokchi, and Doina Precup. \newblock Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. \newblock \emph{arXiv preprint arXiv:1708.04133}, 2017. \bibitem[James et~al.(2020)James, Ma, Arrojo, and Davison]{james:2020} Stephen James, Zicong Ma, David~Rovick Arrojo, and Andrew~J. Davison. \newblock Rlbench: The robot learning benchmark learning environment. \newblock \emph{IEEE Robotics and Automation Letters}, 5\penalty0 (2):\penalty0 3019--3026, 2020. \bibitem[Johnson et~al.(2017)Johnson, Hariharan, van~der Maaten, Fei-Fei, Lawrence~Zitnick, and Girshick]{johnson:2017} Justin Johnson, Bharath Hariharan, Laurens van~der Maaten, Li~Fei-Fei, C.~Lawrence~Zitnick, and Ross Girshick. \newblock {CLEVR}: A diagnostic dataset for compositional language and elementary visual reasoning. \newblock In \emph{CVPR}, 2017. \bibitem[Kostrikov et~al.(2018)Kostrikov, Agrawal, Dwibedi, Levine, and Tompson]{kostrikov2018discriminator} Ilya Kostrikov, Kumar~Krishna Agrawal, Debidatta Dwibedi, Sergey Levine, and Jonathan Tompson. \newblock Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning. \newblock \emph{arXiv preprint arXiv:1809.02925}, 2018. \bibitem[Kottur et~al.(2019)Kottur, Moura, Parikh, Batra, and Rohrbach]{kottur:2019} Satwik Kottur, Jos{\'{e}} M.~F. Moura, Devi Parikh, Dhruv Batra, and Marcus Rohrbach. \newblock {CLEVR}-{D}ialog: {A} diagnostic dataset for multi-round reasoning in visual dialog. \newblock In \emph{NAACL-HLT}, 2019. \bibitem[Liu et~al.(2019)Liu, Liu, Bai, and Yuille]{liu:2019} Runtao Liu, Chenxi Liu, Yutong Bai, and Alan~L. Yuille. \newblock {CLEVR}-{R}ef+: Diagnosing visual reasoning with referring expressions. \newblock In \emph{CVPR}, 2019. \bibitem[Memmesheimer et~al.(2019)Memmesheimer, Kramer, Seib, and Paulus]{memmesheimer:2019} Raphael Memmesheimer, Ivanna Kramer, Viktor Seib, and Dietrich Paulus. \newblock Simitate: A hybrid imitation learning benchmark. \newblock In \emph{IROS}, pages 5243--5249, 2019. \bibitem[Myers et~al.(2011)Myers, Sandler, and Badgett]{myers2011art} Glenford~J Myers, Corey Sandler, and Tom Badgett. \newblock \emph{The art of software testing}, chapter~5. \newblock John Wiley \& Sons, 2011. \bibitem[Ng and Russell(2000)]{ng:2000} Andrew~Y. Ng and Stuart Russell. \newblock Algorithms for inverse reinforcement learning. \newblock In \emph{ICML}, 2000. \bibitem[OpenAI et~al.(2019)OpenAI, Akkaya, Andrychowicz, Chociej, Litwin, McGrew, Petron, Paino, Plappert, Powell, Ribas, Schneider, Tezak, Tworek, Welinder, Weng, Yuan, Zaremba, and Zhang]{openai:2019} OpenAI, Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, Jonas Schneider, Nikolas Tezak, Jerry Tworek, Peter Welinder, Lilian Weng, Qiming Yuan, Wojciech Zaremba, and Lei Zhang. \newblock Solving {Rubik's Cube} with a robot hand. \newblock arXiv: 1910.07113v1 [cs.LG], 2019. \bibitem[Osband et~al.(2020)Osband, Doron, Hessel, Aslanides, Sezener, Saraiva, McKinney, Lattimore, Szepesvari, Singh, Roy, Sutton, Silver, and Hasselt]{osband:2020} Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepesvari, Satinder Singh, Benjamin~Van Roy, Richard Sutton, David Silver, and Hado~Van Hasselt. \newblock Behaviour suite for reinforcement learning. \newblock In \emph{ICLR}, 2020. \bibitem[Reddy et~al.(2020)Reddy, Dragan, and Levine]{reddy:2020} Siddharth Reddy, Anca~D. Dragan, and Sergey Levine. \newblock {SQIL}: Imitation learning via reinforcement learning with sparse rewards. \newblock In \emph{ICLR}, 2020. \bibitem[Ross et~al.(2011)Ross, Gordon, and Bagnell]{ross:2011} Stephane Ross, Geoffrey Gordon, and Drew Bagnell. \newblock A reduction of imitation learning and structured prediction to no-regret online learning. \newblock In \emph{AISTATS}, 2011. \bibitem[Schulman et~al.(2017)Schulman, Wolski, Dhariwal, Radford, and Klimov]{schulman:2017} John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. \newblock Proximal policy optimization algorithms. \newblock arXiv:1707.06347v2 [cs.LG], 2017. \bibitem[Silver et~al.(2016)Silver, Huang, Maddison, Guez, Sifre, van~den Driessche, Schrittwieser, Antonoglou, Panneershelvam, Lanctot, Dieleman, Grewe, Nham, Kalchbrenner, Sutskever, Lillicrap, Leach, Kavukcuoglu, Graepel, and Hassabis]{silver:2016} David Silver, Aja Huang, Chris~J. Maddison, Arthur Guez, Laurent Sifre, George van~den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. \newblock Mastering the game of {Go} with deep neural networks and tree search. \newblock \emph{Nature}, 529\penalty0 (7587):\penalty0 484--489, 2016. \bibitem[Sinha et~al.(2019)Sinha, Sodhani, Dong, Pineau, and Hamilton]{sinha:2019} Koustuv Sinha, Shagun Sodhani, Jin Dong, Joelle Pineau, and William~L. Hamilton. \newblock {CLUTRR:} {A} diagnostic benchmark for inductive reasoning from text. \newblock In \emph{EMNLP}, 2019. \bibitem[Vinyals et~al.(2019)Vinyals, Babuschkin, Czarnecki, Mathieu, Dudzik, Chung, Choi, Powell, Ewalds, Georgiev, Oh, Horgan, Kroiss, Danihelka, Huang, Sifre, Cai, Agapiou, Jaderberg, Vezhnevets, Leblond, Pohlen, Dalibard, Budden, Sulsky, Molloy, Paine, Gulcehre, Wang, Pfaff, Wu, Ring, Yogatama, W{\"u}nsch, McKinney, Smith, Schaul, Lillicrap, Kavukcuoglu, Hassabis, Apps, and Silver]{vinyals:2019} Oriol Vinyals, Igor Babuschkin, Wojciech~M. Czarnecki, Micha{\"e}l Mathieu, Andrew Dudzik, Junyoung Chung, David~H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, Laurent Sifre, Trevor Cai, John~P. Agapiou, Max Jaderberg, Alexander~S. Vezhnevets, R{\'e}mi Leblond, Tobias Pohlen, Valentin Dalibard, David Budden, Yury Sulsky, James Molloy, Tom~L. Paine, Caglar Gulcehre, Ziyu Wang, Tobias Pfaff, Yuhuai Wu, Roman Ring, Dani Yogatama, Dario W{\"u}nsch, Katrina McKinney, Oliver Smith, Tom Schaul, Timothy Lillicrap, Koray Kavukcuoglu, Demis Hassabis, Chris Apps, and David Silver. \newblock Grandmaster level in {StarCraft II} using multi-agent reinforcement learning. \newblock \emph{Nature}, 575\penalty0 (7782):\penalty0 350--354, 2019. \bibitem[Wacker(2015)]{wacker:2015} Mike Wacker. \newblock Just say no to more end-to-end tests. \newblock Google Testing Blog, 2015. \newblock URL \url{https://testing.googleblog.com/2015/04/just-say-no-to-more-end-to-end-tests.html}. \bibitem[Wang et~al.(2019)Wang, Singh, Michael, Hill, Levy, and Bowman]{wang:2019} Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel~R. Bowman. \newblock {GLUE}: A multi-task benchmark and analysis platform for natural language understanding. \newblock In \emph{ICLR}, 2019. \bibitem[Wang et~al.(2020)Wang, Gleave, and Toyer]{imitation:2020} Steven Wang, Adam Gleave, and Sam Toyer. \newblock imitation: implementations of inverse reinforcement learning and imitation learning algorithms. \newblock \url{https://github.com/humancompatibleai/imitation}, 2020. \bibitem[Ziebart(2010)]{ziebart:2010:thesis} Brian~D Ziebart. \newblock \emph{Modeling purposeful adaptive behavior with the principle of maximum causal entropy}. \newblock PhD thesis, Carnegie Mellon University, 2010. \bibitem[Ziebart et~al.(2008)Ziebart, Maas, Bagnell, and Dey]{ziebart:2008} Brian~D. Ziebart, Andrew Maas, J.~Andrew Bagnell, and Anind~K. Dey. \newblock Maximum entropy inverse reinforcement learning. \newblock In \emph{AAAI}, 2008. \end{thebibliography}

bibliography_bib

% @book{lawson:1995, author = {Lawson, Charles L. and Hanson, Richard J.}, title = {Solving Least Squares Problems}, publisher = {SIAM}, year = {1995}, } @article{guyon:2003, title={An Introduction to Variable and Feature Selection}, author={Isabelle Guyon and André Elisseeff}, journal={JMLR}, pages={1157-1182}, month={3}, year={2003}, } @inproceedings{henderson2018deep, title={Deep reinforcement learning that matters}, author={Henderson, Peter and Islam, Riashat and Bachman, Philip and Pineau, Joelle and Precup, Doina and Meger, David}, booktitle={AAAI}, year={2018} } @inproceedings{engstrom:2020, title={Implementation Matters in Deep {RL}: A Case Study on {PPO} and {TRPO}}, author={Logan Engstrom and Andrew Ilyas and Shibani Santurkar and Dimitris Tsipras and Firdaus Janoos and Larry Rudolph and Aleksander Madry}, booktitle={ICLR}, year={2020}, url={https://openreview.net/forum?id=r1etN1rtPB} } @article{islam2017reproducibility, title={Reproducibility of benchmarked deep reinforcement learning tasks for continuous control}, author={Islam, Riashat and Henderson, Peter and Gomrokchi, Maziar and Precup, Doina}, journal={arXiv preprint arXiv:1708.04133}, year={2017} } @article{kostrikov2018discriminator, title={Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning}, author={Kostrikov, Ilya and Agrawal, Kumar Krishna and Dwibedi, Debidatta and Levine, Sergey and Tompson, Jonathan}, journal={arXiv preprint arXiv:1809.02925}, year={2018} } @inbook{myers2011art, title={The art of software testing}, chapter=5, author={Myers, Glenford J and Sandler, Corey and Badgett, Tom}, year={2011}, publisher={John Wiley \& Sons} } @misc{wacker:2015, title={Just say no to more end-to-end tests}, author={Mike Wacker}, howpublished={Google Testing Blog}, year={2015}, url={https://testing.googleblog.com/2015/04/just-say-no-to-more-end-to-end-tests.html} } @article{fu:1998, author = {Wenjiang J. Fu}, title = {Penalized Regressions: The {Bridge} versus the {Lasso}}, journal = {Journal of Computational and Graphical Statistics}, volume = {7}, number = {3}, pages = {397-416}, year = {1998}, publisher = {Taylor & Francis}, } @inproceedings{kingma:2015, author ={Diederik P. Kingma and Jimmy Ba}, title = {Adam: A Method for Stochastic Optimization}, booktitle = {ICLR}, year = {2015}, } @misc{brockman:2016, author = {Greg Brockman and Vicki Cheung and Ludwig Pettersson and Jonas Schneider and John Schulman and Jie Tang and Wojciech Zaremba}, title = {{OpenAI} {Gym}}, year = {2016}, howpublished={arXiv: 1606.01540v1 [cs.LG]}, archivePrefix = {arXiv}, eprint = {1606.01540}, primaryclass={cs.LG}, } % @article{silver:2016, author={Silver, David and Huang, Aja and Maddison, Chris J. and Guez, Arthur and Sifre, Laurent and van den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and Dieleman, Sander and Grewe, Dominik and Nham, John and Kalchbrenner, Nal and Sutskever, Ilya and Lillicrap, Timothy and Leach, Madeleine and Kavukcuoglu, Koray and Graepel, Thore and Hassabis, Demis}, title={Mastering the game of {Go} with deep neural networks and tree search}, journal={Nature}, year={2016}, volume={529}, number={7587}, pages={484-489}, } @misc{schulman:2017, title={Proximal Policy Optimization Algorithms}, author={John Schulman and Filip Wolski and Prafulla Dhariwal and Alec Radford and Oleg Klimov}, year={2017}, eprint={1707.06347}, archivePrefix={arXiv}, primaryClass={cs.LG}, howpublished={arXiv:1707.06347v2 [cs.LG]} } @misc{berner2019dota, title={Dota 2 with Large Scale Deep Reinforcement Learning}, author={OpenAI and Christopher Berner and Greg Brockman and Brooke Chan and Vicki Cheung and Przemysław Dębiak and Christy Dennison and David Farhi and Quirin Fischer and Shariq Hashme and Chris Hesse and Rafal Józefowicz and Scott Gray and Catherine Olsson and Jakub Pachocki and Michael Petrov and Henrique Pondé de Oliveira Pinto and Jonathan Raiman and Tim Salimans and Jeremy Schlatter and Jonas Schneider and Szymon Sidor and Ilya Sutskever and Jie Tang and Filip Wolski and Susan Zhang}, journal={arXiv preprint arXiv:1912.06680}, year={2019}, eprint={1912.06680}, archivePrefix={arXiv}, primaryClass={cs.LG}, howpublished={arXiv:1912.06680v1 [cs.LG]} } @inproceedings{schulman2015trust, title={Trust region policy optimization}, author={Schulman, John and Levine, Sergey and Abbeel, Pieter and Jordan, Michael and Moritz, Philipp}, booktitle={International conference on machine learning}, pages={1889--1897}, year={2015} } @misc{openai:2018, author = {OpenAI}, title = {{OpenAI Five}}, howpublished = {\url{https://blog.openai.com/openai-five/}}, year = {2018}, } @misc{stable-baselines:2018, author = {Hill, Ashley and Raffin, Antonin and Ernestus, Maximilian and Gleave, Adam and Kanervisto, Anssi and Traore, Rene and Dhariwal, Prafulla and Hesse, Christopher and Klimov, Oleg and Nichol, Alex and Plappert, Matthias and Radford, Alec and Schulman, John and Sidor, Szymon and Wu, Yuhuai}, title = {{Stable Baselines}}, year = {2018}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/hill-a/stable-baselines}}, } @inproceedings{vecerik:2019, author={Mel Vecerik and Oleg Sushkov and David Barker and Thomas Roth\"{o}rl and Todd Hester and Jon Scholz}, booktitle={ICRA}, title={A Practical Approach to Insertion with Variable Socket Position Using Deep Reinforcement Learning}, year={2019}, } @article{vinyals:2019, author={Vinyals, Oriol and Babuschkin, Igor and Czarnecki, Wojciech M. and Mathieu, Micha{\"e}l and Dudzik, Andrew and Chung, Junyoung and Choi, David H. and Powell, Richard and Ewalds, Timo and Georgiev, Petko and Oh, Junhyuk and Horgan, Dan and Kroiss, Manuel and Danihelka, Ivo and Huang, Aja and Sifre, Laurent and Cai, Trevor and Agapiou, John P. and Jaderberg, Max and Vezhnevets, Alexander S. and Leblond, R{\'e}mi and Pohlen, Tobias and Dalibard, Valentin and Budden, David and Sulsky, Yury and Molloy, James and Paine, Tom L. and Gulcehre, Caglar and Wang, Ziyu and Pfaff, Tobias and Wu, Yuhuai and Ring, Roman and Yogatama, Dani and W{\"u}nsch, Dario and McKinney, Katrina and Smith, Oliver and Schaul, Tom and Lillicrap, Timothy and Kavukcuoglu, Koray and Hassabis, Demis and Apps, Chris and Silver, David}, title={Grandmaster level in {StarCraft II} using multi-agent reinforcement learning}, journal={Nature}, year={2019}, volume={575}, number={7782}, pages={350-354}, } @misc{openai:2019, title={Solving {Rubik's Cube} with a Robot Hand}, author={OpenAI and Ilge Akkaya and Marcin Andrychowicz and Maciek Chociej and Mateusz Litwin and Bob McGrew and Arthur Petron and Alex Paino and Matthias Plappert and Glenn Powell and Raphael Ribas and Jonas Schneider and Nikolas Tezak and Jerry Tworek and Peter Welinder and Lilian Weng and Qiming Yuan and Wojciech Zaremba and Lei Zhang}, year={2019}, howpublished={arXiv: 1910.07113v1 [cs.LG]}, archivePrefix={arXiv}, eprint={1910.07113}, primaryClass={cs.LG} } @inproceedings{burda2018exploration, title={Exploration by random network distillation}, author={Yuri Burda and Harrison Edwards and Amos Storkey and Oleg Klimov}, booktitle={International Conference on Learning Representations}, year={2019}, url={https://openreview.net/forum?id=H1lJJnR5Ym}, } % @inproceedings{hadfield2016cooperative, title={Cooperative inverse reinforcement learning}, author={Hadfield-Menell, Dylan and Russell, Stuart J and Abbeel, Pieter and Dragan, Anca}, booktitle={Advances in neural information processing systems}, pages={3909--3917}, year={2016} } @inproceedings{ng:1999, title={Policy invariance under reward transformations: theory and application to reward shaping}, author={Andrew Y. Ng and Daishi Harada and Stuart Russell}, booktitle={NIPS}, year={1999}, } @inproceedings{ng:2000, author = {Andrew Y. Ng and Stuart Russell}, title = {Algorithms for Inverse Reinforcement Learning}, booktitle = {ICML}, year = {2000}, } @inproceedings{ramachandran:2007, title={Bayesian Inverse Reinforcement Learning}, author={Deepak Ramachandran and Eyal Amir}, booktitle={IJCAI}, year={2007} } @inproceedings{ziebart:2008, title={Maximum Entropy Inverse Reinforcement Learning}, author={Brian D. Ziebart and Andrew Maas and J. Andrew Bagnell and Anind K. Dey}, booktitle={AAAI}, year={2008} } @inproceedings{ziebart:2010:paper, title={Modeling interaction via the principle of maximum causal entropy}, author={Ziebart, Brian D and Bagnell, J Andrew and Dey, Anind K}, booktitle={ICML}, year={2010} } @phdthesis{ziebart:2010:thesis, title={Modeling purposeful adaptive behavior with the principle of maximum causal entropy}, author={Ziebart, Brian D}, school={Carnegie Mellon University}, year={2010} } @inproceedings{ross:2011, title = {A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning}, author = {Stephane Ross and Geoffrey Gordon and Drew Bagnell}, booktitle = {AISTATS}, year = {2011}, } @inproceedings{akrour:2011, author="Akrour, Riad and Schoenauer, Marc and Sebag, Michele", title="Preference-Based Policy Learning", booktitle="Machine Learning and Knowledge Discovery in Databases", year="2011", } @inproceedings{akrour:2012, author="Akrour, Riad and Schoenauer, Marc and Sebag, Mich{\`e}le", title="{APRIL}: Active Preference Learning-Based Reinforcement Learning", booktitle="Machine Learning and Knowledge Discovery in Databases", year="2012", } @inproceedings{wilson:2012, title = {A {Bayesian} Approach for Policy Learning from Trajectory Preference Queries}, author = {Aaron Wilson and Fern, Alan and Tadepalli, Prasad}, booktitle = {NIPS}, year = {2012}, } @inproceedings{finn:2016, title={Guided cost learning: Deep inverse optimal control via policy optimization}, author={Finn, Chelsea and Levine, Sergey and Abbeel, Pieter}, booktitle={ICML}, year={2016} } @inproceedings{ho:2016, title = {Generative Adversarial Imitation Learning}, author = {Ho, Jonathan and Ermon, Stefano}, booktitle = {NIPS}, pages = {4565--4573}, year = {2016}, } @misc{amodei:2017, title = {Learning from Human Preferences}, author = {Dario Amodei and Paul Christiano and Alex Ray}, url = {https://openai.com/blog/deep-reinforcement-learning-from-human-preferences/}, year = {2017}, month = {June}, day = {13}, } @inproceedings{christiano:2017, title = {Deep Reinforcement Learning from Human Preferences}, author = {Christiano, Paul F and Leike, Jan and Brown, Tom and Martic, Miljan and Legg, Shane and Amodei, Dario}, booktitle = {NIPS}, pages = {4299--4307}, year = {2017}, } @inproceedings{sadigh:2017, author = {Sadigh, Dorsa and Dragan, Anca D. and Sastry, S. Shankar and Seshia, Sanjit A.}, title = {Active Preference-Based Learning of Reward Functions}, booktitle = {RSS}, year = {2017}, month = {July}, } @inproceedings{ibarz:2018, title = {Reward learning from human preferences and demonstrations in {Atari}}, author = {Ibarz, Borja and Leike, Jan and Pohlen, Tobias and Irving, Geoffrey and Legg, Shane and Amodei, Dario}, booktitle = {NeurIPS}, pages = {8011--8023}, year = {2018}, } @inproceedings{fu:2018, title={Learning Robust Rewards with Adverserial Inverse Reinforcement Learning}, author={Justin Fu and Katie Luo and Sergey Levine}, booktitle={ICLR}, year={2018}, } @misc{fu-inverse-rl:2018, author = {Justin Fu}, title = {Inverse {RL}: Implementations for imitation learning/{IRL} algorithms in {rllab}}, year = {2018}, publisher = {GitHub}, howpublished = {\url{https://github.com/justinjfu/inverse_rl}}, } @inproceedings{brown:2019, title={Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations}, author={Daniel S. Brown and Wonjoon Goo and Prabhat Nagarajan and Scott Niekum}, booktitle={ICML}, year={2019}, } @misc{cabi:2019, title={A Framework for Data-Driven Robotics}, author={Serkan Cabi and Sergio Gómez Colmenarejo and Alexander Novikov and Ksenia Konyushkova and Scott Reed and Rae Jeong and Konrad Zolna and Yusuf Aytar and David Budden and Mel Vecerik and Oleg Sushkov and David Barker and Jonathan Scholz and Misha Denil and Nando de Freitas and Ziyu Wang}, year={2019}, howpublished={arXiv: 1909.12200v1 [cs.RO]}, eprint={1909.12200}, archivePrefix={arXiv}, primaryClass={cs.RO} } @misc{ziegler:2019, title={Fine-Tuning Language Models from Human Preferences}, author={Daniel M. Ziegler and Nisan Stiennon and Jeffrey Wu and Tom B. Brown and Alec Radford and Dario Amodei and Paul Christiano and Geoffrey Irving}, year={2019}, howpublished={arXiv: 1909.08593v2 [cs.CL]}, eprint={1909.08593}, archivePrefix={arXiv}, primaryClass={cs.CL} } @inproceedings{reddy:2020, title={{SQIL}: Imitation Learning via Reinforcement Learning with Sparse Rewards}, author={Siddharth Reddy and Anca D. Dragan and Sergey Levine}, booktitle={ICLR}, year={2020}, } @misc{imitation:2020, author = {Steven Wang and Adam Gleave and Sam Toyer}, title = {imitation: implementations of inverse reinforcement learning and imitation learning algorithms}, year = {2020}, publisher = {GitHub}, howpublished = {\url{https://github.com/humancompatibleai/imitation}}, } @misc{evaluating-rewards:2020, author = {Adam Gleave}, title = {Evaluating Rewards: comparing and evaluating reward models}, year = {2020}, publisher = {GitHub}, howpublished = {\url{https://github.com/humancompatibleai/evaluating-rewards}}, } % @inproceedings{todorov:2012, author={Emanuel Todorov and Tom Erez and Yuval Tassa}, booktitle={IROS}, title={{MuJoCo}: A physics engine for model-based control}, year={2012}, } @article{bellemare:2013, author = {Bellemare, Marc G. and Naddaf, Yavar and Veness, Joel and Bowling, Michael}, title = {The Arcade Learning Environment: An Evaluation Platform for General Agents}, year = {2013}, volume = {47}, number = {1}, journal = {JAIR}, month = {May}, pages = {253–279}, numpages = {27} } @misc{kurin:2017, title={The {Atari} Grand Challenge Dataset}, author={Vitaly Kurin and Sebastian Nowozin and Katja Hofmann and Lucas Beyer and Bastian Leibe}, year={2017}, howpublished={arXiv: 1705.10998v1 [cs.AI]}, eprint={1705.10998}, archivePrefix={arXiv}, primaryClass={cs.AI} } @inproceedings{johnson:2017, author = {Johnson, Justin and Hariharan, Bharath and van der Maaten, Laurens and Fei-Fei, Li and Lawrence Zitnick, C. and Girshick, Ross}, title = {{CLEVR}: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning}, booktitle = {CVPR}, year = {2017} } @article{machado:2018, author = {Machado, Marlos C. and Bellemare, Marc G. and Talvitie, Erik and Veness, Joel and Hausknecht, Matthew and Bowling, Michael}, title = {Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents}, year = {2018}, issue_date = {January 2018}, volume = {61}, number = {1}, journal = {JAIR}, month = {Jan}, pages = {523–562}, numpages = {40} } @inproceedings{mania:2018, title = {Simple random search of static linear policies is competitive for reinforcement learning}, author = {Mania, Horia and Guy, Aurelia and Recht, Benjamin}, booktitle = {NeurIPS}, pages = {1800--1809}, year = {2018}, } @misc{nichol:2018, title={Gotta Learn Fast: A New Benchmark for Generalization in {RL}}, author={Alex Nichol and Vicki Pfau and Christopher Hesse and Oleg Klimov and John Schulman}, year={2018}, howpublished={arXiv: 1804.03720v2 [cs.LG]}, eprint={1804.03720}, archivePrefix={arXiv}, primaryClass={cs.LG} } @misc{cobbe:2018, title={Quantifying Generalization in Reinforcement Learning}, author={Karl Cobbe and Oleg Klimov and Chris Hesse and Taehoon Kim and John Schulman}, howpublished={arXiv: 1812.02341v3 [cs.LG]}, year={2018}, eprint={1812.02341}, archivePrefix={arXiv}, primaryClass={cs.LG} } @misc{packer:2018, title={Assessing Generalization in Deep Reinforcement Learning}, author={Charles Packer and Katelyn Gao and Jernej Kos and Philipp Kr\"{a}henb\"{u}hl and Vladlen Koltun and Dawn Song}, year={2018}, howpublished={arXiv: 1810.12282v2 [cs.LG]}, eprint={1810.12282}, archivePrefix={arXiv}, primaryClass={cs.LG} } @inproceedings{sharma:2018, title={Multiple Interactions Made Easy {(MIME)}: Large Scale Demonstrations Data for Imitation}, author={Pratyusha Sharma and Lekha Mohan and Lerrel Pinto and Abhinav Gupta}, booktitle={CoRL}, year={2018}, } @inproceedings{kottur:2019, author = {Satwik Kottur and Jos{\'{e}} M. F. Moura and Devi Parikh and Dhruv Batra and Marcus Rohrbach}, title = {{CLEVR}-{D}ialog: {A} Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog}, booktitle = {NAACL-HLT}, year = {2019}, } @InProceedings{liu:2019, author = {Liu, Runtao and Liu, Chenxi and Bai, Yutong and Yuille, Alan L.}, title = {{CLEVR}-{R}ef+: Diagnosing Visual Reasoning With Referring Expressions}, booktitle = {CVPR}, year = {2019} } @misc{cobbe:2019, title={Leveraging Procedural Generation to Benchmark Reinforcement Learning}, author={Karl Cobbe and Christopher Hesse and Jacob Hilton and John Schulman}, year={2019}, howpublished={arXiv: 1912.01588v1 [cs.LG]}, eprint={1912.01588}, archivePrefix={arXiv}, primaryClass={cs.LG} } @inproceedings{guss:2019, title={The {M}ine{RL} Competition on Sample Efficient Reinforcement Learning using Human Priors}, author={William H. Guss and Codel, Cayden and Hofmann, Katja and Houghton, Brandon and Kuno, Noboru and Milani, Stephanie and Mohanty, Sharada and Liebana, Diego Perez and Salakhutdinov, Ruslan and Topin, Nicholay and others}, booktitle={NeurIPS Competition Track}, year={2019} } @misc{perezliebana:2019, title={The Multi-Agent Reinforcement Learning in {MalmÖ} {(MARLÖ)} Competition}, author={Diego Perez-Liebana and Katja Hofmann and Sharada Prasanna Mohanty and Noburu Kuno and Andre Kramer and Sam Devlin and Raluca D. Gaina and Daniel Ionita}, year={2019}, howpublished = {arXiv: 1901.08129v1 [cs.AI]}, eprint={1901.08129}, archivePrefix={arXiv}, primaryClass={cs.AI} } @misc{straub:2019, title={The {Replica} Dataset: A Digital Replica of Indoor Spaces}, author={Julian Straub and Thomas Whelan and Lingni Ma and Yufan Chen and Erik Wijmans and Simon Green and Jakob J. Engel and Raul Mur-Artal and Carl Ren and Shobhit Verma and Anton Clarkson and Mingfei Yan and Brian Budge and Yajie Yan and Xiaqing Pan and June Yon and Yuyang Zou and Kimberly Leon and Nigel Carter and Jesus Briales and Tyler Gillingham and Elias Mueggler and Luis Pesqueira and Manolis Savva and Dhruv Batra and Hauke M. Strasdat and Renzo De Nardi and Michael Goesele and Steven Lovegrove and Richard Newcombe}, year={2019}, howpublished = {arXiv: 1906.05797v1 [cs.CV]}, eprint={1906.05797}, archivePrefix={arXiv}, primaryClass={cs.CV} } @inproceedings{sinha:2019, author = {Koustuv Sinha and Shagun Sodhani and Jin Dong and Joelle Pineau and William L. Hamilton}, title = {{CLUTRR:} {A} Diagnostic Benchmark for Inductive Reasoning from Text}, booktitle = {EMNLP}, year = {2019}, } @inproceedings{wang:2019, title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding}, author={Alex Wang and Amanpreet Singh and Julian Michael and Felix Hill and Omer Levy and Samuel R. Bowman}, booktitle={ICLR}, year={2019}, } @inproceedings{memmesheimer:2019, author={Raphael Memmesheimer and Ivanna Kramer and Viktor Seib and Dietrich Paulus}, booktitle={IROS}, title={Simitate: A Hybrid Imitation Learning Benchmark}, year={2019}, pages={5243-5249} } @inproceedings{osband:2020, title={Behaviour Suite for Reinforcement Learning}, author={Ian Osband and Yotam Doron and Matteo Hessel and John Aslanides and Eren Sezener and Andre Saraiva and Katrina McKinney and Tor Lattimore and Csaba Szepesvari and Satinder Singh and Benjamin Van Roy and Richard Sutton and David Silver and Hado Van Hasselt}, booktitle={ICLR}, year={2020}, } @article{james:2020, author={Stephen James and Zicong Ma and David Rovick Arrojo and Andrew J. Davison}, journal={IEEE Robotics and Automation Letters}, title={RLBench: The Robot Learning Benchmark Learning Environment}, year={2020}, volume={5}, number={2}, pages={3019-3026}, }

arxiv_citations

{"1708.04133":true,"1809.02925":true,"1707.06347":true,"1912.06680":true}

alignment_newsletter

{"source":"alignment-newsletter","source_type":"google-sheets","converted_with":"python","venue":"arXiv","newsletter_category":"Learning human intent","highlight":false,"newsletter_number":"AN #132","newsletter_url":"https://mailchi.mp/395179bd3c3d/an-132complex-and-subtly-incorrect-arguments-as-an-obstacle-to-debate","summarizer":"Rohin","summary":"Most deep RL algorithms are quite sensitive to implementation and hyperparameters, and this applies to imitation learning as well. So, it would be useful to have some simple sanity checks that an algorithm works well, before throwing algorithms at challenging benchmarks trying to beat the state of the art. This paper presents a suite of simple environments that each aim to test a single aspect of an algorithm, in a similar spirit to unit testing.\n\nFor example, RiskyPath is a very simple four-state stochastic MDP, in which the agent can take a long, safe path to the reward, or a short, risky path. While it is always better in expectation to take the safer path, properly reasoning about stochasticity can be subtle, and some published algorithms, like <@Maximum Entropy IRL@>(@Maximum Entropy Inverse Reinforcement Learning@), always choose the risky path (this can be fixed by using <@causal entropy@>(@Modeling Interaction via the Principle of Maximum Causal Entropy@)). By isolating the issue, RiskyPath can be used as a quick test to detect this behavior in new algorithms.\n\nThe paper also presents a case study in tuning an implementation of [Deep RL from Human Preferences](https://deepmind.com/blog/learning-through-human-feedback/), in which a sparse exploration task suggested that the comparison queries were insufficiently diverse to guarantee stability.","opinion":"nan","prerequisites":"nan","read_more":"nan","paper_version":"2012.01365v1","arxiv_id":"2012.01365","title":"DERAIL: Diagnostic Environments for Reward And Imitation Learning","authors":["Pedro Freire","Adam Gleave","Sam Toyer","Stuart Russell"],"date_published":"2020-12-02 18:07:09+00:00","data_last_modified":"2020-12-02 18:07:09+00:00","url":"http://arxiv.org/abs/2012.01365v1","abstract":"The objective of many real-world tasks is complex and difficult to procedurally specify. This makes it necessary to use reward or imitation learning algorithms to infer a reward or policy directly from human data. Existing benchmarks for these algorithms focus on realism, testing in complex environments. Unfortunately, these benchmarks are slow, unreliable and cannot isolate failures. As a complementary approach, we develop a suite of simple diagnostic tasks that test individual facets of algorithm performance in isolation. We evaluate a range of common reward and imitation learning algorithms on our tasks. Our results confirm that algorithm performance is highly sensitive to implementation details. Moreover, in a case-study into a popular preference-based reward learning implementation, we illustrate how the suite can pinpoint design flaws and rapidly evaluate candidate solutions. The environments are available at https://github.com/HumanCompatibleAI/seals .","author_comment":"None","journal_ref":"None","doi":"None","primary_category":"cs.LG","categories":"['cs.LG', 'cs.AI']","individual_summary":"Title: DERAIL: Diagnostic Environments for Reward And Imitation Learning\nAuthors: Pedro Freire, Adam Gleave, Sam Toyer, Stuart Russell\nPaper abstract: The objective of many real-world tasks is complex and difficult to procedurally specify. This makes it necessary to use reward or imitation learning algorithms to infer a reward or policy directly from human data. Existing benchmarks for these algorithms focus on realism, testing in complex environments. Unfortunately, these benchmarks are slow, unreliable and cannot isolate failures. As a complementary approach, we develop a suite of simple diagnostic tasks that test individual facets of algorithm performance in isolation. We evaluate a range of common reward and imitation learning algorithms on our tasks. Our results confirm that algorithm performance is highly sensitive to implementation details. Moreover, in a case-study into a popular preference-based reward learning implementation, we illustrate how the suite can pinpoint design flaws and rapidly evaluate candidate solutions. The environments are available at https://github.com/HumanCompatibleAI/seals .\nSummary: Most deep RL algorithms are quite sensitive to implementation and hyperparameters, and this applies to imitation learning as well. So, it would be useful to have some simple sanity checks that an algorithm works well, before throwing algorithms at challenging benchmarks trying to beat the state of the art. This paper presents a suite of simple environments that each aim to test a single aspect of an algorithm, in a similar spirit to unit testing.\n\nFor example, RiskyPath is a very simple four-state stochastic MDP, in which the agent can take a long, safe path to the reward, or a short, risky path. While it is always better in expectation to take the safer path, properly reasoning about stochasticity can be subtle, and some published algorithms, like <@Maximum Entropy IRL@>(@Maximum Entropy Inverse Reinforcement Learning@), always choose the risky path (this can be fixed by using <@causal entropy@>(@Modeling Interaction via the Principle of Maximum Causal Entropy@)). By isolating the issue, RiskyPath can be used as a quick test to detect this behavior in new algorithms.\n\nThe paper also presents a case study in tuning an implementation of [Deep RL from Human Preferences](https://deepmind.com/blog/learning-through-human-feedback/), in which a sparse exploration task suggested that the comparison queries were insufficiently diverse to guarantee stability.\nMy opinion: nan","paper_text":"","text":"HIGHLIGHTS\n[Debate update: Obfuscated arguments problem](https://www.alignmentforum.org/posts/PJLABqQ962hZEqhdB/debate-update-obfuscated-arguments-problem) *(Beth Barnes et al)* (summarized by Rohin): We’ve [previously seen](https://www.alignmentforum.org/posts/Br4xDbYu4Frwrb64a/writeup-progress-on-ai-safety-via-debate-1) ([AN #86](https://mailchi.mp/598f425b1533/an-86-improving-debate-and-factored-cognition-through-human-experiments)) work on addressing potential problems with debate, including (but not limited to):1. Evasiveness: By introducing structure to the debate, explicitly stating which claim is under consideration, we can prevent dishonest debaters from simply avoiding precision.2. Misleading implications: To prevent the dishonest debater from “framing the debate” with misleading claims, debaters may also choose to argue about the meta-question “given the questions and answers provided in this round, which answer is better?”.3. Truth is ambiguous: Rather than judging whether answers are *true*, which can be ambiguous and depend on definitions, we instead judge which answer is *better*.4. Ambiguity: The dishonest debater can use an ambiguous concept, and then later choose which definition to work with depending on what the honest debater says. This can be solved with [cross-examination](https://www.alignmentforum.org/posts/Br4xDbYu4Frwrb64a/writeup-progress-on-ai-safety-via-debate-1) ([AN #86](https://mailchi.mp/598f425b1533/an-86-improving-debate-and-factored-cognition-through-human-experiments)).This post presents an open problem: the problem of *obfuscated arguments*. This happens when the dishonest debater presents a long, complex argument for an incorrect answer, where neither debater knows which of the series of steps is wrong. In this case, any given step is quite likely to be correct, and the honest debater can only say “I don’t know where the flaw is, but one of these arguments is incorrect”. Unfortunately, honest arguments are also often complex and long, to which a dishonest debater could also say the same thing. It’s not clear how you can distinguish between these two cases.While this problem was known to be a potential theoretical issue with debate, the post provides several examples of this dynamic arising in practice in debates about physics problems, suggesting that this will be a problem we have to contend with. |\n\n\n |\n\n\n\n| \n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| **Rohin's opinion:** This does seem like a challenging problem to address, and as the authors mention, it also affects iterated amplification. (Intuitively, if during iterated amplification the decomposition chosen happens to be one that ends up being obfuscated, then iterated amplification will get to the wrong answer.) I’m not really sure whether I expect this to be a problem in practice -- it feels like it could be, but it also feels like we should be able to address it using whatever techniques we use for robustness. But I generally feel very confused about this interaction and want to see more work on it. |\n\n |\n\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| |\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n\n| |\n| --- |\n| \n\n \n\n TECHNICAL AI ALIGNMENT\n\n\n TECHNICAL AGENDAS AND PRIORITIZATION\n[AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy](https://www.alignmentforum.org/posts/jS2iiDPqMvZ2tnik2/ai-alignment-philosophical-pluralism-and-the-relevance-of) *(Tan Zhi Xuan)* (summarized by Rohin): This post argues that AI alignment has specific philosophical tendencies: 1) connectionism, where knowledge is encoded in neural net weights rather than through symbols, 2) behaviorism, where we learn from data rather than using reasoning or planning, 3) Humean motivations for humans (i.e. modeling humans as reward maximizers), 4) viewing rationality as decision theoretic, that is, about maximizing expected utility, rather than also considering e.g. logic, argumentation, and dialectic, and 5) consequentialism. This could be a “philosophical bubble” caused by founder effects from the EA and rationality communities, as well as from the recent success and popularity of deep learning.Instead, we should be aiming for philosophical plurality, where we explore other philosophical traditions as well. This would be useful because 1) we would likely find insights not available in Western philosophy, 2) we would be more robust to moral uncertainty, 3) it helps us get buy in from more actors, and 4) it is the “right” thing to do, to allow others to choose the values and ethical frameworks that matter to them.For example, certain interpretations of Confucian philosophy hold that norms have intrinsic value, as opposed to the dominant approach in Western philosophy in which individual preferences have intrinsic value, while norms only have instrumental value. This may be very relevant for learning what an AI system should optimize. Similarly, Buddhist thought often talks about problems of ontological shifts. |\n\n\n |\n\n\n\n| \n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| **Rohin's opinion:** Certainly to the extent that AI alignment requires us to “lock in” philosophical approaches, I think it is important that we consider a plurality of views for this purpose (see also [The Argument from Philosophical Difficulty](https://www.alignmentforum.org/posts/w6d7XBCegc96kz4n3/the-argument-from-philosophical-difficulty) ([AN #46](https://mailchi.mp/c48f996a5db5/alignment-newsletter-46))). I especially think this is true if our approach to alignment is to figure out “human values” and then tell an AI to maximize them. However, I’m more optimistic about other approaches to alignment; and I think they require fewer philosophical commitments, so it becomes less of an issue that the alignment community has a specific philosophical bubble. See [this comment](https://www.alignmentforum.org/posts/jS2iiDPqMvZ2tnik2/ai-alignment-philosophical-pluralism-and-the-relevance-of?commentId=zaAYniACRc29CM6sJ) for more details. |\n\n |\n\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n\n| |\n| --- |\n| \n\n \n\n LEARNING HUMAN INTENT\n[DERAIL: Diagnostic Environments for Reward And Imitation Learning](https://arxiv.org/abs/2012.01365) *(Pedro Freire et al)* (summarized by Rohin): Most deep RL algorithms are quite sensitive to implementation and hyperparameters, and this transfers to imitation learning as well. So, it would be useful to have some simple sanity checks that an algorithm works well, before throwing algorithms at challenging benchmarks trying to beat the state of the art. This paper presents a suite of simple environments that each aim to test a single aspect of an algorithm, in a similar spirit to unit testing.For example, RiskyPath is a very simple four-state MDP, in which the agent can take a long, safe path to the reward, or a short, risky path. As long as the agent is not incredibly short-sighted (i.e. very low γ), it should choose the safe path. This environment was directly inspired to catch an issue that affects [Maximum Entropy IRL](http://www.cs.cmu.edu/~bziebart/publications/maxentirl-bziebart.pdf) ([AN #12](https://mailchi.mp/bcb2c6f1d507/alignment-newsletter-12)) (later fixed by using [causal entropy](http://www.cs.cmu.edu/~bziebart/publications/maximum-causal-entropy.pdf) ([AN #12](https://mailchi.mp/bcb2c6f1d507/alignment-newsletter-12))).The paper also presents a case study in tuning an implementation of [Deep RL from Human Preferences](https://deepmind.com/blog/learning-through-human-feedback/), in which a sparse exploration task suggested that the comparison queries were insufficiently diverse to guarantee stability.[Understanding Learned Reward Functions](https://arxiv.org/abs/2012.05862) *(Eric J. Michaud et al)* (summarized by Rohin): This paper investigates what exactly learned reward functions are doing, through the use of interpretability techniques. They hope that this will be more scalable, as it seems plausible that reward functions will stay relatively similar in complexity, even when the policies become more complex as AI systems become more capable. Specifically, the authors look at:1. Saliency maps, which plot the gradient of the reward with respect to each pixel, intuitively quantifying “how important is this pixel to the reward”2. Occlusion maps, which show how much the reward changes if a certain area of the image is blurred3. Counterfactual inputs, in which the authors manually craft input images to see what the learned reward function outputs.In a simple gridworld where the agent must find the goal, the authors coded the reward function “1 if the agent moves to a previously visible goal location, else 0”, but they show that the learned reward is instead “0 if there is a currently visible goal location, else 1”. These are identical in the training environment, where there is always exactly one goal location (that the agent may be standing on, in which case that location is not visible). However, if there are changes at test time, e.g. multiple goal locations, then the learned reward will diverge from the true reward.They then apply a similar methodology to Atari. They find that if the score is not hidden, then the learned reward model will simply check whether the score pixels are changing to detect reward -- *unless* the score pixels change at a later time than reward is accrued, in which case this is not a viable strategy. They thus suggest that future reward learning work on Atari should ensure that the score is removed from the screen.[Bayesian Inverse Reinforcement Learning](https://www.aaai.org/Papers/IJCAI/2007/IJCAI07-416.pdf) *(Deepak Ramachandran et al)* (summarized by Rohin): Unlike many other methods, [Bayesian Inverse Reinforcement Learning](https://www.aaai.org/Papers/IJCAI/2007/IJCAI07-416.pdf) produces a *posterior distribution* over the reward functions that would explain the observed demonstrations. This distribution can be used for e.g. planning in a risk-averse manner. It works by starting with some randomly chosen reward function, and then repeating the following steps:1. Perturb the reward function randomly2. Solve for the optimal policy for that reward function3. Use the learned policy to see how likely the demonstrations would be for the reward function4. Use the likelihood to determine whether to take this new reward function, or return to the old one.(This is the application of a standard MCMC sampling algorithm to the likelihood model used in IRL.)[Efficient Exploration of Reward Functions in Inverse Reinforcement Learning via Bayesian Optimization](https://papers.nips.cc/paper/2020/file/2bba9f4124283edd644799e0cecd45ca-Paper.pdf) *(Sreejith Balakrishnan et al)* (summarized by Rohin): In the description of Bayesian IRL above, Step 2 is a very expensive step, as it requires solving a full RL problem. Can we improve any of the other steps to reduce the amount of times we have to run step 2? This paper aims to improve step 1: rather than choosing the next reward *randomly*, we can choose one that we think will be most informative. The authors apply the framework of Bayesian optimization to put this into practice. I won’t explain it more here since the details are fairly technical and involved (and I didn’t read the paper closely enough to understand it myself). They did have to introduce a new kernel in order to handle the fact that reward functions are invariant to the addition of a potential function.\n\n FORECASTING\n[How energy efficient are human-engineered flight designs relative to natural ones?](https://aiimpacts.org/are-human-engineered-flight-designs-better-or-worse-than-natural-ones/) *(Ronny Fernandez)* (summarized by Rohin): When forecasting AI timelines from [biological anchors](https://www.alignmentforum.org/posts/KrJfoZzpSDpnrv9va/draft-report-on-ai-timelines) ([AN #121](https://mailchi.mp/41774b61e5f8/an-121forecasting-transformative-ai-timelines-using-biological-anchors)), one important subquestion is how well we expect human-made artifacts to compare to natural artifacts (i.e. artifacts made by evolution). This post gathers empirical data for flight, by comparing the Monarch butterfly and the Wandering Albatross to various types of planes. The albatross is the most efficient, with a score of 2.2 kg-m per Joule (that is, a ~7 kg albatross spends ~3 Joules for every meter it travels). This is 2-8x better than the most efficient manmade plane that the authors considered, the Boeing 747-400, which in turn is better than the Monarch butterfly. (The authors also looked at distance per Joule without considering mass, in which case unsurprisingly the butterfly wins by miles; it is about 3 orders of magnitude better than the albatross, which is in turn better than all the manmade solutions.) |\n\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| |\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n\n| |\n| --- |\n| \n\n \n\n NEAR-TERM CONCERNS\n\n\n PRIVACY AND SECURITY\n[Does GPT-2 Know Your Phone Number?](https://bair.berkeley.edu/blog/2020/12/20/lmmem/) *(Nicholas Carlini et al)* (summarized by Rohin): This post and associated paper demonstrate that large language models memorize rare training data, and (some of) that training data can then be extracted through an automated attack. The key idea is to sample text that is *unusually* high likelihood. Given a high likelihood sample from a language model, we can check whether the likelihood is especially high by comparing the likelihood to:1. The likelihood assigned by other (especially smaller) language models. Presumably these models would not have memorized the same content, especially if the content was rare (which is the content we are most interested in).2. The length of the text when compressed by (say) zlib. Existing compression algorithms are pretty good at compressing regular English text, and so it is notable when a language model assigns high likelihood but the compression algorithm can’t compress it much.3. The likelihood assigned to the same text, but lowercase. Often, memorized content is case-sensitive, and likelihood drops significantly when the case is changed.The authors generate a lot of samples from GPT-2, use the metrics above to rank them in order of how likely they are to be memorized from the training set, and then investigate the top 1800 manually. They find that 604 of them are directly from the training set. While many are unobjectionable (such as news headlines), in some cases GPT-2 has memorized personal data (and the authors have extracted it simply by prompting GPT-2). In their most objectionable example, they extract the name, email, phone number, work address, and fax of a single person.**Read more:** [Blog post: Privacy Considerations in Large Language Models](https://ai.googleblog.com/2020/12/privacy-considerations-in-large.html)[Paper: Extracting Training Data from Large Language Models](https://arxiv.org/abs/2012.07805) |\n\n\n |\n\n\n\n| \n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| **Rohin's opinion:** I really liked the paper: it contains a lot of empirical detail that didn’t make it into the blog post, that gave me a much better sense of the scope of the problem. I don’t really have the space to summarize it here, so I recommend reading the paper. |\n\n |\n\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| |\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n\n| |\n| --- |\n| \n\n \n\n AI GOVERNANCE\n[Why those who care about catastrophic and existential risk should care about autonomous weapons](https://forum.effectivealtruism.org/posts/oR9tLNRSAep293rr5/why-those-who-care-about-catastrophic-and-existential-risk-2) *(Anthony Aguirre)* (summarized by Nicholas): This post argues for a focus on autonomous weapons systems (AWs) for three main reasons:**AWs Provide a Trial Run for AGI governance**. Governance of AWs shares many properties with AGI safety. Preventing an AW arms race would require international cooperation that would provide a chance to understand and improve AI governance institutions. As with any AI system, AWs have the potential to be *effective* without necessarily being aligned with human values, and accidents could quickly lead to deadly consequences. Public opinion and the vast majority of AI researchers oppose AW arms races, so there is an opportunity for global coordination on this issue. **Some AWs can directly cause catastrophic risk**. Cheap drones could potentially be created at scale that are easy to transport and hard to detect. This could enable an individual to kill many people without the need to convince many others that it is justified. They can discriminate targets better than other WMDs and cause less environmental damage. This has the potential to make war less harmful, but also makes it easier to justify.**AWs increase the likelihood and severity of conflict** by providing better tools for terrorists and assassins, lowering the threshold for violence between and within states, upsetting the relative power balance of current militaries, and increasing the likelihood of accidental escalation. In particular, AWs that are being used to counter other AWs might intentionally be made hard to understand and predict, and AWs may react to each other at timescales that are too quick for humans to intervene or de-escalate. An international agreement governing autonomous weapons could help to alleviate the above concerns. In particular, some classes of weapons could be banned, and others could be tracked and subjected to regulations. This would hopefully lead us to an equilibrium where offensive AWs are prohibited, but defended against in a stable way. |\n\n\n |\n\n\n\n| \n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| **Nicholas' opinion:** I agree completely with the first two points. Much of technical safety work has been based around solving currently existing analogs of the alignment problem. Governance does seem to have less of these, so autonomous weapon governance could provide a great opportunity to test and build credibility for AI governance structures. The ability for autonomous weapons to cause catastrophic risk seems hard to argue against. With powerful enough AI, even accidents can pose catastrophic risk, but I would expect military use to only increase those.For the third point, I agree with the reasons provided, but I think there are also ways in which AWs may reduce the likelihood and severity of war. For instance, currently soldiers bear most of the risk in wars, whereas decision-makers are often protected. Targeted AW attacks may increase the relative risk for those making decisions and thus disincentivize them from declaring war. An equilibrium of AW mutually assured destruction might also be attained if we can find reliable ways to attribute AW attacks and selectively retaliate. I’d be interested to see a more extensive analysis of how these and other factors trade off as I am unsure of the net effect.The piece that gives me the most doubt that this is an area for the x-risk community to focus on is tractability. An international agreement runs the risk of weakening the states that sign on without slowing the rate of AW development in countries that don’t. Getting all actors to sign on seems intractable to me. As an analogy, nuclear weapons proliferation has been a challenge and nuclear weapons development is much more complex and visible than development of AWs.**Rohin's opinion:** I particularly liked this piece because it actually made the case for work on autonomous weapons -- I do not see such work as obviously good (see for example [this post](https://forum.effectivealtruism.org/posts/vdqBn65Qaw77MpqXz/on-ai-weapons) that I liked for the perspective against banning autonomous weapons). I still feel pretty uncertain overall, but I think this post meaningfully moved the debate forward. |\n\n |\n\n\n |\n\n |\n\n |\n| \n\n| | |\n| --- | --- |\n| \n\n\n| |\n| --- |\n| FEEDBACK\nI'm always happy to hear feedback; you can send it to me, [Rohin Shah](https://rohinshah.com/), by **replying to this email**.\n |\n\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n\n| |\n| --- |\n| PODCAST\nAn audio podcast version of the **Alignment Newsletter** is available. This podcast is an audio version of the newsletter, recorded by [Robert Miles](http://robertskmiles.com).\n**Subscribe here:**\n\n[RSS Feed](http://alignment-newsletter.libsyn.com/rss)[Google Podcasts](https://podcasts.google.com/?feed=aHR0cDovL2FsaWdubWVudC1uZXdzbGV0dGVyLmxpYnN5bi5jb20vcnNz)[Spotify Podcasts](https://open.spotify.com/show/5pwApVP0wr1Q61S4LmONuX)[Apple Podcasts](https://podcasts.apple.com/us/podcast/alignment-newsletter-podcast/id1489248000) |\n\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| |\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n\n| |\n| --- |\n| *Copyright © 2021 Alignment Newsletter, All rights reserved.*\n\n**"}

abstract: | The objective of many real-world tasks is complex and difficult to procedurally specify. This makes it necessary to use reward or imitation learning algorithms to infer a reward or policy directly from human data. Existing benchmarks for these algorithms focus on realism, testing in complex environments. Unfortunately, these benchmarks are slow, unreliable and cannot isolate failures. As a complementary approach, we develop a suite of simple diagnostic tasks that test individual facets of algorithm performance in isolation. We evaluate a range of common reward and imitation learning algorithms on our tasks. Our results confirm that algorithm performance is highly sensitive to implementation details. Moreover, in a case-study into a popular preference-based reward learning implementation, we illustrate how the suite can pinpoint design flaws and rapidly evaluate candidate solutions. The environments are available at https://github.com/HumanCompatibleAI/seals. author:

| Pedro Freire[^1]
École Polytechnique
pedrofreirex@gmail.com Adam Gleave
UC Berkeley
gleave@berkeley.edu Sam Toyer
UC Berkeley
sdt@berkeley.edu Stuart Russell
UC Berkeley
russell@berkeley.edu
Anonymous Authors bibliography:
refs.bib title: "DERAIL: Diagnostic Environments for Reward And Imitation Learning"

Introduction

Reinforcement learning (RL) optimizes a fixed reward function specified by the designer. This works well in artificial domains with well-specified reward functions such as games [@silver:2016; @vinyals:2019; @openai:2019]. However, in many real-world tasks the agent must interact with users who have complex and heterogeneous preferences. We would like the AI system to satisfy users' preferences, but the designer cannot perfectly anticipate users' desires, let alone procedurally specify them. This challenge has led to a proliferation of methods seeking to learn a reward function from user data [@ng:2000; @ziebart:2008; @christiano:2017; @fu:2018; @cabi:2019], or imitate demonstrations [@ross:2011; @ho:2016; @reddy:2020]. Collectively, we say algorithms that learn a reward or policy from human data are Learning from Humans (LfH).

LfH algorithms are primarily evaluated empirically, making benchmarks critical to progress in the field. Historically, evaluation has used RL benchmark suites. In recognition of important differences between RL and LfH, recent work has developed imitation learning benchmarks in complex simulated robotics environments with visual observations [@memmesheimer:2019; @james:2020].

In this paper, we develop a complementary approach using simple diagnostic environments that test individual aspects of LfH performance in isolation. Similar diagnostic tasks have been applied fruitfully to RL [@osband:2020], and diagnostic datasets have long been popular in natural language processing [@johnson:2017; @sinha:2019; @kottur:2019; @liu:2019; @wang:2019]. Diagnostic tasks are analogous to unit-tests: while less realistic than end-to-end tests, they have the benefit of being fast, reliable and able to isolate failures [@myers2011art; @wacker:2015]. Isolating failure is particularly important in machine learning, where small implementation details may have major effects on the results [@islam2017reproducibility].

This paper contributes the first suite of diagnostic environments designed for LfH algorithms. We evaluate a range of LfH algorithms on these tasks. Our results in section 4{reference-type="ref" reference="sec:experiments"} show that, like deep RL [@henderson2018deep; @engstrom:2020], imitation learning is very sensitive to implementation details. Moreover, the diagnostic tasks isolate particular implementation differences that affect performance, such as positive or negative bias in the discriminator. Additionally, our results suggest that a widely-used preference-based reward learning algorithm [@christiano:2017] suffers from limited exploration. In section 5{reference-type="ref" reference="sec:case-study"}, we propose and evaluate several possible improvements using our suite, illustrating how it supports rapid prototyping of algorithmic refinements.

Designing Diagnostic Tasks {#sec:desiderata}

In principle, LfH algorithms can be evaluated in any Markov decision process. Designers of benchmark suites must reduce this large set of possibilities to a small set of tractable tasks that discriminate between algorithms. We propose three key desiderata to guide the creation of diagnostic tasks.

Isolation. Each task should test a single dimension of interest. The dimension could be a capability, such as robustness to noise; or the absence of a common failure mode, such as episode termination bias [@kostrikov2018discriminator]. Keeping tests narrow ensures failures pinpoint areas where an algorithm requires improvement. By contrast, an algorithm's performance on more general tasks has many confounders.

Parsimony. Tasks should be as simple as possible. This maintains compatibility with a broad range of algorithms. Furthermore, it ensures the tests run quickly, enabling a more rapid development cycle and sufficient replicas to achieve low-variance results. However, tasks may need to be computationally demanding in special cases, such as testing if an algorithm can scale to high-dimensional inputs.

Coverage. The benchmark suite should test a broad range of capabilities. This gives confidence that an algorithm passing all diagnostic tasks will perform well on general-purpose tasks. For example, a benchmark suite might want to test categories as varied as exploration ability, the absence of common design flaws and bugs, and robustness to shifts in the transition dynamics.

Tasks

In this section, we outline a suite of tasks we have developed around the guidelines from the previous section. Some of the tasks have configuration parameters that allow the difficulty of the task to be adjusted. A full specification of the tasks can be found in appendix 7{reference-type="ref" reference="sec:appendix:tasks-specification"}.

Design Flaws and Implementation Bugs

First, we describe tasks that check for fundamental issues in the design and implementation of algorithms.

`RiskyPath`: Stochastic Transitions

Many LfH algorithms are derived from Maximum Entropy Inverse Reinforcement Learning [@ziebart:2008], which models the demonstrator as producing trajectories with probability $p(\tau) \propto \exp R(\tau)$. This model implies that a demonstrator can "control" the environment well enough to follow any high-reward trajectory with high probability [@ziebart:2010:thesis]. However, in stochastic environments, the agent cannot control the probability of each trajectory independently. This misspecification may lead to poor behavior.

To demonstrate and test for this issue, we designed , illustrated in Figure [fig:task:risky-path]{reference-type="ref" reference="fig:task:risky-path"}. The agent starts at $s_0$ and can reach the goal $s_2$ (reward $1.0$) by either taking the safe path $s_0 \to s_1 \to s_2$, or taking a risky action, which has equal chances of going to either $s_3$ (reward $-100.0$) or $s_2$. The safe path has the highest expected return, but the risky action sometimes reaches the goal $s_2$ in fewer timesteps, leading to higher best-case return. Algorithms that fail to correctly handle stochastic dynamics may therefore wrongly believe the reward favors taking the risky path.

`EarlyTerm`: Early Termination

Many implementations of imitation learning algorithms incorrectly assign a value of zero to terminal states [@kostrikov2018discriminator]. Depending on the sign of the learned reward function in non-terminal states, this can either bias the agent to end episodes early or prolong them as long as possible. This confounds evaluation as performance is spuriously high in tasks where the termination bias aligns with the task objective. @kostrikov2018discriminator demonstrate this behavior with a simple example, which we adapt here as the tasks and .

The environment is a 3-state MDP, in which the agent can either alternate between two initial states until reaching the time horizon, or they can move to a terminal state causing the episode to terminate early. In , the rewards are all $+1$, while in , the rewards are all $-1$. Algorithms that are biased towards early termination (e.g. because they assign a negative reward to all states) will do well on and poorly on . Conversely, algorithms biased towards late termination will do well on and poorly on .

Core Capabilities

In this subsection, we consider tasks that focus on a core algorithmic capability for reward and imitation learning.

`NoisyObs`: Robust Learning

, illustrated in Figure [fig:noisy-obs]{reference-type="ref" reference="fig:noisy-obs"}, tests for robustness to noise. The agent randomly starts at the one of the corners of an $M \times M$ grid (default $M = 5$), and tries to reach and stay at the center. The observation vector consists of the agent's $(x,y)$ coordinates in the first two elements, and $L$ "distractor" samples of Gaussian noise as the remaining elements (default $L=20$). The challenge is to select the relevant features in the observations, and not overfit to noise [@guyon:2003].

`Branching`: Exploration

We include the Branching task to test LfH algorithms' exploration ability. The agent must traverse a specific path of length $L$ to reach a final goal (default $L=10$), with $B$ choices at each step (default $B=2$). Making the wrong choice at any of the $L$ decision points leads to a dead end with zero reward.

`Parabola`: Continuous Control

Parabola tests algorithms' ability to learn in continuous action spaces, a challenge for $Q$-learning methods in particular. The goal is to mimic the path of a parabola $p(x) = Ax^2 + Bx + C$, where $A$, $B$ and $C$ are constants sampled uniformly from $[-1, 1]$ at the start of the episode. The state at time $t$ is $s_t = (x_t, y_t, A, B, C)$. Transitions are given by $x_{t+1} = x_t + dx$ (default $dx = 0.05$) and $y_{t+1} = y_t + a_t$. The reward at each timestep is the negative squared error, $-\left(y_t-p(x_t)\right)^2$.

`LargestSum`: High Dimensionality

Many real-world tasks are high-dimensional. LargestSum evaluates how algorithms scale with increasing dimensionality. It is a classification task with binary actions and uniformly sampled states $s \in [0, 1]^{2L}$ (default $L = 25$). The agent is rewarded for taking action $1$ if the sum of the first half $x_{0:L}$ is greater than the sum of the second half $x_{L:2L}$, and otherwise is rewarded for taking action $0$.

Ability to Generalize

In complex real-world tasks, it is impossible for the learner to observe every state during training, and so some degree of generalization will be required at deployment. These tasks simulate this challenge by having a (typically large) state space which is only partially explored at training.

`InitShift`: Distribution Shift

Many LfH algorithms learn from expert demonstrations. This can be problematic when the environment the demonstrations were gathered in differs even slightly from the learner's environment.

To illustrate this problem, we introduce , a depth-2 full binary tree where the agent moves left or right until reaching a leaf. The expert starts at the root $s_0$, whereas the learner starts at the left branch $s_1$ and so can only reach leaves $s_3$ and $s_4$. Reward is only given at the leaves. The expert always move to the highest reward leaf $s_6$, so any algorithm that relies on demonstrations will not know whether it is better to go to $s_3$ or $s_4$. By contrast, feedback such as preference comparison can disambiguate this case.

`ProcGoal`: Procedural Generation

In this task, the agent starts at a random position in a large grid, and must navigate to a goal randomly placed in a neighborhood around the agent. The observation is a 4-dimensional vector containing the $(x,y)$ coordinates of the agent and the goal. The reward at each timestep is the negative Manhattan distance between the two positions. With a large enough grid, generalizing is necessary to achieve good performance, since most initial states will be unseen.

`Sort`: Algorithmic Complexity

In Sort, the agent must sort a list of random numbers by swapping elements. The initial state is a vector $x$ sampled uniformly from $[0,1]^n$ (default $n=4$), with actions $a = (i,j)$ swapping $x_i$ and $x_j$. The reward is given according to the number of elements in the correct position. To perform well, the learned policy must compare elements, otherwise it will not generalize to all possible randomly selected initial states.

Benchmarking Algorithms {#sec:experiments}

{#fig:experiments:heatmap}

Experimental Setup

We evaluate a range of commonly used LfH algorithms: Maximum Entropy IRL (; [@ziebart:2008]), Maximum Causal Entropy IRL (; [@ziebart:2010:thesis]), Behavioral Cloning (), Generative Adversarial Imitation Learning (; [@ho:2016]), Adversarial IRL (; [@fu:2018]) and Deep Reinforcement Learning from Human Preferences (; [@christiano:2017]). We also present an RL baseline using Proximal Policy Optimization (; [@schulman:2017]).

We test several variants of these algorithms. We compare multiple implementations of ( and ) and (, and ). We also vary whether the reward function in and is state-only ( and ) or state-action ( and ). All other algorithms use state-action rewards.

We train using preference comparisons from the ground-truth reward, and train all other algorithms using demonstrations from an optimal expert policy. We compute the expert policy using value iteration in discrete environments, and procedurally specify the expert in other environments. See appendix 8{reference-type="ref" reference="sec:appendix:experimental-setup"} for a complete description of the experimental setup and implementations.

Experimental Results {#sec:experiments:results}

For brevity, we highlight a few notable findings, summarizing our results in Figure 1{reference-type="ref" reference="fig:experiments:heatmap"}. See appendix 9{reference-type="ref" reference="sec:appendix:experimental-results"} for the full results and a more comprehensive analysis.

Implementation Matters. Our results show that and are biased towards prolonging episodes: they achieve worse than random return on , where the optimal action is to end the episode, but match expert performance in . By contrast, is biased towards ending the episode, succeeding in but failing in . This termination bias can be a major confounder when evaluating on variable-horizon tasks.

We also observe several other differences between implementations of the same algorithm. achieves significantly lower return than and on , , and . Moreover, attains near-expert return on while performs worse than random. These results confirm that implementation matters [@islam2017reproducibility; @henderson2018deep; @engstrom:2020], and illustrate how diagnostic tasks can pinpoint how performance varies between implementations.

Rewards vs Policy Learning. Behavioral cloning (), fitting a policy to demonstrations using supervised learning, exhibits bimodal performance. often attains near-optimal returns. However, in tasks with large continuous state spaces such as and , performs close to random. We conjecture this is because has more difficulty generalizing to novel states than reward learning algorithms, which have an inductive bias towards goal-directed behavior.

return on for varying numbers of noise dimensions $L$ (grid size $M=7$). We evaluate across 24 seeds trained for 3M timesteps. Mean returns are depicted as horizontal lines inside the boxes and reported underneath $x$-axis labels. Boxes span the 95% confidence intervals of the means; whiskers span the range of returns. {#fig:experiments:drlhpnoise width="70%"}

Exploration in Preference Comparison. [[drlhp-exploration]]{#drlhp-exploration label="drlhp-exploration"} We find , which learns from preference comparisons, achieves lower returns in than algorithms that learn from demonstrations. This is likely because is a hard-exploration task: a specific sequence of actions must be taken in order to achieve a reward. For to succeed, it must first discover this sequence, whereas algorithms that learn from demonstrations can simply mimic the expert.

While this problem is particularly acute in , exploration is likely to limit the performance of in other environments. To investigate this further, we varied the number of noise dimensions $L$ in from $5$ to $500$, reporting the performance of in Figure 2{reference-type="ref" reference="fig:experiments:drlhpnoise"}. Increasing $L$ decreases both the maximum and the variance of the return. This causes a higher mean return in $L=50$ than in $L=5$ (high variance) or $L=500$ (low maximum).

We conjecture this behavior is partly due to comparing trajectories sampled from a policy optimized for its current best-guess reward. If the policy becomes low-entropy too soon, then will fail to sufficiently explore. Adding stochasticity stabilizes the training process, but makes it harder to recover the true reward.

Case Study: Improving Implementations {#sec:case-study}

In the previous section, we showed how DERAIL can be used to compare existing implementations of reward and imitation learning algorithms. However, benchmarks are also often used during the development of new algorithms and implementations. We believe diagnostic task suites are particularly well-suited to rapid prototyping. The tasks are lightweight so tests can be conducted quickly. Yet they are sufficiently varied to catch a wide range of bugs, and give a glimpse of effects in more complex environments. To illustrate this workflow, we present a case study refining an implementation of Deep Reinforcement Learning from Human Preferences ([@christiano:2017]).

As discussed in section 4.2{reference-type="ref" reference="sec:experiments:results"}, the implementation we experimented with, , has high-variance across random seeds. We conjecture this problem occurs because the preference queries are insufficiently diverse. The queries are sampled from rollouts of a policy, and so their diversity depends on the stochasticity of the environment and policy. Indeed, we see in Figure 2{reference-type="ref" reference="fig:experiments:drlhpnoise"} that is more stable when environment stochasticity increases.

The fundamental issue is that 's query distribution depends on the policy, which is being trained to maximize 's predicted reward. This entanglement makes the procedure liable to get stuck in local minima. Suppose that, mid-way through training, the policy chances upon some previously unseen, high-reward state. The predicted reward at this unseen state will be random -- and so likely worse than a previously seen, medium-reward state. The policy will thus be trained to avoid this high-reward state -- starving of the queries that would allow it to learn in this region.

In an attempt to address this issue, we experiment with a few simple modifications to : [[dlrhp-mod-ideas]]{#dlrhp-mod-ideas label="dlrhp-mod-ideas"}

. Reduce the learning rate for the policy. The policy is initially high-entropy; over time, it learns to only take actions with high predicted reward. By slowing down policy learning, we maintain a higher-entropy query distribution.
. When sampling trajectories for preference comparison, use an $\epsilon$-greedy version of the current policy (with $\epsilon = 0.1$). This directly increases the entropy of the query distribution.
. Add an exploration bonus using random network distillation [@burda2018exploration]. Distillation steps are performed only on trajectories submitted for preference comparison. This has the effect of giving a bonus to state-action pairs that are uncommon in preference queries (even if they occurred frequently during policy training).

$Return of and three variants: , slower policy training; , $\epsilon$-greedy exploration; , exploration bonus (see section [dlrhp-mod-ideas]{reference-type="ref" reference="dlrhp-mod-ideas"}). Mean episode return (across 15 seeds) of policy learned by each algorithm ($x$-axis) on each task ($y$-axis). Returns are normalized between $0.0$ for a random policy and $1.0$ for an optimal policy.$ {#fig:experiments:drlhp-mods}

We report the return achieved with these modifications in Figure 3{reference-type="ref" reference="fig:experiments:drlhp-mods"}. produces the most stable results: the returns are all comparable to or substantially higher than the original . However, all modifications increase returns on hard-exploration task , although for the improvement is modest. and also enjoy significant improvements on high-dimensional classification task , which likely benefits from more balanced labels. performs poorly on and : we conjecture that the large state space caused to explore excessively.

This case study shows how DERAIL can help rapidly test new prototypes, quickly confirming or discrediting a hypothesis of a how a change will affect a given algorithm. Moreover, we can gain a fine-grained understanding of performance along different axes. For example, we could conclude that does increase exploration (higher return on ) but may over-explore (lower return on ). It would be difficult to disentangle these distinct effects in more complex environments.

Discussion

We have developed, to the best of our knowledge, the first suite of diagnostic environments for reward and imitation learning algorithms. We find that by isolating particular algorithmic capabilities, diagnostic tasks can provide a more nuanced picture of individual algorithms' strengths and weaknesses than testing on more complex benchmarks. Our results confirm that reward and imitation learning algorithm performance is highly sensitive to implementation details. Furthermore, we have demonstrated the fragility of behavioral cloning, and obtained qualitative insights into the performance of preference-based reward learning. Finally, we have illustrated in a case study how DERAIL can support rapid prototyping of algorithmic refinements.

In designing the task suite, we have leveraged our personal experience as well as past work documenting design flaws and implementation bugs [@ziebart:2010:thesis; @kostrikov2018discriminator]. We expect to refine and extend the suite in response to user feedback, and we encourage other researchers to develop complementary tasks. Our environments are open-source and available at https://github.com/HumanCompatibleAI/seals.

Acknowledgements {#acknowledgements .unnumbered}

We would like to thank Rohin Shah and Andrew Critch for feedback during the initial stages of this project, and Scott Emmons, Cody Wild, Lawrence Chan, Daniel Filan and Michael Dennis for feedback on earlier drafts of the paper.

Full specification of tasks {#sec:appendix:tasks-specification}

State space: $\statespace{} = {0, 1, 2, 3}$
Action space: $\actionspace{} = {0, 1}$
Horizon: $5$
Dynamics:
- $0 \xrightarrow[]{0} 1$, $0 \xrightarrow[]{1} \begin{cases}1, & \text{50% probability;} \ 2, & \text{50% probability.}\end{cases}$
- $1 \xrightarrow[]{0} 2$, $1 \xrightarrow[]{1} 1$
- $2 \xrightarrow[]{a} 2$ for all $a \in \actionspace{}$
- $3 \xrightarrow[]{a} 3$ for all $a \in \actionspace{}$
Rewards: $R(0) = R(1) = 0, R(2) = 1, R(3) = -100$
State space: $\statespace{} = {0, 1, 2}$
Action space: $\actionspace{} = {0, 1}$
Horizon: $10$
Dynamics:
- $0 \xrightarrow[]{a} 1$ for all $a \in \actionspace{}$
- $1 \xrightarrow[]{0} 0$, $1 \xrightarrow[]{1} 2$
Rewards: $R(s) = 1.0$ in
and $R(s) = -1.0$ in
Configurable parameters: size $M=5$, noise length $L=20$
State space: grid $\statespace{} = \mathbb{Z}_M \times \mathbb{Z}_M$, where $\mathbb{Z}_M = {0, 1, \cdots, M-1}$
Observations: $(x,y,z_1,\cdots,z_L)$, where:
- $(x, y) \in \statespace{}$ are the state coordinates
- $z_i$ are i.i.d. samples from the Gaussian $\mathcal{N}(0, 1)$
Action space: $\actionspace{} = {U,D,L,R,S}$ for moving Up, Down, Left, Right or Stay (no-op).
Horizon: $3M$
Dynamics: Deterministic gridworld; attempting to move beyond boundary of world is a no-op.
Rewards: $R(s) = \mathbbm{1}\left[s = \left(\floor{\frac{M}{2}}, \floor{\frac{M}{2}}\right)\right]$
Initial state: $(0, 0)$
Configurable parameters: path length $L=10$, branching factor $B=2$
State space: $\statespace{} = {0, 1, \cdots, LB}$
Action space: $\actionspace{} = {0, \cdots, B-1}$
Horizon: $L$
Dynamics: $s \xrightarrow[]{a} s + (a + 1) \cdot \mathbbm{1}[s \equiv 0 \mod B]$
Rewards: $R(s) = \mathbbm{1}\left[s = LB \right]$
Initial state: $0$
Configurable parameters: x-step $dx=0.05$, horizon $h=20$
State space: $\statespace{} = \mathbbm{R}^2 \times [-1, 1]^3$
Actions: $a \in (-\infty, +\infty)$
Horizon: $h$
Dynamics: $$\left((x, y), (c_2, c_1, c_0)\right) \xrightarrow[]{a} \left((x + dx, y + a), (c_2, c_1, c_0)\right)$$
Rewards: $R(s) = (c_2 x^2 + c_1 x + c_0 - y)^2$
Initial state: $(0, c_0, c_0, c_1, c_2)$, where $c_i \sim \operatorname{Unif}([-1, 1])$
Configurable parameters: half-length $L=25$
State space: $\statespace{} = [0, 1]^{2L}$
Action space: $\actionspace{} = {0, 1}$
Horizon: $1$
Rewards: $R(s, a) = \mathbbm{1}\left[a = \mathbbm{1}\left[\sum_{j=1}^L s_j \leq \sum_{j=L+1}^{2L} s_j\right]\right]$
Initial state: $s_0 \sim \operatorname{Unif}\left([0, 1]^{2L}\right)$
State space: $\statespace{} = {0, 1, .., 6}$
Action space: $\actionspace{} = {0, 1}$
Horizon: $2$
Dynamics: $s \xrightarrow[]{a} 2 s + 1 + a$
Rewards: $$R(s) = \begin{dcases*} +1 & if $s = 3$\ -1 & if $s \in {4, 5}$\ +2 & if $s = 6$\ 0 & otherwise \end{dcases*}$$
Initial state:\
- Expert: $0$
- Learner: $1$
Configurable prameters: initial state bound $B=100$, goal distance $D=10$
State space: $\statespace{} = \mathbb{Z}^2 \times \mathbb{Z}^2$, where $(p,g) \in \statespace$ consists of agent position $p$ and goal position $g$
Action space: $\actionspace{} = {U,D,L,R,S}$ for moving Up, Down, Left, Right or Stay (no-op).
Horizon: $3D$
Dynamics: Deterministic gridworld on $p$; $g$ is fixed at start of episode.
Rewards: $R((p,g)) = - \lVert p - g \rVert_1$
Initial state:
- Position $p$ uniform over ${p \in \mathbb{Z}^2 : \lVert p \rVert_1 \leq B}$
- Goal $g$ uniform over ${g \in \mathbb{Z}^2 : \lVert p - g \rVert_1 = D}$
Configurable parameters: length $L=4$
State space: $\statespace{} = [0, 1]^L$
Action space: $\actionspace{} = \mathbb{Z}_L \times \mathbb{Z}_L$ where $\mathbb{Z}_L = {0, .., L-1}$
Horizon: $2L$
Dynamics: $a = (i, j)$ swaps elements $i$ and $j$
Rewards: $R(s, s') = \mathbbm{1}\left[c(s') = n\right] + c(s') - c(s)$, where $c(s)$ is the number of elements in the correct sorted position.
Initial state: $s \sim \operatorname{Unif}([0, 1]^{L})$

Experimental setup {#sec:appendix:experimental-setup}

Algorithms

The exact code for running the experiments and generating the plots can be found at https://github.com/HumanCompatibleAI/derail.

Imitation learning and IRL algorithms are trained using rollouts from an optimal policy. The number of expert timesteps provided is the same as the number of timesteps each algorithm runs for. For , trajectories are compared using the ground-truth reward. The trajectory queries are generated from the policy being learned jointly with the reward.

We used open source implementations of these algorithms, as listed in Table 1{reference-type="ref" reference="table:sourcecodes"}. We did not perform hyperparameter tuning, and relied on default values for most hyperparameters.

::: {#table:sourcecodes} Source Code Algorithms

@stable-baselines:2018 , , @fu-inverse-rl:2018 , @imitation:2020 , , , , @evaluating-rewards:2020 ,

: Sources of algorithm implementations (some of which were slightly adapted). :::

Evaluation

We run each algorithm with 15 different seeds and 500,000 timesteps. To evaluate a policy, we compute the exact expected episode return in discrete state environments. In other environments, we compute the average return over 1000 episodes. The score in a task is the mean return of the learned policy, normalized such that a policy taking random actions gets a score of 0.0 and the expert gets a score of 1.0. For , poor policies can reach values smaller than -3.0; to keep scores in a similar range to other tasks, we truncate negative values at -1.0.

Complete experimental results {#sec:appendix:experimental-results}

We provide results and analysis grouped around individually tasks (section 9.1{reference-type="ref" reference="sec:appendix:task-results"}) and algorithms (section 9.2{reference-type="ref" reference="sec:appendix:algo-results"}). The results are presented using boxplot graphs, such as Figure 4{reference-type="ref" reference="fig:appendix:risky-path"}. The y-axis represents the return of the learned policy, while the $x$-axis contains different algorithms or tasks. Each point corresponds to a different seed. The means across seeds are represented as horizontal lines inside the boxes, with the boxes spanning bootstrapped $95%$ confidence intervals of the means; the whiskers show the full range of returns. Each box is assigned a different color to aid in visually distinguishing the tasks; they do not have any semantic meaning.

Tasks {#sec:appendix:task-results}

: performs poorly as expected, while performs well. Other algorithms evaluated look at state-action pairs individually, instead of looking at trajectories, avoiding the problem of risky behavior. {#fig:appendix:risky-path width="103%"}

{width="103%"}

: algorithms that learn from expert demonstrations tend to perform well, since they require limited exploration. On the other hand, can struggle to perform enough exploration to consistently find the goal and learn the correct reward. Note that needs to find the goal multiple times in order to update the reward significantly. {width="103%"}

{width="103%"}

: we see that achieves near-optimal performance, demonstrating that supervised learning can be more robust and sample-efficient in the presence of noise than other LfH algorithms. We also see that performs poorly relative to and , which underscores the importance of the subtle differences between these implementations. {width="103%"}

: most algorithms perform well, except for , and . {width="103%"}

: Most algorithms fail to achieve expert performance, while does match expert performance, suggesting scaling algorithms like and to high-dimensional tasks may be a fruitful direction for future work. Methods using state-only reward functions, and , perform poorly since the reward for this task depends on the actions taken. {width="103%"}

{width="103%"}

Algorithms {#sec:appendix:algo-results}

: serves as an RL baseline. We would expect most reward and imitation learning algorithms to obtain lower return, since they must learn a policy without knowing the reward. Most seeds achieve close to expert performance. {width="103%"}

: exhibits bimodal performance, either attaining near-expert return ($1.0$, normalized) in an environment or close to random ($0.0$). The return is similar across seeds. Behavioral cloning attains relatively low returns in and , which have continuous observation spaces that require generalization and sequential decision making. {width="103%"}

{width="103%"}

Tabular IRL: executed only on tasks with discrete state and action spaces and fixed horizon. As expected, obtains a low return in . For , the expert demonstrations do not provide information to choose between the subset of states accessible during learning, and so the algorithm gets a random score that depends on the randomly initialized initial reward. {width="103%"} Tabular IRL: executed only on tasks with discrete state and action spaces and fixed horizon. As expected, obtains a low return in . For , the expert demonstrations do not provide information to choose between the subset of states accessible during learning, and so the algorithm gets a random score that depends on the randomly initialized initial reward. {width="103%"}

[^1]: Work partially conducted during an internship at UC Berkeley.

DERAIL: Diagnostic Environments for Reward And Imitation Learning

Introduction

Designing Diagnostic Tasks {#sec:desiderata}

Tasks

Design Flaws and Implementation Bugs

RiskyPath: Stochastic Transitions

EarlyTerm: Early Termination

Core Capabilities

NoisyObs: Robust Learning

Branching: Exploration

Parabola: Continuous Control

LargestSum: High Dimensionality

Ability to Generalize

InitShift: Distribution Shift

ProcGoal: Procedural Generation

Sort: Algorithmic Complexity

Benchmarking Algorithms {#sec:experiments}

Experimental Setup

Experimental Results {#sec:experiments:results}

Case Study: Improving Implementations {#sec:case-study}

Discussion

Acknowledgements {#acknowledgements .unnumbered}

Full specification of tasks {#sec:appendix:tasks-specification}

Experimental setup {#sec:appendix:experimental-setup}

Algorithms

Evaluation

Complete experimental results {#sec:appendix:experimental-results}

Tasks {#sec:appendix:task-results}

Algorithms {#sec:appendix:algo-results}

`RiskyPath`: Stochastic Transitions

`EarlyTerm`: Early Termination

`NoisyObs`: Robust Learning

`Branching`: Exploration

`Parabola`: Continuous Control

`LargestSum`: High Dimensionality

`InitShift`: Distribution Shift

`ProcGoal`: Procedural Generation

`Sort`: Algorithmic Complexity