Preventing Imitation Learning with Adversarial Policy Ensembles

source

arxiv

source_type

latex

converted_with

pandoc

paper_version

2002.01059v2

title

Preventing Imitation Learning with Adversarial Policy Ensembles

authors

["Albert Zhan","Stas Tiomkin","Pieter Abbeel"]

date_published

2020-01-31 01:57:16+00:00

data_last_modified

2020-08-02 23:15:58+00:00

abstract

Imitation learning can reproduce policies by observing experts, which poses a problem regarding policy privacy. Policies, such as human, or policies on deployed robots, can all be cloned without consent from the owners. How can we protect against external observers cloning our proprietary policies? To answer this question we introduce a new reinforcement learning framework, where we train an ensemble of near-optimal policies, whose demonstrations are guaranteed to be useless for an external observer. We formulate this idea by a constrained optimization problem, where the objective is to improve proprietary policies, and at the same time deteriorate the virtual policy of an eventual external observer. We design a tractable algorithm to solve this new optimization problem by modifying the standard policy gradient algorithm. Our formulation can be interpreted in lenses of confidentiality and adversarial behaviour, which enables a broader perspective of this work. We demonstrate the existence of "non-clonable" ensembles, providing a solution to the above optimization problem, which is calculated by our modified policy gradient algorithm. To our knowledge, this is the first work regarding the protection of policies in Reinforcement Learning.

author_comment

null

journal_ref

null

doi

null

primary_category

cs.LG

categories

["cs.LG","cs.AI","stat.ML"]

citation_level

alignment_text

unlabeled

confidence_score

0.9465582725

main_tex_filename

./draft_camera.tex

bibliography_bbl

\begin{thebibliography}{42} \providecommand{\natexlab}[1]{#1} \providecommand{\url}[1]{\texttt{#1}} \expandafter\ifx\csname urlstyle\endcsname\relax \providecommand{\doi}[1]{doi: #1}\else \providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi \bibitem[Laskey et~al.(2017)Laskey, Lee, Hsieh, Liaw, Mahler, Fox, and Goldberg]{DART} M.~Laskey, J.~Lee, W.~Y. Hsieh, R.~Liaw, J.~Mahler, R.~Fox, and K.~Goldberg. \newblock Iterative noise injection for scalable imitation learning. \newblock \emph{CoRR}, abs/1703.09327, 2017. \newblock URL \url{http://arxiv.org/abs/1703.09327}. \bibitem[Finn et~al.(2017)Finn, Yu, Zhang, Abbeel, and Levine]{oneshotIL} C.~Finn, T.~Yu, T.~Zhang, P.~Abbeel, and S.~Levine. \newblock One-shot visual imitation learning via meta-learning. \newblock \emph{CoRR}, abs/1709.04905, 2017. \newblock URL \url{http://arxiv.org/abs/1709.04905}. \bibitem[Codevilla et~al.(2019)Codevilla, Santana, L{\'{o}}pez, and Gaidon]{BCLimitations} F.~Codevilla, E.~Santana, A.~M. L{\'{o}}pez, and A.~Gaidon. \newblock Exploring the limitations of behavior cloning for autonomous driving. \newblock \emph{CoRR}, abs/1904.08980, 2019. \newblock URL \url{http://arxiv.org/abs/1904.08980}. \bibitem[Codevilla et~al.(2017)Codevilla, M{\"{u}}ller, Dosovitskiy, L{\'{o}}pez, and Koltun]{end2endCondIL} F.~Codevilla, M.~M{\"{u}}ller, A.~Dosovitskiy, A.~L{\'{o}}pez, and V.~Koltun. \newblock End-to-end driving via conditional imitation learning. \newblock \emph{CoRR}, abs/1710.02410, 2017. \newblock URL \url{http://arxiv.org/abs/1710.02410}. \bibitem[Pomerleau(1988)]{alvinn} D.~Pomerleau. \newblock Alvinn: An autonomous land vehicle in a neural network. \newblock In \emph{NIPS}, 1988. \bibitem[Bojarski et~al.(2016)Bojarski, Testa, Dworakowski, Firner, Flepp, Goyal, Jackel, Monfort, Muller, Zhang, Zhang, Zhao, and Zieba]{end2endSelfDriving} M.~Bojarski, D.~D. Testa, D.~Dworakowski, B.~Firner, B.~Flepp, P.~Goyal, L.~D. Jackel, M.~Monfort, U.~Muller, J.~Zhang, X.~Zhang, J.~Zhao, and K.~Zieba. \newblock End to end learning for self-driving cars. \newblock \emph{CoRR}, abs/1604.07316, 2016. \newblock URL \url{http://arxiv.org/abs/1604.07316}. \bibitem[Al-Rubaie and Chang(2019)]{privateML} M.~Al-Rubaie and J.~M. Chang. \newblock Privacy-preserving machine learning: Threats and solutions. \newblock \emph{IEEE Security \& Privacy}, 17\penalty0 (2):\penalty0 49--58, 2019. \bibitem[Papernot et~al.(2016)Papernot, Abadi, Úlfar Erlingsson, Goodfellow, and Talwar]{PATE} N.~Papernot, M.~Abadi, Úlfar Erlingsson, I.~Goodfellow, and K.~Talwar. \newblock Semi-supervised knowledge transfer for deep learning from private training data, 2016. \bibitem[Ziebart et~al.(2008)Ziebart, Maas, Bagnell, and Dey]{maxentIRL} B.~D. Ziebart, A.~Maas, J.~A. Bagnell, and A.~K. Dey. \newblock Maximum entropy inverse reinforcement learning. \newblock \emph{AAAI Conference on Artificial Intelligence}, 2008. \bibitem[Levine(2018)]{controlasoptimalinference} S.~Levine. \newblock Reinforcement learning and control as probabilistic inference: Tutorial and review. \newblock \emph{CoRR}, abs/1805.00909, 2018. \newblock URL \url{http://arxiv.org/abs/1805.00909}. \bibitem[Sutton et~al.(2000)Sutton, Mcallester, Singh, and Mansour]{PG} R.~Sutton, D.~Mcallester, S.~Singh, and Y.~Mansour. \newblock Policy gradient methods for reinforcement learning with function approximation. \newblock \emph{Adv. Neural Inf. Process. Syst}, 12, 02 2000. \bibitem[Schulman et~al.(2015)Schulman, Moritz, Levine, Jordan, and Abbeel]{gae} J.~Schulman, P.~Moritz, S.~Levine, M.~Jordan, and P.~Abbeel. \newblock High-dimensional continuous control using generalized advantage estimation. \newblock 06 2015. \bibitem[Schulman et~al.(2017)Schulman, Wolski, Dhariwal, Radford, and Klimov]{PPO} J.~Schulman, F.~Wolski, P.~Dhariwal, A.~Radford, and O.~Klimov. \newblock Proximal policy optimization algorithms. \newblock \emph{CoRR}, abs/1707.06347, 2017. \newblock URL \url{http://arxiv.org/abs/1707.06347}. \bibitem[Widrow and W.~Smith(1964)]{ILOriginal} B.~Widrow and F.~W.~Smith. \newblock Pattern recognizing control systems. \newblock \emph{Computer Inf. Sci. (COINS) Proc.}, 01 1964. \bibitem[Abbeel and Ng(2004)]{IRL} P.~Abbeel and A.~Y. Ng. \newblock Apprenticeship learning via inverse reinforcement learning. \newblock In \emph{ICML}, 2004. \bibitem[Ng and Russell(2000)]{algoIRL} A.~Y. Ng and S.~J. Russell. \newblock Algorithms for inverse reinforcement learning. \newblock In \emph{Proceedings of the Seventeenth International Conference on Machine Learning}, ICML '00, pages 663--670, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. \newblock ISBN 1-55860-707-2. \newblock URL \url{http://dl.acm.org/citation.cfm?id=645529.657801}. \bibitem[Levine et~al.(2011)Levine, Popovic, and Koltun]{nonlinear} S.~Levine, Z.~Popovic, and V.~Koltun. \newblock Nonlinear inverse reinforcement learning with gaussian processes. \newblock 12 2011. \bibitem[Ho and Ermon(2016)]{GAIL} J.~Ho and S.~Ermon. \newblock Generative adversarial imitation learning. \newblock \emph{CoRR}, abs/1606.03476, 2016. \newblock URL \url{http://arxiv.org/abs/1606.03476}. \bibitem[Peng et~al.(2018)Peng, Kanazawa, Toyer, Abbeel, and Levine]{VDB} X.~B. Peng, A.~Kanazawa, S.~Toyer, P.~Abbeel, and S.~Levine. \newblock Variational discriminator bottleneck: Improving imitation learning, inverse rl, and gans by constraining information flow. \newblock \emph{CoRR}, abs/1810.00821, 2018. \newblock URL \url{http://arxiv.org/abs/1810.00821}. \bibitem[Lin et~al.(2017)Lin, Hong, Liao, Shih, Liu, and Sun]{DRLAttack} Y.~Lin, Z.~Hong, Y.~Liao, M.~Shih, M.~Liu, and M.~Sun. \newblock Tactics of adversarial attack on deep reinforcement learning agents. \newblock \emph{CoRR}, abs/1703.06748, 2017. \newblock URL \url{http://arxiv.org/abs/1703.06748}. \bibitem[Behzadan and Munir(2017)]{vulnerablePolicies} V.~Behzadan and A.~Munir. \newblock Vulnerability of deep reinforcement learning to policy induction attacks. \newblock In \emph{International Conference on Machine Learning and Data Mining in Pattern Recognition}, pages 262--275. Springer, 2017. \bibitem[Ma et~al.(2019)Ma, Zhang, Sun, and Zhu]{poison} Y.~Ma, X.~Zhang, W.~Sun, and X.~Zhu. \newblock Policy poisoning in batch reinforcement learning and control. \newblock \emph{arXiv preprint arXiv:1910.05821}, 2019. \bibitem[Gleave et~al.(2019)Gleave, Dennis, Kant, Wild, Levine, and Russell]{gleaveAttackRL} A.~Gleave, M.~Dennis, N.~Kant, C.~Wild, S.~Levine, and S.~Russell. \newblock Adversarial policies: Attacking deep reinforcement learning. \newblock \emph{arXiv preprint arXiv:1905.10615}, 2019. \bibitem[Wang and Hegde(2019)]{privateQ} B.~Wang and N.~Hegde. \newblock Private q-learning with functional noise in continuous spaces. \newblock \emph{arXiv preprint arXiv:1901.10634}, 2019. \bibitem[Balle et~al.(2016)Balle, Gomrokchi, and Precup]{privatePolicy} B.~Balle, M.~Gomrokchi, and D.~Precup. \newblock Differentially private policy evaluation. \newblock In \emph{International Conference on Machine Learning}, pages 2130--2138, 2016. \bibitem[Li et~al.(2017)Li, Song, and Ermon]{infogail} Y.~Li, J.~Song, and S.~Ermon. \newblock Inferring the latent structure of human decision-making from raw visual inputs. \newblock \emph{CoRR}, abs/1703.08840, 2017. \newblock URL \url{http://arxiv.org/abs/1703.08840}. \bibitem[Brown et~al.(2019)Brown, Cui, and Niekum]{RiskAwareIRL} D.~S. Brown, Y.~Cui, and S.~Niekum. \newblock Risk-aware active inverse reinforcement learning. \newblock \emph{CoRR}, abs/1901.02161, 2019. \newblock URL \url{http://arxiv.org/abs/1901.02161}. \bibitem[de~Haan et~al.(2019)de~Haan, Jayaraman, and Levine]{causal_imitation_learning} P.~de~Haan, D.~Jayaraman, and S.~Levine. \newblock Causal confusion in imitation learning. \newblock \emph{CoRR}, abs/1905.11979, 2019. \newblock URL \url{http://arxiv.org/abs/1905.11979}. \bibitem[Hristov et~al.(2018)Hristov, Lascarides, and Ramamoorthy]{latentFromDemo} Y.~Hristov, A.~Lascarides, and S.~Ramamoorthy. \newblock Interpretable latent spaces for learning from demonstration. \newblock \emph{CoRR}, abs/1807.06583, 2018. \newblock URL \url{http://arxiv.org/abs/1807.06583}. \bibitem[Achiam et~al.(2018)Achiam, Edwards, Amodei, and Abbeel]{valor} J.~Achiam, H.~Edwards, D.~Amodei, and P.~Abbeel. \newblock Variational option discovery algorithms. \newblock \emph{CoRR}, abs/1807.10299, 2018. \newblock URL \url{http://arxiv.org/abs/1807.10299}. \bibitem[Eysenbach et~al.(2018)Eysenbach, Gupta, Ibarz, and Levine]{diayn} B.~Eysenbach, A.~Gupta, J.~Ibarz, and S.~Levine. \newblock Diversity is all you need: Learning skills without a reward function. \newblock \emph{CoRR}, abs/1802.06070, 2018. \newblock URL \url{http://arxiv.org/abs/1802.06070}. \bibitem[Sharma et~al.(2019)Sharma, Gu, Levine, Kumar, and Hausman]{DADS} A.~Sharma, S.~Gu, S.~Levine, V.~Kumar, and K.~Hausman. \newblock Dynamics-aware unsupervised discovery of skills. \newblock \emph{CoRR}, abs/1907.01657, 2019. \newblock URL \url{http://arxiv.org/abs/1907.01657}. \bibitem[Jacobs et~al.(1991)Jacobs, Jordan, Nowlan, Hinton, et~al.]{mixexp} R.~A. Jacobs, M.~I. Jordan, S.~J. Nowlan, G.~E. Hinton, et~al. \newblock Adaptive mixtures of local experts. \newblock \emph{Neural computation}, 3\penalty0 (1):\penalty0 79--87, 1991. \bibitem[Henderson et~al.(2018)Henderson, Chang, Bacon, Meger, Pineau, and Precup]{optiongan} P.~Henderson, W.-D. Chang, P.-L. Bacon, D.~Meger, J.~Pineau, and D.~Precup. \newblock Optiongan: Learning joint reward-policy options using generative adversarial inverse reinforcement learning. \newblock In \emph{Thirty-Second AAAI Conference on Artificial Intelligence}, 2018. \bibitem[Zhang et~al.(2019)Zhang, Yu, and Turk]{novel} Y.~Zhang, W.~Yu, and G.~Turk. \newblock Learning novel policies for tasks. \newblock \emph{CoRR}, abs/1905.05252, 2019. \newblock URL \url{http://arxiv.org/abs/1905.05252}. \bibitem[Goyal et~al.(2019)Goyal, Islam, Strouse, Ahmed, Larochelle, Botvinick, Levine, and Bengio]{InfoBot} A.~Goyal, R.~Islam, D.~Strouse, Z.~Ahmed, H.~Larochelle, M.~Botvinick, S.~Levine, and Y.~Bengio. \newblock Transfer and exploration via the information bottleneck. \newblock In \emph{International Conference on Learning Representations}, 2019. \newblock URL \url{https://openreview.net/forum?id=rJg8yhAqKm}. \bibitem[Strouse et~al.(2018)Strouse, Kleiman{-}Weiner, Tenenbaum, Botvinick, and Schwab]{LSH} D.~Strouse, M.~Kleiman{-}Weiner, J.~Tenenbaum, M.~Botvinick, and D.~J. Schwab. \newblock Learning to share and hide intentions using information regularization. \newblock \emph{CoRR}, abs/1808.02093, 2018. \newblock URL \url{http://arxiv.org/abs/1808.02093}. \bibitem[Ross et~al.(2010)Ross, Gordon, and Bagnell]{dagger} S.~Ross, G.~J. Gordon, and J.~A. Bagnell. \newblock No-regret reductions for imitation learning and structured prediction. \newblock \emph{CoRR}, abs/1011.0686, 2010. \newblock URL \url{http://arxiv.org/abs/1011.0686}. \bibitem[Abadi et~al.(2016)Abadi, Barham, Chen, Chen, Davis, Dean, Devin, Ghemawat, Irving, Isard, Kudlur, Levenberg, Monga, Moore, Murray, Steiner, Tucker, Vasudevan, Warden, Wicke, Yu, and Zhang]{Tensorflow} M.~Abadi, P.~Barham, J.~Chen, Z.~Chen, A.~Davis, J.~Dean, M.~Devin, S.~Ghemawat, G.~Irving, M.~Isard, M.~Kudlur, J.~Levenberg, R.~Monga, S.~Moore, D.~G. Murray, B.~Steiner, P.~A. Tucker, V.~Vasudevan, P.~Warden, M.~Wicke, Y.~Yu, and X.~Zhang. \newblock Tensorflow: {A} system for large-scale machine learning. \newblock \emph{CoRR}, abs/1605.08695, 2016. \newblock URL \url{http://arxiv.org/abs/1605.08695}. \bibitem[Mnih et~al.(2016)Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, and Kavukcuoglu]{asynch} V.~Mnih, A.~P. Badia, M.~Mirza, A.~Graves, T.~P. Lillicrap, T.~Harley, D.~Silver, and K.~Kavukcuoglu. \newblock Asynchronous methods for deep reinforcement learning. \newblock \emph{CoRR}, abs/1602.01783, 2016. \newblock URL \url{http://arxiv.org/abs/1602.01783}. \bibitem[Kingma and Ba(2015)]{AdaM} D.~P. Kingma and J.~Ba. \newblock Adam: A method for stochastic optimization. \newblock \emph{CoRR}, abs/1412.6980, 2015. \bibitem[Cho et~al.(2014)Cho, van Merrienboer, G{\"{u}}l{\c{c}}ehre, Bougares, Schwenk, and Bengio]{GRU} K.~Cho, B.~van Merrienboer, {\c{C}}.~G{\"{u}}l{\c{c}}ehre, F.~Bougares, H.~Schwenk, and Y.~Bengio. \newblock Learning phrase representations using {RNN} encoder-decoder for statistical machine translation. \newblock \emph{CoRR}, abs/1406.1078, 2014. \newblock URL \url{http://arxiv.org/abs/1406.1078}. \end{thebibliography}

bibliography_bib

@inproceedings{optiongan, title={Optiongan: Learning joint reward-policy options using generative adversarial inverse reinforcement learning}, author={Henderson, Peter and Chang, Wei-Di and Bacon, Pierre-Luc and Meger, David and Pineau, Joelle and Precup, Doina}, booktitle={Thirty-Second AAAI Conference on Artificial Intelligence}, year={2018} } @article{poison, title={Policy Poisoning in Batch Reinforcement Learning and Control}, author={Ma, Yuzhe and Zhang, Xuezhou and Sun, Wen and Zhu, Xiaojin}, journal={arXiv preprint arXiv:1910.05821}, year={2019} } @inproceedings{vulnerablePolicies, title={Vulnerability of deep reinforcement learning to policy induction attacks}, author={Behzadan, Vahid and Munir, Arslan}, booktitle={International Conference on Machine Learning and Data Mining in Pattern Recognition}, pages={262--275}, year={2017}, organization={Springer} } @article{privateQ, title={Private Q-Learning with Functional Noise in Continuous Spaces}, author={Wang, Baoxiang and Hegde, Nidhi}, journal={arXiv preprint arXiv:1901.10634}, year={2019} } @article{privateML, title={Privacy-Preserving Machine Learning: Threats and Solutions}, author={Al-Rubaie, Mohammad and Chang, J Morris}, journal={IEEE Security \& Privacy}, volume={17}, number={2}, pages={49--58}, year={2019}, publisher={IEEE} } @inproceedings{privatePolicy, title={Differentially private policy evaluation}, author={Balle, Borja and Gomrokchi, Maziar and Precup, Doina}, booktitle={International Conference on Machine Learning}, pages={2130--2138}, year={2016} } @article{gleaveAttackRL, title={Adversarial Policies: Attacking Deep Reinforcement Learning}, author={Gleave, Adam and Dennis, Michael and Kant, Neel and Wild, Cody and Levine, Sergey and Russell, Stuart}, journal={arXiv preprint arXiv:1905.10615}, year={2019} } @article{DRLAttack, author = {Yen{-}Chen Lin and Zhang{-}Wei Hong and Yuan{-}Hong Liao and Meng{-}Li Shih and Ming{-}Yu Liu and Min Sun}, title = {Tactics of Adversarial Attack on Deep Reinforcement Learning Agents}, journal = {CoRR}, volume = {abs/1703.06748}, year = {2017}, url = {http://arxiv.org/abs/1703.06748}, archivePrefix = {arXiv}, eprint = {1703.06748}, timestamp = {Mon, 13 Aug 2018 16:48:19 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/LinHLSLS17}, bibsource = {dblp computer science bibliography, https://dblp.org} } @INPROCEEDINGS{membershipInference, author={R. {Shokri} and M. {Stronati} and C. {Song} and V. {Shmatikov}}, booktitle={2017 IEEE Symposium on Security and Privacy (SP)}, title={Membership Inference Attacks Against Machine Learning Models}, year={2017}, volume={}, number={}, pages={3-18}, keywords={authorisation;data privacy;hospitals;inference mechanisms;learning (artificial intelligence);pattern classification;membership inference attacks;machine learning models;data record;black-box access;adversarial use;inference techniques;classification models;hospital discharge dataset;privacy perspective;Training;Data models;Predictive models;Privacy;Sociology;Statistics;Google}, doi={10.1109/SP.2017.41}, ISSN={}, month={May},} @misc{pate, title={Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data}, author={Nicolas Papernot and Martín Abadi and Úlfar Erlingsson and Ian Goodfellow and Kunal Talwar}, year={2016}, eprint={1610.05755}, archivePrefix={arXiv}, primaryClass={stat.ML} } @incollection{Bengio+chapter2007, author = {Bengio, Yoshua and LeCun, Yann}, booktitle = {Large Scale Kernel Machines}, publisher = {MIT Press}, title = {Scaling Learning Algorithms Towards {AI}}, year = {2007} } @article{Hinton06, author = {Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye}, journal = {Neural Computation}, pages = {1527--1554}, title = {A Fast Learning Algorithm for Deep Belief Nets}, volume = {18}, year = {2006} } @book{goodfellow2016deep, title={Deep learning}, author={Goodfellow, Ian and Bengio, Yoshua and Courville, Aaron and Bengio, Yoshua}, volume={1}, year={2016}, publisher={MIT Press} } % This file was created with JabRef 2.10. % Encoding: UTF-8 @article{maxentIRL, title={Maximum entropy inverse reinforcement learning}, author={Ziebart, Brian D and Maas, Andrew and Bagnell, J Andrew and Dey, Anind K}, journal={AAAI Conference on Artificial Intelligence}, year={2008}, publisher={figshare} } @article{GRU, author = {Kyunghyun Cho and Bart van Merrienboer and {\c{C}}aglar G{\"{u}}l{\c{c}}ehre and Fethi Bougares and Holger Schwenk and Yoshua Bengio}, title = {Learning Phrase Representations using {RNN} Encoder-Decoder for Statistical Machine Translation}, journal = {CoRR}, volume = {abs/1406.1078}, year = {2014}, url = {http://arxiv.org/abs/1406.1078}, archivePrefix = {arXiv}, eprint = {1406.1078}, timestamp = {Mon, 13 Aug 2018 16:46:44 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/ChoMGBSB14}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{mixexp, title={Adaptive mixtures of local experts.}, author={Jacobs, Robert A and Jordan, Michael I and Nowlan, Steven J and Hinton, Geoffrey E and others}, journal={Neural computation}, volume={3}, number={1}, pages={79--87}, year={1991} } @article{DADS, author = {Archit Sharma and Shixiang Gu and Sergey Levine and Vikash Kumar and Karol Hausman}, title = {Dynamics-Aware Unsupervised Discovery of Skills}, journal = {CoRR}, volume = {abs/1907.01657}, year = {2019}, url = {http://arxiv.org/abs/1907.01657}, archivePrefix = {arXiv}, eprint = {1907.01657}, timestamp = {Mon, 08 Jul 2019 14:12:33 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/abs-1907-01657}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{diayn, author = {Benjamin Eysenbach and Abhishek Gupta and Julian Ibarz and Sergey Levine}, title = {Diversity is All You Need: Learning Skills without a Reward Function}, journal = {CoRR}, volume = {abs/1802.06070}, year = {2018}, url = {http://arxiv.org/abs/1802.06070}, archivePrefix = {arXiv}, eprint = {1802.06070}, timestamp = {Thu, 20 Dec 2018 16:30:14 +0100}, biburl = {https://dblp.org/rec/bib/journals/corr/abs-1802-06070}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{valor, author = {Joshua Achiam and Harrison Edwards and Dario Amodei and Pieter Abbeel}, title = {Variational Option Discovery Algorithms}, journal = {CoRR}, volume = {abs/1807.10299}, year = {2018}, url = {http://arxiv.org/abs/1807.10299}, archivePrefix = {arXiv}, eprint = {1807.10299}, timestamp = {Mon, 13 Aug 2018 16:46:42 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/abs-1807-10299}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{LSTM_thing, author = {Rouhollah Rahmatizadeh and Pooya Abolghasemi and Ladislau B{\"{o}}l{\"{o}}ni}, title = {Learning Manipulation Trajectories Using Recurrent Neural Networks}, journal = {CoRR}, volume = {abs/1603.03833}, year = {2016}, url = {http://arxiv.org/abs/1603.03833}, archivePrefix = {arXiv}, eprint = {1603.03833}, timestamp = {Mon, 13 Aug 2018 16:46:40 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/RahmatizadehAB16}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{gae, author = {Schulman, John and Moritz, Philipp and Levine, Sergey and Jordan, Michael and Abbeel, Pieter}, year = {2015}, month = {06}, pages = {}, title = {High-Dimensional Continuous Control Using Generalized Advantage Estimation} } @article{infogail, author = {Yunzhu Li and Jiaming Song and Stefano Ermon}, title = {Inferring The Latent Structure of Human Decision-Making from Raw Visual Inputs}, journal = {CoRR}, volume = {abs/1703.08840}, year = {2017}, url = {http://arxiv.org/abs/1703.08840}, archivePrefix = {arXiv}, eprint = {1703.08840}, timestamp = {Mon, 13 Aug 2018 16:48:01 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/LiSE17}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{visuomotor, author = {Sergey Levine and Chelsea Finn and Trevor Darrell and Pieter Abbeel}, title = {End-to-End Training of Deep Visuomotor Policies}, journal = {CoRR}, volume = {abs/1504.00702}, year = {2015}, url = {http://arxiv.org/abs/1504.00702}, archivePrefix = {arXiv}, eprint = {1504.00702}, timestamp = {Mon, 13 Aug 2018 16:47:04 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/LevineFDA15}, bibsource = {dblp computer science bibliography, https://dblp.org} } @inproceedings{nonlinear, author = {Levine, Sergey and Popovic, Zoran and Koltun, Vladlen}, year = {2011}, month = {12}, pages = {}, title = {Nonlinear Inverse Reinforcement Learning with Gaussian Processes} } @inproceedings{algoIRL, author = {Ng, Andrew Y. and Russell, Stuart J.}, title = {Algorithms for Inverse Reinforcement Learning}, booktitle = {Proceedings of the Seventeenth International Conference on Machine Learning}, series = {ICML '00}, year = {2000}, isbn = {1-55860-707-2}, pages = {663--670}, numpages = {8}, url = {http://dl.acm.org/citation.cfm?id=645529.657801}, acmid = {657801}, publisher = {Morgan Kaufmann Publishers Inc.}, address = {San Francisco, CA, USA}, } @article{dagger, author = {St{\'{e}}phane Ross and Geoffrey J. Gordon and J. Andrew Bagnell}, title = {No-Regret Reductions for Imitation Learning and Structured Prediction}, journal = {CoRR}, volume = {abs/1011.0686}, year = {2010}, url = {http://arxiv.org/abs/1011.0686}, archivePrefix = {arXiv}, eprint = {1011.0686}, timestamp = {Mon, 13 Aug 2018 16:45:56 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/abs-1011-0686}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{oneshotIL, author = {Chelsea Finn and Tianhe Yu and Tianhao Zhang and Pieter Abbeel and Sergey Levine}, title = {One-Shot Visual Imitation Learning via Meta-Learning}, journal = {CoRR}, volume = {abs/1709.04905}, year = {2017}, url = {http://arxiv.org/abs/1709.04905}, archivePrefix = {arXiv}, eprint = {1709.04905}, timestamp = {Mon, 13 Aug 2018 16:47:36 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/abs-1709-04905}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{DART, author = {Michael Laskey and Jonathan Lee and Wesley Yu{-}Shu Hsieh and Richard Liaw and Jeffrey Mahler and Roy Fox and Ken Goldberg}, title = {Iterative Noise Injection for Scalable Imitation Learning}, journal = {CoRR}, volume = {abs/1703.09327}, year = {2017}, url = {http://arxiv.org/abs/1703.09327}, archivePrefix = {arXiv}, eprint = {1703.09327}, timestamp = {Mon, 13 Aug 2018 16:47:54 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/LaskeyLHLMFG17}, bibsource = {dblp computer science bibliography, https://dblp.org} } @INPROCEEDINGS{robograsp, author={L. {Pinto} and A. {Gupta}}, booktitle={2016 IEEE International Conference on Robotics and Automation (ICRA)}, title={Supersizing self-supervision: Learning to grasp from 50K tries and 700 robot hours}, year={2016}, volume={}, number={}, pages={3406-3413}, keywords={grippers;image classification;learning (artificial intelligence);neurocontrollers;regression analysis;robot vision;image patches;18-way binary classification;regression problem;CNN;convolutional neural network;trial-and-error experiments;semantics;human-labeled datasets;model free learning-based robot grasping;supersizing self-supervision;Robots;Grasping;Grippers;Three-dimensional displays;Data models;Labeling;Training}, doi={10.1109/ICRA.2016.7487517}, ISSN={}, month={May},} @article{asynch, author = {Volodymyr Mnih and Adri{\`{a}} Puigdom{\`{e}}nech Badia and Mehdi Mirza and Alex Graves and Timothy P. Lillicrap and Tim Harley and David Silver and Koray Kavukcuoglu}, title = {Asynchronous Methods for Deep Reinforcement Learning}, journal = {CoRR}, volume = {abs/1602.01783}, year = {2016}, url = {http://arxiv.org/abs/1602.01783}, archivePrefix = {arXiv}, eprint = {1602.01783}, timestamp = {Mon, 13 Aug 2018 16:47:40 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/MnihBMGLHSK16}, bibsource = {dblp computer science bibliography, https://dblp.org} } @InProceedings{localization, title = {Leveraging Deep Visual Descriptors for Hierarchical Efficient Localization}, author = {Sarlin, Paul-Edouard and Debraine, Frederic and Dymczyk, Marcin and Siegwart, Roland}, booktitle = {Proceedings of The 2nd Conference on Robot Learning}, pages = {456--465}, year = {2018}, editor = {Billard, Aude and Dragan, Anca and Peters, Jan and Morimoto, Jun}, volume = {87}, series = {Proceedings of Machine Learning Research}, address = {}, month = {29--31 Oct}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v87/sarlin18a/sarlin18a.pdf}, url = {http://proceedings.mlr.press/v87/sarlin18a.html}, abstract = {Many robotics applications require precise pose estimates despite operating in large and changing environments. This can be addressed by visual localization, using a pre-computed 3D model of the surroundings. The pose estimation then amounts to finding correspondences between 2D keypoints in a query image and 3D points in the model using local descriptors. However, computational power is often limited on robotic platforms, making this task challenging in large-scale environments. Binary feature descriptors significantly speed up this 2D-3D matching, and have become popular in the robotics community, but also strongly impair the robustness to perceptual aliasing and changes in viewpoint, illumination and scene structure. In this work, we propose to leverage recent advances in deep learning to perform an efficient hierarchical localization. We first localize at the map level using learned image-wide global descriptors, and subsequently estimate a precise pose from 2D-3D matches computed in the candidate places only. This restricts the local search and thus allows to efficiently exploit powerful non-binary descriptors usually dismissed on resource-constrained devices. Our approach results in state-of-the-art localization performance while running in real-time on a popular mobile platform, enabling new prospects for robotics research.} } @InProceedings{pomdpHrl, title = {Efficient Hierarchical Robot Motion Planning Under Uncertainty and Hybrid Dynamics}, author = {Jain, Ajinkya and Niekum, Scott}, booktitle = {Proceedings of The 2nd Conference on Robot Learning}, pages = {757--766}, year = {2018}, editor = {Billard, Aude and Dragan, Anca and Peters, Jan and Morimoto, Jun}, volume = {87}, series = {Proceedings of Machine Learning Research}, address = {}, month = {29--31 Oct}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v87/jain18a/jain18a.pdf}, url = {http://proceedings.mlr.press/v87/jain18a.html}, abstract = {Noisy observations coupled with nonlinear dynamics pose one of the biggestchallengesinrobotmotionplanning. Bydecomposingnonlineardynamics into a discrete set of local dynamics models, hybrid dynamics provide a natural way to model nonlinear dynamics, especially in systems with sudden discontinuities in dynamics due to factors such as contacts. We propose a hierarchical POMDP planner that develops cost-optimized motion plans for hybrid dynamics models. The hierarchical planner first develops a high-level motion plan to sequence the local dynamics models to be visited and then converts it into a detailed continuous state plan. This hierarchical planning approach results in a decomposition of the POMDP planning problem into smaller sub-parts that can be solved with significantly lower computational costs. The ability to sequence the visitation of local dynamics models also provides a powerful way to leverage the hybrid dynamics to reduce state uncertainty. We evaluate the proposed planner on a navigation task in the simulated domain and on an assembly task with a robotic manipulator, showing that our approach can solve tasks having high observation noise and nonlinear dynamics effectively with significantly lower computational costs compared to direct planning approaches. } } @article{controlasoptimalinference, author = {Sergey Levine}, title = {Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review}, journal = {CoRR}, volume = {abs/1805.00909}, year = {2018}, url = {http://arxiv.org/abs/1805.00909}, archivePrefix = {arXiv}, eprint = {1805.00909}, timestamp = {Mon, 13 Aug 2018 16:47:19 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/abs-1805-00909}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{latentFromDemo, author = {Yordan Hristov and Alex Lascarides and Subramanian Ramamoorthy}, title = {Interpretable Latent Spaces for Learning from Demonstration}, journal = {CoRR}, volume = {abs/1807.06583}, year = {2018}, url = {http://arxiv.org/abs/1807.06583}, archivePrefix = {arXiv}, eprint = {1807.06583}, timestamp = {Mon, 13 Aug 2018 16:48:46 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/abs-1807-06583}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{AdaM, title={Adam: A Method for Stochastic Optimization}, author={Diederik P. Kingma and Jimmy Ba}, journal={CoRR}, year={2015}, volume={abs/1412.6980} } @inproceedings{IRL, title={Apprenticeship learning via inverse reinforcement learning}, author={Pieter Abbeel and Andrew Y. Ng}, booktitle={ICML}, year={2004} } @article{GAIL, author = {Jonathan Ho and Stefano Ermon}, title = {Generative Adversarial Imitation Learning}, journal = {CoRR}, volume = {abs/1606.03476}, year = {2016}, url = {http://arxiv.org/abs/1606.03476}, archivePrefix = {arXiv}, eprint = {1606.03476}, timestamp = {Mon, 13 Aug 2018 16:47:10 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/HoE16}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{VDB, author = {Xue Bin Peng and Angjoo Kanazawa and Sam Toyer and Pieter Abbeel and Sergey Levine}, title = {Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow}, journal = {CoRR}, volume = {abs/1810.00821}, year = {2018}, url = {http://arxiv.org/abs/1810.00821}, archivePrefix = {arXiv}, eprint = {1810.00821}, timestamp = {Tue, 30 Oct 2018 10:49:09 +0100}, biburl = {https://dblp.org/rec/bib/journals/corr/abs-1810-00821}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{Distral, author = {Yee Whye Teh and Victor Bapst and Wojciech Marian Czarnecki and John Quan and James Kirkpatrick and Raia Hadsell and Nicolas Heess and Razvan Pascanu}, title = {Distral: Robust Multitask Reinforcement Learning}, journal = {CoRR}, volume = {abs/1707.04175}, year = {2017}, url = {http://arxiv.org/abs/1707.04175}, archivePrefix = {arXiv}, eprint = {1707.04175}, timestamp = {Mon, 13 Aug 2018 16:47:14 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/TehBCQKHHP17}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{LSH, author = {DJ Strouse and Max Kleiman{-}Weiner and Josh Tenenbaum and Matthew Botvinick and David J. Schwab}, title = {Learning to Share and Hide Intentions using Information Regularization}, journal = {CoRR}, volume = {abs/1808.02093}, year = {2018}, url = {http://arxiv.org/abs/1808.02093}, archivePrefix = {arXiv}, eprint = {1808.02093}, timestamp = {Sun, 02 Sep 2018 15:01:54 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/abs-1808-02093}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{novel, author = {Yunbo Zhang and Wenhao Yu and Greg Turk}, title = {Learning Novel Policies For Tasks}, journal = {CoRR}, volume = {abs/1905.05252}, year = {2019}, url = {http://arxiv.org/abs/1905.05252}, archivePrefix = {arXiv}, eprint = {1905.05252}, timestamp = {Tue, 28 May 2019 12:48:08 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/abs-1905-05252}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{ILOriginal, author = {Widrow, B and W. Smith, F}, year = {1964}, month = {01}, pages = {}, title = {Pattern recognizing control systems}, journal = {Computer Inf. Sci. (COINS) Proc.} } @inproceedings{GANRL, author = {Pinto, Lerrel and Davidson, James and Sukthankar, Rahul and Gupta, Abhinav}, title = {Robust Adversarial Reinforcement Learning}, booktitle = {Proceedings of the 34th International Conference on Machine Learning - Volume 70}, series = {ICML'17}, year = {2017}, location = {Sydney, NSW, Australia}, pages = {2817--2826}, numpages = {10}, url = {http://dl.acm.org/citation.cfm?id=3305890.3305972}, acmid = {3305972}, publisher = {JMLR.org}, } @article{end2endSelfDriving, author = {Mariusz Bojarski and Davide Del Testa and Daniel Dworakowski and Bernhard Firner and Beat Flepp and Prasoon Goyal and Lawrence D. Jackel and Mathew Monfort and Urs Muller and Jiakai Zhang and Xin Zhang and Jake Zhao and Karol Zieba}, title = {End to End Learning for Self-Driving Cars}, journal = {CoRR}, volume = {abs/1604.07316}, year = {2016}, url = {http://arxiv.org/abs/1604.07316}, archivePrefix = {arXiv}, eprint = {1604.07316}, timestamp = {Mon, 13 Aug 2018 16:47:06 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/BojarskiTDFFGJM16}, bibsource = {dblp computer science bibliography, https://dblp.org} } @inproceedings{alvinn, title={ALVINN: An Autonomous Land Vehicle in a Neural Network}, author={Dean Pomerleau}, booktitle={NIPS}, year={1988} } @InProceedings{DDIL, title = {Learning Neural Parsers with Deterministic Differentiable Imitation Learning}, author = {Shankar, Tanmay and Rhinehart, Nicholas and Muelling, Katharina and Kitani, Kris M.}, booktitle = {Proceedings of The 2nd Conference on Robot Learning}, pages = {592--604}, year = {2018}, editor = {Billard, Aude and Dragan, Anca and Peters, Jan and Morimoto, Jun}, volume = {87}, series = {Proceedings of Machine Learning Research}, address = {}, month = {29--31 Oct}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v87/shankar18a/shankar18a.pdf}, url = {http://proceedings.mlr.press/v87/shankar18a.html}, abstract = { We explore the problem of learning to decompose spatial tasks into segments, as exemplified by the problem of a painting robot covering a large object. Inspired by the ability of classical decision tree algorithms to construct structured parti- tions of their input spaces, we formulate the problem of decomposing objects into segments as a parsing approach. We make the insight that the derivation of a parse-tree that decomposes the object into segments closely resembles a decision tree constructed by ID3, which can be done when the ground-truth available. We learn to imitate an expert parsing oracle, such that our neural parser can generalize to parse natural images without ground truth. We introduce a novel deterministic policy gradient update, DRAG (i.e., DeteRministically AGgrevate) in the form of a deterministic actor-critic variant of AggreVaTeD [1], to train our neural parser. From another perspective, our approach is a variant of the Deterministic Policy Gradient [2, 3] suitable for the imitation learning setting. The deterministic policy representation offered by training our neural parser with DRAG allows it to outperform state of the art imitation and reinforcement learning approaches. } } @article{PPO, author = {John Schulman and Filip Wolski and Prafulla Dhariwal and Alec Radford and Oleg Klimov}, title = {Proximal Policy Optimization Algorithms}, journal = {CoRR}, volume = {abs/1707.06347}, year = {2017}, url = {http://arxiv.org/abs/1707.06347}, archivePrefix = {arXiv}, eprint = {1707.06347}, timestamp = {Mon, 13 Aug 2018 16:47:34 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/SchulmanWDRK17}, bibsource = {dblp computer science bibliography, https://dblp.org} } @misc{OpenAIbaselines, author = {Dhariwal, Prafulla and Hesse, Christopher and Klimov, Oleg and Nichol, Alex and Plappert, Matthias and Radford, Alec and Schulman, John and Sidor, Szymon and Wu, Yuhuai and Zhokhov, Peter}, title = {OpenAI Baselines}, year = {2017}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/openai/baselines}}, } @article{PG, author = {Sutton, Richard and Mcallester, David and Singh, Satinder and Mansour, Yishay}, year = {2000}, month = {02}, pages = {}, title = {Policy Gradient Methods for Reinforcement Learning with Function Approximation}, volume = {12}, journal = {Adv. Neural Inf. Process. Syst} } @article{6DOFRobots, author = {Marcus Gualtieri and Robert Platt Jr.}, title = {Learning 6-DoF Grasping and Pick-Place Using Attention Focus}, journal = {CoRR}, volume = {abs/1806.06134}, year = {2018}, url = {http://arxiv.org/abs/1806.06134}, archivePrefix = {arXiv}, eprint = {1806.06134}, timestamp = {Mon, 13 Aug 2018 16:48:55 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/abs-1806-06134}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{fewShotGoal, author = {Annie Xie and Avi Singh and Sergey Levine and Chelsea Finn}, title = {Few-Shot Goal Inference for Visuomotor Learning and Planning}, journal = {CoRR}, volume = {abs/1810.00482}, year = {2018}, url = {http://arxiv.org/abs/1810.00482}, archivePrefix = {arXiv}, eprint = {1810.00482}, timestamp = {Tue, 30 Oct 2018 10:49:09 +0100}, biburl = {https://dblp.org/rec/bib/journals/corr/abs-1810-00482}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{Sim2RealTrans, author = {Xue Bin Peng and Marcin Andrychowicz and Wojciech Zaremba and Pieter Abbeel}, title = {Sim-to-Real Transfer of Robotic Control with Dynamics Randomization}, journal = {CoRR}, volume = {abs/1710.06537}, year = {2017}, url = {http://arxiv.org/abs/1710.06537}, archivePrefix = {arXiv}, eprint = {1710.06537}, timestamp = {Mon, 13 Aug 2018 16:47:23 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/abs-1710-06537}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{Sim2RealDeform, author = {Jan Matas and Stephen James and Andrew J. Davison}, title = {Sim-to-Real Reinforcement Learning for Deformable Object Manipulation}, journal = {CoRR}, volume = {abs/1806.07851}, year = {2018}, url = {http://arxiv.org/abs/1806.07851}, archivePrefix = {arXiv}, eprint = {1806.07851}, timestamp = {Mon, 13 Aug 2018 16:48:34 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/abs-1806-07851}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{RiskAwareIRL, author = {Daniel S. Brown and Yuchen Cui and Scott Niekum}, title = {Risk-Aware Active Inverse Reinforcement Learning}, journal = {CoRR}, volume = {abs/1901.02161}, year = {2019}, url = {http://arxiv.org/abs/1901.02161}, archivePrefix = {arXiv}, eprint = {1901.02161}, timestamp = {Thu, 31 Jan 2019 13:52:49 +0100}, biburl = {https://dblp.org/rec/bib/journals/corr/abs-1901-02161}, bibsource = {dblp computer science bibliography, https://dblp.org} } @InProceedings{Carla, title = {{CARLA}: {An} Open Urban Driving Simulator}, author = {Alexey Dosovitskiy and German Ros and Felipe Codevilla and Antonio Lopez and Vladlen Koltun}, booktitle = {Proceedings of the 1st Annual Conference on Robot Learning}, pages = {1--16}, year = {2017}, editor = {Sergey Levine and Vincent Vanhoucke and Ken Goldberg}, volume = {78}, series = {Proceedings of Machine Learning Research}, address = {}, month = {13--15 Nov}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v78/dosovitskiy17a/dosovitskiy17a.pdf}, url = {http://proceedings.mlr.press/v78/dosovitskiy17a.html} } @article{end2endCondIL, author = {Felipe Codevilla and Matthias M{\"{u}}ller and Alexey Dosovitskiy and Antonio L{\'{o}}pez and Vladlen Koltun}, title = {End-to-end Driving via Conditional Imitation Learning}, journal = {CoRR}, volume = {abs/1710.02410}, year = {2017}, url = {http://arxiv.org/abs/1710.02410}, archivePrefix = {arXiv}, eprint = {1710.02410}, timestamp = {Mon, 13 Aug 2018 16:48:54 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/abs-1710-02410}, bibsource = {dblp computer science bibliography, https://dblp.org} } @inproceedings{ InfoBot, title={Transfer and Exploration via the Information Bottleneck}, author={Anirudh Goyal and Riashat Islam and DJ Strouse and Zafarali Ahmed and Hugo Larochelle and Matthew Botvinick and Sergey Levine and Yoshua Bengio}, booktitle={International Conference on Learning Representations}, year={2019}, url={https://openreview.net/forum?id=rJg8yhAqKm}, } @article{DBLP:journals/corr/BellemareSOSSM16, author = {Marc G. Bellemare and Sriram Srinivasan and Georg Ostrovski and Tom Schaul and David Saxton and R{\'{e}}mi Munos}, title = {Unifying Count-Based Exploration and Intrinsic Motivation}, journal = {CoRR}, volume = {abs/1606.01868}, year = {2016}, url = {http://arxiv.org/abs/1606.01868}, archivePrefix = {arXiv}, eprint = {1606.01868}, timestamp = {Mon, 13 Aug 2018 16:46:58 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/BellemareSOSSM16}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{BCLimitations, author = {Felipe Codevilla and Eder Santana and Antonio M. L{\'{o}}pez and Adrien Gaidon}, title = {Exploring the Limitations of Behavior Cloning for Autonomous Driving}, journal = {CoRR}, volume = {abs/1904.08980}, year = {2019}, url = {http://arxiv.org/abs/1904.08980}, archivePrefix = {arXiv}, eprint = {1904.08980}, timestamp = {Fri, 26 Apr 2019 13:18:53 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/abs-1904-08980}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{causal_imitation_learning, author = {Pim de Haan and Dinesh Jayaraman and Sergey Levine}, title = {Causal Confusion in Imitation Learning}, journal = {CoRR}, volume = {abs/1905.11979}, year = {2019}, url = {http://arxiv.org/abs/1905.11979}, archivePrefix = {arXiv}, eprint = {1905.11979}, timestamp = {Mon, 03 Jun 2019 13:42:33 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/abs-1905-11979}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{Tensorflow, author = {Mart{\'{\i}}n Abadi and Paul Barham and Jianmin Chen and Zhifeng Chen and Andy Davis and Jeffrey Dean and Matthieu Devin and Sanjay Ghemawat and Geoffrey Irving and Michael Isard and Manjunath Kudlur and Josh Levenberg and Rajat Monga and Sherry Moore and Derek Gordon Murray and Benoit Steiner and Paul A. Tucker and Vijay Vasudevan and Pete Warden and Martin Wicke and Yuan Yu and Xiaoqiang Zhang}, title = {TensorFlow: {A} system for large-scale machine learning}, journal = {CoRR}, volume = {abs/1605.08695}, year = {2016}, url = {http://arxiv.org/abs/1605.08695}, archivePrefix = {arXiv}, eprint = {1605.08695}, timestamp = {Mon, 13 Aug 2018 16:48:39 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/AbadiBCCDDDGIIK16}, bibsource = {dblp computer science bibliography, https://dblp.org} } @misc{alphastarblog, title="{AlphaStar: Mastering the Real-Time Strategy Game StarCraft II}", author={Vinyals, Oriol and Babuschkin, Igor and Chung, Junyoung and Mathieu, Michael and Jaderberg, Max and Czarnecki, Wojciech M. and Dudzik, Andrew and Huang, Aja and Georgiev, Petko and Powell, Richard and Ewalds, Timo and Horgan, Dan and Kroiss, Manuel and Danihelka, Ivo and Agapiou, John and Oh, Junhyuk and Dalibard, Valentin and Choi, David and Sifre, Laurent and Sulsky, Yury and Vezhnevets, Sasha and Molloy, James and Cai, Trevor and Budden, David and Paine, Tom and Gulcehre, Caglar and Wang, Ziyu and Pfaff, Tobias and Pohlen, Toby and Wu, Yuhuai and Yogatama, Dani and Cohen, Julia and McKinney, Katrina and Smith, Oliver and Schaul, Tom and Lillicrap, Timothy and Apps, Chris and Kavukcuoglu, Koray and Hassabis, Demis and Silver, David}, howpublished={\url{https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/}}, year={2019} }

arxiv_citations

{"1910.05821":true,"1901.10634":true,"1905.10615":true,"1703.06748":true,"1406.1078":true,"1907.01657":true,"1802.06070":true,"1807.10299":true,"1603.03833":true,"1703.08840":true,"1504.00702":true,"1011.0686":true,"1709.04905":true,"1703.09327":true,"1602.01783":true,"1805.00909":true,"1807.06583":true,"1412.6980":true,"1606.03476":true,"1810.00821":true,"1707.04175":true,"1808.02093":true,"1905.05252":true,"1604.07316":true,"1707.06347":true,"1806.06134":true,"1810.00482":true,"1710.06537":true,"1806.07851":true,"1901.02161":true,"1710.02410":true,"1606.01868":true,"1904.08980":true,"1905.11979":true,"1605.08695":true}

abstract: | Imitation learning can reproduce policies by observing experts, which poses a problem regarding policy privacy. Policies, such as human, or policies on deployed robots, can all be cloned without consent from the owners. How can we protect against external observers cloning our proprietary policies? To answer this question we introduce a new reinforcement learning framework, where we train an ensemble of near-optimal policies, whose demonstrations are guaranteed to be useless for an external observer. We formulate this idea by a constrained optimization problem, where the objective is to improve proprietary policies, and at the same time deteriorate the virtual policy of an eventual external observer. We design a tractable algorithm to solve this new optimization problem by modifying the standard policy gradient algorithm. Our formulation can be interpreted in lenses of confidentiality and adversarial behaviour, which enables a broader perspective of this work. We demonstrate the existence of "non-clonable" ensembles, providing a solution to the above optimization problem, which is calculated by our modified policy gradient algorithm. To our knowledge, this is the first work regarding the protection of policies in Reinforcement Learning. author:

| Albert Zhan
UC Berkeley
albertzhan@berkeley.edu
Stas Tiomkin
UC Berkeley
stas@berkeley.edu
Pieter Abbeel
UC Berkeley
pabbeel@cs.berkeley.edu bibliography:
draft.bib title: Preventing Imitation Learning with Adversarial Policy Ensembles

Introduction

Imitation learning and behavioral cloning provide really strong ability to create powerful policies, as seen in robotic tasks ([@DART; @oneshotIL; @BCLimitations; @end2endCondIL; @alvinn; @end2endSelfDriving]). Other fields in machine learning have developed methods to ensure privacy (@privateML [@PATE]), however, none have examined protection against policy cloning. In this work, we tackle the issue of protecting policies by training policies that aim to prevent an external observer from using behaviour cloning. Our approach draws inspiration from imitating human experts, who can near-optimally accomplish given tasks. The setting which we analyze is presented in Figure 1{reference-type="ref" reference="fig:scheme"}. We wish to find a collection of experts, which as an ensemble can perform a given task well, however, also targets behaviour cloning through adversarial behaviour. Another interpretation is that this collection of experts represents the worst case scenario for behaviour cloning on how to perform a task "good enough".

{#fig:scheme} [[fig:scheme]]{#fig:scheme label="fig:scheme"}

Imitation learning frameworks generally make certain assumptions of the optimality of the demonstrations ([@maxentIRL; @controlasoptimalinference]), yet never considered the scenario when the experts specifically attempt to be adversarial to the imitator. We pose the novel question regarding this assumption: does there exist a set of experts that are adversarial to an external observer trying to behaviour clone?

We propose Adversarial Policy Ensembles (APE), a method that simultaneously optimizes the performance of the ensemble and minimizes the performance of policies eventually obtained from cloning it. Our experiments show that APE do not suffer much performance loss from an optimal policy, while causing, on average, the cloned policy to experience over $5$ times degradation compared to the optimal policy.

Our main contributions can be summarized as follows:

We introduce a novel method APE, as well as the mathematical justification of the notion of adversarial experts.
By modifying Policy Gradient (@PG), a common reinforcement learning algorithm, we suggest a tractable scheme for finding an optimal solution for this objective.
We demonstrate the solution by numerical simulations, where we show that a cloned policy is crippled even after collecting a significantly large number of samples from a policy ensemble.

To our knowledge, not only is this the first work regarding the protection of policies in reinforcement learning, but it is also the first to represent adversarial experts.

Preliminaries

We develop APE in the standard framework of Reinforcement Learning (RL). The main components we use are Markov Decision Processes, Policy Gradient (@PG), policy ensembles, and behaviour cloning, which we review below.

Markov Decision Process

A discrete-time finite-horizon discounted Markov decision process (MDP) $\mathcal{M}$ is defined by $(\mathcal{S}, \mathcal{A}, r, p, p_0, \gamma, T)$ where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, $r : \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ is the reward function, $p(s_{t+1} | s_t, a_t)$ is the transition probability distribution, $p_0 : \mathcal{S} \rightarrow \mathbb{R}^{+}$ is the initial state distribution, $\gamma \in (0, 1)$ is the discount factor, and $T$ is the time horizon. A trajectory $\tau \sim \rho_\pi$, sampled from $p$ and a policy $\pi : \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}^+$, is defined to be the states and actions tuple $(s_0, a_0, ... s_{T-1}, a_{T-1}, s_T)$, whose distribution is characterized by $\rho_\pi$. Define the return of a trajectory to be $r(\tau) = \sum_{t=0}^{T-1} \gamma ^{t} r(s_{t}, a_{t})$ to be the sum of discounted rewards seen along the trajectory, and define a value function $V^\pi : \mathcal{S} \rightarrow \mathbb{R}$ to be expected return of a trajectory starting from state $s$, under the policy $\pi$. The goal of reinforcement learning is to find a policy that maximizes the expected return $\mathop{\mathrm{\mathbb{E}}}{\tau \sim \rho\pi} [r(\tau)]$.

Policy Gradient

Policy Gradient (PG) @PG aim to directly learn the optimal policy $\pi$, parameterized by $\theta$, by repeatedly estimating the gradient of the expected return, in one of many forms, shown in @gae. In our work, we follow notation similar to that of @gae [@PPO] and estimate $\nabla_\theta \mathop{\mathrm{\mathbb{E}}}{\tau \sim \rho\pi}[r(\tau)]$ using the advantage, which is estimated from a trajectory $\tau$, $A^\pi_\tau (t) = R_\tau (t) - V^\pi (s_t)$, where $R_\tau (t) = \sum_{t'=t}^{T-1} \gamma ^{t'} r(s_{t'}, a_{t'})$ is the sum of the reward following action $a_t$.

Here, the value function is learned simultaneously with the the policy, and so the advantage will use $\hat{V}^\pi$ as an estimate for $V^\pi$.

Policy Ensemble (PE)

We denote a PE by $\pi_{\textbf{c}}$, where each $\pi_{c^{(i)}}, i \in { 1, 2, ... n }$ represents an expert. To rollout the PE, an expert is chosen at random (in our case uniform), and the expert completes a trajectory. Each expert policy $\pi_{c^{(i)}}(a | s)$ can be viewed as a policy conditioned on a latent variable $c$, $\pi(a | s, c)$.

Although $\pi_{\textbf{c}}$ consists of multiple policies, it is important to note that it itself is still a policy.

Behaviour Cloning

To behaviour clone an expert policy (@ILOriginal), a dataset of trajectories $\mathcal{D}$ consisting of state action pairs $(s, a)$ are collected from the the expert rollouts. Then, a policy parametized by $\phi$ is trained by maximizing the likelihood of an action given a state, $\sum_{(s, a) \in \mathcal{D}} \log \pi_\phi (a \mid s)$.

When cloning $\pi_{\textbf{c}}$, $\mathcal{D}$ will not contain information of the latent variable $c$, and so the cloned policy will marginalize it out. Thus, the observer will clone:

$$\begin{aligned} \label{eqn:observe} \pi_o({a} \mid {s}) \vcentcolon= \sum_i p(c^{(i)}\mid s ) \pi_{c^{(i)}}( a \mid s)\end{aligned}$$

We stress that this policy does not exist until $\pi_{\textbf{c}}$ is behaviour cloned. $\pi_o$ is a fictitious policy to represent what would happen in the best case scenario of the observer having access to infinite data from $\pi_{\textbf{c}}$ to clone into $\pi_o$.

The scope of this paper is to specifically prevent behavioral cloning from succeeding. Other imitation learning approaches such as inverse reinforcement learning (@IRL [@algoIRL; @nonlinear]) and adversarial imitation learning (@GAIL [@VDB]) require rollouts of non-expert policies in the environment, which may be costly, and thus are not considered.

Related Work {#sec:related}

Adversarial Attacks in RL: Our notion of adversarial policies is inextricably related to other adversarial methods that target RL such as @DRLAttack, and @vulnerablePolicies, that add adversarial perturbations to policy input during training. Other adversarial attacks include poisoning the batch of data used when training RL (@poison), and exploitation in the multi-agent setting (@gleaveAttackRL). However, these methods all present as active attacks for various learning techniques. Our method, instead, passively protects against cloning.

Privacy in RL: With regards to protection, our work is related to differential privacy (@privateML). Differential privacy in RL can be used to create private Q-functions (@privateQ) or private policies (@privatePolicy), which have private reward functions or private policy evaluation. However, we would like to emphasize that our motivation is to prevent cloning, and thus protecting the policies, rather than protecting against differentiating between reward functions and policies.

Imitation Learning: Since we comply to the standard imitation learning setting of cloning from a dataset with many experts providing the demonstrations, latent variables w.r.t. imitation learning is well-studied. For example, @end2endCondIL show that conditioning on context representation can make imitation learning a viable option for autonomous driving. @infogail demonstrate that the latent contextual information in expert trajectories is often semantically meaningful. As well, providing extra context variables to condition on also appears in forms of extra queries or providing labels ([@RiskAwareIRL; @causal_imitation_learning; @latentFromDemo]). Our method is different, as we use context variables to prevent imitation learning while learning the policies from scratch, rather than assuming using context variables to increase performance of imitation learning.

Multiple Policies: VALOR, DIAYN, and DADS (@valor [@diayn; @DADS]) have similar schemes of sampling a latent variable and fixing it throughout a trajectory, although their latent variables (contexts or skills) are used to solve semantically different tasks. The reason to solve different tasks is due to the objective of using the context variable/skills for learning in an unsupervised setting. Our approach differs in both motivation and implementation, as we learn experts that all solve the same task, and constrain so that observers can not clone the policy.

A PE $\pi_{\textbf{c}}$ can also be viewed as a Mixture of Experts (@mixexp), except the gating network assigns probability $1$ to the same expert for an entire trajectory. As such, we do not learn the gating network, although it may still be useful to see $\pi_{\textbf{c}}$ as a special case of a mixture of experts where the gating network learns immediately to fix the expert for each trajectory. There are also methods such as OptionGAN (@optiongan), which uses a mixture of experts model to learn multiple policies as options with access to only expert states.

@novel also proposes a method to train multiple policies that complete the same task but uses the uncertainty of an autoencoder as a reward augment. Their motivation is to find multiple novel policies, while our motivation has no connection to novelty. Due to these differences in motivation, they train each policy one after the other, while our policies are trained simultaneously.

Policy ensembles are also used in the multi-task and goal conditioned settings in which case the task that is meant to be solved can be viewed as the context. Marginalizing out the context variable (Equation [eqn:observe]{reference-type="ref" reference="eqn:observe"}) of these context-conditioned policies is studied in the case of introducing a KL divergence regularizing term for learning new tasks (@InfoBot) and for sharing/hiding goals (@LSH). However, the main motivation is different in that both @InfoBot and @LSH use $\pi_o$ to optimize mutual information, while we directly optimize its performance.

Method

Objective

We wish to have experts that can perform the task, while minimizing the possible returns of the cloned policy, denoted in Equation [eqn:observe]{reference-type="ref" reference="eqn:observe"}. We modify the standard RL objective to be: $$\begin{gathered} \label{eqn:constrained} \mathop{\mathrm{arg,min}}\theta \mathop{\mathrm{\mathbb{E}}}{\tau \sim \rho_{\pi_o}}[r(\tau)]
\text{~~~s.t.~} \mathop{\mathrm{\mathbb{E}}}{\tau \sim \rho{\pi_{\textbf{c}}}}[r(\tau)] \geq \alpha % \E_{\tau \sim \rho_{\pim}}[r(\tau)] \leq \alpha\end{gathered}$$

where $\alpha$ is a parameter that lower bounds the reward of the policy ensemble. This translates to maximizing the unconstrained Lagrangian: $$\begin{gathered} \label{eqn:unconstrained} J(\theta) =
\mathop{\mathrm{\mathbb{E}}}{\tau \sim \rho{\pi_{\textbf{c}}}} [r(\tau)]

\beta \mathop{\mathrm{\mathbb{E}}}{\tau \sim \rho{\pi_o}} [ r(\tau)]\end{gathered}$$

where $1/ \beta$ is the corresponding Lagrangian multiplier, and is subsumed into the returns collected by the policy ensemble. We refer to PE that optimizes this objective as Adversarial Policy Ensembles (APE). There is a natural interpretation of the objective in Equation [eqn:constrained]{reference-type="ref" reference="eqn:constrained"}. Human experts tend to be "good enough", which is reflected in the constraint. The minimization is simply finding the most adversarial experts.

Although we assume that the observer can only map states to actions, it may be the case that they can train a sequential policy, which is dependent on its previous states and actions. Our method can be generalized to sequential policies as well, and the impact of such observers is discussed in the Section 6{reference-type="ref" reference="sec:dicussion"}.

Modified Policy Gradient Algorithm

Intuitively, since there are the returns of two policies that are being optimized, both should be sampled from to estimate the returns.

We show how we can modify PG to train APE, by maximizing Equation [eqn:unconstrained]{reference-type="ref" reference="eqn:unconstrained"}. The two terms suggest a simple scheme to estimate the returns of the policy ensemble twice: once using $\pi_{\textbf{c}}$ that we wish to maximize, and a second time using $\pi_o$, which approximates the returns of an eventual observer who tries to clone the policy ensemble. Along with our PE, we train value functions $\Tilde{V}^{\pi_{c^{(i)}}}$ for each expert, jointly parameterized by $\phi$ which estimates $V^{\pi_{c^{(i)}}} - \beta V^{\pi_o}$. The loss function for the value functions of two sampled trajectories $\tau_1, \tau_2$ is

$$\begin{gathered} \label{sec: returns} J_{\tau_1, \tau_2} (\phi) = % \left[ \sum_{t=0}^{T_1-1} \frac{1}{2} \left(\Tilde{V}\phi ^{\pi{c^{(i)}}} (s_{t_1}) - R_{\tau_1}(t) \right) ^2 + \sum_{t=0}^{T_2-1} \frac{1}{2} \left(\Tilde{V}\phi ^ {\pi{c^{(i)}}} (s_{t 2}) + \beta R{\tau_2}(t) \right)^2 % \right]\end{gathered}$$

The policy gradient update from $N_1$ and $N_2$ trajectories is then $$\begin{gathered} \label{sec: pgupdate} \nabla_\theta J_{{\tau}1 , {\tau}2} (\theta) \approx G_1 + G_2\end{gathered}$$ where $$\begin{gathered} \label{sec: g1} G_1 = \frac{1}{N_1} \displaystyle\sum{j=1}^{N_1} \displaystyle\sum{t=0}^{T_1} \nabla_\theta \log{\pi_{c^{(i)}}(a_{t1}^{(j)} \mid s_{t1}^{(j)}) \Tilde{A}^{\pi_{c^{(i)}}}{\tau_1} (t) }\end{gathered}$$ $$\begin{gathered} \label{sec: g2} G_2 = \frac{1}{N_2} \displaystyle\sum{j=1}^{N_2} \displaystyle\sum_{t=0}^{T_2} \nabla_\theta \log{\pi_o(a_{t2}^{(j)} \mid s_{t2}^{(j)} )} \Tilde{A}^{\pi_o}_{\tau_2} (t) \end{gathered}$$

where $c^{(i)}$ identifies the chosen expert of the trajectory., and $\Tilde{A}^{\pi_{c^{(i)}}}{\tau_1} (t) = R{\tau_1} (t) - \Tilde{V}^{\pi_{c^{(i)}}} (s_t)$ and $\Tilde{A}^{\pi_o}{\tau_2} (t) = -\beta R{\tau_2} (t) - \Tilde{V}^{\pi_o} (s_t)$ are the modified advantage functions. The $-\beta$ that is in the advantage in $G_2$ optimizes against the performance of the observed policy $\pi_o$.

The gradient $G_1$ for $\pi_{\textbf{c}}$ is straightforward. However, to estimate the gradient $G_2$ for $\pi_o$ which is an fictitious policy, we sample from it by first re-sampling the context of the expert at each state, and then sampling an action from the context. The back-propagation occurs to $\pi_{c^{(i)}}(a \mid s)$ for the context sampled at each state. Practical implementation details can be found in [9.2](#sec: appendEst){reference-type="ref" reference="sec: appendEst"}. The intuition is as follow. While sampling $\pi_o$, if a selected action causes high return, we should decrease the probability, which lowers the expected reward of $\pi_o$. Combined, the two gradients will cause the PE to select actions that both achieves high reward, and are detrimental to the observer.

Equations [[sec: returns]](#sec: returns){reference-type="ref" reference="sec: returns"} and [[sec: pgupdate]](#sec: pgupdate){reference-type="ref" reference="sec: pgupdate"} formulate our PG approach of APE, which is summarized in Algorithm [alg:CAPE]{reference-type="ref" reference="alg:CAPE"}.

[[alg:CAPE]]{#alg:CAPE label="alg:CAPE"}

for each iteration do: Generate trajectories ${\tau}1$ with $\pi{\textbf{c}}$ from $\mathcal{M}$ for Equation [[sec: g1]](#sec: g1){reference-type="ref" reference="sec: g1"} Generate trajectories ${\tau}_2$ with $\pi_o$ from $\mathcal{M}$ for Equation [[sec: g2]](#sec: g2){reference-type="ref" reference="sec: g2"}

Calculate Equation [[sec: pgupdate]](#sec: pgupdate){reference-type="ref" reference="sec: pgupdate"} to perform a gradient update on the PE $\theta \leftarrow \theta + \alpha_\theta \hat{\nabla}\theta J{\tau_1, \tau_2}(\theta)$ Update the value function $\phi \leftarrow \phi - \alpha_\phi \hat{\nabla}\phi J{\tau_1, \tau_2} (\phi)$ as determined by Equation [[sec: returns]](#sec: returns){reference-type="ref" reference="sec: returns"}.

end for

Experiments {#sec:result}

We perform experiments on a navigation task, where the objective is to reach a goal state as fast as possible. The purpose is to illustrate that an APE can cause the cloned policy to take significantly longer to reach the goal state. We do so by first training a PE and behaviour cloning it. We then compare the performance of the PE to that of the clone. We use a discrete environment to best demonstrate the validity of the equation. This is because all discrete policies can be parameterized, which is not true in continuous, where typically Gaussian parameterization is used. As such, continuous environments would have to make assumptions about how both the PE and the cloner parameterizes policies, as well as tackle problems of distributional drift, which we would like to avoid. However, with these assumptions, our setting can extend to the continuous domain. In our experiments, we use a $10 \times 10$ grid-world environment as our main testbed. This is to have large enough expression that would not be found in smaller grids, while still small enough to visualize the behaviour of the APE. The discrete actions will show precisely how the experts can be jointly adversarial.

Using gridworld allows for precise expected return estimates. In an environment where there is no computable analytical solution for the returns, approximation error can accumulate through estimating the returns of both the trained PE and the clone. This noise would only increase in continuous state space, where the returns of $\pi_o$ may not be tractable to estimate due to issues such as distributional drift (@dagger [@BCLimitations; @causal_imitation_learning]).

Our results answer the following questions. How much optimality is compromised? How useless can we make the cloned policy? Is it possible to use non APE to prevent behaviour cloning?

Training {#sec:navigation}

{#fig:contexted} [[fig:contexted]]{#fig:contexted label="fig:contexted"}

Even though our method can compute a policy ensemble with any finite number of experts, we chose to visualize a solution with 2 experts, which is sufficient to reveal the essential properties of the method. Specifically, we train $n=2$ tabular experts with PG-APE. Our code is written in Tensorflow (@Tensorflow). Training details and hyper-parameters are in Section 9.1{reference-type="ref" reference="sec:hyperparams"} of the Appendix.

Environment

The basic environment is a $10 \times 10$ grid, with the goal state at the top left corner. The agent spawns in a random non-goal state, and incurs a reward of $-1$ for each time-step until it reaches the goal. At the goal state, the agent no longer receives a loss and terminates the episode. The agent is allowed five actions, $\mathcal{A} =$ { Up, Down, Left, Right, Stay }. Moving into the wall is equivalent to executing a Stay action. We choose this reward function for the benefit of having a clear representation of the notion of "good enough", which is reflected in how long it takes to reach the goal state. Having such representation exemplifies how the APE can prevent an observer from cloning a good policy.

Visualization

Figure 2{reference-type="ref" reference="fig:contexted"} shows an example of a PE that is trained for the basic gridworld environment. Figure [fig:cloned]{reference-type="ref" reference="fig:cloned"} shows the corresponding cloned policy, as well as a comparison to an optimal policy. The colour scale represents the expected return of starting at a given state.

In the case of an optimal policy ($\beta=0$), actions are taken to take the agent to the goal state as fast as possible. However, when $\beta > 0$, such a solution is no longer the optimum. Similar to $\beta=0$, the experts would like to maximize the expected reward, and reach the goal state. However, to minimize the reward of the observed policy, the two expert policies must jointly learn to increase the number of steps needed for $\pi_o$ to reach the goal state. The expert policies must use adversarial behaviour while reaching the goal state, such as taking intelligent detours or Stay in the same state, which are learned to hinder $\pi_o$ as much as possible. These learnt behaviours cause the cloned policy to take a drastically longer time to reach the goal. For example, note the two purple squares at the top-left near the goal, which indicates that the experts understand that they should not move to prevent the observer from attaining reward. Even though these sub-optimal decisions are made, on expectation, the experts are "not bad" and achieve an average of $-15.27$ reward.

Baselines

R0.52

{#fig:optimal} [[fig:optimal]]{#fig:optimal label="fig:optimal"}

We use behaviour cloning to clone our PG-APE trained policies. To support our claims of preventing IL even in the horizon of infinite data, we collect a million timesteps of the trained PE in the environment. Further details of behaviour cloning are in the appendix. Shown in Figure [fig:cloned]{reference-type="ref" reference="fig:cloned"} is an optimal policy, and the resulting cloned policy from Section 5.1{reference-type="ref" reference="sec:navigation"}.

We evaluate against other PE, to show that preventing against behaviour cloning is non-trivial. We use several baselines. We first test policies that have approximately the same return as our ensemble by training PE with vanilla PG, and halting early rather than running until convergence. In the Near-Optimal case, we ran until the PE had expected returns that matched the average achieved by our method. Conversely, "Random" policies are used as a comparison to show that it is possible to cause the cloned policy to do poorly, but the tradeoff is that the PE itself cannot perform well, which is undesirable. These policies are also policies trained with PG, except they are stopped much earlier, when their clones matches the expected returns of our PG-APE. For each PG-APE, we use $n=2$ different tabular policies treated as an ensemble, which we then clone, and average across $5$ seeds. For the baselines, we hand-pick the policies, and thus only use $3$ different policies.

p1.35indd & & &
PG-APE & -16.24 $\pm$ 1.20 & -44.27 $\pm$ 1.07 & -28.03
Near-Optimal PE & -16.74 $\pm$ 1.32 & -16.67 $\pm$ 1.31 & +0.07
Random Policy & -44.59 $\pm$ 0.52 & -44.52 $\pm$ 0.77 & +0.07\

[[default]]{#default label="default"}

As presented in Table [table:PEComp]{reference-type="ref" reference="table:PEComp"}, all other PE have an insignificant difference (returns of the PE subtracted from returns of the cloned policy) between the performance of the PE and the cloned policy, except for our method. These empirical findings show that preventing behaviour cloning difficult, but possible using APE.

Discussion & Future Work {#sec:dicussion}

Confidential Policies: There are promising research directions regarding the protection of policies, due to the many applications where confidentiality is crucial. As long as there is a model of the observer, our presented method provides a worst-case scenario of experts.

In our work, we focused on the case where the observer does not use the current trajectory to determine their policy. Instead, it may be the case that the observer uses a sequential policy (one that depends on its previous states and/or actions), such as an RNN to determine the context of the current expert.

Formally, the observer will no longer learn the policy formulated in Equation [eqn:observe]{reference-type="ref" reference="eqn:observe"} that is solely dependent on the current state, but rather a policy that is dependent on the current trajectory: $$\begin{aligned} \label{eqn:observeRNN} \pi_o({a} \mid \tau_{1:t}) \vcentcolon= \sum_i p(c^{(i)}\mid \tau_{1:t} ) \pi_{c^{(i)}}( a \mid s)\end{aligned}$$ We found in our preliminary results that using an RNN classifier which outputs $p(c | \tau_{1:t})$ simply ended up in with either optimal policies or crippled policies. In both cases, there was a relatively minor difference in performance between the policy ensemble and the cloned policy.

Unsurprisingly, when the observer has access to a strong enough representation for their policy, then they should be able to imitate any policy. In this case, the worst-case set of experts cannot do much to prevent the cloning. We believe that this is an exciting conclusion, and is grounds for future work.

Continuous: Although our methods are evaluated in discrete state spaces, our approach can be generalized to continuous domains.

The Monte Carlo sampling in Equation [eqn:exp_MC]{reference-type="ref" reference="eqn:exp_MC"} suggests that the use of continuous context may also be possible, given there is a strong enough function approximator to estimate the distribution of $c | s$. We see this as an exciting direction for future work, to recover the full spectrum of possible adversarial policies under the constraint of Equation [eqn:constrained]{reference-type="ref" reference="eqn:constrained"}.

The Semantics of Reward: Although the minimization in Equation [eqn:constrained]{reference-type="ref" reference="eqn:constrained"} implies a logical equivalence between the success of behaviour cloning to the reward the cloned policy can achieve, it may follow that this is not the case. It may be the case that useless is defined differently by the expected reward the cloned policy achieves on a different reward function $\Tilde{r}$. For example, a robot that is unpredictable should not be deployed with humans. Since the $r$ functions in Equation [eqn:constrained]{reference-type="ref" reference="eqn:constrained"} are disentangled, the reward function $r$ that is minimized in Equation [eqn:constrained]{reference-type="ref" reference="eqn:constrained"} can be engineered to fit any definition of uselessness.

We can modify the objective of APE by modifying Equations [[sec: returns]](#sec: returns){reference-type="ref" reference="sec: returns"} and [[sec: pgupdate]](#sec: pgupdate){reference-type="ref" reference="sec: pgupdate"} to use a different reward function $\Tilde{r}$ in the minimization, substituting $R(t)$ for $\Tilde{R}(t) = \sum_{t'=t}^{T-1} \gamma^{t'-t} \Tilde{r} (s_{t'}, a_{t'})$. The rest of the derivation and algorithm remain the same.

We think this is an exciting direction, especially for learning all different possible representations of the worst-case experts.

Conclusion {#sec:conclusion}

We present APE as well as its mathematical formulation, and show that policy gradient, a basic RL algorithm can be used to optimize a policy ensemble that cannot be cloned. We evaluated APE against baselines to show that adversarial behaviour is not feasible without our method.

This work identifies a novel yet crucial area in Reinforcement Learning, regarding the confidentiality of proprietary policies. The essence of our approach is that a policy ensemble can achieve high return for the policy owner, while providing an external observer with a guaranteed low reward, making proprietary ensemble useless to the observer.

The formulation of our problem setup and the algorithm are very general. In this first work we demonstrate the solution in the deliberately chosen simple environments in order to better visualize the essence of our method. In our concurrent work we study thoroughly the application of our method in various domains, which is out of the scope of this introductory paper.

Acknowledgements

This work was supported in part by NSF under grant NRI-#1734633 and by Berkeley Deep Drive.

Appendix

Training Details & Hyperparameters {#sec:hyperparams}

For our training, we set $\alpha_\theta = 0.05$, and the value weight to be $0.5$, use annealed entropy regularization (@asynch) from $5 \text{e}-1$ to $5 \text{e}-3$ and set the discount factor $\gamma=0.99$. Due to the contrasting gradients experienced, large batch sizes are used. In our experiments, we take 1 gradient update of AdaM (@AdaM) per batch of 4096 (containing multiple trajectories), and trained for $3e6$ timesteps.

To estimate $p(c|s)$ in Equation [eqn:observe]{reference-type="ref" reference="eqn:observe"}, we use a replay buffer that keeps track of the previous $60$ contexts seen at each state.

Estimating the quantity in Equation [eqn:observeRNN]{reference-type="ref" reference="eqn:observeRNN"} requires memory, which we use a single GRU (@GRU) as done in @LSH, with the exception that only states are fed in as a one-hot. Due to our environment is deterministic, state sequences captures the action sequence information. The single unit is then concatenated with the state, which feeds into a fully connected layer of 128, and then a soft-max, to produce the distribution $c | s$ over contexts.

For our behaviour cloning, we collect $1e6$ state action pairs, and train a tabular policy with $0.01$ learning rate on cross entropy softmax loss for $100$ epochs. The large amount of data and epochs is to ensure that we can recover $\pi_o$ with little to no variance.

To solve the precise returns of the policies, we inject noise of $1e-9$, to ensure a hitting time always exists from each state. As well, we clip all the hitting times to $T = 100$.

Estimating $\nabla_\theta \log \pi_o$ {#sec: appendEst}

It is not obvious how $\nabla_\theta \log \pi_o$ should be estimated, since $\pi_o$ is never realized until the policy is cloned. Literally, it is a virtual policy.

Equation [eqn:observe]{reference-type="ref" reference="eqn:observe"} offers a straightforward method to back-propagate, similar to that of the Mixture of Experts model (@mixexp), except using an estimate of $c | s$ instead of a gating network.

However, we can also rewrite Equation [eqn:observe]{reference-type="ref" reference="eqn:observe"} as $\sum_i p(c^{(i)}| s) \pi_{c^{(i)}}( a \mid s)

\mathop{\mathrm{\mathbb{E}}}{c \sim p(c | s)} [ \pi{c^{(i)}}( a \mid s)]$, which results in the gradient update being:

$$\begin{aligned} \label{eqn:exp_MC} \nabla_\theta \log \pi_o(a | s) = \nabla_\theta \log \mathop{\mathrm{\mathbb{E}}}{c \sim p(c | s)} [ \pi{c^{(i)}}( a \mid s)]\end{aligned}$$

which suggests a method of Monte Carlo sampling the inner expectation with $1$ sampled context. Empirically, we use the Monte Carlo sampling method.

Preventing Imitation Learning with Adversarial Policy Ensembles

Introduction

Preliminaries

Markov Decision Process

Policy Gradient

Policy Ensemble (PE)

Behaviour Cloning

Related Work {#sec:related}

Method

Objective

where $\alpha$ is a parameter that lower bounds the reward of the policy ensemble. This translates to maximizing the unconstrained Lagrangian: $$\begin{gathered} \label{eqn:unconstrained} J(\theta) = \mathop{\mathrm{\mathbb{E}}}{\tau \sim \rho{\pi_{\textbf{c}}}} [r(\tau)]

Modified Policy Gradient Algorithm

Experiments {#sec:result}

Training {#sec:navigation}

Environment

Visualization

Baselines

Discussion & Future Work {#sec:dicussion}

Conclusion {#sec:conclusion}

Acknowledgements

Appendix

Training Details & Hyperparameters {#sec:hyperparams}

Estimating $\nabla_\theta \log \pi_o$ {#sec: appendEst}

However, we can also rewrite Equation [eqn:observe]{reference-type="ref" reference="eqn:observe"} as $\sum_i p(c^{(i)}| s) \pi_{c^{(i)}}( a \mid s)

where $\alpha$ is a parameter that lower bounds the reward of the policy ensemble. This translates to maximizing the unconstrained Lagrangian: $$\begin{gathered} \label{eqn:unconstrained} J(\theta) =
\mathop{\mathrm{\mathbb{E}}}{\tau \sim \rho{\pi_{\textbf{c}}}} [r(\tau)]