-
source
arxiv
-
source_type
latex
-
converted_with
pandoc
-
paper_version
1807.00196v1
-
title
Modeling Friends and Foes
-
authors
["Pedro A. Ortega","Shane Legg"]
-
date_published
2018-06-30 16:07:43+00:00
-
data_last_modified
2018-06-30 16:07:43+00:00
-
abstract
How can one detect friendly and adversarial behavior from raw data? Detecting whether an environment is a friend, a foe, or anything in between, remains a poorly understood yet desirable ability for safe and robust agents. This paper proposes a definition of these environmental "attitudes" based on an characterization of the environment's ability to react to the agent's private strategy. We define an objective function for a one-shot game that allows deriving the environment's probability distribution under friendly and adversarial assumptions alongside the agent's optimal strategy. Furthermore, we present an algorithm to compute these equilibrium strategies, and show experimentally that both friendly and adversarial environments possess non-trivial optimal strategies.
-
author_comment
13 pages, 9 figures
-
journal_ref
null
-
doi
null
-
primary_category
cs.AI
-
categories
["cs.AI"]
-
citation_level
0
-
alignment_text
pos
-
confidence_score
1.0
-
main_tex_filename
./paper.tex
-
bibliography_bbl
\begin{thebibliography}{34}
\providecommand{\natexlab}[1]{#1}
\providecommand{\url}[1]{\texttt{#1}}
\expandafter\ifx\csname urlstyle\endcsname\relax
\providecommand{\doi}[1]{doi: #1}\else
\providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi
\bibitem[Leike et~al.(2017)Leike, Martic, Krakovna, Ortega, Everitt, Orseau,
and Legg]{Leike2017}
J.~Leike, M.~Martic, V.~Krakovna, P.A Ortega, T.~Everitt, L.~Orseau, and
S.~Legg.
\newblock {AI} safety gridworlds.
\newblock \emph{arXiv:1711.09883}, 2017.
\bibitem[Amodei et~al.(2016)Amodei, Olah, Steinhardt, Christiano, Schulman, and
Man{\'e}]{Amodei2016}
D.~Amodei, C.~Olah, J.~Steinhardt, P.~Christiano, J.~Schulman, and D.~Man{\'e}.
\newblock {Concrete Problems in {AI} Safety}.
\newblock \emph{arXiv preprint arXiv:1606.06565}, 2016.
\bibitem[Russell(1997)]{Russell1997}
S.~Russell.
\newblock Rationality and intelligence.
\newblock \emph{Artificial Intelligence}, 94\penalty0 (1--2):\penalty0 57--77,
1997.
\bibitem[Ortega and Braun(2013)]{OrtegaBraun2013}
P.~A. Ortega and D.~A. Braun.
\newblock {Thermodynamics as a theory of decision-making with
information-processing costs}.
\newblock \emph{Proceedings of the Royal Society A}, 469, 2013.
\bibitem[Osborne and Rubinstein(1994)]{OsborneRubinstein1994}
M.~J. Osborne and A.~Rubinstein.
\newblock \emph{{A course in game theory}}.
\newblock MIT Press, first edition, 1994.
\bibitem[Von~Neumann and Morgenstern(1947)]{VonNeumann1947}
J.~Von~Neumann and O.~Morgenstern.
\newblock Theory of games and economic behavior, 2nd rev.
\newblock 1947.
\bibitem[Sutton and Barto(1998)]{SuttonBarto1998}
R.~S. Sutton and A.~G. Barto.
\newblock \emph{Reinforcement learning: An introduction}, volume~1.
\newblock MIT press Cambridge, 1998.
\bibitem[Bubeck and Cesa-Bianchi(2012)]{BubeckCesaBianchi2012}
S.~Bubeck and N.~Cesa-Bianchi.
\newblock Regret analysis of stochastic and nonstochastic multi-armed bandit
problems.
\newblock \emph{Foundations and Trends in Machine Learning}, 5\penalty0
(1):\penalty0 1--122, 2012.
\bibitem[Auer et~al.(2002)Auer, Cesa-Bianchi, Freund, and
Schapire]{AuerEtAl2002}
P.~Auer, N.~Cesa-Bianchi, Y.~Freund, and R.~E. Schapire.
\newblock {The nonstochastic multiarmed bandit problem}.
\newblock \emph{SIAM journal on computing}, 32:\penalty0 48--77, 2002.
\bibitem[Tishby and Polani(2011)]{TishbyPolani2011}
N.~Tishby and D.~Polani.
\newblock {Information theory of decisions and actions}.
\newblock In \emph{{Perception-action cycle}}, pages 601--636. Springer New
York, 2011.
\bibitem[Banerjee et~al.(2005)Banerjee, Merugu, Dhillon, and
Ghosh]{Banerjee2005}
A.~Banerjee, S.~Merugu, I.~S. Dhillon, and J.~Ghosh.
\newblock Clustering with bregman divergences.
\newblock \emph{Journal of machine learning research}, 6\penalty0
(Oct):\penalty0 1705--1749, 2005.
\bibitem[McKelvey and Palfrey(1995)]{McKelveyPalfrey1995}
R.~McKelvey and T.~Palfrey.
\newblock {Quantal Response Equilibria for Normal Form Games}.
\newblock \emph{Games and Economic Behavior}, 10:\penalty0 6--38, 1995.
\bibitem[Mescheder et~al.(2017)Mescheder, Nowozin, and Geiger]{Mescheder2017}
Lars Mescheder, Sebastian Nowozin, and Andreas Geiger.
\newblock The numerics of gans.
\newblock In \emph{Advances in Neural Information Processing Systems}, pages
1823--1833, 2017.
\bibitem[Balduzzi et~al.(2018)Balduzzi, Racaniere, Martens, Foerster, and
Tuyls]{Balduzzi2018}
D.~Balduzzi, S.~Racaniere, J.~Martens, J.~Foerster, and T.~Tuyls, K.~Graepel.
\newblock {The Mechanics of n-Player Differentiable Games}.
\newblock In \emph{{Proceedings of the International Conference on Machine
Learning (ICML)}}, 2018.
\bibitem[Ortega et~al.(2015)Ortega, Kim, and Lee]{OrtegaKimLee2015}
P.~A. Ortega, K.-E. Kim, and D.~D. Lee.
\newblock {Reactive bandits with attitude}.
\newblock In \emph{{Artificial Intelligence and Statistics}}, 2015.
\bibitem[Tishby and Slonim(2001)]{TishbySlonim2001}
N.~Tishby and N.~Slonim.
\newblock {Data clustering by markovian relaxation and the information
bottleneck method}.
\newblock In \emph{{Advances in neural information processing systems}}, pages
640--646, 2001.
\bibitem[Chechik et~al.(2005)Chechik, Globerson, Tishby, and
Weiss]{Chechik2005}
G.~Chechik, A.~Globerson, N.~Tishby, and Y.~Weiss.
\newblock {Information bottleneck for Gaussian variables}.
\newblock \emph{Journal of Machine Learning Research}, 6\penalty0
(Jan):\penalty0 165--188, 2005.
\bibitem[Genewein et~al.(2015)Genewein, Leibfried, Grau-Moya, and
Braun]{GeneweinLeibfriedGrauMoyaBraun2015}
T.~Genewein, F.~Leibfried, J.~Grau-Moya, and D.~A. Braun.
\newblock {Bounded Rationality, Abstraction, and Hierarchical Decision-Making:
An Information-Theoretic Optimality Principle}.
\newblock \emph{Frontiers in Robotics and AI}, 2:\penalty0 27, 2015.
\bibitem[van~den Broek et~al.(2010)van~den Broek, Wiegerinck, and
Kappen]{Broek2010}
B.~van~den Broek, W.~Wiegerinck, and B.~Kappen.
\newblock Risk sensitive path integral control.
\newblock In \emph{Proceedings of the Twenty-Sixth Conference on Uncertainty in
Artificial Intelligence}, pages 615--622. AUAI Press, 2010.
\bibitem[Ortega and Braun(2011)]{OrtegaBraun2011}
P.~A. Ortega and D.~A. Braun.
\newblock {Information, utility and bounded rationality}.
\newblock In \emph{{International Conference on Artificial General
Intelligence}}. Springer Berlin Heidelberg, 2011.
\bibitem[Wolpert et~al.(2012)Wolpert, Harr{\'e}, Olbrich, Bertschinger, and
Jost]{WolpertHarreOlbrichBertschingerJost2012}
D.~H. Wolpert, M.~Harr{\'e}, E.~Olbrich, N.~Bertschinger, and J.~Jost.
\newblock {Hysteresis effects of changing the parameters of noncooperative
games}.
\newblock \emph{Physical Review E}, 85, 2012.
\bibitem[Bubeck and Slivkins(2012)]{BubeckSlivkins2012}
S.~Bubeck and A.~Slivkins.
\newblock {The best of both worlds: stochastic and adversarial bandits}.
\newblock In \emph{{In Proceedings ofthe International Conference on
Computational Learning Theory (COLT)}}, 2012.
\bibitem[Seldin and Silvkins(2014)]{SeldinSilvkins2014}
Y.~Seldin and A.~Silvkins.
\newblock {One practical algorithm for both stochastic and adversarial
bandits}.
\newblock In \emph{{31 st International Conference on Machine Learning}}, 2014.
\bibitem[Auer and Chao-Kai(2016)]{AuerChaoKai2016}
P.~Auer and C.~Chao-Kai.
\newblock {An algorithm with nearly optimal pseudo-regret for both stochastic
and adversarial bandits}.
\newblock In \emph{{29th Annual Conference on Learning Theory}}, 2016.
\bibitem[Littman(2001)]{Littman2001}
M.~L. Littman.
\newblock {Friend-or-Foe Q-Learning in General-Sum Games}.
\newblock In \emph{{Proceedings of the International Conference on Machine
Learning (ICML)}}, 2001.
\bibitem[Powers and Shoham(2005)]{Powers2005}
R.~Powers and Y.~Shoham.
\newblock New criteria and a new algorithm for learning in multi-agent systems.
\newblock In \emph{Advances in neural information processing systems}, pages
1089--1096, 2005.
\bibitem[Greenwald and Hall(2003)]{Greenwald2003}
A.~Greenwald and K.~Hall.
\newblock {Correlated Q-Learning}.
\newblock In \emph{{Proceedings of the 22nd Conference on Artificial
Intelligence}}, pages 242--249, 2003.
\bibitem[Crandall and Goodrich(2011)]{Crandall2011}
J.~W. Crandall and M.~A. Goodrich.
\newblock {Learning to compete, coordinate, and cooperate in repeated games
using reinforcement learning}.
\newblock \emph{Machine Learning}, 82\penalty0 (3):\penalty0 281--314, 2011.
\bibitem[Hernandez-Leal and Kaisers(2017)]{Hernandez2017}
P.~Hernandez-Leal and M.~Kaisers.
\newblock Learning against sequential opponents in repeated stochastic games.
\newblock In \emph{The 3rd Multi-disciplinary Conference on Reinforcement
Learning and Decision Making, Ann Arbor}, 2017.
\bibitem[Thompson(1933)]{Thompson1933}
W.~R. Thompson.
\newblock {On the likelihood that one unknown probability exceeds another in
view of the evidence of two samples}.
\newblock \emph{Biometrika}, 25:\penalty0 285--294, 1933.
\bibitem[Chappelle and Li(2011)]{ChappelleLi2011}
O.~Chappelle and L.~Li.
\newblock {An empirical evaluation of Thompson Sampling}.
\newblock In \emph{{Advances in neural information processing systems}}, 2011.
\bibitem[Ling et~al.(2018)Ling, Fang, and Kolter]{Ling2014}
C.~K. Ling, F.~Fang, and J.~Z. Kolter.
\newblock What game are we playing? end-to-end learning in normal and extensive
form games.
\newblock \emph{arXiv preprint arXiv:1805.02777}, 2018.
\bibitem[Szegedy et~al.(2013)Szegedy, Zaremba, Sutskever, Bruna, Erhan,
Goodfellow, and
Fergus]{SzegedyZarembaSutskeverBrunaErhanGoodfellowFergus2013}
C.~Szegedy, W.~Zaremba, I.~Sutskever, J.~Bruna, D.~Erhan, I.~Goodfellow, and
R.~Fergus.
\newblock {Intriguing properties of neural networks}.
\newblock \emph{arXiv preprint arXiv:1312.6199}, 2013.
\bibitem[Goodfellow et~al.(2014)Goodfellow, Shlens, and
Szegedy]{GoodfellowShlensSzegedy2014}
I.~J. Goodfellow, J.~Shlens, and C.~Szegedy.
\newblock {Explaining and harnessing adversarial examples}.
\newblock \emph{arXiv preprint arXiv:1412.6572}, 2014.
\end{thebibliography}
-
bibliography_bib
-
arxiv_citations
{"1711.09883":true,"1606.06565":true,"1805.02777":true,"1312.6199":true,"1412.6572":true}
-
alignment_newsletter
{"source":"alignment-newsletter","source_type":"google-sheets","converted_with":"python","venue":"arXiv","newsletter_category":"Handling groups of agents","highlight":false,"newsletter_number":"AN #14","newsletter_url":"https://mailchi.mp/4a4bc2286f53/alignment-newsletter-14","summarizer":"Rohin","summary":"Multiagent scenarios are typically modeled using game theory. However, it is hard to capture the intuitive notions of \"adversarial\", \"neutral\" and \"friendly\" agents using standard game theory terminology. The authors propose that we model the agent and environment as having some prior mixed strategy, and then allow them to \"react\" by changing the strategies to get a posterior strategy, but with a term in the objective function for the change (as measured by the KL divergence). The sign of the environment's KL divergence term determines whether it is friendly or adversarial, and the magnitude determines the magnitude of friendliness or adversarialness. They show that there are always equilibria, and give an algorithm to compute them. They then show some experiments demonstrating that the notions of \"friendly\" and \"adversarial\" they develop actually do lead to behavior that we would intuitively call friendly or adversarial.\n\nSome notes to understand the paper: while normally we think of multiagent games as consisting of a set of agents, in this paper there is an agent that acts, and an environment in which it acts (which can contain other agents). The objective function is neither minimized nor maximized -- the sign of the environment's KL divergence changes whether the stationary points are maxima or minima (which is why it can model both friendly and adversarial environments). There is only one utility function, the agent's utility function -- the environment is only modeled as responding to the agent, rather than having its own utility function.","opinion":"This is an interesting formalization of friendly and adversarial behavior. It feels somewhat weird to model the environment as having a prior strategy that it can then update. This has the implication that a \"somewhat friendly\" environment is unable to change its strategy to help the agent, even though it would \"want\" to, whereas when I think of a \"somewhat friendly\" environment, I think of a group of agents that share some of your goals but not all of them, so a limited amount of cooperation is possible. These feel quite different.","prerequisites":"nan","read_more":"nan","paper_version":"1807.00196v1","arxiv_id":"1807.00196","title":"Modeling Friends and Foes","authors":["Pedro A. Ortega","Shane Legg"],"date_published":"2018-06-30 16:07:43+00:00","data_last_modified":"2018-06-30 16:07:43+00:00","url":"http://arxiv.org/abs/1807.00196v1","abstract":"How can one detect friendly and adversarial behavior from raw data? Detecting whether an environment is a friend, a foe, or anything in between, remains a poorly understood yet desirable ability for safe and robust agents. This paper proposes a definition of these environmental \"attitudes\" based on an characterization of the environment's ability to react to the agent's private strategy. We define an objective function for a one-shot game that allows deriving the environment's probability distribution under friendly and adversarial assumptions alongside the agent's optimal strategy. Furthermore, we present an algorithm to compute these equilibrium strategies, and show experimentally that both friendly and adversarial environments possess non-trivial optimal strategies.","author_comment":"13 pages, 9 figures","journal_ref":"None","doi":"None","primary_category":"cs.AI","categories":"['cs.AI']","individual_summary":"Title: Modeling Friends and Foes\nAuthors: Pedro A. Ortega, Shane Legg\nPaper abstract: How can one detect friendly and adversarial behavior from raw data? Detecting whether an environment is a friend, a foe, or anything in between, remains a poorly understood yet desirable ability for safe and robust agents. This paper proposes a definition of these environmental \"attitudes\" based on an characterization of the environment's ability to react to the agent's private strategy. We define an objective function for a one-shot game that allows deriving the environment's probability distribution under friendly and adversarial assumptions alongside the agent's optimal strategy. Furthermore, we present an algorithm to compute these equilibrium strategies, and show experimentally that both friendly and adversarial environments possess non-trivial optimal strategies.\nSummary: Multiagent scenarios are typically modeled using game theory. However, it is hard to capture the intuitive notions of \"adversarial\", \"neutral\" and \"friendly\" agents using standard game theory terminology. The authors propose that we model the agent and environment as having some prior mixed strategy, and then allow them to \"react\" by changing the strategies to get a posterior strategy, but with a term in the objective function for the change (as measured by the KL divergence). The sign of the environment's KL divergence term determines whether it is friendly or adversarial, and the magnitude determines the magnitude of friendliness or adversarialness. They show that there are always equilibria, and give an algorithm to compute them. They then show some experiments demonstrating that the notions of \"friendly\" and \"adversarial\" they develop actually do lead to behavior that we would intuitively call friendly or adversarial.\n\nSome notes to understand the paper: while normally we think of multiagent games as consisting of a set of agents, in this paper there is an agent that acts, and an environment in which it acts (which can contain other agents). The objective function is neither minimized nor maximized -- the sign of the environment's KL divergence changes whether the stationary points are maxima or minima (which is why it can model both friendly and adversarial environments). There is only one utility function, the agent's utility function -- the environment is only modeled as responding to the agent, rather than having its own utility function.\nMy opinion: This is an interesting formalization of friendly and adversarial behavior. It feels somewhat weird to model the environment as having a prior strategy that it can then update. This has the implication that a \"somewhat friendly\" environment is unable to change its strategy to help the agent, even though it would \"want\" to, whereas when I think of a \"somewhat friendly\" environment, I think of a group of agents that share some of your goals but not all of them, so a limited amount of cooperation is possible. These feel quite different.","paper_text":"","text":"Highlights\n**[One-Shot Imitation from Watching Videos](http://bair.berkeley.edu/blog/2018/06/28/daml/)** *(Tianhe Yu and Chelsea Finn)*: Can we get a robot to learn a task by watching a human do it? This is very different from standard imitation learning. First, we want to do it with a single demonstration, and second, we want to do it by *watching a human* -- that is, we're learning from a video of a human, not a trajectory where the robot actions are given to us. Well, first consider how we could do this if we have demonstrations from a teleoperated robot. In this case, we do actually have demonstrations in the form of trajectories, so normal imitation learning techniques (behavioral cloning in this case) work fine. We can then take this loss function and use it with [MAML](http://bair.berkeley.edu/blog/2017/07/18/learning-to-learn/) to learn from a large dataset of tasks and demonstrations how to perform a new task given a single demonstration. But this still requires the demonstration to be collected by teleoperating the robot. What if we want to learn from a video of a human demonstrating? They propose learning a *loss function* that given the human video provides a loss from which gradients can be calculated to update the policy. Note that at training time there are still teleoperation demonstrations, so the hard task of learning how to perform tasks is done then. At test time, the loss function inferred from the human video is primarily used to identify which objects to manipulate.\n**My opinion:** This is cool, it actually works on a real robot, and it deals with the issue that a human and a robot have different action spaces.\n**Prerequisities:** Some form of meta-learning (ideally [MAML](http://bair.berkeley.edu/blog/2017/07/18/learning-to-learn/)).\n**[Capture the Flag: the emergence of complex cooperative agents](https://deepmind.com/blog/capture-the-flag/)** *(Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning et al)*: DeepMind has trained FTW (For The Win) agents that can play Quake III Arena Capture The Flag from raw pixels, given *only* the signal of whether they win or not. They identify three key ideas that enable this -- population based training (instead of self play), learning an internal reward function, and operating at two timescales (enabling better use of memory). Their ablation studies show that all of these are necessary, and in particular it even outperforms population based training with manual reward shaping. The trained agents can cooperate and compete with a wide range of agents (thanks to the population based training), including humans.\nBut why are these three techniques so useful? This isn't as clear, but I can speculate. Population based training works well because the agents are trained against a diversity of collaborators and opponents, which can fix the issue of instability that afflicts self-play. Operating at two timescales gives the agent a better inductive bias. They say that it enables the agent to use memory more effectively, but my story is that it lets it do something more hierarchical, where the slow RNN makes \"plans\", while the fast RNN executes on those plans. Learning an internal reward function flummoxed me for a while, it really seemed like that should not outperform manual reward shaping, but then I found out that the internal reward function is computed from the game points screen, not from the full trajectory. This gives it a really strong inductive bias (since the points screen provides really good features for defining reward functions) that allows it to quickly learn an internal reward function that's more effective than manual reward shaping. It's still somewhat surprising, since it's still learning this reward function from the pixels of the points screen (I assume), but more believable.\n**My opinion:** This is quite impressive, since they are learning from the binary win-loss reward signal. I'm surprised that the agents generalized well enough to play alongside humans -- I would have expected that to cause a substantial distributional shift preventing good generalization. They only had 30 agents in their population, so it seems unlikely a priori that this would induce a distribution that included humans. Perhaps Quake III is simple enough strategically that there aren't very many viable strategies, and most strategies are robust to having slightly worse allies? That doesn't seem right though.\nDeepMind did a *lot* of different things to analyze what the agents learned and how they are different from humans -- check out the [paper](https://arxiv.org/pdf/1807.01281.pdf) for details. For example, they showed that the agents are much better at tagging (shooting) at short ranges, while humans are much better at long ranges.\n \n\nTechnical AI alignment\n \n\nTechnical agendas and prioritization\n[An introduction to worst-case AI safety](http://s-risks.org/an-introduction-to-worst-case-ai-safety/) *(Tobias Baumann)*: Argues that people with suffering-focused ethics should focus on \"worst-case AI safety\", which aims to find technical solutions to risks of AIs creating vast amounts of suffering (which would be much worse than extinction).\n**My opinion:** If you have strongly suffering-focused ethics (unlike me), this seems mostly right. The post claims that suffering-focused AI safety should be more tractable than AI alignment, because it focuses on a subset of risks and only tries to minimize them. However, it's not necessarily the case that focusing on a simpler problem makes it easier to solve. It feels easier to me to figure out how to align an AI system to humans, or how to enable human control of an AI system, than to figure out all the ways in which vast suffering could happen, and solve each one individually. You can make an analogy to mathematical proofs and algorithms -- often, you want to try to prove a *stronger* statement than the one you are looking at, because when you use induction or recursion, you can rely on a stronger inductive hypothesis.\n \n\nLearning human intent\n**[One-Shot Imitation from Watching Videos](http://bair.berkeley.edu/blog/2018/06/28/daml/)** *(Tianhe Yu and Chelsea Finn)*: Summarized in the highlights!\n[Learning Montezuma’s Revenge from a Single Demonstration](https://blog.openai.com/learning-montezumas-revenge-from-a-single-demonstration/) *(Tim Salimans et al)*: Montezuma's Revenge is widely considered to be one of the hardest Atari games to learn, because the reward is so sparse -- it takes many actions to reach the first positive reward, and if you're using random exploration, it will take exponentially many actions (in N, the number of actions till the first reward) to find any reward. A human demonstration should make the exploration problem much easier. In particular, we can start just before the end of the demonstration, and train the RL agent to get as much score as the demonstration. Once it learns that, we can start it at slightly earlier in the demonstration, and do it again. Repeating this, we eventually get an agent that can perform the whole demonstration from start to finish, and it takes time linear in the length of the demonstration. Note that the agent must be able to generalize a little bit to states \"around\" the human demonstration -- when it takes random actions it will eventually reach a state that is similar to a state it saw earlier, but not exactly the same, and it needs to generalize properly. It turns out that this works for Montezuma's Revenge, but not for other Atari games like Gravitar and Pitfall.\n**My opinion:** Here, the task definition continues to be the reward function, and the human demonstration is used to help the agent effectively optimize the reward function. Such agents are still vulnerable to misspecified reward functions -- in fact, the agent discovers a bug in the emulator that wouldn't have happened if it was trying to imitate the human. I would still expect the agent to be more human-like than one trained with standard RL, since it only learns the environment near the human policy.\n[Atari Grand Challenge](http://atarigrandchallenge.com/about) *(Vitaly Kurin)*: This is a website crowdsourcing human demonstrations for Atari games, which means that the dataset will be very noisy, with demonstrations from humans of vastly different skill levels. Perhaps this would be a good dataset to evaluate algorithms that aim to learn from human data?\n[Beyond Winning and Losing: Modeling Human Motivations and Behaviors Using Inverse Reinforcement Learning](http://arxiv.org/abs/1807.00366) *(Baoxiang Wang et al)*: How could you perform IRL without access to a simulator, or a model of the dynamics of the game, or the full human policy (only a set of demonstrations)? In this setting, as long as you have a large dataset of diverse human behavior, you can use Q-learning on the demonstrations to estimate separate Q-function for each feature, and then for a given set of demonstrations you can infer the reward for that set of demonstrations using a linear program that attempts to make all of the human actions optimal given the reward function. They define (manually) five features for World of Warcraft Avatar History (WoWAH) that correspond to different motivations and kinds of human behavior (hence the title of the paper) and infer the weights for those rewards. It isn't really an evaluation because there's no ground truth.\n \n\nPreventing bad behavior\n[Overcoming Clinginess in Impact Measures](https://www.lesswrong.com/posts/DvmhXysefEyEvXuXS/overcoming-clinginess-in-impact-measures) *(TurnTrout)*: In their [previous post](https://www.lesswrong.com/posts/H7KB44oKoSjSCkpzL/worrying-about-the-vase-whitelisting), TurnTrout proposed a whitelisting approach, that required the AI not to cause side effects not on the whitelist. One criticism was that it made the AI *clingy*, that is, the AI would also prevent any other agents in the world from causing non-whitelisted effects. In this post, they present a solution to the clinginess problem. As long as the AI knows all of the other agents in the environment, and their policies, the AI can be penalized for the *difference* of effects between its behavior, and what the human(s) would have done. There's analysis in a few different circumstances, where it's tricky to get the counterfactuals exactly right. However, this sort of impact measure means that while the AI is punished for causing side effects itself, it *can* manipulate humans to perform those side effects on its behalf with no penalty. This appears to be a tradeoff in the impact measure framework -- either the AI will be clingy, where it prevents humans from causing prohibited side effects, or it could cause the side effects through manipulation of humans.\n**My opinion:** With any impact measure approach, I'm worried that there is no learning of what humans care about. As a result I expect that there will be issues that won't be handled properly (similarly to how we don't expect to be able to write down a human utility function). In the previous post, this manifested as a concern for generalization ability, which I'm still worried about. I think the tradeoff identified in this post is actually a manifestation of this worry -- clinginess happens when your AI overestimates what sorts of side effects humans don't want to happen in general, while manipulation of humans happens when your AI underestimates what side effects humans don't want to happen (though with the restriction that only humans can perform these side effects).\n**Prerequisities:** [Worrying about the Vase: Whitelisting](https://www.lesswrong.com/posts/H7KB44oKoSjSCkpzL/worrying-about-the-vase-whitelisting)\n \n\nGame theory\n[Modeling Friends and Foes](http://arxiv.org/abs/1807.00196) *(Pedro A. Ortega et al)*: Multiagent scenarios are typically modeled using game theory. However, it is hard to capture the intuitive notions of \"adversarial\", \"neutral\" and \"friendly\" agents using standard game theory terminology. The authors propose that we model the agent and environment as having some prior mixed strategy, and then allow them to \"react\" by changing the strategies to get a posterior strategy, but with a term in the objective function for the change (as measured by the KL divergence). The sign of the environment's KL divergence term determines whether it is friendly or adversarial, and the magnitude determines the magnitude of friendliness or adversarialness. They show that there are always equilibria, and give an algorithm to compute them. They then show some experiments demonstrating that the notions of \"friendly\" and \"adversarial\" they develop actually do lead to behavior that we would intuitively call friendly or adversarial.\nSome notes to understand the paper: while normally we think of multiagent games as consisting of a set of agents, in this paper there is an agent that acts, and an environment in which it acts (which can contain other agents). The objective function is neither minimized nor maximized -- the sign of the environment's KL divergence changes whether the stationary points are maxima or minima (which is why it can model both friendly and adversarial environments). There is only one utility function, the agent's utility function -- the environment is only modeled as responding to the agent, rather than having its own utility function.\n**My opinion:** This is an interesting formalization of friendly and adversarial behavior. It feels somewhat weird to model the environment as having a prior strategy that it can then update. This has the implication that a \"somewhat friendly\" environment is unable to change its strategy to help the agent, even though it would \"want\" to, whereas when I think of a \"somewhat friendly\" environment, I think of a group of agents that share some of your goals but not all of them, so a limited amount of cooperation is possible. These feel quite different.\n \n\nInterpretability\n[This looks like that: deep learning for interpretable image recognition](http://arxiv.org/abs/1806.10574) *(Chaofan Chen, Oscar Li et al)*\n \n\nVerification\n[Towards Mixed Optimization for Reinforcement Learning with Program Synthesis](http://arxiv.org/abs/1807.00403) *(Surya Bhupatiraju, Kumar Krishna Agrawal et al)*: This paper proposes a framework in which policies are represented in two different ways -- as neural nets (the usual way) and as programs. To go from neural nets to programs, you use *program synthesis* (as done by [VIPER](http://arxiv.org/abs/1805.08328) and [PIRL](https://arxiv.org/abs/1804.02477), both summarized in previous newsletters). To go from programs to neural nets, you use *distillation* (basically use the program to train the neural net with supervised training). Given these transformations, you can then work with the policy in either space. For example, you could optimize the policy in both spaces, using standard gradient descent in neural-net-space, and *program repair* in program-space. Having a program representation can be helpful in other ways too, as it makes the policy more interpretable, and more amenable to formal verification of safety properties.\n**My opinion:** It is pretty nice to have a program representation. This paper doesn't delve into specifics (besides a motivating example worked out by hand), but I'm excited to see an actual instantiation of this framework in the future!\n \n\nNear-term concerns\n \n\nAdversarial examples\n[Adversarial Reprogramming of Neural Networks](https://arxiv.org/abs/1806.11146) *(Gamaleldin F. Elsayed et al)*\n \n\nAI strategy and policy\n[Shaping economic incentives for collaborative AGI](https://www.lesswrong.com/posts/FkZCM4DMprtEp568s/shaping-economic-incentives-for-collaborative-agi) *(Kaj Sotala)*: This post considers how to encourage a culture of cooperation among AI researchers. Then, when researchers try to create AGI, this culture of cooperation may make it more likely that AGI is developed collaboratively, instead of with race dynamics, making it more likely to be safe. It specifically poses the question of what external economic or policy incentives could encourage such cooperation.\n**My opinion:** I am optimistic about developing AGI collaboratively, especially through AI researchers cooperating. I'm not sure whether external incentives from government are the right way to achieve this -- it seems likely that such regulation would be aimed at the wrong problems if it originated from government and not from AI researchers themselves. I'm more optimistic about some AI researchers developing guidelines and incentive structures themselves, that researchers buy into voluntarily, that maybe later get codified into law by governments, or adopted by companies for their AI research.\n[An Overview of National AI Strategies](https://medium.com/politics-ai/an-overview-of-national-ai-strategies-2a70ec6edfd) *(Tim Dutton)*: A short reference on the AI policies released by various countries.\n**My opinion:** Reading through this, it seems that countries are taking quite different approaches towards AI. I don't know what to make of this -- are they acting close to optimally given their geopolitical situation (which must then vary a lot by country), or does no one know what's going on and as a result all of the strategies are somewhat randomly chosen? (Here by \"randomly chosen\" I mean that the strategies that one group of analysts would select with is only weakly correlated with the strategies another group would select.) It could also be that the approaches are not actually that different.\n[Joint Artificial Intelligence Center Created Under DoD CIO](https://breakingdefense.com/2018/06/joint-artificial-intelligence-center-created-under-dod-cio/) *(Sydney J. Freedberg Jr.)*\n \n\nAI capabilities\n \n\nReinforcement learning\n**[Capture the Flag: the emergence of complex cooperative agents](https://deepmind.com/blog/capture-the-flag/)** *(Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning et al)*: Summarized in the highlights!\n[Ranked Reward: Enabling Self-Play Reinforcement Learning for Combinatorial Optimization](http://arxiv.org/abs/1807.01672) *(Alexandre Laterre et al)*\n[Procedural Level Generation Improves Generality of Deep Reinforcement Learning](http://arxiv.org/abs/1806.10729) *(Niels Justesen et al)* |\n\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| [If you got forwarded this email, you can sign up here!](http://eepurl.com/dqMSZj \"If you got forwarded this email, you can sign up here!\") |\n\n |\n\n |\n| \n\n| | | | | | | | | | | | | |\n| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |\n| \n\n| | | | | | | | | | | | |\n| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |\n| \n\n| | | | | | | | | | | |\n| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |\n| \n\n| | | | | | | | | | |\n| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |\n| \n\n\n| | | |\n| --- | --- | --- |\n| \n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| |\n\n |\n\n |\n\n\n\n\n| | | |\n| --- | --- | --- |\n| \n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| |\n\n |\n\n |\n\n\n\n\n| | | |\n| --- | --- | --- |\n| \n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| |\n\n |\n\n |\n\n\n |\n\n |\n\n |\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| |\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n\n| |\n| --- |\n| *Copyright © 2018 Rohin Shah, All rights reserved.*\n\n\n"}
abstract: |
How can one detect friendly and adversarial behavior from raw data? Detecting whether an environment is a friend, a foe, or anything in between, remains a poorly understood yet desirable ability for safe and robust agents. This paper proposes a definition of these environmental "attitudes" based on an characterization of the environment's ability to react to the agent's private strategy. We define an objective function for a one-shot game that allows deriving the environment's probability distribution under friendly and adversarial assumptions alongside the agent's optimal strategy. Furthermore, we present an algorithm to compute these equilibrium strategies, and show experimentally that both friendly and adversarial environments possess non-trivial optimal strategies.
author:
- |
Pedro A. Ortega
DeepMind
pedroortega@google.com Shane Legg
DeepMind
legg@google.com
bibliography:
- bibliography.bib
title: Modeling Friends and Foes
Keywords: AI safety; friendly and adversarial; game theory; bounded rationality.
Introduction
How can agents detect friendly and adversarial behavior from raw data? Discovering whether an environment (or a part within) is a friend or a foe is a poorly understood yet desirable skill for safe and robust agents. Possessing this skill is important for a number of situations, including:
-
Multi-agent systems: Some environments, especially in multi-agent systems, might have incentives to either help or hinder the agent [@Leike2017]. For example, an agent playing football must anticipate both the creative moves of its team members and its opponents. Thus, learning to discern between friends and foes might not only help the agent to avoid danger, but also open the possibility to solving taks throught collaboration that it could not solve alone otherwise.
-
Model uncertainty: An agent can choose to impute "adversarial" or "friendly" qualities to an environment that it does not know well. For instance, an agent that is trained in a simulator could compensate for the innaccuracies by assuming that the real environment differs from the simulated one---but in an adversarial way, so as to devise countermeasures ahead of time [@Amodei2016]. Similarly, innacuracies might also originate in the agent itself---for instance, due to bounded rationality [@Russell1997; @OrtegaBraun2013].
Typically, these situations involve a knowledge limitation that the agent addresses by responding with a risk-sensitive policy.
The contributions of this paper are threefold. First, we offer a broad definition of friendly and adversarial behavior. Furthermore, by varying a single real-valued parameter, one can select from a continuous range of behaviors that smoothly interpolate between fully adversarial and fully friendly. Second, we derive the agent's (and environment's) optimal strategy under friendly or adversarial assumptions. To do so, we treat the agent-environment interaction as a one-shot game with information-constraints, and characterize the optimal strategies at equilibrium. Finally, we provide an algorithm to find the equilibrium strategies of the agent and the environment. We also demonstrate empirically that the resulting strategies display non-trivial behavior which vary qualitatively with the information-constraints.
Motivation & Intuition {#sec:rationale}
We begin with an intuitive example using multi-armed bandits. This helps motivating our mathematical formalization in the next section.
Game theory is the classical economic paradigm to analyze the interaction between agents [@OsborneRubinstein1994]. However, within game theory, the term adversary is justified by the fact that in zero-sum games the equilibrium strategies are maximin strategies, that is, strategies that maximize the expected payoff under the assumption that the adversary will minimize the payoffs. However, when the game is not a zero-sum game, interpreting an agent's behavior as adversarial is far less obvious, as there are is no coupling between payoffs, and the strong guarantees provided by the minimax theorem are unavailable [@VonNeumann1947]. The notions of "indifferent" and "friendly" are similarly troublesome to capture using the standard game-theoretic language.
{#fig:bernoulli-mods width="80%"}
Intuitively though, terms such as adversarial, indifferent, and friendly have simple meanings. To illustrate, consider two different strategies (I & II) used on four different two-armed bandits (named A--D), where each strategy gets to play 1000 rounds. In each round, a bandit secretly chooses which one (and only one) of the two arms will deliver the reward. Importantly, each bandit chooses the location of the arm using a different probabilistic rule. Then the agent pulls an arm, receiving the reward if it guesses correctly and nothing otherwise.
The two different strategies are:
I
: The agent plays all 1000 rounds using a uniformly random strategy.
II
: The agent deterministically pulls each arm exactly 500 times using some fixed rule.
Note that each strategy pulls each arm approximately 50% of the time. Now, the agent's average rewards for all four bandits are shown in Figure 1{reference-type="ref" reference="fig:bernoulli-mods"}. Based on these results, we make the following observations:
1.
: Sensitivity to strategy. The two sets of average rewards for bandit A are statistically indistinguishable, that is, they stay the same regardless of the agent's strategy. This corresponds to the stochastic bandit type in the literature [@SuttonBarto1998; @BubeckCesaBianchi2012]. In contrast, bandits B--D yielded different average rewards for the two strategies. Although each arm was pulled approximately 500 times, it appears as if the reward distributions were a function of the strategy.
2.
: Adversarial/friendly exploitation of strategy. The average rewards do not always add up to one, as one would expect if the rewards were truly independent of the strategy. Compared to the uniform strategy, the deterministic strategy led to either an increase (bandit B) or a decrease of the empirical rewards (bandits C and D). We can interpret the behavior of bandits C & D as an adversarial exploitation of the predictability of the agent's strategy---much like an exploitation of a deterministic strategy in a rock-paper-scissors game. Analogously, bandit B appears to be friendly towards the agent, tending to place the rewards favorably.
3.
: Strength of exploitation. Notice how the rewards of both adversarial bandits (C & D) when using strategy II differ in how strongly they deviate from the baseline set by strategy I. This difference suggests that bandit D is better at reacting to the agent's strategy than bandit C---and therefore also more adversarial. A bandit that can freely choose any placement of rewards is known as a non-stochastic bandit [@AuerEtAl2002; @BubeckCesaBianchi2012].
4.
: Cooperating/hedging. The nature of the bandit qualitatively affects the agent's optimal strategy. A friendly bandit (B) invites the agent to cooperate through the use of predictable policies; whereas adversarial bandits (C & D) pressure the agent to hedge through randomization.
Simply put, bandits B--D appear to react to the agent's private strategy in order to manipulate the payoffs. Abstractly, we can picture this as follows. First, a reactive environment can be thought of as possessing privileged information about the agent's private strategy, acquired e.g. through past experience or through "spying". Then, the amount of information determines the extent to which the environment is willing to deviate from a baseline, indifferent strategy. Second, the adversarial or friendly nature of the environment is reflected by the strategy it chooses: an adversarial (resp. friendly) environment will select the reaction that minimizes (resp. maximizes) the agent's payoff. The diagram in Figure 2{reference-type="ref" reference="fig:channel"} illustrates this idea in a Rock-Paper-Scissors game.
{#fig:channel width="50%"}
We make one final observation.
5.
: Agent/environment symmetry. Let us turn the tables on the agent: how should we play if we were the bandit? A moment of reflection reveals that the analysis is symmetrical. An agent that does not attempt to maximize the payoff, or cannot do so due to limited reasoning power, will pick its strategy in a way that is indifferent to our placement of the reward. In contrast, a more effective agent will react to our choice, seemingly anticipating it. Furthermore, the agent will appear friendly if our goal is to maximize the payoff and adversarial if our goal is to minimize it.
This symmetry implies that the choices of the agent and the environment are coupled to each other, suggesting a solution principle for determining a strategy profile akin to a Nash equilibrium [@OsborneRubinstein1994]. The next section will provide a concrete formalization.
Characterizing Friends and Foes {#sec:formalization}
In this section we formalize the picture sketched out in the preceeding section. We first state an objective function for a game that couples the agent's interests with those of the environment and limits both player's ability to react to each other. We then derive expressions for their optimal strategies (i.e. the best-response functions). Based on these, we then give an existence plus an indifference result for the ensuing equilibrium strategies.
Objective function
Let $\mathcal{X}$ and $\mathcal{Z}$ denote the set of actions (i.e. pure strategies) of the agent and the environment respectively; and let $Q$ and $P$ denote prior and posterior strategies respectively. We represent the interaction between the agent and the environment as a one-shot game in which both the agent, starting from prior strategies $Q(X)\in\Delta(\mathcal{X})$ and $Q(Z)\in\Delta(\mathcal{Z})$, choose (mixed) posterior strategies $P(X)\in\Delta(\mathcal{X})$ and $P(Z)\in\Delta(\mathcal{Z})$ respectively. The goal of the agent is to maximize the payoffs given by a function that maps choices $(x,z)\in\mathcal{X}\times\mathcal{Z}$ into utilities $U(x,z)\in\mathbb{R}$.
We model the two players' sensitivity to each other's strategy as coupled deviations from indifferent prior strategies, whereby each player attempts to extremize the expected utility, possibly pulling in opposite directions. Formally, consider the objective function
$$\begin{matrix}
J & = & \mathbb{E}[U(X,Z)]
& - & \frac{1}{\alpha}D_\mathrm{KL}(P(X)|Q(X))
& - & \frac{1}{\beta}D_\mathrm{KL}(P(Z)|Q(Z)) \ \
& = & \sum_{x,z}P(x)P(z)U(x,z)
& - & \frac{1}{\alpha}\sum_{x}P(x)\log\frac{P(x)}{Q(x)}
& - & \frac{1}{\beta}\sum_{z}P(z)\log\frac{P(z)}{Q(z)} \ \
& = & \text{{coupled expected payoffs}}
& - & \text{{agent deviation cost}}
& - & \text{{env. deviation cost}}\
\end{matrix}\label{eq:objective}$$ where $\alpha,\beta\in\mathbb{R}$, known as the inverse temperature parameters, determine the reaction abilities of the agent and the environment respectively.
This objective function ([eq:objective]{reference-type="ref" reference="eq:objective"}) is obtained by coupling two free energy functionals (one for each player) which model decision-making with information-constraints (see e.g. [@TishbyPolani2011; @OrtegaBraun2013]). We will discuss the interpretation of this choice further in Section 6.1{reference-type="ref" reference="sec:information-channel"}. Other constraints are possible, e.g. any deviation quantified as a Bregman divergence [@Banerjee2005].
Friendly and Adversarial Environments
Both the sign and the magnitude of the inverse temperatures control the player's reactions as follows.
1.
: The sign of $\beta$ determines the extremum operation: if $\beta$ is positive, then $J$ is concave for fixed $P(X)$ and the environment maximizes the objective w.r.t. $P(Z)$; analogously, a negative value of $\beta$ yields a convex objective $J$ that is minimized w.r.t $P(Z)$.
2.
: The magnitude of $\beta$ determines the strength of the deviation: when $|\beta|\approx0$, the environment can only pick strategies $P(Z)$ that are within a small neighborhood of the center $Q(Z)$, whereas $|\beta|\gg0$ yields a richer set of choices for $P(Z)$.
The parameter $\alpha$ plays an analogous role, although in this exposition we will focus on $\alpha\geq0$ and interpret it as a parameter that controls the agent's ability to react. In particular, setting $\alpha=0$ fixes $P(X)$ to $Q(X)$, which is useful for deriving the posterior environment $P(Z)$ for a given, fixed agent strategy.
From the above, it is easy to see that friendly and adversarial environments are modeled through the appropriate choice of $\beta$. For $\alpha>0$ and $\beta>0$, we obtain a friendly environment that helps the agent in maximizing the objective, i.e. $$\max_{P(X)}\max_{P(Z)}{J[P(X),P(Z)]}.$$ In contrast, for $\alpha>0$ and $\beta<0$, we get an adversarial environment that minimizes the objective: $$\max_{P(X)}\min_{P(Z)}{J[P(X),P(Z)]}=\min_{P(Z)}\max_{P(X)}{J[P(X),P(Z)]}.$$ In particular, the equality after exchanging the order of the minimization and maximization can be shown to hold using the minimax theorem: $J$ is a continuous and convex-concave function of $P(X)$ and $P(Z)$, which in turn live in compact and convex sets $\Delta(\mathcal{X})$ and $\Delta(\mathcal{Z})$ respectively. The resulting strategies then locate a saddle point of $J$.
Existence and Characterization of Equilibria
To find the equilibrium strategies for ([eq:objective]{reference-type="ref" reference="eq:objective"}), we calculate the best-response function for each player, i.e. the optimal strategy for a given strategy of the other player in both the friendly and adversarial cases. Proofs to the claims can be found in Appendix 7{reference-type="ref" reference="sec:proofs"}.
Proposition 1.
The best-response functions $f_{X}, f_{Z}$ for the agent and the environment respectively are given by the Gibbs distributions $$\begin{aligned}
P(X) & =f_{X}[P(Z)]: & P(x) & =\frac{1}{N_{X}}Q(x)\exp{\alpha U(x)}, & U(x) & :=\sum_{z}P(z)U(x,z);\label{eq:opt-player}\
P(Z) & =f_{Z}[P(X)]: & P(z) & =\frac{1}{N_{Z}}Q(z)\exp{\beta U(z)}, & U(z) & :=\sum_{x}P(x)U(x,z)\label{eq:opt-env}\end{aligned}$$ respectively, where $N_{X}$ and $N_{Z}$ are normalizing constants. $\square$
Given the above best-response functions $f_{X}$ and $f_{Z}$, we define an equilibrium strategy profile of the objective function ([eq:objective]{reference-type="ref" reference="eq:objective"}) as a fixed-point of the combined best-response function defined as a mapping $f:\Delta(\mathcal{X})\times\Delta(\mathcal{Z})\rightarrow\Delta(\mathcal{X})\times\Delta(\mathcal{Z})$ that concatenates the two best-response functions, i.e. $$f[P(X),P(Z)]:=(f_{X}[P(Z)],f_{Z}[P(X)]).\label{eq:comb-best-resp}$$ That is, the equilibrium strategy profile [^1], in analogy with the Nash equilibrium, is a mixed-strategy profile that lies at the intersection of both best-response curves. With this definition, the following existence result follows immediately.
Proposition 2.
There always exists an equilibrium strategy profile. $\square$
Finally, the following result characterizes the equilibrium strategy profile in terms of an indifference principle (later illustrated in Figure 3{reference-type="ref" reference="fig:learning-dynamics"}). In particular, the result shows that both players strive towards playing strategies that equalize each other's net payoffs defined as $$J_{X}(x) := \alpha\sum_{z}P(z)U(x,z) - \log\frac{P(x)}{Q(x)}
\quad \text{and} \quad
J_{Z}(z) := \beta\sum_{x}P(x)U(x,z) - \log\frac{P(z)}{Q(z)}.
\label{eq:net-payoffs}$$
Proposition 3.
In equilibrium, the net payoffs are such that for all $x,x'\in\mathcal{X}$ and all $z,z'\in\mathcal{Z}$ in the support of $P(X)$ and $P(Z)$ respectively, $$J_{X}(x)=J_{X}(x')\quad\text{and}\quad J_{Z}(z)=J_{Z}(z'). \square$$
Computing Equilibria[[sec:learning]]{#sec:learning label="sec:learning"}
Now we derive an algorithm for computing the equilibrium strategies for the agent and the environment. It is well-known that using standard gradient descent with competing losses is difficult [@Mescheder2017; @Balduzzi2018], and indeed a straight-forward gradient-ascent/descent method on the objective ([eq:objective]{reference-type="ref" reference="eq:objective"}) turns out to be numerically brittle, especially for values of $\alpha$ and $\beta$ near zero. Rather, we let the strategies follow a smoothed dynamics on the exponential manifold until reaching convergence. Equation ([eq:opt-player]{reference-type="ref" reference="eq:opt-player"}) shows that the log-probabilities of the best-response strategies are: $$\begin{aligned}
\log P(x) & \overset{+}{=}\log Q(x)+\alpha\sum_{z}P(z)U(x,z)\
\log P(z) & \overset{+}{=}\log Q(z)+\beta\sum_{x}P(x)U(x,z)\end{aligned}$$ where $\overset{+}{=}$ denotes equality up to a constant. This suggests the following iterative algorithm. Starting from $L_{0}(x)=\log Q(x)$ and $L_{0}(z)=\log Q(z)$, one can iterate the following four equations for time steps $t=0,1,2,\ldots$ $$\begin{aligned}
{1}
L_{t+1}(x) & =(1-\eta_{t})\cdot L_{t}(x)+\eta_{t}\cdot\biggl(\log Q(x)+\alpha\sum_{z}P_{t}(z)U(x,z)\biggr)\
P_{t+1}(x) & =\frac{\exp L_{t+1}(x)}{\sum_{\tilde{x}}\exp L_{t+1}(\tilde{x})}\
L_{t+1}(z) & =(1-\eta_{t})\cdot L_{t}(z)+\eta_{t}\cdot\biggl(\log Q(z)+\beta\sum_{x}P_{t+1}(x)U(x,z)\biggr)\
P_{t+1}(z) & =\frac{\exp L_{t+1}(z)}{\sum_{\tilde{z}}\exp L_{t+1}(\tilde{z})}\end{aligned}$$ Here, the learning rate $\eta_{t}>0$ can be chosen constant but sufficiently small to achieve a good approximation; alternatively, one can use an annealing schedule that conform to the Robbins-Monro conditions $\sum_{t}\eta_{t}\rightarrow\infty$ and $\sum_{t}\eta_{t}^{2}<\infty$. Figure (3{reference-type="ref" reference="fig:learning-dynamics"}) shows four example simulations of the learning dynamics.
{#fig:learning-dynamics width="100%"}
Experiments[[sec:experiments]]{#sec:experiments label="sec:experiments"}
Bernoulli Bandit
We now return to the two-armed bandits discussed in the introduction (Figure 1{reference-type="ref" reference="fig:bernoulli-mods"}) and explain how these were modeled.
Assuming that the agent plays the rows (i.e. arms) and the bandit/environment the columns (i.e. reward placement), the utility matrix was chosen as $$U
= I_{2\times2}
= \left[\begin{matrix}
1 & 0\
0 & 1
\end{matrix}\right].$$ This reflects the fact that there is always one and only one reward.
We did not want the agent to play an equilibrium strategy, but rather investigate each bandit's reaction to a uniform strategy and two pure strategies: that is, $Q(X)=[0.5, 0.5]^{T}$, $Q(X)=[1, 0]^{T}$, and $Q(X)=[0,
1]^{T}$ respectively. Then, choosing $\alpha=0$ implies that the agent's posterior strategy stays fixed, i.e. $P(X)=Q(X)$.
For the bandits, we fixed a common prior strategy $Q(Z)=[0.4,0.6]^{T}$, that is, slightly biased toward the second arm. Obviously, appropriate inverse temperatures were chosen to model the indifferent, friendly, adversarial, and very adversarial bandit:
Bandit Type $\beta$
A Indifferent/Stochastic $0$
B Friendly $1$
C Adversarial $-1$
D Very Adversarial $-2$
We then computed the equilibrium strategies for each combination, which in this case (due to $\alpha=0$) reduces to computing the bandits' best-response functions for each one of the three agent strategies. Once the posterior distributions $P(Z)$ were calculated, we simulated each one and collected the empirical rewards which are summarized in Figure 1{reference-type="ref" reference="fig:bernoulli-mods"}.
Gaussian Bandit, and Dependence on Variance
In a stochastic bandit, the optimal strategy is to deterministically pick the arm with the largest expected payoff. However, in adversarial and friendly bandits, the optimal strategy can depend on the higher-order moments of the reward distribution, as was shown previously in [@OrtegaKimLee2015]. The aim of this experiment is to reproduce these results, showing the dependence of the optimal strategy on both the mean and variance of the payoff distribution.
Setup.
To do so, we considered a four-armed bandit with payoffs that are distributed according to (truncated and discretized) Gaussian distributions. To investigate the interplay between mean and the variance, we chose four Gaussians with:
-
increasing means, where the means $\mu$ are uniformly spaced between -0.2 and 0.2;
-
and decreasing variances, where the standard deviations $\sigma$ are uniformly spaced between 1 and 2.
Thus, the arm with the largest mean is the most precise. Clearly, if the bandit is stochastic, arms are ranked according to their mean payoffs irrespective of their variances.
We then performed a sweep through the bandit's inverse temperature $\beta$, starting from an adversarial ($\beta=-3$) and ending in a friendly bandit ($\beta=+3$), computing the equilibrium strategies for both players along the way. Throughout the sweep, the agent's inverse temperature was kept constant at $\alpha=30$, modeling a highly rational agent.
{#fig:gaussian-arms width="100%"}
Results.
The results are shown in Figure 4{reference-type="ref" reference="fig:gaussian-arms"}. As expected, for values close to $\beta\approx0$ the agent's optimal strategy consists in (mostly) pulling the arm with the largest expected reward. In contrast, adversarial bandits ($\beta<0$) attempt to diminish the payoffs of the agent's preferred arms, thus forcing it to adopt a mixed strategy. In the friendly case ($\beta>0$), the agent's strategy becomes deterministic. Interestingly though, the optimal arm switches to those with higher variance as $\beta$ increases (e.g. see Figure 4{reference-type="ref" reference="fig:gaussian-arms"}c). This is because Gaussians with larger variance are "cheaper to shift"; in other words, the rate of growth of the KL-divergence per unit of translation is lower. Thus, arms that were suboptimal for $\beta=0$ can become optimal for larger $\beta$ values if their variances are large. We believe that these "phase transitions" are related to the ones previously observed under information-constraints [@TishbySlonim2001; @Chechik2005; @GeneweinLeibfriedGrauMoyaBraun2015].
Linear Classifier
The purpose of our last experiment is to illustrate the non-trivial interactions that may arise between a classifier and a reactive data source, be it friendly or adversarial. Specifically, we designed a simple 2-D linear classification example in which the agent chooses the parameters of the classifier and the environment picks the binary data labels.
Method.
We considered hard classifiers of the form $y=\sigma(\mathbf{w}^{T}\mathbf{x}-\mathbf{b})$ where $\mathbf{x}$ and $y$ are the input and the class label, $\mathbf{w}$ and $\mathbf{b}$ are the weight and the bias vectors, and $\sigma$ is a hard sigmoid defined by $\sigma(u)=1$ if $u\geq0$ and $\sigma(u)=-1$ otherwise. To simplify our analysis, we chose a set of 25 data points (i.e. the inputs) placed on a $5 \times 5$ grid spread uniformly in $[-1,1]^{2}$. Furthermore, we discretized the parameter space so that $\mathbf{w}$ and $\mathbf{b}$ have both 25 settings that are uniform in $[-1;1]^{2}$ (that is, just as the input locations), yielding a total of $25^{2}=625$ possible parameter combinations $[\mathbf{w},\mathbf{b}]\in\Theta$.
The agent's task consisted in choosing a strategy to set those parameters. However, unlike a typical classification task, here the agent could choose a distribution $P(X = [\mathbf{w},\mathbf{b}])$ over $\Theta$ if deemed necessary. Similarly, the environment picked the data labels indirectly by choosing the parameters of an optimal classifier. Specifically, the environment could pick a distribution $P(Z = [\mathbf{w},\mathbf{b}])$ over $\Theta$, which in turn induced (stochastic) labels on the data set.
The utility and the prior distributions were chosen as follows. The utility function $U:\Theta\times\Theta\rightarrow\mathbb{\mathbb{N}}$ mapped each classifier-label pair $(x,z)$ into the number of correctly classified data points. The prior distribution of the agent $Q(X)$ was uniform over $\Theta$, reflecting initial ignorance. For the environment we chose a prior with a strong bias toward label assignments that are compatible with $z^\ast \in \Theta$, where $z^{\ast}=(\mathbf{w}^{\ast},\mathbf{b^{\ast}})$, $\mathbf{w}^{\ast}=[-1,-0.5]^{T}$, and $\mathbf{b}^{\ast}=[-0.5,0.5]^{T}$. Specifically, for each label assignment $z \in \Theta$, its prior probability $Q(z)$ was proportional to $U(z,z^{\ast})$, the number of data points that a model based on $z$ would correctly classify when the true labels are given by $z^\ast$ instead. Figure 5{reference-type="ref" reference="fig:linear-start"} shows the stochastic labels obtained by marginalizing over the prior strategies of the agent and the environment respectively. Notice that although each individual classifier is linear, their mixture is not.
{#fig:linear-start width="40%"}
Results.
Starting from the above prior, we then calculated various friendly and adversarial deviations. First we improved the agent's best-response strategy by setting the inverse temperature to $\alpha=30$ and keeping the environment's strategy fixed ($\beta=0$). As a result, we obtained a crisper (i.e. less stochastic) classifier that reduced the mistakes for the data points that are more distant from the decision boundary. We will use these posterior strategies as the prior for our subsequent tests.
{width="70%"}
Next we generated two adversarial modifications of the environment ($\beta=-0.1$ and $\beta=-1$) while keeping the agent's strategy fixed, simulating an attack on a pre-trained classifier. The case $\beta=-0.1$ shows that a slightly adversarial environment attempts to increase the classification error by "shifting" the members of the second class (white) towards the agent's decision boundary. However, a very adversarial environment (case $\beta=-1$) will simply flip the labels, nearly maximizing the expected classification error.
{width="70%"}
If instead we pair a reactive agent ($\alpha=10$) with the previous adversarial environment ($\beta=-1$), we see that the agent significantly improves his performance by slightly randomizing his classifier, thereby thwarting the environment's attempt to fully flip the labels. Finally, we paired a reactive agent ($\alpha=10$) with a friendly environment ($\beta=1$). As a result, both players cooperated by significantly aligning and sharpening their choices, with the agent picking a crisp decision boundary that nearly matched all the labels chosen by the environment.
{width="70%"}
Discussion and Conclusions
Information Channel {#sec:information-channel}
The objective function [eq:objective]{reference-type="eqref" reference="eq:objective"} can be tied more closely to the idea of an information channel that allows the agent and the environment to anticipate each other's strategy (Section 2{reference-type="ref" reference="sec:rationale"}).
Let $D$ be a random variable that encapsulates the totality of the information that informs the decisions of the agent and the environment. Identify the posteriors $P(X)$ and $P(Z)$ with the mixed strategies that the two players adopt after learning about $D$, that is, $P(X) = Q(X|D)$ and $P(Z)
= Q(Z|D)$ respectively. This corresponds to the graphical model:
{width="20%"}
Assuming $D \sim Q(D)$ and taking the expectation over $D$ of the objective function [eq:objective]{reference-type="eqref" reference="eq:objective"}, we obtain $$\mathbb{E}{D}[J]
= \mathbb{E}{D,X,Z} \bigl[ U(X, Z) \bigr]
- \frac{1}{\alpha} I(X; D)
- \frac{1}{\beta} I(Z; D),$$ where $I(X; D)$ and $I(Z; D)$ are mutual information terms. These terms quantify the capacity of the information channel between the background information $D$ and the strategies.
With this connection, we can draw two conclusions. First, The baseline strategies $Q(X)$ and $Q(Z)$ are the strategies that result when the two players do not observe the background information, since $$Q(x) = \sum_d Q(d) Q(x|d)
\qquad\text{and}\qquad
Q(z) = \sum_d Q(d) Q(z|d),$$ that is, the players play one of their strategies according to their base rates, effectively averaging over them. Second, the objective function [eq:objective]{reference-type="eqref" reference="eq:objective"} controls, via the inverse temperature parameters, the amount of information about $D$ that the agent and environment use to choose their strategy.
Relation to previous work
This work builds on a number of previous ideas. The characterization of friendly and adversarial from Section 2{reference-type="ref" reference="sec:rationale"} is a direct adaptation of the Gaussian case introduced in [@OrtegaKimLee2015] to the case of discrete strategy sets. Therein, the authors presented a model of multi-armed bandits for the special case of Gaussian-distributed rewards in which the bandit can react to the strategy of the agent in a friendly or adversarial way. They furthermore used this model to derive the agent's optimal policy and a Thompson sampling algorithm to infer the environment's inverse temperature parameter from experience. Our work can be thought as a adaptation of their model to normal-form games.
In turn, the formalization of friendly and adversarial behavior through information-constraints was suggested in [@Broek2010] (in the context of risk-sensitivity) and in [@OrtegaBraun2011; @OrtegaBraun2013] (in sequential decision-making). Information-constraints have also been used in game theory: in particular, Quantal Response Equilibria feature reactive strategies that are bounded rational [@McKelveyPalfrey1995; @WolpertHarreOlbrichBertschingerJost2012], which relates to our case when the inverse temperatures are positive.
Our work was also inspired by existing work in the bandit literature; in particular, by bandit algorithms that achieve near-optimal performance in both stochastic and adversarial bandits [@BubeckSlivkins2012; @SeldinSilvkins2014; @AuerChaoKai2016]. Notice however, that these algorithms do neither include the friendly/cooperative case nor distinguish between different degrees of attitudes.
In multiagent learning there is work that considers how to act in different scenarios with varying, discrete degrees of adversarial behaviour (see for instance [@Littman2001; @Powers2005; @Greenwald2003; @Crandall2011]). In particular, a Bayesian approach has been tested to learn against a given class of opponents in stochastic games [@Hernandez2017].
Learning the attitude of an environment
In this work we have centered our attention on formalizing friendly and adversarial behavior by characterizing the statistics of such an environment in a one-shot game given three components:
a)
: the strategy of the agent;
b)
: the prior strategy of the environment;
c)
: and an inverse temperature parameter.
This model can be used within a learning algorithm to detect the environment's friendly or adversarial attitude from past interactions akin to [@Littman2001]. However, since this sacrifices the one-shot setup, additional assumptions (e.g. of stationarity) need to be made to accommodate our definition.
For instance, in a Bayesian setup, the model can be used as a likelihood function $P(z|\beta,\theta,\pi)$ of the inverse temperature $\beta$ and the parameters $\theta$ of the environment's prior distribution $P(z|\theta)$ given the parameters $\pi$ of the (private) strategy $P(x|\pi)$ used by the agent. If combined with a suitable prior over $\beta$ and $\theta$, one can e.g. use Thompson sampling [@Thompson1933; @ChappelleLi2011] to implement an adaptive strategy that in round $t+1$ plays the best response $P(x|\pi_{t+1})$ for simulated parameters $\beta'{t+1}$ and $\theta'{t+1}$ drawn from the posterior $P(\beta,\theta|\pi_{1:t},z_{1:t})$. This is the method adopted in [@OrtegaKimLee2015].
Alternatively, another way of detecting whether the environment is reactive is by estimating the mutual information $I(\pi;z|x)$ between the agent's strategy parameter $\pi$ and the environment's action $z$ given the player's action. This is because, for a non-reactive environment, the agent's action $x$ forms a Markov blanket for the environment's response $z$ and hence $I(\pi;z|x)=0$; whereas if $I(\pi;z|x)>0$, then it must be that the environment can "spy" on the agent's private policy.
Final thoughts
We have presented an information-theoretic definition of behavioral attitudes such as friendly, indifferent, and adversarial and shown how to derive optimal strategies for these attitudes. These results can serve as a general conceptual basis for the design of specialized detection mechanisms and more robust strategies.
Two extensions are worth pointing out. The first is the extension of the model to extensive-form games to represent sequential interactions. This will require formulating novel backward induction procedures involving subgame equilibria [@OsborneRubinstein1994], perhaps similar to [@Ling2014]. The second is the analysis of state-of-the-art machine learning techniques such as deep neural networks: e.g. whether randomizing the weights protects from adversarial examples [@SzegedyZarembaSutskeverBrunaErhanGoodfellowFergus2013; @GoodfellowShlensSzegedy2014]; and whether friendly examples exist and can be exploited.
Importantly, we have shown the existence of a continuous range of environments that, if not finely discriminated by the agent, will lead to strictly suboptimal strategies, even in the friendly case.
Acknowledgments {#acknowledgments .unnumbered}
We thank Marc Lanctot, Bernardo Pires, Laurent Orseau, Victoriya Krakovna, Jan Leike, Neil Rabinowitz, and David Balduzzi for comments on an earlier manuscript.
Proofs {#sec:proofs}
Proof of Proposition 1.
The best response-functions are obtained by optimizing the Lagrangian $$L:=J-\lambda_{X}\Bigl(\sum_{x}P(x)-1\Bigr)-\lambda_{Z}\Bigl(\sum_{z}P(z)-1\Bigr)$$ where $\lambda_{X}$ and $\lambda_{Z}$ are the Lagrange multipliers for the equality constraints enforcing the normalization of $P(X)$ and $P(Z)$ respectively. For $P(Z)$, we fix $P(X)$ and equate the derivatives to zero:
$$\frac{\partial L}{\partial
P(z)}=\sum_{x}P(x)U(x,z)-\frac{1}{\beta}\Bigl(\log\frac{P(z)}{Q(z)}
+1\Bigr)+\lambda_{Z}\overset{!}{=}0.$$ Solving for $P(z)$ yields $$P(z)=Q(z)\exp\Bigl{\beta\sum_{x}P(x)U(x,z)+\beta\lambda_{Z}-1\Bigr}
\label{eq:unnorm-solution}$$
Since $\sum_{z}P(z)=1$, it must be that the Lagrange multiplier $\lambda_{Z}$ is equal to $$\lambda_{Z}=-\frac{1}{\beta}\log\sum_{z}Q(z)\exp\Bigl{\beta\sum_{z}P(x)U(x,
z)-1\Bigr},$$ which, when substituted back into ([eq:unnorm-solution]{reference-type="ref" reference="eq:unnorm-solution"}) gives the best-response function of the claim. The argument for $P(X)$ proceeds analogously. $\blacksquare$
Proof of Proposition 2.
The combined best-response function is a continuous map from a compact set into itself. It follows therefore from Brouwer's fixed-point theorem that it has a fixed-point. $\blacksquare$
Proof of Proposition 3.
We first multiply ([eq:objective]{reference-type="ref" reference="eq:objective"}) by $\alpha$ and then see that, for any fixed $P(Z)$, the agent's best response $P(X)$ is also the maximizer (that is, irrespective of the sign of $\alpha$) of the objective function $$\sum_{x}P(x)\Bigl{\alpha\sum_{z}P(z)U(x,z)-\log\frac{P(x)}{Q(x)}\Bigr}=\sum_{x
}P(x)J_{X}(x)\label{eq:indiff-1}$$ which ignores the terms that do not depend on $P(X)$. $J_{X}$ is a continuous and monotonically decreasing function in $P(x)$. Assume by contradiction that there are actions $x_{1},x_{2}$ such that $J_{X}(x_{1})<J_{X}(x_{2})$. Then one can always improve the objective function by transferring a sufficiently small amount of probability mass from action $x_{1}$ to action $x_{2}$. However, this contradicts the assumption that $P(X)$ is a best-response and thus, $J_{X}(x_{1})=J_{X}(x_{2})$. The argument for $J_{Z}$ proceeds analogously. $\blacksquare$
[^1]: This definition is closely related to the Quantal-Response-Equilibrium [@McKelveyPalfrey1995].