Alignment Newsletter #18

https://www.lesswrong.com/posts/9a33WfdPe9Cd26vL9/alignment-newsletter-18

Link post Contents

Technical AI alignment

Problems

A Gym Gridworld Environment for the Treacherous Turn (Michaël Trazzi): An example Gym environment in which the agent starts out "weak" (having an inaccurate bow) and later becomes "strong" (getting a bow with perfect accuracy), after which the agent undertakes a treacherous turn in order to kill the supervisor and wirehead. My opinion: I’m a fan of executable code that demonstrates the problems that we are worrying about—it makes the concept (in this case, a treacherous turn) more concrete. In order to make it more realistic, I would want the agent to grow in capability organically (rather than simply getting a more powerful weapon). It would really drive home the point if the agent undertook a treacherous turn the very first time, whereas in this post I assume it learned using many episodes of trial-and-error that a treacherous turn leads to higher reward. This seems hard to demonstrate with today’s ML in any complex environment, where you need to learn from experience instead of using eg. value iteration, but it’s not out of the question in a continual learning setup where the agent can learn a model of the world.

Agent foundations

Counterfactuals, thick and thin (Nisan): There are many different ways to formalize counterfactuals (the post suggests three such ways). Often, for any given way of formalizing counterfactuals, there are many ways you could take a counterfactual, which give different answers. When considering the physical world, we have strong causal models that can tell us which one is the "correct" counterfactual. However, there is no such method for logical counterfactuals yet. My opinion: I don’t think I understood this post, so I’ll abstain on an opinion. Decisions are not about changing the world, they are about learning what world you live in (shminux): The post tries to reconcile decision theory (in which agents can "choose" actions) with the deterministic physical world (in which nothing can be "chosen"), using many examples from decision theory.

Handling groups of agents

Multi-Agent Generative Adversarial Imitation Learning (Jiaming Song et al): This paper generalizes GAIL (which was covered last week) to the multiagent setting, where we want to imitate a group of interacting agents. They want to find a Nash equilibrium in particular. They formalize the Nash equilibrium constraints and use this to motivate a particular optimization problem for multiagent IRL, that looks very similar to their optimization problem for regular IRL in GAIL. After that, it is quite similar to GAIL—they use a regularizer ψ for the reward functions, show that the composition of multiagent RL and multiagent IRL can be solved as a single optimization problem involving the convex conjugate of ψ, and propose a particular instantiation of ψ that is data-dependent, giving an algorithm. They do have to assume in the theory that the multiagent RL problem has a unique solution, which is not typically true, but may not be too important. As before, to make the algorithm practical, they structure it like a GAN, with discriminators acting like reward functions. What if we have prior information that the game is cooperative or competitive? In this case, they propose changing the regularizer ψ, making it keep all the reward functions the same (if cooperative), making them negations of each other (in two-player zero-sum games), or leaving it as is. They evaluate in a variety of simple multiagent games, as well as a plank environment in which the environment changes between training and test time, thus requiring the agent to learn a robust policy, and find that the correct variant of MAGAIL (cooperative/​competitive/​neither) outperforms both behavioral cloning and single-agent GAIL (which they run N times to infer a separate reward for each agent). My opinion: Multiagent settings seem very important (since there does happen to be more than one human in the world). This looks like a useful generalization from the single agent case to the multiagent case, though it’s not clear to me that this deals with the major challenges that come from multiagent scenarios. One major challenge is that there is no longer a single optimal equilibrium when there are multiple agents, but they simply assume in their theoretical analysis that there is only one solution. Another one is that it seems more important that the policies take history into account somehow, but they don’t do this. (If you don’t take history into account, then you can’t learn strategies like tit-for-tat in the iterated prisoner’s dilemma.) But to be clear I think this is the standard setup for multiagent RL—it seems like field is not trying to deal with this issue yet (even though they could using eg. a recurrent policy, I think?)

Miscellaneous (Alignment)

Safely and usefully spectating on AIs optimizing over toy worlds (Alex Mennen): One way to achieve safety would be to build an AI that optimizes in a virtual world running on a computer, and doesn’t care about the physical world. Even if it realizes that it can break out and eg. get more compute, these sorts of changes to the physical world would not be helpful for the purpose of optimizing the abstract computational object that is the virtual world. However, if we take the results of the AI and build them in the real world, that causes a distributional shift from the toy world to the real world that could be catastrophic. For example, if the AI created another agent in the toy world that did reasonable things in the toy world, when we bring it to the real world it may realize that it can instead manipulate humans in order to do things. My opinion: It’s not obvious to me, even on the "optimizing an abstract computational process" model, why an AI would not want get more compute—it can use this compute for itself, without changing the abstract computational process it is optimizing, and it will probably do better. It seems that if you want to get this to work, you need to have the AI want to compute the result of running itself without any modification or extra compute on the virtual world. This feels very hard to me. Separately, I also find it hard to imagine us building a virtual world that is similar enough to the real world that we are able to transfer solutions between the two, even with some finetuning in the real world. Sandboxing by Physical Simulation? (moridinamael)

Near-term concerns

Adversarial examples

Evaluating and Understanding the Robustness of Adversarial Logit Pairing (Logan Engstrom, Andrew Ilyas and Anish Athalye)

AI strategy and policy

The Facets of Artificial Intelligence: A Framework to Track the Evolution of AI (Fernando Martinez-Plumed et al) Podcast: Six Experts Explain the Killer Robots Debate (Paul Scharre, Toby Walsh, Richard Moyes, Mary Wareham, Bonnie Docherty, Peter Asaro, and Ariel Conn)

AI capabilities

Reinforcement learning

Learning Dexterity (Many people at OpenAI): Summarized in the highlights! Variational Option Discovery Algorithms (Joshua Achiam et al): Summarized in the highlights! Learning Plannable Representations with Causal InfoGAN (Thanard Kurutach, Aviv Tamar et al): Hierarchical reinforcement learning aims to learn a hierarchy of actions that an agent can take, each implemented in terms of actions lower in the hierarchy, in order to get more efficient planning. Another way we can achieve this is to use a classical planning algorithm to find a sequence of waypoints, or states that the agent should reach that will allow it to reach its goal. These waypoints can be thought of as a high-level plan. You can then use standard RL algorithms to figure out how to go from one waypoint to the next. However, typical planning algorithms that can produce a sequence of waypoints require very structured state representations, that were designed by humans in the past. How can we learn them directly from data? This paper proposes Causal InfoGAN. They use a GAN where the generator creates adjacent waypoints in the sequence, while the discriminator tries to distinguish between waypoints from the generator and pairs of points sampled from the true environment. This incentivizes the generator to generate waypoints that are close to each other, so that we can use an RL algorithm to learn to go from one waypoint to the next. However, this only lets us generate adjacent waypoints. In order to use this to make a sequence of waypoints that gets from a start state to a goal state, we need to use some classical planning algorithm. In order to do that, we need to have a structured state representation. GANs do not do this by default. InfoGAN tries to make the latent representation in a GAN more meaningful by providing the generator with a "code" (a state in our case) and maximizing the mutual information of the code and the output of the generator. In this setting, we want to learn representations that are good for planning, so we want to encode information about transitions between states. This leads to the Causal InfoGAN objective, where we provide the generator with a pair of abstract states (s, s’), have it generate a pair of observations (o, o’) and maximize the mutual information between (s, s’) and (o, o’), so that s and s’ become good low-dimensional representations of o and o’. They show that Causal InfoGAN can create sequences of waypoints in a rope manipulation task, that previously had to be done manually. My opinion: We’re seeing more and more work combining classical symbolic approaches with the current wave of statistical machine learning from big data, that gives them the best of both worlds. While the results we see are not general intelligence, it’s becoming less and less true that you can point to a broad swath of capabilities that AI cannot do yet. I wouldn’t be surprised if a combination of symbolic and stastical AI techniques led to large capability gains in the next few years.

Deep learning

TensorFuzz: Debugging Neural Networks with Coverage-Guided Fuzzing (Augustus Odena et al)

News

AI Strategy Project Manager (FHI)