Progress on Causal Influence Diagrams

https://www.lesswrong.com/posts/Cd7Hw492RqooYgQAS/progress-on-causal-influence-diagrams

Contents

What are causal influence diagrams?

A key problem in AI alignment is understanding agent incentives. Concerns have been raised that agents may be incentivized to avoid correction, manipulate users, or inappropriately influence their learning. This is particularly worrying as training schemes often shape incentives in subtle and surprising ways. For these reasons, we’re developing a formal theory of incentives based on causal influence diagrams (CIDs). Here is an example of a CID for a one-step Markov decision process (MDP). The random variable S1 represents the state at time 1, A1 represents the agent’s action, S2 the state at time 2, and R2 the agent’s reward. The action A1 is modeled with a decision node (square) and the reward R2 is modeled as a utility node (diamond), while the states are normal chance nodes (rounded edges). Causal links specify that S1 and A1 influence S2, and that S2 determines R2. The information link S1 → A1 specifies that the agent knows the initial state S1 when choosing its action A1. In general, random variables can be chosen to represent agent decision points, objectives, and other relevant aspects of the environment. In short, a CID specifies:

Incentive Concepts

Having a unified language for objectives and training setups enables us to develop generally applicable concepts and results. We define four such concepts in Agent Incentives: A Causal Perspective (AAAI-21):

User Interventions and Interruption

Let us next turn to some recent applications of these concepts. In How RL Agents Behave when their Actions are Modified (AAAI-21), we study how different RL algorithms react to user interventions such as interruptions and over-ridden actions. For example, Saunders et al. developed a method for safe exploration where a user overrides dangerous actions. Alternatively, agents might get interrupted if analysis of their "thoughts" (or internal activations) suggest they are planning something dangerous. How do such interventions affect the incentives of various RL algorithms? First, we formalize action-modification by extending MDPs with a parameter PA that describes action-modification. We then model such modified-action MDPs with a CID: Here we model the agent’s policy Π as the decision rather than the actions Ai, as the latter are not under full control of the agent, but can also be influenced by the action-modification PA (as represented by arrows PA → Ai and Π → Ai). The agent might know the interruption scheme PA from interruptions during training, so we include an information link PA → Π. We analyze different prototypical RL algorithms in terms of the causal assumptions they make on the environment:

Reward Tampering

Another AI safety problem that we have studied with CIDs is reward tampering. Reward tampering can take several different forms, including the agent:

Multi-Agent CIDs

Many interesting incentive problems arise when multiple agents interact, each trying to optimize their own reward while they simultaneously influence each other’s payoff. In Equilibrium Refinements in Multi-Agent Influence Diagrams (AAMAS-21), we begin to lay some foundations for understanding multi-agent situations with multi-agent CIDs (MACIDs). First, we relate MACIDs to extensive-form games (EFGs), currently the most popular graphical representations of games. While EFGs sometimes offer more natural representations of games, they have some significant drawbacks compared to MACIDs. In particular, EFGs can be exponentially larger, don’t represent conditional independencies, and lack random variables to apply incentive analysis to. As an example, consider a game where a store (Agent 1) decides (D1) whether to charge full (F) or half (H) price for a product depending on their current stock levels (X), and a customer (Agent 2) decides (D2) whether to buy it (B) or pass (P) depending on the price and how much they want it (Y). The store tries to maximize their profit U1, which is greater if the customer buys at a high price. If they are overstocked and the customer doesn’t buy, then they have to pay extra rent. The customer is always happy to buy at half price, and sometimes at full price (depending on how much they want the product). The EFG representation of this game is quite large, and uses information sets (represented with dotted arcs) to represent the facts that the store doesn’t know how much the customer wants the gadget, and that the customer doesn’t know the store’s current stock levels: In contrast, the MACID representation is significantly smaller and clearer. Rather than relying on information sets, the MACID uses information links (dotted edges) to represent the limited information available to each player: Another aspect that is made more clear from the MACID, is that for any fixed customer decision, the store’s payoff is independent of how much the customer wanted the product (there’s no edge Y→U1). Similarly, for any fixed product price, the customer’s payoff is independent of the store’s stock levels (no edge X→U2). In the EFG, these independencies could only be inferred by looking carefully at the payoffs. One benefit of MACIDs explicitly representing these conditional independencies is that more parts of the game can be identified as independently solvable. For example, in the MACID, the following independently solvable component can be identified. We call such components MACID subgames: Solving this subgame for any value of D1 reveals that the customer always buys when they really want the product, regardless of whether there is a discount. This knowledge makes it simpler to next compute the optimal strategy for the store. In contrast, in the EFG the information sets prevent any proper subgames from being identified. Therefore, solving games using a MACID representation is often faster than using an EFG representation. Finally, we relate various forms of equilibrium concepts between MACIDs and EFGs. The most famous type of equilibrium is the Nash equilibrium, which occurs when no player can unilaterally improve their payoff. An important refinement of the Nash equilibrium is the subgame perfect equilibrium**, **which rules out non-credible threats by requiring that a Nash equilibrium is played in every subgame. An example of a non-credible threat in the store-customer game would be the customer "threatening" the store to only buy at a discount. The threat is non-credible, since the best move for the customer is to buy the product even at full price, if he really wants it. Interestingly, only the MACID version of subgame perfectness is able rule such threats out, because only in the MACID is the customer’s choice recognized as a proper subgame. Ultimately, we aim to use MACIDs to analyze incentives in multi-agent settings. With the above observations, we have put ourselves in position to develop a theory of multi-agent incentives that is properly connected to the broader game theory literature.

Software

To help us with our research on CIDs and incentives, we’ve developed a Python library called PyCID, which offers:

Looking ahead

Ultimately, we hope to contribute to a more careful understanding of how design, training, and interaction shapes an agent’s behavior. We hope that a precise and broadly applicable language based on CIDs will enable clearer reasoning and communication on these issues, and facilitate a cumulative understanding of how to think about and design powerful AI systems. From this perspective, we find it encouraging that several other research groups have adopted CIDs to:

List of recent papers:

Comment

https://www.lesswrong.com/posts/Cd7Hw492RqooYgQAS/progress-on-causal-influence-diagrams?commentId=T8fdJtrh5bkHifGgC

Pretty interesting.Since you are interested in policies that operate along some paths only, you might find these of interest:https://​​pubmed.ncbi.nlm.nih.gov/​​31565035/​​https://​​www.ncbi.nlm.nih.gov/​​pmc/​​articles/​​PMC6330047/​​We have some recent stuff on generalizing MDPs to have a causal model inside every state (‘path dependent structural equation models’, to appear in UAI this year).

Comment

https://www.lesswrong.com/posts/Cd7Hw492RqooYgQAS/progress-on-causal-influence-diagrams?commentId=ftyfpvJncu49thefq

Thanks Ilya for those links, in particular the second one looks quite relevant to something we’ve been working on in a rather different context (that’s the benefit of speaking the same language!) We would also be curious to see a draft of the MDP-generalization once you have something ready to share!

Comment

https://​​auai.org/​​uai2021/​​pdf/​​uai2021.89.preliminary.pdf (this really is preliminary, e.g. they have not yet uploaded a newer version that incorporates peer review suggestions).---Can’t do stuff in the second paper without worrying about stuff in the first (unless your model is very simple).

https://www.lesswrong.com/posts/Cd7Hw492RqooYgQAS/progress-on-causal-influence-diagrams?commentId=BFz4NGSmDfPPnCQCf

Planned summary for the Alignment Newsletter:

Many of the problems we care about (reward gaming, wireheading, manipulation) are fundamentally a worry that our AI systems will have the wrong incentives. Thus, we need Causal Influence Diagrams (CIDs): a formal theory of incentives. These are <@graphical models@>(@Understanding Agent Incentives with Causal Influence Diagrams@) in which there are action nodes (which the agent controls) and utility nodes (which determine what the agent wants). Once such a model is specified, we can talk about various incentives the agent has. This can then be used for several applications:

  1. We can analyze what happens when you intervene on the agent’s action. Depending on whether the RL algorithm uses the original or modified action in its update rule, we may or may not see the algorithm disable its off switch.
  2. We can <@avoid reward tampering@>(@Designing agent incentives to avoid reward tampering@) by removing the connections from future rewards to utility nodes; in other words, we ensure that the agent evaluates hypothetical future outcomes according to its current reward function.
  3. A multiagent version allows us to recover concepts like Nash equilibria and subgames from game theory, using a very simple, compact representation.
https://www.lesswrong.com/posts/Cd7Hw492RqooYgQAS/progress-on-causal-influence-diagrams?commentId=YZbKZjTiAJp3cgjtS

IIUC, in a multi-agent influence model, every subgame perfect equilibrium is also a subgame perfect equilibrium in the corresponding extensive form game, but the converse is false in general. Do you know whether at least one subgame perfect equilibrium exists for any MAIM? I couldn’t find it in the paper.

Comment

https://www.lesswrong.com/posts/Cd7Hw492RqooYgQAS/progress-on-causal-influence-diagrams?commentId=ZDDTMts87xAMyEpQM

Hi Vanessa, Thanks for your question! Sorry for taking a while to reply. The answer is yes if we allow for mixed policies (i.e., where an agent can correlate all of their decision rules for different decisions with a shared random bit), but no if we restrict agents to only be able to use behavioural policies (i.e., decision rules for each of an agent’s decisions are independent because they can’t access a shared random bit). This is analogous to the difference between mixed and behavioural strategies in extensive form games, where (in general) a subgame perfect equilibrium (SPE) is only guaranteed to exist in mixed strategies (and the game is finite etc by Nash’ theorem). Note that If all agents in the MAIM have perfect recall (where they remember their previous decisions and the information that they knew at previous decisions), then there is guaranteed to exist a SPE in behavioural policies). In fact, Koller and Milch showed that only a weaker criterion of "sufficient recall" is needed (https://​​www.semanticscholar.org/​​paper/​​Ignorable-Information-in-Multi-Agent-Scenarios-Milch-Koller/​​5ea036bad72176389cf23545a881636deadc4946).In a forthcoming journal paper, we expand significantly on the the theoretical underpinnings and advantages of MAIMs and so we will provide more results there.