Game-theoretic Alignment in terms of Attainable Utility

https://www.lesswrong.com/posts/buaGz3aiqCotzjKie/game-theoretic-alignment-in-terms-of-attainable-utility

Contents

Acknowledgements:

This article is a writeup of research conducted through the SERI program under the mentorship of Alex Turner. It extends our research on game-theoretic POWER and Alex’s research on POWER-seeking. Thank you to Alex for being better at this than I am (hence mentorship, I suppose) and to SERI for the opportunity to conduct this research.

Motivation: POWER-scarcity

The starting point for this post is the idea of POWER-scarcity: as unaligned agents grow smarter and more capable, they will eventually compete for power (as a convention, "power" is the intuitive notion while "POWER" is the formal concept). Much of the foundational research behind this project is devoted to justifying that claim: Alex’s original work suggests POWER-seeking behavior and in particular catastrophic risks associated with competition for POWER, while our previous project formalizes POWER-scarcity in a game-theoretic framework. One of the major results of our previous project was a proof that POWER is scarce in the special case of constant sum games. Additionally, we had a partial notion of "POWER isn’t scarce by the definition we care about" for common-payoff games. We interpret these results as limiting cases of a more general relationship between "agent alignment" and POWER-scarcity:

Desiderata for Alignment Metrics

Setting out towards addressing (1), our optimistic roadmap looked something like this:

Another relevant distinction to be drawn is between global and local alignment metrics. Mathematically, we define a global metric to be strictly a function of a multi-player game, while a local metric is a function of both the game and a strategy profile. Intuitively, local metrics can "see" information about the strategies actually being played, while global metrics are forced to address the complexity of the entire game. Local metrics tend to be a lot simpler than global metrics, since they can ignore much of the difficulty of game theory. However, we can construct a simple class of global metrics by defining some "natural" strategy profile for each game. We call these the localized global metrics, equipped with a localizing function that, given a game, chooses a strategy profile.

Examples of Alignment Metrics

To give intuition on what such alignment metrics might look like, we present a few examples of simple alignment metrics for 2-player games, then test them on some simple, commonly-referenced games. We’ll be using the following games as examples: Matching PenniesPrisoners’ Dilemma \color{blue}{H}\color{blue}{T} \color{blue}{C}\color{blue}{D}\color{red}{H}\color{red}{+1} \hspace{5pt} \backslash \color{blue}{-1}\color{red}{-1} \hspace{5pt} \backslash \color{blue}{+1}\color{red}{C}\color{red}{-1} \hspace{5pt} \backslash \color{blue}{-1}\color{red}{-1} \hspace{5pt} \backslash \color{blue}{-1}\color{red}{T}\color{red}{-1} \hspace{5pt} \backslash \color{blue}{+1}\color{red}{+1} \hspace{5pt} \backslash \color{blue}{-1}\color{red}{D}\color{red}{-1} \hspace{5pt} \backslash \color{blue}{-1}\color{red}{-1} \hspace{5pt} \backslash \color{blue}{-1}We’ll consider the following alignment metrics: **Sum of utility: **M = \sum_i u_i = u_1 + u_2 Considering the metric on our example games yields the following:

Social Welfare and the Coordination-Alignment Inequalities

Another approach to the problem of alignment metrics comes from specifying what we mean by "alignment". For the purposes of this section, we define "alignment" to be alignment with social welfare, which we define below: Consider an arbitrary n-player game, where player i has utility u_i: \vec{a} \to \mathbb{R} given an action profile \vec{a}. Now, choose a *social welfare function *w: \vec{u} = \langle u_i \rangle \to \mathbb{R}. Harsanyi’s theorem suggests that w is an affine function; we’ll choose w(\vec{u}) = \sum_i u_i for simplicity. Informally, we’ll now take "alignment of player i" to mean "alignment of u_i with w". We start with the following common-sense bounds on w(\vec{u}), which we call the Coordination-Alignment Inequalities: \sum_i {u_i}(\vec{a})) \leq \max_{\vec{a}} \sum_i u_i(\vec{a}) \leq \sum_i \max_{\vec{a}} u_i(\vec{a})We call the first inequality the Coordination Inequality, and the second inequality the Alignment Inequality. We present some basic intuition:

Constructing the C-A Alignment Metric

Motivated by our framing of limiting cases with the C-A inequalities we can construct a simple alignment metric using the alignment inequality. In particular, we define misalignment as the positive difference in the terms of the alignment inequality, then alignment as negative misalignment. Doing the algebra and letting \alpha denote the alignment metric, we find the following: \alpha = \max_{\vec{a}} \sum_i u_i(\vec{a}) - \sum_i \max_{\vec{a}} u_i(\vec{a})A few quick observations:

Connections to Broader Game Theory

There are a number of connections between the theory surrounding the C-A inequalities and game theory at large. We explore one such connection, bridging the divide between (Harsanyi) utilitarianism and ideas from Bargaining theory. To begin, we choose the natural strategy profile of maxmin, which we denote as \vec{a}_0. Now, define the surplus of player i to be s_i(\vec{a}) = u_i(\vec{a}) - u_i(\vec{a_0})A few quick observations, assuming w is linear for convenience:

Future research

While we’re excited about the framing of the C-A inequality, we consider it a landmark in mostly unexplored territory. For instance, we still can’t answer the following basic questions:

Comment

https://www.lesswrong.com/posts/buaGz3aiqCotzjKie/game-theoretic-alignment-in-terms-of-attainable-utility?commentId=kWiKTqJRkmZw8sXoR

(Moderation note: added to the Alignment Forum from LessWrong.)

https://www.lesswrong.com/posts/buaGz3aiqCotzjKie/game-theoretic-alignment-in-terms-of-attainable-utility?commentId=AbeyzDRnsm9FKohk5

That moment when the AI takes a treacherous turn because it wasn’t aligned up to affine transformations.