Coherence arguments do not entail goal-directed behavior

https://www.lesswrong.com/posts/NxF5G6CJiof6cemTw/coherence-arguments-do-not-entail-goal-directed-behavior

Contents

All behavior can be rationalized as EU maximization

Suppose we have access to the entire policy of an agent, that is, given any universe-history, we know what action the agent will take. Can we tell whether the agent is an EU maximizer? Actually, no matter what the policy is, we can view the agent as an EU maximizer. The construction is simple: the agent can be thought as optimizing the utility function U, where U(h, a) = 1 if the policy would take action a given history h, else 0. Here I’m assuming that U is defined over histories that are composed of states/​observations and actions. The actual policy gets 1 utility at every timestep; any other policy gets less than this, so the given policy perfectly maximizes this utility function. This construction has been given before, eg. at the bottom of page 6 of this paper. (I think I’ve seen it before too, but I can’t remember where.) But wouldn’t this suggest that the VNM theorem has no content? Well, we assumed that we were looking at the *policy *of the agent, which led to a universe-history deterministically. We didn’t have access to any probabilities. Given a particular action, we knew exactly what the next state would be. Most of the axioms of the VNM theorem make reference to lotteries and probabilities—if the world is deterministic, then the axioms simply say that the agent must have transitive preferences over outcomes. Given that we can only observe the agent choose one history over another, we can trivially construct a transitive preference ordering by saying that the chosen history is higher in the preference ordering than the one that was not chosen. This is essentially the construction we gave above. What then is the purpose of the VNM theorem? It tells you how to behave if you have probabilistic beliefs about the world, as well as a complete and consistent preference ordering over outcomes. This turns out to be not very interesting when "outcomes" refers to "universe-histories". It can be more interesting when "outcomes" refers to world states instead (that is, snapshots of what the world looks like at a particular time), but utility functions over states/​snapshots can’t capture everything we’re interested in, and there’s no reason to take as an assumption that an AI system will have a utility function over states/​snapshots.

There are no coherence arguments that say you must have goal-directed behavior

Not all behavior can be thought of as goal-directed (primarily because I allowed the category to be defined by fuzzy intuitions rather than something more formal). Consider the following examples:

There are no coherence arguments that say you must have preferences

This section is another way to view the argument in the previous section, with "goal-directed behavior" now being operationalized as "preferences"; it is not saying anything new. Above, I said that the VNM theorem assumes both that you use probabilities and that you have a preference ordering over outcomes. There are lots of good reasons to assume that a good reasoner will use probability theory. However, there’s not much reason to assume that there is a preference ordering over outcomes. The twitching robot, "A"-following agent, and random policy agent from the last section all seem like they don’t have preferences (in the English sense, not the math sense). Perhaps you could define a preference ordering by saying "if I gave the agent lots of time to think, how would it choose between these two histories?" However, you could apply this definition to anything, including eg. a thermostat, or a rock. You might argue that a thermostat or rock can’t "choose" between two histories; but then it’s unclear how to define how an AI "chooses" between two histories without that definition also applying to thermostats and rocks. Of course, you could always define a preference ordering based on the AI’s observed behavior, but then you’re back in the setting of the first section, where all observed behavior can be modeled as maximizing an expected utility function and so saying "the AI is an expected utility maximizer" is vacuous.

Convergent instrumental subgoals are about goal-directed behavior

One of the classic reasons to worry about expected utility maximizers is the presence of convergent instrumental subgoals, detailed in Omohundro’s paper The Basic AI Drives. The paper itself is clearly talking about goal-directed AI systems: To say that a system of any design is an "artificial intelligence", we mean that it has goals which it tries to accomplish by acting in the world. It then argues (among other things) that such AI systems will want to "be rational" and so will distill their goals into utility functions, which they then maximize. And once they have utility functions, they will protect them from modification. Note that this starts from the assumption of goal-directed behavior and derives that the AI will be an EU maximizer along with the other convergent instrumental subgoals. The coherence arguments all imply that AIs will be EU maximizers for some (possibly degenerate) utility function; they don’t prove that the AI must be goal-directed.

Goodhart’s Law is about goal-directed behavior

A common argument for worrying about AI risk is that we know that a superintelligent AI system will look to us like an EU maximizer, and if it maximizes a utility function that is even slightly wrong we could get catastrophic outcomes. By now you probably know my first response: that any behavior can be modeled as an EU maximizer, and so this argument proves too much, suggesting that any behavior causes catastrophic outcomes. But let’s set that aside for now. The second part of the claim comes from arguments like Value is Fragile and Goodhart’s Law. However, if we consider utility functions that assign value 1 to some histories and 0 to others, then if you accidentally assign a history where I needlessly stub my toe a 1 instead of a 0, that’s a slightly wrong utility function, but it isn’t going to lead to catastrophic outcomes. The worry about utility functions that are slightly wrong holds water when the utility functions are wrong about some high-level concept, like whether humans care about their experiences reflecting reality. This is a very rarefied, particular distribution of utility functions, that are all going to lead to goal-directed or agentic behavior. As a result, I think that the argument is better stated as "if you have a slightly incorrect goal, you can get catastrophic outcomes". And there aren’t any coherence arguments that say that agents must have goals.

Wireheading is about explicit reward maximization

There are a few papers that talk about the problems that arise with a very powerful system with a reward function or utility function, most notably wireheading. The argument that AIXI will seize control of its reward channel falls into this category. In these cases, typically the AI system is considering making a change to the system by which it evaluates goodness of actions, and the goodness of the change is evaluated by the system after the change. Daniel Dewey argues in Learning What to Value that if the change is evaluated by the system before the change, then these problems go away. I think of these as problems with reward maximization, because typically when you phrase the problem as maximizing reward, you are maximizing the sum of rewards obtained in all timesteps, no matter how those rewards are obtained (i.e. even if you self-modify to make the reward maximal). It doesn’t seem like AI systems have to be built this way (though admittedly I do not know how to build AI systems that reliably avoid these problems).

Summary

In this post I’ve argued that many of the problems we typically associate with expected utility maximizers are actually problems with goal-directed agents or with explicit reward maximization. Coherence arguments only entail that a superintelligent AI system will look like an expected utility maximizer, but this is actually a vacuous constraint, and there are many potential utility functions for which the resulting AI system is neither goal-directed nor explicit-reward-maximizing. This suggests that we could try to build AI systems of this type, in order to sidestep the problems that we have identified so far.

Comment

https://www.lesswrong.com/posts/NxF5G6CJiof6cemTw/coherence-arguments-do-not-entail-goal-directed-behavior?commentId=7aDHHPsDX4ZbdYmiS

This post was very surprising to me; it’s definitely an update about how to think about modelling intelligences, and opens up the space of possible theories a fair bit. That said, I feel like I’ve not seen it reviewed by someone who has broadly the position this post responds to, so I’d really like it to be reviewed by someone else.

https://www.lesswrong.com/posts/NxF5G6CJiof6cemTw/coherence-arguments-do-not-entail-goal-directed-behavior?commentId=bByTjTDvMGrCrrfYD

This post directly addresses what I think is the biggest conceptual hole in our current understanding of AGI: what type of goals will it have, and why? I think it’s been important in pushing people away from unhelpful EU-maximisation framings, and towards more nuanced and useful ways of thinking about goals.