Is the term mesa optimizer too narrow?

nFDXq7HTv9Xugcqaw

title

authors

Matthew Barnett

date_published

2019-12-14T23:20

score

omega_karma

votes

Comment

id

x7tkRDhAynny27bdk
authors

Rohin Shah
score

9
omega_karma

5
votes

5
date_published

2019-12-16T20:01

https://www.lesswrong.com/posts/nFDXq7HTv9Xugcqaw/is-the-term-mesa-optimizer-too-narrow?commentId=x7tkRDhAynny27bdk

I suspect that solutions to inner alignment will come > from taking an opinionated view on the structure of agents, clarifying its assumptions and concepts, explaining why it actually applies to real-world agents, and offering concrete ways in which the extra structure of the view can be exploited for alignment.Even taking that as an assumption, it seems like if we accept that "mesa optimizer" doesn’t work as a description of humans, then mesa optimization can’t be the right view, and we should retreat to malign generalization while trying to figure out a better view.

Comment

id

ceLFFJNtXrysEbbk8
authors

vlad_m
score

6
omega_karma

4
votes

4
date_published

2019-12-18T22:18

https://www.lesswrong.com/posts/nFDXq7HTv9Xugcqaw/is-the-term-mesa-optimizer-too-narrow?commentId=ceLFFJNtXrysEbbk8

We’re probably in agreement, but I’m not sure what exactly you mean by "retreat to malign generalisation".

For me, mesa-optimisation’s primary claim isn’t (call it Optimisers) that agents are well-described as optimisers, which I’m happy to drop. It is the claim (call it Mesa≠Base) that whatever the right way to describe them is, in general their intrinsic goals are distinct from the reward.

That’s a specific (if informal) claim about a possible source of malign generalisation. Namely, that when intrinsic goals differ arbitrarily from the reward, then systems that competently pursue them may lead to outcomes that are arbitrarily bad according to the reward. Humans don’t pose a counterexample to that, and it seems prima facie conceptually clarifying, so I wouldn’t throw it away. I’m not sure if you propose to do that, but strictly, that’s what "retreating to malign generalisation" could mean, as malign generalisation itself makes no reference to goals.

One might argue that until we have a good model of goal-directedness, Mesa≠Base reifies goals more than is warranted, so we should drop it. But I don’t think so – so long as one accepts goals as meaningful at all, the underlying model need only admit a distinction between the goal of a system and the criterion according to which a system was selected. I find it hard to imagine a model or view that wouldn’t allow this – this makes sense even in the intentional stance, whose metaphysics for goals is pretty minimal.

It’s a shame that Mesa≠Base is so entangled with Optimisers. When I think of mesa-optimisation, I tend to think more about the former than about the latter. I wish there was a term that felt like it pointed directly to Mesa≠Base without pointing to Optimisers. The Inner Alignment Problem might be it, though it feels like it’s not quite specific enough.

Comment

id

CLX6gKsWquvRDGmD3
authors

Rohin Shah
score

3
omega_karma

3
votes

2
date_published

2019-12-19T00:47

https://www.lesswrong.com/posts/nFDXq7HTv9Xugcqaw/is-the-term-mesa-optimizer-too-narrow?commentId=CLX6gKsWquvRDGmD3

From my perspective, there are three levels:

Most general: The inner agent could malignly generalize in some arbitrary bad way.
Middle: The inner agent malignly generalizes in such a way that it makes sense to call it goal-directed, and the mesa-goal (= intentional-stance-goal) is different from the base-goal.
Most specific: The inner agent encodes an explicit search algorithm, an explicit world model, and an explicit utility function. I worry about the middle case. It seems like upon reading the mesa optimizers paper, most people start to worry about the last case. I would like people to worry about the middle case instead, and test their proposed solutions against that. (Well, ideally they’d test it against the most general case, but if it doesn’t work against that, which it probably won’t, that isn’t necessarily a deal breaker.) I feel better about people accidentally worrying about the most general case, rather than people accidentally worrying about the most specific case.

The Inner Alignment Problem might be it, though it feels like it’s not quite specific enough.I like "inner alignment", and am not sure why you think it isn’t specific enough.

Comment

id

oJD3HzFeZ4HLzPzRZ
authors

vlad_m
score

4
omega_karma

4
votes

3
date_published

2019-12-19T10:13

https://www.lesswrong.com/posts/nFDXq7HTv9Xugcqaw/is-the-term-mesa-optimizer-too-narrow?commentId=oJD3HzFeZ4HLzPzRZ

I think we basically agree. I would also prefer people to think more about the middle case. Indeed, when I use the term mesa-optimiser, I usually intend to talk about the middle picture, though strictly that’s sinful as the term is tied to Optimisers.

Re: inner alignment

I think it’s basically the right term. I guess in my mind I want to say something like, "Inner Alignment is the problem of aligning objectives across the Mesa≠Base gap", which shows how the two have slightly different shapes. But the difference isn’t really important.

Inner alignment gap? Inner objective gap?

id

sBLf3iit6NTGwWhHm
authors

Matthew Barnett
score

3
omega_karma

2
votes

2
date_published

2019-12-18T22:46

https://www.lesswrong.com/posts/nFDXq7HTv9Xugcqaw/is-the-term-mesa-optimizer-too-narrow?commentId=sBLf3iit6NTGwWhHm

I’m not sure what exactly you mean by "retreat to malign generalisation".When you don’t have a deep understanding of a phenomenon, it’s common to use some empirical description of what you’re talking about, rather than using your current (and incorrect) model to interpret the phenomenon. The issue with using your current model, is that it leads you to make incorrect inferences about why things happen because you’re relying too heavily on the model being internally correct. Therefore, until we gain a deeper understanding, it’s better to use the pre-theoretical description of what we’re talking about. I’m assuming that’s what Rohin meant by "retreat to malign generalization." This is important because if we used the definition given in the paper, then this could affect which approaches we use to address inner alignment. For instance, we could try using some interpretability technique to discover the "objective" that a neural network was maximizing. But if our model of the neural network as an optimizer is ultimately incorrect, then the neural network won’t have an explicit objective, making this technique very difficult.

Comment

id

Xs7HvgSG4qupEWbXQ
authors

vlad_m
score

1
omega_karma

1
votes

1
date_published

2019-12-18T22:53

https://www.lesswrong.com/posts/nFDXq7HTv9Xugcqaw/is-the-term-mesa-optimizer-too-narrow?commentId=Xs7HvgSG4qupEWbXQ

I understand that, and I agree with that general principle. My comment was intended to be about where to draw the line between incorrect theory, acceptable theory, and pre-theory.

In particular, I think that while optimisation is too much theory, goal-directedness talk is not, despite being more in theory-land than empirical malign generalisation talk. We should keep thinking of worries on the level of goals, even as we’re still figuring out how to characterise goals precisely. We should also be thinking of worries on the level of what we could observe empirically.

id

M65chyrmavRbuGCEj
authors

Matthew Barnett
score

2
omega_karma

1
votes

1
date_published

2019-12-18T23:00

https://www.lesswrong.com/posts/nFDXq7HTv9Xugcqaw/is-the-term-mesa-optimizer-too-narrow?commentId=M65chyrmavRbuGCEj

I wish there was a term that felt like it pointed directly to Mesa≠Base without pointing to Optimisers.I think it’s fairly easy to point out the problem using an alternative definition. If we just change the definition of mesa optimizer to reflect that we’re are using the intentional stance (in other words, we’re interpreting the neural network as having goals, whether it’s using an internal search or not), the mesa!=base description falls right out, and all the normal risks about building mesa optimizers still apply.

Comment

id

jg2BWKoKQFKGNzexk
authors

vlad_m
score

1
omega_karma

1
votes

1
date_published

2019-12-18T23:07

https://www.lesswrong.com/posts/nFDXq7HTv9Xugcqaw/is-the-term-mesa-optimizer-too-narrow?commentId=jg2BWKoKQFKGNzexk

I’m not talking about finding on optimiser-less definition of goal-directedness that would support the distinction. As you say, that is easy. I am interested in a term that would just point to the distinction without taking a view on the nature of the underlying goals.

As a side note I think the role of the intentional stance here is more subtle than I see it discussed. The nature of goals and motivation in an agent isn’t just a question of applying the intentional stance. We can study how goals and motivation work in the brain neuroscientifically (or at least, the processes in the brain that resemble the role played by goals in the intentional stance picture), and we experience goals and motivations directly in ourselves. So, there is more to the concepts than just taking an interpretative stance, though of course to the extent that the concepts (even when refined by neuroscience) are pieces of a model being used to understand the world, they will form part of an interpretative stance.

Comment

id

FMLErJKXXYMLcrp5a
authors

Matthew Barnett
score

2
omega_karma

1
votes

1
date_published

2019-12-18T23:13

https://www.lesswrong.com/posts/nFDXq7HTv9Xugcqaw/is-the-term-mesa-optimizer-too-narrow?commentId=FMLErJKXXYMLcrp5a

I am interested in a term that would just point to the distinction without taking a view on the nature of the underlying goals.I’m not sure what’s unsatisfying about the characterization I gave? If we just redefined optimizer to mean an interpretation of the agent’s behavior, specifically, that it abstractly pursues goals, why is that an unsatisfying way of showing the mesa != base issue? ETA: The nature of goals and motivation in an agent isn’t just a question of applying the intentional stance. We can study how goals and motivation work in the brain neuroscientifically (or at least, the processes in the brain that resemble the role played by goals in the intentional stance picture), and we experience goals and motivations directly in ourselves.I agree. And the relevance this plays is that in future systems that might experience malign generalization, we would want some model of how goals play a role in their architecture, because this could help us align the system. But until we have such architectures, or until we have models for how those future systems should behave, we should work abstractly.

id

H5X4ndhanQ8b79pci
authors

Wei_Dai
score

6
omega_karma

4
votes

3
date_published

2019-12-16T05:27

https://www.lesswrong.com/posts/nFDXq7HTv9Xugcqaw/is-the-term-mesa-optimizer-too-narrow?commentId=H5X4ndhanQ8b79pci

First, I think by this definition humans are clearly not mesa optimizers.

I’m confused/unconvinced. Surely the 9/11 attackers, for example, must have "internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system"? Can you give some examples of humans being highly dangerous without having done this kind of explicit optimization?

As far as I can tell, Hjalmar Wijk introduced the term "malign generalization" to describe the failure mode that I think is most worth worrying about here.

Can you give some realistic examples/scenarios of "malign generalization" that does not involve mesa optimization? I’m not sure what kind of thing you’re actually worried about here.

Comment

id

t37pAZbiW3k8drsTv
authors

Matthew Barnett
score

4
omega_karma

3
votes

2
date_published

2019-12-16T05:59

https://www.lesswrong.com/posts/nFDXq7HTv9Xugcqaw/is-the-term-mesa-optimizer-too-narrow?commentId=t37pAZbiW3k8drsTv

Surely the 9/11 attackers, for example, must have "internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system"?ETA: I agree if someone were to eg. write a spreadsheet of all the things they could do, and write the costs of those actions, and then choose the one with the lowest cost, this would certainly count. And maybe terrorist organizations do a lot of deliberation that meets this kind of criteria. But I am responding to the more typical type of human action: walking around, seeking food, talking to others, working at a job. There are two reasons why we might model something as an optimizer. The first reason is that we know that it is internally performing some type of search over strategies in its head, and then outputting the strategy that ranks highest under some explicit objective function. The second reason is that, given our ignorant epistemic state, our best model of that object is that it is optimizing some goal. We might call the second case the intentional stance, following Dennett. If we could show that the first case was true in humans, then I would agree that humans would be mesa optimizers. However, my primary objection is that we could have better models of what the brain is actually doing. It’s often the case that when you don’t know how something works, the best way of understanding it is by modeling it as an optimizer. However, once you get to look inside and see what’s going on, this way of thinking lends to better models which take into account the specifics of its operation. I suspect that human brains are well modeled as optimizers from the outside, but that this view falls apart when considering specific cases. When the brain makes a decision, it usually considers at most three or four alternatives for each action it does. Most of the actual work is therefore done at the heuristics stage, not the selection part. And even at the selection stage, I have little reason to believe that it is actually comparing alternatives against an explicit objective function. But since this is all a bit vague, and hard to see in the case of humans, I can provide the analogy that I gave in the post above. At a first glance, someone who looked at the agent in the Chests and Keys environment would assume that it was performing an internal search, and then selecting the action that ranked highest in its preference ordering, where its preference ordering was something like "more keys is better." This would be a good model, but we could still do better. In fact, the only selection that’s really happening is at the last stage of the neural network, when the max function is being applied over its output layer. Otherwise, all it’s really doing is applying a simple heuristic: if there are no keys on the board, move along the wall; otherwise, move towards the key currently in sight. Since this can all be done in a simple feedforward neural network, I find it hard to see why the best model of its behavior should be an optimizer.

Comment

id

Y2zDKgcTaYB52m9zK
authors

Wei_Dai
score

2
omega_karma

2
votes

1
date_published

2019-12-17T04:29

https://www.lesswrong.com/posts/nFDXq7HTv9Xugcqaw/is-the-term-mesa-optimizer-too-narrow?commentId=Y2zDKgcTaYB52m9zK

When the brain makes a decision, it usually considers at most three or four alternatives for each action it does. Most of the actual work is therefore done at the heuristics stage, not the selection part. And even at the selection stage, I have little reason to believe that it is actually comparing alternatives against an explicit objective function.

Assuming this, it seems to me that the heuristics are being continuously trained by the selection stage, so that is the most important part even if heuristics are doing most of the immediate work in making each decision. And I’m not sure what you mean by "explicit objective function". I guess the objective function is encoded in the connections/weights of some neural network. Are you not counting that as an explicit objective function and instead only counting a symbolically represented function as "explicit"? If so, why would not being "explicit" disqualify humans as mesa optimizers? If not, please explain more what you mean?

Since this can all be done in a simple feedforward neural network, I find it hard to see why the best model of its behavior should be an optimizer.

I take your point that some models can behave like an optimizer at first glance but if you look closer it’s not really an optimizer after all. But this doesn’t answer my question: "Can you give some realistic examples/scenarios of "malign generalization" that does not involve mesa optimization? I’m not sure what kind of thing you’re actually worried about here."

ETA: If you don’t have a realistic example in mind, and just think that we shouldn’t currently rule out the possibility that a non-optimizer might generalize in a way that is more dangerous than total failure, I think that’s a good thing to point out too. (I had already upvoted your post based on that.)

Comment

id

y6TYG3oQNwpgHJCoE
authors

Matthew Barnett
score

2
omega_karma

1
votes

1
date_published

2019-12-17T05:07

https://www.lesswrong.com/posts/nFDXq7HTv9Xugcqaw/is-the-term-mesa-optimizer-too-narrow?commentId=y6TYG3oQNwpgHJCoE

I guess the objective function is encoded in the connections/weights of some neural network. Are you not counting that as an explicit objective function and instead only counting a symbolically represented function as "explicit"?If the heuristics are continuously being trained, and this is all happening by comparing things against some criterion that’s encoded within some other neural network, I suppose that’s a bit like saying that we have an "objective function." I wouldn’t call it explicit, though, because to call something explicit means that you could extract the information content easily. I predict that extracting any sort of coherent or consistent reward function from the human brain will be very difficult. If so, why would not being "explicit" disqualify humans as mesa optimizers? If not, please explain more what you mean?I am only using the definition given. The definition clearly states that the objective function must be "explicit" not "implicit." This is important; as Rohin mentioned below, this definition naturally implies that one way of addressing inner alignment will be to use some transparency procedure to extract the objective function used by the neural network we are training. However, if neural networks don’t have clean, explicit internal objective functions, this technique becomes a lot harder, and might not be as tractable as other approaches. "Can you give some realistic examples/scenarios of "malign generalization" that does not involve mesa optimization? I’m not sure what kind of thing you’re actually worried about here."I actually agree that I didn’t adequately argue this point. Right now I’m trying to come up with examples, and I estimate about a 50% chance that I’ll write a post about this in the future naming detailed examples. For now, my argument can be summed up by saying, logically, if humans are not mesa optimizers, yet humans are dangerous, then you don’t need a mesa optimizer to produce malign generalization.

id

5wrmtL8ywA399t7tz
authors

Steven Byrnes
score

5
omega_karma
votes

3
date_published

2019-12-15T16:51

https://www.lesswrong.com/posts/nFDXq7HTv9Xugcqaw/is-the-term-mesa-optimizer-too-narrow?commentId=5wrmtL8ywA399t7tz

humans are clearly not mesa optimizers. Most optimization we do is implicit.

Sure, but some of the optimization we do is explicit, like if someone is trying to get out of debt. Are you saying there’s an important safety-relevant distinction between "system that sometimes does explicit optimization but also does other stuff" versus "system that does explicit optimization exclusively"? And/or that "mesa-optimizer" only refers to the latter (in the definition you quote)? I was assuming not… Or maybe we should say that the human mind has "subagents", some of which are mesa-optimizers...?

Comment

id

DyfjDkd4HLPgPrtra
authors

Matthew Barnett
score

7
omega_karma
votes

3
date_published

2019-12-15T22:18

https://www.lesswrong.com/posts/nFDXq7HTv9Xugcqaw/is-the-term-mesa-optimizer-too-narrow?commentId=DyfjDkd4HLPgPrtra

Are you saying there’s an important safety-relevant distinction between "system that sometimes does explicit optimization but also does other stuff" versus "system that does explicit optimization exclusively"?I would say that very little of our optimization is explicit, and so our behavior is not very well described by invoking an explicit utility function. One big strategic implication I see is that, if there is no canonical or natural utility function which describes our behavior, then modeling us as having one will lead to predictable errors in understanding us. I think that saying that humans optimize for some X is useful up to a point, but if someone were trying to devise a way of keeping humans safe (let’s say evolution was working on the so-called alignment problem to genetic fitness) then this model of humanity would break down rather easily and not help much. In other words, modeling us as optimizers is a low resolution model, and should be replaced with more detailed models which take into account the actual specifics of how we make decisions. For an intution pump, I recommend reading this post from Rohin Shah. In particular, the section about "Our understanding of the behavior" mirrors what I wrote above.

Comment

id

aG9PcQEbW48YQLwbw
authors

Steven Byrnes
score

5
omega_karma
votes

3
date_published

2019-12-16T11:06

https://www.lesswrong.com/posts/nFDXq7HTv9Xugcqaw/is-the-term-mesa-optimizer-too-narrow?commentId=aG9PcQEbW48YQLwbw

Hmm, let me try again to explain what I’m trying to get at.

Let’s say you’re a prison architect, building prisons for three prisoners

Prisoner Alice is 100% maximally obsessed with breaking out of prison. She eats and sleeps and works and socializes precisely to the extent that these are instrumentally useful for breaking out of prison. She thinks hard and creatively about breakout strategies during every free moment.

Prisoner Carol does not treat breaking out as an explicit goal, although if she happened upon an open door, she would absolutely walk out. She spends her days reading books from the prison library, socializing, working, etc.

Prisoner Bob is somewhere between the two. For 23 hours a day, he has various persuits like Carol. But for a total of 1 hour a day, he works on the project of breaking out of prison. (For a total of ~30 minutes scattered through the day, he does opportunistic information-gathering—he watches guard behavior, he explores the prison grounds, etc. Then, for the 30 minutes before he goes to sleep each night, he sits quietly in his cell, and thinks hard and creatively about how to break out.)

Here’s my question: "Is Bob more like Alice or more like Carol?"

From a psychological perspective, Bob is similar to Carol, while Alice is kinda nuts, even cartoonish. Bob and Carol are doing the same range if activities for the same reasons for 23 hours a day = 96%. They both make friends, they both exercise, they both read books, they both complain about the weather. Alice has a simple utility function, whereas Bob and Carol are actual plausible humans.

From a prison safety and security perspective, Bob and Alice are very similar to each other, and wildly different from Carol. Alice and Bob are Very High Breakout Risk, Carol is Low Breakout Risk. If you build the prison with 5 layers of security, each of which is 90% reliable, Carol will be kept locked up with 99.999% certainty, whereas either Alice or Bob will engineer a situation where all 5 layers break at once. Alice will come up with good breakout plans sooner than Bob, but not exponentially sooner; it’s a difference of degree not kind, unlike Bob versus Carol.

The moral I’m trying to convey is that, when we talk about mesa-optimization, the important question is "Can we correctly explain any non-infinitesimal subset of the system’s behavior as explicit optimization for a misaligned goal?" , not "Can we correctly explain 100% of the system’s behavior as explicit optimization for a misaligned goal?"

Comment

id

xySJJh88FTpXGy7je
authors

Rohin Shah
score

7
omega_karma
votes

4
date_published

2019-12-16T19:25

https://www.lesswrong.com/posts/nFDXq7HTv9Xugcqaw/is-the-term-mesa-optimizer-too-narrow?commentId=xySJJh88FTpXGy7je

The argument for risk doesn’t depend on the definition of mesa optimization. I would state the argument for risk as "the AI system’s capabilities might generalize without its objective generalizing", where the objective is defined via the intentional stance. Certainly this can be true without the AI system being 100% a mesa optimizer as defined in the paper. I thought this post was suggesting that we should widen the term "mesa optimizer" so that it includes those kinds of systems (the current definition doesn’t), so I don’t think you and Matthew actually disagree. It’s important to get this right, because solutions often do depend on the definition. Under the current definition, you might try to solve the problem by developing interpretability techniques that can find the mesa objective in the weights of the neural net, so that you can make sure it is what you want. However, I don’t think this would work for other systems that are still risky, such as Bob in your example.

id

nfxSJ7DPgjz6MoujP
authors

Rohin Shah
score

4
omega_karma

3
votes

2
date_published

2019-12-25T05:51

https://www.lesswrong.com/posts/nFDXq7HTv9Xugcqaw/is-the-term-mesa-optimizer-too-narrow?commentId=nfxSJ7DPgjz6MoujP

Planned summary for the Alignment newsletter:

The <@mesa optimization@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@) paper defined an optimizer as a system that internally searches through a search space for elements that score high according to some explicit objective function. However, humans would not qualify as mesa optimizers by this definition, since there (presumably) isn’t some part of the brain that explicitly encodes some objective function that we then try to maximize. In addition, there are inner alignment failures that don’t involve mesa optimization: a small feedforward neural net doesn’t do any explicit search; yet when it is trained in the <@chest and keys environment@>(@A simple environment for showing mesa misalignment@), it learns a policy that goes to the nearest key, which is equivalent to a key-maximizer. Rather than talking about "mesa optimizers", the post recommends that we instead talk about "malign generalization", to refer to the problem when <@capabilities generalize but the objective doesn’t@>(@2-D Robustness@).Planned opinion: I strongly agree with this post (though note that the post was written right after a conversation with me on the topic, so this isn’t independent evidence). I find it very unlikely that most powerful AI systems will be optimizers as defined in the original paper, but I do think that the malign generalization problem will apply to our AI systems. For this reason, I hope that future research doesn’t specialize to the case of explicit-search-based agents.

id

wFPRmymr9ETmQrtP6
authors

Matthew Barnett
score

4
omega_karma

2
votes

2
date_published

2019-12-14T23:22

https://www.lesswrong.com/posts/nFDXq7HTv9Xugcqaw/is-the-term-mesa-optimizer-too-narrow?commentId=wFPRmymr9ETmQrtP6

Here’s a related post that came up on Alignment Forum a few months back: Does Agent-like Behavior Imply Agent-like Architecture?