In the post introducing mesa optimization, the authors defined an optimizer as
a system [that is] internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system.The paper continues by defining a mesa optimizer as an optimizer that was selected by a base optimizer. However, there are a number of issues with this definition, as some have already pointed out. First, I think by this definition humans are clearly not mesa optimizers. Most optimization we do is implicit. Yet, humans are the supposed to be the prototypical examples of mesa optimizers, which appears be a contradiction. Second, the definition excludes perfectly legitimate examples of inner alignment failures. To see why, consider a simple feedforward neural network trained by deep reinforcement learning to navigate my Chests and Keys environment. Since "go to the nearest key" is a good proxy for getting the reward, the neural network simply returns the action, that when given the board state, results in the agent getting closer to the nearest key. Is the feedforward neural network optimizing anything here? Hardly, it’s just applying a heuristic. Note that you don’t need to do anything like an internal A* search to find keys in a maze, because in many environments, following a wall until the key is within sight, and then performing a very shallow search (which doesn’t have to be explicit) could work fairly well. As far as I can tell, Hjalmar Wijk introduced the term "malign generalization" to describe the failure mode that I think is most worth worrying about here. In particular, malign generalization happens when you trained a system with objective function X, that at deployment has the actual outcome of doing Y, where Y is so bad that we’d prefer the system to fail completely. To me at least, this seems like a far more intuitive and less theory-laden way of framing inner alignment failures. This way of reframing the issue allows us to keep the old terminology that we are concerned with capability robustness without alignment robustness, but drops all unnecessary references to mesa optimization. Mesa optimizers could still form a natural class of things that are prone to malign generalization. But if even humans are not mesa optimizers, why should we expect mesa optimizers to be the primary real world examples of such inner alignment failures?
I think this is one of the major remaining open question wrt inner alignment. Personally, I think there is a meaningful sense in which all the models I’m most worried about do some sort of search internally (at least to the same extent that humans do search internally), but I’m definitely uncertain about that. If true, though, it could be quite helpful for solving inner alignment, since it could enable us to factor models into pieces (either through architecture or transparency tools). Also:
Hjalmar actually cites this post by Paul Christiano as the source of that term—though Hjalmar’s usage is slightly different.
I’m sympathetic to what I see as the message of this post: that talk of mesa-optimisation is too specific given that the practical worry is something like malign generalisation. I agree that it makes extra assumptions on top of that basic worry, which we might not want to make. I would like to see more focus on inner alignment than on mesa-optimisation as such. I’d also like to see a broader view of possible causes for malign generalisation, which doesn’t stick so closely to the analysis in our paper. (In hindsight our analysis could also have benefitted from taking a broader view, but that wasn’t very visible at the time.)
At the same time, speaking only in terms of malign generalisation (and dropping the extra theoretical assumptions of a more specific framework) is too limiting. I suspect that solutions to inner alignment will come from taking an opinionated view on the structure of agents, clarifying its assumptions and concepts, explaining why it actually applies to real-world agents, and offering concrete ways in which the extra structure of the view can be exploited for alignment. I’m not sure that mesa-optimisation is the right view for that, but I do think that the right view will have something to do with goal-directedness.
Comment
I suspect that solutions to inner alignment will come > from taking an opinionated view on the structure of agents, clarifying its assumptions and concepts, explaining why it actually applies to real-world agents, and offering concrete ways in which the extra structure of the view can be exploited for alignment.Even taking that as an assumption, it seems like if we accept that "mesa optimizer" doesn’t work as a description of humans, then mesa optimization can’t be the right view, and we should retreat to malign generalization while trying to figure out a better view.
Comment
We’re probably in agreement, but I’m not sure what exactly you mean by "retreat to malign generalisation".
For me, mesa-optimisation’s primary claim isn’t (call it Optimisers) that agents are well-described as optimisers, which I’m happy to drop. It is the claim (call it Mesa≠Base) that whatever the right way to describe them is, in general their intrinsic goals are distinct from the reward.
That’s a specific (if informal) claim about a possible source of malign generalisation. Namely, that when intrinsic goals differ arbitrarily from the reward, then systems that competently pursue them may lead to outcomes that are arbitrarily bad according to the reward. Humans don’t pose a counterexample to that, and it seems prima facie conceptually clarifying, so I wouldn’t throw it away. I’m not sure if you propose to do that, but strictly, that’s what "retreating to malign generalisation" could mean, as malign generalisation itself makes no reference to goals.
One might argue that until we have a good model of goal-directedness, Mesa≠Base reifies goals more than is warranted, so we should drop it. But I don’t think so – so long as one accepts goals as meaningful at all, the underlying model need only admit a distinction between the goal of a system and the criterion according to which a system was selected. I find it hard to imagine a model or view that wouldn’t allow this – this makes sense even in the intentional stance, whose metaphysics for goals is pretty minimal.
It’s a shame that Mesa≠Base is so entangled with Optimisers. When I think of mesa-optimisation, I tend to think more about the former than about the latter. I wish there was a term that felt like it pointed directly to Mesa≠Base without pointing to Optimisers. The Inner Alignment Problem might be it, though it feels like it’s not quite specific enough.
Comment
From my perspective, there are three levels:
Most general: The inner agent could malignly generalize in some arbitrary bad way.
Middle: The inner agent malignly generalizes in such a way that it makes sense to call it goal-directed, and the mesa-goal (= intentional-stance-goal) is different from the base-goal.
Most specific: The inner agent encodes an explicit search algorithm, an explicit world model, and an explicit utility function. I worry about the middle case. It seems like upon reading the mesa optimizers paper, most people start to worry about the last case. I would like people to worry about the middle case instead, and test their proposed solutions against that. (Well, ideally they’d test it against the most general case, but if it doesn’t work against that, which it probably won’t, that isn’t necessarily a deal breaker.) I feel better about people accidentally worrying about the most general case, rather than people accidentally worrying about the most specific case.
Comment
I think we basically agree. I would also prefer people to think more about the middle case. Indeed, when I use the term mesa-optimiser, I usually intend to talk about the middle picture, though strictly that’s sinful as the term is tied to Optimisers.
Re: inner alignment
I think it’s basically the right term. I guess in my mind I want to say something like, "Inner Alignment is the problem of aligning objectives across the Mesa≠Base gap", which shows how the two have slightly different shapes. But the difference isn’t really important.
Inner alignment gap? Inner objective gap?
Comment
I understand that, and I agree with that general principle. My comment was intended to be about where to draw the line between incorrect theory, acceptable theory, and pre-theory.
In particular, I think that while optimisation is too much theory, goal-directedness talk is not, despite being more in theory-land than empirical malign generalisation talk. We should keep thinking of worries on the level of goals, even as we’re still figuring out how to characterise goals precisely. We should also be thinking of worries on the level of what we could observe empirically.
Comment
I’m not talking about finding on optimiser-less definition of goal-directedness that would support the distinction. As you say, that is easy. I am interested in a term that would just point to the distinction without taking a view on the nature of the underlying goals.
As a side note I think the role of the intentional stance here is more subtle than I see it discussed. The nature of goals and motivation in an agent isn’t just a question of applying the intentional stance. We can study how goals and motivation work in the brain neuroscientifically (or at least, the processes in the brain that resemble the role played by goals in the intentional stance picture), and we experience goals and motivations directly in ourselves. So, there is more to the concepts than just taking an interpretative stance, though of course to the extent that the concepts (even when refined by neuroscience) are pieces of a model being used to understand the world, they will form part of an interpretative stance.
Comment
I’m confused/unconvinced. Surely the 9/11 attackers, for example, must have "internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system"? Can you give some examples of humans being highly dangerous without having done this kind of explicit optimization?
Can you give some realistic examples/scenarios of "malign generalization" that does not involve mesa optimization? I’m not sure what kind of thing you’re actually worried about here.
Comment
Comment
Assuming this, it seems to me that the heuristics are being continuously trained by the selection stage, so that is the most important part even if heuristics are doing most of the immediate work in making each decision. And I’m not sure what you mean by "explicit objective function". I guess the objective function is encoded in the connections/weights of some neural network. Are you not counting that as an explicit objective function and instead only counting a symbolically represented function as "explicit"? If so, why would not being "explicit" disqualify humans as mesa optimizers? If not, please explain more what you mean?
I take your point that some models can behave like an optimizer at first glance but if you look closer it’s not really an optimizer after all. But this doesn’t answer my question: "Can you give some realistic examples/scenarios of "malign generalization" that does not involve mesa optimization? I’m not sure what kind of thing you’re actually worried about here."
ETA: If you don’t have a realistic example in mind, and just think that we shouldn’t currently rule out the possibility that a non-optimizer might generalize in a way that is more dangerous than total failure, I think that’s a good thing to point out too. (I had already upvoted your post based on that.)
Comment
Sure, but some of the optimization we do is explicit, like if someone is trying to get out of debt. Are you saying there’s an important safety-relevant distinction between "system that sometimes does explicit optimization but also does other stuff" versus "system that does explicit optimization exclusively"? And/or that "mesa-optimizer" only refers to the latter (in the definition you quote)? I was assuming not… Or maybe we should say that the human mind has "subagents", some of which are mesa-optimizers...?
Comment
Comment
Hmm, let me try again to explain what I’m trying to get at.
Let’s say you’re a prison architect, building prisons for three prisoners
Prisoner Alice is 100% maximally obsessed with breaking out of prison. She eats and sleeps and works and socializes precisely to the extent that these are instrumentally useful for breaking out of prison. She thinks hard and creatively about breakout strategies during every free moment.
Prisoner Carol does not treat breaking out as an explicit goal, although if she happened upon an open door, she would absolutely walk out. She spends her days reading books from the prison library, socializing, working, etc.
Prisoner Bob is somewhere between the two. For 23 hours a day, he has various persuits like Carol. But for a total of 1 hour a day, he works on the project of breaking out of prison. (For a total of ~30 minutes scattered through the day, he does opportunistic information-gathering—he watches guard behavior, he explores the prison grounds, etc. Then, for the 30 minutes before he goes to sleep each night, he sits quietly in his cell, and thinks hard and creatively about how to break out.)
Here’s my question: "Is Bob more like Alice or more like Carol?"
From a psychological perspective, Bob is similar to Carol, while Alice is kinda nuts, even cartoonish. Bob and Carol are doing the same range if activities for the same reasons for 23 hours a day = 96%. They both make friends, they both exercise, they both read books, they both complain about the weather. Alice has a simple utility function, whereas Bob and Carol are actual plausible humans.
From a prison safety and security perspective, Bob and Alice are very similar to each other, and wildly different from Carol. Alice and Bob are Very High Breakout Risk, Carol is Low Breakout Risk. If you build the prison with 5 layers of security, each of which is 90% reliable, Carol will be kept locked up with 99.999% certainty, whereas either Alice or Bob will engineer a situation where all 5 layers break at once. Alice will come up with good breakout plans sooner than Bob, but not exponentially sooner; it’s a difference of degree not kind, unlike Bob versus Carol.
The moral I’m trying to convey is that, when we talk about mesa-optimization, the important question is "Can we correctly explain any non-infinitesimal subset of the system’s behavior as explicit optimization for a misaligned goal?" , not "Can we correctly explain 100% of the system’s behavior as explicit optimization for a misaligned goal?"
Comment
The argument for risk doesn’t depend on the definition of mesa optimization. I would state the argument for risk as "the AI system’s capabilities might generalize without its objective generalizing", where the objective is defined via the intentional stance. Certainly this can be true without the AI system being 100% a mesa optimizer as defined in the paper. I thought this post was suggesting that we should widen the term "mesa optimizer" so that it includes those kinds of systems (the current definition doesn’t), so I don’t think you and Matthew actually disagree. It’s important to get this right, because solutions often do depend on the definition. Under the current definition, you might try to solve the problem by developing interpretability techniques that can find the mesa objective in the weights of the neural net, so that you can make sure it is what you want. However, I don’t think this would work for other systems that are still risky, such as Bob in your example.
Planned summary for the Alignment newsletter:
Here’s a related post that came up on Alignment Forum a few months back: Does Agent-like Behavior Imply Agent-like Architecture?