Contents
- The State of the Subfield
- Focusing on the Core Problem
- The Developmental Story
- The Evolutionary Story
- Bounding the Problem
- The Formal Problem
- Defining Mesa-Optimization
- Why is this a problem?
- Description Complexity
- Pure Computational Complexity
- Mixing Time and Description Complexity
- A Note on the Consensus Algorithm
- Conclusion Most of the work on inner alignment so far has been informal or semi-formal (with the notable exception of a little work on minimal circuits). I feel this has resulted in some misconceptions about the problem. I want to write up a large document clearly defining the formal problem and detailing some formal directions for research. Here, I outline my intentions, inviting the reader to provide feedback and point me to any formal work or areas of potential formal work which should be covered in such a document. (Feel free to do that last one without reading further, if you are time-constrained!)
The State of the Subfield
Risks from Learned Optimization (henceforth, RLO) offered semi-formal definitions of important terms, and provided an excellent introduction to the area for a lot of people (and clarified my own thoughts and the thoughts of others who I know, even though we had already been thinking about these things). However, RLO spent a lot of time on highly informal arguments (analogies to evolution, developmental stories about deception) which help establish the plausibility of the problem. While I feel these were important motivation, in hindsight I think they’ve caused some misunderstandings. My interactions with some other researchers has caused me to worry that some people confuse the positive arguments for plausibility with the core problem, and in some cases have exactly the wrong impression about the core problem. This results in mistakenly trying to block the plausibility arguments, which I see as merely illustrative, rather than attacking the core problem. By no means do I intend to malign experimental or informal/semiformal work. Rather, by focusing on formal theoretical work, I aim to fill a hole I perceive in the field. I am very appreciative of much of the informal/semiformal work that has been done so far, and continue to think that kind of work is necessary for the crystallization of good concepts.
Focusing on the Core Problem
In order to establish safety properties, we would like robust safety arguments ("X will not happen" / "X has an extremely low probability of happening"). For example, arguments that probability of catastrophe will be very low, or arguments that probability of intentional catastrophe will be very low (ie, intent-alignment), or something along those lines. For me, the core inner alignment problem is the absence of such an argument in a case where we might naively expect it. We don’t know how to rule out the presence of (misaligned) mesa-optimizers. Instead, I see many people focusing on blocking the plausibility arguments in RLO. This strikes me as the wrong direction. To me, these arguments are merely illustrative. It seems like some people have gotten the impression that when the assumptions of the plausibility arguments in RLO aren’t met, we should not expect an inner alignment problem to arise. Not only does this attitude misunderstand what we want (ie, a strong argument that we won’t encounter a problem) -- I further think it’s actually wrong (because when we look at almost any case, we see cause for concern). Examples:
The Developmental Story
One recent conversation involved a line of research based on the developmental story, where a mesa-optimizer develops a pseudo-aligned objective early in training (an objective with a strong statistical correlation to the true objective in the training data), but as it learns more about the world, it improves its training score by becoming deceptive rather than by fixing the pseudo-aligned objective. The research proposal being presented to me involved shaping the early pseudo-aligned objective in very coarse-grained ways, which might ensure (for example) a high preference for cooperative behavior, or a low tolerance for risk (catastrophic actions might be expected to be particularly risky), etc. This line of research seemed promising to the person I was talking to, because they supposed that while it might be very difficult to precisely control the objectives of a mesa-optimizer or rule out mesa-optimizers entirely, it might be easy to coarsely shape the mesa-objectives. I responded that for me, the whole point of the inner alignment problem was the conspicuous absence of a formal connection between the outer objective and the mesa-objective, such that we could make little to no guarantees based on any such connection. I proceeded to offer a plausibility argument for a total disconnect between the two, such that even these coarse-grained adjustments would fail. (Possibly it was a mistake to offer a plausibility argument, because the rest of the discussion focused on this plausibility argument, again distracting from the core problem!)
The Evolutionary Story
Another recent conversation involved an over-emphasis on the evolutionary analogy. This person believed the inner optimizer problem would apply when systems were incentivised to be goal-oriented, as with animals selected for reproductive fitness, or policy networks trained to pursue reward. However, they did not believe it would apply to networks which are simply trained to predict, such as GPT. Again, this strikes me as ignoring the fundamental problem, that we have little to no idea when mesa-optimizers can arise, that we lack formal tools for the analysis of such questions, and that what formal tools we might have thought to apply, have failed to yield any such results.
Bounding the Problem
My third and final example: in one conversation, someone made a claim which I see as "exactly wrong": that we can somehow lower-bound the complexity of a mesa-optimizer in comparison to a non-agentic hypothesis (perhaps because a mesa-optimizer has to have a world-model plus other stuff, where a regular hypothesis just needs to directly model the world). This idea was used to argue against some concern of mine. The problem is precisely that we know of no way of doing that! If we did, there would not be any inner alignment problem! We could just focus on the simplest hypothesis that fit the data, which is pretty much what you want to do anyway! I want to return to that idea. But first, we have to clarify some definitions.
The Formal Problem
I currently see three areas of concern:
-
Behavioral changes at deployment: The mesa-optimizer appears to perform well during training, but after deployment, competently pursues goals which are unintended and unwanted. For my purposes I will call any such occurrences "treacherous turns"—although this term is often defined in a way that requires *intentional *deception, for the moment, I’m not requiring that.
-
Manipulation of imperfect search: see here and here.
-
Dangerous hardware manipulation: EG, models which develop row-hammer attacks during training to increase their (supposed) score, exploit security holes to communicate with or manipulate the external environment, etc. Vanessa terms #1 Cartesian daemons because they obey the intended input/output protocol of the whole system, and #3 non-Cartesian daemons because they violate the protocol. I’m not sure whether/where #2 falls on Vanessa’s classification. For this short write-up I’ll focus on #1, although clearly #2 and #3 are also important areas of study. My semi-formal description of the problem is going to be very close to the explain-like-i’m-12 version:
-
A rich hypothesis space does contain (malign) mesa-optimizers, even if they occupy a small fraction of the space.
-
A powerful search over that space effectively restricts outputs to a very small fraction, namely, the points in the space which score very well on the objective function.
-
Some of the (malign) mesa-optimizers will score very well, so, there will be an overlap.
-
Therefore, we expect there to be a chance of getting (malign) mesa-optimizers. Essentially what we want to do is block or weaken this conclusion (concluding that there is no, or only a very small, chance). (The "(malign)" in parentheses should be removed in approaches which are trying to avoid mesa-optimization entirely, and included in approaches which are only trying to block bad actors. I’ll sometimes use the term "malign hypothesis" to indicate hypotheses which produce catastrophic output, regardless of whether they contain a mesa-optimizer in any formal sense.) Now, to make this into a formal problem, we have to define "mesa-optimizer".
Defining Mesa-Optimization
RLO offered a semi-formal definition of mesa-optimizers, as objects in a search space which themselves conduct an internal search. Whether or not you agree with this definition, it’s too vague for formal purposes: how would we define "internal search"? So, for the purpose of the formal problem, it’s necessary to pick an operational definition. Fortunately, it’s possible to select operational definitions which have strong connections to safety concerns. In particular, I will call an operationalization tight if it includes more than necessary, so that we can be sure we’re ruling out the undesired behavior when we rule out its operationalization. ("Tight" is supposed to sound like the opposite of "leaky" as in "leaky abstraction".) As with any research, the choice of operational definitions is going to depend on the specific approach of the researcher, norms of the field, etc. Here are some options that occur to me:
-
Generalization accuracy.
-
This throws away all details of the inner alignment problem, and treats treacherous turns as simple generalization error. This has the advantage of engaging existing machine learning theory. Sufficiently strong generalization guarantees would indeed rule out treacherous turns, so, the operationalization is tight.
-
Importance-weighted generalization accuracy.
-
The concept of generalization accuracy misses important issues. For example, a guaranteed very low frequency of errors might still allow an error to be strategically inserted at a very important time. A formal model could attempt to incorporate this.
-
Significant inference during deployment.
-
The intuition here is that a mesa-optimizer has to do some kind of significant planning, so we can rule them out by eliminating any computations which are "too sophisticated".
-
Possible versions of this include limiting neural network layers, or limiting run-time.
-
An operationalization of this form may or may not be tight; this would require further research/argument.
-
Explicit representation of goals/values.
-
I argue here that this may be relevant. The basic intuition is that in order to competently pursue unintended goals at deployment, there must be a compressed representation of those goals (rather than just compressed policies learned for specific situations).
-
Again, this may not be tight; further research required.
Obviously, this is not exhaustive by any means. Coming up with productive operational definitions is an important part of attacking the formal inner alignment problem. The goal is then to provide some kind of strong argument which avoids the operationalized problem.
Why is this a problem?
Why can’t we address the inner alignment problem in the same way that we usually address problems of induction? The most obvious way to avoid mesa-optimizers is to rule them out of the search space or the results of search. Earlier, I said:
[The hope was that] we can somehow lower-bound the complexity of a mesa-optimizer in comparison to a non-agentic hypothesis [...]. The problem is precisely that we know of no way of doing that! I was referring to the idea that the first thing one tries when hearing about the inner alignment problem is probably this kind of complexity-bound approach, which leads me to think of "the whole problem" as the fact that these approaches don’t seem to work. So, let’s look at these approaches in more detail. The hope is that we can be confident that mesa-optimizers will be more complex in comparison with benign outputs, so that we can avert mesa-optimization by using machine learning techniques which heavily bias toward less complex models. (This is "the first thing one tries" because it employs standard ideas from machine learning.) The main theoretical choice one has is, what notion of "complexity"? I will look at three approaches:
-
Description complexity. That is, how compressed is the model? How short is its description?
-
Computational complexity. That is, how fast is the model?
-
Combining the two. If neither work on their own, perhaps some combination of the two would work?
Description Complexity
The idea here would have to be that mesa-optimizers have to have a larger description length than benign hypotheses. Why you might think this idea would work:
-
You trust Occam’s razor. The whole point of Solomonoff induction is to give an objective theory of normatively correct inductive inference. So, almost by definition, the hypotheses favored by short description length should be "good explanations" of the universe, rather than unindended & undesirable hypotheses like mesa-optimizers.
-
A benign hypothesis just has to predict reality. A mesa-optimizer has to predict reality and also represent goal content and planning algorithms. So the mesa-optimizer has to be more complex, right?
-
A benign hypothesis just has to accurately perform the task. A deceptive mesa-optimizer has to also reason about the training process and other details of the world it’s in, in order to choose the right moment to execute a treacherous turn. Surely this would be more complex. Why this idea doesn’t seem to work:
-
Even in the best case (where you not only have realizability, but furthermore, know your prior is a pretty good one), Occam’s razor should only make you think one of the shortest hypotheses that fits your data is going to be correct, not necessarily *the *shortest one. So, this kind of thinking does not directly imply a lack of malign mesa-optimization in the shortest hypothesis.
-
In particular, I am invoking the idea that belief that a prior is near-optimal for predictive purposes does not imply trust in a prior for instrumental purposes, because rare predictive errors can be catastrophic for planning and decision making if they’re selected adversarially rather than distributed randomly.
-
This suggests a line of research examining conditions under which this does/doesn’t hold, and perhaps investigating planning algorithms which don’t entirely trust the prior.
-
Mutual information between predicting reality and agency may mean mesa-optimizers don’t have to spend extra bits on goal content and planning. In particular, if the reality being predicted contains goal-driven agents, then a mesa-optimizer doesn’t have to spend extra bits on these things, because it already needs to describe them in order to predict well.
-
This suggests a line of research into tasks which do not have mutual information with agency. Are there conditions under which we can be confident that mesa-optimizers would have high descriptive complexity?
-
However, it’s possible that any hypothesis can be transformed into a treacherous mesa-optimizing variant with a relatively short string of injected code. If this is the case, it guarantees that malign and benign hypotheses are not too different in description-length. This would suggest that it’ll be difficult to achieve high confidence that malign hypotheses always have larger description length.
-
Similarly, if a task is rich enough to involve the richness of the world (eg text prediction), then the description of the mesa-optimizer doesn’t need to spend extra bits on describing the training process and so on, because it’s already modeling complexities of the world such as that. In such a case, there’s no apparent simplicity-based argument against deception.
-
Are there tasks which we can confidently assert are simpler than the whole world, so that we can confidently rule out deception in simple models?
-
Also, see the reasoning in The Solomonoff Prior is Malign.
Pure Computational Complexity
Alright, if description complexity can’t rule out mesa-optimizers, maybe computational complexity can? I’m going to formalize "fast programs" as minimal circuits, where "circuit size" is understood as the number of logic gates. Why you might think this would work:
-
Maybe you think the inner alignment problem for Solomonoff induction arises due to "too much compute". After all, Paul’s original simulation warfare argument relied on simulating whole alien civilizations. More to the point, if mesa-optimization relies on internal search, well, that takes time, right? If we can find fast programs, maybe we force any "intelligence" out of the programs.
-
Specifically, you might think that the fastest way to do just about anything is essentially a lookup table. Why it doesn’t seem to work:
-
A lookup table is exponential in size (since you’d be checking for each combination of inputs and making an entry for each). So yes, minimal circuits rule out any algorithm with more than exponentially many steps. However, there are lots of concerting, sophisticated algorithms with sub-exponential numbers of steps. The minimal-circuit formalism will prefer these to lookup tables. (Note that even exponential-time calculations might involve circuits smaller than lookup tables.)
-
In particular, textbook AI algorithms are mostly about taking things which naively require exponential computations, and producing useful results in sub-exponential time.
-
Based on this, you should actively expect that minimal circuits implement sophisticated algorithms to solve problems, if such algorithms exist (and are not super-exponential).
-
Separately, we now think minimal circuits can be problematic because both Evan and Vanessa have proposed counterexamples. (Neither are necessarily the final word on the subject, though, since one may choose to work with different assumptions/definitions.) It might seem like we’re just not cutting off enough processing power—the minimal-circuits formalism cuts off mesa-optimization which requires superexponential thinking time, but not sub-exponential. Let’s call this the "computational ceiling" of the formalism. Shouldn’t we just pick a formalism with a lower computational ceiling? Maybe! Let’s look at a formalism which gives us a knob to adjust, so that we can choose to penalize time more/less harshly.
Mixing Time and Description Complexity
The speed prior is like Solomonoff induction, except that it penalizes computation time. Specifically, taking twice as long is like 1 more bit of description length. Combining time and description complexity in this way is called Levin complexity. For Levin complexity, time is penalized logarithmically (in comparison to description length). This suggests that we will get an exponential computational ceiling again: if computing n bits costs O(2^n) time, we’re around the break-even point (it might be worth computing the bits, or it might be better to memorize them). So super-exponential calculations won’t be worth it, and sub-exponential calculations will, just like we saw with minimal circuits. However, it’s easy to tweak the time/description trade-off. Let’s define Levin_f complexity as 2^l + f(t), where l is description length and t is execution time (the choice of f=\log_2 gives us regular Levin complexity). (I am not sure if Levin_f complexity has a more standard name in the literature.) This gives us a computational ceiling of f^{-1}. Why this might work:
-
If the computation of a bit (via search or other algorithmic means) takes too long, you’re better off just memorizing it. We can set this ceiling wherever we like. If we think "dangerously intelligent" computations are cubic and above, we can set f(t)=\sqrt t, for example. This gives us a quadratic computational ceiling. Why this might not work:
-
Humans are essentially linear-time algorithms, in the sense that we take the same maximum amount of processing power (ie, that of the human brain) to produce each next output. Anything which produces linearly much output has to do so in at least linear time. So, Levin_f-complexity can’t rule out humanlike intelligence.
-
Furthermore, forcing hypotheses to be "slow" like this may only delay the inevitable. Let’s say for the sake of argument that a treacherous turn can be executed at any time, but requires a specific computation C to be executed (say, reversing the cryptographic hash of a critical password). This computation C can simply be slowed down to get past any computational ceiling. Thus, if our notion of (malign) inner optimizer resembles "eventually executes a treacherous turn", then lowering the computational ceiling doesn’t actually help. Elaborating on that last point: I’m assuming that you are asking for a computation that runs "sufficiently long" anyway. The speed prior does have a heavy bias for computations which cut off quickly. However, if we’re (say) predicting the weather, we probably want to require that the calculation keeps going on for a while into the future, rather than cutting off. If we’re training an AI traffic control system, we’d certainly require that it doesn’t just cut off after a few days. The idea, then, is that the treacherous computation C can be spaced out through the time the AI traffic control system is required to remain on-line. Taking everything together, I don’t think we have a strong basis (yet!) for thinking penalizing runtime makes the situation any better. Some have even argued that it makes the situation worse, because time constraints provide an extra incentive for complex value-driven computations (in order to manage time effectively) as opposed to simple causal hypotheses (such as raw physics). I don’t think we have enough information to judge one way or the other.
A Note on the Consensus Algorithm
The eventual write-up, which this is a mere prospectus for, would aim to cover as broad a variety of approaches as possible, and give our current best assessment of the viability of each (ie, report why it might work and why it might not work). I can’t cover everything I’d want to cover in this prospectus, but one more approach sticks out as deserving of treatment. In February of this year, Michael Cohen posted Formal Solution to the Inner Alignment Problem. This used an algorithm which I had heard many others talk about in the context of inner alignment, but usually, with considerably less hope and optimism than Michael Cohen expresses. I’ll call the algorithm the ***consensus algorithm. ***(I don’t know of a better name for it; I’ve heard it called KWIK learning (Knows-What-It-Knows), but unfortunately KWIK is actually used for a learning *setting, *not the algorithm that’s typically applied in that setting.) The consensus algorithm works like this:
-
You have a (finite) ensemble of hypotheses which you get from somewhere. For example, a Bayesian could take their top 100 hypotheses, or select a probability threshold of .01% and select all hypotheses above this threshold. A deep-learning person might instead re-run their training algorithm 100 times, different initializations and hyperparameters.
-
You require* agreement* between these hypotheses in order to output any prediction. In the absence of such agreement, you instead say "I need further feedback".
-
When you get further feedback, you get to edit your list, EG throwing out the hypotheses which were wrong (but also, perhaps, bringing in new replacement hypotheses which now exceed the .01% threshold thanks to a Bayesian update). Note that I’m not nailing down all details, EG, how closely do we require models to agree? (Do we just require that they agree about which option is most probable? Do we require their probabilities to fall within the same small range?) The consensus algorithm allows us to make some nice formal arguments. For example:
-
Suppose that we think at least one of our top 100 hypotheses is (a) always correct (or at least as accurate as the rest of the top 100) and (b) non-malign.
-
Then no malign hypothesis will be able to control the output. Malign hypotheses are limited to either agreeing with the non-malign hypothesis, or, blocking any output by spoiling the consensus. However, other drawbacks seem to spoil the appeal of this result.
-
To state the least of our problems first: this requires a 100x slowdown in comparison with the state-of-the-art deep learning (or whatever) we’re layering the consensus algorithm on top of. We’re amplifying safety at the expense of speed; if we think the base algorithm has a 10% frequency of yielding safe AGI, then we only need around a 10x slowdown to get a high chance of safety; if it’s a .01% chance, we would need something more like a 10,000x slowdown (since we’d want to require consensus of around 10,000 independent hypotheses).
-
Second, to apply this to something like deep learning, we need to think that each run has an independent chance of creating safe AGI. This is probably far from true. Our uncertainty about whether a particular training method produces safe AGI is probably almost entirely correlated. This would further blow up the size of the ensemble required to achieve any confidence.
-
Third, the consensus algorithm requires a strong form of realizability assumption, where you not only assume that our Bayesian space contains the true hypothesis, but furthermore, *that it’s in the top 100 *(or whatever number we choose). This hypothesis has to be really good: we have to think that malign hypotheses never out-guess the benign hypothesis. Otherwise, there’s a chance that we eliminate the good guy at some point (allowing the bad guys to coordinate on a wrong answer). But this is unrealistic! The world is big and complex enough that no realistic hypothesis has all the answers. Michael Cohen seems to think that restricting to imitation learning makes the realizability assumption realistic:
There are none of the usual embedding problems here—the imitator can just be bigger than the demonstrator that it’s modeling. I disagree. It still doesn’t seem realistic to suppose that competitors to the best hypothesis are never lucky. Even if we suppose that my behavior is entirely deterministic, the number of hypotheses you’d have to consider in order for one of them to be an exact copy of my brain would no doubt be huge! Let’s conservatively call this 10^{100} (suggesting that I have at least 100 neurons, each of which has at least 10 meaningfully different configurations). This would suggest that the consensus algorithm needs googol times the processing power of the human brain to work. Michael Cohen also says some other stuff about why the proposal doesn’t have embedding problems, which I also disagree with. But let’s set this aside and get back to listing other problems with the consensus algorithm.
-
Fourth, the consensus algorithm still gives inner optimizers control of when the system asks for more feedback. This in itself could be catastrophic! For example, it could ask at the worst times, when it anticipates the human feedback to be of poor quality, or when asking itself has catastrophic consequences. A silly example would be causing a self-driving car to ask for feedback right when it needs to enact difficult evasive maneuvers to avoid a crash. Obviously we can’t put a system like this in charge of extremely time-sensitive decisions like this. But whatever the case, asking for feedback will influence the external world somehow. Malign hypotheses just need to find the right moment, where asking for feedback butterfly-effects its way to a catastrophe.
-
Fifth, although (as Michael Cohen notes) the frequency of asking for feedback will rapidly diminish, the learning algorithm converges to the correct hypothesis extremely slowly. Specifically, if we have an ensemble of programs of description-length at most l, one of which we believe to be correct and benign, the best convergence bound we can give is busy beaver of l!! (For those who don’t know about the busy beaver function, this roughly means "uncomputably slow". It’s way worse than needing exponentially much data. I discuss why this happens in Subsystem Alignment.) This means we can’t solve the previous problem by keeping the system in a safe training environment (eg, keeping it away from time-sensitive tasks, and making sure that it always has a human on-duty to provide feedback when prompted) until it converges. We have to face the fact that it might require human feedback at any point in the future. The fourth point really seems like the deal-breaker. Malign mesa-optimizers could intelligently leverage this to communicate with the outside world, manipulate the training process, etc.
Conclusion
Has this been useful? Would an expanded and improved version of this be useful? This is something where I could really use detailed peer-review-like feedback, since the final version of this thing would hopefully be a pretty canonical resource, with standardized terminology and so on. A weakness of this as it currently stands is that I *purport *to offer the formal version of the inner optimization problem, but really, I just gesture at a cloud of possible formal versions. I think this is somewhat inevitable, but nonetheless, could probably be improved. What I’d like to have would be several specific formal definitions, together with several specific informal concepts, and strong stories connecting all of those things together. I’d be glad to get any of the following types of feedback:
-
Possible definitions/operationalizations of significant concepts.
-
Ideas about which definitions and assumptions to focus on.
-
Approaches that I’m missing. I’d love to have a basically exhaustive list of approaches to the problem discussed so far, even though I have not made a serious attempt at that in this document.
-
Any brainstorming you want to do based on what I’ve said—variants of approaches I listed, new arguments, etc.
-
Suggested background reading.
-
Nitpicking little choices I made here.
-
Any other type of feedback which might be relevant to putting together a better version of this. If you take nothing else away from this, I’m hoping you take away this one idea: the main point of the inner alignment problem (at least to me) is that we know hardly anything about the relationship between the outer optimizer and any mesa-optimizers. There are hardly any settings where we can rule mesa-optimizers out. And we can’t strongly argue for any particular connection (good or bad) between outer objectives and inner.
I’ll make a case here that manipulation of imperfect internal search should be considered the inner alignment problem, and all the other things which look like inner alignment failures actually stem from outer alignment failure or non-mesa-optimizer-specific generalization failure.
Example: Dr Nefarious
Suppose Dr Nefarious is an agent in the environment who wants to acausally manipulate other agents’ models. We have a model which knows of Dr Nefarious’ existence, and we ask the model to predict what Dr Nefarious will do. At this point, we have already failed: either the model returns a correct answer, in which case Dr Nefarious has acausal control over the answer and can manipulate us through it, or it returns an incorrect answer, in which case the prediction is wrong. (More precisely, the distinction is between informative/independent answers, not correct/incorrect.) The only way to avoid this would be to not ask the question in the first place—but if we need to know what Dr Nefarious will do in order to make good decisions ourselves, then we need to run that query. On the surface, this looks like an inner alignment failure: there’s a malign subagent in the model. But notice that it’s not even clear what we want in this situation—we don’t know how to write down a goal-specification which avoids the problem while also being useful. The question of "what do we even want to do in this sort of situation?" is unambiguously an outer alignment question. It’s not a situation where we know what we want but we’re not sure how to make a system actually do it; it’s a situation where it’s not even clear what we want. Conversely, if we did have a good specification of what we want in this situation, then we could just specify that in the outer objective. Once that’s done, we would still potentially need to solve inner alignment problems in practice, but we’d know how to solve them in principle: do the thing which is globally optimal for our outer objective. The whole point of "having a good specification of what we want" is that the globally-optimal thing should be good. Point of all this: this supposed "inner alignment failure" can be broken into two parts. One of those parts is a "what do we even want?" question, i.e. an outer alignment problem. The other part is a problem of actually achieving the optimal thing, which is where manipulation of imperfect internal search is relevant. If both of those parts are solved, then the system is aligned.
Generalizing The Example
Another example, this time with explicit acausal trade: our AI uses a Solomonoff-like world model, and a subagent in that model is trying to gain influence. Meanwhile, an (unrelated) nefarious agent in the environment wants to manipulate the AI. So, the subagent and the nefarious agent simulate each other and make an acausal deal: the nefarious agent produces a very specific string of bits in the real world, and the subagent gains weight by perfectly predicting that string. In exchange, the subagent manipulates the AI to help the nefarious agent in some way. Self-fulfilling prophecies provide a similar, but simpler, class of examples. In each of these, there is a malign inner agent, but that malign inner agent is only able to manipulate the AI successfully because of some structure in the environment. Or, another way to state it: the malign agent is successful only because the combination of (outer) prior + objective does not handle self-fulfilling prophecies or acausal trade the way we (humans) want them to. These are, in an important sense, outer alignment problems: we have not correctly specified what-we-want; even the global optimum of the outer process suffers from the problem.
Objective Is Only Defined With Prior + Data
One possible objection to this is that "outer alignment"—i.e. specifying what-humans-want—should be more narrowly interpreted. In particular, Evan has argued before that generalization errors resulting from e.g. distribution shift between training data and deployment environment should be considered a separate problem. I disagree with this. I claim that an objective isn’t even well-defined without a distribution; that’s part of the type-signature of an objective. This is easy to see in the case of an expected utility maximizer. When we write "max E[u(X)]", X is a variable in the probabilistic model. It is a thing-in-the-model, not a thing-in-the-world; the world does not necessarily share our ontology. We could say something similar for any setup which maximizes some expected value on an empirical distribution, i.e. an average over the training data. For instance, maybe we have some labeled images, and we’re training a classifier. We may have an objective for which the system does-what-we-want for the original labels, but does not do what we want if we change the objective function to permute the labels before calculating error (i.e. it switches "true" with "false"). Permuting the labels in the objective function is obviously an outer alignment problem—yet we can achieve exactly the same effect by permuting the labels in the dataset instead. Another angle: plenty of ML work uses the exact same objective on different data sets, and obviously they do completely different things. There is no useful sense in which a training objective can be aligned or misaligned, separate from the context of data/prior. My point is: there is no line between bad training data and bad objective. These problems only make sense when considered together. So, if "bad training objective" is an outer alignment problem, then we also need to consider "bad training data" to be an outer alignment problem in order for our factorization of the problem to work well. (For Bayesian agents, this also extends to the prior.)
The General Argument
Outer objective, training data, and prior (if any) all have to be treated as a unit: changes in one are equivalent to changes in another, and the objective isn’t even well-defined outside the ontology of the data/prior. The central outer alignment question of "what do we even want?" has to be answered with both an objective and data/prior, in order for the answer to be well-defined. If we buy that, then outer alignment (i.e. fully answering the question "what do we want?") implies that the true global optimum of an outer optimizer’s search is aligned. So, there’s only one type of inner alignment problem which would not be solved by solving outer alignment: manipulation of imperfect search. We can have a good objective+prior+data, but the search may still be imperfect, and malign subagents may arise which manipulate that search. All that said… there’s still an interesting alignment problem which examples like Dr Nefarious or self-fulfilling prophecies or maligness of Solomonoff are pointing to. I claim that inner alignment is not the right way to think about these—it’s not the malign inner agents themselves which are the problem. They’re just an indicator that we have not correctly specified what-we-want.
Comment
This is a great comment. I will have to think more about your overall point, but aside from that, you’ve made some really useful distinctions. I’ve been wondering if inner alignment should be defined separately from mesa-optimizer problems, and this seems like more evidence in that direction (ie, the dr nefarious example is a mesa-optimization problem, but it’s about outer alignment). Or maybe inner alignment just shouldn’t be seen as the compliment of outer alignment! Objective quality vs search quality is a nice dividing line, but, doesn’t cluster together the problems people have been trying to cluster together.
Comment
Haven’t read the full comment thread, but on this sentence
Comment
Right, but John is disagreeing with Evan’s frame, and John’s argument that such-and-such problems aren’t inner alignment problems is that they are outer alignment problems.
So, I think I could write a much longer response to this (perhaps another post), but I’m more or less not persuaded that problems should be cut up the way you say. As I mentioned in my other reply, your argument that Dr. Nefarious problems shouldn’t be classified as inner alignment is that they are apparently outer alignment. If inner alignment problems are roughly "the internal objective doesn’t match the external objective" and outer alignment problems are roughly "the outer objective doesn’t meet our needs/goals", then there’s no reason why these have to be mutually exclusive categories. In particular, Dr. Nefarious problems can be both. But more importantly, I don’t entirely buy your notion of "optimization". This is the part that would require a longer explanation to be a proper reply. But basically, I want to distinguish between "optimization" and "optimization under uncertainty". Optimization under uncertainty is not optimization—that is, it is not optimization of the type you’re describing, where you have a well-defined objective which you’re simply feeding to a search. Given a prior, you can reduce optimization-under-uncertainty to plain optimization (if you can afford the probabilistic inference necessary to take the expectations, which often isn’t the case). But that doesn’t mean that you do, and anyway, I want to keep them as separate concepts even if one is often implemented by the other. Your notion of the inner alignment problem applies only to optimization. Evan’s notion of inner alignment applies (only!) to optimization under uncertainty.
Comment
I buy the "problems can be both" argument in principle. However, when a problem involves both, it seems like we have to solve the outer part of the problem (i.e. figure out what-we-even-want), and once that’s solved, all that’s left for inner alignment is imperfect-optimizer-exploitation. The reverse does not apply: we do not necessarily have to solve the inner alignment issue (other than the imperfect-optimizer-exploiting part) at all. I also think a version of this argument probably carries over even if we’re thinking about optimization-under-uncertainty, although I’m still not sure exactly what that would mean. In other words: if a problem is both, then it is useful to think of it as an outer alignment problem (because that part has to be solved regardless), and not also inner alignment (because only a narrower version of that part necessarily has to be solved). In the Dr Nefarious example, the outer misalignment causes the inner misalignment in some important sense—correcting the outer problem fixes the inner problem , but patching the inner problem would leave an outer objective which still isn’t what we want. I’d be interested in a more complete explanation of what optimization-under-uncertainty would mean, other than to take an expectation (or max/min, quantile, etc) to convert it into a deterministic optimization problem. I’m not sure the optimization vs optimization-under-uncertainty distinction is actually all that central, though. Intuitively, the reason an objective isn’t well-defined without the data/prior is that the data/prior defines the ontology, or defines what the things-in-the-objective are pointing to (in the pointers-to-values sense) or something along those lines. If the objective function is f(X, Y), then the data/prior are what point "X" and "Y" at some things in the real world. That’s why the objective function cannot be meaningfully separated from the data/prior: "f(X, Y)" doesn’t mean anything, by itself. But I could imagine the pointer-aspect of the data/prior could somehow be separated from the uncertainty-aspect. Obviously this would require a very different paradigm from either today’s ML or Bayesianism, but if those pieces could be separated, then I could imagine a notion of inner alignment (and possibly also something like robust generalization) which talks about both optimization and uncertainty, plus a notion of outer alignment which just talks about the objective and what it points to. In some ways, I actually like that formulation better, although I’m not clear on exactly what it would mean.
Comment
Trying to lay this disagreement out plainly: According to you, the inner alignment problem should apply to well-defined optimization problems, meaning optimization problems which have been given all the pieces needed to score domain items. Within this frame, the only reasonable definition is "inner" = issues of imperfect search, "outer" = issues of objective (which can include the prior, the utility function, etc). According to me/Evan, the inner alignment problem should apply to optimization under uncertainty, which is a notion of optimization where you don’t have enough information to really score domain items. In this frame, it seems reasonable to point to the way the algorithm tries to fill in the missing information as the location of "inner optimizers". This "way the algorithm tries to fill in missing info" has to include properties of the search, so we roll search+prior together into "inductive bias". I take your argument to have been:
The strength of well-defined optimization as a natural concept;
The weakness of any factorization which separates elements like prior, data, and loss function, because we really need to consider these together in order to see what task is being set for an ML system (Dr Nefarious demonstrates that the task "prediction" becomes the task "create a catastrophe" if prediction is pointed at the wrong data);
The idea that the my/Evan/Paul’s concern about priors will necessarily be addressed by outer alignment, so does not need to be solved separately. Your crux is, can we factor ‘uncertainty’ from ‘value pointer’ such that the notion of ‘value pointer’ contains all (and only) the outer alignment issues? In that case, you could come around to optimization-under-uncertainty as a frame. I take my argument to have been:
The strength of optimization-under-uncertainty as a natural concept (I argue it is more often applicable than well-defined optimization);
The naturalness of referring to problems involving inner optimizers under one umbrella "inner alignment problem", whether or not Dr Nefarious is involved;
The idea that the malign-prior problem has to be solved in itself whether we group it as an "inner issue" or an "outer issue";
For myself in particular, I’m ok with some issues-of-prior, such as Dr Nefarious, ending up as both inner alignment and outer alignment in a classification scheme (not overjoyed, but ok with it). My crux would be, does a solution to outer alignment (in the intuitive sense) really imply a solution to exorcising mesa-optimizers from a prior (in the sense relevant to eliminating them from perfect search)? It might also help if I point out that well-defined-optimization vs optimization-under-uncertainty is my current version of the selection/control distinction. In any case, I’m pretty won over by the uncertainty/pointer distinction. I think it’s similar to the capabilities/payload distinction Jessica has mentioned. This combines search and uncertainty (and any other generically useful optimization strategies) into the capabilities. But I would clarify that, wrt the ‘capabilities’ element, there seem to be mundane capabilities questions and then inner optimizer questions. IE, we might broadly define "inner alignment" to include all questions about how to point ‘capabilities’ at ‘payload’, but if so, I currently think there’s a special subset of ‘inner alignment’ which is about mesa-optimizers. (Evan uses the term ‘inner alignment’ for mesa-optimizer problems, and ‘objective-robustness’ for broader issues of reliably pursuing goals, but he also uses the term ‘capability robustness’, suggesting he’s not lumping all of the capabilities questions under ‘objective robustness’.)
Comment
This is a good summary. I’m still some combination of confused and unconvinced about optimization-under-uncertainty. Some points:
It feels like "optimization under uncertainty" is not quite the right name for the thing you’re trying to point to with that phrase, and I think your explanations would make more sense if we had a better name for it.
The examples of optimization-under-uncertainty from your other comment do not really seem to be about uncertainty per se, at least not in the usual sense, whereas the Dr Nefarious example and maligness of the universal prior do.
Your examples in the other comment do feel closely related to your ideas on learning normativity, whereas inner agency problems do not feel particularly related to that (or at least not any more so than anything else is related to normativity).
It does seem like there’s in important sense in which inner agency problems are about uncertainty, in a way which could potentially be factored out, but that seems less true of the examples in your other comment. (Or to the extent that it is true of those examples, it seems true in a different way than the inner agency examples.)
The pointers problem feels more tightly entangled with your optimization-under-uncertainty examples than with inner agency examples. … so I guess my main gut-feel at this point is that it does seem very plausible that uncertainty-handling (and inner agency with it) could be factored out of goal-specification (including pointers), but this particular idea of optimization-under-uncertainty seems like it’s capturing something different. (Though that’s based on just a handful of examples, so the idea in your head is probably quite different from what I’ve interpolated from those examples.) On a side note, it feels weird to be the one saying "we can’t separate uncertainty-handling from goals" and you saying "ok but it seems like goals and uncertainty could somehow be factored". Usually I expect you to be the one saying uncertainty can’t be separated from goals, and me to say the opposite.
Comment
The true objective is not well-defined. IE, machine learning people generally can’t write down an objective function which (a) spells out what they want, and (b) can be evaluated. (What you want is generalization accuracy for the presently-unknown deployment data.)
So, machine learning people create proxies to optimize. Training data is the start, but then you add regularizing terms to penalize complex theories.
But none of these proxies is the full expected value (ie, expected generalization accuracy). If we could compute the full expected value, we probably wouldn’t be searching for a model at all! We would just use the EV calculations to make the best decision for each individual case. So you can see, we can always technically turn optimization-under-uncertainty into a well-defined optimization by providing a prior, but, this is usually so impractical that ML people often don’t even consider what their prior might be. Even if you did write down a prior, you’d probably have to do ordinary ML search to approximate that. Which goes to show that it’s pretty hard to eliminate the non-EV versions of optimization-under-uncertainty; if you try to do real EV, you end up using non-EV methods anyway, to approximate EV. The fact that we’re not really optimizing EV, in typical applications of gradient descent, explains why methods like early stopping or dropout (or anything else that messes with the ability of gradient descent to optimize the given objective) might be useful. Otherwise, you would only expect to use modifications if they helped the search find higher-value items. But in real cases, we sometimes prefer items that have a lower score on our proxy, when the-way-we-got-that-item gives us other reason to expect it to be good (early stopping being the clearest example of this). This in turn means we don’t even necessarily convert our problem to a real, solidly defined optimization problem, ever. We can use algorithms like gradient-descent-with-early-stopping just "because they work well" rather than because they optimize some specific quantity we can already compute. Which also complicates your argument, since if we’re never converting things to well-defined optimization problems, we can’t factor things into "imperfect search problems" vs "alignment given perfect search"—because we’re not really using search algorithms (in the sense of algorithms designed to get the maximum value), we’re using algorithms with a strong family resemblance to search, but which may have a few overtly-suboptimal kinks thrown in because those kinks tend to reduce Goodharting. In principle, a solution to an optimization-under-uncertainty problem needn’t look like search at all. Ah, here’s an example: online convex optimization. It’s a solid example of optimization-under-uncertainty, but, not necessarily thought of in terms of a prior and an expectation. So optimization-under-uncertainty doesn’t necessarily reduce to optimization. I claim it’s usually better to think about optimization-under-uncertainty in terms of regret bounds, rather than reduce it to maximization. (EG this is why Vanessa’s approach to decision theory is superior.)
While I agree that outer objective, training data and prior should be considered together, I disagree that it makes the inner alignment problem dissolve except for manipulation of the search. In principle, if you could indeed ensure through a smart choice of these three parameters that there is only one global optimum, only "bad" (meaning high loss) local minima, and that your search process will always reach the global optimum, then I would agree that the inner alignment problem disappears. But answering "what do we even want?" at this level of precision seems basically impossible. I expect that it’s pretty much equivalent to specifying exactly the result we want, which we are quite unable to do in general. So my perspective is that the inner alignment problem appears because of inherent limits into our outer alignment capabilities. And that in realistic settings where we cannot rule out multiple very good local minima, the sort of reasoning underpinning the inner alignment discussion is the best approach we have to address such problems. That being said, I’m not sure how this view interacts with yours or Evan’s, or if this is a very standard use of the terms. But since that’s part of the discussion Abram is pushing, here is how I use these terms.
Hm, I want to classify "defense against adversaries" as a separate category from both "inner alignment" and "outer alignment". The obvious example is: if an *adversarial *AGI hacks into my AGI and changes its goals, that’s not any kind of alignment problem, it’s a defense-against-adversaries problem. Then I would take that notion and extend it by saying "yes interacting with an adversary presents an attack surface, but also merely imagining an adversary presents an attack surface too". Well, at least in weird hypotheticals. I’m not convinced that this would really be a problem in practice, but I dunno, I haven’t thought about it much. Anyway, I would propose that the procedure for defense against adversaries *in general *is: (1) shelter an AGI from adversaries early in training, until it’s reasonably intelligent and aligned, and then (2) trust the AGI to defend itself. I’m not sure we can do any better than that. In particular, I imagine an intelligent and self-aware AGI that’s aligned in trying to help me would deliberately avoid imagining an adversarial superintelligence that can acausally hijack its goals! That still leaves the issue of early training, when the AGI is not yet motivated to not imagine adversaries, or not yet able. So I would say: if it does imagine the adversary, and then its goals *do *get hijacked, then at that point I would say "OK yes now it’s misaligned". (Just like if a real adversary is exploiting a normal security hole—I would say the AGI is aligned before the adversary exploits that hole, and misaligned after.) Then what? Well, presumably, we will need to have procedure that verifies alignment before we release the AGI from its training box. And that procedure would presumably be indifferent to how the AGI came to be misaligned. So I don’t think that’s really a special problem we need to think about.
Comment
You might be unable to defend against arbitrarily adversarial cognition, so, you might want to prevent it early rather than try to detect it later, because you may be vulnerable in between.
You might be able to detect some sorts of misalignment, but not others. In particular, it might be very difficult to detect purposeful deception, since it intelligently evades whatever measures are in place. So your misalignment-detection may be dependent on averting mesa-optimizers or specific sorts of mesa-optimizers.
Comment
That’s fair. Other possible approaches are "try to ensure that imagining dangerous adversarial intelligences is aversive to the AGI-in-training ASAP, such that this motivation is installed before the AGI is able to do so", or "intepretability that looks for the AGI imagining dangerous adversarial intelligences".
I guess the fact that people don’t tend to get hijacked by imagined adversaries gives me some hope that the first one is feasible—like, that maybe there’s a big window where one is smart enough to understand that imagining adversarial intelligences can be bad, but not smart enough to do so with such fidelity that it actuality is dangerous.
But hard to say what’s gonna work, if anything, at least at my current stage of general ignorance about the overall training process.
Comment
I think one major reason why people don’t tend to get hijacked by imagined adversaries is that you can’t simulate someone who is smarter than you, and therefore you can defend against anything you can simulate in your mind.This is not a perfect arugment since I can imagine someone that has power over me in the real world, and for example imagine how angry they would be at me if I did something they did not like. But then their power over me comes from their power in the real world, not their ability to outsmart me inside my own mind.
Comment
Not to disagree hugely, but I have heard one religious conversion (an enlightenment type experience) described in a way that fits with "takeover without holding power over someone". Specifically this person described enlightenment in terms close to "I was ready to pack my things and leave. But the poison was already in me. My self died soon after that."
It’s possible to get the general flow of the arguments another person would make, spontaneously produce those arguments later, and be convinced by them (or at least influenced).
Since you’re trying to compile a comprehensive overview of directions of research, I will try to summarize my own approach to this problem:
I want to have algorithms that admit thorough theoretical analysis. There’s already plenty of bottom-up work on this (proving initially weak but increasingly stronger theoretical guarantees for deep learning). I want to complement it by top-down work (proving strong theoretical guarantees for algorithms that are initially infeasible but increasingly made more feasible). Hopefully eventually the two will meet in the middle.
Given feasible algorithmic building blocks with strong theoretical guarantees, some version of the consensus algorithm can tame Cartesian daemons (including manipulation of search) as long as the prior (inductive bias) of our algorithm is sufficiently good.
Coming up with a good prior is a problem in embedded agency. I believe I achieved significant progress on this using a certain infra-Bayesian approach, and hopefully will have a post soonish.
The consensus-like algorithm will involve a trade-off between safety and capability. We will have to manage this trade-off based on expectations regarding external dangers that we need to deal with (e.g. potential competing unaligned AIs). I believe this to be inevitable, although ofc I would be happy to be proven wrong.
The resulting AI is only a first stage that we will use to design the second stage AI, it’s not something we will deploy in self-driving cars or such
Non-Cartesian daemons need to be addressed separately. Turing RL seems like a good way to study this if we assume the core is too weak to produce non-Cartesian daemons, so the latter can be modeled as potential catastrophic side effects of using the envelope. However, I don’t have a satisfactory solution yet (aside perhaps homomorphic encryption, but the overhead might be prohibitive).
The problem is precisely that we know of no way of doing that! If we did, there would not be any inner alignment problem! We could just focus on the simplest hypothesis that fit the data, which is pretty much what you want to do anyway!
I think there would still be an inner alignment problem even if deceptive models were in fact always more complicated than non-deceptive models—i.e. if the universal prior wasn’t malign—which is just that the neural net prior (or whatever other ML prior we use) might be malign even if the universal prior isn’t (and in fact I’m not sure that there’s even that much of a connection between the malignity of those two priors).
Also, I think that this distinction leads me to view "the main point of the inner alignment problem" quite differently: I would say that the main point of the inner alignment problem is that whatever prior we use in practice will probably be malign. But that does suggest that if you can construct a training process that defuses the arguments for why its prior/inductive biases will be malign, then I think that does make significant progress on defusing the inner alignment problem. Of course, I agree that we’d like to be as confident that there’s as little malignancy/deception as possible such that just defusing the arguments that we can come up with might not be enough—but I still think that trying to figure out how plausible it is that the actual prior we use will be malign is in fact at least attempting to address the core problem.
Comment
Thanks for the post! Here is my attempt at a detailed peer-review feedback. I admit that I’m more excited by doing this because you’re asking it directly, and so I actually believe there will be some answer (which in my experience is rarely the case for my in-depth comments). One thing I really like is the multiple "failure" stories at the beginning. It’s usually frustrating in posts like that to see people argue against position/arguments which are not written anywhere. Here we can actually see the problematic arguments.
> Defining Mesa-Optimization
There’s one approach that you haven’t described (although it’s a bit close to your last one) and which I am particularly excited about: finding an operationalization of goal-directedness, and just define/redefine mesa-optimizers as learned goal-directed agents. My interpretation of RLO is that it’s arguing that search for simple competent programs will probably find a goal-directed system AND that it might have a simple structure "parametrized with a goal" (so basically an inner optimizer). This last assumption was really relevant for making argument about the sort of architecture likely to evolved by gradient descent. But I don’t think the arguments are tight enough to convince us that the learned goal-directed systems will necessarily have this kind of structure, and the sort of problems mentioned seems just as salient for other goal-directed systems. I also believe that we’re not necessarily that far from having a usable definition of goal-directedness (based on the thing I presented at AISU, changed according to your feedback), but I know that not everyone agree. Even if I’m wrong about being able to formalize goal-directedness, I’m pretty convinced that the cluster of intuitions around goal-directed is what should be applied to a definition of mesa-optimization, because I expect inner alignment problems beyond inner optimizers.
> Pure Computational Complexity
About penalizing time complexity, I really like this part of RLO (which is sadly missed by basically everyone as it’s not redescribed in the intro or conclusion):
> A Note on the Consensus Algorithm
As someone who has been unconvinced with this proposal as a solution for inner alignment, but didn’t take the time to express exactly why, I feel like you did a pretty nice work, and probably what I will point people to when they ask about this post.
Comment
Comment
One way I imagine the use of a definition of goal-directedness is to filter against very goal-directed systems. A good definition (if it’s possible) should clarify whether low goal-directed systems can be competitive, as well as the consequences of different parts and aspects of goal-directedness. You can see that as a sort of analogy to the complexity penalties, although it might risk being similarly uncompetitive.
One hope with a definition we can actually toy with is to find some properties of the environments and the behavior of the systems that 1) capture a lot of the information we care about and 2) are easy to abstract. Something like what Alex has done for his POWER-seeking results, where the relevant aspect of the environment are the symmetries it contains.
Even arguing for your point, that evaluating goals and/or goal-directedness of actual NNs would be really hard, is made easier by a deconfused notion of goal-directedness.
Comment
I have fairly mixed feelings about this post. On one hand, I agree that it’s easy to mistakenly address some plausibility arguments without grasping the full case for why misaligned mesa-optimisers might arise. On the other hand, there has to be some compelling (or at least plausible) case for why they’ll arise, otherwise the argument that ‘we can’t yet rule them out, so we should prioritise trying to rule them out’ is privileging the hypothesis. Secondly, it seems like you’re heavily prioritising formal tools and methods for studying mesa-optimisation. But there are plenty of things that formal tools have not yet successfully analysed. For example, if I wanted to write a constitution for a new country, then formal methods would not be very useful; nor if I wanted to predict a given human’s behaviour, or understand psychology more generally. So what’s the positive case for studying mesa-optimisation in big neural networks using formal tools? In particular, I’d say that the less we currently know about mesa-optimisation, the more we should focus on qualitative rather than quantitative understanding, since the latter needs to build on the former. And since we currently do know very little about mesa-optimisation, this seems like an important consideration.
Comment
I agree with much of this. I over-sold the "absence of negative story" story; of course there has to be some positive story in order to be worried in the first place. I guess a more nuanced version would be that I am pretty concerned about the broadest positive story, "mesa-optimizers are in the search space and would achieve high scores in the training set, so why wouldn’t we expect to see them?"—and think more specific positive stories are mostly of illustrative value, rather than really pointing to gears that I expect to be important. (With the exception of John’s story, which did point to important gears.)
With respect to formalization, I did say up front that less-formal work, and empirical work, is still valuable. However, that’s not the sense the post conveyed overall, so I get it. I am concretely trying to convey pessimism about a specific sort of less-formal work: work which tries to block plausibility stories. Possibly you disagree about this kind of work.
WRT your argument for informal work, well, I agree in principle (trying to push toward more formal work myself has so far revealed challenges which I think more informal conceptual work could help with), but I’m nonetheless optimistic at the moment that we can define formal problems which won’t be a waste of time to work on. And out of informal work, what seems most interesting is whatever pushes toward formality.
Comment
Comment
To me, the post as written seems like enough to spell out my optimism… there multiple directions for formal work which seem under-explored to me. Well, I suppose I didn’t focus on explaining why things seem under-explored. Hopefully the writeup-to-come will make that clear.
Feedback on your disagreements with Michael:I agree with "the consensus algorithm still gives inner optimizers control of when the system asks for more feedback".* ***Most of your criticisms seem to be solvable by using a less naive strategy for active learning and inference, such as Bayesian Active Learning with Disagreement (BALD). Its main drawback is that exact posterior inference in deep learning is expensive since it requires integrating over a possibly infinite/continuous hypothesis space. But approximations exist. BALD (and similar methods) help with most criticisms:
It only needs one run, not 100. Instead, it samples hypotheses (let’s say 100) from a posterior p(h|x_{1:t},y_{1:t}).
It doesn’t suffer from dependence between runs because there’s only 1 run. It just has to take iid samples from its own posterior (many inference techniques do this).
It doesn’t require that the true hypothesis is always right. Instead each hypothesis defines a distribution over answers and it only gets ruled out when it puts 0% chance on the human’s answer. (For imitation learning, that should never happen)
It doesn’t require that \exists one among the 100 hypotheses that is safe \forall inputs. Drawback: It requires the weaker condition that \forall input we encounter, \exists one hypothesis (among 100) that is safe.
It converges faster because it actively searches for inputs where hypotheses disagree.
(Bayesian ML can even be adversarially robust with exact posterior inference.) Apologies if I missed details from Michael’s paper.
Comment
Thanks for the extensive reply, and sorry for not getting around to it as quickly as I replied to some other things! I am sorry for the critical framing, in that it would have been more awesome to get a thought-dumb of ideas for research directions from you, rather than a detailed defense of your existing work. But of course existing work must be judged, and I felt I had remained quiet about my disagreement with you for too long.
They will all be pretty similar, so getting consensus doesn’t tell us much. We generally have no reason to assume that some point along the path will be benign—undercutting the point of the consensus algorithm.
The older parts of the path will basically be worse, so if you keep a lot of path, you get a lot of not-very-useful failures of consensus.
Lottery-ticket research suggests that if a malign structure is present in the end, then precursors to it will be present at the beginning. So it seems to me that you at least need to do independent training runs (w/ different random initializations) for the different models which you are checking consensus between, so that they are "independent" in some sense (perhaps most importantly, drawing different lottery tickets). However, running the same training algorithm many times may not realistically explore the space enough. We sort of expect the same result from the same training procedure. Sufficiently large models will contain malign lotto tickets with high probability (so we can’t necessarily argue from "one of these N initializations almost certainty lacks a malign lotto ticket" without very high N). The gradient landscape contains the same demons; maybe the chances of being pulled into them during training are just quite high. All of this suggests that N may need to be incredibly high, or, other measures may need to be taken to ensure that the consensus is taken between a greater variety of hypotheses than what we get from re-running training.
Comment
Comment
My main worry continues to be the way bad actors have control over an io channel, rather than the slowdown issue.
I feel like there’s something a bit wrong with the ‘theory/practice’ framing at the moment. My position is that certain theoretical concerns (eg, embeddedness) have a tendency to translate to practical concerns (eg, approximating AIXI misses some important aspects of intelligence). Solving those ‘in theory’ may or may not translate to solving the practical issues ‘in practice’. Some forms of in-theory solution, like setting the computer outside of the universe, are particularly unrelated to solving the practical problems. Your particular in-theory solution to embeddedness strikes me as this kind. I would contest whether it’s even an in-theory solution to embeddedness problems; after all, are you theoretically saying that the computer running the imitation learning has no causal influence over the human being imitated? (This relates to my questions about whether the learner specifically requests demonstrations, vs just requiring the human to do demonstrations forever.) I don’t really think of something like that as a "theoretical solution" to the realizability probelm at all. That’s reserved for something like logical induction which has unrealistically high computational complexity, but does avoid a realizability assumption.
Comment
A few quick thoughts, and I’ll get back to the other stuff later.
Comment
Just want to note that although it’s been a week this is still in my thoughts, and I intend to get around to continuing this conversation… but possibly not for another two weeks.
I guess at the end of the day I imagine avoiding this particular problem by building AGIs without using "blind search over a super-broad, probably-even-Turing-complete, space of models" as one of its ingredients. I guess I’m just unusual in thinking that this is a feasible, and even probable, way that people will build AGIs… (Of course I just wind up with a different set of unsolved AGI safety problems instead...)
Comment
We can also get a model that has an objective that is different from the intended formal objective (never mind whether the latter is aligned with us). For example, SGD may create a model with a different objective that is identical to the intended objective just during training (or some part thereof). Why would this be unlikely? The intended objective is not privileged over such other objectives, from the perspective the training process.
Evan gave an example related to this, where the intention was to train a myopic RL agent that goes through blue doors in the current epoch episode, but the result is an agent with a more general objective that cares about blue doors in future epochs episodes as well. In Evan’s words (from the Future of Life podcast):
Similar concerns are relevant for (self-)supervised models, in the limit of capability. If a network can model our world very well, the objective that SGD yields may correspond to caring about the actual physical RAM of the computer on which the inference runs (specifically, the memory location that stores the loss of the inference). Also, if any part of the network, at any point during training, corresponds to dangerous logic that cares about our world, the outcome can be catastrophic (and the probability of this seems to increase with the scale of the network and training compute).
Also, a malign prior problem may manifest in (self-)supervised learning settings. (Maybe you consider this to be a special case of (2).)
Comment
Like, if we do gradient descent, and the training signal is "get a high score in PacMan", then "mesa-optimize for a high score in PacMan" is incentivized by the training signal, and "mesa-optimize for making paperclips, and therefore try to get a high score in PacMan as an instrumental strategy towards the eventual end of making paperclips" is also incentivized by the training signal. For example, if at some point in training, the model is OK-but-not-great at figuring out how to execute a deceptive strategy, gradient descent will make it better and better at figuring out how to execute a deceptive strategy. Here’s a nice example. Let’s say we do RL, and our model is initialized with random weights. The training signal is "get a high score in PacMan". We start training, and after a while, we look at the partially-trained model with interpretability tools, and we see that it’s fabulously effective at calculating digits of π—it calculates them by the billions—and it’s doing nothing else, it has no knowledge whatsoever of PacMan, it has no self-awareness about the training situation that it’s in, it has no proclivities to gradient-hack or deceive, and it never did anything like that anytime during training. It literally just calculates digits of π. I would sure be awfully surprised to see that! Wouldn’t you? If so, then you agree with me that "reasoning about training incentives" is a valid type of reasoning about what to expect from trained ML models. I don’t think it’s a controversial opinion... Again, I did not (and don’t) claim that this type of reasoning should lead people to believe that mesa-optimizers won’t happen, because there do tend to be training incentives for mesa-optimization.
Comment
My surprise would stem from observing that RL in a trivial environment yielded a system that is capable of calculating/reasoning-about \pi. If you replace the PacMan environment with a complex environment and sufficiently scale up the architecture and training compute, I wouldn’t be surprised to learn the system is doing very impressive computations that have nothing to do with the intended objective.
Note that the examples in my comment don’t rely on deceptive alignment. To "convert" your PacMan RL agent example to the sort of examples I was talking about: suppose that the objective the agent ends up with is "make the relevant memory location in the RAM say that I won the game", or "win the game in all future episodes".
Comment
My hunch is that we don’t disagree about anything. I think you keep trying to convince me of something that I already agree with, and meanwhile I keep trying to make a point which is so trivially obvious that you’re misinterpreting me as saying something more interesting than I am.
Comment
Comment
Should this be l + f(t)?
I like your agenda. Some comments....
The benefit of formalizing things
First off, I’m a big fan of formalizing things so that we can better understand them. In the case of AI safety that, better understanding may lead to new proposals for safety mechanisms or failure mode analysis.
In my experience, once you manage to create a formal definition, it seldom captures the exact or full meaning you expected the informal term to have. Formalization usually exposes or clarifies certain ambiguities in natural language. And this is often the key to progress.
The problem with formalizing inner alignment
On this forum and in the broader community. I have seen a certain anti-pattern appear. The community has so far avoided getting too bogged down in discussing and comparing alternative definitions and formalization’s of the intuitive term intelligence.
However, it has definitely gotten bogged down when it comes to the terms corrigibility, goal-directedness, and inner alignment failure. I have seen many cases of this happening:
The anti-pattern goes like this:
participant 1: I am now going to describe what I mean with the concept of X \in {corrigibility, goal-directedness,inner alignment failure}, as first step to make progress on this problem of X.
participants 2-n: Your description does not correspond to my intuitive concept of X at all! Also, your steps 2 and 3 seem to be irrelevant to making progress on my concept of X, because of the following reasons.
In this post on corrigibility I have have called corrigibility a term with a high linguistic entropy, I think the same applies to the other two terms above.
These high-entropy terms seem to be good at producing long social media discussions, but unfortunately these discussions seldom lead to any conclusions or broadly shared insights. A lot of energy is lost in this way. What we really want, ideally, is useful discussion about the steps 2 and 3 that follow the definitional step.
On the subject of offering formal versions of inner alignment, you write:
My recommendation would be to see the above weakness as a feature, not a bug. I’d be interested in reading posts (or papers) where you pick one formal problem out of this cloud and run with it, to develop new proposals for safety mechanisms or failure mode analysis.
Some technical comments on the formal problem you identify
From your section ‘the formal problem’, I gather that the problems you associate with inner alignment failures are those that might produce treacherous turns or other forms of reward hacking.
You then consider the question if these failure modes could be suppressed by somehow limiting the complexity of the ‘inner optimization’ process, limited so that it is no longer capable of finding the unwanted ‘malign’ solutions. I’ll give you my personal intuition on that approach here, by way of an illustrative example.
Say we have a shepherd who wants to train a newborn lion as a sheepdog. The shepherd punishes the lion whenever the lion tries to eat a sheep. Now, once the lion is grown, it will either have internalized the goal of not eating sheep but protecting them, or the goal of not getting punished. If the latter, the lion may at one point sneak up while the shepherd is sleeping and eat the shepherd.
It seems to me that the possibility of this treacherous turn happening is encoded from the start into the lion’s environment and the ambiguity inherent in their reward signal. For me, the design approach of suppressing the treacherous turn dynamic by designing a lion that will not be able to imagine the solution of eating the shepherd seems like a very difficult one. The more natural route would be to change the environment or reward function.
That being said, I can interpret Cohen’s imitation learner as a solution that removes (or at least attempts to suppress) all creativity from the lion’s thinking.
If you want to keep the lion creative, you are looking for a way to robustly resolve the above inherent ambiguity in the lion’s reward signal, to resolve it in a particular direction. Dogs are supposed to have a mental architecture which makes this easier, so they can be seen as an existence proof.
Reward hacking
I guess I should re-iterate that, though treacherous turns seem to be the most popular example that comes up when people talk inner optimizers, I see treacherous turns as just another example of reward hacking, of maximizing the reward signal in a way that was not intended by the original designers.
As ‘not intended by the original designers’ is a moral or utilitarian judgment, it is difficult to capture it in math, except indirectly. We can do it indirectly by declaring e.g. that a mentoring system is available which shows the intention of the original designers unambiguously by definition.
Comment
Curated. Solid attempt to formalize the core problem, and solid comment section from lots of people.
Planned summary for the Alignment Newsletter:
Suggestion for content 2: relationship to invariant causal prediction Lots of people in ML these days seem excited about getting out of distribution generalization with techniques like invariant causal prediction. See e.g. this, this, section 5.2 here and related background. This literature seems promising but in discussions about inner alignment it’s missing. It seems useful to discuss how far it can go in helping solve inner alignment.
Suggestion for content 1: relationship to ordinary distribution shift problemsWhen I mention inner alignment to ML researchers, they often think of it as an ordinary problem of (covariate) distribution shift. My suggestion is to discuss if a solution to ordinary distribution shift is also a solution to inner alignment. E.g. an ‘ordinary’ robustness problem for imitation learning could be handled safely with an approach similar to Michael’s: maintain a posterior over hypotheses p(h|x_{1:t},y_{1:t}), with a sufficiently flexible hypothesis class, and ask for help whenever the model is uncertain about the output y for a new input x. One interesting subtopic is whether inner alignment is an extra-ordinary robustness problem because it is adversarial: even the tiniest difference between train and test inputs might cause the model to misbehave. (See also this.)
Brainstorming
The following is a naive attempt to write a formal, sufficient condition for a search process to be "not safe with respect to inner alignment".
Definitions:
D: a distribution of labeled examples. Abuse of notation: I’ll assume that we can deterministically sample a sequence of examples from D.
L: a deterministic supervised learning algorithm that outputs an ML model.L has access to an infinite sequence of training examples that is provided as input; and it uses a certain "amount of compute" c that is also provided as input. If we operationalize L as a Turing Machine then c can be the number of steps that L is simulated for.
L(D,c): The ML model that L outputs when given an infinite sequence of training examples that was deterministically sampled from D; and c as the "amount of compute" that L uses.
a_{L,D}(c): The accuracy of the model L(D,c) over D (i.e. the probability that the model L(D,c) will be correct for a random example that is sampled from D).
Finally, we say that the learning algorithm L Fails The Basic Safety Test with respect to the distribution D if the accuracy a_{L,D}(c) is not weakly increasing as a function of c.
Note: The "not weakly increasing" condition seems too strict weak. It should probably be replaced with a stricter condition, but I don’t know what that stricter condition should look like.
Great post.
What I have to offer is yet another informal perspective, but one that may further the search for formal approaches. The structure of the inner alignment problem is isomorphic to the problem of cancer. Cancer can be considered a state in which a cell employs a strategy which is not aligned with that of the organism or organ of which it belongs. One might expect, then, that advances in cancer research will offer solutions which can be translated in terms of AI alignment. In order for this to work, one would have to construct a dictionary to facilitate the process.
A major benefit of this approach would be the ability to leverage the efforts of some of the greatest scientists of our time working on solving a problem that is considered to be of high priority. Cancer research gets massive funding. Alignment research does not. If the problem structure is at least partly isomorphic, translation should be both possible and beneficial.
Comment
Personally, I think the cancer analogy is ok, but I strongly predict that cancer treatment/prevention won’t provide good inspiration for inner alignment. For example, we can already conceive of the idea of scanning for mesa-optimization and surgically removing it (we don’t need any analogy for that), but we don’t know how to do it, and details of medical scans and radiation therapy etc don’t seem usefully analogous.
I think you should be careful to not mix an analogy and an isomorphism. I agree that there is a pretty natural analogy with the cancer case, but it falls far short of an isomorphism at the moment. You don’t have an argument to say that the mechanism used by cancer cells are similar to those creating mesa-optimizers, that the process creating them is similar, etc I’m not saying that such a lower level correspondence doesn’t exist. Just that saying "Look, the very general idea is similar" is not a strong enough argument for such a correspondence.
Comment
All analogies rely on isomorphisms. They simply refer to shared patterns. A good analogy captures many structural regularities that are shared between two different things while a bad one captures only a few.
The field of complex adaptive systems (CADs) is dedicated to the study of structural regularities between various systems operating under similar constraints. Ant colony optimization and simulated annealing can be used to solve an extremely wide range of problems because there are many structural regularities to CADs.
I worry that a myopic focus will result in a lot of time wasted on lines of inquiry that have parallels in a number of different fields. If we accept that the problem of inner alignment can be formalized, it would be very surprising to find that the problem is unique in the sense that it has no parallels in nature. Especially considering the obvious general analogy to the problem of cancer which may or may not provide insight to the alignment problem.