I think Evan’s Clarifying Inner Alignment Terminology is quite clever; more well-optimized than it may at first appear. However, do think there are a couple of things which don’t work as well as they could:
-
What exactly does the modifier "intent" mean?
-
Based on how "intent alignment" is defined (basically, the optimal policy of its behavioral objective would be good for humans), capability robustness is exactly what it needs to combine with in order to achieve impact alignment. However, we could instead define "intent alignment" as "the optimal policy of the mesa objective would be good for humans". In this case, capability robustness is not exactly what’s needed; instead, what I’ll provisionally call inner robustness (IE, strategies for achieving the mesa-objective generalize well) would be put in its place.
-
(I find myself flipping between these two views, and thereby getting confused.)
-
Furthermore, I would argue that the second alternative (making "intent alignment" about the mesa-objective) is more true to the idea of intent alignment. Making it about the behavioral objective turns it into a fact about the actual impact of the system, since "behavioral objective" is defined by looking at what the system actually accomplishes. But then, why the divide between intent alignment and impact alignment?
-
Any definition where "inner alignment" isn’t directly paired with "outer alignment" is going to be confusing for beginners.
-
In Evan’s terms, objective robustness is basically a more clever (more technically accurate and more useful) version of "the behavioral objective equals the outer objective", whereas inner alignment is "the mesa-objective equals the outer objective".
-
(It’s clear that "behavioral" is intended to imply generalization, here—the implication of objective robustness is supposed to be that the objective is stable under distributional shift. But this is obscured by the definition, which does not explicitly mention any kind of robustness/generalization.)
-
By making this distinction, Evan highlights the assumption that solving inner alignment will solve behavioral alignment: he thinks that the most important cases of catastrophic bad behavior are intentional (ie, come from misaligned objectives, either outer objective or inner objective).
-
In contrast, the generalization-focused approach puts less emphasis on the assumption that the worst catastrophes are intentional—which could be an advantage, if this assumption isn’t so good!
-
However, although I find the decomposition insightful, I dread explaining it to beginners in this way. I find that I would prefer to gloss over objective robustness and pretend that intent alignment simply factors into outer alignment and inner alignment.
-
I also find myself constantly thinking as if inner/outer alignment were a pair, intuitively!
My current proposal would be the following:
-
Re-define "intent alignment" to refer to the mesa-objective.
-
Now, inner alignment + outer alignment directly imply intent alignment, provided that there is a mesa-objective at all (IE, assuming that there’s an inner optimizer).
-
This fits with the intuitive picture that inner and outer are supposed to be complimentary!
-
If we wish, we could replace or re-define "capability robustness" with "inner robustness", the robustness of pursuit of the mesa-objective under distributional shift.
-
This is exactly what we need to pair with the new "intent alignment" in order to achieve impact alignment.
-
However, this is clearly a narrower concept than capability robustness (it assumes there is a mesa-objective).
This is a complex and tricky issue, and I’m eager to get thoughts on it. Relevant reading:
-
Evan’s post on the topic.
-
The post which discusses Evan’s as the "objective-focused approach", contrasting it with Rohin’s "generalization-focused approach". My proposal would make the two diagrams more different from each other. I’m also interested in trying to merge the diagrams or otherwise "bridge the conceptual gap" between the two approaches. As a reminder, here are Evan’s definitions. Nested children are subgoals; it’s supposed to be the case that if you can achieve all the children, you can achieve the parent.
-
Impact Alignment: An agent is impact aligned (with humans) if it doesn’t take actions that we would judge to be bad/problematic/dangerous/catastrophic.
-
Capability Robustness: An agent is capability robust if it performs well on its behavioral objective even in deployment/off-distribution.
-
Intent Alignment: An agent is intent aligned if the optimal policy for its behavioral objective is impact aligned with humans.
-
Outer Alignment: An objective function r is outer aligned if all models that perform optimally on r in the limit of perfect training and infinite data are intent aligned.
-
Objective Robustness: An agent is objective robust if the optimal policy for its behavioral objective is impact aligned with the base objective it was trained under.
-
Inner Alignment: A mesa-optimizer is inner aligned if the optimal policy for its mesa-objective is impact aligned with the base objective it was trained under.
So we split impact alignment into intent alignment and capability; we split intent alignment into outer alignment and objective robustness; and, we achieve objective robustness through inner alignment. Here’s what my proposed modifications do:
-
(Impact) Alignment
-
Inner Robustness: An agent is inner-robust if it performs well on its ***mesa-objective ***even in deployment/off-distribution.
-
Intent Alignment: An agent is intent aligned if the optimal policy for its ***mesa-objective ***is impact aligned with humans.
-
Outer Alignment
-
Inner Alignment
"Objective Robustness" disappears from this, because inner+outer gives intent-alignment directly now. This is a bit of a shame, as I think objective robustness is an important subgoal. But I think the idea of objective robustness fits better with the generalization-focused approach:
-
Alignment
-
Outer Alignment: For this approach, outer alignment is re-defined to be *only on-training-distribution *(we could call it "on-distribution alignment" or something).
-
Robustness
-
Objective Robustness
-
Inner Alignment
-
Capability Robustness
And it’s fine for there to be multiple different subgoal hierarchies, since there may be multiple paths forward.
I agree that we need a notion of "intent" that doesn’t require a purely behavioral notion of a model’s objectives, but I think it should also not be limited strictly to mesa-optimizers, which neither Rohin nor I expect to appear in practice. (Mesa-optimizers appear to me to be the formalization of the idea "what if ML systems, which by default are not well-described as EU maximizers, learned to be EU maximizers?" I suspect MIRI people have some unshared intuitions about why we might expect this, but I currently don’t have a good reason to believe this.)
I want to be able to talk about how we can shape goals which may be messier, perhaps somewhat competing, internal representations or heuristics or proxies that determine behavior. If we actually want to understand "intent," we have to understand what the heck intentions and goals actually are in humans and what they might look like in advanced ML systems. However, I do think this is a very good point you raise about intent alignment (that it should correspond to the model’s internal goals, objectives, intentions, etc.), and the need to be mindful of which version we’re using in a given context.
Also, I’m iffy on including the "all inputs"/optimality thing (I believe Rohin is, too)… it does have the nice property that it lets you reason without considering e.g. training setup, dataset, architecture, but we won’t actually have infinite data and optimal models in practice. So, I think it’s pretty important to model how different environments or datasets interact with the reward/objective function in producing the intentions and goals of our models.
I don’t think this is necessarily a crux between the generalization- and objective-driven approaches—if intentional behavior requires a mesa-objective, then humans can’t act "intentionally." So we obviously want a notion of intent that applies to the messier middle cases of goal representation (between a literal mesa-objective and a purely implicit behavioral objective).
Comment
Comment
For example, I don’t put much stock in the idea of utility functions, but I endorse a form of EU theory which avoids them. Specifically, I believe in approximately coherent expectations: you assign expected values to events, and a large part of cognition is devoted to making these expectations as coherent as possible (updating them based on experience, propagating expectations of more distant events to nearer, etc). This is in contrast to keeping some centrally represented utility function, and devoting cognition to computing expectations for this utility function.
Is this related to your post An Orthodox Case Against Utility Functions? It’s been on my to-read list for a while; I’ll be sure to give it a look now.
Comment
Right, exactly. (I should probably have just referred to that, but I was trying to avoid reference-dumping.)
(Meta: was this meant to be a question?)
Comment
Comment
Talk at CHAI saying something like "daemons are just distributional shift" in August 2018, I think. (I remember Scott attending it.)
Talk at FHI in February 2020 that emphasized a risk model where objectives generalize but capabilities don’t.
Talk at SERI conference a few months ago that explicitly argued for a focus on generalization over objectives. Especially relevant stuff other people have done that has influenced me:
Two guarantees (arguably this should be thought of as the origin)
2-D Robustness (My views were pretty set by the time Evan wrote the clarifying inner alignment terminology post; it’s possible that his version that’s closer to generalization-focused was inspired by things I said, you’d have to ask him.)
Comment
I’ve watched your talk at SERI now. One question I have is how you hope to define a good notion of "acceptable" without a notion of intent. In your talk, you mention looking at why the model does what it does, in addition to just looking at what it does. This makes sense to me (I talk about similar things), but, it seems just about as fraught as the notion of mesa-objective:
It requires approximately the same "magic transparency tech" as we need to extract mesa-objectives.
Even with magical transparency tech, it requires additional insight as to which reasoning is acceptable vs unacceptable. If you are pessimistic about extracting mesa-objectives, why are you optimistic about providing feedback about how to reason? More generally, what do you think "acceptability" might look like? (By no means do I mean to say your view is crazy; I am just looking for your explanation.)
Comment
Comment
All of that made perfect sense once I thought through it, and I tend to agree with most it. I think my biggest disagreement with you is that (in your talk) you said you don’t expect formal learning theory work to be relevant. I agree with your points about classical learning theory, but the alignment community has been developing basically-classical-learning-theory tools which go beyond those limitations. I’m optimistic that stuff like Vanessa’s InfraBayes could help here. Granted, there’s a big question of whether that kind of thing can be competitive. (Although there could potentially be a hybrid approach.)
Comment
My central complaint about existing theoretical work is that it doesn’t seem to be trying to explain why neural nets learn good programs that generalize well, even when they have enough parameters to overfit and can fit a randomly labeled dataset. It seems like you need to make some assumption about the real world (i.e. an assumption about your dataset, or the training process that generated it), which people seem loathe to do. I don’t currently see how any of the alignment community’s tools address that complaint; for example I don’t think the InfraBayes work so far is making an interesting assumption about reality. Perhaps future work will address this though?
Comment
InfraBayes doesn’t look for the regularity in reality that NNs are taking advantage of, agreed. But InfraBayes is exactly about "what kind of regularity assumptions can we realistically make about reality?" You can think of it as a reaction to the unrealistic nature of the regularity assumptions which Solomonoff induction makes. So it offers an answer to the question "what useful+realistic regularity assumptions could we make?" The InfraBayesian answer is "partial models". IE, the idea that even if reality cannot be completely described by usable models, perhaps we can aim to *partially *describe it. This is an assumption about the world—not all worlds can be usefully described by partial models. However, it’s a weaker assumption about the world than usual. So it may not have presented itself as an assumption about the world in your mind, since perhaps you were thinking more of stronger assumptions. If it’s a good answer, it’s at least plausible that NNs work well for related reasons. But I think it also makes sense to try to get at the useful+realistic regularity assumptions from scratch, rather than necessarily making it all about NNs
Comment
Comment
Answer 1
I meant to invoke a no-free-lunch type intuition; we can always construct worlds where some particular tool isn’t useful. My go-to would be "a world that checks what an InfraBayesian would expect, and does the opposite". This is enough for the narrow point I was trying to make (that InfraBayes does express some kind of regularity assumption about the world), but it’s not very illustrative or compelling for my broader point (that InfraBayes plausibly addresses your concerns about learning theory). So I’ll try to tell a better story.
Answer 2
I might be describing logically impossible (or at least uncomputable) worlds here, but here is my story: Solomonoff Induction captures something important about the regularities we see in the universe, but it doesn’t explain NN learning (or "ordinary human learning") very well, because NNs and humans mostly use very fast models which are clearly much smaller (in time-complexity and space-complexity) than the universe. (Solomonoff induction is closer to describing human science, which does use these very simple but time/space-complex models.) So there’s this remaining question of induction: why can we do induction in practice? (IE, with NNs and with nonscientific reasoning) InfraBayes answers this question by observing that although we can’t easily use Solomonoff-like models of the whole universe, there are many patterns we can take advantage of which can be articulated with partial models. This didn’t need to be the case. We could be in a universe in which you need to fully model the low-level dynamics in order to predict things well at all. So, a regularity which InfraBayes takes advantage of is the fact that we see multi-scale phenomena—that simple low-level rules often give rise to simple high-level behavior as well. I say "maybe I’m describing logically impossible worlds" here because it is hard to imagine a world where you can construct a computer but where you don’t see this kind of multi-level phenomena. Mathematics is full of partial-model-type regularities; so, this has to be a world where mathematics isn’t relevant (or, where mathematics itself is different). But Solomonoff induction alone doesn’t give a reason to expect this sort of regularity. So, if you imagine a world being drawn from the Solomonoff prior vs a world being drawn from a similar InfraBayes prior, I think the InfraBayes prior might actually generate worlds more like the one we find ourselves in (ie, InfraBayes contains more information about the world). (Although actually, I don’t know how to "sample from an infrabayes prior"...)
"Usefully Describe"
What I Think You Should Think
I think you should think that it’s plausible we will have learning-theoretic ideas which will apply directly to objects of concern, in the sense of under some plausible assumptions about the world, we can argue a learning-theoretic guarantee for some system we can describe, which theoretically addresses some alignment concern. I don’t want to strongly argue that you should think this will be competitive with NNs or anything like that. Obviously I prefer worlds where that’s true, but I am not trying to argue that. Even if in some sense InfraBayes (or some other theory) turns out to explain the success of NNs, that does not actually imply it’ll give rise to something competitive with NNs. I’m wondering if that’s a crux for your interest. Honestly, I don’t really understand what’s going on behind this remark:
Comment
Comment
Great, I feel pretty resolved about this conversation now.