On inner and outer alignment, and their confusion

https://www.lesswrong.com/posts/fkNEHeJja2B9yRLeT/on-inner-and-outer-alignment-and-their-confusion

There have been multiple posts that have discussed inner and outer alignment and it seems like people are operating with slightly different definitions of these terms. This article aims to give a clear and general explanation of inner and outer alignment and clarify some causes of confusion and disagreement. Definitions I will start with what I consider to be a reasonable definition of the inner and outer alignment problems: Outer alignment problem: how do we design a test to determine whether or not an AI does what we want given a particular input? Other possible phrasings:

Comment

https://www.lesswrong.com/posts/fkNEHeJja2B9yRLeT/on-inner-and-outer-alignment-and-their-confusion?commentId=ARNBGDeE4MqjEdG9R

I like all of this post except for the conclusion. I think this comment shows that the definition of inner alignment requires an explicit optimizer. Your broader definition of inner misalignment is equivalent to generalization error or robustness, which we already have a name for.

Comment

https://www.lesswrong.com/posts/fkNEHeJja2B9yRLeT/on-inner-and-outer-alignment-and-their-confusion?commentId=SquScfZrasmmn92QC

I currently see inner alignment problems as a superset of generalisation error and robustness. Furthermore, an AI being a mesa-optimiser with a misaligned objective can also be thought of as a generalisation error seeing as this means we haven’t tested the AI in scenarios where it’s mesa-objective behaves differently from the base objective. The conclusion is meant to emphasise the possibility of extending the concept of inner misalignment to AI’s that we do not model as optimisers. I am open to the claim that this is not useful, and we should only use the term when we think of the AI as an optimiser. In which case the definition involving mesa-objectives is sufficient.

Comment

I currently see inner alignment problems as a superset of generalisation error and robustness. What would you include as an inner alignment problem that isn’t a generalization problem or robustness problem?

Comment

I think any inner alignment problem can be thought of as a kind of generalisation error (this wouldn’t have happened if we had more data), including misaligned mesa-optimisers. So yes, you are correct, in my model they are different ways of looking at the same problem (in hindsight, superset was a wrong word to use). Is your opinion that inner misalignment should only be used in cases when a mesa-optimiser can be shown to exist (which is the original definition and that stated by the comment you linked)? I agree, that would make sense also. I was starting with an assumption that "that which is not outer misalignment should be inner misalignment" but I notice that Evan mentions problems that are neither (eg: mis-generalisations when there are no mesa-optimisers). This way of defining things only works if you commit to seeing the AI in terms of it being an optimiser, which is indeed a useful framing, but not the only one. However, based on your (and Evan’s) comments I do see how having inner alignment as a subset of things-that-are-not-outer-alignment also works.