Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

https://www.lesswrong.com/posts/FrFZjkdRsmsbnQEm8/interpretability-s-alignment-solving-potential-analysis-of-7

Contents

Summary

This post explores the extent to which interpretability is relevant to the hardest, most important parts of the AI alignment problem (property #1 of High-leverage Alignment Research[1]). First, I give an overview of the four important parts of the alignment problem (following Hubinger[2]): outer alignment, inner alignment, training competitiveness and performance competitiveness (jump to section). Next I discuss which of them is "hardest", taking the position that it is inner alignment (if you have to pick just one), and also that it’s hard to find alignment proposals which simultaneously address all four parts well. Then, I move onto exploring how interpretability could impact these four parts of alignment. Our primary vehicle for this exploration involves imagining and analyzing seven best-case scenarios for interpretability research (jump to section). Each of these scenarios represents a possible endgame story for technical alignment, hinging on one or more potential major breakthroughs in interpretability research. The scenarios’ impacts on alignment vary, but usually involve solving inner alignment to some degree, and then indirectly benefiting outer alignment and performance competitiveness; impacts on training competitiveness are more mixed. Finally, I discuss the likelihood that interpretability research could contribute to unknown solutions to the alignment problem (jump to section). This includes examining interpretability’s potential to lead to breakthroughs in our basic understanding of neural networks and AI, deconfusion research and paths to solving alignment that are difficult to predict or otherwise not captured by the seven specific scenarios analyzed. Tips for navigating this long post! If you get lost scrolling through this post on mobile, consider reading on desktop for two reasons: 1) To take advantage of LessWrong’s convenient linked outline feature that appears in the sidebar, and 2) To be able to glance at the footnotes and posts that I link to just by hovering over them.

Acknowledgments

Lots of people greatly improved this post by providing insightful discussions, critical points of view, editing suggestions and encouraging words both before and during its writing.Many thanks in particular to Joe Collman, Nick Turner, Eddie Kibicho, Donald Hobson, Logan Riggs Smith, Ryan Murphy, the EleutherAI Interpretability Reading Group, Justis Mills (along with LessWrong’s amazing free editing service!) and Andrew McKnight for all their help. Thanks also to the AGI Safety Fundamentals Curriculum, which is an excellent course I learned a great deal from leading up to writing this post, and for which I started this sequence as my capstone project.

What are the hardest and most important parts of AI alignment?

After several days of research and deliberation, I concluded[3] that the most important parts of alignment are well-stated in Hubinger (2020)[2]:

How does interpretability impact the important parts of alignment?

Interpretability cannot be a complete alignment solution in isolation, as it must always be paired with another alignment proposal or AI design. I used to think this made interpretability somehow secondary or expendable. But the more I have read about various alignment approaches, the more I’ve seen that one or another is stuck on a problem that interpretability could solve. It seems likely to me that interpretability is necessary, or at least could be instrumentally very valuable, toward solving alignment. For example, if you look closely at Hubinger (2020)[2], every single one of the 11 proposals relies on transparency tools in order to become viable.[4] So even though interpretability cannot be an alignment solution in isolation, as we’ll see its advancement does have the potential to solve alignment. This is because in several different scenarios which we’ll examine below, advanced interpretability has large positive impacts on some of alignment components #1-4 listed above. Usually this involves interpretability being able to solve all or part of inner alignment for some techniques. Its apparent benefits on outer alignment and performance competitiveness are usually indirect, in the form of addressing inner alignment problems for one or more techniques that conceptually have good outer alignment properties or performance competitiveness, respectively. It’s worth noting that sometimes interpretability methods do put additional strain on training competitiveness. We’ll examine this all much more closely in the Interpretability Scenarios with Alignment-Solving Potential section below.

Other potentially important aspects of alignment scarcely considered here

This post largely assumes that we need to solve prosaic AI alignment. That is, I assume that transformative AI will come from scaled-up-versions of systems not vastly different from today’s deep learning ML systems. Hence we mostly don’t consider non-prosaic AI designs. I also don’t make any attempt to address the embedded agency problem. (However, Alex Flint’s The ground of optimization, referenced later on, does seem to have bearing on this problem.) There are important AI governance and strategy problems around coordination, and important misuse risks to consider if aligned advanced AI is actually developed. Neel Nanda’s list of interpretability impact theories also mentions several theories around setting norms or cultural shifts. I touch on some of these briefly in the scenarios below. However, I don’t make any attempt to cover these comprehensively. Primarily, in this sequence, I am exploring a world where technical research can drive us toward AI alignment, with the help of scaled up funding and talent resources as indicated in the Alignment Research Activities Question[5].

Interpretability Scenarios with Alignment-Solving Potential

In attacking the Alignment Research Activities Question[5], Karnofsky (2022)[6] suggests ‘visualizing the "best case"’ for each alignment research track examined—in the case we’re examining, that means the best case for interpretability. I think the nature of interpretability lends itself to multiple "best case" and "very good case" scenarios, perhaps more so than many other alignment research directions. I tried to think of ambitious milestones for interpretability research that could produce game-changing outcomes for alignment. This is not an exhaustive list. Further investigation: Additional scenarios worth exploring discusses a few more potentially important scenarios, and even more may come to light as others read and respond to this post, and as we continue to learn more about AI and alignment. There are also a few scenarios I considered but decided to exclude from this section because I didn’t find that any potential endgames for alignment followed directly from them (see Appendix 2: Other scenarios considered but lacked clear alignment-solving potential). Some of these scenarios below may also be further developed as an answer to one of the other questions from Karnofsky (2022)[6], i.e. "What’s an alignment result or product that would make sense to offer a $1 billion prize for?" The list of scenarios progresses roughly from more ambitious/​aspirational to more realistic/​attainable, though in many cases it is difficult to say which would be harder to attain.

Why focus on best-case scenarios? Isn’t it the worst case we should be focusing on?

It is true that AI alignment research aims to protect us from worst-case scenarios. However, Karnofsky (2022)[6] suggests and I agree that envisioning/​analyzing best-case scenarios of each line of research is important to help us learn: "(a) which research tracks would be most valuable if they went well", and "(b) what the largest gaps seem to be [in research] such that a new set of questions and experiments could be helpful." Next we’ll look at a few more background considerations about the scenarios, and then we’ll dive into the scenarios themselves.

Background considerations relevant to all the scenarios

In each of the scenarios below, I’ll discuss specific impacts we can expect from that scenario. In these impact sections, I’ll discuss general impacts on the four components of alignment presented above. I also consider more in depth how each of these scenarios impacts several specific robustness and alignment techniques. To help keep the main text of this post from becoming too lengthy, I have placed this analysis in Appendix 1: Analysis of scenario impacts on specific robustness and alignment techniques. I link to the relevant parts of this appendix analysis throughout the main scenarios analysis below. This appendix is incomplete but may be useful if you are looking for more concrete examples to clarify any of these scenarios. In each of the scenarios, I’ll also discuss specific reasons to be optimistic or pessimistic about their possibility. But there are also reasons which apply generally to all interpretability research, including all of the scenarios considered below. In the rest of this section, I’ll go over those generally-applicable considerations, rather than duplicate them in every scenario.

Reasons to think interpretability will go well with enough funding and talent

Reasons to think interpretability won’t go far enough even with lots of funding and talent

Scenario 1: Full understanding of arbitrary neural networks

What is this scenario?

The holy grail of interpretability research, in this scenario the state of interpretability is so advanced that we can fully understand any artificial neural network in a reasonably short amount of time. Neural networks are no longer opaque or mysterious. We effectively have comprehensive mind-reading abilities on any AI where we have access to both the model weights and our state of the art transparency tools. Note for the impatient skeptic: If you’re finding this scenario too far-fetched, don’t abandon just yet! The scenarios after this one get significantly less "pie in the sky", though they’re still quite ambitious. This is the most aspirational scenario for interpretability research I could think of, so I list it first. I do think it’s not impossible and still useful to analyze. But if your impatience and skepticism is getting overwhelming, you are welcome to skip to Scenario 2: Reliable mesa-optimizer detection and precise goal read-offs. What does it mean to "fully understand" a neural network? Chris Olah provides examples of 3 ways we could operationalize this concept in the Open Phil 2021 RFP:

Expected impacts on alignment

Reasons to be optimistic about this scenario given sufficient investment in interpretability research

Reasons to be pessimistic about this scenario

Scenario 2: Reliable mesa-optimizer detection and precise goal read-offs

What is this scenario?

In this scenario, we don’t necessarily achieve a full and timely understanding of everything happening inside of neural networks. But here, interpretability does advance to the state that it grants us two key abilities:

Expected impacts on alignment

Reasons to be optimistic about this scenario given sufficient investment in interpretability research

Reasons to be pessimistic about this scenario

Scenario 3: Reliable lie detection

What is this scenario?

In this scenario, reliable lie detection is developed, such that we can tell 100% of the time if an AI is lying through natural language. One path for realizing this scenario is that after studying many neural networks in deceptive situations[12], we discover neural activity signatures that are reliably present when an AI is lying. I’ll refer to this as the "neural tell" for lying. What might this neural tell look like? When an AI is being honest about things it knows or has seen, we’d expect to see activity in some part of its neural network corresponding to facts about the world. If an AI is lying, on the other hand, we might expect to see additional activity in some kind of "hypothetical world model" circuitry which might otherwise be activated if an AI is exploring something new or writing fiction, but not when it’s honestly discussing high-confidence beliefs that it holds. It probably won’t be exactly like this. But the hope in this scenario is that there is some kind of reliably present neural activity when an AI is lying. Here lying is considered to only be one class of deception that AIs may be capable of. But we only need to reliably detect lies rather than all kinds of deception in order to unlock valuable potential impacts, as we’ll see below. This scenario does, of course, depend on the AI having natural language capabilities. It also assumes that advanced AI will have coherent beliefs. This scenario is similar to the one from Neel Nanda’s list titled "4. Auditing for deception". Empirically honest AI If we can train a model which always passes our lie detector test, then we have (empirically) honest AI. With advanced honest AI, there are a number of powerful alignment-related abilities we gain. For example, we can ask the AI about what its goals or driving heuristics are, and be confident that any answers it provides are honest.
This does assume that the AI will know about what its goals or heuristics are, but that doesn’t seem like an unreasonable expectation for advanced AI. If the AI can’t articulate its high-level goals for some reason, we can ask it questions about how it would behave in hypothetical situations to start to build up a picture of its motivations and expected behavior (credit to Owain Evans’ Truthful and honest AI for this idea). Just because our AI is honest, that doesn’t mean that it can’t make mistakes, be negligent or have inconsistent beliefs and motivations. The AI could provide misleading answers to questions, ones that are technically true but designed to misdirect us. For effectively using such an AI, very careful questioning might be necessary. We’d need to systematically ask it questions and follow-up questions from many different angles, checking critically for multiple interpretations of any word that could change the meanings of answers provided. This would be a tedious process, but it could probably be mostly automated. Perhaps a larger concern, though, is that an AI under lie detection still may not be forthcoming about what it knows. We’ll explore this issue in the next section. Myopia, goals and addressing the ELK in the room Just because we have a reliable lie detector and an honest AI doesn’t mean it will tell us everything it knows. It could still refuse to answer questions sometimes, perhaps even on crucial matters. How can we reliably get the information we care about out of our AI? This is the Eliciting Latent Knowledge (ELK) problem.[13] Having a reliable lie detector doesn’t directly solve ELK, but it may provide the means for such a solution. This becomes easy to see using a technique such as approval-directed amplification, debate or market making (see Overview of techniques to be analyzed in Appendix 1 if you’re unfamiliar with any of these techniques). I’ll use debate to make this concrete. Let’s imagine we ask our lie detector-verified honest AI about what its primary objective or terminal goal is. There are a few different ways that this could go:

We can observe a few interesting things about the above debates. One is that our reliable lie detector has turned out to be a reliable myopia verification tool. This is because the myopia debate above looks clearly different from either possibility of how the non-myopia debate goes. This means that the present scenario subsumes Scenario 4: Reliable myopia verification. If our AI is myopic and under effective lie detection, then it will always have an incentive to answer the question honestly. So if you can train a myopic question-answering AI in this scenario, then you have a solution to ELK and you also have also realized Scenario 2: Reliable mesa-optimizer detection and precise goal read-offs (strong version), having access to its precise terminal and instrumental goals. We don’t know whether possibility 1 or 2 above will turn out to be what non-myopic debaters do. If it’s Possibility 1 on the terminal goal question, then we have effectively read off the AI’s goal and this scenario subsumes Scenario 2: Reliable mesa-optimizer detection and precise goal read-offs (weak version). If it’s always Possibility 1 for any question, then this scenario is also a solution to ELK. But for non-myopic AI, Possibility 2 might very well happen at least some of the time. In this case we don’t have an ELK solution (can’t elicit any knowledge we want from the AI), and we also don’t know what its terminal goal is. We still have an AI where we can verify whether its statements are honest or not, however having a long-term planning AI with unknown goals around makes me very nervous and seems rather risky. I would probably recommend retraining such a model until Possibility 1 or a myopic AI is attained.

Expected impacts on alignment

Reasons to be optimistic about this scenario given sufficient investment in interpretability research

Reasons to be pessimistic about this scenario

Scenario 4: Reliable myopia verification

What is this scenario?

A prerequisite for deceptive alignment is that an AI be non-myopic. That is, the AI must be concerned not just about the present training step and near-future, but also about a long-term future post-deployment. In deceptive alignment, an AI is optimizing for an objective which is a proxy for the base objective of its training process. In addition to the proxy objective which is its terminal goal, it also has a model of the base objective and an awareness that it is in some kind of training process. The AI realizes that if it doesn’t score well on the base objective, then it will be modified. Acting rationally toward the instrumentally convergent goals of self-preservation and goal-content integrity, it pretends to be aligned to the base objective in the short term. This is because it’s anticipating a future when it will be deployed, at which point it can pursue its primary goal into the far future and obtain a very high total reward. Along with scenarios 1-3, verifying that an AI is myopic is another promising route toward resolving deceptive alignment. Myopic agents may also have other beneficial properties. For example, since they don’t pursue instrumentally convergent sub-goals, they avoid many of the dangers of long-term planning agents. Approval-directed myopic agents are also easy to "course correct" since they’re only interested in a short-term reward; future approval can steer them in a different direction if needed: *"While small errors in reward specification can incentivize catastrophic outcomes, small errors in approval feedback are unlikely to incentivize catastrophic outcomes." (comment *from Rohin Shah) There are various approaches to myopia and how to limit an agent’s long-term planning.[18] Two prominent ideas are per-step myopia and per-episode myopia:[19]

Expected impacts on alignment

Reasons to be optimistic about this scenario given sufficient investment in interpretability research

Reasons to be pessimistic about this scenario

Scenario 5: Locate the AI’s beliefs about its observations

What is this scenario?

This scenario is focused around the specific interpretability breakthroughs required to make relaxed adversarial training (see Hubinger 2019[10]) effective. The need for interpretability in this context was presented in "The core problem: transparency" section of that last link, and it was also outlined in Beth Barnes’ Another list of theories of impact for interpretability. Reposting Beth’s summary of this idea:

Expected impacts on alignment

Reasons to be optimistic about this scenario given sufficient investment in interpretability research

Reasons to be pessimistic about this scenario

Scenario 6: Reliable detection of human modeling

What is this scenario?

Some alignment proposals, notably STEM AI, are based on the idea that there are dangers around having an AI model humans. In this scenario, our interpretability has made great strides in detecting the presence of human modeling in a deep learning model. Hence by training against this, we can be confident that our AI will not be modeling humans in any significant way. The strong version of this scenario allows reading specific details about the AI’s model of humans. A weaker version would be more like a simple binary detection of the presence of human modeling. Related reading:

Expected impacts

Expected impacts on alignment Since this scenario primarily impacts STEM AI, much of this section includes quotes from the corresponding alignment components analysis of STEM AI from Hubinger (2020)[2]:

Reasons to be optimistic about this scenario given sufficient investment in interpretability research

Reasons to be pessimistic about this scenario

Scenario 7: Identify the AI’s beliefs about training vs. deployment

What is this scenario?

In this scenario, we find a way to reliably locate and interpret an AI’s beliefs about training and deployment distributions in its neural network. There are a couple interesting things we could do with such information:

Expected impacts on alignment

Reasons to be optimistic about this scenario given sufficient investment in interpretability research

Reasons to be pessimistic about this scenario

Basic research potential, deconfusion and "Scenario X"

The previous section considered interpretability’s potential impacts on alignment by visualizing and analyzing seven specific scenarios. But what about important scenarios that are difficult to foresee, and which would really surprise us?[22] I believe this is another area which should be considered in evaluating any research activity’s potential impact on alignment. In other words, what is the potential of a research activity to contribute to relevant basic research breakthroughs, deconfusing machine learning systems and the like? For example, there is so little we understand about what it even means for an AI to "know" something or have "goals". Having a clearer understanding of such things could open up many possibilities for how to leverage them to the ends of AI alignment. I expect that future research could invalidate some of the alignment techniques I analyzed throughout the scenarios and in Appendix 1, and there will be new ones proposed that we haven’t considered here. But given the broad potential impacts we can from see on the current landscape of alignment proposals, there is good reason to think interpretability will also be valuable to future proposals as well. Earlier, we mentioned a broad assumption in this post that we are in a world which depends on prosaic AI alignment. In this world, interpretability seems to be strong for the kind of basic research that benefits AI alignment. Certainly for increasing our basic understanding of neural networks, it will help to look inside them! However, what if it turns out we’re living in a non-prosaic AI world, where the important AGI or other transformative AI systems will be created using an approach very different from the deep learning neural networks of today? In this case, doing interpretability on present-day deep learning systems could be much less valuable for alignment.

Further investigation

Additional scenarios worth exploring

We have the seven scenarios analyzed above, and Appendix 2 contains a few more I looked at, but which didn’t have clear alignment-solving potential. Below are some additional interpretability scenarios which I think may be as valuable as the main seven of this post, but I didn’t have time to investigate them.

Potential Scenario: Identifying a ‘truthfulness direction’ in activation space

Collin Burns is working on this idea, which is summarized in Beth Barnes’ post Another list of theories of impact for interpretability. She mentions that there is no thorough writeup of the idea yet. Once there is, it would be good to evaluate whether it merits being added to the list of scenarios with alignment-solving potential considered in the present post. From my basic understanding of the idea, it may end up being a means of achieving Scenario 3: Reliable lie detection, rather than a distinct scenario in itself. But I don’t claim to have a good understanding of the idea yet.

Potential Scenario: Detecting the processes of deceptive alignment

Scenario 3: Reliable lie detection discussed a possible "neural tell" for lying. What if there’s a neural tell for deceptive alignment? For an AI engaged in deceptive alignment, we’d expect it to be consuming additional compute compared to being robustly aligned on the same objective, all other things equal. This is because deceptive alignment would require at least two processes: 1) optimizing the base objective and 2) searching for ways to defect. Whereas robust alignment and even other kinds of non-deceptive pseudo-alignment should only require process #1.[23] So another scenario that might have alignment-solving potential is one which specifically targets reliable detection of process #2. This may end up having considerable overlap with Scenario 5: Locate the AI’s beliefs about its observations, but more research is recommended.

Potential Scenario: The ability to access an AI’s model of the world

From Hubinger (2019)[10]:* "[W]hat facts about the world does [a model] M know that are going into producing its output? In particular, what implicit knowledge is M using that isn’t represented in its explicit output?"* This is similar to Scenario 1: Full understanding of arbitrary neural networks, but may be different enough to warrant a separate scenario analysis.

Potential Scenario: Knowledge of what factors most highly impact an AI’s decisions

From Hubinger (2019)[10]:* "In particular, if there is some condition α such that M only behaves acceptably if α is true, then we want to be able to know that M ’s output is highly dependent on α."* This may be the same as Donald Hobson’s suggestion from a comment on an earlier draft of this post: "[To include a scenario] that highlights which training data is most influential in making its decisions. (So you can put it in a secure box. Tell it about it being in the box, and some details. Not give any problem where it needs to know this. Then if its decisions are ever influenced by it thinking about the design of box, shut it down)" These proposals might be equivalent and/​or largely covered by Scenario 2: Reliable mesa-optimizer detection and precise goal read-offs, but I’m not sure yet.

Scenario paths and probabilities

The section Interpretability Scenarios with Alignment-Solving Potential above provides a fairly thorough analysis of what the seven scenarios are, their expected impacts on alignment, and reasons to be optimistic and pessimistic about each one. To more comprehensively evaluate interpretability for property #1 of High-leverage Alignment Research[1] and as a target for large investments of capital and/​or talent, it would be useful also to consider the paths and intermediate steps toward realizing each of these scenarios.[24] We would like to have probabilities for the likelihood of being able to achieve each scenario and its intermediate steps as well. It may then be possible to consider all the scenario probabilities together to form an overall probability estimate of interpretability research going well, given enough funding and talent. I am considering doing this research for a future post in this sequence. Part of why I haven’t done this yet is that while I received a lot of great feedback on the draft for this post, I imagine it might make sense to revise or update the list of scenarios based on the feedback that comes in after this is published. Probability estimates are quite sensitive to the specific details of a scenario, so it makes sense to wait until both the overall list of scenarios and the parameters of each scenario contained within it are fairly stable.

Analyze partially realized scenarios and combinations

A lot of the scenarios above are written assuming perfection of some interpretability technique (perfect lie detection, reliable myopia verification etc.) Is it possible to get sufficient benefits out of only partially realizing some of these scenarios? What about combinations of partial scenarios, e.g. good but imperfect lie detection (partial Scenario 3) combined with human modeling detection (Scenario 6). It would be valuable to know if there are visible paths to alignment having only partial progress towards the scenarios above, as that may be more achievable than realizing 100% reliability of these interpretability techniques.[25]

Analyze scenario impacts on Amplification + RL techniques

Proposals #10 and #11 from Hubinger (2020)[2] involve using a hybrid approach of amplification and RL. While Appendix 1: Analysis of scenario impacts on specific robustness and alignment techniques analyzes the impact of each scenario on many different techniques, this one wasn’t explored. But that was simply for lack of time, and it would be good to know more about how the scenarios in this post impact that approach.

Address suboptimality alignment

The seven scenarios in this post show many inner alignment issues that interpretability could address. However, one inner alignment issue that is not well addressed by this post is suboptimality alignment. (Neither is the closely related suboptimality deceptive alignment.) I can see how some forms of suboptimality alignment are addressed in the scenarios. For example, an AI might have a misaligned terminal goal, but some errors in its world model cause it to coincidentally have aligned behavior for a period of time. In Scenario 2, we could catch this form of suboptimality alignment when we do the goal read-offs and see that its terminal goal is misaligned. But what about unpredictable forms of suboptimality alignment? What if an AI is aligned in training, but as it learns more during deployment, it has an ontological crises and determines that the base objective isn’t compatible with its new understanding of the universe? How serious of a risk is suboptimality alignment in practice, and how can that risk be mitigated? This is an important question to investigate, both for alignment in general as well as for better understanding the extent of interpretability’s potential impacts on inner alignment.

Closing thoughts

In this post, we investigated whether interpretability has property of #1 of High-leverage Alignment Research[1]. We discussed the four most important parts AI alignment, and which seem to be the hardest. Then we explored interpretability’s relevance to these areas by analyzing seven specific scenarios focused on major interpretability breakthroughs that could have great impacts on the four alignment components. We also looked at interpretability’s potential relevance to deconfusion research and yet-unknown scenarios for solving alignment. It seems clear that there are many ways interpretability will be valuable or even essential for AI alignment.[26] It is likely to be the best resource available for addressing inner alignment issues across a wide range of alignment techniques and proposals, some of which look quite promising from an outer alignment and performance competitiveness perspective. However, it doesn’t look like it will be easy to realize the potential of interpretability research. The most promising scenarios analyzed above tend to rely on near-perfection of interpretability techniques that we have barely begun to develop. Interpretability also faces serious potential obstacles from things like distributed representations (e.g. polysemanticity), the likely-alien ontologies of advanced AIs, and the possibility that those AIs will attempt to obfuscate their own cognition. Moreover, interpretability doesn’t offer many great solutions for suboptimality alignment and training competitiveness, at least not that I could find yet. Still, interpretability research may be one of the activities that most strongly exhibits property #1 of High-leverage Alignment Research[1]. This will become more clear if we can resolve some of the Further investigation questions above, such as developing more concrete paths to achieving the scenarios in this post and estimating probabilities that we could achieve them. It would also help if, rather than considering interpretability just on its own terms, we could do a side-by-side-comparison of interpretability with other research directions, as the Alignment Research Activities Question[5] suggests.

What’s next in this series?

Realizing any of the scenarios with alignment-solving potential covered in this post would likely require much more funding for interpretability, as well as many more researchers to be working in the field than are currently doing so today. For the next post in this series, I’ll be exploring whether interpretability has property #2 of High-leverage Alignment Research[1]: "the sort of thing that researchers can get up to speed on and contribute to relatively straightforwardly (without having to take on an unusual worldview, match other researchers’ unarticulated intuitions to too great a degree, etc.)"

Appendices

The Appendices for this post are on Google Docs at the following link: Appendices for Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios