Another list of theories of impact for interpretability

https://www.lesswrong.com/posts/YQALrtMkeqemAF5GX/another-list-of-theories-of-impact-for-interpretability

Neel’s post on this is good. I thought I’d add my own list/​framing. Somewhat rough. I see various somewhat different ways in which interpretability can be useful for AI safety. These require different things from your interpretability in terms of how efficient it is, how much it lets you identify exactly what your model is thinking as opposed to broad properties of its cognition, and how reliable it needs to be. Roughly in decreasing order of demandingness:

How do you know if your interpretability is good?

Ultimate goals:

Comment

https://www.lesswrong.com/posts/YQALrtMkeqemAF5GX/another-list-of-theories-of-impact-for-interpretability?commentId=25CkjiyrZpA8c6L93

Nice post! I really don’t know much about frontal lobotomy patients. I’ll irresponsibly speculate anyway. I think "figuring out the solution to tricky questions" has a lot in common with "getting something tricky done in the real world", despite the fact that one involves "internal" actions (i.e., thinking the appropriate thoughts) and the other is "external" actions (i.e., moving the appropriate muscles). I think they both require the same package of goal-oriented planning, trial-and-error exploration via RL, and so on. (See discussion of "RL-on-thoughts" here.) By contrast, querying existing knowledge doesn’t require that—as an adult, if you see a rubber ball falling, you instinctively expect it to bounce, and I claim that the algorithm forming that expectation does not require or involve RL. I would speculate that frontal lobotomy patients lose their ability to BOTH "figure out the solution to tricky questions" AND "get something tricky done in the real world", because the frontal lobotomy procedure screws with their RL systems. But their existing knowledge can still be queried. They’ll still expect the ball to bounce. (If there are historical cases of people getting a frontal lobotomy and then proving a new math theorem or whatever, I would be very surprised and intrigued.) It’s hard to compare this idea to, say, a self-supervised language model, because the latter has never had any RL system in the first place. (See also here.) If we did have an agential AI that combined RL with self-supervised learning in a brain-like way, and if that AI had already acquired the knowledge and concepts of how to make nanobots or solve alignment or whatever, then yeah, maybe "turning off the RL part" would be a (probably?) safe way to extract that knowledge, and I would think that this is maybe a bit like giving the AI a frontal lobotomy. But my concern is that this story picks up after the really dangerous part—in other words, I think the AI needs to be acting agentially and using the RL during the course of figuring out how to make nanobots or solve alignment or whatever, and *that’s *when it could get out of control. That problem wouldn’t be solvable by "turn off RL". Turning off RL would prevent the AI from figuring out the things we want it to figure out.