Confused why a "capabilities research is good for alignment progress" position isn’t discussed more

https://www.lesswrong.com/posts/EzAt4SbtQcXtDNhHK/confused-why-a-capabilities-research-is-good-for-alignment

The predominant view on LW seems to be "pure AI capabilities research is bad, because capabilities progress alone doesn’t contribute to alignment progress, and capabilities progress without alignment progress means that we’re doomed". I understand the arguments for this position, but I have what might be called the opposite position. The opposite position seems at least as intuitive as the standard position to me, and it confuses me that it’s not discussed more. (I’m not confused that people reject it; I’m confused that nobody seems to even bring it up for the purpose of rejecting it.) The opposite position is "In order to do alignment research, we need to understand how AGI works; and we currently don’t understand how AGI works, so we need to have more capabilities research so that we would have a chance of figuring it out. Doing capabilities research now is good because it’s likely to be slower now than it might be in some future where we had even more computing power, neuroscience understanding, etc. than we do now. If we successfully delayed capabilities research until a later time, then we might get a sudden spurt of it and wouldn’t have the time to turn our increased capabilities understanding into alignment progress. Thus by doing capabilities research now, we buy ourselves a longer time period in which it’s possible to do more effective alignment research." Some reasons I have for holding this position: 1) I used to do AI strategy research. Among other things, I looked into how feasible it is for intelligence to rapidly turn superintelligent, and what kinds of pathways there are into AI disaster. But a thought that I kept having when doing any such research was "I don’t know if any of this theory is of any use, because so much depends on what the world will be like when actual AGI is developed, and what that AGI will look in the first place. Without knowing what AGI will look like, I don’t know whether any of the assumptions I’m making about it are going to hold. If any one of them fails to hold, the whole paper might turn out to be meaningless." Eventually, I concluded that I can’t figure out a way to make the outputs of strategy research useful for as long as I know as little about AGI as I do. Then I went to do something else with my life, since it seemed too early to do useful AGI strategy research (as far as I could tell). **2) **Compare the state of AI now, to how it was before the deep learning revolution happened. It seems obvious to me that our current understanding of DL puts us in a better position to do alignment research than we were before the DL revolution. For instance, Redwood Research is doing research on language models because they believe that their research is analogous to some long-term problems. Assume that Redwood Research’s work will actually turn out to be useful for aligning superintelligent AI. Language models are one of the results of the DL revolution, so their work couldn’t have been done before that revolution. It seems that in a counterfactual world where the DL revolution happened later and the DL era was compressed into a shorter timespan, our chances of alignment would be worse since that world’s equivalent of Redwood Research would have less time to do their research. 3) As a similar consideration, language models are already "deceptive" in a sense—asked something that it has no clue about, InstructGPT will happily come up with confident-sounding nonsense. When I linked people to some of that nonsense, multiple people pointed out that InstructGPT’s answers sound like the kind of a student who’s taking an exam and is asked to write an essay about a topic they know nothing about, but tries to fake it anyway (that is, trying to deceive the examiner). Thus, even if you are doing pure capabilities research and just want your AI system to deliver people accurate answers, it is already the case that you can see a system like InstructGPT "trying to deceive" people. If you are building a question-answering system, you want to build one that people can trust to give accurate answers rather than impressive-sounding bullshit, so you have the incentive to work on identifying and stopping such "deceptive" computations as a capabilities researcher already. So it has already happened that

Comment

https://www.lesswrong.com/posts/EzAt4SbtQcXtDNhHK/confused-why-a-capabilities-research-is-good-for-alignment?commentId=3LqcSELj77tWdTfFL

In order to do alignment research, we need to understand how AGI works; and we currently don’t understand how AGI works, so we need to have more capabilities research so that we would have a chance of figuring it out. I totally agree with this. Alas, "understand how AGI works" is not something which most capabilities work even attempts to do. It turns out that people can advance capabilities without having much clue what’s going on inside their magic black boxes, and that’s what most capabilities work looks like at this point.

Comment

https://www.lesswrong.com/posts/EzAt4SbtQcXtDNhHK/confused-why-a-capabilities-research-is-good-for-alignment?commentId=Epgc6xrPjT4hyPyuv

I think we are getting some information. For example, we can see that token level attention is actually quite powerful for understanding language and also images. We have some understanding of scaling laws. I think the next step is a deeper understanding of how world modeling fits in with action generation—how much can you get with just world modeling, versus world modeling plus reward/​action combined? If the transformer architecture is enough to get us there, it tells us a sort of null hypothesis for intelligence—that the structure for predicting sequences by comparing all pairs of elements of a limited sequence—is general. Not rhetorically, what kind of questions you think would better lead to understanding how AGI works? I think teaching a transformer with an internal thought process (predicting the next tokens over a part of the sequence that’s "showing your work") would be an interesting insight into how intelligence might work. I thought of this a little while back but also discovered this is also a long standing MIRI research direction into transparency. I wouldn’t be surprised if Google took it up at this point.

Comment

Not rhetorically, what kind of questions you think would better lead to understanding how AGI works? Suppose I’m designing an engine. I try out a new design, and it surprises me—it works much worse or much better than expected. That’s a few bits of information. That’s basically the sort of information we get from AI experiments today. What we’d really like is to open up that surprising engine, stick thermometers all over the place, stick pressure sensors all over the place, measure friction between the parts, measure vibration, measure fluid flow and concentrations and mixing, measure heat conduction, etc, etc. We want to be able to open that black box, see what’s going on, figure out where that surprising performance is coming from. That would give us far more information, and far more useful information, than just "huh, that worked surprisingly well/​poorly". And in particular, there’s no way in hell we’re going to understand how an engine works without opening it up like that. The same idea carries over to AI: there’s no way in hell we’re going to understand how intelligence works without opening the black box. If we can open it up, see what’s going on, figure out where surprises come from and why, then we get orders of magnitude more information and more useful information. (Of course, this also means that we need to figure out what things to look at inside the black box and how—the analogues of temperatures, pressures, friction, mixing, etc in an engine.)

https://www.lesswrong.com/posts/EzAt4SbtQcXtDNhHK/confused-why-a-capabilities-research-is-good-for-alignment?commentId=hYRL9WbSaiuwDrJch

It seems to me that the this argument only makes sense if we assume that "more capabilities research now" translates into "more gradual development of AGI". That’s the real crux for me. If that assumption is false, then accelerating capabilities is basically equivalent to having all the AI alignment and strategy researchers hibernate for some number N years, and then wake up and get back to work. And that, in turn, is strictly worse than having all the AI alignment and strategy researchers do what they can during the next N years, and also continue doing work after those N years have elapsed. I do agree that there is important alignment-related work that we can only do in the future, when AGI is closer. I don’t agree that there is nothing useful being done right now. On the other hand, if that assumption is true (i.e. the assumption "more capabilities research now" translates into "more gradual development of AGI"), then there’s at least a chance that more capabilities research now would be net positive. However, I don’t think the assumption is true—or at least, not to any appreciable extent. It would only be true if you thought that there was a different bottleneck to AGI besides capabilities research. You mention faster hardware, but my best guess is that we already have a massive hardware overhang—once we figure out AGI-capable algorithms, I believe we already have the hardware that would support superhuman-level AGI with quite modest amounts of money and chips. (Not everyone agrees with me.) You mention "neuroscience understanding", but I would say that insofar as neuroscience understanding helps people invent AGI-capable learning algorithms, neuroscience understanding = capabilities research! (I actually think some types of neuroscience are mainly helpful for capabilities and other types are mainly helpful for safety, see here.) I imagine there being *small *bottlenecks that would add a few months today, but would only add a few weeks in a decade, e.g. future better CUDA compilers. But I don’t see any big bottlenecks, things that add years or decades, other than AGI capabilities research itself. Even if the assumption is significantly true, I still would be surprised if more capabilities research now would be a good trade, because (1) I do think there’s a lot of very useful alignment work we can do right now (not to mention outreach, developing pedagogy, etc.), (2) the most valuable alignment work is work that informs differential technological development, i.e. work that tells us exactly what AGI capabilities work should be done at all, namely R&D that moves us down a path to maximally alignable AGI, but that’s only valuable to the extent that we figure things out before the wrong kind of capabilities research has already been completed. See Section 1.7 here.

I’m not sure how this desire works, but I don’t think you could train GPT to have it. It looks like some sort of theory of mind is involved in how the goal is defined. I do think that would be valuable to know, and am very interested in that question myself, but I think that figuring it out is mostly a different type of research than AGI capabilities research—loosely speaking, what you’re talking about looks like "designing the right RL reward function", whereas capabilities research mostly looks like "designing a good RL algorithm"—or so I claim, for reasons here and here.

https://www.lesswrong.com/posts/EzAt4SbtQcXtDNhHK/confused-why-a-capabilities-research-is-good-for-alignment?commentId=g4g7KCzquuuDFEpMr

There might be a good argument for capability research being good if directed at making more Tool-AI instead of Agent-AI. In general I think there should be a push to redirect all research effort from Reinforcement Learning to things that are easier to use and control, like Language Models. And especially any system where the action space is manipulating the physical world should be made taboo. If the first AGI is a robotics system trained with RL and access to the physical world, we’re **significantly **more screwed than if we just get a really really good Language model. Convincing capabilities researchers to switch to AI safety is hard, but just convincing them to focus on Tool-AI is a lot easier.

Comment

https://www.lesswrong.com/posts/EzAt4SbtQcXtDNhHK/confused-why-a-capabilities-research-is-good-for-alignment?commentId=XfqH3aYANuDvye4Gf

If the first AGI is a robotics system trained with RL and access to the physical world, we’re **significantly **more screwed than if we just get a really really good Language model. That doesn’t seem true at all? A generally intelligent language model sounds like a manipulation machine, which sounds plenty dangerous.

Comment

A generally intelligent language model is one which outputs simulated human output which very closely resemble those in its dataset. The dataset of internet posts and books don’t include very many examples of successfully manipulating teams of AI researchers, therefore that strategy is not assigned a high likelihood by the model, even if it might actually be capable of executing the strategy. A language model just outputs the continuation to the query and then stops, this would still be unsafe at ultra-high capabilities because of the risk of mesa-optimizers, but we can control a weakly superhuman language model by placing it in a box and resetting its state for every new question we ask it. Also, detecting human manipulation is one of the things that we might believe human brains to be exceptionally good at. We didn’t evolve to solve math or physics problems, but we certainly did evolve to deceive and detect deception in other humans. I expect that an AI with uniformly increasing capabilities across some set of tasks would become able to solve deep math problems much earlier than it would be able to manipulate hostile humans guarded against it. This all means that a weakly superhuman language model would be a great tool to have, while still not ending the world right away. In contrast, an open-ended reward maximizer that uses RL operating on the physical world is a nightmare, it would just automatically modify itself to acquire all the capability that the general language model would have, if it believed it needed them to maximise reward.

Comment

A generally intelligent language model is one which outputs simulated human output which very closely resemble those in its dataset. What exactly makes it "general" then? Whats the difference between a general language model and non-general language model?

Comment

In some sense current language models are already general given their wide breath. The real crucial part is being human-level or weakly superhuman, for instance such model should be able to generate a physics textbook, or generate correct science papers from given only the abstract as prompt. Novel scientific research is where I’d draw the line to define "impactful" language models.

https://www.lesswrong.com/posts/EzAt4SbtQcXtDNhHK/confused-why-a-capabilities-research-is-good-for-alignment?commentId=f7LgydqSpk4Zn5X9g

It seems obvious to me that our current understanding of DL puts us in a better position to do alignment research than we were before the DL revolution. Not at all obvious. I think we barely get insight, at least so far, from DL. More broadly, capabilities research can be strategically-relevantly different from other capabilities research. E.g., capability research that is published or likely will be published, adds to the pile of stuff that arbitrary people can use to make AGI. Capability research that will be kept private has much less of this problem. Capability research can be more or less "about" understanding AGI in a way that leads to understanding how to align it, vs understanding AGI in a way that leads to be able to make it (whether FAI or UFAI). For example, one could pour a bunch of research into building a giant evolution simulator with rich environment and heuristics for skipping ahead, etc. This is capabilities research that seems to me not super likely to go anywhere, but if it does go anywhere, it seems more likely to lead to AGI that’s opaque and unalignable by strong default, and even if transparency-type stuff can be bolted on, the evolution-engineering itself doesn’t help very much with doing that.

https://www.lesswrong.com/posts/EzAt4SbtQcXtDNhHK/confused-why-a-capabilities-research-is-good-for-alignment?commentId=HhE8jHgBPniyShkbN

I tend to value a longer timeline more than a lot of other people do. I guess I see EA and AI Safety setting up powerful idea machines that get more powerful when they are given more time to gear up. A lot more resources have been invested into EA field-building recently, but we need time for these investments to pay off. At EA London this year, I gained a sense that AI Safety movement building is only now becoming its own thing; and of course it’ll take time to iterate to get it right, then time for people to pass through the programs, then time for them to have a career. I suspect the kind of argument that we need more capabilities to make progress might have been stronger earlier in the game, but now that we already have powerful language models, there’s a lot that we can do without needing AI to advance any further.

https://www.lesswrong.com/posts/EzAt4SbtQcXtDNhHK/confused-why-a-capabilities-research-is-good-for-alignment?commentId=ZvA3HJ6EumB3oRWTa

My sense is that Anthropic is somewhat oriented around this idea. I’m not sure if this is their actual plan or just some guesswork I read between the lines. But I vaguely recall something like "develop capabilities that you don’t publish, while also developing interpretability techniques which you do publish, and try to have a competitive edge on capabilities which you then have some lead time to try to inspect via intepretability techniques and the practice alignment on various capability-scales. (I may have just made this up while trying to steelman them to myself)

https://www.lesswrong.com/posts/EzAt4SbtQcXtDNhHK/confused-why-a-capabilities-research-is-good-for-alignment?commentId=CePTqicSSgzdfzzqW

It seems that in a counterfactual world where the DL revolution happened later and the DL era was compressed into a shorter timespan, our chances of alignment would be worse since that world’s equivalent of Redwood Research would have less time to do their research. It seems to me that counterfactually changing the date of the start of the Deep Learning revolution has two impacts: it shortens or lengthens the Deep Learning era, and it accelerates or decelerates the arrival of AGI. ie If you could have magically gotten Deep Learning to happen earlier, we would have had longer time in the DL era, because there would be more time where people are using the DL paradigm while there was less compute to do it with, and more time for us to learn more about how Deep Learning works. But also, it means there are more researcher-hours going into finding DL techniques, which overall probably speeds up AGI arrival times. It seems like a (the?) crux here is which of these impacts predominates. How much additional safety progress do you get from marginal knowledge of AI paradigms, vs. how much additional safety progress do you get from additional years to work on the problem. Making up some numbers: would we prefer to have another 10 years to work on the problem, in which it is only in the final 2 that we get to see the paradigm in which AGI will be built? Or would we prefer to have 6 years to work on the problem, during all of which we have access to the paradigm that will build AGI?

https://www.lesswrong.com/posts/EzAt4SbtQcXtDNhHK/confused-why-a-capabilities-research-is-good-for-alignment?commentId=ein2LqJ5fZzMXJPho

I can see the argument of capabilities vs safety both ways. On the one hand, by working on capabilities, we may get some insights. We could figure out how much data is a factor, and what kinds of data they need to be. We could figure out how long term planning emerges, and try our hand at inserting transparency into the model. We can figure out whether the system will need separate modules for world modeling vs reward modeling. On the other hand, if intelligence turns out to be not that hard, and all we need to do is train a giant decision transformer… then we have major problems. I think it would be great to focus capabilities research into a narrower space as Razied says. My hunch is that a giant language model by itself would not go foom, because it’s not really optimizing for anything other than predicting the next token. It’s not even really aware of the passage of time. I can’t imagine it having a drive to, for example, make the world output only a single word forever. I think the danger would be in trying to make it into an agent. I also think that there must be alignment work that can be done without knowing the exact nature of the final product. For example, learning the human value function, whether it comes from a brain-like formulation, or inverse RL. I am also curious if there has been work done on trying to find a "least bad" nondegenerate value function, i.e. one that doesn’t kill us, torture us, or tile the universe with junk, even if it does not necessarily want what we want perfectly. I think relevant safety work can always take the form of, "suppose current technology scaled up (e.g. decision transformer) could go foom, what should we do right now that could constrain it?" There is some risk that future advancements could be very different, and work done in this stage is not directly applicable, but I imagine it would still be useful somehow. Also, my intuition is that we could always wonder what’s the next step in capabilities, until the final step, and we may not know it’s the final step. One thing you have to admit, though. Capabilities research is just plain exciting, probably on the same level as working on the Manhattan project was exciting. I mean, who doesn’t want to know how intelligence works?

https://www.lesswrong.com/posts/EzAt4SbtQcXtDNhHK/confused-why-a-capabilities-research-is-good-for-alignment?commentId=jLHY4wtC3A43umb2s

I think the desire works because most honest people know, if they give a good-sounding answer that is ultimately meaningless, no benefits will come of the answers given. They may eventually stop asking questions, knowing the answers are always useless. It’s a matter of estimating future rewards from building relationships. Now, when a human gives advice to another human, most of the time it is also useless, but not always. Also, it tends to not be straight up lies. Even in the useless case, people still think there is some utility in there, for example, having the person think of something novel, giving them a chance to vent without appearing to talk to a brick wall, etc. To teach a GPT to do this, maybe there would have to be some reward signal. To do with purely language modeling, not sure. Maybe you could continue to train it with examples of its own responses and the interviewer’s response afterwards with whether its advice was true or not. With enough of these sessions, perhaps you could run the language model and have it try to predict the human response, and see what it thinks of its own answers, haha.