[Intro to brain-like-AGI safety] 11. Safety ≠ alignment (but they’re close!)

https://www.lesswrong.com/posts/BeQcPCTAikQihhiaK/intro-to-brain-like-agi-safety-11-safety-alignment-but-they

Contents

11.1 Post summary /​ Table of contents

In the previous post, I talked about "the alignment problem" for brain-like AGIs. Two points are worth emphasizing: (1) the alignment problem for brain-like AGIs is currently unsolved (just like the alignment problem for any other type of AGI), and (2) solving it would be a giant leap towards AGI safety. That said, "solving AGI alignment" is not *exactly *the same as "solving AGI safety". This post is about how the two may come apart, at least in principle. As a reminder, here’s the terminology:

11.2 Alignment without safety?

This is the case where an AGI is aligned (i.e., trying to do things that its designers had intended for it to try to do), but still causes catastrophic accidents. How? One example: maybe, as designers, we didn’t think carefully about what we had intended for the AGI to do. John Wentworth gives a hypothetical example here: humans ask the AGI for a nuclear fusion power plant design, but they neglect to ask the follow-up question of whether the same design makes it much easier to make nuclear weapons. Another example: maybe the AGI is trying to do what we had intended for it to try to do, but it screws up. For example, maybe we ask the AGI to build a new better successor AGI, that is still well-behaved and aligned. But the AGI messes up. It makes a successor AGI with the wrong motivations, and the successor gets out of control and kills everyone. I don’t have much to say in general about alignment-without-safety. But I guess I’m modestly optimistic that, if we solve the alignment problem, then we can muddle our way through to safety. After all, if we solve the alignment problem, then we’ll be able to build AGIs that are sincerely trying to help us, and the first thing we can use them for is to ask them for help clarifying exactly what they should be doing and how, thus hopefully avoiding failure modes like those above.[3] That said, I could be wrong, and I’m certainly happy for people to keep thinking hard about the non-alignment aspects of safety.

11.3 Safety without alignment?

Conversely, there are various ideas of how to make an AGI safe without needing it to make it aligned. They all seem hard or impossible to me. But hey, perfect alignment seems hard or impossible too. I’m in favor of keeping an open mind, and using multiple layers of protection. I’ll go through some possibilities here (this is not a comprehensive list):

11.3.1 AI Boxing

No, not *that *kind of "AI boxing"! *(This image is from "Real Steel" (2011), a movie which incidentally had (I believe) a larger budget than the sum total that humanity has ever spent on long-term-oriented technical AGI safety research. More on the funding situation in Post #15.)*The idea here is to put an AGI in a box, with no internet access, no actuators, etc. We can unplug the AGI whenever we want. Even if the AGI has dangerous motivations, who cares? What harm could it possibly do? Oh, umm, it could send out radio signals with RAM. So we also need a Faraday cage. Hopefully there’s nothing else we forgot! Actually, I am quite optimistic that people could make a leakproof AGI box if they really tried. I love bringing up Appendix C of Cohen, Vellambi, Hutter (2020), which has an awesome box design, complete with air-tight seals and Faraday cages and laser interlocks and so on. Someone should totally build that. When we’re not using it for AGI experiments, we can loan it to movie studios as a prison for supervillains. A different way to make a leakproof AGI box is using homomorphic encryption. This has the advantage of being provably leakproof (I think), but the disadvantage of dramatically increasing the amount of compute required to run the AGI algorithm. What’s the problem with boxing? Well, we made the AGI for a reason. We want to use it to do things. For example, something like the following could be perfectly safe:

11.3.2 Data curation

Let’s say we fail to solve the alignment problem, so we’re not sure about the AGI’s plans and intentions, and we’re concerned about the possibility that the AGI may be trying to trick or manipulate us. One way to tackle this problem is to ensure that the AGI has no idea that we humans exist and are running it on a computer. *Then *it won’t try to trick us, right? As one example along those lines, we can make a "mathematician AGI" that knows about the universe of math, but knows nothing whatsoever about the real world. See Thoughts on Human Models for more along these lines. I see two problems:

11.3.3 Impact limits

We humans have an intuitive notion of the "impact" of a course of action. For example, removing all the oxygen from the atmosphere is a "high-impact action", whereas making a cucumber sandwich is a "low-impact action". There’s a hope that, even if we can’t really control an AGI’s motivations, maybe we can somehow restrict the AGI to "low-impact actions", and thus avoid catastrophe. Defining "low impact" winds up being quite tricky. See Alex Turner’s work for one approach. Rohin Shah suggests that there are three desiderata that seem to be mutually incompatible: "objectivity (no dependence on [human] values), safety (preventing any catastrophic plans) and non-trivialness (the AI is still able to do some useful things)". If that’s right, then clearly we need to throw out objectivity. One place we may wind up is something like AGIs that try to follow human norms, for example. From my perspective, I find these ideas intriguing, but the only way I can see them working in a brain-like AGI is to implement them via the motivation system. I imagine that the AGI would follow human norms because it wants to follow human norms. So this topic is absolutely worth keeping in mind, but for my purposes, it’s not a separate topic from alignment, but rather an idea about what motivation we should be trying to put into our aligned AGIs.

11.3.4 Non-agentic (or "tool") AI

There’s an appealing intuition, dating back at least to this 2012 post by Holden Karnofsky, that maybe there’s an easy solution: just make AIs that aren’t "trying" to do anything in particular, but instead are more like "tools" that we humans can use. While Holden himself changed his mind and is now a leading advocate of AGI safety research, the idea of non-agentic AI lives on. Prominent advocates of this approach include Eric Drexler (see his "Comprehensive AI Services", 2019), and people who think that large language models (e.g. GPT-3) are on the path to AGI (well, not all of those people, it’s complicated[5]). As discussed in this reply to the 2012 post, we shouldn’t take for granted that "tool AI" would make all safety problems magically disappear. Still, I suspect that tool AI would help with safety for various reasons. I’m skeptical of "tool AI" for a quite different reason: I don’t think such systems will be powerful enough. Just like the "mathematician AGI" in Section 11.3.2 above, I think a tool AI would be a neat toy, but it wouldn’t help solve the big problem—namely, that the clock is ticking until some *other *research group comes along and makes an agentic AGI. See my discussion here for why I think that agentic AGIs will be able to come up with creative new ideas and inventions in a way that non-agentic AGIs can’t. But also, this is a series on brain-like AGI. Brain-like AGI (as I’m using the term) is definitely agentic. So non-agentic AI is off-topic for this series, even if it were a viable option.

11.4 Conclusion

In summary: