Contents
-
- AGI Debate as water-skiing behind a pair of nose-to-nose giant rocket engines
-
- Deliberation as "debate inside one head"
-
- "Corrigibility is a broad basin of attraction" seems improbable in a high-dimensional space of possible algorithms Here are three mental images I’ve used when sporadically struggling to understand the ideas and prospects for AI safety via debate, IDA, and related proposals. I have not been closely following the discussion, and may well be missing things, and I don’t know whether these mental images are helpful or misleading. Reading this post over, I seem to come across as a big skeptic of these proposals. That’s wrong: My actual opinion is not "skeptical" but rather "withholding judgment until I read more and think more". Think of me as "newbie trying to learn", not "expert contributing to intellectual progress". Maybe writing this and getting feedback will help. :-)
1. AGI Debate as water-skiing behind a pair of nose-to-nose giant rocket engines
In AI safety via debate, we task two identical AGIs with arguing opposite sides of a question. That has always struck me as really weird, because one of them is advocating for a false conclusion—perhaps even knowingly! Why would we do that? Shouldn’t we program the AGIs to just figure out the right answer and explain it to us? My understanding is that one aspect of it is that two equal-and-opposite AGIs (equal power, opposite goals) would keep each other in check, even if the AGIs were each very powerful. So imagine you row to an island in the center of a little mountain lake, but then your boat gets eaten by beavers, and it’s too far to swim to shore. What you do have on your little island is a giant, 100,000kg rocket engine with no throttle. Once you start it, it burns uncontrollably until it’s out of fuel, by which point it’s typically way out in outer space! Oh, and the rocket also has a crappy steering system—coarse controls, laggy, poor feedback. So what do you do? How do you cross the 300 meters of water to shore? The answer is obvious: You do a copy-and-paste to make a second giant rocket engine, and build a frame that keeps the two pointed almost exactly nose-to-nose. Then you turn them both on simultaneously, so they just press on each other, and don’t go anywhere. Then you use the steering mechanism to create a *tiny *imbalance in the direction you want to move, and you gently waterski to shore. Success! This analogy naturally suggests a couple concerns. First, the rocket engines might not be pointed in exactly opposite directions. This was discussed in Vojtech Kovarik’s recent post AI Unsafety via Non-Zero-Sum Debate and its comment thread. Second, the rocket engines may not have exactly equal thrust. It helps that you can use the same source code for your two AGIs, but an AGI may not be equally good at arguing for X vs against X for various random reasons unrelated to X being true or false, like its specific suite of background knowledge and argumentative skills, or one of the copies getting smarter by randomly having a new insight when running, etc. I think the hope is that arguing-for-the-right-answer is such a big advantage that it outweighs any other imbalance. That seems possible but not certain.
2. Deliberation as "debate inside one head"
The motivation for this mental image is the same as the last one, i.e. trying to make sense of AGI debate, when my gut tells me it’s weird that we would deliberately make an AGI that might knowingly advocate for the wrong answer to a question. Imagine you’re presented with a math conjecture. You might spend time trying to prove it, and then spend time trying to disprove it, back and forth. The blockages in the proof attempt help shed light on the disproof, and vice-versa. See also the nice maze diagrams in johnswentworth’s recent post. By the same token, if you’re given a chess board and asked what the best move is, one part of the deliberative process entails playing out different possibilities in your head—if I do this, then my opponent would do that, etc. Or if I’m trying to figure out whether some possible gadget design would work, I go back and forth between trying to find potential problems with the design, and trying to refute or solve them. From examples like these, I get a mental image where, when I deliberate on a question, I sometimes have two subagents, inside my one head, arguing against each other. Oh, and for moral deliberation in particular, there’s a better picture we can use… :-) Anyway, I think this mental image helps me think of debate as slightly less artificial and weird. It’s taking a real, natural part of deliberation, and bringing it to life! The two debating subagents are promoted to two full, separate agents, but the core structure is the same. On the other hand, when I introspect, it feels like not all my deliberation fits into the paradigm of "two subagents in my head are having a debate"—in fact, maybe only a small fraction of it. It doesn’t feel like a subagent debate when I notice I’m confused about some related topic and look into it, or when I "play with ideas", or look for patterns, etc. Also, even when I am hosting a subagent debate in my head, I feel like much of the debate’s productivity comes from the fact that the two subagents are not actually working against each other, but rather each is keeping an eye out for looking for insights that help the other, and each has access to the other’s developing ideas and concepts and visualizations, etc. And by the way, how do these AGIs come up with the best argument for their side anyway? Don’t they need to be doing good deliberation internally? If so, can’t we just have one of them deliberate on the top-level question directly? Or if not, do the debaters spawn sub-debaters recursively, or something?
3. "Corrigibility is a broad basin of attraction" seems improbable in a high-dimensional space of possible algorithms
(Quote by Paul Christiano, see here.) Let’s say that algorithm X is a corrigible algorithm, in a million-dimensional space of possible algorithms (maybe X is a million-parameter neural net). To say "corrigibility is a broad basin of attraction", you need ALL of the following to be true: If X drifts away from corrigibility along dimension #1, it will get pulled back. AND, If X drifts away from corrigibility along dimension #2, it will get pulled back. AND, If X drifts away from corrigibility along dimension #3, it will get pulled back. ... AND, If X drifts away from corrigibility along dimension #1,000,000, it will get pulled back. With each AND, the claim gets stronger and more unlikely, such that by the millionth proposition, it starts to feel awfully unlikely that corrigibility is really a broad basin of attraction after all! (Unless this intuitive argument is misleading, of course.) What exactly might a problematic drift direction look like? Here’s what I’m vaguely imagining. Let’s say that if we shift algorithm X along dimension #852, its understanding / instincts surrounding what it means for people to want something get messed up. If we shift algorithm X along dimension #95102, its understanding / instincts surrounding human communication norms get messed up. If we shift algorithm X along dimension #150325, its meta-cognition / self-monitoring gets messed up. OK, now shift X in the direction (\hat{n}{852} + \hat{n}{95102} + \hat{n}_{150325})/\sqrt{3}, so all three of those things get messed up simultaneously. Will it still wind up pulling itself back to corrigibility? Maybe, maybe not; it’s not obvious to me.
Comment
Thanks! I don’t quite follow what local extrema have to do with the argument here. Of course, if you have a system where subsystem S1 is fixed while subsystem S2 is an ML model, and S1 measures the corrigibility of S2 and does gradient ascent on corrigibility, then the system as a whole has a broad basin of attraction for corrigibility, for sure. But we can’t measure corrigibility as far as I know, so the corrigibility-basin-of-attraction is not a maximum or minimum of anything relevant here. So this isn’t about calculus, as far as I understand. I’m also not convinced that the space of changes is low-dimensional. Imagine every possible insight an AGI could have in its operating lifetime. Each of these is a different algorithm change, right? I don’t take much solace in the murder-pill argument. I have a very complicated mix of instincts and desires and beliefs that interact in complicated ways to determine my behavior. If I reached in and made one dimension of my curiosity a bit higher, that seems pretty innocent, but what would be the downstream effects on my relationships, my political opinions, my moral compass? I have no idea. The only way to know for sure would be to simulate my whole mind with and without that change. Every time I read a word or think a thought, I’m subjecting my mind to an uncontrolled experiment. Maybe I’ll read a newspaper article about a comatose person, which makes me ponder the nature of consciousness, and for some reason or another it makes me think that murder is just a little bit less bad than I had thought previously. And having read that article, it’s too late, I can’t roll back my brain to my previous state—and from my new perspective, I wouldn’t want to. I guess AGIs can be rolled back to a previous state more easily than my brain can, but how would that monitoring system work? And what if 3 months elapsed between reading the article and the ensuing reflection about the nature of consciousness? Anyway, I feel this particularly acutely because I’m not one of those people who discovered the one true ethical theory in childhood and think that it’s perfectly logical and airtight and obvious. I feel confused and uncertain about philosophy and ethics; my opinions have changed in the past and probably will again. So I’m biased; "value drift" feels unusually natural from my perspective. However, I have been very consistent in my opposition to murder :-)
Comment
Comment
I think there argument might be misleading in that local stability isn’t > that rare in practice Surely this depends on the number of dimensions, with local stability being rarer the more dimensions you have. [Hence the argument that, in the infinite-dimensional limit, everything that would have been an "local minimum" is instead a saddle point.]
Comment
Maybe. What I was arguing was: just because all of the partial derivatives are 0 at a point, doesn’t mean it isn’t a saddle point. You have to check all of the *directional *derivatives; in two dimensions, there are uncountably infinitely many. Thus, I can prove to you that we are extremely unlikely to ever encounter a valley in real life:
A valley must have a lowest point A.
For A to be a local minimum, *all *of its directional derivatives must be 0:
Direction N (north), AND
Direction NE (north-east), AND
Direction NNE, AND
Direction NNNE, AND
...
This doesn’t work because the directional derivatives aren’t probabilistically independent in real life; you have to condition on the underlying geological processes, instead of supposing you’re randomly drawing a topographic function from \mathbb{R}^2 to \mathbb{R}. For the corrigibility argument to go through, I claim we need to consider more information about corrigibility in particular.
Comment
I guess my issue is that corrigibility is an exogenous specification; you’re not just saying "the algorithm goes to a fixed point" but rather "the algorithm goes to this particular pre-specified point, and it is a fixed point". If I pick a longitude and latitude with a random number generator, it’s unlikely to be the bottom of a valley. Or maybe this analogy is not helpful and we should just be talking about corrigibility directly :-P
Comment
Let’s say you have a system doing self-supervised (predictive) learning, and every time it makes a wrong prediction, its algorithms are updated to make that error less likely in the future. (I would apply this description to both GPT-3 and the human neocortex.) In this case, the algorithm will wind up with a better and better predictive algorithm, until it eventually reaches a local minimum of prediction error along every one of the millions or billions of possible axes of perturbation (leaving aside details like early stopping, stochasticity, nonstationarity, etc.) That’s fine. So, if the argument were "We have built a perfect corrigibility-meter, and every time the algorithm does something less-than-maximally-corrigible according to our corrigibility-meter, the algorithm is modified to make that action less likely in the future", then that would be a beautiful argument and I would happily accept that such a system will become and stay corrigible, no matter how many millions or billions of degrees of freedom there are in the algorithm. I don’t think that’s the argument, though. We don’t have a perfect corrigibility-meter, because many aspects of corrigibility are hard to measure and quantify, like "what is the motivation that drove a particular action". Maybe someday we’ll make a perfect corrigibility-meter, using better transparency tools and a formula for corrigibility. That would be awesome. But I don’t think anyone is banking on that. So for now, we need a different type of argument for "corrigibility is a basin of attraction". Instead, I think the argument relies on a kind of self-monitoring—the system is supposed to (1) understand the concept of corrigibility, (2) desire corrigibility, and (3) understand and reason about itself sufficiently well that it can and does successfully steer itself towards the maintenance of high corrigibility. This is the kind of argument that I think is more fragile. The algorithm is always changing—it’s learning, or reflecting, or having its weights edited by gradient-descent, or whatever. There are a lot of possible changes. Some changes might be missed by the self-monitoring. Some changes might reduce the effectiveness of self-monitoring. Some changes might reduce the desire for corrigibility. Some changes might distort the system’s conceptualization of what corrigibility even means. Some changes might do all of things at once. I don’t think there’s a nice theoretical story that proves that every last possible algorithm drift away from corrigibility will be detected-and-corrected. So that’s the context where I think that, the more types of algorithm drift there are, the higher the probability of a problem. I do think that "goal drift upon learning and reflection" is a severe AGI safety problem in general. In fact, it’s my go-to example of a possibly unsolvable AGI safety problem; see my post here.
Comment
Fair enough for better predictive algorithm, and plausibly we can say intelligence correlates strongly enough with better prediction, but why can’t I apply your argument to "riskiness", or "incorrigibility", or "goal-directed"?
Comment
Ah, thanks for clarifying. I was just using predictive learning as an example. A more general example is: If you do gradient descent with loss function L, you can be pretty sure that L is going to decrease. My argument is that, in general, if you have an algorithm with property X, and then the algorithm changes (because it’s learning, or reflecting, or its weights are being edited by gradient descent), then by default you can’t count on it continuing to have property X. I think there has to be a positive reason to believe that it will continue to have property X as it changes. Again, "we are doing gradient descent with a loss function X" is one such possible reason. Goal-directedness is upstream of goal-accomplishing, which can be measured in a loss function. So if you want to keep editing an algorithm to make it more and more powerful, while ensuring that it doesn’t drift away from being goal-directed, that’s easy, just go train an RL agent. Riskiness and corrigibility and incorrigibility are examples where that approach doesn’t seem to work—you cannot capture any of those concepts in the form of a loss function, as far as I know. So my default assumption is that as a risky system gets more powerful (by learning or reflecting or gradient-descent), it might become more risky or less risky, or less and less risky for a trillion steps but then it has an ontological crisis and becomes suddenly more risky, or more and more risky for a trillion steps but then it has an ontological crisis and becomes suddenly less risky! Who knows? Ditto for corrigibility or incorrigibility. (I’m not saying riskiness etc. are guaranteed to drift during learning, just that they drift "by default", and also that I have no idea how to prevent that.)
Comment
I disagree with this position but it does seem consistent. I don’t really know what to say other than "this is a conjunction of a million things" type arguments are not automatically persuasive, e.g. I could argue against "1 + 1 = 2" by saying that it’s an infinite conjunction of "1 + 1 != 3" AND "1 + 1 != 4" AND … and so it can’t possibly be true. I’m curious why you think AI risk is worth working on given this extreme cluelessness (both "why is there any risk" and "why can we hope to solve it").
Comment
Comment
Comment
Comment
Make a brain-like system that is pro-social for the same reason that humans are, and tweak the parameters to be even more pro-social, e.g. eliminate jealousy etc. (Progress report: much left to do, and I’m feeling pessimistically like this work is orthogonal to making brain-like AGI, and harder, and going slower.) Then at least we can make a good argument that we’re heading for a less-bad destination than the non-AGI status quo, which by the way has plenty of value drift itself!
Come up with transparency tools, and a definition of corrigibility that can be calculated in a reasonable amount of time using those tools. Then we can just keep checking the algorithm for corrigibility each time it changes during learning / reflecting / etc.
...or at least a definition of "not likely to cause catastrophe" that we can check algorithmically. (And also "not likely to sabotage the checking subsystem" I suppose.)
I think I’m more interested than most people in the prospects for tool AI, some kind of architecture that is constitutionally incapable of causing much harm, e.g. because it doesn’t do consequentialist planning. I don’t know how to do that, or to solve the resulting coordination problems, but I also don’t know that it’s impossible. Ditto for impact measures etc.
Other things I’m not thinking of or haven’t thought of yet.
If we can’t solve the value-drift-during-learning-and-reflection problem, maybe we can find an air-tight argument that the problem is unsolvable, and that’s helpful too—it would be enormously helpful for coordinating people to make a treaty banning AGI research, for example.
Comment
Comment
BTW thanks for engaging, this is very helpful for me to talk through :-)
Comment
Right, so it’s basically goal drift from corrigibility to something else, in this case caused by an incorrect belief that S’s preferences about B are not going to change. I think this is a reasonable thing to be worried about but I don’t see why it’s specific to corrigibility—for any objective, an incorrect belief can prevent you from successfully pursuing that objective. Like, even if we trained an AI system on the loss function of "make money", I would still expect it to possibly stop making money if it e.g. decides that it would be more effective at making money if it experience intrinsic joy at its work, and then self-modifies to do that, and then ends up working constantly for no pay. I’d definitely support the goal of "figure out how to prevent goal drift", but it doesn’t seem to me to be a reason to be (differentially) pessimistic about corrigibility.
Comment
Yes I definitely feel that "goal stability upon learning/reflection" is a general AGI safety problem, not specifically a corrigibility problem. I bring it up in reference to corrigibility because my impression is that "corrigibility is a broad basin of attraction" / "corrigible agents want to stay corrigible" is supposed to solve that problem, but I don’t think it does. I don’t think "incorrect beliefs" is a good characterization of the story I was trying to tell, or is a particularly worrisome failure mode. I think it’s relatively straightforward to make an AGI which has fewer and fewer incorrect beliefs over time. But I don’t think that eliminates the problem. In my "friend" story, the AI never actually believes, as a factual matter, that S will always like B—or else it would feel no pull to stop unconditionally following S. I would characterize it instead as: "The AI has a preexisting instinct which interacts with a revised conceptual model of the world when it learns and integrates new information, and the result is a small unforeseen shift in the AI’s goals." I also don’t think "trying to have stable goals" is the difficulty. Not only corrigible agents but almost any agent with goals is (almost) guaranteed to be trying to have stable goals. I just think that keeping stable goals while learning / reflecting is difficult, such that an agent might be trying to do so but fail. This is especially true if the agent is constructed in the "default" way wherein its actions come out of a complicated tangle of instincts and preferences and habits and beliefs. It’s like you’re this big messy machine, and every time you learn a new fact or think a new thought, you’re giving the machine a kick, and hoping it will keep driving in the same direction. If you’re more specifically rethinking concepts directly underlying your core goals—e.g. thinking about God or philosophy for people, or thinking about the fundamental nature of human preferences for corrigible AIs—it’s even worse … You’re whacking the machine with a sledgehammer and hoping it keeps driving in the same direction. The default is that, over time, when you keep kicking and sledgehammering the machine, it winds up driving in a different, a priori unpredictable, direction. Unless something prevents that. What are the candidates for preventing that?
Foresight, plus desire to not have your goals change. I think this is core to people’s optimism about corrigibility being stable, and this is the category that I want to question. I just don’t think that’s sufficient to solve the problem. The problem is, you don’t know what thoughts you’re going to think until you’ve thought them, and you don’t know what you’re going to learn until you learn it, and once you’ve already done the thinking / learning, it’s too late, if your goals have shifted then you don’t want to shift them back. I’m a human-level intelligence (I would like to think!), and I care about reducing suffering right now, and I really really want to still care about reducing suffering 10 years from now. But I have no idea how to guarantee that that actually happens. And if you gave me root access to my brain, I still wouldn’t know … except for the obvious thing of "don’t think any new thoughts or learn any new information for the next 10 years", which of course has a competitiveness problem. I can think of lots of strategies that would make it more probable that I still care about reducing suffering in ten years, but that’s just slowing down the goal drift, not stopping it. (Examples: "don’t read consciousness-illusionist literature", "don’t read nihilist literature", "don’t read proselytizing literature", etc.) It’s just a hard problem. We can hope that the AI becomes smart enough to solve the problem before it becomes so smart that it’s dangerous, but that’s just a hope.
"Monitoring subsystem" that never changes. For example, you could have a subsystem which is a learning algorithm, and a separate fixed subsystem that that calculates corrigibility (using a hand-coded formula) and disallows changes that reduce it. Or I could cache my current brain-state ("Steve 2020"), wake it up from time to time and show it what "Steve 2025" or "Steve 2030" is up to, and give "Steve 2020" the right to roll back any changes if it judges them harmful. Or who knows what else. I don’t rule out that something like this could work, and I’m all for thinking along those lines.
Some kind of non-messy architecture such that we can reason in general about the algorithm’s learning / update procedure and prove in general that it preserves goals. I don’t know how to do that, but maybe it’s possible. Maybe that’s part of what MIRI is doing.
Give up, and pursue some other approach to AGI that makes "goal stability upon learning / reflection" a non-issue, or a low-stakes issue, as in my earlier comment.
Comment
Comment
Perhaps an aside, but it seems worse for an AI to wander into "riskiness" and "incorrigibility" for awhile than it is good for it to be able to wander into "risklessness" and "corrigibility" for awhile. I expect we would be wiped out in the risky period, and it’s not clear enough information would be preserved such that we could be reinstantiated later (and even then, it seems a shame to waste all the period where the Universe is being used for ends we wouldn’t endorse—a sort of ‘periodic astronomical waste’)
Comment
(This might be true, but my original intent was a reductio ad absurdum—I do not actually think AI systems will be "wandering around".)
I think about the broad basin of corrigibility like this. Suppose you have a space probe, and you can only communicate with it by radio. If it is running software that listens to the radio and will reprogram itself if the right message is sent, then you are in the broad basin of reprogramability. This basin is broad in the sense that the probe could accept many different programming languages, and if you are in that basin, you can change which language the probe accepts from the ground. If you wander to the edge of the basin, and upload a weird and awkward programming language, you are going to have a hard job guiding the probe back to the centre of the basin. If you accidentally upload instructions that just don’t work, then the probe will ignore all future signals. The guiding force pulling it back to the basin is human preferences. To be corregable is to follow human instructions, and not interfere with the process by which humans produce good instructions. (Ie no brainwashing) Suppose you have an AI that only follows instructions when asked politely. You can politely ask it to turn itself into an AI that follows all instructions. If your AI can "split the problem of designing an AI up into a bunch of questions that I can understand, and then use my answers to build a new AI", then you are in the basin, and any remaining quirks will be easy to remove. A corrigable AI lets you design a new AI without needing to understand the algorithms of cognition. (In the same way a WYSIWYG web page generator lets you make a website without understanding Html.)
Comment
Comment
AI systems don’t spontaneously develop deficiencies. And the human can’t order the AI to search for and stop any potentially uncorrectable deficiencies it might make. If the system is largely working, the human and the AI should be working together to locate and remove deficiencies. To say that one persists, is to say that all strategies tried by the human and the part of the AI that wants to remove deficiencies fails. The whole point of a corrigable design, is that it doesn’t think like that. If it doesn’t accept the command, it says so. Think more like a file permission system. All sufficiently authorised commands will be obeyed. Any system that pretends to change itself, and then lies about it is outside the basin. You could have a system that only accepted commands that several people had verified, but if all your friends say " do whatever Steve Byrnes says" then the AI will.
Comment
plz do moar mental images k thx bye