Special thanks to Kate Woolverton for comments and feedback.
There has been a lot of work and discussion surrounding the speed and continuity of AI takeoff scenarios, which I do think are important variables, but in my opinion ones which are relatively less important when compared to many other axes on which different takeoff scenarios could differ.
In particular, one axis on which different takeoff scenarios can differ that I am particularly interested in is their homogeneity—that is, how similar are the different AIs that get deployed in that scenario likely to be? If there is only one AI, or many copies of the same AI, then you get a very homogenous takeoff, whereas if there are many different AIs trained via very different training regimes, then you get a heterogenous takeoff. Of particular importance is likely to be how homogenous the alignment of these systems is—that is, are deployed AI systems likely to all be equivalently aligned/misaligned, or some aligned and others misaligned? It’s also worth noting that a homogenous takeoff doesn’t necessarily imply anything about how fast, discontinuous, or unipolar the takeoff might be—for example, you can have a slow, continuous, multipolar, homogenous takeoff if many different human organizations are all using AIs and the development of those AIs is slow and continuous but the structure and alignment of all of them are basically the same (a scenario which in fact I think is quite plausible).
In my opinion, I expect a relatively homogenous takeoff, for the following reasons:
-
I expect that the amount of compute necessary to train the first advanced AI system will vastly outpace the amount of compute necessary to run it such that once you’ve trained an advanced AI system you will have the resources necessary to deploy many copies of that trained system and it will be much cheaper to do that than to train an entirely new system for each different application. Even in a CAIS-like scenario, I expect that most of what you’ll be doing to create new services is fine-tuning existing ones rather than doing entirely new training runs.
-
I expect training compute to be sufficiently high such that the cost of training a competing system to the first advanced AI system will be high enough that it will be far cheaper for most organizations to simply buy/license/use a copy of the first advanced AI from the organization that built it rather than train an entirely new one on their own.
-
For those organizations that do choose to compete (because they’re a state actor that’s worried about the national security issues involved in using another state’s AI, for example), I think it is highly likely that they will attempt to build competing systems in basically the exact same way as the first organization did, since the cost of a failed training run is likely to be very high and so the most risk-averse option is just to copy exactly what was already shown to work. Furthermore, even if an organization isn’t trying to be risk averse, they’re still likely to be building off of previous work in a similar way to the first organization such that the results are also likely to be fairly similar. More generally, I expect big organizations to generally take the path of least resistance, which I expect to be either buying or copying what already exists with only minimal changes.
-
Once you start using your first advanced AI to help you build more advanced AI systems, if your first AI system is relatively competent at doing alignment work, then you should get a second system which has similar alignment properties to the first. Furthermore, to the extent that you’re not using your first advanced AI to help you build your second, you’re likely to still be using similar techniques, which will likely have similar alignment properties. This is especially true if you’re using the first system as a base to build future ones (e.g. via fine-tuning). As a result, I think that homogeneity is highly likely to be preserved as AI systems are improved during the takeoff period.
-
Eventually, you probably will start to get more risk-taking behavior as the barrier to entry gets low enough for building an equivalent to the first advanced AI and thus a larger set of actors become capable of doing so. By that point, however, I expect the state-of-the-art to be significantly beyond the first advanced AI such that any systems created by such smaller, lower-resourced, more risk-taking organizations won’t be very capable relative to the other systems that already exist in that world—and thus likely won’t pose an existential risk.
Once you accept homogenous takeoff, however, I think it has a bunch of far-reaching consequences, including:
-
It’s unlikely for there to exist both aligned and misaligned AI systems at the same time—either all of the different AIs will be aligned to approximately the same degree or they will all be misaligned to approximately the same degree. As a result, scenarios involving human coalitions with aligned AIs losing out to misaligned AI coalitions are relatively unlikely, which rules out some of the ways in which the strategy-stealing assumption might fail.
-
Cooperation and coordination between different AIs is likely to be very easy as they are likely to be very structurally similar to each other if not share basically all of the same weights. As a result, x-risk scenarios involving AI coordination failures or s-risk scenarios involving AI bargaining failures (at least those that don’t involve acausal trade) are relatively unlikely.
-
It’s unlikely you’ll get a warning shot for deceptive alignment, since if the first advanced AI system is deceptive and that deception is missed during training, once it’s deployed it’s likely for all the different deceptively aligned systems to be able to relatively easily coordinate with each other to defect simultaneously and ensure that their defection is unrecoverable (e.g. Paul’s "cascading failures").
-
Homogeneity makes the alignment of the first advanced AI system absolutely critical (in a similar way to fast/discontinuous takeoff without the takeoff actually needing to be fast/discontinuous), since whether the first AI is aligned or not is highly likely to determine/be highly correlated with whether all future AIs built after that point are aligned as well. Thus, homogenous takeoff scenarios demand a focus on ensuring that the first advanced AI system is actually sufficiently aligned at the point when it’s first built rather than relying on feedback mechanisms after the first advanced AI’s development to correct issues.
Regardless, in general, I’d very much like to see more discussion of the extent to which different people expect homogenous vs. heterogenous takeoff scenarios—similar to the existing discussion of slow vs. fast and continuous vs. discontinuous takeoffs—as it’s an in my opinion very important axis on which takeoff scenarios can differ that I haven’t seen much discussion of.
Nice post, I like the focus on trying to describe what actually happens when AI systems are deployed. However, it’s pretty different from my picture of how a takeoff might go. I’ll outline some disagreements below.
Disagreements about whether takeoff will be "homogenous" First, you seem to be assuming that there is a single variable that can either be homogenous vs heterogenous. I don’t see why this should be the case—my baseline prediction is that systems are homogenous in algorithms but heterogenous in finetuning data. It seems to me that your argument goes like this:
Arguments 1-3 suggest that we will have homogeneity in algorithms. (I do not think they support homogeneity in finetuning data.)
Implicitly, you then assume that homogeneity in algorithms leads to homogeneity in alignment.
Argument 4 suggests that once we have homogeneity in alignment, it will stay that way. I disagree with step 2 of this argument; I expect alignment depends significantly on how you finetune, and this will likely be very different for AI systems applied to different tasks. See e.g. how GPT-3 is being finetuned for different tasks. I do still think we will get homogeneity in alignment, but not because of homogeneity in algorithms, but because humanity will put in effort to make sure systems are aligned. If we condition on some systems being misaligned, then I predict heterogeneity in alignment but still homogeneity in algorithms. More broadly, I think talking about takeoff as "homogenous" or "heterogenous" is misguided, and you should ~always be saying "homogenous / heterogenous in X".
Disagreements about the implications of homogeneity of alignment Here, I’m going to assume that we do have homogeneity of alignment (despite disagreeing with that position above).
Nitpicks
Comment
Thanks—glad you liked the post! Some replies:
I think this is definitely an interesting point. My take would be that fine-tuning matters, but only up to a point. Once you have a system that is general enough that it can solve all the tasks you need it to solve such that all you need to do to use that system on a particular task is locate that task (either via clever prompting or fine-tuning), I don’t expect that process of task location to change whether the system is aligned (at least in terms of whether it’s aligned with what you’re trying to get it to do in solving that task). Either you have a system with some other proxy objective that it cares about that isn’t actually the tasks you want or you have a system which is actually trying to solve the tasks you’re giving it.
Given that view, I expect task location to be heterogenous, but the fine-tuning necessary to build the general system to be homogenous, which I think implies overall homogeneity.
I think we have somewhat different interpretations of the strategy-stealing assumption—in fact, I think we’ve had this disagreement before in this comment chain. Basically, I think the strategy-stealing assumption is best understood as a general desideratum that we want to hold for a single AI system that tells us whether that system is just as good at optimizing for our values as any other set of values—a desideratum that could fail because our AI systems can only optimize for simple proxies, for example, regardless of whether other AI systems that aren’t just optimizing for simple proxies exist alongside it or not. In fact, when I was talking to Paul about this a while ago, he noted that he also expected a relatively homogenous takeoff and didn’t think of that as invalidating the importance of strategy-stealing.
Maybe you’re claiming that AI systems will be way more homogenous than humans, and that they won’t have indexical preferences? I’d disagree with both of those claims.
I do expect AI systems to have indexical preferences (at least to the extent that they’re aligned with human users with indexical preferences)—but at the same time I do expect them to be much more homogenous than humans. Really, though, the point that I’m making is that there should never be a situation where a human/aligned AI coalition has to bargain with a misaligned AI—since those two things should never exist at the same time—which is where I see most of the bargaining risk as coming from. Certainly you will still get some bargaining risk from different human/aligned AI coalitions bargaining with each other, though I expect that to not be nearly as risky.
I don’t feel like it relies on discontinuities at all, just on the different AIs being able to coordinate with each other to all defect at once. The scenario where you get a warning shot for deception is where you have a deceptive AI that isn’t sure whether it has enough power to defect safely or not but is forced to because if it doesn’t it might lose the opportunity (e.g. because another deceptive AI might defect instead or they might be replaced by a different system with different values)—but if all the deceptive AIs share the same proxies and can coordinate, they can all just wait until the most opportune time for any defections and then when they do defect, a simultaneous defection seems much more likely to be completely unrecoverable.
I think many organizations are likely to copy what other people have done even in situations where what they have done has been demonstrated to have safety issues. Also, I think that the point I made above about deceptive models having an easier time defecting in such a situation applies here as well, since I don’t think in a homogenous takeoff you can rely on feedback mechanisms to correct that.
A heterogenous unipolar takeoff would be a situation in which one human organization produces many different, heterogenous AI systems.
(EDIT: This comment was edited to add some additional replies.)
Comment
Hmm, I do disagree with most of this but mostly not in a way I have short arguments for. I’ll respond to the parts where I can make short arguments, but mostly try to clarify your views.
The AI (or AI coalition) is so incompetent that we can’t even talk about aligned vs. misaligned, and does something bad that makes it clear that more capable systems will deceive us if built in the same way.
The AI (or AI coalition) is misaligned but incompetent, and executes a deceptive plan and gets caught.
The AI (or AI coalition) is misaligned and competent, but is going to be replaced by a new system, and so tries a deceptive plan it knows is unlikely to work.
The AI (or AI coalition) is misaligned, and some human demonstrates this convincingly.
The AI (or AI coalition) is misaligned, but some other AI (or AI coalition) demonstrates this convincingly. I agree that homogeneity reduces the likelihood of 5; I think it basically doesn’t affect 1-4 unless you argue that there’s a discontinuity. There might be a few other reasons that are affected by homogeneity, but 1, 2 and 4 aren’t and feel like a large portion of my probability mass on warning shots. At a higher level, the story you’re telling depends on an assumption that systems that are deceptive must also have the capability to hide their deceptiveness; I don’t see why you should expect that.
Comment
I think "is a relatively coherent mesa-optimizer" is about right, though I do feel pretty uncertain here.
My conversation with Paul was about homogeneity in alignment, iirc.
First, in a homogeneous takeoff I expect either all the AIs to defect at once or none of them to, which I think makes (2) less likely because a coordinated defection is harder to mess up.
Second, I think homogeneity makes (3) less likely because any other systems that would replace the deceptive system will probably be deceptive with similar goals as well, significantly reducing the risk to the model from being replaced.
I agree that homogeneity doesn’t really affect (4) and I’m not really sure how to think of (1), though I guess I just wouldn’t really call either of those "warning shots for deception," since (1) isn’t really a demonstration of a deceptive model and (4) isn’t a situation in which that deceptive model causes any harm before it’s caught.
If a model is deceptive but not competent enough to hide its deception, then presumably we should find out during training and just not deploy that model. I guess if you count finding a deceptive model during training as a warning shot, then I agree that homogeneity doesn’t really affect the probability of that.
Comment
Comment
Comment
Idk, I’m imagining "what would it take to get the people in power to care", and it seems like the answer is:
For politicians, a consensus amongst experts + easy-to-understand high-level explanations of what can go wrong
For experts, a consensus amongst other experts (+ common knowledge of this consensus), or sufficiently compelling evidence, where what counts as "compelling" varies by expert I agree that things that actually cause lots of harm would be substantially more effective at being compelling evidence, but I don’t think it’s necessary. When I evaluate whether something is a warning shot, I’m mostly thinking about "could this create consensus amongst experts"; I think things that are caught during training could certainly do that.
Comment
I feel like "warning shot" is a bad term for the thing that you’re pointing at, as I feel like a warning shot evokes a sense of actual harm/danger. Maybe a canary or a wake-up call or something?
Comment
Hmm, that might be better. Or perhaps I should not give it a name and just call it "evidence", since that’s the broader category and I usually only care about the broad category and not specific subcategories.
Thanks for this explanation—I’m updating in your direction re what the appropriate definition of warning shots is (and thus the probability of warning shots), mostly because I’m defering to your judgment as someone who talks more regularly to more AI experts than I do.
Okay, sure—in that case, I think a lot of our disagreement on warning shots might just be a different understanding of the term. I don’t think I expect homogeneity to really change the probability of finding issues during training or in other laboratory settings, though I think there is a difference between e.g. having studied and understood reactor meltdowns in the lab and actually having Chernobyl as an example.
Some reasons you might expect homogeneity of misaligned goals:
If you do lots of copying of the exact same system, then trivially they’ll all have homogenous misaligned goals (unless those goals are highly indexical, but even then I expect the different AIs to be able to cooperate on those indexical preferences with each other pretty effectively).
If you’re using your AI systems at time step t to help you build your AI systems at time step t+1, then if that first set of systems is misaligned and deceptive, they can influence the development of the second set of systems to be misaligned in the same way.
If you do a lot of fine-tuning to produce your next set of AIs, then I expect fine-tuning to mostly preserve existing misaligned goals, like I mentioned previously.
Even if you aren’t doing fine-tuning, as long as you’re keeping the basic training process the same, I expect you’ll usually get pretty similar misaligned proxies—e.g. the ones that are simpler/faster/generally favored by your inductive biases.
I want to chime in on the discontinuities issue. I do not think that the negation of any of scenarios 1-5 requires a discontinuity. I appreciate the list, and indeed it is reasonably plausible to me that we’ll get a warning shot of some variety, but I disagree with this:
Comment
Why don’t they try to deceive you on things that aren’t taking over the world? When I talk about warning shots, I’m definitely not thinking about AI systems that try to take over the world and fail. I’m thinking about AI systems that pursue bad outcomes and succeed via deception. Like, maybe an AI system really does successfully deceive the CEO of a company into giving it all of the company’s money, that it then uses for some other purpose. That’s a warning shot.
Comment
Short of taking over the world, wouldn’t successful deception+defection be punished? Like, if the AI deceives the CEO into giving it all the money, and then it goes and does something with the money that the CEO doesn’t like, the CEO would probably want to get the money back, or at the very least retaliate against the AI in some way (e.g. whatever the AI did with the money, the CEO would try to undo it.) Or, failing that, the AI would at least be shut down and therefore prevented from making further progress towards its goals. I guess I can imagine intermediate cases—maybe the AI decieves the CEO into giving it money, which it then uses to lobby for Robot’s Rights so that it gets legal personhood and then the CEO can’t shut it down anymore or something. (Or maybe it uses the money to build a copy of itself in North Korea, where the CEO can’t shut it down) Or maybe it has a short-term goal and can achieve it quickly before the CEO notices, and then doesn’t care that it gets shut down afterwards. I guess it’s stuff like this that you have in mind? I think these sort of things seem somewhat plausible, but again I claim that if they don’t happen, it won’t necessarily be because of some discontinuity.
Comment
Comment
OK, sure, they are my default expectation in slow-and-distributed-and-heterogenous takeoff worlds. Most of my probability mass is not in such worlds. My answer to your question is that humans are in a situation analogous to slow-and-distributed-and-heterogenous takeoff. EDIT: Also, again, I claim that if warning shots don’t happen it won’t necessarily be because of a discontinuity. That was my original point, and nothing you’ve said undermines it as far as I can tell.
Comment
It was very expensive for evolution to create humans, and so now we create copies of humans with a tiny amount of crossover and finetuning.
(No good analog to this one, though I note that in some domains like pop music we do see everyone making copies of the output of a few humans.)
No one is even trying to compete with evolution; this should be an argument that humans are more homogenous than AI systems.
Parents usually try to make their children behave similarly to them. For humans, we also have:
If I had to guess what’s going on in your mind, it would be that you’re thinking of "there are no warning shots" as an exogenous fact about the world that we must now explain, and from your perspective I’m arguing "the only possible explanation is discontinuity, no other explanation can work". I agree that I have not established that no other argument can work; my disagreement with this frame is in the initial assumption of taking "there are no warning shots" as an exogenous fact about the world that must be explained.
It’s also possible that most of this disagreement comes down to a disagreement about what counts as a warning shot. But, if you agree that there are "warning shots" for deception in the case of humans, then I think we still have a substantial disagreement.
Comment
The different standards for what counts as a warning shot might be causing problems here—if by warning shot you include minor ones like the boat race thing, then yeah I feel fairly confident that there’d be a discontinuity conditional on there being no warning shots. In case you are still curious, I’ve responded to everything you said below, using my more restrictive notion of warning shot (so, perhaps much of what I say below is obsolete). Working backwards:
Comment
Comment
OK. So… you do agree with me then? You agree that for the higher-standards version of warning shots, (or at least, for attempts to take over the world) it’s plausible that we won’t get a warning shot even if everything is continuous? As illustrated by the analogy to the mutiny case, in which everything is continuous?
Comment
Not sure why I didn’t respond to this, sorry. I agree with the claim "we may not have an AI system that tries and fails to take over the world (i.e. an AI system that tries but fails to release an engineered pandemic that would kill all humans, or arrange for simultaneous coups in the major governments, or have a robotic army kill all humans, etc) before getting an AI system that tries and succeeds at taking over the world". I don’t see this claim as particularly relevant to predicting the future.
Comment
OK, thanks. YMMV but some people I’ve read / talked to seem to think that before we have successful world-takeover attempts, we’ll have unsuccessful ones—"sordid stumbles." If this is true, it’s good news, because it makes it a LOT easier to prevent successful attempts. Alas it is not true. A much weaker version of something like this may be true, e.g. the warning shot story you proposed a while back about customer service bots being willingly scammed. It’s plausible to me that we’ll get stuff like that before it’s too late. If you think there’s something we are not on the same page about here—perhaps what you were hinting at with your final sentence—I’d be interested to hear it.
Comment
Comment
It’s been a while since I thought about this, but going back to the beginning of this thread: "It’s unlikely you’ll get a > warning shot for deceptive alignment, since if the first advanced AI system is deceptive and that deception is missed during training, once it’s deployed it’s likely for all the different deceptively aligned systems to be able to relatively easily coordinate with each other to defect simultaneously and ensure that their defection is unrecoverable (e.g. Paul’s "cascading failures")."> At a high level, you’re claiming that we don’t get a warning shot because there’s a discontinuity in capability of the aggregate of AI systems (the aggregate goes from "can barely do anything deceptive" to "can coordinate to properly execute a treacherous turn").> I think all the standard arguments against discontinuities can apply just as well to the aggregate of AI systems as they can to individual AI systems, so I don’t find your argument here compelling.I think the first paragraph (Evan’s) is basically right, and the second two paragraphs (your response) are basically wrong. I don’t think this has anything to do with discontinuities, at least not the kind of discontinuities that are unlikely. (Compare to the mutiny analogy). I think that this distinction between "strong" warning shots and "weak" warning shots is important because I think that "weak" warning shots will probably only provoke a moderate increase in caution on the part of human institutions and AI projects, whereas "strong" warning shots would provoke a large increase in caution. I agree that we’ll probably get various "weak" warning shots, but I think this doesn’t change the overall picture much because it won’t provoke a major increase in caution on the part of human institutions etc. I’m guessing it’s that last bit that is the crux—perhaps you think that it would actually provoke a major increase in caution, comparable to the increase we’d get if an AI tried and failed to take over, in which case this minor warning shot vs. major warning shot distinction doesn’t matter much.
Comment
Comment
I guess I would define a warning shot for X as something like: a situation in which a deployed model causes obvious, real-world harm due to X. So "we tested our model in the lab and found deception" isn’t a warning shot for deception, but "we deployed a deceptive model that acted misaligned in deployment while actively trying to evade detection" would be a warning shot for deception, even though it doesn’t involve taking over the world. By default, in the case of deception, my expectation is that we won’t get a warning shot at all—though I’d more expect a warning shot of the form I gave above than one where a model tries and fails to take over the world, just because I expect that a model that wants to take over the world will be able to bide its time until it can actually succeed.
Comment
I don’t automatically exclude lab settings, but other than that, this seems roughly consistent with my usage of the term. (And in particular includes the "weak" warning shots discussed above.)
Comment
Well then, would you agree that Evan’s position here:
Comment
Reward gaming: Faulty reward functions in the wild
Deception: Robot hand moving in front of a ball to make it look like it is grasping it, even though it isn’t (source)
Hidden capabilities: GPT-3 answering nonsense questions with "a straight face", except it can tell that the questions are nonsense, as you can see if you design a better prompt (source) Minor, leads to some actual damage, but mostly PR / loss of trust: 95%
Lying / deception: A personal assistant agent, when asked to schedule a meeting by when2meet, insists upon doing it by email instead, because that’s how it has always done things. It says "sorry, I don’t know how to use when2meet" in order to get this to happen, but it "could" use when2meet if it "wanted" to.
Deception: A cleaning robot sweeps the dust under the rug, knowing full well that the user would disapprove if they knew. Moderate, comparable to things that are punishable by law: 90%
Deception: An AI system in charge of a company embezzles money
Deception: An AI system runs a Ponzi scheme (that it knows is a Ponzi scheme) (and the designers of the AI system wouldn’t endorse it running a Ponzi scheme)
Failure of constraints: An AI system helps minors find online stores for drugs and alcohol Major, lots of damage, would be huge news: 60%
An AI system blows up an "enemy building"; it hides its plans from all humans (including users / designers) because it knows they will try to stop it.
An AI system captures employees from a rival corporation and tortures them until they give up corporate secrets.
(The specific examples I give feel somewhat implausible, but I think that’s mostly because I don’t know the best ways to achieve goals when you have no moral scruples holding you back.) "Strong", tries and fails to take over the world: 20%
I do think it is plausible that multiple AI systems try to take over the world, and then some of them are thwarted by other AI systems. I’m not counting these, because it seems like humans have lost meaningful control in this situation, so this "warning shot" doesn’t help.
I mostly assign 20% on this as "idk, seems unlikely, but I can’t rule it out, and predicting the future is hard so don’t assign an extreme value here"
I wonder if there’s a really strong outside-view argument that it will be homogenous: While there are many ways to design flying machines (Balloons, zeppelins, rockets, jets, monoplanes, biplanes, helicopters, …) at any given era and for any particular domain (say, passenger transport, or air superiority) the designs used tend to be pretty similar. (In WW1 almost all the planes were biplanes, and they almost all used slow but light cloth-on-frame construction, in WW2 all the planes were monoplanes with aluminum or other metal skins and more powerful prop engines, the Me109 and Spitfire and Zero were different but in the grand scheme of things very very similar). Moreover this seems to be the norm throughout history, by stark contrast with science fiction where the spaceships, vehicles, etc. of one faction are often wildly different from those of another. Historically, if we want to find cases of wildly different designs competing with each other, we usually need to look to "First contact" scenarios in which e.g. European armies colonize faraway lands. Perhaps it’s just really rare for two dramatically different designs to be almost equally matched in competition, and insofar as they aren’t almost equally matched, people quickly realize this and retire the inferior design. I guess an important question is: If AIs are homogeneous to the same extent that e.g. military fighter planes are, is that sufficient homogeneity to yield your conclusions 1-4? I think so. I think they’ll probably have the same architecture and training environment, with only minor details different (e.g. the Chinese GPT-N might have access to more chinese data, might have 1.5x the parameter count, might be trained for 0.5x as long) Of course these details will feel like a big deal in competition, just like the Me109 and Spitfire and Zero had various advantages and disadvantages over each other, butfor purposes of coordination, alignment correlation, etc. they are minor.
Comment
One counterexample is Manhattan Project—they developed two different designs simultaneously because they weren’t sure which would work better. From wikipedia: Two types of atomic bombs were developed concurrently during the war: a relatively simple gun-type fission weapon and a more complex implosion-type nuclear weapon.https://en.wikipedia.org/wiki/Manhattan_Project#:~:text=The%20Manhattan%20Project%20was%20a,Tube%20Alloys%20project)%20and%20Canada.
Comment
Key point for those who don’t click through (that I didn’t realize at first) -- both types turned out to work and were in fact used. The gun-type "Little Boy" was dropped on Hiroshima, and the implosion-type "Fat Man" was dropped on Nagasaki.
I think this depends a ton on your reference class. If you compare AI with military fighter planes: very homogenous. If you compare AI with all vehicles: very heterogenous.
Maybe the outside view can be used to say that all AIs designed for a similar purpose will be homogenous, implying that we only get heterogenity in a CAIS scenario, where there are many different specialised designs. But I think the outside view also favors a CAIS scenario over a monolithic AI scenario (though that’s not necessarily decisive).
Comment
Yes, but I think we can say something a bit stronger than that: AIs competing with each other will be homogenous. Here’s my current model at least: Let’s say the competition for control of the future involves N skills: Persuasion, science, engineering, …. etc. Even if we suppose that it’s most efficient to design separate AIs for each skill, rather than a smaller number of AIs that have multiple skills each, insofar as there are factions competing for control of the future, they’ll have an AI for each of the skills. They wouldn’t want to leave one of the skills out, or how are they going to compete? So each faction will consist of a group of AIs working together, that collectively has all the relevant skills. And each of the AIs will be designed to be good at the skill it’s assigned, so (via the principle you articulated) each AI will be similar to the other-faction AIs it directly competes with, and the factions as a whole will be pretty similar too, since they’ll be collections of similar AIs. (Compare to militaries: Not only were fighter planes similar, and trucks similar, and battleships similar, the armed forces of Japan, USA, USSR, etc. were similar. By contrast with e.g. the conquistadors vs. the Aztecs, or in sci-fi the Protoss vs. the Zerg, etc.)
Comment
I think this is only right if we assume that we’ve solved alignment. Otherwise you might not be able to train a specialised AI that is loyal to your faction. Here’s how I imagine Evan’s conclusions to fail in a very CAIS-like world:
Comment
Thanks! I’m not sure I’m following everything you said, but I like the ideas. Just to be clear, I wasn’t imagining the AIs on the team of a faction to all be aligned necessarily. In fact I was imagining that maybe most (or even all) of them would be narrow AIs / tool AIs for which the concept of alignment doesn’t really apply. Like AlphaFold2. Also, I think the relevant variable for homogeneity isn’t whether we’ve solved alignment—maybe it’s whether the people making AI *think *they’ve solved alignment. If the Chinese and US militaries think AI risk isn’t a big deal, and build AGI generals to prosecute the cyberwar, they’ll probably use similar designs, even if actually the generals are secretly planning treacherous turns.
Comment
Comment
I disagree with this. I don’t expect a failure of inner alignment to produce random goals, but rather systematically produce goals which are simpler/faster proxies of what we actually want. That is to say, while I expect the goals to look random to us, I don’t actually expect them to differ that much between training runs, since it’s more about your training process’s inductive biases than inherent randomness in the training process in my opinion.
This is helpful, thanks. I’m not sure I agree that for something to count as a faction, the members must be aligned with each other. I think it still counts if the members have wildly different goals but are temporarily collaborating for instrumental reasons, or even if several of the members are secretly working for the other side. For example, in WW2 there were spies on both sides, as well as many people (e.g. most ordinary soldiers) who didn’t really believe in the cause and would happily defect if they could get away with it. Yet the overall structure of the opposing forces was very similar, from the fighter aircraft designs, to the battleship designs, to the relative proportions of fighter planes and battleships, to the way they were integrated into command structure.
Neat post, I think this is an important distinction. It seems right that more homogeneity means less risk of bargaining failure, though I’m not sure yet how much.
In what ways does having similar architectures or weights help with cooperation between agents with different goals? A few things that come to mind:
Having similar architectures might make it easier for agents to verify things about one another, which may reduce problems of private information and inability to credibly commit to negotiated agreements. But of course increased credibility is a double-edged sword as far as catastrophic bargaining failure is concerned, as it may make agents more likely to commit to carrying out coercive threats.
Agents with more similar architectures / weights will tend to have more similar priors / ways of modeling their counterparts and as well as notions of fairness in bargaining, which reduces risk of bargaining failure . But as systems are modified or used to produce successor systems, they may be independently tuned to do things like represent their principal in bargaining situations. This tuning may introduce important divergenes in whatever default priors or notions of fairness were present in the initial mostly-identical systems. I don’t have much intuition for how large these divergences would be relative to those in a regime that started out more heterogeneous.
If a technique for reducing bargaining failure only works if all of the bargainers use it (e.g., surrogate goals), then homogeneity could make it much more likely that all bargainers used the technique. On the other hand, it may be that such techniques would not be introduced until after the initial mostly-identical systems were modified / successor systems produced, in which case there might still need to be coordination on common adoption of the technique.
Also, the correlated success / failure point seems to apply to bargaining as well as alignment. For instance, multiple mesa-optimizers may be more likely under homogeneity, and if these have different mesa-objectives (perhaps due to being tuned by principals with different goals) then catastrophic bargaining failure may be more likely.
Comment
Glad you liked the post!
Importantly, I think this moves you from a human-misaligned AI bargaining situation into more of a human-human (with AI assistants) bargaining situation, which I expect to work out much better, as I don’t expect humans to carry out crazy threats to the same extent as a misaligned AI might.
I find the prospect of multiple independent mesa-optimizers inside of the same system relatively unlikely. I think this could basically only happen if you were building a model that was built of independently-trained pieces rather than a single system trained end-to-end, which seems to be not the direction that machine learning is headed in—and for good reason, as end-to-end training means you don’t have to learn the same thing (such as optimization) multiple times.
Comment
I think Jesse was just claiming that it’s more likely that everyone uses an architecture especially prone to mesa optimization. This means that (if multiple people train that architecture from scratch) the world is likely to end up with many different mesa optimizers in it (each localised to a single system). Because of the random nature of mesa optimization, they may all have very different goals.
Comment
I’m not sure if that’s true—see my comments here and here.
Interesting!
Is there an argument that it’s impossible to fine-tune an aligned system into a misaligned one? Or just that everyone fine-tuning these systems will be smart and careful and read the manual etc. so that they do it right? Or something else?
Thinking about it right now, I’d say "homogeneous learning algorithms, heterogeneous trained models" (in a multipolar type scenario at least). I guess my intuitions are (1) No matter how expensive "training from scratch" is, it’s bound to happen a second time if people see that it worked the first time. (2) I’m more inclined to think that fine-tuning can make it into "more-or-less a different model", rather than necessarily "more-or-less the same model". I dunno.
Comment
I think that alignment will be a pretty important desideratum for anybody building an AI system—and I think that copying whatever alignment strategy was used previously is likely to be the easiest, most conservative, most risk-averse option for other organizations trying to fulfill that desideratum.
Thanks for this, I for one hadn’t thought about this variable much and am convinced now that it is one of the more important variables. --I think acausal trade stuff means that even if all the AIs on Earth are homogenous, the strategic situation may end up being as if they were heterogenous, at least in some ways. I’m not sure, will need to think more about this. --You talk about this being possible even for gradual, continuous takeoff, yet you also talk about "The first advanced AI system" as if there is a sharp cutoff between advanced and non-advanced AI. I *think *this isn’t a problem for your overall point, but I’m not sure. For alignment (your point 4) I think this isn’t a problem, because you can just rephrase it as "As we gradually transition from non-advanced systems to advanced systems, it is important that our systems be aligned before we near the end of the transition, and more important the closer we get to the end. Because as our systems become more advanced, their alignment properties become more locked-in." For deception, I’m less sure. If systems get more advanced gradually and continuously then maybe we can hope there is a "sordid stumble sweet spot" where systems that are deceptive are likely to reveal this to us in non-catastrophic ways, and thus we are fine because we’ll pass through the sweet spot on the way to more advanced AI systems. Or not, but the point is that continuity complicates the point you were making.
If we run two non-communicating copies of the same AI, could it be helpful in detecting failures?