Contents
- Plan:
- Illustrative Analogy
- Exciting Graph
- Analysis
- Part 1: Extra brute force can make the problem a lot easier
- Part 2: Evolution produces complex mysterious efficient designs by default, even when simple inefficient designs work just fine for human purposes.
- Part 3: What’s bogus and what’s not
- Part 4: Example: Data-efficiency
- Conclusion
- Appendix [Epistemic status: Strong opinions lightly held, this time with a cool graph.] I argue that an entire class of common arguments against short timelines is bogus, and provide weak evidence that anchoring to the human-brain-human-lifetime milestone is reasonable. In a sentence, my argument is that the complexity and mysteriousness and efficiency of the human brain (compared to artificial neural nets) is *almost zero evidence *that building TAI will be difficult, because evolution typically makes things complex and mysterious and efficient, even when there are simple, easily understood, inefficient designs that work almost as well (or even better!) for human purposes. In slogan form: ***If all we had to do to get TAI was make a simple neural net 10x the size of my brain, my brain would still look the way it does. *** The case of birds & planes illustrates this point nicely. Moreover, it is also a precedent for several other short-timelines talking points, such as the human-brain-human-lifetime (HBHL) anchor.
Plan:
-
Illustrative Analogy
-
Exciting Graph
-
Analysis
-
Extra brute force can make the problem a lot easier
-
Evolution produces complex mysterious efficient designs by default, even when simple inefficient designs work just fine for human purposes.
-
What’s bogus and what’s not
-
Example: Data-efficiency
-
Conclusion
-
Appendix *1909 French military plane, the Antionette VII. * By Deep silence (Mikaël Restoux) - Own work (Bourget museum, in France), CC BY 2.5, https://commons.wikimedia.org/w/index.php?curid=1615429
Illustrative Analogy
**AI timelines, from our current perspective****Flying machine timelines, from the perspective of the late 1800’s:**Shorty: Human brains are giant neural nets. This is reason to think we can make human-level AGI (or at least AI with strategically relevant skills, like politics and science) by making giant neural nets. Shorty: Birds are winged creatures that paddle through the air. This is reason to think we can make winged machines that paddle through the air. **Longs: **Whoa whoa, there are loads of important differences between brains and artificial neural nets: [what follows is a direct quote from the objection a friend raised when reading an early draft of this post!]- During training, deep neural nets use some variant of backpropagation. My understanding is that the brain does something else, closer to Hebbian learning. (Though I vaguely remember at least one paper claiming that maybe the brain does something that’s similar to backprop after all.)
- It’s at least possible that the wiring diagram of neurons plus weights is too coarse-grained to accurately model the brain’s computation, but it’s all there is in deep neural nets. If we need to pay attention to glial cells, intracellular processes, different neurotransmitters etc., it’s not clear how to integrate this into the deep learning paradigm.
- My impression is that several biological observations on the brain don’t have a plausible analog in deep neural nets: growing new neurons (though unclear how important it is for an adult brain), "repurposing" in response to brain damage, … **Longs: **Whoa whoa, there are loads of important differences between birds and flying machines:
- Birds paddle the air by flapping, whereas current machine designs use propellers and fixed wings.
- It’s at least possible that the anatomical diagram of bones, muscles, and wing surfaces is too coarse-grained to accurately model how a bird flies, but that’s all there is to current machine designs (replacing bones with struts and muscles with motors, that is). If we need to pay attention to the percolation of air through and between feathers, micro-eddies in the air sensed by the bird and instinctively responded to, etc. it’s not clear how to integrate this into the mechanical paradigm.
- My impression is that several biological observations of birds don’t have a plausible analog in machines: Growing new feathers and flesh (though unclear how important this is for adult birds), "repurposing" in response to damage … **Shorty: **The key variables seem to be size and training time. Current neural nets are tiny; the biggest one is only one-thousandth the size of the human brain. But they are rapidly getting bigger. Once we have enough compute to train neural nets as big as the human brain for as long as a human lifetime (HBHL), it should in principle be possible for us to build HLAGI. No doubt there will be lots of details to work out, of course. But that shouldn’t take more than a few years. **Shorty: **The key variables seem to be engine-power and engine weight. Current motors are not strong & light enough, but they are rapidly getting better. Once the power-to-weight ratio of our motors surpasses the power-to-weight ratio of bird muscles, it should be in principle possible for us to build a flying machine. No doubt there will be lots of details to work out, of course. But that shouldn’t take more than a few years. **Longs: **Bah! I don’t think we know what the key variables are. For example, biological brains seem to be able to learn faster, with less data, than artificial neural nets. And we don’t know why. Besides, "there will be lots of details to work out" is a huge understatement. It took evolution billions of generations of billions of individuals to produce humans. What makes you think we’ll be able to do it quickly? It’s plausible that actually we’ll have to do it the way evolution did it, i.e. meta-learn, i.e. evolve a large population of HBHLs, over many generations. (Or, similarly, train a neural net with a big batch size and a horizon length of a lifetime).And even if you think we’ll be able to do it substantially quicker than evolution did, it’s pretty presumptuous to think we could do it quickly enough that the HBHL milestone is relevant for forecasting. **Longs: **Bah! I don’t think we know what the key variables are. For example, birds seem to be able to soar long distances without flapping their wings at all, and we still haven’t figured out how they do it. Another example: We still don’t know how birds manage to steer through the air without crashing (flight stability & control). Besides, "there will be lots of details to work out" is a huge understatement. It took evolution billions of generations of billions of individuals to produce birds. What makes you think we’ll be able to do it quickly? It’s plausible that actually we’ll have to do it the way evolution did it, i.e. meta-design, i.e. evolve a large population of flying machines, tweaking our blueprints each generation of crashed machines to grope towards better designs. And even if you think we’ll be able to do it substantially quicker than evolution did, it’s pretty presumptuous to think we could do it quickly enough that the date our engines achieve power/weight parity with bird muscle is relevant for forecasting.
Exciting Graph
This data shows that Shorty was entirely correct about forecasting heavier-than-air flight. (For details about the data, see appendix.) Whether Shorty will also be correct about forecasting TAI remains to be seen. In some sense, Shorty has already made two successful predictions: I started writing this argument before having any of this data; I just had an intuition that power-to-weight is the key variable for flight and that therefore we probably got flying machines shortly after having comparable power-to-weight as bird muscle. Halfway through the first draft, I googled and confirmed that yes, the Wright Flyer’s motor was close to bird muscle in power-to-weight. Then, while writing the second draft, I hired an RA, Amogh Nanjajjar, to collect more data and build this graph. As expected, there was a trend of power-to-weight improving over time, with flight happening right around the time bird-muscle parity was reached. I had previously heard from a friend, who read a book about the invention of flight, that the Wright brothers were the first because they (a) studied birds and learned some insights from them, and (b) did a bunch of trial and error, rapid iteration, etc. (e.g. in wind tunnels). The story I heard was all about the importance of insight and experimentation—but this graph seems to show that the key constraint was engine power-to-weight. Insight and experimentation were important for determining who invented flight, but not for determining which decade flight was invented in.
Analysis
Part 1: Extra brute force can make the problem a lot easier
One way in which compute can substitute for insights/algorithms/architectures/ideas is that you can use compute to search for them. But there is a different and arguably more important way in which compute can substitute for insights/etc.: Scaling up the key variables, so that the problem becomes easier, so that fewer insights/etc. are needed. For example, with flight, the problem becomes easier the more power/weight ratio your motors have. Even if the Wright brothers didn’t exist and nobody else had their insights, eventually we would have achieved powered flight anyway, because when our engines are 100x more powerful for the same weight, we can use extremely simple, inefficient designs. (For example, imagine a u-shaped craft with a low center of gravity and helicopter-style rotors on each tip. Add a third, smaller propeller on a turret somewhere for steering. EDIT: Oops, lol, I’m actually wrong about this. Keeping center of gravity low doesn’t help. Welp, this is embarrassing.) With neural nets, we have plenty of evidence now that bigger = better, with theory to back it up. Suppose the problem of making human-level AGI with HBHL levels of compute is really difficult. OK, 10x the parameter count and 10x the training time and try again. Still too hard? Repeat. Note that I’m not saying that if you take a particular design that doesn’t work, and make it bigger, it’ll start working. (If you took Da Vinci’s flying machine and made the engine 100x more powerful, it would not work). Rather, I’m saying that the problem of finding a design that works gets qualitatively easier the more parameters and training time you have to work with. Finally, remember that human-level AGI is not the only kind of TAI. Sufficiently powerful R&D tools would work, as would sufficiently powerful persuasion tools, as might something that is agenty and inferior to humans in some ways but vastly superior in others.
Part 2: Evolution produces complex mysterious efficient designs by default, even when simple inefficient designs work just fine for human purposes.
Suppose that actually all we have to do to get TAI is something fairly simple and obvious, but with a neural net 10x the size of my (actual) brain and trained for 10x longer. In this world, does the human brain look any different than it does in the actual world? No. Here is a nonexhaustive list of reasons why evolution would evolve human brains to look like they do, with all their complexity and mysteriousness and efficiency, even if the same capability levels could be reached with 10x more neurons and a very simple architecture. Feel free to skip ahead if you think this is obvious.
-
In general, evolved creatures are complex and mysterious to us, even when simple and human-comprehensible architectures work fine. Take birds, for example: As mentioned before, all the way up to the Wright brothers there were a lot of very basic things about birds that were still not understood. From this article: "They watched buzzards glide from horizon to horizon without moving their wings, and guessed they must be sucking some mysterious essence of upness from the air. Few seemed to realize that air moves up and down as well as horizontally." I don’t know much about ornithology but I’d be willing to bet that there were lots of important things discovered about birds *after *airplanes already existed, and that there are *still *at least a few remaining mysteries about how birds fly. (Spot check: Yep, the history of ornithopters page says "...the development of comprehensive aerodynamic theory for flapping remains an outstanding problem..."). And of course evolved creatures are often more efficient in various ways than their still-useful engineered counterparts.
-
Making the brain 10x bigger would be enormously costly to fitness, because it would cost 10x more energy and restrict mobility (not to mention the difficulties of getting through the birth canal!) Much better to come up with clever modules, instincts, optimizations, etc. that achieve the same capabilities in a smaller brain.
-
Evolution is heavily constrained on training data, perhaps even more than on brain size. It can’t just evolve the organism to have 10x more training data, because longer-lived organisms have more opportunities to be eaten or suffer accidents, especially in their 10x-longer childhoods. Far better to hard-code some behaviors as instincts.
-
Evolution gets clever optimizations and modules and such "for free" in some sense. Since it is evolving millions of individuals for millions of generations anyway, it’s not a big deal for it to perform massive search and gradient descent through architecture-space.
-
Completely blank slate brains (i.e. extremely simple architecture, no instincts or finely tuned priors) would be unfit even if they were highly capable because they wouldn’t be aligned to evolution’s values (i.e. reproduction.) Perhaps most of the complexity in the human brain—the instincts, inbuilt priors, and even most of the modules—isn’t for capabilities at all, but rather for alignment.
Part 3: What’s bogus and what’s not
The general pattern of argument I think is bogus is:
The brain has property X, which seems to be important to how it functions. We don’t know how to make AI’s with property X. It took evolution a long time to make brains have property X. This is reason to think TAI is not near. As argued above, if TAI *is *near, there should still be *many *X which are important to how the brain functions, which we don’t know how to reproduce in AI, and which it took evolution a long time to produce. So rattling off a bunch of X’s is basically zero evidence against TAI being near. Put differently, here are two objections any particular argument of this type needs to overcome:
-
TAI does not actually require X (analogous to how airplanes didn’t require anywhere near the energy-efficiency of birds, nor the ability to soar, nor the ability to flap their wings, nor the ability to take off from unimproved surfaces… the list goes on)
-
We’ll figure out how to get property X in AIs soon after we have the other key properties (size and training time), because (a) we can do search, like evolution did but much more efficient, (b) we can increase the other key variables to make our design/search problem easier, and (c) we can use human ingenuity & biological inspiration. Historically there is plenty of precedent for the previous three factors being strong enough; see e.g. the case of powered flight. This reveals how the arguments could be reformulated to become non-bogus! They need to argue (a) that X is probably necessary for TAI, and (b) that X isn’t something that we’ll figure out fairly quickly once the key variables of size and training time are surpassed. In some cases there are decent arguments to be made for both (a) and (b). I think efficiency is one of them, so I’ll use that as my example below.
Part 4: Example: Data-efficiency
Let’s work through the example of data-efficiency. A bad version of this argument would be:
Humans are much more data-efficient learners than current AI systems. Data-efficiency is very important; any human who learned as inefficiently as current AI would basically be mentally disabled. This is reason to think TAI is not near. The rebuttal to this bad argument is: If birds were as energy-inefficient as planes, they’d be disabled too, and would probably die quickly. Yet planes work fine. (See Table 1 from this AI Impacts page) Even if TAI is near, there are going to be lots of X’s that are important for the brain, that we don’t know how to make in AI yet, but that are either unnecessary for TAI or not too difficult to get once we have the other key variables. So even if TAI is near, I should expect to hear people going around pointing out various X’s and claiming that this is reason to think TAI is far away. You haven’t done anything to convince me that this isn’t what’s happening with X = data-efficiency. However, I do think the argument can be reformulated and expanded to become good. Here’s a sketch, inspired by Ajeya Cotra’s argument here. We probably can’t get TAI without figuring out how to make AIs that are as data-efficient as humans. It’s true that there are some useful tasks for which there is plenty of data—like call center work, or driving trucks—but AIs that can do these tasks won’t be transformative. Transformative AI will be doing things like managing corporations, leading armies, designing new chips, and writing AI theory publications. Insofar as AI learns more slowly than humans, by the time it accumulates enough experience doing one of these tasks, (a) the world would have changed enough that its skills would be obsolete, and/or (b) it would have made a lot of expensive mistakes in the meantime. Moreover, we probably won’t figure out how to make AIs that are as data-efficient as humans for a long time—decades at least. This is because 1. We’ve been trying to figure this out for decades and haven’t succeeded, and 2. Having a few orders of magnitude more compute won’t help much. Now, to justify point #2: Neural nets actually do get more data-efficient as they get bigger, but we can plot the trend and see that they will still be less data-efficient than humans when they are a few orders of magnitude bigger. So making them bigger won’t be enough, we’ll need new architectures/algorithms/etc. As for using compute to search for architectures/etc., that might work, but given how long evolution took, we should think it’s unlikely that we could do this with only a few orders of magnitude of searching—probably we’d need to do many generations of large population size. (We could also think of this search process as analogous to typical deep learning training runs, in which case we should expect it’ll take many gradient updates with large batch size.) Anyhow, there’s no reason to think that data-efficient learning is something you need to be human-brain-sized to do. If we can’t make our tiny AIs learn efficiently after several decades of trying, we shouldn’t be able to make big AIs learn efficiently after just one more decade of trying. I think this is a good argument. Do I buy it? Not yet. For one thing, I haven’t verified whether the claims it makes are true, I just made them up as plausible claims which would be persuasive to me if true. For another, some of the claims actually seem false to me. Finally, I suspect that in 1895 someone could have made a similarly plausible argument about energy efficiency, and another similarly plausible argument about flight control, and both arguments would have been wrong: Energy efficiency turned out to be insufficiently necessary, and flight control turned out to be insufficiently difficult!
Conclusion
What I am not saying: I am not saying that the case of birds and planes is strong evidence that TAI will happen once we hit the HBHL milestone. I do think it is evidence, but it is weak evidence. (For my all-things-considered view of how many orders of magnitude of compute it’ll take to get TAI, see future posts, or ask me.) I would like to see a more thorough investigation of cases in which humans attempt to design something that has an obvious biological analogue. It would be interesting to see if the case of flight was typical. Flight being typical would be strong evidence for short timelines, I think. *What I am saying: *I am saying that many common anti-short-timelines arguments are bogus. They need to do much more than just appeal to the complexity/mysteriousness/efficiency of the brain; they need to argue that some property X is both necessary for TAI and not about to be figured out for AI anytime soon, not even after the HBHL milestone is passed by several orders of magnitude. *Why this matters: *In my opinion the biggest source of uncertainty about AI timelines has to do with how much "special sauce" is necessary for making transformative AI. As jylin04 puts it,
A first and frequently debated crux is whether we can get to TAI from end-to-end training of models specified by relatively few bits of information at initialization, such as neural networks initialized with random weights. OpenAI in particular seems to take the affirmative view[^3], while people in academia, especially those with more of a neuroscience / cognitive science background, seem to think instead that we’ll have to hard-code in lots of inductive biases from neuroscience to get to AGI [^4]. In my words: Evolution clearly put lots of special sauce into humans, and took millions of generations of millions of individuals to do so. How much special sauce will we need to get TAI? Shorty is one end of a spectrum of disagreement on this question. Shorty thinks the amount of special sauce required is small enough that we’ll "work out the details" within a few years of having the key variables (size and training time). At the other end of the spectrum would be someone who thought that the amount of special sauce required is similar to the amount found in the brain. Longs is in the middle. Longs thinks the amount of special sauce required is large enough that the HBHL milestone isn’t particularly relevant to timelines; we’ll either have to brute-force search for the special sauce like evolution did, or have some brilliant new insights, or mimic the brain, etc. This post rebutted common arguments against Shorty’s position. It also presented weak evidence in favor of Shorty’s position: the precedent of birds and planes. In future posts I’ll say more about what I think the probability distribution over amount-of-special-sauce-needed should be and why. Acknowedgements: Thanks to my RA, Amogh Nanjajjar, for compiling the data and building the graph. Thanks to Kaj Sotala, Max Daniel, Lukas Gloor, and Carl Shulman for comments on drafts.
Appendix
Some footnotes:
-
I didn’t say anything about why we might think size and training time are the key variables, or even what "key variables" means. Hopefully I’ll get a chance in the comments or in subsequent posts.
-
I deliberately left vague what "training time" means and what "size" means. Thus, I’m not commiting myself to any particular way of calculating the HBHL milestone yet. I’m open to being convinced that the HBHL milestone is farther in the future than it might seem.
-
Persuasion tools, even very powerful ones, wouldn’t be TAI by the standard definition. However they would constitute a potential-AI-induced-point-of-no-return, so they still count for timelines purposes.
-
This "How much special sauce is needed?" variable is very similar to Ajeya Cotra’s variable "how much compute would lead to TAI given 2020′s algorithms." Some bookkeeping details about the data:
-
This dataset is not complete. Amogh did a reasonably thorough search for engines throughout the period (with a focus on stuff before 1910) but was unable to find power or weight stats for many of the engines we heard about. Nevertheless I am reasonably confident that this dataset is representative; if an engine was significantly better than the others of its time, probably this would have been mentioned and Amogh would have flagged it as a potential outlier.
-
Many of the points for steam engine power/weight should really be bumped up slightly. This is because most of the data we had was for the weight of the entire locomotive of a steam-powered train, rather than just the steam engine part. I don’t know what fraction of a locomotive is non-steam-engine but 50% seems like a reasonable guess. I don’t think this changes the overall picture much; in particular, the two highest red dots do not need to be bumped up at all (I checked).
-
The birds bar is the power/weight ratio for the muscles of a particular species of bird, reported by this source, which reports the power/weight for a particular species of bird. Amogh has done a bit of searching and doesn’t think muscle power/weight is significantly different for other species of bird. Seems plausible to me; even if the average bird has muscles that are twice (or half) as powerful-per-kilogram, the overall graph would look basically the same.
-
I attempted to find estimates of human muscle power-to-weight ratio; it gets smaller the more tired the muscles get, but at peak performance for fit individuals it seems to be about an order of magnitude less than bird muscle. (This chart lists power-to-weight ratio for human cyclists, which according to this are probably about half muscle, so look at the left-hand column and double it.) Interestingly, this means that the engines of the first flying machines were possibly* *the first engines to be substantially better than human flapping/pedaling as a source of flying-machine power.
-
EDIT Gaaah I forgot to include a link to the data! Here’s the spreadsheet.
Related to one aspect of this: my post Building brain-inspired AGI is infinitely easier than understanding the brain
Comment
Ah! If I had read that before, I had forgotten about it, sorry. This is indeed highly relevant. Strong-upvoted to signal boost.
Flying machines are one example but can we choose other examples which would teach the opposite lesson? Nuclear Fusion Power Generation Longs: The only way we know sustained nuclear fusion can be achieved is in stars. If we are confined to things less big than the sun then sustaining nuclear fusion to produce power will be difficult and there are many unknown unknowns. Shorty: The key parameters are temperature and pressure and then controlling the plasma. A Tokamak design should be sufficient to achieve this—if we lose control it just means we need stronger / better magnets.
Comment
The appeal-to-nature’s-constants argument doesn’t work great in this context because the sun actually produces fairly low power per unit volume. Nuclear fusion on Earth requires vastly higher power density to be practical. That said, I think it is correct that temperature and pressure are the key factors. I just don’t think the factors map on to the natural equivalents, as much as onto some physical equations that give us the Q factor. In the context of the article, controlling the plasma is an appeal to complexity; if it turns out to be a rate limiter even after temperature and pressure suffice, then it would be evidence against the argument, but if it turns out not to matter that much, it would be evidence for.
Comment
Controlling the plasma is an appeal to complexity, but it isn’t an appeal to the complexity of the natural design. The natural design is super simple in this case. So it’s not analogous to the types of arguments I think are bogus.
Comment
OK, but doesn’t this hurt the point in the post? Shortly’s claim that the key variables for AI ‘seem to be size and training time’ and not other measures of complexity seems no stronger (and actually much weaker) than the analogous claim that the key variables for fusion seem to be temperature and pressure, and not other measures of complexity like plasma control. If the point of the post is only to argue against one specific framing for introducing appeals to complexity, rather than advocate for the simpler models, it seems to lose most of its predictive power for AI, since most of those appeals to complexity can be easily rephrased.
Comment
Thanks for these questions and arguments, they’ve given me something to think about. Here’s my current take: The point of this post was to argue against a common type of argument I heard. I agree that some of these appeals can be rephrased to become non-bogus, and indeed I sketched an account of how they need to rephrase in order to become non-bogus: They need to argue that a.) X is probably necessary for TAI, and b.) X probably won’t arrive shortly after the other variables are achieved. I think most of the arguments I am calling bogus cannot be rephrased in this way to achieve a and b, or if they can, I haven’t seen it done yet. The secondary point of this post was to provide evidence for the HBHL milestone, basically "Hey, the case of flight seems analogous in a bunch of ways to the case of AI, and if AI goes the way flight went, it’ll happen around the HBHL milestone." This point is much weaker for the obvious reason that flight is just one case-study and we can think of others (like maybe fusion?) that yield the opposite lessons. I think flight is more analogous to AI than fusion, but I’m not sure. Thus, to people who already assigned non-negligible weight to the HBHL and who didn’t put much stock in the bogus arguments, my post is just preaching to the choir and provides no further evidence. My post should only cause a big update in people who either bought the bogus arguments, or who assigned such a low probability to the HBHL milestone that a single historical case study is enough to make them feel like their probability was too low.
Comment
Comment
I expect that fusion progress is in fact predominantly determined by temperature and pressure (and factors like that that go into the Q factor), and expect that issues with control won’t seem very relevant to long-run timelines in retrospect. It’s true that we’ve had temperature and pressure equal to the sun for a while, but it’s also true that > low-yield fusion is pretty easy. The missing piece to that cannot simply be control, since even a perfectly controlled ounce of a replica sun is not going to produce much energy. Rather, we just have a higher bar to cross before we get yield. In fusion, you can use temperature and pressure to trade off against control issues. This is most clearly illustrated in hydrogen bombs. In fact, there is little in-principle reason you couldn’t use hydrogen bombs to heat water to power a turbine, even if it’s not the most politically or economically sensible design. OK, then in that case I feel like the case of fusion is totally not a counterexample-precedent to Shorty’s methodology, because the Sun is just not at all analogous to what we are trying to do with fusion power generation. I’m surprised and intrigued to hear that control isn’t a big deal. I assume you know more about fusion than me so I’m deferring to you.
Comment
I am by no means an expert on fusion power, I’ve just been loosely following the field after the recent bunch of fusion startups, a significant fraction of which seem to have come about precisely because HTS magnets significantly shifted the field strength you can achieve at practical sizes. Control and instabilities are absolutely a real practical concern, as are a bunch of other things like neutron damage; my expectation is only that they are second-order difficulties in the long run, much like wing shape was a second-order difficulty for flight. My framing is largely shaped by this MIT talk (here’s another, here’s their startup).
Comment
Thanks again for the detailed reply; I feel like I’m coming to understand you (and fusion!) much better. You may indeed be hoping the OP is something it’s not. That said, I think I have more to say in agreement with your strong position:
Comment
Thanks, I think I pretty much understand your framing now. I think the only thing I really disagree with is that ""can use compute to automate search for special sauce" is pretty self-explanatory." I think this heavily depends on what sort of variable you expect the special sauce to be. Eg. for useful, self-replicating nanoscale robots, my hypothetical atomic manufacturing technology would enable rapid automated iteration, but it’s unclear how you could use that to automatically search for a solution in practice. It’s an enabler for research, moreso than a substitute. Personally I’m not sure how I’d justify that claim for AI without importing a whole bunch of background knowledge of the generality of optimization procedures! IIUC this is mostly outside the scope of what your article was about, and we don’t disagree on the meat of the matter, so I’m happy to leave this here.
Comment
I think I agree that it’s not clear compute can be used to search for special sauce in general, but in the case of AI it seems pretty clear to me: AIs themselves run in computers, and the capabilities we are interested in (some of them, at least) can be detected on AIs in simulations (no need for e.g. robotic bodies) and so we can do trial-and-error on our AI designs in proportion to how much compute we have. More compute, more trial-and-error. (Except it’s more efficient than mere trial-and-error, we have access to all sorts of learning and meta-learning and architecture search algorithms, not to mention human insight). If you had enough compute, you could just simulate the entire history of life evolving on an earth-sized planet for a billion years, in a very detailed and realistic physics environment!
Comment
Eventually the conclusion holds trivially, sure, but that takes us very far from the HBHL anchor. Most evolutionary algorithms we do today are very constrained in what programs they can generate, and are run over small models for a small number of iteration steps. A more general search would be exponentially slower, and even more disconnected from current ML. If you expect that sort of research to be pulling a lot of weight, you probably shouldn’t expect the result to look like large connectionist models trained on lots of data, and you lose most of the argument for anchoring to HBHL. A more standard framing is that ‘we can do trial-and-error on our AI designs’, but there we’re again in a regime where scale is an enabler for research, moreso than a substitute for it. Architecture search will still fine-tune and validate these ideas, but is less likely to drive them directly in a significant way.
Comment
Comment
Comment
Ajeya estimates (and I agree with her) how much compute it would take to recapitulate evolution, i.e. simulate the entire history of life on earth evolving for a billion years etc. The number she gets is 10^41 FLOP give or take a few OOMs. That’s 17 OOMs away from where we are now. So if you take 10^41 as an upper bound, and divide up the probability evenly across the OOMs… Of course it probably shouldn’t be a *hard *upper bound, so instead of dividing up 100 percentage points you should divide up 95 or 90 or whatever your credence is that TAI could be achieved for 10^41 or less compute. But that wouldn’t change the result much, which is that a naive, flat-across-orders-of-magnitude-up-until-the-upper-bound-is-reached distribution would assign substantially higher probability to Shorty’s position than Ajeya does. I’m still not following the argument. I agree that you won’t be able to use your HBHL compute to do search over HBHL-sized brains+childhoods, because if you only have HBHL compute, you can only do one HBHL-sized brain+childhood. But that doesn’t undermine my point, which is that as you get more compute, you can use it to do search. So e.g. when you have 3 OOMs more compute than the HBHL milestone, you can do automated search over 1000 HBHL-sized brains+childhoods. (Also I suppose even when you only have HBHL compute you could do search over architectures and childhoods that are a little bit smaller and hope that the lessons generalize) I think part of what might be going on here is that since Shorty’s position isn’t "TAI will happen as soon as we hit HBHL" but rather "TAI will happen shortly after we hit HBHL" there’s room for an OOM or three of extra compute beyond the HBHL to be used. (Compute costs decrease fairly quickly, and investment can increase much faster, and probably will when TAI is nigh) I agree that we can’t use compute to search for special sauce if we only have exactly HBHL compute (setting aside the paranthetica in the previous paragraph, which suggests that we can)
Comment
Well I understand now where you get the 17, but I don’t understand why you want to spread it uniformly across the orders of magnitude. Shouldn’t you put the all probability mass for the brute-force evolution approach on some gaussian around where we’d expect that to land, and only have probability elsewhere to account for competing hypotheses? Like I think it’s fair to say the probability of a ground-up evolutionary approach only using 10-100 agents is way closer to zero than to 4%.
Comment
Sorry I didn’t see this until now! --I agree that for the brute-force evolution approach, we should have a gaussian around where we’d expect that to land. My "Let’s just do evenly across all the OOMs between now and evolution" is only a reasonable first-pass approach to what our all-things-considered distribution should be like, including evolution but also various other strategies. (Even better would be having a taxonomy of the various strategies and a gaussian for each; this is sorta what Ajeya does. the problem is that insofar as you don’t trust your taxonomy to be exhaustive, the resulting distribution is untrustworthy as well.) I think it’s reasonable to extend the probability mass down to where we are now, because we are currently at the HBHL milestone pretty much, which seems like a pretty relevant milestone to say the least.
Comment
Good point! I’d love to see a more thorough investigation into cases like this. This is the best comment so far IMO; strong-upvoted. My immediate reply would be: Shorty here is just wrong about what the key parameters are; as Longs points out, size seems pretty important, because it means you don’t have to worry about control. Trying to make a fusion reactor much smaller than a star seems to me to be analogous to trying to make a flying machine with engines much weaker than bird muscle, or an AI with neural nets much smaller than human brains. Yeah, maybe it’s possible in principle, but in practice we should expect it to be very difficult. But I’m not sure, I’d want to think about this more.
Comment
Update: Actually, I think I analyzed that wrong. Shorty did mention "controlling the plasma" as a key variable; in that case, I agree that Shorty got the key variables correct. Shorty’s methodology is to plot a graph with the key variables and say "We’ll achieve it when our variables reach roughly the same level as they are in nature’s equivalent." But how do we measure level of control? How can we say that we’ve reached the same level of control over the plasma as the Sun has? This bit seems implausible. So I think a steelman Shorty would either say that it’s unknown whether we’ve reached the key variables yet (because we don’t know how good tokamaks are at controlling plasma) or that control isn’t a key variable (because it can be compensated for by other things, like temperature and pressure.) (Though in this case if Shorty went that second route, they’d probably just be wrong? Compare to the case of flight, where the problem of controlling the craft really does become a lot easier when you have access to more powerful&light engines. I don’t know much about fusion designs but I suspect that cranking up temperature and pressure doesn’t, in fact, make controlling the reaction easier. Am I wrong?)
Comment
Probably nowadays what Shorty missed was the difficulty in dealing with the energetic neutrons being created and associated radiation. Then associated maintenance costs etc and therefore price-competitiveness. I chose nuclear fusion purely because it was the most salient example of project-that-always-misses-its-deadlines. (I did my university placement year in nuclear fusion research but still don’t feel like I properly understand it! I’m pretty sure you’re right though about temperature, pressure and control.) In theory a steelman Shorty could have thought of all of these things but in practice it’s hard to think of everything. I find myself in the weird position of agreeing with you but arguing in the opposite direction. For a random large project X, which is more likely to be true:
Project X took longer than expert estimates because of failure to account for Y
Project X was delivered approximately on time In general I suspect that it is the former (1). In that case the burden of evidence is on Shorty to show why project X is outside of the reference class of typical-large-projects and maybe in some subclass where accurate predictions of timelines are more achievable. Maybe what is required is to justify TAI as being in the subclass
projects-that-are-mainly-determined-by-a-single-limiting-factor or
projects-whose-key-variables-are-reliably-identifiable-in-advance I think this is essentially the argument the OP is making in Analysis Part1?
I notice in the above I’ve probably gone beyond the original argument—the OP was arguing specifically against using the fact that natural systems have such properties to say that they’re required. I’m talking about something more general—systems generally have more complexity than we realize. I think this is importantly different. It may be the case that Longs’ argument about brains having such properties is based on an intuition from the broader argument. I think that the OP is essentially correct in saying that adding examples from the human brain into the argument does little to make such an argument stronger (Analysis part 2).
(1) Although there is also the question of how much later counts as a failure of prediction. I guess Shorty is arguing for TAI in the next 20 years, Longs is arguing 50-100 years?
Comment
I still prefer my analysis above: Fusion is not a case of Shorty being wrong, because a steelman Shorty wouldn’t have predicted that we’d get fusion soon. Why? Because we don’t have the key variables. Why? Because controlling the plasma is one of the key variables, and the sun has near-perfect control, whereas we are trying to substitute with various designs which may or may not work. Shorty is actually arguing for TAI much sooner than 20 years from now; if TAI comes around the HBHL milestone then it could happen any day now, it’s just a matter of spending a billion dollars on compute and then iterating a few times to work out the details, wright-brothers style. Of course we shouldn’t think Shorty is probably correct here; the truth is probably somewhere in between. (Unless we do more historical analyses and find that the case of flight is truly representative of the reference class AI fits in, in which case ho boy singularity here we come) And yeah, the main purpose of the OP was to argue that certain anti-short-timelines arguments are bogus; this issue of whether timelines are actually short or long is secondary and the case of flight is just one case study, of limited evidential import. I do take your point that maybe Longs’ argument was drawing on intuitions of the sort you are sketching out. In other words, maybe there’s a steelman of the arguments I think are bogus, such that they become non-bogus. I already agree this is true in at least one way (see Part 3). I like your point about large projects—insofar as we think of AI in that reference class, it seems like our timelines should be "Take whatever the experts say and then double it." But if we had done this for flight we would have been disastrously wrong. I definitely want to think, talk, and hear more about these issues… I’d like to have a model of what sorts of technologies are like fusion and what sort are like flight, and why. I like your suggestions:
Extremely minor nitpick: the low center of gravity wouldn’t stabilize the craft. Helicopters are unstable regardless of where the rotors are relative to the center of gravity, due to the pendulum rocket fallacy.
Comment
I came here to say this :)If you do the stabilisation with the rotors in the usual helicopter way, you basically have a Chinook (though you don’t need the extra steering propeller because you can control the rotors well enough)
Comment
A Chinook was basically what I was envisioning… what does a Chinook do that my U-shaped proposal wouldn’t do? How does stabilization with rotors work? EDIT: Ok, so helicopters use some sort of weighted balls attached to their rotors, and maybe some flexibility in the rotors also… I still don’t fully understand how it works but it seems like there are probably explainer videos somewhere.
Comment
Yeah, the mechanics of helicopter rotors is pretty complex and a bit counter-intuitive, Smarter Every Day has a series on it
Damn! I feel foolish, should have looked this up first. Thanks! EDIT: OK, so simple design try #2: What about a quadcopter (with counter-rotating propellers of course to cancel out torque) but where the propellers are angled away from the center of mass instead of just pointing straight down—that way if the craft starts tilting in some direction, it will have an imbalance of forces such that more of the upward component comes from the side that is tilting down, and less from the side that is tilting up, and so the former side will rise and the latter side will fall, and it’ll be not-tilted again. This was the other idea I had, but I wrote the U-shaped thing because it took fewer words to explain. … is this wrong too? EDIT: Now I’m worried this is wrong too for the same reason… damn… I guess I’m still just very confused about the pendulum rocket fallacy and why it’s a fallacy. I should go read more.)
I know you weren’t endorsing this claim as definitely true, but FYI my take is that other families of learning algorithms besides deep neural networks are in fact as data-efficient as humans, particularly those related to probabilistic programming and analysis-by-synthesis, see examples here.
Curated. This post laid out some important arguments pretty clearly.
Planned summary for the Alignment Newsletter:
Comment
Sounds good to me! I suggest you replace "we don’t know how to make wings that flap" with "we don’t even know how birds stay up for so long without flapping their wings," because IMO it’s a more compelling example. But it’s not a big deal either way. As an aside, I’d be interested to hear your views given this shared framing. Since your timelines are much longer than mine, and similar to Ajeya’s, my guess is that you’d say TAI requires data-efficiency and that said data-efficiency will be really hard to get, even once we are routinely training AIs the size of the human brain for longer than a human lifetime. In other words, I’d guess that you would make some argument like the one I sketched in Part 3. Am I right? If so, I’d love to hear a more fleshed-out version of that argument from someone who endorses it—I suppose there’s what Ajeya has in her report...
Comment
Sorry, what in this post contradicts anything in Ajeya’s report? I agree with your headline conclusion of
Comment
OK, so here is a fuller response: First of all, yeah, as far as I can tell you and I agree on everything in the OP. Like I said, this disagreement is an aside. Now that you mention it / I think about it more, there’s another strong point to add to the argument I sketched in part 3: Insofar as our NN’s aren’t data-efficient, it’ll take more compute to train them, and so even if TAI need not be data-efficient, *short-timelines-TAI *must be. (Because in the short term, we don’t have much more compute. I’m embarrassed I didn’t notice this earlier and include it in the argument.) That helps the argument a lot; it means that all the argument has to do is establish that we aren’t going to get more data-efficient NN’s anytime soon. And yeah, I agree the scaling laws are a great source of evidence about this. I had them in mind when I wrote the argument in part 3. I guess I’m just not as convinced as you (?) that (a) when we are routinely training NN’s with 10e15 params, it’ll take roughly 10e15 data points to get to a useful level of performance, and (b) average horizon length for the data points will need to be more than short. Some reasons I currently doubt (a): --A bunch of people I talk to, who know more about AI than me, seem confident that we can get several OOMs more data-efficient training than the GPT’s had using various already-developed tricks and techniques. --The scaling laws, IIRC, don’t tell us how much data is needed to reach a useful level of performance. Rather, they tell us how much data is needed if you want to use your compute budget optimally. It could be that at 10e15 params and 10e15 data points, performance is actually much higher than merely useful; maybe only 10e13 params and 10e13 data points would be the first to cross the usefulness threshold. (Counterpoint: Extrapolating GPT performance trends on text prediction suggests it wouldn’t be human-level at text prediction until about 10e15 params and 10e15 data points, according to data I got from Lanrian. Countercounterpoint: Extrapolating GPT performance trends on tasks other than text prediction makes it seem to me that it could be pretty useful well before then; see these figures, in which I think 10e15/10e15 would be the far-right edge of the graph). Some reasons I currently doubt (b): --I’ve been impressed with how much GPT-3 has learned despite having a very short horizon length, very limited data modality, very limited input channel, very limited architecture, very small size, etc. This makes me think that yeah, if we improve on GPT-3 in all of those dimensions, we could get something really useful for some transformative tasks, even if we keep the horizon length small. --I think that humans have a tiny horizon length—our brains are constantly updating, right? I guess it’s hard to make the comparison, given how it’s an analog system etc. But it sure seems like the equivalent of the average horizon length for the brain is around a second or so. Now, it could be that humans get away with such a small horizon length because of all the fancy optimizations evolution has done on them. But it also could just be that that’s all you need. --Having a small average horizon length doesn’t preclude also training lots on long-horizon tasks. It just means that *on average *your horizon length is small. So e.g. if the training process involves a bit of predict-the-next input, and also a bit of make-and-execute-plans-actions-over-the-span-of-days, you could get quite a few data points of the latter variety and still have a short average horizon length. I’m very uncertain about all of this and would love to hear your thoughts, which is why I asked. :)
Comment
Comment
Thanks for the detailed reply! Yeah, this is (part of) why I put compute + scaling laws front and center and make > inferences about data efficiency; you can have much stronger conclusions when you start reasoning from the thing you believe is the bottleneck. I didn’t quite follow this part. Do you think I’m not reasoning from the thing I believe is the bottleneck?
Comment
Estimated algorithmic efficiency in the report is low because researchers are not currently optimizing for "efficiency on a transformative task", whereas researchers probably are optimizing for "efficiency of GPT-3 style systems", suggesting faster improvements in algorithmic efficiency for GPT-3 than estimated in the report.
90% confidence is quite a lot; I do not have high certainty in the algorithmic efficiency part of the report. (Note that 2 OOMs in 10 years seems significantly different from "we can get several OOMs more data-efficient training than the GPT’s had using various already-developed tricks and techniques". I also assume that you have more than 10% credence in this, since 10% seems too low to make a difference to timelines.)
I don’t particularly care about comparisons of memory / knowledge between GPT-3 and humans. Humans weren’t optimized for that.
I expect that Google search beats GPT-3 on that dataset. I don’t really know what you mean when you say that this task is "hard". Sure, humans don’t do it very well. We also don’t do arithmetic very well, while calculators do.
Comment
Thanks again. My general impression is that we disagree less than it first appeared, and that our disagreements are currently bottoming out in different intuitions rather than obvious cruxes we can drill down on. Plus I’m getting tired. ;) So, I say we call it a day. To be continued later, perhaps in person, perhaps in future comment chains on future posts! For the sake of completeness, to answer your questions though:
L(D), the N \rightarrow \infty limit of L(N, D)
meaning: the peak data efficiency possible with this model class
L(N), the D \rightarrow \infty limit of L(N, D)
meaning: the scaling of loss with parameters when not data-constrained but still using early stopping
If the Kaplan et al scaling results are relevant for AGI, I expect one of these two limits to provide the relevant constraint, rather than a careful balance between N and D to ensure we are not in either limit. Ultimately, we expect AGI to require some specific-if-unknown level of performance (ie crossing some loss threshold L_{AGI}). Ajeya’s approach essentially assumes that we’ll cross this threshold at a particular value of N, and then further assumes that this will happen in a regime where data and compute limitations are around the same order of magnitude. I’m not sure why that ought to be true: it seems more likely that one side of the problem will become practically difficult to scale in proportion to the other, after a certain point, and we will essentially hug tight to either the L(N) or the L(D) curve until it hits L_{AGI}. See also my post here.
Comment
Huh, thanks, now I’m more confused about the scaling laws than I was before, in a good way! I appreciate the explanation you gave but am still confused. Some questions: --In my discussion with Rohin I said: Since evolution obviously cares about a lot more than that (and indeed doesn’t care about minimizing compute > at all, it just cares about minimizing size and training time separately, with no particular ratio between them except that which is set by the fitness landscape) the laws aren’t directly relevant. In other words, for all we know, if the human brain was 3 OOMs smaller and had one OOM more training time it would be qualitatively superior! Or for all we know, if it had 1 OOM more synapses it would need 2 OOMs less training time to be just as capable. Or… etc. Judging by the scaling laws, it seems like the human brain has a lot more synapses than its childhood length would suggest for optimal performance, or else a lot less if you buy the idea that evolutionary history is part of its training data. Do you agree or disagree? My guess is that you’d disagree, since you say: If the heuristic gives you an answer that seems very high, that doesn’t mean the model is "not as data efficient as you expected." Rather, it means that you need a very large dataset if you want a > good reason to push the parameter count up to N∼10^15 rather than using a smaller model to get almost identical performance. which I take to mean that you think the human brain could have had almost identical performance with much fewer synapses, since it has much more N than is appropriate given its D? (But wait, surely you don’t think that… OK, yeah, I’m just very confused here, please help!) 2. You say "This is not exactly a heuristic for managing compute (since D is not dependent on compute, it’s dependent on how much data you can source)." Well, isn’t it both? You can’t have more D than you have compute, in some sense, because D isn’t the amount of training examples you’ve collected, it’s the amount you actually use to train… right? So… isn’t this a heuristic for managing compute? It sure seemed like it was presented that way. 3. Perhaps it would help me if I could visualize it in two dimensions. Let the y-axis be parameter count, N, and the x-axis be data trained on, D. Make it a heat map with color = loss. Bluer = lower loss. It sounds to me like the compute-optimal scaling law Kaplan et al tout is something like a 45 degree line from the origin such that every point on the line has the lowest loss of all the points on an equivalent-compute indifference curve that contains that point. Whereas you are saying there are two other interesting lines, the L(D) line and the L(N) line, and the L(D) line is (say) a 60-degree line from the origin such that for any point on that line, all points straight above it are exactly as blue. And the L(N) line is (say) a 30-degree line from the origin such that for any point on that line, all points straight to the right of it are exactly as blue. This is the picture I currently have in my head, is it correct in your opinion? (And you are saying that probably when we hit AGI we won’t be on the 45-degree line but rather will be constrained by model size or by data and so will be hugging one of the other two lines)
Comment
A law that assumes a sufficiently large dataset (ie effectively infinite dataset), and tell you how to manage the tradeoff between steps S and params N.
The law we discussed above, that assumes a finite dataset, and then tells you how to manage its size D vs params N. I said the D vs N law was "not a heuristic for managing compute" because the S vs N law is more directly about compute, and is what the authors mean when they talk about compute optimal budgeting. However, the D vs N law does tell you about how to spend compute in an indirect way, for the exact reason you say, that D is related to how long you train. Comparing the two laws yields the "breakdown" or "kink point."
Comment
I’ve read your linked post thrice now, it’s excellent, any remaining confusions are my fault. I didn’t confidently expect you to disagree, I just guessed you did. The reason is that the statement you DID disagree with: " The scaling laws, IIRC, don’t tell us how much data is needed to reach a useful level of performance. " was, in my mind, closely related to the paragraph about the human brain which you agree with. Since they were closely related in my mind, I thought if you disagreed with one you’d disagree with the other. The statement about brains is the one I care more about, since it relates to my disagreement with Rohin. I’m glad my 2D visualization is qualitatively correct! Quantitatively, roughly how many degrees do you think there would be between the L(D) and L(N) laws? In my example it was 30, but of course I just made that up.
Comment
Actually, I think I spoke too soon about the visualization… I don’t think your image of L(D) and L(N) is quite right. Here is what the actual visualization looks like. More blue = lower loss, and I made it a contour plot so it’s easy to see indifference curves of the loss. https://64.media.tumblr.com/8b1897853a66bccafa72043b2717a198/de8ee87db2e582fd-63/s540x810/8b960b152359e9379916ff878c80f130034d1cbb.png In these coordinates, L(D) and L(N) are not really straight lines, but they are close to straight lines when we are far from the diagonal line:
If you look at the upper left region, the indifference curves are parallel to the vertical (N) axis. That is, in this regime, N doesn’t matter and loss is effectively a function of D alone.
This is L(D).
It looks like the color changes you see if you move horizontally through the upper left region.
Likewise, in the lower right region, D doesn’t matter and loss depends on N alone.
This is L(N).
It looks like the color changes you see if you move vertically through the lower right region.
To restate my earlier claims… If either N or D is orders of magnitude larger than the other, then you get close to the same loss you would get from N ~ D ~ (whichever OOM is lower). So, setting eg (N, D) = (1e15, 1e12) would be sort of a waste of N, achieving only slightly lower loss than (N, D) = (1e12, 1e12). This is what motives the heuristic that you scale D with N, to stay on the diagonal line. On the other hand, if your goal is to reach some target loss and you have resource constraints, what matters is whichever resource constraint is more restrictive. For example, if we were never able to scale D above 1e12, then we would be stuck achieving a loss similar to GPT-3, never reaching the darkest colors on the graph. When I said that it’s intuitive to think about L(D) and L(N), I mean that I care about which target losses we can reach. And that’s going to be set, more or less, by the highest N or the highest D we can reach, whichever is more restrictive. Asking "what could we do with a N=1e15 model?" (or any other number) is kind of a weird question from the perspective of this plot. It could mean either of two very different situations: either we are in the top right corner with N and D scaled together, hitting the bluest region … or we are just near the top somewhere, in which case our loss is entirely determined by D and can be arbitrarily low. In Ajeya’s work, this question means "let’s assume we’re using an N=1e15 model, and then let’s assume we actually need that many parameters, which must mean we want to reach the target losses in the upper right corner, and then let’s figure out how big D has to be to get there." So, the a priori choice of N=1e15 is driving the definition of sufficient performance, defined here as "the performance which you could only reach with N=1e15 params". What feels weird to me—which you touched on above—is the way this lets the scaling relations "backset drive" the definition of sufficient quality for AGI. Instead of saying we want to achieve some specific thing, then deducing we would need N=1e15 params to do it… we start with an unspecified goal and the postulate that we need N=1e15 params to reach it, and then derive the goal from there.
Comment
OK, wow, I didn’t realize the indifference curves were so close to being indifference L-shapes! Now I think Ajeya’s methodology was great after all—my worries have been largely dispelled! Given that the indifference curves are so close to being L-shaped, it seems there’a a pretty strong argument to be made that since the human brain has 10e15 params or so, it must be doing some fairly important tasks which *can’t be done *(at least not as well) for much less than 10e15 params. Like, *maybe *a 10e13 param brain could do the task if it didn’t have to worry about other biological constraints like noisy neurons that occasionally die randomly, or being energy-efficient, etc. But probably these constraints and others like them aren’t *that *big a deal, such that we can be fairly confident that these tasks require a NN of 10e13 or more params. The next step in the argument is to say that TAI requires one of these tasks. Then we point out that an AI which is bigger than the human brain should be able to do all the things it can do, in principle. Thus we feel justified in setting the parameter count of our hypothetical TAI to "within a few OOMs of 10e15." Then we look at the scaling law chart you just provided us, and we look at those L-shaped indifference curves, and we think: OK, so a task which can’t be done for less than 10e15 params is a task which requires 10e15 data points also. Because otherwise we could reduce parameter count below 10e15 and keep the same performance. So I no longer feel weird about this; I feel like this part of Ajeya’s analysis makes sense. But I am now intensely curious as to how many "data points" the human brain has. Either the argument I just gave above is totally wrong, or the human brain must be trained on 10e15 data points in the course of a human lifetime, or the genome must be substituting for the data points via priors, architectures, etc. Is the second possibility plausible? I guess so. there are 10^9 seconds in a human lifetime, so if you are processing a million data points a second… Huh, that seems a bit much. What about active learning and the like? You talked about how sufficiently big models are extracting all the info out of the data, and so that’s why you need more data to do better—but that suggests that curating the data to make it more info-dense should reduce compute requirements, right? Maybe that’s what humans are doing—"only" a billion data points in a lifetime, but really high-quality ones and good mechanisms for focusing on the right stuff to update on of all your sensory data coming in? And then there’s the third possibility of course. The third possibility says: These scaling laws only apply to blank-slate, simple neural nets. The brain is not a blank slate, nor is it simple; it has lots of instincts and modules and priors etc. given to it by evolution. So that’s how humans can get away with only 10^9 data points or so. (well, I guess it should be more like 10^11, right? Each second of experience is more than just one data point, probably more like a hundred, right? What would you say?) What do you think of these three possibilities?
Comment
I’m don’t think this step makes sense:
Comment
Comment
Update: According to this the human brain actually is getting ~10^7 bits of data every second, although the highest level conscious awareness is only processing ~50. So insofar as we go with the "tokens" definition, it does seem that the human brain is processing plenty of tokens for its parameter count -- 10^16, in fact, over the course of its lifetime. More than enough! And insofar as we go with the "single pass through the network" definition, which would mean we are looking for about 10^12… then we get a small discrepancy; the maximum firing rate of neurons is 250 − 1000 times per second, which means 10^11.5 or so… actually this more or less checks out I’d say. Assuming it’s the max rate that matters and not the average rate (the average rate is about once per second). Does this mean that it may not actually be true that humans are several OOMs more data-efficient than ANNs? Maybe the apparent data-efficiency advantage is really mostly just the result of transfer learning from vast previous life experience, just as GPT-3 can "few-shot learn" totally new tasks, and also "fine-tune" on relatively small amounts of data (3+ OOMs less, according to the transfer laws paper!) but really what’s going on is just transfer learning from its vast pre-training experience.
What do you mean by horizon length here?
Comment
I intended to mean something similar to what Ajeya meant in her report: I’ll define the > "effective horizon length" of an ML problem as the amount of data it takes (on average) to tell whether a perturbation to the model improves performance or worsens performance*.* If we believe that the number of "samples" required to train a model of size P is given by KP, then the number of subjective seconds that would be required should be given by HKP, where H is the effective horizon length expressed in units of "subjective seconds per sample."To be clear, I’m still a bit confused about the concept of horizon length. I’m not sure it’s a good idea to think about things this way. But it seems reasonable enough for now.
Comment
I’ve been working on a draft blog post kinda related to that, if you’re interested in I can DM you a link, it could use a second pair of eyes.
Comment
Sure!
Nothing in this post directly contradicts anything in Ajeya’s report. The conflict, insofar as there is any, is in that Part 3 I mentioned, where I sketch an argument for long timelines based on data-efficiency. That argument sketch was inspired by what Ajeya said; it’s what my model of her (and of you) would say. Indeed it’s what you are saying now (e.g. you are saying the scaling laws tell us how data-efficient our models will be once they are bigger, and it’s still not data-efficient enough to be transformative, according to you.) I think. So, the only conflict is external to this post I guess: I think this is a decent argument but I’m not yet fully convinced, whereas (I think) you and Ajeya think it or something like it is a more convincing argument. I intend to sleep on it and get back to you tomorrow with a more considered response.
Great post!
I would drop the "brute-force" here (evolution is not a random/naive search).
Re the footnote:
I don’t see how they are similar.
Comment
Thanks! Fair enough re: brute force; I guess my problem is that I don’t have a good catchy term for the level of search evolution does. It’s better than pure random search, but a lot worse than human-intelligent search. I think "how much compute would lead to TAI given 2020′s algorithms" is sort of an operationalization of "how much special sauce is needed." There are three ways to get special sauce: Brute-force search, awesome new insights, or copy it from the brain. "given 2020′s algorithms" rules out two of the three. It’s like operationalizing "distance to Edinburgh" as "time it would take to get to Edinburgh by helicopter."
Comment
My understanding is that the 2020 algorithms in Ajeya Cotra’s draft report refer to algorithms that train a neural network on a given architecture (rather than algorithms that search for a good neural architecture etc.). So the only "special sauce" that can be found by such algorithms is one that corresponds to special weights of a network (rather than special architectures etc.).
Comment
Huh, that’s not how I interpreted it. I should reread the report. Thanks for raising this issue.
"automated search"?
Comment
I tentatively agree? Given what people I respect were saying about how AIs are less data-efficient than humans, I certainly ended up quite surprised by EfficientZero. But those people haven’t reacted much to it, don’t seem to be freaking out, etc. so I guess I misunderstood their view and incorrectly thought it would be surprised by EfficientZero. But now I’m just confused as to what their view is, because it sure seems like EfficientZero is comparably data-efficient to humans, despite lacking pre-training and despite being much smaller…
Comment
Have you had a chance to ask these people if they’re surprised, and why not if not?
Comment
For my part, I kinda updated towards "Well, actually data efficiency isn’t quite exactly what I care about, and EfficientZero is gaming / Goodharting that metric in a way that dissociates it from the thing that I care about". See here. Yeah, I know it totally sounds like special pleading / moving the goalposts. Oh well. For example, I consider "running through plans one-timestep-at-a-time" to be a kind of brute force way to make plans and understand consequences, and I’m skeptical of that kind of thing scaling to "real world intelligence and common sense". By contrast, the brain can do flexible planning at multiple levels of an abstraction hierarchy, that it can build and change in real time, like how "I’m gonna go to the store and buy cucumbers" is actually millions of motor actions. EfficientZero still retains that brute-force aspect, seems to me. It just rejiggers things so that the brute-force aspect doesn’t count as "data inefficiency".
Comment
Thanks this is helpful! I think for timelines though… EfficientZero can play an Atari game for 2 subjective hours and get human-level ability at it. That’s, like, 1000 little 7-second clips of gameplay—maybe 1000 ‘lives,’ or 1000 data points. Make a list of all the "transformative tasks" and "dangerous tasks" etc. and then go down the list and ask: Can we collect 1000 data points for this task? How many subjective hours is each data point? Remember, humans have at most about 10,000 hours on any particular task. So even if it takes 20,000 hours for each data point, that’s only 10M subjective hours total… which is only 7 OOMs more than EfficientZero had in training. EfficientZero costs 1 day on $10K worth of hardware. Imagine it is 2030, hardware is 2 OOMs cheaper, and people are spending $10B on hardware and running it for 100 days. That’s 10 OOMs more compute to work with. So, we could run EfficientZero for 7 OOMs longer, and thereby get our 1000 data points of experience, each 20,000 hours long. And if EfficientZero could beat humans in data-efficiency for Atari, why wouldn’t it also beat humans for data-efficiency at this transformative task / dangerous task? Especially because we only used 7 of our 10 available OOMs, so we can also make it 1000x larger if we want to. And this argument only has to work for at least one transformative / dangerous task, not all of them. This is a crude sketchy argument of course, but you see what I’m getting at? ETA: I’m attacking the view that by 2030ish we’ll have AIs that can do all the short-horizon tasks, but long-horizon tasks will only come around 2040 or 2050 because it takes a lot more compute to train on them because each data point requires a lot more subjective time.
Comment
Let’s say we want our EfficientZero-7 to output good alignmentforum blog posts. We have plenty of training data, in terms of the finished product, but we don’t have training data in terms of the "figuring out what to write" part. That part happens in the person’s head. (Suppose the test data is a post containing Insight X. If we’re training a network to output that post, the network updates can lead to *the ability to figure out *Insight X, or can lead the network to *already know *Insight X. Evidence from GPT-3 suggests that the latter is what would actually happen, IMO.) So then maybe you’ll say: Someone will get the AGI safety researcher to write an alignmentforum blog post while wearing a Kernel Flux brain-scanner helmet, and make EfficientZero-7 build a model from that. But I’m skeptical that the brain-scan data would sufficiently constrain the model so that it would learn how to "figure things out". Brain scans are too low-resolution, too noisy, and/or too incomplete. I think they would miss pretty much all the important aspects of "figuring things out". I think if we had a sufficiently good operationalization of "figuring things out" to train EfficientZero-7, we could just use that to build a "figuring things out" AGI directly instead. That’s my guess anyway. Then maybe your response would be: Writing alignmentforum blog posts is a bad example. Instead let’s build silicon-eating nanobots. We can run a slow expensive molecular-dynamics simulation running on a supercomputer, and we can have EfficientZero-7 query it, watch it, build its own "mental model" of what happens in a molecular simulation, and recapitulate that model on cheaper faster GPUs. And we can put in some kind of score that’s maximized when you query the model with the precursors to a silicon-eating nanobot. I can get behind that kind of story; indeed, I would not be surprised to see papers along those general lines popping up on arxiv tomorrow, or indeed years ago. But would describe that kind of thing as "pivotal acts that require only narrow AI". I’m not an expert on pivotal acts, and I’m open-minded to the possibility that there are "pivotal acts that require only narrow AI". And I’m also open-minded to the possibility that we can’t do those acts today, because they require too much querying of expensive-to-query stuff like molecular simulation code or humans or real-world actuation, and that future narrow-AI advances like EfficientZero-7 will solve that problem. I guess I’m modestly skeptical, but I suppose there are unknown unknowns (to me), and certainly I haven’t spent enough time thinking about pivotal acts to have any confidence.
Comment
I wasn’t imagining this being a good thing that helps save the world; I was imagining it being a world-ending thing that someone does anyway because they don’t realize how dangerous it is. I totally agree that the two examples you gave probably wouldn’t work. How about this though: --Our task will be: Be a chatbot. Talk to users over the course of several months to get them to give you high marks in a user satisfaction survey. --Pre-train the model on logs of human-to-human chat conversations so you have a reasonable starting point for making predictions about how conversations go. --Then run the efficientzero algorithm, but with a massively larger parameter count, and talking to hundreds of thousands (millions?) of humans for several years. It would be a very expensive, laggy chatbot (but the user wouldn’t care since they aren’t paying for it and even with lag the text comes in about as fast as a human would reply) Seems to me this would "work" in the sense that we’d all die within a few years of this happening, on the default trajectory.
Comment
In a similar conversation about non-main-actor paths to dangerous AI I came up with this as an example of a path I can imagine being plausible and dangerous: A plausible-to-me worst case scenario would be something like:A phone-scam organization employs someone to build them a online-learning reinforcement learning agent (using an open-source language model as a language-understanding-component) that functions as a scam-helper. It takes in the live transcription of the ongoing conversation between a scammer and a victim, and gives the scammer suggestions for what to say next to persuade the victim to send money. So long as it was even a bit helpful sometimes according to the team of scammers using it, more resources would be given to it and it would continue to collect useful data. I think this scenario contains a number of dangerous aspects:being illegal and secret, not subject to ethical or safety guidance or regulationdeliberately being designed to open-endedly self-improvebringing in incremental resources as it trains to continue to prove its worth (thus not needing a huge initial investment of training cost) being agentive and directed at the specific goal of manipulating and deceiving humans I don’t think we need 10 more years of progress in algorithms and compute for this story to be technologically feasible. A crude version of this is possibly already in use, and we wouldn’t know.
Not yet! I didn’t want to bother them. I have been closely following (and asking questions in) all the LW discussions about EfficientZero, but they haven’t shown up. Maybe I should just message them directly… I should also go reread Ajeya’s report because the view is explained there IIRC.
I like the bird-plane analogy. I kind of had the same idea, but for slightly different reason: just as man made flying machines can be superior to birds in a lot of aspects, man made ai will most likely can be superior to a human mind in a similar way. Regarding your specific points: they may be valid, however, we do not know at which point in time we are talking about flying or AI: Probably a lot of similar arguments could have been made by Leonardo da Vinci when he was designing his flying machine; most likely he understood a lot more about birds and the way they fly than any of his contemporaries or predecessors; yet, he had no chance to succeed for at least three additional centuries. So are we in the era of the Wright Brothers of A.I., or are we still only at da Vinci’s? I personally think the former is more likely, but I believe the probability of the second one is a lot greater than zero.
Comment
Comment
Comment
Sorry for not making this clear—I agree the probability distribution should be stretched out. I think Longs’ argument is bogus, in the sense of being basically zero evidence for its conclusion as currently stated—but the conclusion may still be right, because there are more fleshed-out arguments one could make that are much better. For example, as you point out, I didn’t really investigate the issue of whether or not Shorty properly identified the key variables in the case of TAI. I think a really good way to critique Shorty is to argue that those aren’t the key variables, or at least that they probably aren’t. As it happens, I do think those are probably the key variables, but I haven’t argued for that yet, and I am still rather uncertain. (I think Long’s argument that those aren’t the key variables is bad though. It’s too easy to point to things we currently don’t understand; see e.g. how many things we didn’t understand about birds or flight in 1900! Better would be to have an alternative theory of what the key variables are, or a more direct rebuttal of Shorty’s theory of key variables by showing that it makes some incorrect prediction or something.)
I think this is a good point, but I’d flag that the analogy might give the impression that intelligence is easier than it is—while animals have evolved flight multiple times by different paths (birds, insects, pterosaurs, bats) implying flight may be relatively easy, only one species has evolved intelligence.
Comment
Hmmm, this is a good point—but here’s a counter that just now occurred to me: Let’s disambiguate "intelligence" into a bunch of different things. Reasoning, imitation, memory, data-efficient learning, … the list goes on. Maybe the complete bundle has only evolved once, in humans, but almost every piece of the bundle has evolved separately many times. In particular, the number 1 thing people point to as a candidate X for "X is necessary for TAI and we don’t know how to make AIs with X yet and it’s going to be really hard to figure it out soon" is data-efficient learning. But data-efficient learning *has *evolved separately many times; AlphaStar may need thousands of years of Starcraft to learn how to play, but dolphins can learn new games in minutes. Games with human trainers, who are obviously way out of distribution as far as Dolphin’s ancestral environment is concerned. The number 2 thing I hear people point to is "reasoning" and maybe "causal reasoning" in particular. I venture to guess that this has evolved a bunch of times too, based on how various animals can solve clever puzzles to get pieces of food. (See also: https://www.lesswrong.com/posts/GMqZ2ofMnxwhoa7fD/the-octopus-the-dolphin-and-us-a-great-filter-tale )
Comment
Someone who actually knows something about taxonomic phylogeny of neural traits would need to say for sure, but the fact that many species share neural traits doesn’t necessarily mean those traits evolved many times independently as flight did. They could have inherited the traits from a common ancestor. I have no idea if anyone has any clue whether "data efficient learning" falls into the came from a single common ancestor or evolved independently in many disconnected trees categories. It is not a trait that leaves fossil evidence.
Comment
I think all the things we identify as "intelligence" (including data-efficient learning) are things that the neocortex does, working in close conjunction with the thalamus (which might as well be a 7th layer of the neocortex), hippocampus (temporarily stores memories before gradually transferring them back to the neocortex because the neocortex needs a lot of repetition to learn), basal ganglia (certain calculations related to reinforcement learning including the value function calculation I think), and part of the cerebellum (you can have human-level intelligence without a cerebellum, but it does help speed things up dramatically, I think mainly by memoizing neocortex calculations). Anyway, it’s not 100% proven, but my read of the evidence is that the neocortex in mammals is a close cousin of the pallium in lizards and birds and dinosaurs, and the neocortex & bird/lizard pallium do the same calculations using the same neuronal circuits descended from the same ancestor which also did those calculations. The neurons are arranged differently in space in the neocortex vs pallium, but that doesn’t matter, the network is what matters. Some early version of the pallium dates back at least as far as lampreys, if memory serves, and I would not be remotely surprised if the lamprey proto-pallium (whatever it’s called) did data-efficient learning, albeit learning relatively simple things like 1D time-series data or 3D environments. (That doesn’t sound like it has much in common with human intelligence and causal reasoning and rocket science but I think it really does...long story...) Paul Cisek wrote this paper which I found pretty thought-provoking. He’s now diving much deeper into that and writing a book, but says he won’t be done for a few years. I don’t know anything about octopuses by the way. That could be independent.
Fair enough—maybe data efficient learning evolved way back with the dinosaurs or something. Still though… I find it more plausible that it’s just not that much harder than flight (and possibly even easier).
Yeah, that’s fair—it’s certainly possible that the things that make intelligence relatively hard for evolution may not apply to human engineers. OTOH, if intelligence is a bundle of different modules that all coexistent in humans and of which different animals have evolved in various proportions, that seems to point away from the blank slate/"all you need is scaling" direction.
Thanks for writing this, the power to weight statistics are quite interesting. I have an another, longer reply with my own take (edit. comments about the graph, that is) in the works, but while writing it, I started to wonder about a tangential question:
Comment
UPDATE: I just reread Ajeya’s report and actually her version of the human lifetime anchor is shifted +3 OOMs because she’s trying to account for how humans have priors, special sauce, etc. in them given by evolution. So… I’m pretty perplexed. Even after shifting the anchor +3 OOMs to account for special sauce etc. she still assigns only 5% weight to it! Note that if you just did the naive thing, which is to look at the 41-OOM cost of recapitulating evolution as a loose upper bound, and take (say) 85% of your credence and divide it evenly between all the orders of magnitude less than that but more than where we are now… you’d get something like 5% per OOM, which would come out to 25% or so for the human lifetime anchor!
Thanks, and I look forward to seeing your reply! I’m partly responding to things people have said in conversation with me. For example, the thing Longs says that is a direct quote from one of my friends commenting on an early draft! I’ve been hearing things like this pretty often from a bunch of different people. I’m also partly responding to Ajeya Cotra’s epic timelines report. It’s IMO the best piece of work on the topic there is, and it’s also the thing that bigshot AI safety people (like OpenPhil, Paul, Rohin, etc.) seem to take most seriously. I think it’s right about most things but one major disagreement I have with it is that it seems to put too much probability mass on "Lots of special sauce needed" hypotheses. Shorty’s position—the "not very much special sauce" position—applied to AI seems to be that we should anchor on the Human Lifetime anchor. If you think there’s probably a little special sauce but that it can be compensated for via e.g. longer training times and bigger NNs, then that’s something like the Short Horizon NN hypothesis. I consider Genome Anchor, Medium and Long-Horizon NN Anchor, and of course Evolution Anchor to be "lots of special sauce needed" views. In particular, all of these views involve, according to Ajeya, "Learning to Learn:" I’ll quote her in full:
What’s your best estimate for the amount of time it will take us to get to TAI?
That was an exciting graph! However, the labeling would be more consistent if it were steam engines, piston engines, and turbine engines OR stationary, ship/barge, train, automobile, and aircraft (I assume you mean airplanes and helicopters and you excluded rockets).
Comment
Yeah, I guess it should have been steam engines, automobile engines, and aircraft engines. (The steam engines were partly for trains, partly stationary, partly for other things iirc).
The wrong analogies to flight don’t help much if a) you don’t know what your looking for and would need +80 OOM to "search" for a solution like evolution did (which you will never have) b) you have no idea what intelligence is about (hint, it is NOT just about optimization, see (a) if TAI were near I would expect Q) more work in the field of AGI and way more AGI architectures, even with evolutionary / DL / latest clap trap hype of ML T) more companies betting on AGI U) a lot of strange ASI/AGI theories V) a lot of work on RSI W) autonomous robots roaming the streets (AGI/TAI has to be autonomous so this is a prerequisit) this all is not happening and won’t happen for at least 20 years, so TAI is not "near".