So there’s this thing where GPT-3 is able to do addition, it has the internal model to do addition, but it takes a little poking and prodding to actually get it to do addition. "Few-shot learning", as the paper calls it. Rather than prompting the model with
Q: What is 48 + 76? A: … instead prompt it with Q: What is 48 + 76? A: 124 Q: What is 34 + 53? A: 87 Q: What is 29 + 86? A: The same applies to lots of other tasks: arithmetic, anagrams and spelling correction, translation, assorted benchmarks, etc. To get GPT-3 to do the thing we want, it helps to give it a few examples, so it can "figure out what we’re asking for". This is an alignment problem. Indeed, I think of it as the quintessential alignment problem: to translate what-a-human-wants into a specification usable by an AI. The hard part is not to build a system which can do the thing we want, the hard part is to specify the thing we want in such a way that the system actually does it. The GPT family of models are trained to mimic human writing. So the prototypical "alignment problem" on GPT is prompt design: write a prompt such that actual human writing which started with that prompt would likely contain the thing you actually want. Assuming that GPT has a sufficiently powerful and accurate model of human writing, it should then generate the thing you want. Viewed through that frame, "few-shot learning" just designs a prompt by listing some examples of what we want—e.g. listing some addition problems and their answers. Call me picky, but that seems like a rather primitive way to design a prompt. Surely we can do better? Indeed, people are already noticing clever ways to get better results out of GPT-3 - e.g. TurnTrout recommends conditioning on writing by smart people, and the right prompt makes the system complain about nonsense rather than generating further nonsense in response. I expect we’ll see many such insights over the next month or so.
Capabilities vs Alignment as Bottleneck to Value
I said that the alignment problem on GPT is prompt design: write a prompt such that actual human writing which started with that prompt would likely contain the thing you actually want. Important point: this is worded to be agnostic to the details GPT algorithm itself; it’s mainly about predictive power. If we’ve designed a good prompt, the current generation of GPT might still be unable to solve the problem—e.g. GPT-3 doesn’t understand long addition no matter how good the prompt, but some future model with more predictive power should eventually be able to solve it. In other words, there’s a clear distinction between alignment and capabilities:
-
alignment is mainly about the prompt, and asks whether human writing which started with that prompt would be likely to contain the thing you want
-
capabilities are mainly about GPT’s model, and ask about how well GPT-generated writing matches realistic human writing Interesting question: between alignment and capabilities, which is the main bottleneck to getting value out of GPT-like models, both in the short term and the long(er) term? In the short term, it seems like capabilities are still pretty obviously the main bottleneck. GPT-3 clearly has pretty limited "working memory" and understanding of the world. That said, it does seem plausible that GPT-3 could consistently do at least some economically-useful things right now, with a carefully designed prompt—e.g. writing ad copy or editing humans’ writing. In the longer term, though, we have a clear path forward for better capabilities. Just continuing along the current trajectory will push capabilities to an economically-valuable point on a wide range of problems, and soon. Alignment, on the other hand, doesn’t have much of a trajectory at all yet; designing-writing-prompts-such-that-writing-which-starts-with-the-prompt-contains-the-thing-you-want isn’t exactly a hot research area. There’s probably low-hanging fruit there for now, and it’s largely unclear how hard the problem will be going forward. Two predictions on this front:
-
With this version of GPT and especially with whatever comes next, we’ll start to see a lot more effort going into prompt design (or the equivalent alignment problem for future systems)
-
As the capabilities of GPT-style models begin to cross beyond what humans can do (at least in some domains), alignment will become a much harder bottleneck, because it’s hard to make a human-mimicking system do things which humans cannot do Reasoning for the first prediction: GPT-3 is right on the borderline of making alignment economically valuable—i.e. it’s at the point where there’s plausibly some immediate value to be had by figuring out better ways to write prompts. That means there’s finally going to be economic pressure for alignment—there’s going to be ways to make money by coming up with better alignment tricks. That won’t necessarily mean economic pressure for generalizable or robust alignment tricks, though—most of the economy runs on ad-hoc barely-good-enough tricks most of the time, and early alignment tricks will likely be the same. In the longer run, focus will shift toward more robust alignment, as the low-hanging problems are solved and the remaining problems have most of their value in the long tail. Reasoning for the second prediction: how do I write a prompt such that human writing which began with that prompt would contain a workable explanation of a cheap fusion power generator? In practice, writing which claims to contain such a thing is generally crackpottery. I could take a different angle, maybe write some section-headers with names of particular technologies (e.g. electric motor, radio antenna, water pump, …) and descriptions of how they work, then write a header for "fusion generator" and let the model fill in the description. Something like that could plausibly work. Or it could generate scifi technobabble, because that’s what would be most likely to show up in such a piece of writing today. It all depends on which is "more likely" to appear in human writing. Point is: GPT is trained to mimic human writing; getting it to write things which humans cannot currently write is likely to be hard, even if it has the requisite capabilities.
I wonder how long we’ll be in the "prompt programming" regime. As Nick Cammarata put it:
Comment
I think that going forward there’ll be a spectrum of interfaces to natural language models. At one end you’ll have fine-tuning, and at the other you’ll have prompts. The advantage of fine-tuning is that you can actually apply an optimizer to the task! The advantage of prompts is anyone can use them. In the middle of the spectrum, two things I expect are domain-specific tunings and intermediary models. By ‘intermediary models’ I mean NLP models fine-tuned to take a human prompt over a specific area and return a more useful prompt for another model, or a set of activations or biases that prime the other model for further prompting. The ‘specific area’ could be as general as ‘less flights of fancy please’.
The problem with directly manipulating the hidden layers is reusability. If we directly manipulate the hidden layers, then we have to redo that whenever a newer, shinier model comes out, since the hidden layers will presumably be different. On the other hand, a prompt is designed so that human writing which starts with that prompt will likely contain the thing we want—a property mostly independent of the internal structure of the model, so presumably the prompt can be reused. I think the eventual solution here (and a major technical problem of alignment) is to take an internal notion learned by one model (i.e. found via introspection tools), back out a universal representation of the real-world pattern it represents, then match that real-world pattern against the internals of a different model in order to find the "corresponding" internal notion. Assuming that the first model has learned a real pattern which is actually present in the environment, we should expect that "better" models will also have some structure corresponding to that pattern—otherwise they’d lose predictive power on at least the cases where that pattern applies. Ideally, this would all happen in such a way that the second model can be more accurate, and that increased accuracy would be used. In the shorter term, I agree OpenAI will probably come up with some tricks over the next year or so.
Comment
Can’t you just run the model in a generative mode associated with that internal notion, then feed that output as a set of observations into your new model and see what lights up in it’s mind? This should work as long as both models predict the same input modality. I could see this working pretty well for matching up concepts between the latent spaces of different VAEs. Doing this might be a bit less obvious in the case of autoregressive models, but certainly not impossible.
Comment
This works if both (a) both models are neural nets, and (b) the "concept" cleanly corresponds to one particular neuron. You could maybe loosen (b) a bit, but the bottom line is that the nets have to represent the concept in a particular way—they can’t just e.g. run low-level physics simulations in order to make predictions. It would probably allow for some cool applications, but it wouldn’t be a viable long-term path for alignment with human values.
Comment
I think you can loosen (b) quite a bit if you task a separate model with "delineating" the concept in the new network. The procedure does effectively give you access to infinite data, so the boundary for the old concept in the new model can be as complicated as your compute budget allows. Up to and including identifying high level concepts in low level physics simulations.
Comment
We currently have no criteria by which to judge the performance of such a separate model. What do we train it to do, exactly? We could make up some ad-hoc criterion, but that suffers from the usual problem of ad-hoc criteria: we won’t have a reliable way to know in advance whether it will or will not work on any particular problem or in any particular case.
Comment
The way I was envisioning it is that if you had some easily identifiable concept in one model, e.g. a latent dimension/feature that corresponds to the log odd of something being in a picture, you would train the model to match the behaviour of that feature when given data from the original generative model. Theoretically any loss function will do as long as the optimum corresponds to the situation where your "classifier" behaves exactly like the original feature in the old model when both of them are looking at the same data.
In practice though, we’re compute bound and nothing is perfect and so you need to answer other questions to determine the objective. Most of them will be related to why you need to be able to point at the original concept of interest in the first place. The acceptability of misclassifying any given input or world-state as being or not being an example of the category of interest is going to depend heavily on things like the cost of false positives/negatives and exactly which situations get misclassified by the model.
The thing about it working or not working is a good point though, and how to know that we’ve successfully mapped a concept would require a degree of testing, and possibly human judgement. You could do this by looking for situations where the new and old concepts don’t line up, and seeing what inputs/world states those correspond to, possibly interpreted through the old model with more human understandable concepts.
I will admit upon further reflection that the process I’m describing is hacky, but I’m relatively confident that the general idea would be a good approach to cross-model ontology identification.
Have we ever figured out a way to interface with what something has learned that doesn’t involve language prompts? I’m serious. What other options are you trying to hint at? I think manipulating hidden layers is a terrible approach, but I won’t expound on that here.
Comment
Any sort of probabilistic model offers the usual interpretations of the probabilities as an interface. For instance, I can train an LDA topic model, look at the words in the learned topics, pick a topic I’m interested in, then look at that topic’s weighting in each document in order to find relevant documents. More generally, I can train any clustering model, pick a cluster I’m interested in, then look for more things in that cluster. Or if I train a causal model, I can often interpret the learned parameters as estimates of physical interactions in the world. In each case, I’m effectively using the interpretation of the model’s built-in probabilities as an interface. This is arguably the main advantage of probabilistic models over non-probabilistic models: they come with a fairly reliable, well-understood built-in interface.
Comment
My problem is that this doesn’t seem to scale. I like the idea of visual search, but I also realize you’re essentially bit-rate limited in what you can communicate. For example, I’d about give up if I had to write my reply to you using a topic model. Other places in this thread mention semi-supervised learning. I do agree with the idea of taking a prompt and auto-generating the relevant large prompt that at the moment is manually being written in.
Comment
Thanks for the link! I’ll partially accept the variations example. That seems to qualify as "show me what you learned". But I’m not sure if that counts as an interface simply because of the lack of interactivity/programability.
Planned summary for the Alignment Newsletter: Currently, m> any people are trying to figure out how to prompt GPT-3 into doing what they want—in other words, how to align GPT-3 with their desires. GPT-3 may be capable of the task, but that doesn’t mean it will do it (potential example).This suggests that alignment will soon be a bottleneck on our ability to get value from large language models.> Certainly GPT-3 isn’t perfectly capable yet. The author thinks that in the immediate future the major bottleneck will still be its capability, but we have a clear story for how to improve its capabilities: just scale up the model and data even more. Alignment on the other hand is much harder: we don’t know how to <@translate@>(@Alignment as Translation@) the tasks we want into a format that will cause GPT-3 to "try" to accomplish that task.> As a result, in the future we might expect a lot more work to go into prompt design (or whatever becomes the next way to direct language models at specific tasks). In addition, once GPT is better than humans (at least in some domains), alignment in those domains will be particularly difficult, as it is unclear how you would get a system trained to mimic humans <@to do better than humans@>(@The easy goal inference problem is still hard@).Planned opinion:
Comment
LGTM
I think it’s important to take a step back and notice how AI risk-related arguments are shifting.
In the sequences, a key argument (probably the key argument) for AI risk was the complexity of human value, and how it would be highly anthropomorphic for us to believe that our evolved morality was embedded in the fabric of the universe in a way that any intelligent system would naturally discover. An intelligent system could just as easily maximize paperclips, the argument went.
No one seems to have noticed that GPT actually does a lot to invalidate the original complexity-of-value-means-FAI-is-super-difficult argument.
You write:
We’ve gotten from "the alignment problem is about complexity of value" to "the alignment problem is about programming by example" (also known as "supervised learning", or Machine Learning 101).
There’s actually a long history of systems which combine
observing-lots-of-data-about-the-world (GPT-3′s training procedure, "unsupervised learning")
with
programming-by-example ("supervised learning")
The term for this is "semi-supervised learning". When I search for it on Google Scholar, I get almost 100K results. ("Transfer learning" is a related literature.)
The fact that GPT-3′s API only does text completion is, in my view, basically just an API detail that we shouldn’t particularly expect to be true of GPT-4 or GPT-5. There’s no reason why OpenAI couldn’t offer an API which takes in a list of (x, y) pairs and then given some x it predicts y. I expect if they chose to do this as a dedicated engineering effort, getting into the guts of the system as needed, and collected a lot of user feedback on whether the predicted y was correct for many different problems, they could exceed the performance gains you can currently get by manipulating the prompt.
I’m wary of a world where "the alignment problem" becomes just a way to refer to "whatever the difference is between our current system and the ideal system". (If I trained a supervised learning system to classify word vectors based on whether they’re things that humans like or dislike, and the result didn’t work very well, I can easily imagine some rationalists telling me this represented a failure to "solve the alignment problem"—even if the bottleneck was mainly in the word vectors themselves, as evidenced e.g. by large performance improvements on switching to higher-dimensional word vectors.) I’m reminded of a classic bad argument.
If it’s hard to make a human-mimicking system do things which humans cannot do, why should we expect the capabilities of GPT-style models to cross beyond what humans can do in the first place?
My steelman of what you’re saying:
Over the course of GPT’s training procedure, it incidentally acquires superhuman knowledge, but then that superhuman knowledge gets masked as it sees more data and learns which specific bits of its superhuman knowledge humans are actually ignorant of (and even after catastrophic forgetting, some bits of superhuman knowledge remain at the end of the training run). If that’s the case, it seems like we could mitigate the problem by restricting GPT’s training to textbooks full of knowledge we’re very certain in (or fine-tuning GPT on such textbooks after the main training run, or simply increasing their weight in the loss function). Or replace every phrase like "we don’t yet know X" in GPT’s training data with "X is a topic for a more advanced textbook", so GPT never ends up learning what humans are actually ignorant about.
Or simply use a prompt which starts with the letterhead for a university press release: "Top MIT scientists have made an important discovery related to X today..." Or a prompt which looks like the beginning of a Nature article. Or even: "Google has recently released a super advanced new AI system which is aligned with human values; given X, it says Y." (Boom! I solved the alignment problem! We thought about uploading a human, but uploading an FAI turned out to work much better.)
(Sorry if this comment came across as grumpy. I’m very frustrated that so much upvoting/downvoting on LW seems to be based on what advances AI doom as a bottom line. It’s not because I think superhuman AI is automatically gonna be safe. It’s because I’d rather we did not get distracted by a notion of "the alignment problem" which OpenAI could likely solve with a few months of dedicated work on their API.)
Comment
My main response to this needs a post of its own, so I’m not going to argue it in detail here, but I’ll give a summary and then address some tangential points. Summary: the sense in which human values are "complex" is not about predictive power. A low-level physical model of some humans has everything there is to know about human values embedded within it; it has all the predictive power which can be had by a good model of human values. The hard part is pointing to the thing we consider "human values" embedded within that model. In large part, that’s hard because it’s not just a matter of predictive power. Looking at it as semi-supervised learning: it’s not actually hard for an unsupervised learner to end up with some notion of human values embedded in its world-model, but finding the embedding is hard, and it’s hard in a way which cannot-even-in-principle be solved by supervised learning (because that would reduce it to predictive power). On to tangential points...
Comment
This still sounds like a shift in arguments to me. From what I remember, the MIRI-sphere take on uploads is (was?): "if uploads come before AGI, that’s probably a good thing, as long as it’s a sufficiently high-fidelity upload of a benevolent individual, and the technology is not misused prior to that person being uploaded". (Can probably dig up some sources if you don’t believe me.)
I still don’t buy it. Your argument proves too much—how is it that transfer learning works? Seems that pointing to relevant knowledge embedded in an ML model isn’t super hard in practice.
Is there a fundamental difference? You say: ‘The hard part is pointing to the thing we consider "human values" embedded within that model.’ What is it about pointing to the thing we consider "human values" which makes it fundamentally different from pointing to the thing we consider a dog?
The main possible reason I can think of is because a dog is in some sense a more natural category than human values. That there are a bunch of different things which are kind of like human values, but not quite, and one has to sort through a large number of them in order to pinpoint the right one ("will the REAL human values please stand up?") (I’m not sure I buy this, but it’s the only way I see for your argument to make sense.)
As an example of something which is not a natural category, consider a sandwich. Or, to an even greater degree: "Tasty ice cream flavors" is not a natural category, because everyone has their own ice cream preferences.
Disagree, you could also align it with corrigibility.
A big part of this post is about how people are trying to shoehorn programming by example into text completion, wasn’t it? What do you think a good interface would be?
I think perhaps you’re confusing the collection of ML methods that conventionally fall under the umbrella of "supervised learning", and the philosophical task of predicting (x, y) pairs. As an example, from a philosophical perspective, I could automate away most software development if I could train a supervised learning system where x=the README of a Github project and y=the code for that Github project. But most of the ML methods that come to the mind of an average ML person when I say "supervised learning" are not remotely up to that task.
(BTW note that such a collection of README/code pairs comes pretty close to pinpointing the notion of "do what I mean". Which could be a very useful building block—remember, a premise of the classic AI safety arguments is that "do what I mean" doesn’t exist. Also note that quality code on Github is a heck of a lot more interpretable than the weights in a neural net—and restricting the training set to quality code seems pretty easy to do.)
Glad to hear it. I get so demoralized commenting on LW because it so often feels like a waste of time in retrospect.
Comment
Comment
Comment
Comment
If it’s read moral philosophy, it should have some notion of what the words "human values" mean.
In any case, I still don’t understand what you’re trying to get at. Suppose I pretrain a neural net to differentiate lots of non-marsupial animals. It doesn’t know what a koala looks like, but it has some lower-level "components" which would allow it to characterize a koala. Then I use transfer learning and train it to differentiate marsupials. Now it knows about koalas too.
This is actually a tougher scenario than what you’re describing (GPT will have seen human values yet the pretrained net hasn’t seen koalas in my hypothetical), but it’s a boring application of transfer learning.
Locating human values might be trickier than characterizing a koala, but the difference seems quantitative, not qualitative.
Comment
Consider how Bayesian updates on a low-level physics model would behave on whatever task you’re considering. What would go wrong?
Next, imagine a more realistic system (e.g. current ML systems) failing in an analogous way. What would that look like?
What’s preventing ML systems from failing in that way already? The answer is probably "they don’t have enough compute to get higher predictive power from a less abstract model"—which means that, if things keep scaling up, sooner or later that failure will happen.
Comment
You say: "we can be fairly certain that a system will stop directly using those concepts once it has sufficient available compute". I think this depends on specific details of how the system is engineered.
Comment
if we just optimize for predictive power, then abstract notions will definitely be thrown away once the system can discover and perform Bayesian updates on low-level physics. (In principle we could engineer a system which never discovers that, but then it will still optimize predictive power by coming as close as possible.)
if we’re not just optimizing for predictive power, then we need some other design criteria, some other criteria for whether/how well the system is working. In one sense, the goal of all this abstract theorizing is to identify what that other criteria needs to be in order to reliably end up using the "right" abstractions in the way we want. We could probably make up some ad-hoc criteria which works at least sometimes, but then as architectures and hardware advance over time we have no idea when that criteria will fail.
Comment
Comment
How would that fix any of the problems we’ve been talking about?
Comment
This is essentially the "tasty ice cream flavors" problem, am I right? Trying to check if we’re on the same page. If so: John Wentsworth said
Comment
No, this is not the "tasty ice cream flavors" problem. The problem there is that the concept is inherently relative to a person. That problem could apply to "human values", but that’s a separate issue from what dxu is talking about. The problem is that "what a committee of famous moral philosophers would endorse saying/doing", or human written text containing the phrase "human values", is a proxy for human values, not a direct pointer to the actual concept. And if a system is trained to predict what the committee says, or what the text says, then it will learn the proxy, but that does not imply that it directly uses the concept.
Comment
Well, the moral judgements of a high-fidelity upload of a benevolent human are also a proxy for human values—an inferior proxy, actually. Seems to me you’re letting the perfect be the enemy of the good.
Comment
It doesn’t matter how high-fidelity the upload is or how benevolent the human is, I’m not happy giving them the power to launch nukes without at least two keys, and a bunch of other safeguards on top of that. "Don’t let the perfect be the enemy of the good" is advice for writing emails and cleaning the house, not nuclear security. The capabilities of powerful AGI will be a lot more dangerous than nukes, and merit a lot more perfectionism. Humans themselves are not aligned enough that I would be happy giving them the sort of power that AGI will eventually have. They’d probably be better than many of the worst-case scenarios, but they still wouldn’t be a best or even good scenario. Humans just don’t have the processing power to avoid shooting themselves (and the rest of the universe) in the foot sooner or later, given that kind of power.
Comment
Here are some of the people who have the power to set off nukes right now:
Donald Trump
Vladimir Putin
Kim Jong-un
Both parties in this conflict
And this conflict
Tell that to the Norwegian commandos who successfully sabotaged Hitler’s nuclear weapons program.
"A good plan violently executed now is better than a perfect plan executed at some indefinite time in the future."—George Patton
Just because it’s in your nature (and my nature, and the nature of many people who read this site) to be a cautious nerd, does not mean that the cautious nerd orientation is always the best orientation to have.
In any case, it may be that the annual amount of xrisk is actually quite low, and no one outside the rationalist community is smart enough to invent AGI, and we have all the time in the world. In which case, yes, being perfectionistic is the right strategy. But this still seems to represent a major retreat from the AI doomist position that AI doom is the default outcome. It’s a classic motte-and-bailey:
"It’s very hard to build an AGI which isn’t a paperclipper!"
"Well actually here are some straightforward ways one might be able to create a helpful non-paperclipper AGI..."
"Yeah but we gotta be super perfectionistic because there is so much at stake!"
Your final "humans will misuse AI" worry may be justified, but I think naive deployment of this worry is likely to be counterproductive. Suppose there are two types of people, "cautious" and "incautious". Suppose that the "humans will misuse AI" worry discourages cautious people from developing AGI, but not incautious people. So now we’re in a world where the first AGI is most likely controlled by incautious people, making the "humans will misuse AI" worry even more severe.
If you’re willing to grant the premise of the technical alignment problem being solved, shooting oneself in the foot would appear to be much less of a worry, because you can simply tell your FAI "please don’t let me shoot myself in the foot too badly", and it will prevent you from doing that.
Comment
Comment
a major retreat from the "default outcome is doom" thesis which is frequently trotted out on this site (the statement is consistent with a AGI design that’s is 99.9% likely to be safe, which is very much incompatible with "default outcome is doom")
unrelated to our upload discussion (an upload is not an AGI, but you said even a great upload wasn’t good enough for you) You’ve picked a position vaguely in between the motte and the bailey and said "the motte and the bailey are both equivalent to this position!" That doesn’t look at all true to me.
Simple is not the same as obvious. Even if someone at some point tried to think of every obvious solution and justifiably discarded them all, there are probably many "obvious" solutions they didn’t think of.
Nothing ever gets counted as evidence against this claim. Simple proposals get rejected on the basis that everyone knows simple proposals won’t work. A MIRI employee openly admitted here that they apply different standards of evidence to claims of safety vs claims of not-safety. Maybe there are good arguments for that, but the problem is that if you’re not careful, your view of reality is gonna get distorted. Which means community wisdom on claims such as "simple solutions never work" is likely to be systematically wrong. "Everyone knows X", without a good written defense of X, or a good answer to "what would change the community’s mind about X", is fertile ground for information cascades etc. And this is on top of standard ideological homophily problems (the AI safety community is very self-selected subset of the broader AI research world).
Comment
For what it’s worth, my perception of this thread is the opposite of yours: it seems to me John Wentworth’s arguments have been clear, consistent, and easy to follow, whereas you (John Maxwell) have been making very little effort to address his position, instead choosing to repeatedly strawman said position (and also repeatedly attempting to lump in what Wentworth has been saying with what you think other people have said in the past, thereby implicitly asking him to defend whatever you think those other people’s positions were). Whether you’ve been doing this out of a lack of desire to properly engage, an inability to comprehend the argument itself, or some other odd obstacle is in some sense irrelevant to the object-level fact of what has been happening during this conversation. You’ve made your frustration with "AI safety people" more than clear over the course of this conversation (and I did advise you not to engage further if that was the case!), but I submit that in this particular case (at least), the entirety of your frustration can be traced back to your own lack of willingness to put forth interpretive labor. To be clear: I am making this comment in this tone (which I am well aware is unkind) because there are multiple aspects of your behavior in this thread that I find not only logically rude, but ordinarily rude as well. I more or less summarized these aspects in the first paragraph of my comment, but there’s one particularly onerous aspect I want to highlight: over the course of this discussion, you’ve made multiple references to other uninvolved people (either with whom you agree or disagree), without making any effort at all to lay out what those people said or why it’s relevant to the current discussion. There are two examples of this from your latest comment alone:
Comment
I appreciate the defense and agree with a fair bit of this. That said, I’ve actually found the general lack of interpretive labor somewhat helpful in this instance—it’s forcing me to carefully and explicitly explain a lot of things I normally don’t, and John Maxwell has correctly pointed out a lot of seeming-inconsistencies in those explanations. At the very least, it’s helping make a lot of my models more explicit and legible. It’s mentally unpleasant, but a worthwhile exercise to go through.
Comment
I think I want John to feel able to have this kind of conversation when it feels fruitful to him, and not feel obligated to do so otherwise. I expect this is the case, but just wanted to make it common knowledge.
I agree that most of my concern has moved to inner (and, in particular, deceptive) alignment. I still don’t quite see how to get enough outer alignment to trust an AI with the future lightcone, but I am much less worried about it.
To put it another way: What semisupervised learning and transfer learning have in common is: You find a learning problem you have a lot of data for, such that training a learner for that problem will incidentally cause it to develop generally useful computational structures (often people say "features" but I’m trying to take more of an open-ended philosophical view). Then you re-use those computational structures in a supervised learning context to solve a problem you don’t have a lot of data for. From an AI safety perspective, there are a couple obvious ways this could fail:
Training a learner for the problem with lots of data might cause it to develop the wrong computational structures. (Example: GPT-3 learns a meaning of the word "love" which is subtly incorrect.)
While attempting to re-use the computational structures, you end up pinpointing the wrong one, even though the right one exists. (Example: computational structures for both "Effective Altruism" and "maximize # grandchildren" have been learned correctly, but your provided x/y pairs which are supposed to indicate human values don’t allow for differentiating between the two, and your system arbitrarily chooses "maximize # grandchildren" when what you really wanted was "Effective Altruism"). I don’t think this post makes a good argument that we should expect the second problem to be more difficult in general. Note that, for example, it’s not too hard to have your system try to figure out where the "Effective Altruism" and "maximize # grandchildren" theories of how (x, y) arose differ, and query you on those specific data points ("active learning" has 62,000 results on Google Scholar). Incidentally, I’m most worried about non-obvious failure modes, I expect obvious failure modes to get a lot of attention. (As an example of a non-obvious thing that could go wrong, imagine a hypothetical super-advanced AI that queries you on some super enticing scenario where you become global dictator, in order to figure out if the (x, y) pairs it’s trying to predict correspond to a person who outwardly behaves in an altruistic way, but is secretly an egoist who will succumb to temptation if the temptation is sufficiently strong. In my opinion the key problem is to catalogue all the non-obvious ways in which things could fail like this.)
Comment
This is almost, but not quite, the division of failure-modes which I see as relevant. If my other response doesn’t clarify sufficiently, let me know and I’ll write more of a response here.
Comment
I’m not claiming GPT-3 understands human values, I’m saying it’s easy to extrapolate from GPT-3 to a future GPT-N system which basically does.
Comment
I’m confused by what you’re saying. The argument for the fragility of value never relied on AI being unable to understand human values. Are you claiming it does? If not, what are you claiming?
Comment
From Superintelligence:
Comment
Comment
Sorry, just to make sure I’m not wasting my time here (feeling grumpy)… You said earlier that "The argument for the fragility of value never relied on AI being unable to understand human values." I gave you a quote from Superintelligence which talked about AI being unable to understand human values. Are you gonna, like, concede the point or something? Because if you’re just throwing out arguments for the AI doom bottom line without worrying too much about whether they’re correct, I’d rather you throw them at someone else!
Anyway, I read Gwern’s article a while ago and I thought it was pretty bad. If I recall correctly, Gwern confuses various different notions, for example, he seemed to think that if you replace enough bits of handcrafted software with bits trained using machine learning, an agent will spontaneously emerge. The steelman seems to be something like "there will be competitive pressures to misuse a tool in agentlike ways". I agree this is a risk, and I hope OpenAI keeps future versions of GPT to themselves.
It’s looking more plausible that very capable Tool AIs
Are possible
Are easier to build than Agent AIs
Will be able to solve the value-loading problem
(IIRC none of Gwern’s article addresses any of these 3 points?)
Comment
On "conceding the point":
META: I can understand that you’re frustrated about this topic, especially if it seems to you that the "MIRI-sphere" (as you called it in a different comment) is persistently refusing to acknowledge something that appears obvious to you. Obviously, I don’t agree with that characterization, but in general I don’t want to engage in a discussion that one side is finding increasingly unpleasant, especially since that often causes the discussion to rapidly deteriorate in quality after a few replies. As such, I want to explicitly and openly relieve you of any social obligation you may have felt to reply to this comment. If you feel that your time would be better spent elsewhere, please do!
Comment
If you can solve the prediction task, you can probably use the solution to create a reward function for your reinforcement learner.
The best angle of attack here I think, is synthesising knowledge from multiple domains. I was able to get GPT-3 to write and then translate a Japanese poem about a (fictional) ancient language model into Chinese, Hungarian, and Swahili and annotate all of its translations with stylistic notes and historical references. I don’t think any humans have the knowledge required to do that, but unsurprisingly GPT-3 does, and performed better when I used the premise of multiple humans collaborating. It’s said that getting different university departments to collaborate tends to be very productive wrt new papers being published. The only bottleneck is whether its dataset includes scientific publications and the extent to which it can draw upon memorised knowledge (parameter count).
Comment
Awesome example!
As a research tool, I suspect that GPT will be most impactful for probing under-explored interdisciplinary idea spaces. Interdisciplinary research matches its skillset far better than more depth-based research as the nature of its design is more suited for interpolation as opposed to extrapolation.
GPT could help circumvent the increasing problem of knowledge specialization. Back in the time of Newton, it was possible for a studious genius to be at the forefront of every field. This is why the Renaissance was the golden age of, well, Renaissance men. But by the early twentieth century, even the greatest academic minds were only truly cutting-edge in one field with the possible exception of Von Neumann. Now in 2020, there probably isn’t a single person who has complete mastery over even a single subject. There’s just not enough time to learn it all. While Galois could invent group theory in his teens, now even the most promising of intellectuals don’t start making major contributions until their early 30s.
I envision the final form of GPT as the ultimate polymath. What it lacks in explicit reasoning, it makes up for in breadth of knowledge and raw pattern-matching ability. I envision future prompts would combine research papers from disparate areas to see if their intersection could create something of interest. This is a long way off, of course. But the prospect is enticing.
Curated. Simple, crucially important point, I’m really glad you wrote it up.
There are infinitely many distributions from which the training data of GPT could have been sampled from [EDIT: including ones that could be catastrophic as the distribution our AGI learns], so it’s worth mentioning an additional challenge on this route: making the future AGI-level-GPT learn the "human writing distribution" that we have in mind.
(Not related to the overall point of your paper) I’m not so sure that GPT-3 "has the internal model to do addition," depending on what you mean by that — nostalgebraist doesn’t seem to think so in this post, and a priori this seems like a surprising thing for a feedforward neural network to do.
Comment
I’m pretty sure it can’t do long addition—I played around with that specifically—but it single- or double-digit addition well enough that it at least has some idea of what we’re gesturing at.