Contents
- How AlphaGo Zero works
- Iterated capability amplification
- The significance AlphaGo Zero is an impressive demonstration of AI capabilities. It also happens to be a nice proof-of-concept of a promising alignment strategy.
How AlphaGo Zero works
AlphaGo Zero learns two functions (which take as input the current board):
-
A prior over moves p is trained to predict what AlphaGo will eventually decide to do.
-
A value function v is trained to predict which player will win (if AlphaGo plays both sides) Both are trained with supervised learning. Once we have these two functions, AlphaGo actually picks it moves by using 1600 steps of Monte Carlo tree search (MCTS), using p and v to guide the search. It trains p to bypass this expensive search process and directly pick good moves. As p improves, the expensive search becomes more powerful, and p chases this moving target.
Iterated capability amplification
In the simplest form of iterated capability amplification, we train one function:
- A "weak" policy A, which is trained to predict what the agent will eventually decide to do in a given situation. Just like AlphaGo doesn’t use the prior p directly to pick moves, we don’t use the weak policy A directly to pick actions. Instead, we use a capability amplification scheme: we call A many times in order to produce more intelligent judgments. We train A to bypass this expensive amplification process and directly make intelligent decisions. As A improves, the amplified policy becomes more powerful, and A chases this moving target. In the case of AlphaGo Zero, A is the prior over moves, and the amplification scheme is MCTS. (More precisely: A is the pair (p, v), and the amplification scheme is MCTS + using a rollout to see who wins.) Outside of Go, A might be a question-answering system, which can be applied several times in order to first break a question down into pieces and then separately answer each component. Or it might be a policy that updates a cognitive workspace, which can be applied many times in order to "think longer" about an issue.
The significance
Reinforcement learners take a reward function and optimize it; unfortunately, it’s not clear where to get a reward function that faithfully tracks what we care about. That’s a key source of safety concerns. By contrast, AlphaGo Zero takes a policy-improvement-operator (like MCTS) and converges towards a fixed point of that operator. If we can find a way to improve a policy while preserving its alignment, then we can apply the same algorithm in order to get very powerful but aligned strategies. Using MCTS to achieve a simple goal in the real world wouldn’t preserve alignment, so it doesn’t fit the bill. But "think longer" might. As long as we start with a policy that is close enough to being aligned — a policy that "wants" to be aligned, in some sense — allowing it to think longer may make it both smarter and more aligned. I think designing alignment-preserving policy amplification is a tractable problem today, which can be studied either in the context of existing ML or human coordination. So I think it’s an exciting direction in AI alignment. A candidate solution could be incorporated directly into the AlphaGo Zero architecture, so we can already get empirical feedback on what works. If by good fortune powerful AI systems look like AlphaGo Zero, then that might get us much of the way to an aligned AI. *This was originally posted here on 19th October 2017. * Tomorrow’s AI Alignment Forum sequences will continue with a pair of posts, ‘What is narrow value learning’ by Rohin Shah and ‘Ambitious vs. narrow value learning’ by Paul Christiano, from the sequence on Value Learning. The next post in this sequence will be ‘Directions for AI Alignment’ by Paul Christiano on Thursday.
Comment
I think Techniques for optimizing worst-case performance may be what you’re looking for.
Comment
Thank you. I see how the directions proposed there (adversarial training, verification, transparency) can be useful for creating aligned systems. But if we use a Distill step that can be trusted to be safe via one or more of those approaches, I find it implausible that Amplification would yield systems that are competitive relative to the most powerful ones created by other actors around the same time (i.e. actors that create AI systems without any safety-motivated restrictions on the model space and search algorithm).
Comment
Paul’s position in that post was:
I think this is meant to include the difficulty of making them competitive with unaligned ML, since that has been his stated goal. If you can argue that we should be even more pessimistic than this, I’m sure a lot of people would find that interesting.
Comment
In this 2017 post about Amplification (linked from OP) Paul wrote: "I think there is a very good chance, perhaps as high as 50%, that this basic strategy can eventually be used to train benign state-of-the-art model-free RL agents." The post you linked to is more recent, so either the quote in your comment reflects an update or Paul has other insights/estimates about safe Distill steps. BTW, I think Amplification might currently be the most promising approach for creating aligned and powerful systems; what I argue is that in order to save the world it will probably need to be complemented with governance solutions.
Comment
How uncompetitive do you think aligned IDA agents will be relative to unaligned agents, and what kinds of governance solutions do you think that would call for? Also, I should have made this clearer last time, but I’d be interested to hear more about why you think Distill probably can’t be made both safe and competitive, regardless of whether you’re more or less optimistic than Paul.
Comment
Generally, I don’t see why we should expect that the most capable systems that can be created with supervised learning (e.g. by using RL to search over an arbitrary space of NN architectures) would perform similarly to the most capable systems that can be created, at around the same time, using some restricted supervised learning that humans must trust to be safe. My prior is that the former is very likely to outperform by a lot, and I’m not aware of strong evidence pointing one way or another. So for example, I expect that an aligned IDA agent will be outperformed by an agent that was created by that same IDA framework when replacing the most capable safe supervised learning in the Distill steps with the most capable unrestricted supervised learning available at around the same time.
Comment
This seems similar to my view, which is that if you try to optimize for just one thing (efficiency) you’re probably going to end up with more of that thing than if you try to optimize for two things at the same time (efficiency and safety) or if you try to optimize for that thing under a heavy constraint (i.e., safety).
But there are people (like Paul) who seem to be more optimistic than this based on more detailed inside-view intuitions, which makes me wonder if I should defer to them. If the answer is no, there’s also the question of how do we make policy makers take this problem seriously (i.e., that safe AI probably won’t be as efficient as unsafe AI) given the existence of more optimistic AI safety researchers, so that they’d be willing to undertake costly preparations for governance solutions ahead of time. By the time we get conclusive evidence one way or another, it may be too late to make such preparations.
Comment
Comment
I’m not sure what you’d consider "extremely" optimistic, but I gathered some quantitative estimates of AI risk here, and they all seem overly optimistic to me. Did you see that?
I agree with this motivation to do early work, but in a world where we do need drastic policy responses, I think it’s pretty likely that the early work won’t actually produce conclusive enough results to show that. For example, if a safety approach fails to make much progress, there’s not really a good way to tell if it’s because safe and competitive AI really is just too hard (and therefore we need a drastic policy response), or because the approach is wrong, or the people working on it aren’t smart enough, or they’re trying to do the work too early. People who are inclined to be optimistic will probably remain so until it’s too late.
Comment
but I gathered some quantitative estimates of AI risk > here, and they all seem overly optimistic to me. Did you see that? I only now read that thread. I think it is extremely worthwhile to gather such estimates. I think all the three estimates mentioned there correspond to marginal probabilities (rather than probabilities conditioned on "no governance interventions"). So those estimates already account for scenarios in which governance interventions save the world. Therefore, it seems we should not strongly update against the necessity of governance interventions due to those estimates being optimistic. Maybe we should gather researchers’ credences for predictions like:"If there will be no governance interventions, competitive aligned AIs will exist in 10 years from now". I suspect that gathering such estimates from publicly available information might expose us to a selection bias, because very pessimistic estimates might be outside the Overton window (even for the EA/AIS crowd). For example, if Robert Wiblin would have concluded that an AI existential catastrophe is 50% likely, I’m not sure that the 80,000 Hours website (which targets a large and motivationally diverse audience) would have published that estimate.
Comment
Comment
Upvoted for giving this number, but what does it mean exactly? You expect "50% fine" through all kinds of x-risk, assuming no coordination from now until the end of the universe? Or just assuming no coordination until AGI? Is it just AI risk instead of all x-risk, or just risk from narrow AI alignment? If "AI risk", are you including risks from AI exacerbating human safety problems, or AI differentially accelerating dangerous technologies? Is it 50% probability that humanity survives (which might be "fine" to some people) or 50% that we end up with a nearly optimal universe? Do you have a document that gives all of your quantitative risk estimates with clear explanations of what they mean?
(Sorry to put you on the spot here when I haven’t produced anything like that myself, but I just want to convey how confusing all this is.)
MCTS works as amplification because you can evaluate future board positions to get a convergent estimate of how well you’re doing—and then eventually someone actually wins the game, which keeps p from departing reality entirely. Importantly, the single thing you’re learning can play the role of the environment, too, by picking the opponents’ moves.
In trying to train A to predict human actions given access to A, you’re almost doing something similar. You have a prediction that’s also supposed to be a prediction of the environment (the human), so you can use it for both sides of a tree search. But A isn’t actually searching through an interesting tree—it’s searching for cycles of length 1 in its own model of the environment, with no particular guarantee that any cycles of length 1 exist or are a good idea. "Tree search" in this context (I think) means spraying out a bunch of outputs and hoping at least one falls into a fixed point upon iteration.
EDIT: Big oops, I didn’t actually understand what was being talked about here.
Comment
I agree there is a real sense in which AGZ is "better-grounded" (and more likely to be stable) than iterated amplification in general. (This was some of the motivation for the experiments here.)
Comment
Oh, I’ve just realized that the "tree" was always intended to be something like task decomposition. Sorry about that—that makes the analogy a lot tighter.
Isn’t A also grounded in reality by eventually giving no A to consult with?
Comment
This is true when getting training data, but I think it’s a difference between A (or HCH) and AlphaGo Zero when doing simulation / amplification. Someone wins a simulated game of Go even if both players are making bad moves (or even random moves), which gives you a signal that A doesn’t have access to.
I don’t suppose you could explain how it uses P and V? Does it use P to decide which path to go down and V to avoid fully playing it out?
In the simplest form of > iterated capability amplification, we train one function:A "weak" policy > A, which is trained to predict what the agent will eventually decide to do in a given situation.Just like AlphaGo doesn’t use the prior > p directly to pick moves, we don’t use the weak policy A directly to pick actions. Instead, we use a capability amplification scheme: we call A many times in order to produce more intelligent judgments. We train A to bypass this expensive amplification process and directly make intelligent decisions. As A improves, the amplified policy becomes more powerful, and A chases this moving target. This is totally wild speculation, but the thought occurred to me whether the human brain might be doing something like this with identities and social roles: A lot of (but not all) people get a strong hit of this when they go back to visit their family. If you move away and then make new friends and sort of become a new person (!), you might at first think this is just who you are now. But then you visit your parents… and suddenly you feel and act a lot like you did before you moved away. You might even try to hold onto this "new you" with them… and they might respond to what they see as > strange behavior by trying to nudge you into acting "normal": ignoring surprising things you say, changing the topic to something familiar, starting an old fight, etc. [...]For instance, the stereotypical story of the worried nagging wife confronting the emotionally distant husband as he comes home really late from work… is actually a pretty good caricature of > a script that lots of couples play out, as long as you know to ignore the gender and class assumptions embedded in it.But it’s > hard to sort this out without just enacting our scripts. The version of you that would be thinking about it is your character, which (in this framework) can accurately understand its own role only if it has enough slack to become genre-savvy within the web; otherwise it just keeps playing out its role. In the husband/wife script mentioned above, there’s a tendency for the "wife" to get excited when "she" learns about the relationship script, because it looks to "her" like it suggests how to save the relationship — which is "her" enacting "her" role. This often aggravates the fears of the "husband", causing "him" to pull away and act dismissive of the script’s relevance (which is "his" role), driving "her" to insist that they just need to talk about this… which is the same pattern they were in before. They try to become genre-savvy, but there (usually) just isn’t enough slack between them, so the effort merely changes the topic while they play out their usual scene.If you squint, you could kind of interpret this kind of a dynamic to be a result of the human brain trying to predict what it expects itself to do next, using that prediction to guide the search of next actions, and then ending up with next actions that have a strong structural resemblance to its previous ones. (Though I can also think of maybe better-fitting models of this too; still, seemed worth throwing out.)
How do you know MCTS doesn’t preserve alignment?
Comment
As I understand it—MCTS is used to maximize a given computable utility function, and so it is non alignment-preserving in the general sense that a sufficiently strong optimization of a non-perfect utility function is non alignment-preserving.