Matt Botvinick is Director of Neuroscience Research at DeepMind. In this interview, he discusses results from a 2018 paper which describe conditions under which reinforcement learning algorithms will spontaneously give rise to separate full-fledged reinforcement learning algorithms that differ from the original. Here are some notes I gathered from the interview and paper:
Initial Observation
At some point, a group of DeepMind researchers in Botvinick’s group noticed that when they trained a RNN using RL on a series of related tasks, the RNN itself instantiated a separate reinforcement learning algorithm. These researchers weren’t trying to design a meta-learning algorithm—apparently, to their surprise, this just spontaneously happened. As Botvinick describes it, they started "with just one learning algorithm, and then another learning algorithm kind of… emerges, out of, like out of thin air":
"What happens… it seemed almost magical to us, when we first started realizing what was going on—the slow learning algorithm, which was just kind of adjusting the synaptic weights, those slow synaptic changes give rise to a network dynamics, and the dynamics themselves turn into a learning algorithm."
Other versions of this basic architecture—e.g., using slot-based memory instead of RNNs—seemed to produce the same basic phenomenon, which they termed "meta-RL." So they concluded that all that’s needed for a system to give rise to meta-RL are three very general properties: the system must 1) have memory, 2) whose weights are trained by a RL algorithm, 3) on a sequence of similar input data.
From Botvinick’s description, it sounds to me like he thinks [learning algorithms that find/instantiate other learning algorithms] is a strong attractor in the space of possible learning algorithms:
"...it’s something that just happens. In a sense, you can’t avoid this happening. If you have a system that has memory, and the function of that memory is shaped by reinforcement learning, and this system is trained on a series of interrelated tasks, this is going to happen. You can’t stop it."
Search for Biological Analogue
This system reminded some of the neuroscientists in Botvinick’s group of features observed in brains. For example, like RNNs, the human prefrontal cortex (PFC) is highly recurrent, and the RL and RNN memory systems in their meta-RL model reminded them of "synaptic memory" and "activity-based memory." They decided to look for evidence of meta-RL occuring in brains, since finding a neural analogue of the technique would provide some evidence they were on the right track, i.e. that the technique might scale to solving highly complex tasks.
They think they found one. In short, they think that part of the dopamine system (DA) is a full-fledged reinforcement learning algorithm, which trains/gives rise to another full-fledged, free-standing reinforcement learning algorithm in PFC, in basically the same way (and for the same reason) the RL-trained RNNs spawned separate learning algorithms in their experiments.
As I understand it, their story goes as follows:
The PFC, along with the bits of basal ganglia and thalamic nuclei it connects to, forms a RNN. Its inputs are sensory percepts, and information about past actions and rewards. Its outputs are actions, and estimates of state value.
DA[1] is a RL algorithm that feeds reward prediction error to PFC. Historically, people assumed the purpose of sending this prediction error was to update PFC’s synaptic weights. Wang et al. agree that this happens, but argue that the principle purpose of sending prediction error is to cause the creation of "a second RL algorithm, implemented entirely in the prefrontal network’s activation dynamics." That is, they think DA mostly stores its model in synaptic memory, while PFC mostly stores it in activity-based memory (i.e. directly in the dopamine distributions).[2]
What’s the case for this story? They cite a variety of neuroscience findings as evidence for parts of this hypothesis, many of which involve doing horrible things to monkeys, and some of which they simulate using their meta-RL model to demonstrate that it gives similar results. These points stood out most to me:
Does RL occur in the PFC?
Some scientists implanted neuroimaging devices in the PFCs of monkeys, then sat the monkeys in front of two screens with changing images, and rewarded them with juice when they stared at whichever screen was displaying a particular image. The probabilities of each image leading to juice-delivery periodically changed, causing the monkeys to update their policies. Neurons in their PFCs appeared to exhibit RL-like computation—that is, to use information about the monkey’s past choices (and associated rewards) to calculate the expected value of actions, objects and states.
Wang et al. simulated this task using their meta-RL system. They trained a RNN on the changing-images task using RL; when run, it apparently demonstrated similar performance as the monkeys, and when they inspected it they found units that similarly seemed to encode EV estimates based on prior experience, continually adjust the action policy, etc.
Interestingly, the system continued to improve its performance even once its weights were fixed, which they take to imply that the learning which led to improved performance could only have occured within the activation patterns of the recurrent network.[3]
Can the two RL algorithms diverge?
When humans perform two-armed bandit tasks where payoff probabilities oscillate between stable and volatile, they increase their learning rate during volatile periods, and decrease it during stable periods. Wang et al. ran their meta-RL system on the same task, and it varied its learning rate in ways that mimicked human performance. This learning again occurred after weights were fixed, and notably, between the end of training and the end of the task, the learning rates of the two algorithms had diverged dramatically.
Implications
The account detailed by Botvinick and Wang et al. strikes me as a relatively clear example of mesa-optimization, and I interpret it as tentative evidence that the attractor toward mesa-optimization is strong. [Edit: Note that some commenters, like Rohin Shah and Evan Hubinger, disagree].
These researchers did not set out to train RNNs in such a way that they would turn into reinforcement learners. It just happened. And the researchers seem to think this phenomenon will occur spontaneously whenever "a very general set of conditions" is met, like the system having memory, being trained via RL, and receiving a related sequence of inputs. Meta-RL, in their view, is just "an emergent effect that results when the three premises are concurrently satisfied… these conditions, when they co-occur, are sufficient to produce a form of ‘meta-learning’, whereby one learning algorithm gives rise to a second, more efficient learning algorithm."
So on the whole I felt alarmed reading this. That said, if mesa-optimization is a standard feature[4] of brain architecture, it seems notable that humans don’t regularly experience catastrophic inner alignment failures. Maybe this is just because of some non-scalable hack, like that the systems involved aren’t very powerful optimizers.[5] But I wouldn’t be surprised if coming to better understand the biological mechanisms involved led to safety-relevant insights.
Thanks to Rafe Kennedy for helpful comments and feedback.
-
The authors hypothesize that DA is a model-free RL algorithm, and that the spinoff (mesa?) RL algorithm it creates within PFC is model-based, since that’s what happens in their ML model. But they don’t cite biological evidence for this. ↩︎
-
Depending on what portion of memories are encoded in this way, it may make sense for cryonics standby teams to attempt to reduce the supraphysiological intracellular release of dopamine that occurs after cardiac arrest, e.g. by administering D1-receptor antagonists. Otherwise entropy increases in PFC dopamine distributions may result in information loss. ↩︎
-
They demonstrated this phenomenon (continued learning after weights were fixed) in a variety of other contexts, too. For example, they cite an experiment in which manipulating DA activity was shown to directly manipulate monkeys’ reward estimations, independent of actual reward—i.e., when their DA activity was blocked/stimulated while they pressed a lever, they exhibited reduced/increased preference for that lever, even if pressing it did/didn’t give them food. They trained their meta-RL system to simulate this, again demonstrated similar performance as the monkeys, and again noticed that it continued learning even after the weights were fixed. ↩︎
-
The authors seem unsure whether meta-RL also occurs in other brain regions, since for it to occur you need A) inputs carrying information about recent actions/rewards, and B) network dynamics (like recurrence) that support continual activation. Maybe only PFC has this confluence of features. Personally, I doubt it; I would bet that meta-RL (and other sorts of mesa-optimization) occur in a wide variety of brain systems, but it would take more time than I want to allocate here to justify that intuition. ↩︎
-
Although note that neuroscientists do commonly describe the PFC as disproportionately responsible for the sort of human behavior one might reasonably wish to describe as "optimization." For example, the neuroscience textbook recommended on lukeprog’s textbook recommendation post describes PFC as "often assumed to be involved in those characteristics that distinguish us from other animals, such as self-awareness and the capacity for complex planning and problem solving." ↩︎
What is all of humanity if not a walking catastrophic inner alignment failure? We were optimized for one thing: inclusive genetic fitness. And only a tiny fraction of humanity could correctly define what that is!
Comment
I mean, it could both be the case that there exists catastrophic inner alignment failure between humans and evolution, and also that humans don’t regularly experience catastrophic inner alignment failures internally. In practice I do suspect humans regularly experience internal (within-brain) inner alignment failures, but given that suspicion I feel surprised by how functional humans manage to be. That is, I notice expecting that regular inner alignment failures would cause far more mayhem than I observe, which makes me wonder whether brains are implementing some sort of alignment-relevant tech.
Comment
Comment
That’s a really interesting point, and I hadn’t considered it. Thanks!
What would inner alignment failures even look like? Overdosing on meth sure makes the dopamine system happy. Perhaps human values reside in the prefrontal complex, and all of humanity is a catastrophic alignment failure of the dopamine system (except a small minority of drug addicts) on top of being a catastrophic alignment failure of natural selection.
Isn’t evolution a better analogy for deep learning anyway? All natural selection does is gradient descent (hill climbing technically), with no capacity for lookahead. And we’ve known this one for 150 years!
Comment
(EDIT: I’m already seeing downvotes of the post, it was originally at 58 AF karma. This wasn’t my intention: I think this is a failure of the community as a whole, not of the author.) Okay, this has gotten enough karma and has been curated and has influenced another post, so I suppose I should engage, especially since I’m not planning to put this in the Alignment Newsletter. (A lot copied over from this comment of mine) This is extremely basic RL theory. The linked paper studies bandit problems, where each episode of RL is a new bandit problem where the agent doesn’t know which arm gives maximal reward. Unsurprisingly, the agent learns to first explore, and then exploit the best arm. This is a simple consequence of the fact that you have to look at observations to figure out what to do. Basic POMDP theory will tell you that when you have partial observability your policy needs to depend on history, i.e. it needs to learn. However, because bandit problems have been studied in the AI literature, and "learning algorithms" have been proposed to solve bandit problems, this very normal fact of a policy depending on observation history is now trotted out as "learning algorithms spontaneously emerge". I don’t understand why this was surprising to the original researchers, it seems like if you just thought about what the optimal policy would be given the observable information, you would make exactly this prediction. Perhaps it’s because it’s primarily a neuroscience paper, and they weren’t very familiar with AI. More broadly, I don’t understand what people are talking about when they speak of the "likelihood" of mesa optimization. If you mean the chance that the weights of a neural network are going to encode some search algorithm, then this paper should be ~zero evidence in favor of it. If you mean the chance than a policy trained by RL will "learn" without gradient descent, I can’t imagine a way that could fail to be true for an intelligent system trained by deep RL—presumably a system that is intelligent is capable of learning quickly, and when we talk about deep RL leading to an intelligent AI system, presumably we are talking about the policy being intelligent (what else?), therefore the policy must "learn" as it is being executed. Gwern notes here that we’ve seen this elsewhere. This is because it’s exactly what you’d expect, just that in the other cases we call conditioning on observations "adaptation" rather than "learning".
Meta: I’m disappointed that I had to be the one to point this out. (Though to be fair, Gwern clearly understands this point.) There’s clearly been a lot of engagement with this post, and yet this seemingly obvious point hasn’t been said. When I saw this post first come up, my immediate reaction was "oh I’m sure this is a typical LW example of a case where the optimal policy is interpreted as learning, I’m not even going to bother clicking on the link". Do we really have so few people who understand machine learning, that of the many, many views this post must have had, not one person could figure this out? It’s really no surprise that ML researchers ignore us if this is the level of ML understanding we as a community have. EDIT: I should give credit to Nevan for pointing out that this paper is not much evidence in favor of the hypothesis that the neural network weights encode some search algorithm (before I wrote this comment).
Comment
‘Task location’, where they know what to do in a wide range of environments, and all they’re learning is which environment they’re in. The multi-armed bandit is definitely in this case; GPT-3 seems like it’s mostly doing this.
‘Task learning’, where they are running some sort of online learning process that gives them ‘new capabilities’ as they encounter new bits of the world. The two blur into each other; you can imagine training a model to deal with a range of situations, and yet it also performs well on situations not seen in training (that are interpolations between situations it has seen, or where the old abstractions apply correctly, and thus aren’t "entirely new" situations). Just like some people argue that anything we know how to do isn’t "artificial intelligence", you might get into a situation where anything we know how to do is task ‘location’ instead of task ‘learning.’ But to the extent that our safety guarantees rely on the lack of capability in an AI system, any ability for the AI system to do learning instead of location means that it may gain capabilities we didn’t expect it to have. That said, merely restricting it to ‘location’ may not help us very much, because if we misunderstand the abstractions that govern the system’s generalizability, we may underestimate what capabilities it will or won’t have.
Comment
I note that this doesn’t feel like a problem to me, mostly because of reasons related to > Explainers Shoot High. Aim Low!. Even among ML experts, many of them haven’t touched much RL, because they’re focused on another field. Why expect them to know basic RL theory, or to have connected that to all the other things that they know?I’m perfectly happy with good explanations that don’t assume background knowledge. The flaw I am pointing to has nothing to do with explanations. It is that despite this evidence being a clear consequence of basic RL theory, for some reason readers are treating it as important evidence. Clearly I should update negatively on things-AF-considers-important. At a more gears level, presumably I should update towards some combination of:
AF readers don’t know RL.
AF readers upvote anything that’s cheering for their team.
AF readers automatically believe anything written in a post without checking it Any of these would be a pretty damning critique of the forum. And the update should be fairly strong, given that this was (prior to my comment) the highest-upvoted post ever by AF karma.
Comment
I guess I should explain why I upvoted this post despite agreeing with you that it’s not new evidence in favor of mesa-optimization. I actually had a conversation about this post with Adam Shimi prior to you commenting on it where I explained to him that I thought that not only was none of it new but also that it wasn’t evidence about the internal structure of models and therefore wasn’t really evidence about mesa-optimization. Nevertheless, I chose to upvote the post and not comment my thoughts on it. Some reasons why I did that:
I generally upvote most attempts on LW/AF to engage with the academic literature—I think that LW/AF would generally benefit from engaging with academia more and I like to do what I can to encourage that when I see it.
I didn’t feel like any comment I would have made would have anything more to say than things I’ve said in the past. In fact, in "Risks from Learned Optimization" itself, we talk about both a) why we chose to be agnostic about whether current systems exhibit mesa-optimization due to the difficulty of determining whether a system is actually implementing search or not (link) and b) examples of current work that we thought did seem to come closest to being evidence of mesa-optimization such as RL^2 (and I think RL^2 is a better example than the work linked here) (link).
Comment
(Flagging that I curated the post, but was mostly relying on Ben and Habryka’s judgment, in part since I didn’t see much disagreement. Since this discussion I’ve become more agnostic about how important this post is) One thing this comment makes me want is more nuanced reacts that people have affordance to communicate how they feel about a post, in a way that’s easier to aggregate. Though I also notice that with this particular post it’s a bit unclear what the react would be appropriate, since it sounds like it’s not "disagree" so much as "this post seems confused" or something.
Comment
Comment
Unfortunately, I also only have so much time, and I don’t generally think that repeating myself regularly in AF/LW comments is a super great use of it.
Comment
Very fair.
The solution is clear: someone needs to create an Evan bot that will comment on every post of the AF related to mesa-optimization, by providing the right pointers to the paper.
Fair enough, those are sensible reasons. I don’t like the fact that the incentive gradient points away from making intellectual progress, but it’s not an obvious choice.
Comment
I appreciate you writing this, Rohin. I don’t work in ML, or do safety research, and it’s certainly possible I misunderstand how this meta-RL architecture works, or that I misunderstand what’s normal. That said, I feel confused by a number of your arguments, so I’m working on a reply. Before I post it, I’d be grateful if you could help me make sure I understand your objections, so as to avoid accidentally publishing a long post in response to a position nobody holds. I currently understand you to be making four main claims:
The system is just doing the totally normal thing "conditioning on observations," rather than something it makes sense to describe as "giving rise to a separate learning algorithm."
It is probably not the case that in this system, "learning is implemented in neural activation changes rather than neural weight changes."
The system does not encode a search algorithm, so it provides "~zero evidence" about e.g. the hypothesis that mesa-optimization is convergently useful, or likely to be a common feature of future systems.
The above facts should be obvious to people familiar with ML. Does this summary feel like it reasonably characterizes your objection?
Comment
I think my response to Vaniver better illustrates my concerns, but let me take a stab at making a simple list of claims.
Comment
I feel confused about why, given your model of the situation, the researchers were surprised that this phenomenon occurred, and seem to think it was a novel finding that it will inevitably occur given the three conditions described. Above, you mentioned the hypothesis that maybe they just "weren’t very familiar with AI." Looking at the author list, and at their publications (1, 2, 3, 4, 5, 6, 7, 8), this seems implausible to me. While most of the eight co-authors are neuroscientists by training, three have CS degrees (one of whom is Demis Hassabis), and all but one have co-authored previous ML papers. It’s hard for me to imagine their surprise was due simply to them lacking basic knowledge about RL? And this OpenAI paper (whose authors I think you would describe as familiar with ML), which the summary of Wang et al. on the DeepMind website describes as "closely related work," and which appears to me to describe a very similar setup, describes their result in similar terms: We structure the agent as a recurrent neural network, which receives past rewards, actions, and termination flags as inputs in addition to the normally received observations. Furthermore, its internal state is preserved across episodes, so that it has the capacity to perform learning in its own hidden activations. The learned agent thus also acts as the learning algorithm, and can adapt to the task at hand when deployed.> The OpenAI authors also seem to me to think they can gather evidence about the structure of the algorithm simply by looking at its behavior. Given a similar series of experiments (mostly bandit tasks, but also a maze solver), they conclude: the dynamics of the recurrent network come to implement a learning algorithm entirely separate from the one used to train the network weights… the procedure the recurrent network implements is itself a full-fledged reinforcement learning algorithm, which negotiates the exploration-exploitation tradeoff and improves the agent’s policy based on reward outcomes… this learned RL procedure can differ starkly from the algorithm used to train the network’s weights.> They then run an experiment designed specifically to distinguish whether meta-RL was giving rise to a model-free system, or "a model-based system which learns an internal model of the environment and evaluates the value of actions at the time of decision-making through look-ahead planning," and suggest the evidence implies the latter. This sounds like a description of search to me—do you think I’m confused? I get the impression from your comments that you think it’s naive to describe this result as "learning algorithms spontaneously emerge." You describe the lack of LW/AF pushback against that description as "a community-wide failure," and mention updating as a result toward thinking AF members "automatically believe anything written in a post without checking it." But my impression is that OpenAI describes their similar result in basically the same way. Do you think my impression is wrong? Or e.g. that their description is also misleading?
I’ve been feeling very confused lately about how people talk about "search," and have started joking that I’m a search panpsychist. Lots of interesting phenomenon look like piles of thermostats when viewed from the wrong angle, and I worry the conventional lens is deceptively narrow. That said, when I condition on (what I understand to be) the conventional understanding, it’s difficult for me to imagine how e.g. the maze-solver described in the OpenAI paper reliably and quickly locates the exit to new mazes, without doing something reasonably describable as searching for them. And it seems to me that Wang et al. should be taken as evidence that "learning algorithms producing other search-performing learning algorithms" is convergently useful/likely to be a common feature of future systems, even if you don’t think that’s what happened in their paper, assuming you assign some credence to their hypothesis that this is what’s going on in PFC, and to the hypothesis that search occurs in PFC. If the primary difference between the DeepMind and OpenAI meta-RL architecture and the PFC/DA architecture is scale, then I think there’s reasonable reason to suspect that something much like mesa-optimization will emerge in future meta-RL systems, even if it hasn’t yet. That is, I interpret this result as evidence for the hypothesis that highly competent general-ish learners might tend to exhibit this feature, since (among other reasons) it increased my credence that it is already exhibited by the only existing member of that reference class. Upthread, Evan mentions agreeing that this result is "not new evidence in favor of mesa-optimization." But he also mentions that Risks from Learned Optimization references these two papers, describing them as "the closest to producing mesa-optimizers of any existing machine learning research." I feel confused about how to reconcile these two claims. I didn’t realize these papers were mentioned in Risks from Learned Optimization, but if I had, I think I would have been even more inclined to post this/try to ensure people knew about the results, since my (perhaps naive, perhaps not understanding ways this is disanalogous) prior is that the closest existing example to this problem might provide evidence about its nature or likelihood.
Comment
I imagine this was not your intention, but I’m a little worried that this comment will have an undesirable chilling effect. I think it’s good for people to share when members of DeepMind / OpenAI say something that sounds a lot like "we found evidence of mesaoptimization". I also think you’re right that we should be doing a lot better on pushing back against such claims. I hope LW/AF gets better at being as skeptical of AI researchers assertions that support risk as they are of those that undermine risk. But I also hope that when those researchers claim something surprising and (to us) plausibly risky is going on, we continue to hear about it.
Comment
What would be a good resource to level up on RL theory? Is the Sutton and Barto good enough, or do you have something else in mind?
Comment
Hmm, I don’t know unfortunately. I learned basic MDP theory from an undergrad course, and the rest through osmosis by being an AI PhD student at Berkeley. I haven’t read Sutton and Barto, but I would assume that would be good enough (you’d probably know more than me about tabular RL).
Comment
If you don’t have a resource, then do you have a list of pointers to what people should learn? For example the policy gradient theorem and the REINFORCE trick. It will probably not be exhaustive, I’m just trying to make your call to learn more RL theory more actionable to people here.
Comment
I don’t think the takeaway here should be "read these books / watch these lectures / understand these concepts and you’ll be fine". My claim is more like, if you want to interact with some community, you should have whatever background knowledge that community expects. Even if I just made a list of concepts, I’d expect that list to be out of date reasonably quickly (a few years), for a field like deep RL. I think this is pretty important if you want to do any of:
Convince researchers in the field that their work would be risky if scaled up
Learn from evidence presented in papers from the field (this post)
Forecast questions relevant to the field, for questions that don’t have obvious base rates (e.g. AGI timelines) If you don’t have the background knowledge, you can rely on someone else who has such background knowledge. Notably, this is not important if you want to "build basic theory" or something like that, which doesn’t require interaction with the AI community. (Though it might be important for guiding your search for basic theory, I’m not sure.) Also, I forgot to mention this before: normally for deep RL I’d recommend Spinning Up in Deep RL, though in this case that’s too focused on deep RL and not enough on RL basics.
EDIT: An analogy: if someone asked a handyman for a list of resources on how to fix common house problems, it’s not clear that the handyman would have remembered to give the advice "turn clockwise to tighten, and counterclockwise to loosen", because it’s so ingrained. Similarly, I think if I had tried to give a list prior to seeing this post, I would not have thought to give the advice "think about what the optimal policy is, and then expect your RL algorithms to find similar policies".
Comment
It’s the other way around, right?
Comment
Lol yes fixed
The handyman might not give basic advice, but if he didn’t have any advice, I would assume that he doesn’t want to help. I’m really confused by your answers. You have a long comment criticizing the lack of basic RL knowledge of the AF community, and when I ask you for pointers, you say that you don’t want to give any, and that people should just learn the background knowledge. So should every member of the AF stop what they’re doing right now to spend 5 years doing a PhD in RL before being able to post here? If the goal of your comment was to push people to learn things you think they should know, pointing towards some stuff (not an exhaustive list) is the bare minimum for that to be effective. If you don’t, I can’t see many people investing the time to learn enough RL so that by osmosis they can understand a point you’re making.
Comment
I also don’t buy that pointing out a problem is only effective if you have a concrete solution in mind. MIRI argues that it is a problem that we don’t know how to align powerful AI systems, but doesn’t seem to have any concrete solutions. Do you think this disqualifies MIRI from talking about AI risk and asking people to work on solving it?
Comment
Reinforcement Learning by Sutton & Barto (my book review)
Nice book for learning the basics. Best textbook I’ve read for RL, but that’s not saying much.
Superficial, not comprehensive, somewhat outdated circa 2018; a good chunk was focused on older techniques I never/rarely read about again, like SARSA and exponential feature decay for credit assignment. The closest I remember them getting to DRL was when they discussed the challenges faced by function approximators.
*AI: A Modern Approach 3e *by Russell & Norvig (my book review)
Engaging and clear, but most of the book wasn’t about RL. Outdated, but 4e is out now and maybe it’s better.
*Markov Decision Processes *by Puterman
Thorough, theoretical, very old, and very boring. Formal and dry. It was written decades ago, so obviously no mention of Deep RL.
*Neuro-Dynamic Programming *by Tsitsiklis
When I was a wee second-year grad student, I was independently recommended this book by several senior researchers. Apparently it’s a classic. It’s very dry and was written in 1996. Pass.
OpenAI’s several-page web tutorial* Spinning Up with Deep RL *is somehow the most useful beginning RL material I’ve seen, outside of actually taking a class. Kinda sad. So when I ask my brain things like "how do I know about bandits?", the result isn’t "because I read it in {textbook #23}", but rather "because I worked on different tree search variants my first summer of grad school" or "because I took a class". I think most of my RL knowledge has come from:
My own theoretical RL research
the fastest way for me to figure out a chunk of relevant MDP theory is often just to derive it myself
Watercooler chats with other grad students Sorry to say that I don’t have clear pointers to good material.
Comment
Thanks for the in-depth answer! I do share your opinion on the Sutton and Barto, which is the only book I read from your list (except a bit of the Russell and Norvig, but not the RL chapter). Notably, I took a lot of time to study the action value methods, only to realise later that a lot of recent work focus instead of policy-gradient methods (even if actor critics do use action-values). From your answer and Rohin’s, I gather that we lack a good resource in Deep RL, at least of the kind useful for AI Safety researchers. It makes me even more curious of the kind of knowledge that would be treated in such a resource.
Here’s an obvious next step for people: google for resources on RL, > ask others for recommendations on RL, try out some of the resources and see which one works best for you, and then choose one resource and dive deep into it, potentially repeat until you understand new RL papers by reading.Agreed. Which is exactly why I asked you for recommendations. I don’t think you’re the only one someone interested in RL should ask for recommendation (I already asked other people, and knew some resource before all this), but as one of the (apparently few) members of the AF with the relevant skills in RL, it seemed that you might offer good advice on the topic. About self-learning, I’m pretty sure people around here are good on this count. But knowing **how **to self-learn doesn’t mean knowing what to self-learning. Hence the pointers.
Comment
Comment
This is an aside, but I remain really confused by the claim that RL algorithms will tend to find policies close to the optimal one. Is inductive bias not a thing for RL?
Comment
It’s a thing, and is one of the caveats I mentioned. For tabular RL, algorithms can find optimal policies in the limit of infinite exploration, but without infinite exploration how close you get to the optimal policy will depend on the environment (including reward function). For deep RL, even with infinite exploration you don’t get the guarantee, since the optimization problem is nonconvex, and the optimal policy may not be expressible by your neural net. So it again depends heavily on the environment. I think the proper version of the claim is more like "if a paper reports results with RL, the policy they find is probably good, as otherwise they wouldn’t have published it". In practice RL algorithms often fail and need to be heavily tuned to do well, and researchers have to pull out lots of tricks to get them to work. But regardless, I claim the first-order approximation to what an RL algorithm will do is "the optimal policy". You can then figure out reasons for deviation, e.g. "this reward is super sparse, so the algorithm won’t get learning signal, so it’ll have effectively random behavior". If someone expected RL algorithms to fail on this bandit task, and then updated because they succeeded, I’d find that reasonable (though I’d find it pretty surprising that they’d expect a failure on bandits—it’s a relatively simple task where you can get tons of data).
It might well be that 1) people who already know RL shouldn’t be much surprised by this result and 2) people who don’t know much RL are justified in updating on this info (towards mesa-optimizers arising more easily).This would be the case if RL intuition correctly implies that proto-mesa-optimizers (like the one in the paper) arise naturally, and that intuition wasn’t widely shared outside of RL. Not sure if this is actually the way things are, but it seems plausible to me.
Comment
This post primarily argues that a phenomenon is evidence for [learned models being likely to encode search algorithms], but in fact it is not.
We would like the community to be such that this is pointed out quickly, the author edits the post accordingly, and the post does not get super high reception
Instead, the post has high karma, is curated, this wasn’t pointed out until you said it, and the post has not been edited.
If part of the failure is that the post is well-received, why wouldn’t you want people to downvote it now that you pointed it out? I also think the average LW user shouldn’t be expected to understand enough RL to see this, so the system should detect this kind of failure for them. (Which it has done now that you’ve written your comment.) For those people, the proper reaction seems to be to remove their upvote and perhaps downvote. Separately, I think you can explain part of the failure by laziness rather than a lack of understanding of RL. You could read/skim this post and not quite understand what the setting actually is (even though it’s mentioned at the end of the second chapter). Just like I don’t think the average LW user should be expected to understand enough ML to realize that the main point is misleading, I also don’t they they should be expected to read the post carefully enough before upvoting it, especially not if it’s curated or high karma (because that should be a quality assurance, and at that point it seems fine to upvote purely to signal-boost the point). I realize your critique was of the AF, not of LW, so I’m not sure how much I’m really disagreeing with you here. But since Evan Hubinger understood the point and upvoted the post anyway, it’s unclear how much you can conclude. (EDIT after rohin’s answer: actually, I agree this is most likely not a typical case.)
Comment
Comment
Comment
It’s trivially correct to update downward on the de-facto importance of promotion (by however much), but this seems like a bad thing. Naively, I would like people to make sure they understand the point at
the curation step
the promotion-to-AF step
*maybe *at the upvote step if you’re a professional AI safety researcher And if the conclusion is that the post is meaningful despite possibly being misinterpreted, I would naively want the person in charge to PM the author and ask to put in a clarification before the post is curated/promoted. I say ‘naively’ because I don’t know anything about how hard it would be to achieve this and I could be genuinely wrong about this being a reasonable thing to want.
Thanks for this. It seems important. Learning still happening after weights are frozen? That’s crazy. I think it’s a big deal because it is evidence for mesa-optimization being likely and hard to avoid. It also seems like evidence for the Scaling Hypothesis. One major way the scaling hypothesis could be false is if there are further insights needed to get transformative AI, e.g. a new algorithm or architecture. A simple neural network spontaneously learning to do its own, more efficient form of learning? This seems like a data point in favor of the idea that our current architectures and algorithms are fine, and will eventually (if they are big enough) grope their way towards more efficient internal structures on their own. EDIT: Now i’m less sure of all the above, thanks to Rohin’s comment below. I guess this is a case of "Evidence to the people who didn’t already understand the theory well enough to make the prediction," which maybe included me? Though I think I would have made the prediction too had I been asked…
Comment
Sure. We see that elsewhere too, like Dactyl. And of course, GPT-3.
Comment
Huh, thanks.
Gwern, I’m curious whether you would guess that something like mesa-optimization, broadly construed, is happening in GPT-3?
Two separate size parameters. The size of the search space, and the size the traversal algorithm needs to be to span the same gaps brains did.
That said, if mesa-optimization is a standard feature> [4] of brain architecture, it seems notable that humans don’t regularly experience catastrophic inner alignment failures. What would it look like if they did?
Comment
The thing I meant by "catastrophic" is "leading to the death of the organism." I’m suspicious that mesa-optimization is common in humans, although I don’t feel confident of that. I can imagine it being the case that many examples of e.g. addiction, goodharting, OCD, and even just everyday "personal misalignment"-type problems of the sort IFS/IDC/multi-agent models of mind sometimes help with, are caused by phenomenon which might reasonably be described as inner alignment failures (although I can also imagine them being caused by more mundane processes). But I think these things don’t kill people very often? People do sometimes choose to die because of beliefs. And anorexia sometimes kills people, which currently feels to me like the most straightforward candidate example I’ve considered. Things could be a lot worse. For example, it could be the case that mind-architectures that give rise to mesa-optimization simply aren’t viable—that it always kills them. Or e.g. that it always leads to the organism optimizing for a set of goals which is unrecognizably different from the base objective. I don’t think you see these things, and I’m interested in figuring out how evolution prevented them.
Comment
Governments and corporations experience inner alignment failures all the time, but because of convergent instrumental goals, they are rarely catastrophic. For example, Russia underwent a revolution and a civil war on the inside, followed by purges and coups etc., but from the perspective of other nations, it was more or less still the same sort of thing: A nation, trying to expand its international influence, resist incursions, and conquer more territory. Even its alliances were based as much on expediency as on shared ideology. Perhaps something similar happens with humans.
The claim that came to my mind is that the conscious mind is the mesa-optimizer here, the original outer optimizer being a riderless elephant.
Why do you single out anorexia? Do you mean people starving themselves to death? My understanding is that is very rare. Anorexics have a high death rate and some of that is long-term damage from starvation. They also (abruptly) kill themselves at a high rate, comparable to schizophrenics, but why single that out? There’s a theory that they have practice with internal conflict, which does seem relevant, but I think that’s just a theory, not clear cut at all.
Comment
Yeah, I wrote that confusingly, sorry; edited to clarify. I just meant that of the limited set of candidate examples I’d considered, (my model, which may well be wrong) of anorexia feels most straightforwardly like an example of something capable of causing catastrophic within-brain inner alignment failure. That is, it currently feels natural to me to model anorexia as being caused by an optimizer for thinness arising in brains, which can sometimes gain sufficient power that people begin to optimize for that goal at the expense of essentially all other goals. But I don’t feel confident in this model.
Comment
I’m objecting to the claim that it fits your criterion of "catastrophic." Maybe it’s such a clear example, with such a clear goal, that we should sacrifice the criterion of catastrophic, but you keep using that word.
Comment
Ah, I see. The high death rate was what made it seem often-catastrophic to me. Is your objection that the high death rate doesn’t reflect something that might reasonably be described as "optimizing for one goal at the expense of all others"? E.g., because many of the deaths are suicides, in which case persistence may have been net negative from the perspective of the rest of their goals too? Or because deaths often result from people calibratedly taking risky but non-insane actions, who just happened to get unlucky with heart muscle integrity or whatever?
Comment
I asked you if you were talking about starving to death and you didn’t answer. Does your abstract claim correspond to a concrete claim, or do you just observe that anorexics seem to have a goal and assume that everything must flow from that and the details don’t matter? That’s a perfectly reasonable claim, but it’s a weak claim so I’d like to know if that’s what you mean. Abrupt suicides by anorexics are just as mysterious as suicides by schizophrenics and don’t seem to flow from the apparent goal of thinness. Suicide is a good example of something, but I don’t think it’s useful to attach it to anorexia rather than schizophrenia or bipolar. Long-term health damage would be a reasonable claim, which I tried to concede in my original comment. I’m not sure I agree with it. I could pose a lot of complaints about it, but I wouldn’t. If it’s clear that it is the claim, then I think it’s clearly a weak claim and that’s OK. (As for the objection you propose, I would rather say: lots of people take badly calibrated risks without being labeled insane.)
Comment
The scenario I had in mind was one where death occurs as a result of damage caused by low food consumption, rather than by suicide.
Comment
I agree, in the case of evolution/humans. In the text above, I meant to highlight what seemed to me like a relative lack of catastrophic within-mind inner alignment failures, e.g. due to conflicts between PFC and DA. Death of the organism feels to me like a reasonable way to operationalize "catastrophic" in these cases, but I can imagine other reasonable ways.
Comment
I think it makes more sense to operationalize "catastrophic" here as "leading to systematically low DA reward", perhaps also including "manipulating the DA system in a clearly misaligned way". One way catastrophic alignment in this sense is difficult for humans is that the PFC cannot divorce itself from the DA; I’d expect that a failure mode leading to systematically low DA rewards would usually be corrected gradually, as the DA punishes those patterns. However, this is not really clear. The misaligned PFC might e.g. put itself in a local maximum, where it creates DA punishment for giving into temptation. (For example, an ascetic getting social reinforcement from a group of ascetics might be in such a situation.)
Comment
I think it makes > more sense to operationalize "catastrophic" here as "leading to systematically low DA rewardThanks—I feel pretty convinced that this operationalization makes more sense than the one I proposed.
Comment
Comment
It seems possible to me. A common strategy in religious groups is to steer for a wide barrier between them and particular temptations. This could be seen as a strategy for avoiding DA signals which would de-select for the behaviors encouraged by the religious group: no rewards are coming in for alternate behavior, so the best the DA can do is reinforce the types of reward which the PFC has restricted itself to. This can be supplemented with modest rewards for desired behaviors, which force the DA to reinforce the inner optimizer’s desired behaviors. Although is easier in a community which supports the behaviors, it’s entirely possible to do this to oneself in relative isolation, as well.
Comment
Good point, I wasn’t thinking of social effects changing the incentive landscape.
Kaj, the point I understand you to be making is: "The inner RL algorithm in this scenario seems likely to be reliably aligned with the outer RL algorithm, since the former was selected specifically on the basis of it being good at accomplishing the latter’s objective, and since if the former deviates from pursuing that objective it will receive less reward from the outer alg, leading it to reconfigure itself to be more aligned. And since the two algorithms operate on similar time scales, we should expect any such misalignment to be noticed/corrected quickly." Does this seem like a reasonable paraphrase? It doesn’t feel obvious to me that the outer layer will be able to reliably steer the inner layer in this sense, especially as the system becomes more powerful. For example, it seems plausible to me that the inner layer might come to optimize for its proxy estimations of outer reward more than for outer reward itself, and that those two things could become decoupled.
Comment
That seems like a reasonable paraphrase, at least if you include the qualification that the "quickly" is relative to the amount of structure that the inner layer has accumulated, so might not actually happen quickly enough to be useful in all cases.
Sure, e.g. lots of exotic sexual fetishes look like that to me. Hmm, though actually that example makes me rethink the argument that you just paraphrased, given that those generally emerge early in an individual’s life and then generally don’t get "corrected".
Funnily enough, I wrote a blog distilling what I learned from reproducing experiments of that 2018 Nature paper, adding some animations and diagrams. I especially look at the two-step task, the Harlow task (the one with monkeys looking at a screen), and also try to explain some brain things (e.g. how DA interacts with the PFN) at the end.
The slot-based NN paper is "Meta-Learning with Memory-Augmented Neural Networks", Santoro et al 2016 (Arxiv).
I don’t think that paper is an example of mesa optimization. Because the policy could be implementing a very simple heuristic to solve the task, similar to: Pick the image that lead to highest reward in the last 10 timesteps with 90% probability. Pik an image at random with 10% probability.
So the policy doesn’t have to have any properties of a mesa optimizer like considering possible actions and evaluating them with a utility function, ect.
Whenever an RL is trained in a partially observed environment, the agent has to take actions to learn about parts of its environment that it hasn’t observed yet or may have changed. The difference with this paper is that the observations it gets from the environment happen to be the reward the agent received in the previous timestep. However as far as the policy is concerned, the reward it gets as input is just another component of the state. So the fact that the policy gets the previous reward as input doesn’t make it stand out compared to another partially observed environment.
Comment
The argument that these and other meta-RL researchers usually make is that (as indicated by the various neurons which fluctuate, and I think based on some other parts of their experiments which I would have to reread it to list) what these RNNs are learning is not just a simple play-the-winner heuristic (which is suboptimal, and your suggestion would require only 1 neuron to track the winning arm) but amortized Bayesian inference where the internal dynamics are learning the sufficient statistics of the Bayes-optimal solution to the POMDP (where you’re unsure what of a large family of MDPs you’re in): "Meta-learning of Sequential Strategies", Ortega et al 2019; "Reinforcement Learning, Fast and Slow", Botvinick et al 2019; "Meta-learners’ learning dynamics are unlike learners’", Rabinowitz 2019; "Bayesian Reinforcement Learning: A Survey", Ghavamzadeh et al 2016, are some of the papers that come to mind. Then you can have a fairly simple decision rule using that as the input (eg Figure 4 of Ortega on a coin-flipping example, which is a setup near & dear to my heart).
To reuse a quote from my backstop essay: as Duff 2002 puts it,
I made some remarks going partly off of your comment into a post: https://www.alignmentforum.org/posts/WmBukJkEFM72Xr397/mesa-search-vs-mesa-control
It looks like humans actually suffer from mesa-optimisation: when our mind finds a hack to get more dopamine via some sort of "illegal" reward center stimulation: pornography, drugs etc.
Comment
What you’re describing is humans being mesa-optimizers inside the natural selection algorithm. The phenomenon this post talks about is one level deeper.
Comment
Gah, thanks! Fixed.
I dunno, I didn’t really like the meta-RL paper. Maybe it has merits I’m not seeing. But I didn’t find the main analogy helpful. I also don’t think "mesa-optimizer" is a good description of the brain at this level. (i.e., not the level involving evolution). I prefer "steered optimizer" for what it’s worth. :-)
Comment
As I understand it, your point about the distinction between "mesa" and "steered" is chiefly that in the latter case, the inner layer is continually receiving reward signal from the outer layer, which in effect heavily restricts the space of possible algorithms the outer layer might give rise to. Does that seem like a decent paraphrase? One of the aspects of Wang et al.’s paper that most interested me was that the inner layer in their meta-RL model kept learning even once reward signal from the outer layer had ceased. It seems reasonable to me to hypothesize that in fact what’s going on between PFC and DA is something closer to "subcortex-supervised learning," where PFCs input signals are quite regularly "labeled" by a DA-supervisor. But it doesn’t feel intuitively obvious to me that the portion of PFC input which might be labeled in this way is high—e.g., I feel confused about what portion of the concepts currently active in my working memory while writing this paragraph might be labeled by DA—nor that it much restricts the space of possible algorithms that might arise in PFC.
Comment
The temporal difference learning algorithm is an efficient way to do reinforcement learning. And probably something like it happens in the human brain. If you are playing a game like chess, it may take a long time to get enough examples of wins and losses, for training an algorithm to predict good moves. Say you play 128 games, that’s only 7 bits of information, which is nothing. You have no way of knowing which moves in a game were good and which were bad. You have to assume all moves made during a losing game were bad. Which throws out a lot of information.
Temporal difference learning can learn "capturing pieces is good" and start optimizing for that instead. This implies that "inner alignment failure" is a constant fact of life. There are probably players that get quite far in chess doing nothing more than optimizing for piece capture.
I used to have anxiety about the many worlds hypothesis. It just seems kind of terrifying, constantly splitting into hell-worlds and the implications of quantum immortality. But it didn’t take long for it to stop bothering me and to even suppress thoughts about it. After all such thoughts don’t lead to a reward and cause problems and an RL brain should punish them.
But that’s kind of terrifying itself isn’t it? I underwent a drastic change to my utility function. And even the emergence of anti-rational heuristics for suppressing thoughts. Which a rational bayesian should never do (at least not for these reasons.)
Anyway gwern has a whole essay on multi-level optimization algorithms like this, that I haven’t seen linked yet: https://www.gwern.net/Backstop
Comment
That gwern essay was helpful, and I didn’t know about it; thanks.
Curated. [Edit: no longer particularly endorsed in light of Rohin’s comment, although I also have not yet really vetted Rohin’s comment either and currently am agnostic on how important this post is] When I first started following LessWrong, I thought the sequences made a good theoretical case for the difficulties of AI Alignment. In the past few years we’ve seen more concrete, empirical examples of how AI progress can take shape and how that might be alarming. We’ve also seen more concrete simple examples of AI failure in the form of specification gaming and whatnot. I haven’t been following all of this in depth and don’t know how novel the claims here are [fake edit: gwern notes in the comments that similar phenomena have been observed elsewhere]. But, this seemed noteworthy for getting into the empirical observation of some of the more complex concerns about inner alignment. I’m interested in seeing more discussion of these results, what they mean and how people think about them.