EDIT: I now think this post is somewhat confusing and would recommend starting with my more recent post "Exploring safe exploration."
Balancing exploration and exploitation is a classic problem in reinforcement learning. Historically—with approaches such as deep Q learning, for example—exploration is done explicitly via a rule such as \epsilon-greedy exploration or Boltzmann exploration. With more modern approaches, however—especially policy gradient approaches like PPO that aren’t amenable to something like Boltzmann exploration—the exploration is instead entirely learned, encouraged by some sort of extra term in the loss to implicitly encourage exploratory behavior. This is usually an entropy term, though other more advanced approaches have also been proposed, such as random network distillation in which the agent learns to explore states for which it would have a hard time predicting the output of a random neural network, an approach which was able to set a state of the art on Montezuma’s Revenge, a notoriously difficult Atari environment because of how much exploration it requires.
This move to learned exploration has a very interesting and important consequence, however, which is that the safe exploration problem for learned exploration becomes very different. Making \epsilon-greedy exploration safe is in some sense quite easy, since the way it explores is totally random. If you assume that the policy without exploration is safe, then for \epsilon-greedy exploration to be safe on average, it just needs to be the case that the environment is safe on average, which is just a standard engineering question. With learned exploration, however, this becomes much more complicated—there’s no longer a nice "if the non-exploratory policy is safe" assumption that can be used to cleanly subdivide the overall problem of off-distribution safety, since it’s just a single, learned policy doing both exploration and exploitation.
First, though, an aside: why is learned exploration so much better? I think the answer lies primarily in the following observation: for most problems, exploration is an instrumental goal, not a terminal one, which means that to do exploration "right" you have to do it in a way which is cognizant of the objective you’re trying to optimize for. Boltzmann exploration is better than \epsilon-greedy exploration because its exploration is guided by its exploitation—but it’s still essentially just adding random jitter to your policy. Fundamentally, though, exploration is about the value of information such that proper exploration requires dynamically balancing the value of information with the value of exploitation. Ideally, in this view, exploration should arise naturally as an instrumental goal of pursuing the given reward function—an agent should instrumentally want to get updated in such a way that causes it to become better at pursuing its current objective.
Except, there’s a really serious, major problem with that reasoning: instrumental exploration only cares about the value of information for helping the model to achieve the goal it’s learned so far, not for helping it fix its goal to be more aligned with the actual goal.[1] Consider, for instance, my maze example. Instrumental exploration will help the model better explore the larger maze, but it won’t help it better figure out that it’s objective of finding the green arrow is misaligned—that is, it won’t, for example, lead to the model trying both the green arrow and the end of the maze to see which one is right. Furthermore, because the instrumental exploration actively helps the model explore the larger maze better, it improves the model’s capability generalization without also helping its objective generalization, leading to precisely the most worrying case in the maze example. If we think about this problem from a 2D robustness perspective, we can see that what’s happening is that instrumental exploration gives us capability exploration but not objective exploration.
Now, how does this relate to corrigibility? To answer that question, I want to split corrigibility into three different subtypes:
-
Indifference corrigibility: An agent is indifference corrigible if it is indifferent to modifications made to its goal.
-
Exploration corrigibility: An agent is exploration corrigible if it actively searches out information to help you correct its goal.
-
Cooperation corrigibility: An agent is cooperation corrigible if it optimizes under uncertainty over what goal you might want it to have.
Previously, I grouped both of those second two into act-based corrigibility, though recently I’ve been moving towards thinking that act-based corrigibility isn’t as well-defined as I previously thought it was. However, I think the concept of objective exploration lets us disentangle act-based corrigibility. Specifically, I think exploration corrigibility is just indifference corrigibility plus objective exploration, and cooperation corrigibility is just exploration corrigibility plus corrigible alignment.[2] That is, if a model is indifferent to having its objective changed and actively optimizes for the value of information in terms of helping you change its current objective, that gives you exploration corrigibility, and if its objective is also a "pointer" to what you want, then you get cooperation corrigibility. Furthermore, I think this helps solve a lot of the problems I previously had with corrigible alignment, as indifference corrigibility and exploration corrigibility together can help you prevent crystallization of deceptive alignment.
Finally, what does this tell us about safe exploration and how to think about current safe exploration research? Current safe exploration research tends to focus on the avoidance of traps in the environment. Safety Gym, for example, has a variety of different environments containing both goal states that the agent is supposed to reach and unsafe states that the agent is supposed to avoid. One particularly interesting recent work in this domain was Leike et al.’s "Learning human objectives by evaluating hypothetical behaviours," which used human feedback on hypothetical trajectories to learn how to avoid environmental traps. In the context of the capability exploration/objective exploration dichotomy, I think a lot of this work can be viewed as putting a damper on instrumental capability exploration. What’s nice about that lens, in my opinion, is that it both makes clear how and why such work is valuable while also demonstrating how much other work there is to be done here. What about objective exploration—how do we do it properly? And do we need measures to put a damper on objective exploration as well? And what about cooperation corrigibility—is the "right" way to put a damper on exploration through constraints or through uncertainty? All of these are questions that I think deserve answers.
-
For a mesa-optimizer, this is saying that the mesa-optimizer will only explore to help its current mesa-objective, not to help it fix any misalignment between its mesa-objective and the base objective. ↩︎
-
Note that this still leaves the question of what exactly indifference corrigibility is unanswered. I think the correct answer to that is myopia, which I’ll try to say more about in a future post—for this post, though, I just want to focus on the other two types. ↩︎
(This entire comment is setting aside embedded agency concerns, except for mesa optimization) You seem to be equivocating between two notions of exploration. Consider an agent that is trained via RL to do well on a distribution of environments, p(E). Then there are two kinds of exploration:
Across-episode exploration: Exploration across training trajectories, where the RL algorithm collects trajectories going to various different parts of the state space in the environment, in order to figure out where the reward is.
Within-episode exploration: Exploration within a single trajectory, where you try to identify which particular E has been sampled, so that you can tailor your trajectory to that E. In across-episode exploration, the exploration is being done by some human-designed algorithm. (I would claim this of RND, Boltzmann exploration, ϵ-greedy and entropy bonuses.) I agree that these work because you want to tailor your exploration based on the value of information, but the agent isn’t evaluating the value of information and deciding where to explore, the human-designed algorithm is doing that. So mesa optimization is not going to affect this. In within-episode exploration, the exploration is being done directly by the policy, and so it is reasonable to talk about how a mesa optimizer would do such exploration. With that in mind, some thoughts:
Comment
I completely agree with the distinction between across-episode vs. within-episode exploration, and I agree I should have been clearer about that. Mostly I want to talk about across-episode exploration here, though when I was writing this post I was mostly motivated by the online learning case where the distinction is in fact somewhat blurred, since in an online learning setting you do in fact need the deployment policy to balance between within-episode exploration and across-episode exploration.
Agreed. My point is that "If you assume that the policy without exploration is safe, then for \epsilon-greedy exploration to be safe on average, it just needs to be the case that the environment is safe on average, which is just a standard engineering question." That is, even though it seems like it’s hard for \epsilon-greedy exploration to be safe, it’s actually quite easy for it to be safe on average—you just need to be in a safe environment. That’s not true for learned exploration, though.
Yeah, I agree that was confusing—I’ll rephrase it. The point I was trying to make was that across-episode exploration should arise naturally, since an agent with a fixed objective should want to be modified to better pursue that objective, but not want to be modified to pursue a different objective.
Agreed that there’s a similarity there—that’s the motivation for calling it "cooperative." But I’m not trying to advocate for that agenda here—I’m just trying to better classify the different types of corrigibility and understand how they work. In fact, I think it’s quite plausible that you could get away without cooperative corrigibility, though I don’t really want to take a stand on that right now.
If your definition of "safe exploration" is "not making accidental mistakes" then I agree that what I’m pointing at doesn’t fall under that heading. What I’m trying to point at is that I think there are other problems that we need to figure out regarding how models explore than just the "not making accidental mistakes" problem, though I have no strong feelings about whether or not to call those other problems "safe exploration" problems.
Agreed, though I don’t think that’s the end of the story. In particular, I don’t think it’s at all obvious what an agent that cares about the value of information that its actions produce relative to some objective distribution will look like, how you could get such an agent, or how you could verify when you had such an agent. And, even if you could do those things, it still seems pretty unclear to me what the right distribution over objectives should be and how you should learn it.
Well, what does "better exploration" mean? Better across-episode exploration or better within-episode exploration? Better relative to the base objective or better relative to the mesa-objective? I think it tends to be "better within-episode exploration relative to the base objective," which I would call putting a damper on instrumental exploration, which does across-episode and within-episode exploration only for the mesa-objective, not the base objective.
Sure, but as you note getting the right uncertainty could be quite difficult, so for practical purposes my question is still unanswered.
Comment