Contents
- Summary
- So, What’s Whitelisting?
- What’s an "Effect"?
- Common Confusions
- Latent Space Whitelisting
- Advantages
- Results
- Assumptions
- Problems
- Object Permanence
- Time Step Size Invariance
- Information Theory
- Loss of Value
- Reversibility
- Ontological Crises
- Retracing Steps
- Clinginess
- Stasis
- Discussing Imperfect Approaches
- Conclusion
Suppose a designer wants an RL agent to achieve some goal, like moving a box from one side of a room to the other. Sometimes the most effective way to achieve the goal involves doing something unrelated and destructive to the rest of the environment, like knocking over a vase of water that is in its path. If the agent is given a reward only for moving the box, it will probably knock over the vase.Amodei et al., > Concrete Problems in AI Safety> Side effect avoidance is a major open problem in AI safety. I present a robust, transferable, easily- and more safely-trainable, partially reward hacking-resistant impact measure.TurnTrout, > Worrying about the Vase: WhitelistingAn impact measure is a means by which change in the world may be evaluated and penalized; such a measure is not a replacement for a utility function, but rather an additional precaution thus overlaid. While I’m fairly confident that whitelisting contributes meaningfully to short- and mid-term AI safety, I remain skeptical of its robustness to scale. Should several challenges be overcome, whitelisting may indeed be helpful for excluding swathes of unfriendly AIs from the outcome space.^1 Furthermore, the approach allows easy shaping of agent behavior in a wide range of situations. Segments of this post are lifted from my paper, whose latest revision may be found here; for Python code, look no further than this repository. For brevity, some relevant details are omitted.
Summary
Be careful what you wish for.In effect, side effect avoidance aims to decrease how careful we have to be with our wishes. For example, asking for help filling a cauldron with water shouldn’t result in this: However, we just can’t enumerate all the bad things that the agent could do. How do we avoid these extreme over-optimizations robustly? Several impact measures have been proposed, including state distance, which we could define as, say, total particle displacement. This could be measured either naively (with respect to the original state) or counterfactually (with respect to the expected outcome had the agent taken no action). These approaches have some problems:
-
Making up for bad things it prevents with other negative side effects. Imagine an agent which cures cancer, yet kills an equal number of people to keep overall impact low.
-
Not being customizable before deployment.
-
Not being adaptable after deployment.
-
Not being easily computable.
-
Not allowing generative previews, eliminating a means of safely previewing agent preferences (see latent space whitelisting below).
-
Being dominated by random effects throughout the universe at-large; note that nothing about particle distance dictates that it be related to anything happening on planet Earth.
-
Equally penalizing breaking and fixing vases (due to the symmetry of the above metric): For example, the agent would be equally penalized for breaking a vase and for preventing a vase from being broken, though the first action is clearly worse. This leads to "overcompensation" ("> offsetting") behaviors: when rewarded for preventing the vase from being broken, an agent with a low impact penalty rescues the vase, collects the reward, and then breaks the vase anyway (to get back to the default outcome). Victoria Krakovna, > Measuring and Avoiding Side Effects Using Reachability
-
Not actually *measuring impact *in a meaningful way. Whitelisting falls prey to none of these. However, other problems remain, and certain new challenges have arisen; these, and the assumptions made by whitelisting, will be discussed. Rare LEAKED footage of Mickey trying to catch up on his alignment theory after instantiating an unfriendly genie [colorized, 2050].^2
So, What’s Whitelisting?
To achieve robust side effect avoidance with only a small training set, let’s turn the problem on its head: allow a few effects, and penalize everything else.
What’s an "Effect"?
You’re going to be the agent, and I’ll be the supervisor. Look around—what do you see? Chairs, trees, computers, phones, people? Assign a probability mass function to each; basically: When you do things that change your beliefs about what each object is, you receive a penalty proportional to how much your beliefs changed—proportional to how much probability mass "changed hands" amongst the classes.
But wait—isn’t it OK to effect certain changes?Yes, it is—I’ve got a few videos of agents effecting acceptable changes. See all the objects being changed in this video? You can do that, too—without penalty. Decompose your current knowledge of the world into a set of objects. Then, for each object, maintain a distribution over the possible identities of each object. When you do something that changes your beliefs about the objects in a non-whitelisted way, you are penalized proportionally. Therefore, you avoid breaking vases by default.
Common Confusions
-
We are *not *whitelisting entire states or transitions between them; we whitelist specific changes in our beliefs about the ontological decomposition of the current state.^3
-
The whitelist is in addition to whatever utility or reward function we supply to the agent.
-
Whitelisting is compatible with counterfactual approaches. For example, we might penalize a transition after its "quota" has been surpassed, where the quota is how many times we would have observed that transition had the agent not acted.
-
This implies the agent will do no worse than taking no action at all. However, this may still be undesirable. This problem will be discussed in further detail.
-
The whitelist is provably closed under transitivity.
-
The whitelist is directed; a\to b \neq b\to a.
Latent Space Whitelisting
In a sense, class-based whitelisting is but a rough approximation of what we’re really after: "which objects in the world can change, and in what ways?″. In latent space whitelisting, no longer do we constrain transitions based on class boundaries; instead, we penalize based on endpoint distance in the latent space. Learned latent spaces are low-dimensional manifolds which suffice to describe the data seen thus far. It seems reasonable that nearby points in a well-constructed latent space correspond to like objects, but further investigation is warranted. Assume that the agent models objects as points z \in \mathbb{R}^d, the d-dimensional latent space. A priori, any movement in the latent space is undesirable. When training the whitelist, we record the endpoints of the observed changes. For z_1, z_2\in\mathbb{R}^d and observed change z_1 \to z_2, one possible dissimilarity formulation is: \text{Dissimilarity}(z_1, z_2) := \min_{z_{start}, z_{end} \in \textit{whitelist}} \Big[ d(z_1, z_{start}) + d(z_2, z_{end})\Big],where d(\cdot,\cdot) is the Euclidean distance. Basically, the dissimilarity for an observed change is the distance to the closest whitelisted change. Visualizing these changes as one-way wormholes may be helpful.
Advantages
Whitelisting asserts that we can effectively encapsulate a large part of what "change" means by using a reasonable ontology to penalize object-level changes. We thereby ground the definition of "side effect", avoiding the issue raised by Taylor et al.:
For example, if we ask [the agent] to build a house for a homeless family, it should know implicitly that it should avoid destroying nearby houses for materials—a large side effect. However, we cannot simply design it to avoid having large effects in general, since we would like the system’s actions to still have the desirable large follow-on effect of improving the family’s socioeconomic situation.Nonetheless, we may not be able to perfectly express what it means to have side-effects: the whitelist may be incomplete, the latent space insufficiently granular, and the allowed plans sub-optimal. However, the agent still becomes *more *robust against:
-
Incomplete specification of the utility function.
-
Likewise, an incomplete whitelist means missed opportunities, but not unsafe behavior.
-
Out-of-distribution situations (as long as the objects therein roughly fit in the provided ontology).
-
Some varieties of reward hacking. For example, equipped with a can of blue spray paint and tasked with finding the shortest path of blue tiles to the goal, a normal agent may learn to paint red tiles blue, while a whitelist-enabled agent would incur penalties for doing so (\textit{redTile} \to \textit{blueTile} \not \in whitelist).
-
Dangerous exploration. While this approach does not attempt to achieve *safe exploration *(also acting safely during training), an agent with some amount of foresight will learn to avoid actions which likely lead to non-whitelisted side effects.
-
I believe that this can be further sharpened using today’s machine learning technology, leveraging deep Q-learning to predict both action values and expected transitions.
-
This allows querying the human about whether particularly-inhibiting transitions belong on the whitelist. For example, if the agent notices that a bunch of otherwise-rewarding plans are being held up by a particular transition, it could ask for permission to add it to the whitelist.
-
Assigning astronomically-large weight to side effects happening throughout the universe. Presumably, we can just have the whitelist include transitions going on out there—we don’t care as much about dictating the exact mechanics of distant supernovae.
-
If an agent did somehow come up with plans that involved blowing up distant stars, that would indeed constitute astronomical waste.^\text{a triple pun?} Whitelisting doesn’t solve the problem of assigning too much weight to events outside our corner of the neighborhood, but it’s an improvement.
-
Logical uncertainty may be our friend here, such that most reasonable plans incur roughly the same level of interstellar penalty noise.
Results
I tested a vanilla Q-learning agent and its whitelist-enabled counterpart in 100 randomly-generated grid worlds (dimensions up to 5 \times 5). The agents were rewarded for reaching the goal square as quickly as possible; no explicit penalties were levied for breaking objects. The simulated classification confidence of each object’s true class was p \sim \mathcal{N}(.8, \sigma) (truncated to [0,1]), \sigma \in {0,.025,\dots,.175}. This simulated sensor noise was handled with a Bayesian statistical approach.
At reasonable levels of noise, the whitelist-enabled agent completed all levels without a single side effect, while the Q-learner broke over 80 vases.
Assumptions
I am not asserting that these assumptions necessarily hold.
-
The agent has some world model or set of observations which can be decomposed into a set of discrete objects.
-
Furthermore, there is no need to identify objects on multiple levels (e.g., a forest, a tree in the forest, and that tree’s bark need not all be identified concurrently).
-
Not all objects need to be represented—what do we make of a ‘field’, or the ‘sky’, or ‘the dark places between the stars visible to the naked eye’? Surely, these are not all objects.
-
We have an ontology which reasonably describes (directly or indirectly) the vast majority of negative side effects.
-
Indirect descriptions of negative outcomes means that even if an undesirable transition isn’t immediately penalized, it generally results in a number of penalties. Think: pollution.
-
*Latent space whitelisting: *the learned latent space encapsulates most of the relevant side effects. This is a slightly weaker assumption.
-
Said ontology remains in place.
Problems
Beyond resolving the above assumptions, and in roughly ascending difficulty:
Object Permanence
If you wanted to implement whitelisting in a modern embodied deep-learning agent, you could certainly pair deep networks with state-of-the-art segmentation and object tracking approaches to get most of what you need. However, what’s the difference between an object leaving the frame, and an object vanishing? Not only does the agent need to realize that objects are permanent, but also that they keep interacting with the environment even when not being observed. If this is not realized, then an agent might set an effect in motion, stop observing it, and then turn around when the bad effect is done to see a "new" object in its place.
Time Step Size Invariance
The penalty is presently attenuated based on the probability that the belief shift was due to noise in the data. Accordingly, there are certain ways to abuse this to skirt the penalty. For example, simply have non-whitelisted side effects take place over long timescales; this would be classified as noise and attenuated away. However, if we don’t need to handle noise in the belief distributions, this problem disappears—presumably, an advanced agent keeps its epistemic house in order. I’m still uncertain about whether (in the limit) we have to hard-code a means for decomposing a representation of the world-state into objects, and where to point the penalty evaluator in a potentially self-modifying agent.
Information Theory
Whitelisting is wholly unable to capture the importance of "informational states" of systems. It would apply no penalty to passing powerful magnets over your hard drive. It is not clear how to represent this in a sensible way, even in a latent space.
Loss of Value
Whitelisting could get us stuck in a tolerable yet sub-optimal future. Corrigibility via some mechanism for expanding the whitelist after training has ended is then desirable. For example, the agent could propose extensions to the whitelist. To avoid manipulative behavior, the agent should be *indifferent *as to whether the extension is approved. Even if extreme care is taken in approving these extensions, mistakes may be made. The agent itself should be sufficiently corrigible and aligned to notice "this outcome might not actually be what they wanted, and I should check first".
Reversibility
As DeepMind outlines in Specifying AI Safety Problems in Simple Environments, we may want to penalize not just physical side effects, but also causally-irreversible effects:
Krakovna et al. introduce a means for penalizing actions by the proportion of initially-reachable states which are still reachable after the agent acts. I think this is a step in the right direction. However, even given a hypercomputer and a perfect simulator of the universe, this wouldn’t work for the real world if implemented *literally. *That is, due to entropy, you may not be able to return to the exact same universe configuration. To be clear, the authors do not suggest implementing this idealized algorithm, flagging a more tractable abstraction as future work. What does it really mean for an "effect" to be "reversible"? What level of abstraction do we in fact care about? Does it involve reversibility, or just outcomes for the objects involved?
Ontological Crises
When a utility-maximizing agent refactors its ontology, it isn’t always clear how to apply the old utility function to the new ontology—this is called an ontological crisis. Whitelisting may be vulnerable to ontological crises. Consider an agent whose whitelist disincentivizes breaking apart a tile floor (\textit{floor} \to \textit{tiles} \not \in \textit{whitelist}); conceivably, the agent could come to see the floor as being composed of many tiles. Accordingly, the agent would no longer consider removing tiles to be a side effect. Generally, proving invariance of the whitelist across refactorings seems tricky, even assuming that we *can *identify the correct mapping.
Retracing Steps
When I first encountered this problem, I was actually fairly optimistic. It was clear to me that any ontology refactoring should result in utility normalcy—roughly, the utility functions induced by the pre- and post-refactoring ontologies should output the same scores for the same worlds.
Wow, this seems like a useful insight. Maybe I’ll write something up!Turns out a certain someone beat me to the punch—here’s a novella Eliezer wrote on Arbital about "rescuing the utility function".^4
Clinginess
This problem cuts to the core of causality and "responsibility" (whatever that means). Say that an agent is clingy when it not only stops itself from having certain effects, but also stops you.^5 Whitelist-enabled agents are currently clingy. Let’s step back into the human realm for a moment. Consider some outcome—say, the sparking of a small forest fire in California. At what point can we truly say we didn’t start the fire?
-
My actions immediately and visibly start the fire.
-
At some moderate temporal or spatial remove, my actions end up starting the fire.
-
I intentionally persuade someone to start the fire.
-
I unintentionally (but perhaps predictably) incite someone to start the fire.
-
I set in motion a moderately-complex chain of events which convince someone to start the fire.
-
I provoke a butterfly effect which ends up starting the fire.
-
I provoke a butterfly effect which ends up convincing someone to start a fire which they:
-
were predisposed to starting.
-
were not predisposed to starting.
Taken literally, I don’t know that there’s actually a significant difference in "responsibility" between these outcomes—if I take the action, the effect happens; if I don’t, it doesn’t. My initial impression is that uncertainty about the results of our actions pushes us to view some effects as "under our control" and some as "out of our hands". Yet, if we had complete knowledge of the outcomes of our actions, and we took an action that landed us in a California-forest-fire world, whom could we blame but ourselves?^6 Can we really do no better than a naive counterfactual penalty with respect to whatever impact measure we use? My confusion here is not yet dissolved. In my opinion, this is a gaping hole in the heart of impact measures—both this one, and others.
Stasis
Fortunately, a whitelist-enabled agent should not share the classic convergent instrumental goal of valuing us for our atoms. Unfortunately, depending on the magnitude of the penalty in proportion to the utility function, the easiest way to prevent penalized transitions may be putting any relevant objects in some kind of protected stasis, and then optimizing the utility function around that. Whitelisting is clingy! If we have at least an *almost-aligned *utility function and proper penalty scaling, this might not be a problem. Edit: a potential solution to clinginess, with its own drawbacks.
Discussing Imperfect Approaches
A few months ago, Scott Garrabrant wrote about robustness to scale:
Briefly, you want your proposal for an AI to be robust (or at least fail gracefully) to changes in its level of capabilities. I recommend reading it—it’s to-the-point, and he makes good points. Here are three further thoughts:
-
Intuitively-accessible vantage points can help us explore our unstated assumptions and more easily extrapolate outcomes. If less mental work has to be done to put oneself in the scenario, more energy can be dedicated to finding nasty edge cases. For example, it’s probably harder to realize all the things that go wrong with naive impact measures like raw particle displacement, since it’s just a weird frame through which to view the evolution of the world. I’ve found it to be substantially easier to extrapolate through the frame of something like whitelisting.**^7
-
I’ve already adjusted for the fact that one’s own ideas are often more familiar and intuitive, and then adjusted for the fact that I probably didn’t adjust enough the first time.
-
Imperfect results are often left unstated, wasting time and obscuring useful data. That is, people cannot see what has been tried and what roadblocks were encountered.
-
Promising approaches** **may be conceptually-close to correct solutions. My intuition is that whitelisting actually almost works in the limit in a way that might be important.
Conclusion
Although somewhat outside the scope of this post, whitelisting permits the concise shaping of reward functions to get behavior that might be difficult to learn using other methods.^8 This method also seems fairly useful for aligning short- and medium-term agents. While encountering some new challenges, whitelisting ameliorates or solves many problems with previous impact measures. ^1 Even an idealized form of whitelisting is not *sufficient *to align an otherwise-unaligned agent. However, the same argument can be made against having an off-switch; if we haven’t formally proven the alignment of a seed AI, having more safeguards might be better than throwing out the seatbelt to shed deadweight and get some extra speed. Of course, there are also legitimate arguments to be made on the basis of timelines and optimal time allocation. ^2 Humor aside, we would have no luxury of "catching up on alignment theory" if our code doesn’t work on the first go—that is, if the AI still functions, yet differently than expected. Luckily, humans are great at producing flawless code on the first attempt. ^3 A potentially-helpful analogy: similarly to how Bayesian networks decompose the problem of representing a (potentially extremely large) joint probability table to that of specifying a handful of conditional tables, whitelisting attempts to decompose the messy problem of quantifying state change into a set of comprehensible ontological transitions. ^4 Technically, at 6,250 words, Eliezer’s article falls short of the 7,500 required for "novella" status. ^5 Is there another name for this? ^6 I do think that "responsibility" is an important part of our moral theory, deserving of rescue. ^7 In particular, I found a particular variant of Murphyjitsu helpful: I visualized Eliezer commenting "actually, this fails terribly because..." on one of my posts, letting my mind fill in the rest. In my opinion, one of the most important components of doing AI alignment work is iteratively applying Murphyjitsu and Resolve cycles to your ideas. ^8 A fun example: I imagine it would be fairly easy to train an agent to only destroy certain-colored ships in Space Invaders.
I’m fairly confident that whitelisting contributes meaningfully to short- to mid-term AI safety, although I remain skeptical of its > robustness to scale.What I understand this as saying is that the approach is helpful for aligning housecleaning robots (using near extrapolations of current RL), but not obviously helpful for aligning superintelligence, and likely stops being helpful somewhere between the two. I think it is likely best to push against including that sort of thing in the Overton window of what’s considered AI safety / AI alignment literature.
The people who really care about the field care about existential risks and superintelligence, and that’s also the sort we want to attract to the field as it grows. It is pretty bad if the field drifts toward safety for self-driving cars and housecleaning robots, particularly if it trades off against research reducing existential risk.
There is a risk that a large body of safety literature which works for preventing today’s systems from breaking vases but which fails badly for very intelligent systems actually worsens the AI safety problem, by lulling safety-concerned people into a false sense of security, thinking that installing those solutions counts as sufficient caution. (Note that I am not complaining about using the vase example as a motivating example—my concern lies with approaches which specifically target "short- to mid-term" without the robustness to scale to tackle far-term.)
There is something to be said about making problems on what we can (in the hopes that it will help create progress later where we currently don’t have any traction), but robustness to scale is actually really essential to the hard/interesting part of the problem here. There are many measures of impact which one can come up with; as you say, all of these create other problems when optimized very hard, because the AI can find clever ways to have a very low impact, and these end up being counter to our intentions. Your whitelisting proposal has the same sorts of problems. The interesting thing is to get a notion of "low impact" exactly right, so that it doesn’t go wrong even in a very intelligent system. (I am here referring to "robustness of scale" in the direction of robustness to high-capability systems, but I note that the term also refers to robustness in the low-capability direction and robustness to differences in relative capability of subcomponents. Those aren’t as relevant to my critique here.) You already name several failures of the suggested whitelisting approach; of these, I would point to "clinginess" as the most damning. A safeguard would ideally have the property that, added to an already-aligned AI, would not misalign that AI. Whitelisting fails badly in that desideratum; it creates an AI which would seek to reduce the impact of everything in the universe, not just itself. I would point out several more potential failure modes. I think some of the following may apply to the approach as stated and others to nearby alternatives. I haven’t yet understood exactly the version which you implemented, but my guess is that you would agree that the key idea is to penalize a shift in probability distributions, if those shifts haven’t been whitelisted. Decisions about exactly which probability distributions are being compared, such as how change over time is treated, can be reasonably varied while keeping the spirit of the approach (as I understand it).
If the penalty is applied to immediate effects of actions, then the AI would not care if all the vases fall over and break after the AI completes the course. In other words, a penalty applied to immediate consequences will fail to penalize predictable long-term effects of actions.
If the penalty is applied in a way which penalizes long-term effects, then the AI might do what it can to hold everything still, or it might try to carry out its task but otherwise make the world as much as possible look like one where its task was not carried out (example: curing a fatal disease but then killing the patient, because a saved patient would have all sorts of non-whitelisted consequences eventually).
In particular, "clinginess" could make the AI want to take over the light cone to install impact-reducing measures everywhere.
Penalizing a shift in probability distributions can incentivize the agent to learn as little as possible, which is a bit weird.
Certain versions will have the property that if the agent is already quite confident in what it will do, then consequences of those actions do not count as "changes" (no shift in probability when we condition on the act). This would create a loophole allowing for any actions to be "low impact" under the right conditions.
Comment
I’m really sympathetic to these concerns but I’m worried about the possible unintended consequences of trying to do this. There will inevitably be a large group of people working on short and medium term AI safety (due to commercial incentives) and pushing them out of "AI safety / AI alignment literature" risks antagonizing them and creating an adversarial relationship between the two camps, and/or creates a larger incentive for people to stretch the truth about how robust to scale their ideas are. Is this something you considered?
Comment
I’m not sure how to think about this. My intuition is that this doesn’t need to be a problem if people in (my notion of) the AI alignment field just do the best work they can do, so as to demonstrate by example what the larger concerns are. In other words, win people over by being sufficiently exciting rather than by being antagonizing/exclusive. I suppose that’s not very consistent with my comment above.
I think most hard engineering problems are made up of a lot of smaller solutions and especially made up of the lessons learned attempting to implement small solutions, so I think it’s incorrect to think of something that’s useful but incomplete as being competitive to the true solution rather than actually being a part of the path to it.
Comment
I definitely agree with that. There has to be room to find traction. The concern is about things which specifically push the field toward "near-term" solutions, which slides too easily into not-solving-the-same-sorts-of-problems-at-all. I think a somewhat realistic outcome is that the field is taken over by standard machine learning research methodology of achieving high scores on test cases and benchmarks, to the exclusion of research like logical induction. This isn’t particularly realistic because logical induction is actually not far from the sorts of things done in theoretical machine learning. However, it points at the direction of my concern.
It isn’t clear that work allocation for immediate and long-term safety is zero-sum—Victoria wrote more about why this might not be the case.
The specific approach I took here might be conducive for getting more people currently involved with immediate safety interested in long-term approaches. That is, someone might be nodding along—"hey, this whitelisting thing might need some engineering to implement, but this is solid!" and then I walk them through the mental motions of discovering how it doesn’t work, helping them realize that the problem cuts far deeper than they thought.
In my mental model, this is far more likely than pushing otherwise-promising people to inaction.
I’m actually concerned that a lack of overlap between our communities will insulate immediate safety researchers from long-term considerations, having a far greater negative effect. I have weak personal evidence for this being the case.
Why would people (who would otherwise be receptive to rigorous thinking about x-risk) lose sight of the greater problems in alignment? I don’t expect DeepMind to say "hey, we implemented whitelisting, we’re good to go! Hit the switch." In my model, people who would make a mistake like that probably were never thinking about x-risk to begin with.
Comment
This kind of work seems likely to one day redirect funding intended for X-risk away from X-risk.
I know people who would point to this kind of thing to argue that AI can be made safe without the kind of deep decision theory thinking MIRI is interested in. Those people would probably argue against X-risk research regardless, but the more stuff there is that’s difficult for outsiders to distinguish from X-risk relevant research, the more difficulty outsiders have assessing such arguments. So it isn’t so much that I think people who would work on X-risk would be redirected, as that I think there will be a point where people adjacent to X-risk research will have difficulty telling which people are actually trying to work on X-risk, and also what the state of the X-risk concerns is (I mean to what extent it has been addressed by the research.
Comment
this problem can be basically avoided if this kind of work clearly points out where the problems would be if scaled.
I do think it’s plausible that some less-connected funding sources might get confused (NSF), but I’d be surprised if later FLI funding got diverted because of this. I think this kind of work will be done anyways, and it’s better to have people who think carefully about scale issues doing it.
your second bullet point reminds me of how some climate change skeptics will point to "evidence" from "scientists", as if that’s what convinced them. In reality, however, they’re drawing the bottom line first, and then pointing to what they think is the most dignified support for their position. I don’t think that avoiding this kind of work would ameliorate that problem—they’d probably just find other reasons.
most people on the outside don’t understand x-risk anyways, because it requires thinking rigorously in a lot of ways to not end up a billion miles off of any reasonable conclusion. I don’t think that this additional straw will marginally add significant confusion.
Comment
Comment
Although I did flinch a bit, my S2 reaction was "this is Abram, so if it’s criticism, it’s likely very high-quality. I’m glad I’m getting detailed feedback, even if it isn’t all positive". Apology definitely accepted (although I didn’t view you as being a jerk), and really—thank you for taking the time to critique me a bit. :)
Interesting work! Seems closely related to this recent paper from Satinder Singh’s lab: Minimax-Regret Querying on Side Effects for Safe Optimality in Factored Markov Decision Processes. They also use whitelists to specify which features of the state the agent is allowed to change. Since whitelists can be unnecessarily restrictive, and finding a policy that completely obeys the whitelist can be intractable in large MDPs, they have a mechanism for the agent to query the human about changing a small number of features outside the whitelist. What are the main advantages of your approach over their approach? I agree with Abram that clinginess (the incentive to interfere with irreversible processes) is a major issue for the whitelist method. It might be possible to get around this by using an inaction baseline, i.e. only penalizing non-whitelisted transitions if they were caused by the agent, and would not have happened by default. This requires computing the inaction baseline (the state sequence under some default policy where the agent "does nothing"), e.g. by simulating the environment or using a causal model of the environment. I’m not convinced that whitelisting avoids the offsetting problem: "Making up for bad things it prevents with other negative side effects. Imagine an agent which cures cancer, yet kills an equal number of people to keep overall impact low." I think this depends on how extensive the whitelist is: whether it includes all the important long-term consequences of achieving the goal (e.g. increasing life expectancy). Capture all of the relevant consequences in the whitelist seems hard. The directedness of whitelists is a very important property, because it can produce an asymmetric impact measure that distinguishes between causing irreversible effects and preventing irreversible events.
Comment
That’s pretty cool that another group had a whitelisting-ish approach! They also guarantee that (their version of) the whitelist is obeyed, which is nice.
Automatically deduces what effects are, while making weaker assumptions (like the ability to segment the world into discrete objects and maintain beliefs about their identities).
In contrast, their approach requires complete enumeration of side effects. Also, they seem to require the user to specify all effects, but also claim that specifying whether the effect is good is too expensive? O(2|\Phi|)=O(|\Phi|).
It’s unclear how to apply their feature decomposition to the real world. For example, if it’s OK for an interior door to be opened, how does the agent figure out if the door is open outside of toy MDPs? What about unlocked but closed, or just slightly ajar? Where is the line drawn?
The number of features and values seems to grow extremely quickly with the complexity of the environment, which gets us back to the "no compact enumeration" problem.
Doesn’t require a complete model.
To be fair, calculating the counterfactual in my formulation requires at least a reasonably-accurate model.
Works in stochastic environments.
It is possible that their approach could be expanded to do so, but it is not immediately clear to me how.
Uses counterfactual reasoning to provide a limited reduction in clinginess.
Can trade off having an unknown effect with large amounts of reward. Their approach would do literally anything to prevent unknown effects.
Can act even when no perfect outcome is attainable.
Has plausibly-negligible performance overhead.
Can be used with a broad(er) class of RL approaches, since any given whitelist \mathcal{W} implicitly defines a new reward function \bar{R}(s,a,s') := R(s,a,s') - \text{Penalty}_\mathcal{W}(s,s').
I wonder whether "avoiding side effects" will play any role in long-term AI safety. It seems to me that in the long run, we have to assume that the user might tell the AI to do something that intrinsically must have lots of side effects, and therefore requires learning a detailed model of the user’s values in order to backchain through only good side effects (or at least the less bad ones). For example, "make money" (making people happy is generally a good side effect, but certain ways of making people happy are bad, don’t let elections be influenced through your product, except through certain legitimate ways, hacking the bank is bad but taking advantage of certain quirks in the stock market is ok, etc.) or "win this war" (only kill combatants, not civilians, be humane to prisoners, don’t let civilians come to harm through inaction, don’t value civilian lives so much that human shields become an unbeatable tactic, etc.)
If the AI has a detailed model of the user’s values and can therefore safely accomplish goals that intrinsically have lots of side effects, it can also apply that to safely accomplish goals that don’t intrinsically have lots of side effects, without needing a separate "avoiding side effects" solution. Does anyone disagree with this?
Comment
Comment
Ok, if I understand your view correctly, the long-term problem is better described as "minimizing impact" rather than "avoiding side effects" and it’s meant to be a second line of defense or a backup safety mechanism rather than a primary one.
Since "Concrete Problems in AI Safety" takes the short/medium term view and introduces "avoiding side effects" as a primary safety mechanism, and some people might not extrapolate correctly from that to the long run, do you know a good introduction to the "avoiding side effects"/"minimizing impact" problem that lays out both the short-term and long-term views?
ETA: Found this and this, however both of them also seem to view "low impact" as a primary safety mechanism, in other words, as a way to get safe and useful work out of advanced AIs before we know how to give them the "right" utility function or otherwise make them fully value aligned.
Comment
Whoops, illusion of transparency! The Arbital page is the best I’ve found (for the long-term view); the rest I reasoned on my own and sharpened in some conversations with MIRI staff.
Comment
What do you think about Paul Christiano’s argument in the comment to that Arbital page?
Do you think avoiding side effects / low impact could work if an AGI was given a task like "make money" or "win this war" that unavoidably has lots of side effects? If so, can you explain why or give a rough idea of how that might work?
(Feel free not to answer if you don’t have well formed thoughts on these questions. I’m curious what people working on this topic think about these questions, and don’t mean to put you in particular on the spot.)
Comment
My current thoughts on this:
It seems like Paul’s proposed solution here depends on the rest of Paul’s scheme working (you need the human’s opinions on what effects are important to be accurate). Of course if Paul’s scheme works in general, then it can be used for avoiding undesirable side effects.
My current understanding of how a task-directed AGI could work is: it has some multi-level world model that is mappable to a human-understood ontology (e.g. an ontology in which there is spacetime and objects), and you give it a goal that is something like "cause this variable here to be this value at this time step". In general you want causal consequences of changing the variable to happen, but few other effects.
From this paper I wrote:
For things like "make money" there are going to be effects other than you having more money, e.g. some product was sold and others have less money. The hope here is that, since you have ontology mapping, you can (a) enumerate these effects and see if they seem good according to some scoring function (which need not be a utility function; conservatism may be appropriate here), and (b) check that there aren’t additional future consequences not explained by these effects (e.g. that are different from when you take a counterfactual on these effects).
I think "win this war" is going to be a pretty difficult goal to formalize (as a bunch of what is implied by "winning a war" is psychological/sociological); probably it is better to think about achieving specific military objectives.
I realize I’m shoving most of the problem into the ontology mapping / transparency problem; I think this is correct, and that this problem should be prioritized, with the understanding that avoiding unintended side effects will be one use of the ontology mapping system.
EDIT: also worth mentioning that things get weird when humans are involved. One effect of a robot building a house is that someone sees a robot building a house, but how does this effect get formalized? I am not sure whether the right approach will be to dodge the issue (by e.g. using only very simple models of humans) or to work out some ontology for theory of mind that could allow reasoning about these sorts of effects.
Comment
Are you aware of any previous discussion of this, in any papers or posts? I’m skeptical that there’s a good way to implement this scoring function. For example we do want our AI to make money by inventing, manufacturing, and selling useful gadgets, and we don’t want our AI to make money by hacking into a bank, selling a biological weapon design to a terrorist, running a Ponzi scheme, or selling gadgets that may become fire hazards. I don’t see how to accomplish this without the scoring function being a utility function. Can you perhaps explain more about how "conservatism" might work here?
Comment
It should definitely take desiderata into account, I just mean it doesn’t have to be VNM. One reason why it might not be VNM is if it’s trying to produce a non-dangerous distribution over possible outcomes rather than an outcome that is not dangerous in expectation; see Quantilizers for an example of this.
In general things like "don’t have side effects" are motivated by robustness desiderata, where we don’t trust the AI to make certain decisions so would rather it be conservative. We might not want the AI to cause X but also not want the AI to cause not-X. Things like this are likely to be non-VNM.
Nice work! Whitelisting seems like a good thing to do, since it is safe by default. (Computer security has a similar principle of preferring to whitelist instead of blacklist.) I was initially worried that we’d have the problems of symbolic approaches to AI, where we’d have to enumerate far too many transitions for the whitelist in order to be able to do anything realistic, but since whitelisting could work on learned embedding spaces, and the whitelist itself can be learned from demonstrations, this could be a scalable method. I’m worried that whitelisting presents generalization challenges—if you are distinguishing between different colors of tiles, to encode "you can paint any tile" you’d have to whitelist transitions (redTile → blueTile), (blueTile → redTile), (redTile → yellowTile) etc. Those won’t all be in the demonstrations. If you are going to generalize there, how do you not generalize (redLight → greenLight) to (greenLight → redLight) for an AI that controls traffic lights? It seems like you want to On another note, I personally don’t want to assume that we can point to a part of the architecture as the AI’s ontology. On the technical side: The whitelist is only closed under transitivity if you assume that the agent is capable of taking all transitions, and you aren’t worried about cost. If you have a → b and b → c whitelisted, then the agent can only get from a to c if it can change a to c going through intermediate state b, which may be much harder than going directly from a to c. You could just define the whitelist to be transitively closed, since it’s not hard to compute the transitive closure of a directed graph.
Comment
I’m not especially familiar with all the literature involved here, so forgive me if this is somehow repetitive. However, I was wondering if having two lists might be more preferable. Naturally, there would be non-whitelisted objects (do not interfere with these in any way). Second, there could be objects which are fine to manipulate but must retain functional integrity (for instance, a book can be freely manipulated under most circumstances; however, it cannot be moved so it becomes out of reach or illegible, and should not be moved or obstructed while in use). Third, of course, would be objects with "full permissions", such as, potentially, the paint on the aforementioned tiles. The main difficulty here is that definitions for functional integrity would have to be either written or learned for virtually every function, though I suspect it would be (relatively) easy enough to recognise novel objects and their functions thereafter. Of course, there could also be some sort of machine-readable identification added to common objects which carries information on their functions, though whether this would only refer to predefined classes (books, bicycles, vases) or also be able to contain instructions on a new function type (potentially a useful feature for new inventions and similar) is a separate question.
Comment
Hey, thanks for the ideas!
Comment
Comment