Scalar reward is not enough for aligned AGI

https://www.lesswrong.com/posts/eeEEgNeTepZb6F6NF/scalar-reward-is-not-enough-for-aligned-agi

This post was authored by Peter Vamplew and Cameron Foale (Federation University), and Richard Dazeley (Deakin University) Introduction Recently some of the most well-known researchers in reinforcement learning Silver, Singh, Precup and Sutton published a paper entitled Reward is Enough, which proposes the reward-is-enough hypothesis: "Intelligence, and its associated abilities, can be understood as subserving the maximisation of reward by an agent acting in its environment". Essentially, they argue that the overarching goal of maximising reward is sufficient to explain all aspects of natural and artificial intelligences. Of specific interest to this forum is the contention that suitably powerful methods based on maximisation of a scalar reward (as in conventional reinforcement learning) provide a suitable pathway for the creation of artificial general intelligence (AGI). We are concerned that the promotion of such an approach by these influential researchers increases the risk of development of AGI which is not aligned with human interests, and this led us to work with a team of collaborators on a recent pre-print Scalar Reward is Not Enough which argues against the assumption made by the reward-is-enough hypothesis that scalar rewards are sufficient to underpin intelligence. The aim of this post is to provide an overview of our arguments as they relate to the creation of aligned AGI. In this post we will focus on reinforcement learning methods, both because that is the main approach mentioned by Silver et al, and also because it is our own area of expertise. However the arguments apply to any form of AI based on maximisation of a numeric measure of reward or utility. Does aligned AGI require multiple objectives? In discussing the development of intelligence, Silver et al argue that complex, general intelligence may arise from the combination of complex environments and simple reward signals, and provide the following illustrative example:

"For example, consider a signal that provides +1 reward to the agent each time a round-shaped pebble is collected. In order to maximise this reward signal effectively, an agent may need to classify pebbles, to manipulate pebbles, to navigate to pebble beaches, to store pebbles, to understand waves and tides and their effect on pebble distribution, to persuade people to help collect pebbles, to use tools and vehicles to collect greater quantities, to quarry and shape new pebbles, to discover and build new technologies for collecting pebbles, or to build a corporation that collects pebbles." Silver et al present the ability of a reward-maximising agent to develop such wide-ranging, impactful behaviours on the basis of a simple scalar reward as a positive feature of this approach to developing AI. However we were struck by the similarity between this scenario and the infamous paper-clip maximiser thought experiment which has been widely discussed in the AI safety literature. The dangers posed by unbounded maximisation of a simple objective are well-known in this community, and it is concerning to see them totally overlooked in a paper advocating RL as a means for creating AGI. We have previously argued that the creation of human-aligned AI is an inherently multiobjective problem. By incorporating rewards for other objectives in addition to the primary objective (such as making paperclips or collecting rocks), the designer of an AI system can reduce the likelihood of unsafe behaviour arising. In addition to safety objectives, there may be many other aspects of desirable behaviour which we wish to encourage an AI/​AGI to adopt – for example, adhering to legal frameworks, societal norms, ethical guidelines, etc. Of course, it may not be possible for an agent to simultaneously maximise all of these objectives (for example, sometimes illegal actions may be required in order to maximise safety; different ethical frameworks may be in disagreement in particular scenarios), and so we contend that it may be necessary to incorporate concepts from multiobjective decision-making in order to manage trade-offs between conflicting objectives. Our collaborator Ben Smith and his colleagues Roland Pihlakas and Robert Klassert recently posted to this forum an excellent review of the benefits of multiobjective approaches to AI safety, so rather than duplicating those arguments here we refer the reader to that post, and to our prior paper. For the remainder of this post we assume that the aim is to create AGI which takes into account both a primary objective (such as collecting rocks) along with one or more alignment objectives, and we will consider the extent to which technical approaches based on either scalar or vector rewards (with a separate element for each objective) may achieve that goal. Does the reward-is-enough hypothesis only consider scalar rewards? A question which has arisen in previous online discussion of our pre-print is whether we are creating a straw-man in contending that Silver et al assume scalar rewards. While it is true that the reward-is-enough hypothesis (as quoted above) does not explicitly state any restriction on the nature of the reward, this is specified later in Section 2.4 ("A reward is a special scalar observation Rt"), and Silver et al also refer to Sutton’s reward hypothesis which states that "all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward)". This view is also reflected in our prior conversations with the authors; following a presentation we gave on multiobjective reinforcement learning in 2015, Richard Sutton stated that "there is no such thing as a multiobjective problem". In Reward is Enough, Silver et al do acknowledge that multiple objectives may exist, but contend that these can be represented via a scalar reward signal ("…a scalar reward signal can represent weighted combinations of objectives…"). They also argue that scalar methods should be favoured over explicitly multiobjective approaches as they represent a more general solution (although we would argue that the multiobjective case with n>=1 objectives is clearly more general than the special case of scalar reward with n=1): "Rather than maximising a generic objective defined by cumulative reward, the goal is often formulated separately for different cases: for example multi-objective learning, risk-sensitive objectives, or objectives that are specified by a human-in-the-loop …While this may be appropriate for specific applications, a solution to a specialised problem does not usually generalise; in contrast a solution to the general problem will also provide a solution for any special cases." Can a scalar reward adequately represent multiple objectives? As mentioned earlier, Silver al state that "*a scalar reward signal can represent weighted combinations of objectives". *While this statement is true, the question remains as to whether this representation is sufficient to support optimal decision-making with regards to those objectives. While Silver et al don’t clearly specify the exact nature of this representation, the mention of "weighted combinations" suggests that they are referring to a linear weighted sum of the objectives. This is the most widely adopted approach to dealing with multiple objectives in the scalar RL literature – for example, common benchmarks such as gridworlds often provide a reward of −1 on each time step to encourage rapid movement towards a goal state, and a separate negative reward for events such as colliding with walls or stepping in puddles). The assumption is that selecting an appropriate set of weights will allow the agent to discover a policy that produces the optimal trade-off between the two objectives. However, this may not be the case as or some environments the expected returns for certain policies may mean that there is no set of weights that lead to the discovery of those policies; if those policies do in fact correspond to the best compromise between the objectives then we may be forced to settle for a sub-optimal solution. Even if a policy is theoretically findable, identifying the weights that achieve this is non-trivial as the relationship between the weights and the returns achieved by the agent can be highly non-linear. We observed this in our recent work on minimising side-effects; for some problems tuning the agent to find a safe policy was much more difficult and time-consuming for single-objective agents than for multi-objective agents. It is possible to address these issues by using a non-linear function to scalarise the objectives. However, this introduces several new problems. We will illustrate these by considering a scalarisation function that aims to maximise one objective subject to reaching a threshold on a second objective. For simplicity we will also assume an episodic task (i.e. one with a defined, reachable end state). On each timestep the performance with respect to each objective can be calculated, but cannot immediately be scalarised (as, for example, the reward with respect to an objective may first reach the threshold, before subsequently falling below it in later time-steps due to negative rewards). So, these per-objective values must be accumulated external to the agent, and the agent will receive zero reward on all time-steps except at the end of the episode, when the true scalarised value can be calculated and provided as a scalar reward. This has a number of implications:

Comment

https://www.lesswrong.com/posts/eeEEgNeTepZb6F6NF/scalar-reward-is-not-enough-for-aligned-agi?commentId=4Gth6u92MxjiRiChM

I agree that us humans have a lot of information about human values that we want to be able to put into the AI in its architecture and in the design of its training process. But I don’t see why multi-objective RL is particularly interesting. Do you think we won’t need any other ways of giving the AI information about human values? If so, why? If not, and you just think it’s interesting, what’s an example of something to do with it that I wouldn’t think of as an obvious consequence of breaking your reward signal into several channels?

Comment

https://www.lesswrong.com/posts/eeEEgNeTepZb6F6NF/scalar-reward-is-not-enough-for-aligned-agi?commentId=w28i624wNnc9KdJuy

I’m not suggesting that RL is the only, or even the best, way to develop AGI. But this is the approach being advocated by Silver et al, and given their standing in the research community, and the resources available to them at DeepMind, it would appear likely that they, and others, will probably try to develop AGI in this way.Therefore I think it is essential that a multiobjective approach is taken for there to be any chance that this AGI will actually be aligned to our best interests. If conventional RL based on scalar reward is used then(a) it is very difficult to specify a suitable scalar reward which accounts for all of the many factors required for alignment (so reward misspecification becomes more likely), (b) it is very difficult, or perhaps impossible, for the RL agent to learn the policy which represents the optimal trade-off between those factors, and (c) the agent will be unable to learn about rewards other than those currently provided, meaning it will lack flexibility in adapting to changes in values (our own or society’s)The multiobjective maximum expected utility (MOMEU) model is a general framework, and can be used in conjunction with other approaches to aligning AGI. For example, if we encode an ethical system as a rule-base, then the output of those rules can be used to derive one of the elements of the vector utility provided to the multi-objective agent. We also aren’t constrained to a single set of ethics—we could implement many different frameworks, treat each as a separate objective, and then when the frameworks disagree, the agent would aim to find the best compromise between those objectives. While I didn’t touch on it in this post, other desirable aspects of beneficial AI (such as fairness) can also be naturally represented and implemented within a multiobjective framework.

https://www.lesswrong.com/posts/eeEEgNeTepZb6F6NF/scalar-reward-is-not-enough-for-aligned-agi?commentId=HwudjEkbhq7mrBCuT

I agree with your general comments, and I’d like to add some additional observations of my own.

Reading the paper Reward is Enough, what strikes me most is that the paper is reductionist almost to the point of being a self-parody.

Take a sentence like:

The reward-is-enough hypothesis postulates that intelligence, and its associated abilities, can be understood as subserving the maximisation of reward by an agent acting in its environment.

I could rewrite this to

The physics-is-enough hypothesis postulates that intelligence, and its associated abilities, can be understood as being the laws of physics acting in an environment.

If I do that rewriting throughout the paper, I do not have to change any of the supporting arguments put forward by the authors: they equally support the physics-is-enough reductionist hypothesis.

The authors of ‘reward is enough’ posit that rewards explain everything, so you might think that they would be very interested in spending more time to look closely at the internal structure of actual reward signals that exist in the wild, or actual reward signals that might be designed. However, they are deeply uninterested in this. In fact they explicitly invite others to join them in solving the ‘challenge of sample-efficient reinforcement learning’ without ever doing such things.

Like you I feel that, when it comes to AI safety, this lack of interest in the details of reward signals is not very helpful. I like the multi-objective approach (see my comments here), but my own recent work like this has been more about abandoning the scalar reward hypothesis/​paradigm even further, about building useful models of aligned intelligence which do not depend purely on the idea of reward maximisation. In that recent paper (mostly in section 7) I also develop some thoughts about why most ML researchers seem so interested in the problem of designing reward signals.