Utility versus Reward function: partial equivalence

Contents

Formalism
The value functions
Equivalence for finite horizons
(In)Equivalence for infinite horizons
A utility counterexample
Does it make a difference in practice?
Appendix: Proofs A reward function is defined over past sequences of actions and observations. When the agent chooses an action, and gets an observation, they receive a reward that is a function of that observation and all previous observations and actions. A utility function is defined over states of the world. You can take actions to increase or decrease the probability of certain states, thus increasing expected utility, but you don’t actually "receive" any utility. Are these different objects, or are they actually the same thing? This would be good to know, as most of the background knowledge of MIRI and similar AI safety groups is for utility functions, while reward functions are prevalent in reinforcement learning. The summary of this post is:
For finite horizons, reward and utility functions are equivalent.
For infinite horizons, every bounded discounted reward function is equivalent with a bounded utility function. But not all bounded utility functions have a corresponding reward function. Even if they do, the reward function may not be bounded.

Formalism

Let \newcommand{\A}{\mathcal{A}}\A be the set of actions an agent can take, and \mathcal{O} the set of observations. Assume both sets are finite. Let \newcommand{\H}{\mathcal{H}}\H be the set of histories (sequences of observations and actions) of an agent. Let \newcommand{\W}{\mathcal{W}}\W be the (possibly infinite) set of worlds. Note that a world includes the full set of observation history for the agent (since the agent is part of the world). Therefore the worlds are stratified by histories; for any h\in\H, there is a subset \W_h\subset\W consisting of all worlds with history h. Then a reward function R is a function from histories to real numbers, while a utility function U is a function from worlds to real numbers:
\begin{array}{lr}R:&\H\to\mathbb{R},\U:&\W\to\mathbb{R}.\end{array}Rewards and utility functions are bounded if their image is in a bounded subset of \mathbb{R}; without loss of generality, this means there exists an l>0 such that the image of R (or U) is contained in [-l,l] for all h\in\H (or w\in\W). A policy \pi for an agent is a map from histories to a probability distribution over actions; so \pi: \H\to\Delta\A, for \Delta\A the space of probability distributions over actions. Even for a utility-based agent, these are the only policies available. This is because all the information (apart from the prior) that the agent gets from the outside world is given by its observations.

The value functions

Let P be the probability estimator used by the agent. We’ll assume for the moment that it’s logically omniscient and Bayesian. This P incorporates a prior, by, for example, applying it unconditionally to worlds or histories - P(w) and P(h). Given a history h\in\H and a policy \pi, we can compute the conditional probability of a world or a history; designate these by P^\pi(h'|h) and P^\pi(w|h). Given these probabilities, we can then compute expected values. For example, the expected utility given \pi and h is: \newcommand{\E}{\mathbb{E}}\E^\pi(U|h)=\int_{w\in\W}U(w)P^\pi(w|h).We’ll also designate this quantity by the expected value function V(U,\pi,h). If U is bounded in [-l,l], then so is this value function (though the converse need not be true). For rewards, let \H_i be the set of histories that have i actions and observations. We’ll say that h\in\H_i has length i. Let h_{::i} be the the first i actions and observations of the history h. If the agent knows it will only make n\geq 0 observations and actions, then the future reward has an expected value function. For \pi and h_m\in\H_m with h_m a history of length m < n, this is: V(R,\pi,h_m)=\sum_{h\in\H_n}P^\pi(h|h_m)\sum_{i=m+1}^n R(h_{::i}).If the agent expects it could make arbitrarily many observations, then given a discount function 0<\gamma<1, there is the expected discounted future reward of: V(R,\pi,h_m,\gamma)=\lim_{n\to\infty}\sum_{h\in\H_n}P^\pi(h|h_m)\sum_{i=m+1}^n \gamma^{i-(m+1)} R(h_{::i}).## Equivalence for finite horizons Assume the agent knows it will only make n observations and actions. The {W_h | h\in\H_n} form a partition of the possible worlds in \W: every possible world is in exactly one of those W_h subsets. Then assume that R is given, and define, for w_h\in\W_h: U_R(w_h)=\sum_{i=1}^n R(h_{::i}).Then:

Theorem 1: On history h, V(U,\pi,h) and V(R_U,\pi,h) differ by a constant that is a function of h only, not of \pi. Consequently, an R_U-maximiser and a U-maximiser will choose the same policies. All the proofs are given in the appendix at the end of the post. Now, conversely, assume U is given, but R is not. If h\in \H_n, then V(U,\pi,h) is independent of \pi, since the agent will never make any more actions. Then fix any policy \pi', for a history h of length m>1, define the reward R_U by: R_U(h)= V(U,\pi',h)-V(U,\pi',h_{::m-1})For m=1, define R_U by:R_U(h)=V(U,\pi',h)
Theorem 2: On history h, V(U,\pi,h) and V(R_U,\pi,h) differ by a constant that is a function of h and \pi' only, not of \pi. Consequently, an R_U-maximiser and a U-maximiser will choose the same policies. Both of these constructions are non-unique; for example, U_R could vary across the different worlds of W_h, as long as its expectation on that set is the same. And R_U could have different rewards added at one step, as long as it is subtracted at a later step, or could use a different \pi'.

(In)Equivalence for infinite horizons

If the agent expects it could make arbitrarily many observations, then we can still define a good U_R from a given R. Let \H_\infty be the possible infinite histories; then the sets {W_h | h\in\H_\infty} form a partition of \W. Then for any fixed \gamma define: U_R(w_h)=\lim_{j\to\infty}\sum_{i=1}^j \gamma^{i-1}R(h_{::i}),and we will get:

Theorem 3: If R is bounded, then U_R is well-define and bounded, and on history h, V(U,\pi,h) and V(R_U,\pi,h) are related by a positive affine transformation that is a function of h and \gamma only, not of \pi.Consequently, a \gamma-discounted R_U-maximiser and a U-maximiser will choose the same policies. The converse is more tricky. Fix any policy \pi' as before, and, as before, for a history h of length m>1, define the reward R_U by: R_U(h)= \gamma^{-m}\left(V(U,\pi',h)-V(U,\pi',h_{::m-1})\right).For m=1, define R_U by:R_U(h)=V(U,\pi',h),To make this work, we need to put some conditions on U. The condition that we need is that, eventually, the future actions (hence the future policy) don’t matter much. Then we say that U asymptotically ignores the future, if there exists a function f, for any h_m\in\H_m, and any policies \pi and \pi', ||V(U,\pi,h_m)-V(U,\pi',h_m)|| < f(m) and \lim_{m\to\infty} f(m)=0.Then, as before, we can show that:
Theorem 4: If U asymptotically ignores the future, then on history h, V(U,\pi,h) and V(R_U,\pi,h) are related by a positive affine transformation that is a function of h, \pi', and \gamma only, not of \pi. Consequently, a \gamma-discounted R_U-maximiser and a U-maximiser will choose the same policies. **Even if U is bounded, R_U need not be. **

A utility counterexample

So, what kind of utility function cannot be made into a reward function in the above way? Well, assume there are two actions, a and b, and that U is 1 if the agent only chooses a, and 0 if it ever choose b. Let \pi' be the policy that always chooses b (all that’s needed, in fact, is that it eventually chooses b with probability 1). Then V(U,\pi',h) is always zero, as is V(U,\pi',h_{::m-1}). Thus R_U(h) is also always zero. And this despite the fact that there exists a policy that gets utility 1: namely the "always choose a" policy.

Does it make a difference in practice?

In order to reach the equivalences above, the value functions V need to be exactly calculated, meaning that the probability estimator P needs to be perfect. Thus the equivalence is only established for logically omniscient agents. In practice, utility functions are most useful when we know the ideal outcome states, and now the challenge is to design an agent that gets to them. Reward functions are most useful when we know the best local moves for the agent to make, but not necessarily the best outcomes.

Appendix: Proofs

This gives the proofs of the four theorem above. They will proceed by expressing the relationship between the two relevant value functions. Theorem 1: If h'\in\H_n, then U_R is constant on W_{h'}, so we can talk about U_R(W_{h'}) (which is \sum_{i=1}^n R(h'{::i})). Then if h is of length m: \begin{array}{rl}V(U_R,\pi,h)=&\int{w\in\W}U_R(w)P^\pi(w|h) \ =&\sum_{h'\in\H_n}\int_{w\in\W_{h'}}U_R(w)P^\pi(w|h) \ =&\sum_{h'\in\H_n} U_R(W_{h'})P^\pi(W_{h'}|h)\ =&\sum_{h'\in\H_n} U_R(W_{h'})P^\pi(h'|h)\ =&\sum_{h'\in\H_n} P^\pi(h'|h) \sum_{i=1}^n R(h'{::i})\ =&\sum{h'\in\H_n} P^\pi(h'|h) \sum_{i=m+1}^n R(h'{::i}) + \sum{h'\in\H_n} P^\pi(h'|h) \sum_{i=1}^m R(h'{::i})\ =& V(R, \pi, h) + \sum{i=1}^m R(h_{::i}), \end{array}since P(h'|h) being non-zero means that the initial m actions and observations of h' are the same as those of h, and P^\pi(W_{h'}|h) is the same as P^\pi(h'|h): the probability that we are in a world with h' is the same as the probability of observing h'. Because \sum_{i=1}^m R(h_{::i}) is a function purely of the past, this means that a V(U_R,\pi,h) maximiser will behave the same way as a V(R,\pi,h) maximiser. Theorem 2: If h is a history of length m>0, then: \begin{array}{rl}V(R_U,\pi,h)=&\sum_{h'\in\H_n}P^\pi(h'|h)\sum_{i=m+1}^n R_U(h'{::i}) \ =&\sum{h'\in\H_n}P^\pi(h'|h)\sum_{i=m+1}^n V(U,\pi',h'{::i})-V(U,\pi',h'{::i-1}) \ =&\sum_{h'\in\H_n}P^\pi(h'|h) \left(V(U,\pi',h'{::n})-V(U,\pi',h'{::m})\right)\ =&\sum_{h'\in\H_n}P^\pi(h'|h) V(U,\pi,h') - \sum_{h'\in\H_n}P^\pi(h'|h) V(U,\pi',h'{::m}) \ =&V(U,\pi,h) - V(U,\pi',h), \end{array}because if h'\in H_n and P^\pi(h'|h)\neq 0, then h'{::n}=h', h'{::m}=h, and V(U,\pi',h')=V(U,\pi,h). Theorem 3: If h is of length m, then \begin{array}{rl}V(U_R,\pi,h)=&\int{w\in\W}U_R(w)P^\pi(w|h) \ =&\int_{h'\in\H_\infty}\int_{w\in\W_{h'}}U_R(w)P^\pi(w|h) \ =&\int_{h'\in\H_\infty} U_R(W_{h'})P^\pi(W_{h'}|h)\ =&\int_{h'\in\H_\infty} U_R(W_{h'})P^\pi(h'|h)\ =&\int_{h'\in\H_\infty} P^\pi(h'|h) \lim_{j\to\infty}\sum_{i=1}^j \gamma^{i-1} R(h'{::i})\ =&\int{h'\in\H_\infty} P^\pi(h'|h) \lim_{j\to\infty}\sum_{i=m+1}^j \gamma^{i-1} R(h'{::i}) + \int{h'\in\H_\infty} P^\pi(h'|h) \sum_{i=1}^m \gamma^{i-1} R(h'{::i})\ =&\lim{j\to\infty}\sum_{h''\in\H_j} P^\pi(h''|h) \sum_{i=m+1}^j \gamma^{i-1} R(h'{::i}) + \int{h'\in\H_\infty} P^\pi(h'|h) \sum_{i=1}^m \gamma^{i-1} R(h'{::i})\ =& \gamma^m V(R, \pi, h) + \sum{i=1}^m \gamma^{i-1} R(h_{::i}), \end{array}similarly to the proof of Theorem 1. Here, we have used the fact that the actions and observations beyond the j-th are irrelevant to \sum_{i=m+1}^j \gamma^{i-1} R(h'{::i}), in order to amalgamate all h'\in\H\infty that have the same j first actions and observations, and then interchange the limit with the finite sum over all the histories in \H_j. Theorem 4: \begin{array}{rl}V(R_U,\pi,h,\gamma)=&\lim_{n\to\infty}\sum_{h'\in\H_n}P^\pi(h'|h)\sum_{i=m+1}^n \gamma^{i-(m+1)} R_U(h'{::i}) \ =&\lim{n\to\infty}\sum_{h'\in\H_n}P^\pi(h'|h)\sum_{i=m+1}^n \gamma^{i-(m+1)} \gamma^{-i} \left(V(U,\pi',h'{::i})- V(U,\pi',h'{::i-1})\right) \ =&\gamma^{-(m+1)}\lim_{n\to\infty}\sum_{h'\in\H_n}P^\pi(h'|h) \left(V(U,\pi',h'{::n})-V(U,\pi',h'{::m})\right) \end{array}Because U asymptotically ignores the future, the term \sum_{h'\in\H_n}P^\pi(h'|h) V(U,\pi',h'{::n}) can be rewritten as \sum{h'\in\H_n}P^\pi(h'|h) V(U,\pi,h'{::n})=V(U,\pi,h), a re-writing that introduces an error of norm at most f(n). Since f(n) tends to zero in the limit, V(R_U,\pi,h,\gamma)=\gamma^{-(m+1)} \left(V(U,\pi,h)-V(U,\pi',h)\right). Now we’ll show that this value function need not be bounded. Imagine that the agent has two action, a and b. The utility U is an inde weighted sum of the number of times the agent chooses a. For h'\in\H\infty, let I(a,h,m) be a Boolean that is 1 if the agent chose 1 on history h' at time m, and 0 otherwise. Let 0<\beta<1, and define the utility: U(h)=\sum_{i=1}^\infty \beta^{i} I(a,h',m). Now let \pi be the policy of always choosing a, and \pi' the policy of always choosing b. Then for any history h_m\in\H_m, V(U,\pi,h)-V(U,\pi',h) = \sum_{i=m+1}^\infty \beta^{i} = \beta^{m+1}\frac{1}{1-\beta}> \beta^{m+1}.This means that if \beta < \gamma, the value of V(R_U,\pi,h,\gamma) will increase without limits as h gets longer, even though U itself is bounded (by 1/(1-\beta)). If we replace \beta^i with 1/i^2, we keep the fact that U is bounded, and, since 1/i^2 will eventually be greater that \beta^i, for all 0<\beta<1, we can see that V(R_U,\pi,h,\gamma) will increase without limits for any value of \gamma.

id

dAxkvohviWFNoZoXM
authors

William_S
score

1
omega_karma
votes

1
date_published

2018-04-13T16:02

https://www.lesswrong.com/posts/JewWDfLoxgFtJhNct/utility-versus-reward-function-partial-equivalence?commentId=dAxkvohviWFNoZoXM

I’m trying to wrap my head around the case where there are two worlds, w1 and w2; w2 is better than w1, but moving from w1 to w2 is bad (ie. kill everyone and replacing them with different people who are happier, and we think this is bad). I think for the equivalence to work in this case, the utility function U also needs to depend on your current state—if it’s the same for all states, then the agent would always prefer to move from w1 to w2 and erase it’s memory of the past when maximizing the utility function, wheras it would act correctly with the reward function.

Comment

id

heC665t2hykQEXDr4
authors

Stuart_Armstrong
score

2
omega_karma
votes

1
date_published

2018-04-16T11:09

https://www.lesswrong.com/posts/JewWDfLoxgFtJhNct/utility-versus-reward-function-partial-equivalence?commentId=heC665t2hykQEXDr4

erase it’s memory That only works if the agent is motivated by something like "maximise your belief in what the expected value of U is", rather than "maximise the expected value of U". If you’ve got that problem, then the agent is unsalvageable—it could just edit its memory to make itself believe U is maximised.

id

DL3EayMZDBs9QNJ42
authors

William_S
score

1
omega_karma
votes

1
date_published

2018-04-16T15:08

https://www.lesswrong.com/posts/JewWDfLoxgFtJhNct/utility-versus-reward-function-partial-equivalence?commentId=DL3EayMZDBs9QNJ42

Say w2a is the world where the agent starts in w2 and w2b is the world that results after the agent moves from w1 to w2. Without considering the agent’s memory part of the world, it seems like the problem is worse: the only way to distinguish between w2a and w2b is the agent’s memory of past events, so it seems that leaving the agent’s memory over the past out of the utility function requires U(w2a) = U(w2b)

id

NWDDeQQz3zLpbG33n
authors

paulfchristiano
score

4
omega_karma
votes

2
date_published

2018-04-16T15:29

https://www.lesswrong.com/posts/JewWDfLoxgFtJhNct/utility-versus-reward-function-partial-equivalence?commentId=NWDDeQQz3zLpbG33n

U could depend on the entire history of states (rather than on the agent’s memory of that history).

id

fKAPHmjRbuYgQBChW
authors

William_S
score

1
omega_karma
votes

1
date_published

2018-04-16T17:03

https://www.lesswrong.com/posts/JewWDfLoxgFtJhNct/utility-versus-reward-function-partial-equivalence?commentId=fKAPHmjRbuYgQBChW

Ah, misunderstood that, thanks.