Utility versus Reward function: partial equivalence

https://www.lesswrong.com/posts/JewWDfLoxgFtJhNct/utility-versus-reward-function-partial-equivalence

Contents

Formalism

Let \newcommand{\A}{\mathcal{A}}\A be the set of actions an agent can take, and \mathcal{O} the set of observations. Assume both sets are finite. Let \newcommand{\H}{\mathcal{H}}\H be the set of histories (sequences of observations and actions) of an agent. Let \newcommand{\W}{\mathcal{W}}\W be the (possibly infinite) set of worlds. Note that a world includes the full set of observation history for the agent (since the agent is part of the world). Therefore the worlds are stratified by histories; for any h\in\H, there is a subset \W_h\subset\W consisting of all worlds with history h. Then a reward function R is a function from histories to real numbers, while a utility function U is a function from worlds to real numbers:
\begin{array}{lr}R:&\H\to\mathbb{R},\U:&\W\to\mathbb{R}.\end{array}Rewards and utility functions are bounded if their image is in a bounded subset of \mathbb{R}; without loss of generality, this means there exists an l>0 such that the image of R (or U) is contained in [-l,l] for all h\in\H (or w\in\W). A policy \pi for an agent is a map from histories to a probability distribution over actions; so \pi: \H\to\Delta\A, for \Delta\A the space of probability distributions over actions. Even for a utility-based agent, these are the only policies available. This is because all the information (apart from the prior) that the agent gets from the outside world is given by its observations.

The value functions

Let P be the probability estimator used by the agent. We’ll assume for the moment that it’s logically omniscient and Bayesian. This P incorporates a prior, by, for example, applying it unconditionally to worlds or histories - P(w) and P(h). Given a history h\in\H and a policy \pi, we can compute the conditional probability of a world or a history; designate these by P^\pi(h'|h) and P^\pi(w|h). Given these probabilities, we can then compute expected values. For example, the expected utility given \pi and h is: \newcommand{\E}{\mathbb{E}}\E^\pi(U|h)=\int_{w\in\W}U(w)P^\pi(w|h).We’ll also designate this quantity by the expected value function V(U,\pi,h). If U is bounded in [-l,l], then so is this value function (though the converse need not be true). For rewards, let \H_i be the set of histories that have i actions and observations. We’ll say that h\in\H_i has length i. Let h_{::i} be the the first i actions and observations of the history h. If the agent knows it will only make n\geq 0 observations and actions, then the future reward has an expected value function. For \pi and h_m\in\H_m with h_m a history of length m < n, this is: V(R,\pi,h_m)=\sum_{h\in\H_n}P^\pi(h|h_m)\sum_{i=m+1}^n R(h_{::i}).If the agent expects it could make arbitrarily many observations, then given a discount function 0<\gamma<1, there is the expected discounted future reward of: V(R,\pi,h_m,\gamma)=\lim_{n\to\infty}\sum_{h\in\H_n}P^\pi(h|h_m)\sum_{i=m+1}^n \gamma^{i-(m+1)} R(h_{::i}).## Equivalence for finite horizons Assume the agent knows it will only make n observations and actions. The {W_h | h\in\H_n} form a partition of the possible worlds in \W: every possible world is in exactly one of those W_h subsets. Then assume that R is given, and define, for w_h\in\W_h: U_R(w_h)=\sum_{i=1}^n R(h_{::i}).Then:

(In)Equivalence for infinite horizons

If the agent expects it could make arbitrarily many observations, then we can still define a good U_R from a given R. Let \H_\infty be the possible infinite histories; then the sets {W_h | h\in\H_\infty} form a partition of \W. Then for any fixed \gamma define: U_R(w_h)=\lim_{j\to\infty}\sum_{i=1}^j \gamma^{i-1}R(h_{::i}),and we will get:

A utility counterexample

So, what kind of utility function cannot be made into a reward function in the above way? Well, assume there are two actions, a and b, and that U is 1 if the agent only chooses a, and 0 if it ever choose b. Let \pi' be the policy that always chooses b (all that’s needed, in fact, is that it eventually chooses b with probability 1). Then V(U,\pi',h) is always zero, as is V(U,\pi',h_{::m-1}). Thus R_U(h) is also always zero. And this despite the fact that there exists a policy that gets utility 1: namely the "always choose a" policy.

Does it make a difference in practice?

In order to reach the equivalences above, the value functions V need to be exactly calculated, meaning that the probability estimator P needs to be perfect. Thus the equivalence is only established for logically omniscient agents. In practice, utility functions are most useful when we know the ideal outcome states, and now the challenge is to design an agent that gets to them. Reward functions are most useful when we know the best local moves for the agent to make, but not necessarily the best outcomes.

Appendix: Proofs

This gives the proofs of the four theorem above. They will proceed by expressing the relationship between the two relevant value functions. Theorem 1: If h'\in\H_n, then U_R is constant on W_{h'}, so we can talk about U_R(W_{h'}) (which is \sum_{i=1}^n R(h'{::i})). Then if h is of length m: \begin{array}{rl}V(U_R,\pi,h)=&\int{w\in\W}U_R(w)P^\pi(w|h) \ =&\sum_{h'\in\H_n}\int_{w\in\W_{h'}}U_R(w)P^\pi(w|h) \ =&\sum_{h'\in\H_n} U_R(W_{h'})P^\pi(W_{h'}|h)\ =&\sum_{h'\in\H_n} U_R(W_{h'})P^\pi(h'|h)\ =&\sum_{h'\in\H_n} P^\pi(h'|h) \sum_{i=1}^n R(h'{::i})\ =&\sum{h'\in\H_n} P^\pi(h'|h) \sum_{i=m+1}^n R(h'{::i}) + \sum{h'\in\H_n} P^\pi(h'|h) \sum_{i=1}^m R(h'{::i})\ =& V(R, \pi, h) + \sum{i=1}^m R(h_{::i}), \end{array}since P(h'|h) being non-zero means that the initial m actions and observations of h' are the same as those of h, and P^\pi(W_{h'}|h) is the same as P^\pi(h'|h): the probability that we are in a world with h' is the same as the probability of observing h'. Because \sum_{i=1}^m R(h_{::i}) is a function purely of the past, this means that a V(U_R,\pi,h) maximiser will behave the same way as a V(R,\pi,h) maximiser. Theorem 2: If h is a history of length m>0, then: \begin{array}{rl}V(R_U,\pi,h)=&\sum_{h'\in\H_n}P^\pi(h'|h)\sum_{i=m+1}^n R_U(h'{::i}) \ =&\sum{h'\in\H_n}P^\pi(h'|h)\sum_{i=m+1}^n V(U,\pi',h'{::i})-V(U,\pi',h'{::i-1}) \ =&\sum_{h'\in\H_n}P^\pi(h'|h) \left(V(U,\pi',h'{::n})-V(U,\pi',h'{::m})\right)\ =&\sum_{h'\in\H_n}P^\pi(h'|h) V(U,\pi,h') - \sum_{h'\in\H_n}P^\pi(h'|h) V(U,\pi',h'{::m}) \ =&V(U,\pi,h) - V(U,\pi',h), \end{array}because if h'\in H_n and P^\pi(h'|h)\neq 0, then h'{::n}=h', h'{::m}=h, and V(U,\pi',h')=V(U,\pi,h). Theorem 3: If h is of length m, then \begin{array}{rl}V(U_R,\pi,h)=&\int{w\in\W}U_R(w)P^\pi(w|h) \ =&\int_{h'\in\H_\infty}\int_{w\in\W_{h'}}U_R(w)P^\pi(w|h) \ =&\int_{h'\in\H_\infty} U_R(W_{h'})P^\pi(W_{h'}|h)\ =&\int_{h'\in\H_\infty} U_R(W_{h'})P^\pi(h'|h)\ =&\int_{h'\in\H_\infty} P^\pi(h'|h) \lim_{j\to\infty}\sum_{i=1}^j \gamma^{i-1} R(h'{::i})\ =&\int{h'\in\H_\infty} P^\pi(h'|h) \lim_{j\to\infty}\sum_{i=m+1}^j \gamma^{i-1} R(h'{::i}) + \int{h'\in\H_\infty} P^\pi(h'|h) \sum_{i=1}^m \gamma^{i-1} R(h'{::i})\ =&\lim{j\to\infty}\sum_{h''\in\H_j} P^\pi(h''|h) \sum_{i=m+1}^j \gamma^{i-1} R(h'{::i}) + \int{h'\in\H_\infty} P^\pi(h'|h) \sum_{i=1}^m \gamma^{i-1} R(h'{::i})\ =& \gamma^m V(R, \pi, h) + \sum{i=1}^m \gamma^{i-1} R(h_{::i}), \end{array}similarly to the proof of Theorem 1. Here, we have used the fact that the actions and observations beyond the j-th are irrelevant to \sum_{i=m+1}^j \gamma^{i-1} R(h'{::i}), in order to amalgamate all h'\in\H\infty that have the same j first actions and observations, and then interchange the limit with the finite sum over all the histories in \H_j. Theorem 4: \begin{array}{rl}V(R_U,\pi,h,\gamma)=&\lim_{n\to\infty}\sum_{h'\in\H_n}P^\pi(h'|h)\sum_{i=m+1}^n \gamma^{i-(m+1)} R_U(h'{::i}) \ =&\lim{n\to\infty}\sum_{h'\in\H_n}P^\pi(h'|h)\sum_{i=m+1}^n \gamma^{i-(m+1)} \gamma^{-i} \left(V(U,\pi',h'{::i})- V(U,\pi',h'{::i-1})\right) \ =&\gamma^{-(m+1)}\lim_{n\to\infty}\sum_{h'\in\H_n}P^\pi(h'|h) \left(V(U,\pi',h'{::n})-V(U,\pi',h'{::m})\right) \end{array}Because U asymptotically ignores the future, the term \sum_{h'\in\H_n}P^\pi(h'|h) V(U,\pi',h'{::n}) can be rewritten as \sum{h'\in\H_n}P^\pi(h'|h) V(U,\pi,h'{::n})=V(U,\pi,h), a re-writing that introduces an error of norm at most f(n). Since f(n) tends to zero in the limit, V(R_U,\pi,h,\gamma)=\gamma^{-(m+1)} \left(V(U,\pi,h)-V(U,\pi',h)\right). Now we’ll show that this value function need not be bounded. Imagine that the agent has two action, a and b. The utility U is an inde weighted sum of the number of times the agent chooses a. For h'\in\H\infty, let I(a,h,m) be a Boolean that is 1 if the agent chose 1 on history h' at time m, and 0 otherwise. Let 0<\beta<1, and define the utility: U(h)=\sum_{i=1}^\infty \beta^{i} I(a,h',m). Now let \pi be the policy of always choosing a, and \pi' the policy of always choosing b. Then for any history h_m\in\H_m, V(U,\pi,h)-V(U,\pi',h) = \sum_{i=m+1}^\infty \beta^{i} = \beta^{m+1}\frac{1}{1-\beta}> \beta^{m+1}.This means that if \beta < \gamma, the value of V(R_U,\pi,h,\gamma) will increase without limits as h gets longer, even though U itself is bounded (by 1/(1-\beta)). If we replace \beta^i with 1/i^2, we keep the fact that U is bounded, and, since 1/i^2 will eventually be greater that \beta^i, for all 0<\beta<1, we can see that V(R_U,\pi,h,\gamma) will increase without limits for any value of \gamma.

Comment

https://www.lesswrong.com/posts/JewWDfLoxgFtJhNct/utility-versus-reward-function-partial-equivalence?commentId=dAxkvohviWFNoZoXM

I’m trying to wrap my head around the case where there are two worlds, w1 and w2; w2 is better than w1, but moving from w1 to w2 is bad (ie. kill everyone and replacing them with different people who are happier, and we think this is bad). I think for the equivalence to work in this case, the utility function U also needs to depend on your current state—if it’s the same for all states, then the agent would always prefer to move from w1 to w2 and erase it’s memory of the past when maximizing the utility function, wheras it would act correctly with the reward function.

Comment

https://www.lesswrong.com/posts/JewWDfLoxgFtJhNct/utility-versus-reward-function-partial-equivalence?commentId=heC665t2hykQEXDr4

erase it’s memory That only works if the agent is motivated by something like "maximise your belief in what the expected value of U is", rather than "maximise the expected value of U". If you’ve got that problem, then the agent is unsalvageable—it could just edit its memory to make itself believe U is maximised.

Comment

Say w2a is the world where the agent starts in w2 and w2b is the world that results after the agent moves from w1 to w2. Without considering the agent’s memory part of the world, it seems like the problem is worse: the only way to distinguish between w2a and w2b is the agent’s memory of past events, so it seems that leaving the agent’s memory over the past out of the utility function requires U(w2a) = U(w2b)

Comment

U could depend on the entire history of states (rather than on the agent’s memory of that history).

Comment

Ah, misunderstood that, thanks.