Contrasting formalisms

Here I’ll contrast the approach we’re using in using in Pitfalls of Learning a Reward Online (summarised here), with that used by Tom Everitt and Marcu Hutter in the conceptually similar Reward Tampering Problems and Solutions in Reinforcement Learning. In the following, histories h_i are sequences of actions a and observations o; thus h_i=a_1o_1a_2o_2\ldots a_io_i. The agent’s policy is given by \pi, the environment is given by \mu.

Then the causal graph for the "Pitfalls" approach is, in plate notation (which basically means that, for every value of j from 1 to n, the graph inside the rectangle is true):

The R is the set of reward functions (mapping "complete" histories h_n of length n to real numbers), the \rho tells you which reward is correct, conditional on complete histories, and r is the final reward.

In order to move to the reward tampering formalism, we’ll have to generalise the R and \rho, just a bit. We’ll allow R to take partial histories - h_j shorter than h_n - and return a reward. Similarly, we’ll generalise \rho to a conditional distribution on R, conditional on all histories h_j, not just on complete histories.

This leads to the following graph:

This graph is now general enough to include reward tampering formalism.

States, data, and actions

In reward tampering formalism, "observations" (o_j) decompose into two pieces: states (S_j) and data (D_j). The idea is that data informs you about the reward function, while states get put into the reward function to get the actual reward.

So we can model this as this causal graph (adapted from graph 10b, page 22; this is a slight generalisation, as I haven’t assumed Markovian conditions):

Inside the rectangle, the histories split into data (D_{1:j}), states (S_{1:j}), and actions (a_{1:j}). The reward function is defined by the data only, while the reward comes from this reward function and from the states only—actions don’t directly affect these (though they can indirectly affect them by deciding what states and data come up, of course). Note that in the reward tampering paper, the authors don’t distinguish explicitly between R_j and r_j, but they seem to do so implicitly.

Finally, \Theta_*^R is the "user’s reward function", which the agent is estimating via D_{1:j}; this connects to the data only.

Almost all of the probability distributions at each node are "natural" ones that are easy to understand. For example, there are arrows into r_j (the reward) from R_j (the reward function) and S_{1:j} (the states history); the "conditional distribution" of r_j is just "apply R_j to S_{1:j}. The environment, action, and history naturally provide the next observations (state and data).

Two arrows point to more complicated relations: the arrow from \Theta_*^R to D_j, and that from D_{1:j} to R. The two are related; the data D_j is supposed to tell us about the user’s true reward function, while this information informs the choice of R.

But the fact that the nodes and the probability distribution have been "designed" this way doesn’t affect the agent. It has a fixed process P_{\mathbf{rt}}(R \mid D_{1:j}) for estimating R from D_{1:j} (P_{\mathbf{rt}} stands for the probability function for the reward tampering formalism). It has access to a_j, D_j, and S_j (and their histories) as well as its own policy, but has no direct access to \mu or \Theta_*^R.

In fact, from the agent’s perspective, \Theta_*^R is essentially part of \mu, the environment, though focusing on the D_j only.

States and actions in "Pitfalls" formalism

Now, can we put this into the "Pitfalls" formalism? It seems we can, as so:

All conditional probability distributions in this graph are natural.

This graph look very similar to the "reward tampering" one, with the exception of \rho_j and \Theta_*^R, pointing at R_j and D_j respectively.

In fact, \rho_j play the role of P_{\mathbf{rt}}(R \mid D_{1:j}) in that, for P_{\mathbf{lp}} the probability distribution for learning process,

P_{\mathbf{lp}}(R\mid D_{1:j}, \rho_j) = P_{\mathbf{rt}}(R \mid D_{1:j}).

Note that P_{\mathbf{lp}} in that expression is natural and simple, while P_{\mathbf{rt}} is complex; essentially P_{\mathbf{rt}} carries the same information as \rho_j.

The environment \mu_{\mathbf{lp}} of the learning process plays the same role as the combined \mu_{\mathbf{rt}} and \Theta_R^* from the reward tampering formalism.

So the isomorphism between the two approaches is, informally speaking:

On reward functions conditional on histories, P_{\mathbf{rt}} \leftrightarrow \rho.
\mu_{\mathbf{lp}} \leftrightarrow (\mu_{\mathbf{rt}}, \Theta_R^*).

Uninfluenceable similarities

If we make the processes uninfluenceable (a concept that exists for both formalisms), the causal graphs look even more similar:

Here the pair (\mu_{\mathbf{lp}},\eta), for the learning process, play exactly the same role as the pair[1] (\mu_{\mathbf{rt}},\Theta_*^R), for reward tampering: determining reward functions and observations.

There is an equivalence between the pairs, but not between the individual elements; thus \mu_{\mathbf{lp}} carries more information than \mu_{\mathbf{rt}}, while \eta carries less information than \Theta_*^R. ↩︎

id

9QBCYEbgphgEHXxsm
authors

RyanCarey
score

7
omega_karma

3
votes

2
date_published

2020-06-10T16:54

https://www.lesswrong.com/posts/MBrhMSZno6qbfGQdZ/comparing-reward-learning-reward-tampering-formalisms?commentId=9QBCYEbgphgEHXxsm

It would be nice to draw out this distinction in more detail. One guess:

Uninfluencability seems similar to requiring zero i**ndividual **treatment effect of D on R.
Riggability (from the paper) would then correspond to zero **average **treatment effect of D on R

id

QDigcB4rRnRjwAgDK
authors

algon33
score

1
omega_karma
votes

1
date_published

2020-09-30T12:38

https://www.lesswrong.com/posts/MBrhMSZno6qbfGQdZ/comparing-reward-learning-reward-tampering-formalisms?commentId=QDigcB4rRnRjwAgDK

Stuart, by "P_{rt}(R|D_{1;j}) is complex" are you referring to their using R= R(., E[\Theta^R_{*}|D_{1;j}]) as the estimated reward function? Also, what did you think of their arguement that their agents have no incentive to manipulate their beliefs because they evaluate future trajectories based of their current beliefs about how likely they are? Does that suffice to implement eq. 1) from your motivated value selection paper?

Comment

id

uhmn7trDdRFHouRku
authors

Stuart_Armstrong
score

2
omega_karma
votes

1
date_published

2020-10-01T13:06

https://www.lesswrong.com/posts/MBrhMSZno6qbfGQdZ/comparing-reward-learning-reward-tampering-formalisms?commentId=uhmn7trDdRFHouRku

Suart, by "P_{rt}(R\mid D_{1:j}) is complex" are you referring to...

I mean that that defining P_{rt} can be done in many different ways, and hence has a lot of contingent structure. In contrast, in P_{lp}(R\mid D_{1:j},\rho), the $\rho is a complex distribution on R, conditional on D_{1:j}; hence P_{lp} itself is trivial and just encodes "apply \rho to R and D_{1:j} in the obvious way.

Comparing reward learning/​reward tampering formalisms

Contrasting formalisms

States, data, and actions

States and actions in "Pitfalls" formalism

Uninfluenceable similarities

Comment

Comment

Comparing reward learning/reward tampering formalisms