Contents

The value-uncertain U
Value updating and correction
Correction term
Subagent stability
Convexity of ft A putative new idea for AI control; index here.

"Guarded learning" is a model for unbiased learning, the kind of learning where the AI has an incentive to learn its values, but not to bias the direction in which it learns.

The value-uncertain U

Assume the AI is uncertain between n+1 different utilities v_0, v_1, \ldots , v_n. Its actual utility is U = I_{0} v_0 + I_{1} v_1 + \ldots I_{n} v_n where the I_j are indicator functions with I_j \geq 0 and \sum_j I_j = 1.

The learning process is whatever updates the AI’s estimates for the I_j.

Writing the utility in this way means that the utility of future actions will be assessed according to the values of the future agent (as long as the current agent can trust the future agent’s assessment of the I_j), thus avoiding the naive cake or death problem.

At any given time, the expectations of I_0, I_1, \ldots, I_n define a weighted sum of v_j, which can be seen as a point on n-simplex \Delta^n. Define U_t as the utility given by these expectations at time t.

For any w \in \Delta^n, define \pi_w as the policy that maximises the invariant utility function w. Thus if U_t = w, a U-maximiser will not necessarily follow \pi_w, because U_{t+1} might be different from w, while \pi_w always maximises w.

At any given time t, define the function f_t:\Delta^n \to \mathbb{R} my mapping w to \mathbb{E}_t(w|\pi_w). This is the expectation of w at t, given that the AI follows a w-maximising policy. Defining these f_t requires some sort of logical or causal counterfactual—"given that the AI follows \pi_w"—but this is the only requirement in this setup. The f_t are all convex, see the proof at the end of this post.

Value updating and correction

For illustration, consider the case n+1=2. The 1-simplex \Delta^1 is a line segment, which we’re identifying with [0,1]:

The green curve is the plot of f_t. The blue line connects f_t(v_0)=\mathbb{E}t(v_0|\pi{v_0}) with f_t(v_1)=\mathbb{E}t(v_1|\pi{v_1}). The purple line is \mathbb{E}t(v|\pi{v_0}) while the yellow line is \mathbb{E}t(v|\pi{v_1}). The f_t curve must be contained between these lines.

The AI currently has U_t = 0.5 v_0 + 0.5 v_1. If the AI expected to learn more about its values, then its expected utility would rise along the green arrow. Reaching the top, on the blue line, corresponded to a situation where the agent expected to imminently be informed of whether its utility is v_0 or v_1, and thus transform itself into a v_0 or v_1 maximiser. This kind of learning, up the green arrow, is what we want to encourage (note that this doesn’t make the learning safe, though—we need to pick the update process with care).

The red arrows point to updates we want to discourage—we don’t want the AI to manipulate the direction of its update to get higher utility (in this case, by moving towards v_0).

Correction term

Consider the agent operating as a discrete agent, moving from time t to time t+1. It choose an action a_t; as a consequence of that, at time t+1, it expects a probability distribution over possible U_{t+1}. If we assume that there are only finitely many possibilities {U^i_{t+1}} (the continuous case is almost identical), then define p_i=P_t(U_{t+1}=U^i_{t+1}|a_t).

Let EU_{t+1} be the expectation of U_{t+1}, that is \mathbb{E}(U_{t+1}|a_t) = \sum_i p_i U_{t+1}^i.

Then a_t is an unbiased action (corresponding to moving up the blue arrow) if EU_{t+1} = U_t. One restriction we could do would be to require all valid a_t to satisfy this. But this might be extremely restrictive. For instance, we probably wouldn’t mind if (U^0_{t+1},p_0)=(v_0,0.51) and (U^1_{t+1},p_1)=(v_1,0.49). That’s the green point here:

Sure, the green point is slightly off-centre, but the AI has gained a lot of information. What’s clearly incorrect, is if (U^0_{t+1},p_0)=(v_0,1). This is the red point in the above graph; the action a_t has forced the AI’s values straight to v_0.

One way of dealing with the "force the values to v_0" would be to require the AI to use its past values U_t=0.5 v_0 + 0.5 v_1 rather than its future values U_{t+1}=v_0 to evaluate the quality of the update. This would be like imposing the naive cake or death problem on the agent. The expectation would be \mathbb{E}t(0.5 v_0 + 0.5 v_1| \pi{v_0}), which is the orange point on the line here:

This solves the for the AI forcing the value of U_{t+1} to a single value, but we need to extend it to (biased) distributions over multiple U^i_{t+1}’s. The natural way to do that is to define the penalty:

Penalty for biased a_t =\mathbb{E}t( U{t}|\pi_{t+1}) - \mathbb{E}t( EU{t+1}|\pi_{t+1}), where \pi_{t+1} is the subsequent policy.

It’s clear the penalty is 0 when EU_{t+1}=U_t. However, the penalty is not always negative! Indeed, upon forcing U_{t+1} =0.25 v_0 + 0.75 v_1, adding the penalty would move the agent’s utility to the orange point, higher than the red one:

Thus the penalty makes certain biased choice, that would otherwise be negative, slightly less negative. What is clear, however, is that, even with a positive penalty, it’s always better for the AI to keep U_{t+1}=U_t than to force U_{t+1} to the single value v \neq U_{t+1} (ie the blue point is still higher than the orange one). This is because it’s total reward would then be

\mathbb{E}t( v|\pi{v}) + \left(\mathbb{E}t( U{t}|\pi_{v}) - \mathbb{E}t( v|\pi{v}) \right) = \mathbb{E}t( U{t}|\pi_{v}) \leq \mathbb{E}t( U{t}|\pi_{U_t}) by the definition of \pi_{U_t}.

So, in order to get a positive expectation from a biased action, the AI has to get a gain in information to overcome this loss.

Subagent stability

Is this penalty design subagent stable? It’s not clear. Creating a subagent without the penalty is not an action the AI would go (as that action means its expectation of U_{t+1} will become very biased, as the subagent has no restriction on deliberately biasing the values of I_j).

However, an AI with U_t = 0.5 v_1=0 + 0.5 v_1 would certainly create a subagent that randomly decided to force either U_{t+1}=v_0 or U_{t+1}=v_1.

Convexity of f_t

This section will prove the following theorem:

Theorem: If the AI’s utility depends only on its actions, the f_t are convex.

Proof:

For any v \in \Delta^n, there is a map g_v: \Delta^n \to \mathbb{R} that maps w to \mathbb{E}_t(w|\pi_v). This map is linear, and (w,g_v(w)) defines an affine hyper-segment that contains f_t(v)=g_v(v).

Then we need to note that the curve (w,f_t(w)) cannot have a transverse intersection with (w,g_v(w)) (though they can be tangent on a convex subset). This is because a transverse intersection would imply there exists a w with f_t(w) < g_v(w), ie \mathbb{E}_t(w|\pi_w) < \mathbb{E}_t(w|\pi_v). But this is contradicted because \pi_w is defined to be the best policy for maximising w.

Thus (w,g_v(w)) is a supporting hyperplane for (w,f_t(w)), and hence, by the supporting hyperplane theorem, (w,f_t(w)) is convex.