A putative new idea for AI control; index here.
During a workshop with MIRI at the FHI, I defined indifference via reward signals, saying something along the lines of "we can do it with proper utilities, but its more complicated". I then never got round to defining them in terms of utilities.
I’ll do that now in this note.
Consider an AI that we want to (potentially) transition between utility u and utility v. Let Press be the event that we press the button to change the AI’s utility; let u\to v be the event that the change goes through (typically we’d want P(u\to v | Press) = 1-\epsilon for some small \epsilon).
Let I_{Press} and I_{u\to v} be the indicator functions for those events. Then we can define the AI’s utility as:
- (1-I_{Press}I_{u\to v}) u + I_{Press}I_{u\to v}(v + C).
Here, C are the compensatory rewards C=\mathbb{E}(u|Press,¬(u\to v))-\mathbb{E}(v|Press,u\to v).
Thus the AI maximises u conditional on the button not being pressed or the utility change not going through. It maximises v conditional on the button being pressed and the utility change going through. The compensatory rewards are there simply to make it behave like a pure u maximiser up until the moment of button pressing.