Draft papers for REALab and Decoupled Approval on tampering

https://www.lesswrong.com/posts/X23q6T4CDifHykqi4/draft-papers-for-realab-and-decoupled-approval-on-tampering

Hi everyone, we (Ramana Kumar, Jonathan Uesato, Victoria Krakovna, Tom Everitt, and Richard Ngo) have been working on a strand of work researching tampering problems, and we’ve written up our progress in two papers. We’re sharing drafts in advance here because we’d like to get feedback from everyone here. The first paper covers:

Comment

https://www.lesswrong.com/posts/X23q6T4CDifHykqi4/draft-papers-for-realab-and-decoupled-approval-on-tampering?commentId=ThHKxJKAedMNeNH4g

PSA: You can write comment on PDFs in google drive! There’s a button in the top right that says "Add a comment" on hover-over, then you get to click-and-drag to highlight a box in the PDF where your comment goes. I will leave a test comment on the first PDF so everyone can see that. (I literally just found this out.)

https://www.lesswrong.com/posts/X23q6T4CDifHykqi4/draft-papers-for-realab-and-decoupled-approval-on-tampering?commentId=ktLwg37vBTA4ygwCP

Very interesting. Naturalizing feedback (as opposed to directly accessing True Reward) seems like it could lead to a lot of desirable emergent behaviors, though I’m somewhat nervous about reliance on a handwritten model of what reliable feedback is.