I know the idea of making a "honeypot" to detect when an AI system would attempt a treacherous turn if given the opportunity has been discussed (e.g. IIRC, in Superintelligence). But is there anyone actually working on this? Or any work that’s been published?
I don’t know of any serious work on it. I did have an idea regarding honeypots a little while ago here.