[Question] Any work on honeypots (to detect treacherous turn attempts)?

id

mY7aZSXHpehrfwKn5
title

[Question] Any work on honeypots (to detect treacherous turn attempts)?
authors

capybaralet
date_published

2020-11-12T05:41
score

17
omega_karma
votes

7
tags

Treacherous Turn/AI/Tripwire
source

lesswrong

https://www.lesswrong.com/posts/mY7aZSXHpehrfwKn5/any-work-on-honeypots-to-detect-treacherous-turn-attempts

I know the idea of making a "honeypot" to detect when an AI system would attempt a treacherous turn if given the opportunity has been discussed (e.g. IIRC, in Superintelligence). But is there anyone actually working on this? Or any work that’s been published?

Comment

id

4T8YgjrHsuDRhqMSn
authors

michaelcohen
score

3
omega_karma
votes

3
date_published

2020-11-12T19:41

https://www.lesswrong.com/posts/mY7aZSXHpehrfwKn5/any-work-on-honeypots-to-detect-treacherous-turn-attempts?commentId=4T8YgjrHsuDRhqMSn

I don’t know of any serious work on it. I did have an idea regarding honeypots a little while ago here.