[ASoT] Some thoughts about LM monologue limitations and ELK

https://www.lesswrong.com/posts/Jiy3n5KMsGGJ6NNYH/asot-some-thoughts-about-lm-monologue-limitations-and-elk

Editor’s note: I’m experimenting with having a lower quality threshold for just posting things even while I’m still confused and unconfident about my conclusions, but with this disclaimer at the top. Thanks to Kyle and Laria for discussion. One potential way we might think to interpret LMs is to have them explain their thinking as a monologue or justification or train of thought or something. Particularly, by putting the explanation before the answer, we might hope to encourage the model to actually use the monologue to come to its conclusion and try to avoid the model coming up with the bottom line first. However, there are a bunch of ways this could go wrong. For instance: