Bias in rationality is much worse than noise

https://www.lesswrong.com/posts/SGHsnG7ZraTPKzveo/bias-in-rationality-is-much-worse-than-noise

Crossposted at Intelligent Agent Forum. I’ve found quite a few people dubious of my "radical skepticism" post on human preferences. Most "radical skepticism" arguments—Descartes’s Demon, various no-free-lunch theorems, Boltzmann Brains—generally turn out to be irrelevant, in practice, one way or another. But the human preferences are in a different category. For a start, it’s clear that getting them correct is important for AI alignement—we can’t just ignore errors. But most importantly, simplicity priors/​Kolmogorov complexity/​Occam’s razor don’t help with learning human preferences, as illustrated most compactly with the the substitution of (-p,-R) for (p,R). But this still feels like a bit of a trick. Maybe we can just assume rationality plus a bit of noise, or rationality most of the time, and get something a lot more reasonable.

Structured noise has to be explained

And, indeed, if humans were rational plus a bit of noise, things would be simple. A noisy signal has high Kolmogorov complexity, but there are ways to treat noise as being of low complexity. The problem with that approach is that explaining noise is completely different from explaining the highly structured noise that we know as bias. Take the anchoring bias, for example. In one of the classical experiments, American students were asked if they would buy a product for the price that was the last two digits of their social security number, and were then asked what price they would buy the product for. The irrelevant last two digits had a strong distorting effect on their willingness to pay.

Modelling anchoring behaviour

Let’s model a simplified human, if not asked about their social security number, as valuing a cordless keyboard at $25, plus noise η drawn from a normal distribution of mean $0 and standard deviation $5. If they are first asked about their social secuity number s, their valuation shifts to 3/​4($25) + 1/​4(s) + η. To explain human values, we have two theories about their rewards: