2-D Robustness

https://www.lesswrong.com/posts/2mhFMgtAjFJesaSYR/2-d-robustness

This is a short note on a framing that was developed in collaboration with Joar Skalse, Chris van Merwijk and Evan Hubinger while working on Risks from Learned Optimization, but which did not find a natural place in the report. Mesa-optimisation is a kind of robustness problem, in the following sense: Since the mesa-optimiser is selected based on performance on the base objective, we expect it (once trained) to have a good policy on the training distribution. That is, we can expect the mesa-optimiser to act in a way that results in outcomes that we want, and to do so competently. The place where we expect trouble is off-distribution. When the mesa-optimiser is placed in a new situation, I want to highlight two distinct failure modes; that is, outcomes which score poorly on the base objective:

Unlike the 1-d picture, the 2-d picture suggests that more robustness is not always a good thing. In particular, robustness in capabilities is only good insofar is it is matched by robust alignment between the mesa-objective and the base objective. It may be the case that for some systems, we’d rather the system get totally confused in new situations than remain competent while pursuing the wrong objective. Of course, there is a reason why we usually think of robustness as a scalar: one can define clear metrics for how well the system generalises, in terms of the difference between performance on the base objective on- and off-distribution. In contrast, 2-d robustness does not yet have an obvious way to ground its two axes in measurable quantities. Nevertheless, as an intuitive framing I find it quite compelling, and invite you to also think in these terms.

Comment

https://www.lesswrong.com/posts/2mhFMgtAjFJesaSYR/2-d-robustness?commentId=HjPFTbd5KR3bCgBWd

One way to try to measure capability robustness seperate from alignment robustness off of the training distribution of some system would be to:

  • use an inverse reinforcement learning algorithm infer the reward function of the off-distribution behaviour

  • train a new system to do as well on the reward function as the original system

  • measure the number of training steps needed to reach this point for the new system. This would let you make comparisons between different systems as to which was more capability robust. Maybe there’s a version that could train the new system using behavioural cloning, but it’s less clear how you measure when you’re as competent as the original agent (maybe using a discriminator?) The reason for trying this is having a measure of competence that is less dependent on human judgement/​closer to the systems’s ontology and capabilities.