Classifying specification problems as variants of Goodhart’s Law

https://www.lesswrong.com/posts/yXPT4nr4as7JvxLQa/classifying-specification-problems-as-variants-of-goodhart-s

(Cross-posted to personal blog. Summarized in Alignment Newsletter #76. Thanks to Jan Leike and Tom Everitt for their helpful feedback on this post.) There are a few different classifications of safety problems, including the Specification, Robustness and Assurance (SRA) taxonomy and the Goodhart’s Law taxonomy. In SRA, the specification category is about defining the purpose of the system, i.e. specifying its incentives. Since incentive problems can be seen as manifestations of Goodhart’s Law, we explore how the specification category of the SRA taxonomy maps to the Goodhart taxonomy. The mapping is an attempt to integrate different breakdowns of the safety problem space into a coherent whole. We hope that a consistent classification of current safety problems will help develop solutions that are effective for entire classes of problems, including future problems that have not yet been identified. The SRA taxonomy defines three different types of specifications of the agent’s objective: ideal (a perfect description of the wishes of the human designer), design (the stated objective of the agent) and revealed (the objective recovered from the agent’s behavior). It then divides specification problems into design problems (e.g. side effects) that correspond to a difference between the ideal and design specifications, and emergent problems (e.g. tampering) that correspond to a difference between the design and revealed specifications. In the Goodhart taxonomy, there is a variable U* representing the true objective, and a variable U representing the proxy for the objective (e.g. a reward function). The taxonomy identifies four types of Goodhart effects: regressional (maximizing U also selects for the difference between U and U*), extremal (maximizing U takes the agent outside the region where U and U* are correlated), causal (the agent intervenes to maximize U in a way that does not affect U*), and adversarial (the agent has a different goal W and exploits the proxy U to maximize W). We think there is a correspondence between these taxonomies: design problems are regressional and extremal Goodhart effects, while emergent problems are causal Goodhart effects. The rest of this post will explain and refine this correspondence. The SRA taxonomy needs to be refined in order to capture the distinction between regressional and extremal Goodhart effects, and to pinpoint the source of causal Goodhart effects. To this end, we add a model specification as an intermediate point between the ideal and design specifications, and an implementation specification between the design and revealed specifications. The model specification is the best proxy within a chosen formalism (e.g. model class or specification language), i.e. the proxy that most closely approximates the ideal specification. In a reinforcement learning setting, the model specification is the reward function (defined in the given MDP/​R over the given state space) that best captures the human designer’s preferences.

Comment

https://www.lesswrong.com/posts/yXPT4nr4as7JvxLQa/classifying-specification-problems-as-variants-of-goodhart-s?commentId=Xd462v2ATXR9qaJZ4

I thought this post was great when it came out and still do—I think it does a really good job of connecting different frameworks for analyzing AI safety problems.

https://www.lesswrong.com/posts/yXPT4nr4as7JvxLQa/classifying-specification-problems-as-variants-of-goodhart-s?commentId=HuLgYMZzysi4Frc5X

I like the thing this post does, and I like the diagrams. I’d like to see this reviewed and voted on.