Contents
- Problem overview
- Solution overview
- Discussion
- Acknowledgements On Friday I attended the 2020 Foresight AGI Strategy Meeting. Eventually a report will come out summarizing some of what was talked about, but for now I want to focus on what I talked about in my session on deconfusing human values. For that session I wrote up some notes summarizing what I’ve been working on and thinking about. None of it is new, but it is newly condensed in one place and in convenient list form, and it provides a decent summary of the current state of my research agenda for building beneficial superintelligent AI; a version 1 of my agenda, if you will. Thus, I hope this will be helpful in making it a bit clearer what it is I’m working on, why I’m working on it, and what direction my thinking is moving in. As always, if you’re interesting in collaborating on things, whether that be discussing ideas or something more, please reach out.
Problem overview
-
I think we’re confused about what we really mean when we talk about human values.
-
This is a problem because:
-
building aligned AI likely requires a mathematically precise understanding of the structure of human values, though not necessarily the content of human values;
-
we can’t trust AI to discover that structure for us because we would need to understand it enough to verify the result, and I think we’re so confused about what human values are we couldn’t do that without high risk of error.
-
What are values?
-
We don’t have an agreed upon precise definition, but loosely it’s "stuff people care about".
-
When I talk about "values" I mean the cluster we sometimes also point at with words like value, preference, affinity, taste, aesthetic, intention, and axiology.
-
Importantly, what people care about is used to make decisions, and this has had implications for existing approaches to understanding values.
-
Much research on values tries to understand the content of human values or why humans value what they value, but not what the structure of human values is such that we could use it to model arbitrary values. This research unfortunately does not appear very useful to this project.
-
The best attempts we have right now are based on the theory of preferences.
-
In this model a preference is a statement located within a (weak, partial, total, etc.)-order. Often written like A > B > C to mean A is preferred to B is preferred to C.
-
Problems:
-
Goodhart effects are robust and preferences in formal models are measures that is not the thing we care about itself
-
Stated vs. revealed preferences: we generally favor revealed preferences, this approach has some problems:
-
can only infer preferences from behaviors observed; latent preferences
-
inferring preferences from observation requires making normative assumptions, and if we don’t make normative assumptions there are too many free variables
-
General vs. specific preferences: do we look for context-independent preferences ("essential" values) or context-dependent preferences
-
generalized preferences, e.g. "I like cake better than cookies", can lead to irrational preferences (e.g. non-transitive preferences)
-
contextualized preferences, e.g. "I like cake better than cookies at this precise moment", limit our ability to reason about what someone would prefer in new situations
-
See Stuart Armstrong’s work for an attempt to address these issues so we can turn preferences into utility functions.
-
Preference based models look to me to be trying to specify human values at the wrong level of abstraction. But what would the right level of abstraction be?
Solution overview
-
What follows is a summary of what I so far think moves us closer to less confusion about human values. I hope to either think some of this is wrong or insufficient by the end of the discussion!
-
Assumptions:
-
Humans are embedded agents.
-
Agents have fuzzy but definable boundaries.
-
Everything in every moment causes everything in every next moment up to the limit of the speed of light, but we can find clusters of stuff that interact with themselves in ways that are "aligned" such that the stuff in a cluster makes sense to model as an agent separate from the stuff not in an agent.
-
Basic model:
-
Humans (and other agents) cause events. We call this acting.
-
The process that leads to taking one action rather than another possible action is deciding.
-
Decisions are made by some decision generation process.
-
Values are the inputs to the decision generation process that determine its decisions and hence actions.
-
Preferences and meta-preferences are statistical regularities we can observe over the actions of an agent.
-
Important differences from preference models:
-
Preferences are causally after, not causally before, decisions, contrary to the standard preference model.
-
This is not 100% true. Preferences can be observed by self-aware agents, like humans, and influence the decision generation process.
-
So then what are values? The inputs to the decision generation process?
-
My best guess: valence
-
My best best guess: valence as modeled by minimization of prediction error
-
This leaves us with new problems. Now rather than trying to infer preferences from observations of behavior, we need to understand the decision generation process and valence in humans, i.e. this is now a neuroscience problem.
Discussion
-
underdetermination due to noise; many models are consistent with the same data
-
this makes it easy for us to get confused, even when we’re trying to deconfuse ourselves
-
this makes it hard to know if our model is right since we’re often in the situation of explaining rather than predicting
-
is this a descriptive or causal model?
-
both. descriptive of what we see, but trying to find the causal mechanism of what we reify as "values" at the human level in terms of "gears" at the neuron level
-
what is valence?
-
see here
-
complexities of going from neurons to human level notions of values
-
there’s a lot of layers of different systems interacting on the way from neurons to values and we don’t understand enough about almost any of them or even for sure what systems there are in the causal chain
-
Valence in human computer interaction research
Acknowledgements
Thanks to Dan Elton, De Kai, Sai Joseph, and several other anonymous participants of the session for their attention, comments, questions, and insights.
Planned summary for the Alignment Newsletter: This post argues that since 1. human values are necessary for alignment, 2. we are confused about human values, and 3. we couldn’t verify it if an AI system discovered the structure of human values, we need to do research to become less confused about human values. This research agenda aims to deconfuse human values by modeling them as the input to a decision process which produces behavior and preferences. The author’s best guess is that human values are captured by valence, as modeled by > minimization of prediction error.Planned opinion:
Comment
Yep, agree with the summary. I’ll push back on your opinion a little bit here as if it were just a regular LW comment on the post.
Comment
This is a reasonably hope but > I generally think hope is dangerous when it comes to existential risksWhen I say "hope", I mean "it is reasonably likely that the research we do pans out and leads to a knowably-aligned AI system", not "we will look at the AI system’s behavior, pull a risk estimate out of nowhere, and then proceed to deploy it anyway". In this sense, literally all AI risk research is based on hope, since no existing AI risk research knowably will lead to us building an aligned AI system.
Comment
Comment
Comment
Joseph Stalin’s collectivization of farms
Tokugawa Iemitsu’s closing off of Japan
Hugo Chávez’s nationalization of many industries
Comment
Comment
I almost agree, but still ended up disagreeing with a lot of your bullet points. Since reading your list was useful, I figured it would be worthwhile to just make a parallel list. ✓ for agreement, × for disagreement (• for neutral). Problem overview ✓ I think we’re confused about what we really mean when we talk about human values. × But our real problem is on the meta-level: we want to understand value learning so that we can build an AI that learns human values even without starting with a precise model waiting to be filled in. _× We can trust AI to discover that structure for us even though we couldn’t verify the result, because the point isn’t getting the right answer, it’s having a trustworthy process. _ × We can’t just write down the correct structure any more than we can just write down the correct content. We’re trying to translate a vague human concept into precise instructions for an AI ✓ Agree with extensional definition of values, and relevance to decision-making. • Research on the content of human values may be useful information about what humans consider to be human values. I think research on the structure of human values is in much the same boat—information, not the final say. ✓ Agree about Stuart’s work being where you’d go to write down a precise set of preferences based on human preferences, and that the problems you mention are problems. Solution overview ✓ Agree with assumptions. • I think the basic model leaves out the fact that we’re changing levels of description. _ × Merely causing events (in the physical level of description) is not sufficient to say we’re acting (in the agent level of description). We need some notion of "could have done something else," which is an abstraction about agents, not something fundamentally physical. _ × Similar quibbles apply to the other parts—there is no physically special decision process, we can only find one by changing our level of description of the world to one where we posit such a structure. _ × The point: Everything in the basic model is a statistical regularity we can observe over the behavior of a physical system. You need a bit more nuanced way to place preferences and meta-preferences. _ • The simple patch is to just say that there’s some level of description where the decision-generation process lives, and preferences live at a higher level of abstraction than that. Therefore preferences are emergent phenomena from the level of description the decision-generation process is on. _ _ × But I think if one applies this patch, then it’s a big mistake to use loaded words like "values" to describe the inputs (all inputs?) to the decision-generation process, which are, after all, at a level of description below the level where we can talk about preferences. I think this conflicts with the extensive definitions from earlier. × If we recognize that we’re talking about different levels of description, then preferences are not either causally after* or* causally before decisions-on-the-basic-model-level-of-abstraction. They’re regular patterns that we can use to model decisions at a slightly higher level of abstraction. _ • How to describe self-aware agents at a low level of abstraction then? Well, time to put on our GEB hats. The low level of abstraction just has to include a computation of the model we would use on the higher level of abstraction. ✓ Despite all these disagreements, I think you’ve made a pretty good case that the human brain plausibly computes a single currency (valence) that it uses to rate both most decisions and most predictions. _ × But I still don’t agree that this makes valence human values. I mean values in the sense of "the cluster we sometimes also point at with words like value, preference, affinity, taste, aesthetic, intention, and axiology." So I don’t think we’re left with a neuroscience problem, I still think what we want the AI to learn is on that higher level of abstraction where preferences live.
Comment
Thanks for your detailed response. Before I dive in, I’ll just mention I added a bullet point about Goodhart because somehow when I wrote this up initially I forgot to include it.
I really like the idea that preferences are observed after the fact, because I feel like there is some truth to it for human beings. We act, and then become self-aware of our reactions and thoughts, which leads us to formulate some values. Even when we act contrary to those values, at least inside, we feel shitty.
But that doesn’t address the question of where do these judgements and initial reactions come from. And also how this self-awareness influences the following actions.
Still, this makes me want to read the rest of your research!
Comment
I specifically propose they come from valence, recognizing we know that valence is a phenomena generated by the human brain but not exactly how it happens (yet).