Link post I recently gave a two-part talk on the big picture of alignment, as I see it. The talk is not-at-all polished, but contains a lot of stuff for which I don’t currently know of any good writeup. Major pieces in part one:
-
Some semitechnical intuition-building for high-dimensional problem-spaces.
-
Optimization compresses information "by default"
-
Resources and "instrumental convergence" without any explicit reference to agents
-
A frame for thinking about the alignment problem which only talks about high-dimensional problem-spaces, without reference to AI per se.
-
The central challenge is to get enough bits-of-information about human values to narrow down a search-space to solutions compatible with human values.
-
Details like whether an AI is a singleton, tool AI, multipolar, oracle, etc are mostly irrelevant.
-
Fermi estimate: just how complex are human values?
-
Coherence arguments, presented the way I think they should be done.
-
Also subagents!
Note that I don’t talk about timelines or takeoff scenarios; this talk is just about the technical problem of alignment. Here’s the video for part one: Big thanks to Rob Miles for editing! Also, the video includes some good questions and discussion from Adam Shimi, Alex Flint, and Rob Miles.
Are there already plans for a transcript of this? (I could set in motion of a rev.com transcription)
Comment
No plans in motion. Thank you very much if you decide to do so! Also, you might want to message Rob to get the images.
Comment
Here is a link to the transcript, which includes ability to watch along with the video.https://www.rev.com/transcript-editor/shared/QmH6Ofy5AXbQ4siBlLNcvUnMkMBj3qa4WIkQtGeoOlo4K3DvjOH3oMUJuIAUBrJiJkJbb4VU3uqWhLLwRu19f3m6gag?loadFrom=SharedLink
I’ve put in a request for a transcript.
How do transcriptions typically handle images? They’re pretty important for this talk. You could embed the images in the text as it progresses?
I second Rob’s unanswered question at 40:12: how is that we ever accomplish anything in practice, if the search space is vast, and things that both work and look like they work are exponentially rare?
How is the "the genome is small, therefore generators of human values (that can’t be learned from the environment) are no more complex than tens or hundreds of things on the order of a fuzzy face detector" argument compatible with the complexity of value thesis, or does it contradict it?
Comment
Comment
Words cannot possibly express how thankful I am for you doing this!
Thanks a bunch!
I want to interrogate a little more the notion that gradient descent samples uniformly (or rather, are dominated by the initialization distribution) from good parameters. Have you read various things about grokking like Hypothesis: GD Prefers General Crictuits? That argument seems to be that you might start with parameters dominated by the initialization distribution, but various sorts of regularization are going to push you to sample solutions in a nonuniform way. Do you have a take on this?
For the power-seeking-because-of-entropy example, I want to second the audience questions. If you’re getting your policy by sampling from all possible policies, the argument is great, but if you’re getting your policy by sampling from NN parameters that generate strings of 100 actions, then you just finished arguing that uniform-ish sampling over NN parameters will give simplcity-ish sampling over policies. What would a NN do if trained to play the example game? I would assume it would quickly learn to exactly alternate $ and Apple. This looks like something that seems a little less like powerseeking, and more like telling DeepDream to fill the image with dogs, except filling a string with buying three apples. I dunno, do you think it’s still like powerseeking?
I think you make a subtle error when throwing out a lot of "mere biology" genes as not generating human values. If we had different mere biology than we do, the values we develop would probably be different even if our brain-specific genes were the same! Like, I dunno, suppose you have some genes that build your thyroid. But you can’t go "ho hum, the thyroid isn’t the brain, let’s throw those genes out as uninformative," because thyroid disorders activity impacts your mood, which impacts your expressed values. Or I bet I’d have different values if my eyes saw in UV rather then visible, or my skin had no sense of pain, or I went through adolescence in two days rather than five years. Basically I totally disagree with this notion that "if we share it with plants, an AI wouldn’t need to know it."
Actually I’m kinda not sure how relevant you think the size-of-human-preference-generators question is, since we don’t want the AI to learn human preferences in gene-format, we want the AI to learn human preferences in some (different, I think we agree) format that’s better-suited for doing things like making decisions or comparing between different humans.
Cool last section. If you can have 2 dimensions of things to be Pareto optimal over tradeoffs between, why not N dimensions? It seems like there are behaviors that are irrational even for markets (is failing to make mutually beneficial trades between individuals an example? I’m having trouble thinking of something less inward-facing) that could be "optimal" for decision-making procedures with N of 3 or 4.
Comment
I think these are both pointing to basically-the-same problem. Under Yudkowsky’s view, it’s presumably not hard to get AI to do X for all values of X, but it’s hard for most of the X which humans care about, and it’s hard for most of the things which seem like human-intuitive "natural things to do".
Comment
Huh. I thought Yudkowsky’s view was that it’s hard to get an AGI to do X for all values of X, where X is the final effect of the AGI on the world (like, what the universe looks like when the AI is done doing its thing). If X is instead an instrumental sort of thing, like getting a lot of energy and matter, then it’s not hard to get an AGI to do that.
Comment
That’s right.
Comment
So "get enough bits-of-information about human values" makes sense if you have something you can do with the bits, i.e. narrow down something. If we don’t know how to specify any final effect of an AGI at all, then we have an additional problem, which is that we don’t know how to do anything with the bits of information about which final effects we want.
Comment
I mean, yeah, we do need to be able to use the bits to narrow down a search space.
Comment
What’s the search space? Policies, or algorithms, or behaviors, or something. What’s the information? Well, basically pointing a camera at anything in the world today gives you information about human values, or reading anything off the internet. What do we do with this information to get policies we like? The bits of information isn’t the problem, the problem is that we don’t know how to narrow down policy space or algorithm space or behavior space so that it has some particular final results. Getting bits of information about human values, and being able to aim an AGI at anything, are different problems.
Comment
Comment
Comment
The Shannon formula doesn’t define what information is, it it quantifies amount of information. People occasionally point this out as being kind of philosophically funny—we know how to measure amount of information, but we don’t really have a good definition of what information is. Talking about what information is immediately runs into the question of what the information is about, how the information relates to the thing(s) it’s about, etc. Those are basically similar to the problems one runs into when talking about e.g. an AI’s objective and whether it’s "aligned with" something in the physical world. Like, this mathematical function (the objective) is supposed to talk about something out in the world, presumably it should relate to those things in the world somehow, etc. I claim it’s basically the same problem: how do we get symbolic information/functions/math-things to reliably "point to" particular things in the world? (This is what Yudkowsky, IIUC, would call the "pointer problem".) Framed as a bits-of-information problem, the difficulty is not so much getting enough bits as getting bits which are actually "about" "human values". (Presumably that’s why my explanations seem so confusing.)
Comment
If natural abstractions are a thing, in what sense is "make this AGI have particular effect X" trying to be about human values, if X is expressed using natural abstractions?
Comment
In that case, it’s not about human values, which is one of the very nice things the natural abstraction hypothesis buys us.
Section 1 (about compression) was pretty good, I don’t think I had fully internalized this idea, despite having followed a lot of your posts.
Thinking through the "vast majority of problem-space for X fails" argument; assume we have a random text generator that we want to run a sorting algorithm:
Vast majority don’t sort (or are even compilable)
The vast majority of programs that "look like they work", don’t (eg "forgot a semicolon", "didn’t account for an already sorted list", etc)
Generalizing: the vast majority of programs that pass [Unit tests, compiles, human says "looks good to me", simple], don’t work.
Could be incomprehensible, pass several unit tests, but still fail in weird edge cases (eg. when the input number is [84, >100, a prime number > 13, etc], then it spits out gibberish)
counterargument for alignment check of "run it in a simulation to see if it breaks out of the box" because this is just another proxy.
Some constraints above are necessary, like being compilable, and some aren’t, like some randomly generated sorting algorithms that are really hard to understand. For example, could be written in brainfuck, or contain 10,000 lines of code that are mostly redundant or happen to cancel out and sorts correctly
To relate to the original talk, I agree that I can recognize my own values once I reflect on them, but this is different than seeing a plan about an AI that keeps my values and thinking "this looks like it works". In other words, the "human values" shouldn’t be a strict subset of the "human says it looks like it works", just like "correctly sorts" shouldn’t be a strict subset of "human says it looks like it works" due to incomprehensibility.
For programs specifically, if it’s simple and passes a relevant distribution of unit tests, we can be highly confident it in fact sorts correctly, but what’s the equivalent for "plan that maintains human values"? Let’s say John succeeds and finds what we think to be the generators of human values, would it be comprehensible enough to verify it? Applying the argument again but to John’s proposed solution, the vast majority of [Ai’s trained in human environments with what we think are the simple generators of human values]’s plans & behaviors may look good but not actually be good. Or the weights are incomprehensible, so we use unit tests to verify and it could still fail. Counter-counterargument: I can imagine these generators being simple enough that we can indeed be confident they do what we want. Since it should be human-value-equivalent, it should also be human-interpretable (under reflection?). This sounds like a good idea overall, but I wouldn’t bet my life on it. It’d be nice to have necessary and sufficient conditions for this possible solution.
I think a lot of the values we care about are cultural, not just genetic. A human raised without culture isn’t even clearly going to be generally intelligent (in the way humans are), so why assume they’d share our values? Estimations of the information content of this part are discussed by Eric Baum in What is Thought?, although I do not recall the details.
Comment
I find that plausible, a priori. Mostly doesn’t affect the stuff in the talk, since that would still come from the environment, and the same principles would apply to culturally-derived values as to environment-derived values more generally. Assuming the hardwired part is figured out, we should still be able to get an estimate of human values within the typical-human-value-distribution-for-a-given-culture from data which is within the typical-human-environment-distribution-for-that-culture.
Thanks a lot for posting this! A minor point about the 2nd intuition pump (100-timesteps, 4 actions: Take $1, Do Nothing, Buy Apple, Buy Banana; the point being that most action sequences take the Take $1 action a lot rather than the Do Nothing action): the "goal" of getting 3 apples seems irrelevant to the point, and may be misleading if you think that that goal is where the push to acquire resources comes from. A more central source seems to me to be the "rule" of not ending with a negative balance: this is what prunes paths through the tree that contain more "do nothing" actions.
Comment
Yup! More generally, key pieces for modeling a "resource": amounts of the resource are additive, and more resources open up more actions (operationalized by the need for a positive balance in this case). If there’s something roughly like that in the problem space, then the resource-seeking argument kicks in.
Cheers for posting! I’ve got a question about the claim that optimizers compress by default, due to the entropy maximization-style argument given around 20:00 (apologies if you covered this, it’s not easy to check back through a video): Let’s say that we have a neural network of width 100, which is trained on a dataset which could be trained to perfect accuracy on a network of width of only 30. If it compresses it into only 30 weights there’s a 70-dimensional space of free parameters and we should expect a randomly selected solution to be of this kind. I agree that if we randomly sample zero-loss weight configurations, we end up with this kind of compression, but it seems that any kind of learning we know how to do is dependent on the paths that one can take to reach it, and that abstracting this away can give very different results to any high-dimensional optimization that we actually know how to do. Assuming that the network is parameterized by, say, float16s, maximal compression of the data would result in the output of the network being sensitive to the final bit of the weights in as many cases as possible, thereby leaving the largest number of free bits, so 16 bits of info would be compressed in to one weight, rather than spread among 3-4. My intuition is that these highly compressed arrangements would be very sensitive to perturbations, and render them incredibly difficult to reach in practice (and also have a big problem with an unknown examples, and are therefore screened off by techniques like dropout and regularization). There is therefore a competing incentive towards minima which are easy to land on—probably flat minima surrounded by areas of relatively good performance. Further, I expect that these kind of minima tend to leverage the whole network for redundancy and flatness (not needing to depend tightly on the final bit of weights). The properties of would be not just compression but some combination of compression and smoothness (smoothness being sort of a variant of compression where the final bits don’t matter much) which would not result in some subset of the parameters having all the useful information. If you agree that this is what happens, in what sense is there really compression, if the info is spread among multiple bits? Perhaps given the structure of NNs, we should expect to be able to compress by removing the last bits of weights as these are the easiest to leave free given the structure of training? If you disagree I’d be curious to know where. I sense that Mingard et al shares your conclusion but I don’t yet understand the claimed empirical demonstration. tldr: optimization may compress by default, but learning seems to counteract this by choosing easy-to-find minima.
Comment
Comment
Bump re/ my question about trying to make an AI do any specifiable thing at all vs. specifying some good thing to do; still curious what you think.
Regarding generators of human values: say we have the gene information that encodes human cognition, what does that mean? Equivalent of a simulated human? Capabilities secret-sauce algorithm right? I’m unsure if you can take the body out of a person and still have the same values because I have felt senses in my body that tells me information about the world and how I relate to it. Assume it works as a simulated person and ignore mindcrime, how do you algorithmically end up in a good enough subset of human values (because not all human values are meta-good)? Or, how do you use this to create a simulated long reflection? (ie what humans would decide ethics to be if they thought about it for [1000] years) You could first figure out meta-preferences and bootstrap that in for figuring out preferences. Though, I’m unsure if there are a "correct" set of meta-preferences, with my main confusion being the blank spot in my map where "enlightenment" is.