The Big Picture Of Alignment (Talk Part 1)

xdSDFQs4aC5GrdHNZ

title

authors

johnswentworth

date_published

2022-02-21T05:49

score

omega_karma

votes

Comment

id

Gv8Tk97LZkgiRX5o8
authors

johnswentworth
score

2
omega_karma

2
votes

1
date_published

2022-02-22T23:27

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=Gv8Tk97LZkgiRX5o8

No plans in motion. Thank you very much if you decide to do so! Also, you might want to message Rob to get the images.

Comment

id

z3yrEiRf3skPcS4sG
authors

Raemon
score

5
omega_karma

3
votes

3
date_published

2022-02-25T00:27

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=z3yrEiRf3skPcS4sG

Here is a link to the transcript, which includes ability to watch along with the video.https://www.rev.com/transcript-editor/shared/QmH6Ofy5AXbQ4siBlLNcvUnMkMBj3qa4WIkQtGeoOlo4K3DvjOH3oMUJuIAUBrJiJkJbb4VU3uqWhLLwRu19f3m6gag?loadFrom=SharedLink

id

akkCTuFK4PsCQyLbd
authors

Raemon
score

3
omega_karma

1
votes

2
date_published

2022-02-23T01:07

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=akkCTuFK4PsCQyLbd

I’ve put in a request for a transcript.

id

nPojhdhTJwm8cnknm
authors

Logan Riggs
score

1
omega_karma

1
votes

1
date_published

2022-02-22T23:23

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=nPojhdhTJwm8cnknm

How do transcriptions typically handle images? They’re pretty important for this talk. You could embed the images in the text as it progresses?

id

HBifyTtpzBiRkjowQ
authors

Zack_M_Davis
score

12
omega_karma

5
votes

7
date_published

2022-02-21T18:30

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=HBifyTtpzBiRkjowQ

I second Rob’s unanswered question at 40:12: how is that we ever accomplish anything in practice, if the search space is vast, and things that both work and look like they work are exponentially rare?

How is the "the genome is small, therefore generators of human values (that can’t be learned from the environment) are no more complex than tens or hundreds of things on the order of a fuzzy face detector" argument compatible with the complexity of value thesis, or does it contradict it?

Comment

id

jtA2dhJBfjxhDAFen
authors

johnswentworth
score

12
omega_karma

7
votes

5
date_published

2022-02-21T20:33

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=jtA2dhJBfjxhDAFen

how is that we ever accomplish anything in practice, if the search space is vast, and things that both work and look like they work are exponentially rare? This question needs a whole essay (or several) on its own. If I don’t get around to leaving a longer answer in the next few days, ping me. Meanwhile, if you want to think it through for yourself, the general question is: where the hell do humans get all their bits-of-search from? How is the "the genome is small, therefore generators of human values (that can’t be learned from the environment) are no more complex than tens or hundreds of things on the order of a fuzzy face detector" argument compatible with the complexity of value thesis, or does it contradict it? The key difference is between "human values" vs "generators of human values". The complexity of value thesis (as articulated on that arbital page) says that human values are not algorithmically simple, and I do agree with that. But that still allows for simple generators of human values, which (conceptually) take in lots of data from the real world and spit out values. Everything except those generators is learned from the environment. In principle, if we can figure out those relatively-simple generators, then we can feed an AI data similar to the data from which humans’ value-generators generate their values, and the AI should be able to reconstruct human values (up to within ordinary between-humans-within-similar-environments variation).

Comment

id

JDuPMbLagwoeTipaX
authors

Logan Riggs
score

3
omega_karma

3
votes

2
date_published

2022-02-22T20:58

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=JDuPMbLagwoeTipaX

Meanwhile, if you want to think it through for yourself, the general question is: where the hell do humans get all their bits-of-search from? Cultural accumulation and google, but that’s mimicking someone who’s already figured it out. How about the person who first figured out eg crop growth? Could be scientific method, but also just random luck which then caught on. Additionally, sometimes it’s just applying the same hammers to different nails or finding new nails, which means that there are general patterns (hammers) that can be applied to many different situations. There’s bits of information in both the patterns themselves and when to apply them, though I feel confused trying to connect these ideas here. People specifically have inner simulations (ie you can imagine what it’d look like to drop a bowling ball off a building even if you’ve never seen it) from things you have lots of experience with is a way of applying different patterns to new situations.

id

oz3ESfv8yfpzmnaBF
authors

Alexander
score

11
omega_karma
votes

6
date_published

2022-02-21T07:01

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=oz3ESfv8yfpzmnaBF

Words cannot possibly express how thankful I am for you doing this!

id

rjfGkhQ93C2wwMGjR
authors

Charlie Steiner
score

9
omega_karma

5
votes

3
date_published

2022-02-24T03:29

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=rjfGkhQ93C2wwMGjR

Thanks a bunch!

I want to interrogate a little more the notion that gradient descent samples uniformly (or rather, are dominated by the initialization distribution) from good parameters. Have you read various things about grokking like Hypothesis: GD Prefers General Crictuits? That argument seems to be that you might start with parameters dominated by the initialization distribution, but various sorts of regularization are going to push you to sample solutions in a nonuniform way. Do you have a take on this?
For the power-seeking-because-of-entropy example, I want to second the audience questions. If you’re getting your policy by sampling from all possible policies, the argument is great, but if you’re getting your policy by sampling from NN parameters that generate strings of 100 actions, then you just finished arguing that uniform-ish sampling over NN parameters will give simplcity-ish sampling over policies. What would a NN do if trained to play the example game? I would assume it would quickly learn to exactly alternate $ and Apple. This looks like something that seems a little less like powerseeking, and more like telling DeepDream to fill the image with dogs, except filling a string with buying three apples. I dunno, do you think it’s still like powerseeking?
I think you make a subtle error when throwing out a lot of "mere biology" genes as not generating human values. If we had different mere biology than we do, the values we develop would probably be different even if our brain-specific genes were the same! Like, I dunno, suppose you have some genes that build your thyroid. But you can’t go "ho hum, the thyroid isn’t the brain, let’s throw those genes out as uninformative," because thyroid disorders activity impacts your mood, which impacts your expressed values. Or I bet I’d have different values if my eyes saw in UV rather then visible, or my skin had no sense of pain, or I went through adolescence in two days rather than five years. Basically I totally disagree with this notion that "if we share it with plants, an AI wouldn’t need to know it."
Actually I’m kinda not sure how relevant you think the size-of-human-preference-generators question is, since we don’t want the AI to learn human preferences in gene-format, we want the AI to learn human preferences in some (different, I think we agree) format that’s better-suited for doing things like making decisions or comparing between different humans.
Cool last section. If you can have 2 dimensions of things to be Pareto optimal over tradeoffs between, why not N dimensions? It seems like there are behaviors that are irrational even for markets (is failing to make mutually beneficial trades between individuals an example? I’m having trouble thinking of something less inward-facing) that could be "optimal" for decision-making procedures with N of 3 or 4.

id

zQTaShJGTm5THG2AT
authors

TekhneMakre
score

7
omega_karma
votes

4
date_published

2022-02-23T03:06

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=zQTaShJGTm5THG2AT

The central challenge is to get enough bits-of-information about human values to narrow down a search-space to solutions compatible with human values. This sounds like a fundamental disagreement with Yudkowsky’s view. (I think) Yudkowsky thinks the hardest part about alignment is getting an AGI to do any particular specified thing (that requires superhuman general intelligence) at all, whatever it may be, whereas by default AGI will optimize hard for something that no programmer had in mind; rather than the problem being about pointing at particular values. Do you recognize this as a disagreement, and what do you think of it? Do you think aiming-at-all is not that hard, or isn’t usefully separated from pointing at human values?

Comment

id

n9Fq3KhKkRAhooyHH
authors

johnswentworth
score

2
omega_karma
votes

1
date_published

2022-05-13T17:50

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=n9Fq3KhKkRAhooyHH

I think these are both pointing to basically-the-same problem. Under Yudkowsky’s view, it’s presumably not hard to get AI to do X for all values of X, but it’s hard for most of the X which humans care about, and it’s hard for most of the things which seem like human-intuitive "natural things to do".

Comment

id

mDqRrrCAr7eswqJ4w
authors

TekhneMakre
score

4
omega_karma
votes

2
date_published

2022-05-14T04:31

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=mDqRrrCAr7eswqJ4w

Huh. I thought Yudkowsky’s view was that it’s hard to get an AGI to do X for all values of X, where X is the final effect of the AGI on the world (like, what the universe looks like when the AI is done doing its thing). If X is instead an instrumental sort of thing, like getting a lot of energy and matter, then it’s not hard to get an AGI to do that.

Comment

id

BN3H8x6w4m7iwLJuD
authors

johnswentworth
score

2
omega_karma
votes

1
date_published

2022-05-14T04:44

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=BN3H8x6w4m7iwLJuD

That’s right.

Comment

id

2BfdpKnoAtk3paBMF
authors

TekhneMakre
score

2
omega_karma
votes

1
date_published

2022-05-14T04:54

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=2BfdpKnoAtk3paBMF

So "get enough bits-of-information about human values" makes sense if you have something you can do with the bits, i.e. narrow down something. If we don’t know how to specify any final effect of an AGI at all, then we have an additional problem, which is that we don’t know how to do anything with the bits of information about which final effects we want.

Comment

id

Dt8MJrQKdKuoScCAB
authors

johnswentworth
score

2
omega_karma
votes

1
date_published

2022-05-14T06:51

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=Dt8MJrQKdKuoScCAB

I mean, yeah, we do need to be able to use the bits to narrow down a search space.

Comment

id

oMFpdgtJn7HyhoE5G
authors

TekhneMakre
score

4
omega_karma
votes

2
date_published

2022-05-14T09:29

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=oMFpdgtJn7HyhoE5G

What’s the search space? Policies, or algorithms, or behaviors, or something. What’s the information? Well, basically pointing a camera at anything in the world today gives you information about human values, or reading anything off the internet. What do we do with this information to get policies we like? The bits of information isn’t the problem, the problem is that we don’t know how to narrow down policy space or algorithm space or behavior space so that it has some particular final results. Getting bits of information about human values, and being able to aim an AGI at anything, are different problems.

Comment

id

Ga6jxkrF4envgAHNk
authors

johnswentworth
score

2
omega_karma
votes

1
date_published

2022-05-14T17:45

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=Ga6jxkrF4envgAHNk

Getting bits of information about human values, and being able to aim an AGI at anything, are different problems. I think these are the same problem? Like, ability-to-narrow-down-a-search-space-or-behavior-space-by-a-factor-of-two is what a bit of information is. If we can’t use the information to narrow down a search space closer to the thing-the-information-is-supposedly-about, then we don’t actually have any information about that thing.

Comment

id

GtdT7SEonGoX96bS6
authors

TekhneMakre
score

4
omega_karma
votes

2
date_published

2022-05-14T18:09

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=GtdT7SEonGoX96bS6

Like, ability-to-narrow-down-a-search-space-or-behavior-space-by-a-factor-of-two is what a bit of information is. Information is an upper bound, not a lower bound. The capacity of a channel gives you an upper bound on how many distinct messages you can send, not a lower bound on your performance on some task using messages sent over the channel. If you have a very high info-capacity channel with someone who speaks a different language from you, you don’t have an informational problem, you have some other problem (a translation problem). If we can’t use the information to narrow down a search space closer to the thing-the-information-is-supposedly-about, then we don’t actually have any information about that thing. This seems to render the word "information" equivalent to "what we know how to do", which is not the technical meaning of information. Do you mean to do that? If so, why? It seems like a misframing of the problem, because what’s hard about the problem is that you don’t know how to do something, and don’t know how to gather data about how to do that thing, because you don’t have a clear space of possibilities with a shattering set of clear observable implications of those possibilities. When you don’t know how to do something and don’t have a clear space of possibilities, the sort of pieces of progress you want to make aren’t fungible with each other the way information is fungible with other information. [ETA: Like, if the space in question is the space of which "human values" is a member, then I’m saying, our problem isn’t locating human values in that space, our problem is that none of the points in the space are things we can actually implement, because we don’t know how to give any particular values to an AGI.]

Comment

id

RTx6ubPJyKPjM9toW
authors

johnswentworth
score

2
omega_karma
votes

1
date_published

2022-05-15T01:24

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=RTx6ubPJyKPjM9toW

The Shannon formula doesn’t define what information is, it it quantifies amount of information. People occasionally point this out as being kind of philosophically funny—we know how to measure amount of information, but we don’t really have a good definition of what information is. Talking about what information is immediately runs into the question of what the information is about, how the information relates to the thing(s) it’s about, etc. Those are basically similar to the problems one runs into when talking about e.g. an AI’s objective and whether it’s "aligned with" something in the physical world. Like, this mathematical function (the objective) is supposed to talk about something out in the world, presumably it should relate to those things in the world somehow, etc. I claim it’s basically the same problem: how do we get symbolic information/functions/math-things to reliably "point to" particular things in the world? (This is what Yudkowsky, IIUC, would call the "pointer problem".) Framed as a bits-of-information problem, the difficulty is not so much getting enough bits as getting bits which are actually "about" "human values". (Presumably that’s why my explanations seem so confusing.)

Comment

id

8DEd7MZYdmsxNhfA8
authors

TekhneMakre
score

2
omega_karma
votes

1
date_published

2022-05-15T03:32

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=8DEd7MZYdmsxNhfA8

If natural abstractions are a thing, in what sense is "make this AGI have particular effect X" trying to be about human values, if X is expressed using natural abstractions?

Comment

id

d6tgrYfcLuNvdbnBH
authors

johnswentworth
score

2
omega_karma
votes

1
date_published

2022-05-15T06:05

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=d6tgrYfcLuNvdbnBH

In that case, it’s not about human values, which is one of the very nice things the natural abstraction hypothesis buys us.

id

vFv9EA67oan7HaSo4
authors

tailcalled
score

6
omega_karma
votes

4
date_published

2022-02-21T19:08

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=vFv9EA67oan7HaSo4

Section 1 (about compression) was pretty good, I don’t think I had fully internalized this idea, despite having followed a lot of your posts.

id

8h5aTzaE2YjjiCnJi
authors

Logan Riggs
score

5
omega_karma

4
votes

3
date_published

2022-02-22T20:48

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=8h5aTzaE2YjjiCnJi

Thinking through the "vast majority of problem-space for X fails" argument; assume we have a random text generator that we want to run a sorting algorithm:

Vast majority don’t sort (or are even compilable)
The vast majority of programs that "look like they work", don’t (eg "forgot a semicolon", "didn’t account for an already sorted list", etc)
Generalizing: the vast majority of programs that pass [Unit tests, compiles, human says "looks good to me", simple], don’t work.
Could be incomprehensible, pass several unit tests, but still fail in weird edge cases (eg. when the input number is [84, >100, a prime number > 13, etc], then it spits out gibberish)
counterargument for alignment check of "run it in a simulation to see if it breaks out of the box" because this is just another proxy.
Some constraints above are necessary, like being compilable, and some aren’t, like some randomly generated sorting algorithms that are really hard to understand. For example, could be written in brainfuck, or contain 10,000 lines of code that are mostly redundant or happen to cancel out and sorts correctly
To relate to the original talk, I agree that I can recognize my own values once I reflect on them, but this is different than seeing a plan about an AI that keeps my values and thinking "this looks like it works". In other words, the "human values" shouldn’t be a strict subset of the "human says it looks like it works", just like "correctly sorts" shouldn’t be a strict subset of "human says it looks like it works" due to incomprehensibility.

For programs specifically, if it’s simple and passes a relevant distribution of unit tests, we can be highly confident it in fact sorts correctly, but what’s the equivalent for "plan that maintains human values"? Let’s say John succeeds and finds what we think to be the generators of human values, would it be comprehensible enough to verify it? Applying the argument again but to John’s proposed solution, the vast majority of [Ai’s trained in human environments with what we think are the simple generators of human values]’s plans & behaviors may look good but not actually be good. Or the weights are incomprehensible, so we use unit tests to verify and it could still fail. Counter-counterargument: I can imagine these generators being simple enough that we can indeed be confident they do what we want. Since it should be human-value-equivalent, it should also be human-interpretable (under reflection?). This sounds like a good idea overall, but I wouldn’t bet my life on it. It’d be nice to have necessary and sufficient conditions for this possible solution.

id

57MwK6gYF8GkTtti7
authors

abramdemski
score

4
omega_karma

4
votes

2
date_published

2022-02-25T21:16

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=57MwK6gYF8GkTtti7

I think a lot of the values we care about are cultural, not just genetic. A human raised without culture isn’t even clearly going to be generally intelligent (in the way humans are), so why assume they’d share our values? Estimations of the information content of this part are discussed by Eric Baum in What is Thought?, although I do not recall the details.

Comment

id

uaqxD3tZXFGxyrG6w
authors

johnswentworth
score

10
omega_karma

6
votes

4
date_published

2022-02-25T22:02

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=uaqxD3tZXFGxyrG6w

I find that plausible, a priori. Mostly doesn’t affect the stuff in the talk, since that would still come from the environment, and the same principles would apply to culturally-derived values as to environment-derived values more generally. Assuming the hardwired part is figured out, we should still be able to get an estimate of human values within the typical-human-value-distribution-for-a-given-culture from data which is within the typical-human-environment-distribution-for-that-culture.

id

zvN7Cmg4MsDrhRZCx
authors

Ramana Kumar
score

4
omega_karma

3
votes

3
date_published

2022-02-21T11:27

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=zvN7Cmg4MsDrhRZCx

Thanks a lot for posting this! A minor point about the 2nd intuition pump (100-timesteps, 4 actions: Take $1, Do Nothing, Buy Apple, Buy Banana; the point being that most action sequences take the Take $1 action a lot rather than the Do Nothing action): the "goal" of getting 3 apples seems irrelevant to the point, and may be misleading if you think that that goal is where the push to acquire resources comes from. A more central source seems to me to be the "rule" of not ending with a negative balance: this is what prunes paths through the tree that contain more "do nothing" actions.

Comment

id

JAXgqw8AKPeKTovm8
authors

johnswentworth
score

5
omega_karma

5
votes

3
date_published

2022-02-21T18:05

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=JAXgqw8AKPeKTovm8

Yup! More generally, key pieces for modeling a "resource": amounts of the resource are additive, and more resources open up more actions (operationalized by the need for a positive balance in this case). If there’s something roughly like that in the problem space, then the resource-seeking argument kicks in.

id

GrmAcQwLzzGMMHpD9
authors

Hoagy
score

3
omega_karma

3
votes

2
date_published

2022-02-21T16:11

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=GrmAcQwLzzGMMHpD9

Cheers for posting! I’ve got a question about the claim that optimizers compress by default, due to the entropy maximization-style argument given around 20:00 (apologies if you covered this, it’s not easy to check back through a video): Let’s say that we have a neural network of width 100, which is trained on a dataset which could be trained to perfect accuracy on a network of width of only 30. If it compresses it into only 30 weights there’s a 70-dimensional space of free parameters and we should expect a randomly selected solution to be of this kind. I agree that if we randomly sample zero-loss weight configurations, we end up with this kind of compression, but it seems that any kind of learning we know how to do is dependent on the paths that one can take to reach it, and that abstracting this away can give very different results to any high-dimensional optimization that we actually know how to do. Assuming that the network is parameterized by, say, float16s, maximal compression of the data would result in the output of the network being sensitive to the final bit of the weights in as many cases as possible, thereby leaving the largest number of free bits, so 16 bits of info would be compressed in to one weight, rather than spread among 3-4. My intuition is that these highly compressed arrangements would be very sensitive to perturbations, and render them incredibly difficult to reach in practice (and also have a big problem with an unknown examples, and are therefore screened off by techniques like dropout and regularization). There is therefore a competing incentive towards minima which are easy to land on—probably flat minima surrounded by areas of relatively good performance. Further, I expect that these kind of minima tend to leverage the whole network for redundancy and flatness (not needing to depend tightly on the final bit of weights). The properties of would be not just compression but some combination of compression and smoothness (smoothness being sort of a variant of compression where the final bits don’t matter much) which would not result in some subset of the parameters having all the useful information. If you agree that this is what happens, in what sense is there really compression, if the info is spread among multiple bits? Perhaps given the structure of NNs, we should expect to be able to compress by removing the last bits of weights as these are the easiest to leave free given the structure of training? If you disagree I’d be curious to know where. I sense that Mingard et al shares your conclusion but I don’t yet understand the claimed empirical demonstration. tldr: optimization may compress by default, but learning seems to counteract this by choosing easy-to-find minima.

Comment

id

vkQPHcNZrktWN6FWB
authors

johnswentworth
score

3
omega_karma

3
votes

2
date_published

2022-02-21T18:05

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=vkQPHcNZrktWN6FWB

it seems that any kind of learning we know how to do is dependent on the paths that one can take to reach it, and that abstracting this away can give very different results to any high-dimensional optimization that we actually know how to do. This is where Mingard et al come in. One of their main results is that SGD training on neural nets does quite well approximate just-randomly-sampling-an-optimal-point. Turns out our methods are not actually very path-dependent in practice! My intuition is that these highly compressed arrangements would be very sensitive to perturbations, and render them incredibly difficult to reach in practice… There is therefore a competing incentive towards minima which are easy to land on—probably flat minima surrounded by areas of relatively good performance. There is a mismatch between your intuition and the implications of "flat minima surrounded by areas of relatively good performance". Remember, the whole point of the "highly compressed arrangements" is that we only need to lock in a few parameter values in order to get optimal behavior; once those few values are locked in, the rest of the parameters can mostly vary however they want without screwing stuff up. "Flat minimum surrounded by areas of relatively good performance" is synonymous with compression: if we can vary the parameters in lots of ways without losing much performance, that implies that all the info needed for optimal performance has been compressed into whatever-we-can’t-vary-without-losing-performance. Now, your intuition is correct in the sense that info may be spread over many parameters; the relevant "ways to vary things" may not just be "adjust one param while holding others constant". For instance, it might be more useful to look at parameter variation along local eigendirections of the Hessian. Then the claim would be something like "flat optimum = performance is flat along lots of eigendirections, therefore we can project the parameter-values onto the non-flat eigendirections and those projections are the ‘compressed info’". (Tbc, I still don’t know what the best way is to characterize this sort of thing, but eigendirections are an obvious approximation which will probably work.)

Comment

id

pr2t5JHdt5HMu4Bty
authors

Hoagy
score

5
omega_karma

5
votes

3
date_published

2022-02-21T19:13

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=pr2t5JHdt5HMu4Bty

Turns out our methods are not actually very path-dependent in practice! Yeah I get that’s what Mingard et al are trying to show but the meaning of their empirical results isn’t clear to me—but I’ll try and properly read the actual paper rather than the blog post before saying any more in that direction. "Flat minimum surrounded by areas of relatively good performance" is synonymous with compression. if we can vary the parameters in lots of ways without losing much performance, that implies that all the info needed for optimal performance has been compressed into whatever-we-can’t-vary-without-losing-performance. I get that a truly flat area is synonymous with compression—but I think being surrounded by areas of good performance is anti-correlated with compression because it indicates redundancy and less-than-maximal sensitivity. I agree that viewing it as flat eigendimensions in parameter space is the right way to think about it, I still worry that the same concerns apply that maximal compression in this space is traded against ease of finding what would be a flat plain in many dimensions, but a maximally steep ravine in all of the other directions. I can imagine this could be investigated with some small experiments, or they may well already exist but I can’t promise I’ll follow up, if anyone is interested let me know.

id

AdvGPKTLQBJQMjEBv
authors

TekhneMakre
score

2
omega_karma
votes

1
date_published

2022-05-13T09:01

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=AdvGPKTLQBJQMjEBv

Bump re/ my question about trying to make an AI do any specifiable thing at all vs. specifying some good thing to do; still curious what you think.

id

kLAPEwoZR8XWxwsvh
authors

Logan Riggs
score

1
omega_karma

1
votes

1
date_published

2022-02-22T21:27

https://www.lesswrong.com/posts/xdSDFQs4aC5GrdHNZ/the-big-picture-of-alignment-talk-part-1?commentId=kLAPEwoZR8XWxwsvh

Regarding generators of human values: say we have the gene information that encodes human cognition, what does that mean? Equivalent of a simulated human? Capabilities secret-sauce algorithm right? I’m unsure if you can take the body out of a person and still have the same values because I have felt senses in my body that tells me information about the world and how I relate to it. Assume it works as a simulated person and ignore mindcrime, how do you algorithmically end up in a good enough subset of human values (because not all human values are meta-good)? Or, how do you use this to create a simulated long reflection? (ie what humans would decide ethics to be if they thought about it for [1000] years) You could first figure out meta-preferences and bootstrap that in for figuring out preferences. Though, I’m unsure if there are a "correct" set of meta-preferences, with my main confusion being the blank spot in my map where "enlightenment" is.