(I’m reposting this comment as a top-level post, for ease of future reference. The context here is a discussion about the different lines of arguments for the importance of AI safety.)
Here’s another argument that I’ve been pushing since the early days (apparently not very successfully since it didn’t make it to this list :) which might be called "argument from philosophical difficulty". It appears that achieving a good long term future requires getting a lot of philosophical questions right that are hard for us to answer. Given this, initially I thought there are only three ways for AI to go right in this regard (assuming everything else goes well with the AI):
We solve all the important philosophical problems ahead of time and program the solutions into the AI.
We solve metaphilosophy (i.e., understand philosophical reasoning as well as we understand mathematical reasoning) and program that into the AI so it can solve philosophical problems on its own.
We program the AI to learn philosophical reasoning from humans or use human simulations to solve philosophical problems.
Since then people have come up with a couple more scenarios (which did make me slightly more optimistic about this problem):
We all coordinate to stop technological progress some time after AI but before space colonization, and have a period of long reflection where humans, maybe with help from AIs, spend thousands or millions of years to solve philosophical problems.
We program AIs to be corrigible to their users, some users care about getting philosophy correct so the AIs help keep them safe and get their "fair share" of the universe until philosophical problems are solved eventually, enough users care about this so that we end up with a mostly good future, and lack of philosophical knowledge doesn’t cause disaster in the meantime. (My writings on "human safety problems" were in part a response to this suggestion, outlining how hard it would be to keep humans "safe" in this scenario.)
The overall argument is that, given human safety problems, realistic competitive pressures, difficulties with coordination, etc., it seems hard to end up in any of these scenarios and not have something go wrong along the way. Maybe another way to put this is, given philosophical difficulties, the target we’d have to hit with AI is even smaller than it might otherwise appear.
Whenever someone says "there are only N ways that X is possible" outside of a mathematical proof, my immediate reaction is "Oh, great, here is another argument from lack of imagination". This seems like a typical case.
Comment
I think I made it pretty clear that these are the N ways that I could come up with, plus M more that others came up with later. Plus, in a later post, I explicitly ask what else might be possible. Did you see any other language I used where I was claiming something stronger than I should have?
If not, would you agree that people trying to solve a problem over some time only to find that all the plausible approaches they could come up with seem quite difficult is useful evidence for that problem being intrinsically difficult?
It might be interesting to consider this argument from an outside view perspective. Can you give a sample of arguments that you think are comparable to this one so we can check how valid they tend to be in retrospect?
Comment
I may have misunderstood, sorry. I thought you gave it near 100% certainty that there could be only 3 ways, not the more reasonable "my knowledge of this problem is so marginal, I can’t give it a good estimate of probability, since it would be drowned in error bars".
One important dimension to consider is how hard it is to solve philosophical problems well enough to have a pretty good future (which includes avoiding bad futures). It could be the case that this is not so hard, but fully resolving questions so we could produce an optimal future is very hard or impossible. It feels like this argument implicitly relies on assuming that "solve philosophical problems well enough to have a pretty good future" is hard (ie. takes thousands of millions of years in scenario 4) - can you provide further clarification on whether/why you think that is the case?
Comment
I tried to make arguments in this direction in Beyond Astronomical Waste and Two Neglected Problems in Human-AI Safety. Did you read them and/or find them convincing? To be clear I do think there’s a significant chance that we could just get lucky and it turns out that solving philosophical problems well enough to have a pretty good future isn’t that hard. (For example maybe it turns out to be impossible or not worthwhile to influence bigger/richer universes so we don’t lose anything even if we never solve that problem.) But from the perspective of trying to minimize x-risk, it doesn’t seem like a good idea to rely on that.
That was "thousands or millions". I think it’s unlikely we’ll need billions of years. :) BTW I think I got the idea of thousands or millions of years of "the long reflection" from William MacAskill’s 80,000 Hours interview, but I’m not sure who was the first to suggest it. (I think it’s fairly likely that we’ll need at least a hundred years which doesn’t seem very different from thousands or millions from a strategic perspective. Not sure if that’s the part that you’re having an issue with.)
Comment
Thanks, this position makes more sense in light of Beyond Astronomical Waste (I guess I have some concept of "a pretty good future" that is fine with something like a bunch of human-descended beings living a happy lives that misses out on the sort of things mentioned in Beyond Astronomical Waste, and "optimal future" which includes those considerations). I buy this as an argument that "we should put more effort into making philosophy work to make the outcome of AI better, because we risk losing large amounts of value" rather than "our efforts to get a pretty good future are doomed unless we make tons of progress on this" or something like that. "Thousands of millions" was a typo.
Comment
What about the other post I linked, Two Neglected Problems in Human-AI Safety? A lot more philosophical progress would be one way to solve those problems, and I don’t see many other options.
Planned newsletter opinion: This seems like a real problem, but I’m not sure how important it is. I am most optimistic that the last approach will "just work", where we solve alignment and there are enough overseers who care about getting these questions right that we do solve these philosophical problems. However, I’m very uncertain about this since I haven’t thought about it enough (it seems like a question about humans rather than about AI). Regardless of importance, it does seem to have almost no one working on it and could benefit from more thought.
Comment
For the last approach (corrigibility) to work, besides overseers/users who care about eventually getting philosophy right, it seems like we also need solutions to (intentional and unintentional) AI-mediated value corruption. (This part seems to be at least as much about AIs as about humans.) I don’t think I’ve seen anyone sketch out a plausible solution to these two problems yet (that doesn’t require solving hard philosophical problems like metaphilosophy). Do you agree? If not why not, and if yes why are you optimistic about that approach?
Comment
I’m super unsure about the intentional case, and agree that I want to see more work on that front, but it feels like a particular problem that can be solved with something like strategy/policy work. Put another way, intentional value corruption seems like a non-central example of problems that arise from philosophical difficulty. I agree that corrigibility + good overseers does not clearly solve it. For the unintentional case, I think that overseers who care about getting philosophy right are going to think about value drift, because many of us are currently thinking about it. It seems like as long as the overseers make this apparent to the AI system and are sufficiently risk-averse, a corrigible AI system would take care not to corrupt their values. (The AI system might fail at this, but this doesn’t seem that likely to me, and it feels very hard to make progress on that particular point without more details on how the AI system works.) I do think that we want to think about how to ensure that there are overseers who care about getting the questions right, who know about value drift, who will be sufficiently risk-averse, etc.
Comment
What kind of strategy/policy work do you have in mind?
Don’t we usually assume that the AI is ultimately corrigible to the user or otherwise has to cater to the user’s demands, because of competition between different AI providers? In that scenario, the end user also has to care about getting philosophy correct and being risk-averse for things to work out well, right? Or are you imagining some kind of monopoly or oligopoly situation where the AI providers all agree to be paternalistic and keep certain kinds of choices and technologies away from users? If so, how do you prevent AI tech from leaking out (ETA: or being reinvented) and enabling smaller actors from satisfying users’ risky demands? (ETA: Maybe you’re thinking of a scenario that’s more like 4 in my list?)
Another issue is that if AIs are not corrigible to end users but to overseers or their companies, that puts the overseers or companies in positions of tremendous power, which would be corrupting in its own way. I think the risk-averse thing to do would be to not put anyone in such situations, but it’s unclear how that can be accomplished (without other downsides). It seems that in general one could want to be risk-averse but not know how, so just having people be risk averse doesn’t seem enough to ensure safety.
Yet another issue is that in a fast moving world, a corrigible AI might need to query the overseer or user about lots of things that it’s unsure about. But it’s unclear what it’s supposed to do if such queries can themselves corrupt the overseer or user. Again just being risk-averse doesn’t seem be enough, and I don’t see a good solution within the corrigiblity approach that doesn’t involve solving hard philosophical problems.
BTW, Alex Zhu made a similar point in Acknowledging metaphilosophical competence may be insufficient for safe self-amplification.
Comment
Comment
It doesn’t have to be an either-or thing, and we could try to attack from both angles at once.
The two approaches to this problem from the technical side that seem most promising to me are: A) solve metaphilosophy well enough that the AI can distinguish between good arguments and merely persuasive arguments and B) use my proposed hybrid approach to recover from corruption after the fact. These would fall under 2 and 3 in terms of the list in the OP.
These were meant to be arguments that approach 5 (corrigibility) is "doomed", and I gave them as a reply to your optimism about approach 5, with the implication that perhaps we should put more effort into some of the other approaches. Of course these arguments aren’t water tight so I hope there could be some creative technical ways to get around them within approach 5 too, but your statement "I am most optimistic that the last approach will "just work"" didn’t seem right to me and I wanted to point that out before it went into your newsletter.
Comment
Comment
I think no, because using either metaphilosophy or the hybrid approach involving idealized humans, an AI could potentially undo any corruption that happens to the user after it becomes powerful enough (i.e., by using superhuman persuasion or some other method).
Maybe come back to this after we settle the above question?
Comment
Comment
Solving metaphilosophy is itself a philosophical problem, so if we haven’t made much progress on metaphilosophy by the time we get human-level AI, AI probably won’t be able to help much with solving metaphilosophy (especially relative to accelerating technological progress).
Implementing the hybrid approach may be more of a technological problem but may still involve hard philosophical problems so it seems like a good idea to look more into it now to determine if that is the case and how feasible it looks overall (and hence how "doomed" approach 5 is, if approach 5 depends on implementing the hybrid approach at some point). Also it seems like a good idea to try to give the hybrid approach as much of a head start as possible, because any value corruption that occurs prior to corrigible AI switching to a hybrid design probably won’t get rolled back.
Maybe I should clarify that I’m not against people working on corrigibility, if they think that is especially promising or they have a comparative advantage for working on that. I mainly don’t want to see statements that are so strongly in favor of approach 5 as to discourage people from looking into the other approaches deeply enough to determine for themselves how promising those approaches are and whether they might be especially suited to working on those approaches. Does that seem reasonable to you?
Comment
Conditioned on metaphilosophy being hard to solve, AI won’t be able to help us with it.
Conditioned on us not trying to solve metaphilosophy, AI won’t be able to help us with it. The first interpretation is independent of whether or not we work on metaphilosophy, so it can’t be an argument for working on metaphilosophy. The second interpretation seems false to me, and not because I think there are many considerations that overall come out to make it false—I don’t see any arguments in favor of it. Perhaps one argument is that if we don’t try to solve metaphilosophy, then AI won’t infer that we care about it, and so won’t optimize for it. But that seems very weak, since we can just say that we do care, and that’s much stronger evidence. We can also point out that we didn’t try to solve the problem because it wasn’t the most urgent one at the time.
Comment
I thought from a previous comment that you already agree with the latter, but sure I can give an argument. It’s basically that the most obvious way of using ML to accelerate philosophical progress seems risky (compared to just having humans do philosophical work) and no one has proposed a better method, so unless this problem is solved in a better way, it looks like we’d have to either accept a faster growing gap between philosophical progress and technological progress, or incur extra risk from using ML to accelerate philosophical progress. See the section Replicate the trajectory with ML? of Some Thoughts on Metaphilosophy for more details.
Aside from the above argument, I think we could end up creating AIs whose ratio between philosophical ability and technical ability is worse than human, if AI designers simply spent more resources on improving technical ability and neglected philosophical ability in comparison (e.g., because there is higher market demand for technical ability). Considering how much money is currently being invested into making technological progress vs philosophical progress in the overall economy, wouldn’t you expect something similar when it comes to AI? (I guess this is more of an argument for overall pessimism rather than for favoring one approach over another, but I still wanted to point out that I don’t agree with your relative optimism here.)
Comment
I thought from a > previous comment that you already agree with the latterYeah, that’s why I said "I probably agreed with this in the past". I’m not sure whether my underlying models changed or whether I didn’t notice the contradiction in my beliefs at the time.
Comment
On the scientific/technological side, you can also use scientific/engineering papers (which I’m guessing has to be at least an order of magnitude greater in volume than philosophy writing), plus you have access to ground truths in the form of experiments and real world outcomes (as well as near-ground truths like simulation results) which has no counterpart in philosophy. My main point is that it seems a lot harder for technological progress to go "off the rails" due to having access to ground truths (even if that data is sparse) so we can push it much harder with ML.
I agree this could be a reason that things turn out well even if we don’t explicitly solve metaphilosophy or do something like my hybrid approach ahead of time. The way I would put it is that humans developed philosophical abilities for some mysterious reason that we don’t understand, so we can’t rule out AI developing philosophical abilities for the same reason. It feels pretty risky to rely on this though. If by the time we get human-level AI, this turns out not to be true, what are we going to do then? And even if we end up with AIs that appear to be able to help us with philosophy, without having solved metaphilosophy how would we know whether it’s actually helping or pushing us "off the rails"?
Comment
A lot of this doesn’t seem specific to AI. Would you agree that AI accelerates the problem and makes it more urgent, but isn’t the primary source of the problem you’ve identified? How would you feel about our chances for a good future if AI didn’t exist (but we still go forward with technological development, presumably reaching space exploration eventually)? Are human safety problems an issue then? Some of the problems, like intentional value manipulation, do seem to become significantly easier.
Comment
Some philosophical problems are specific to AI though, or at least to specific alignment approaches. For example decision theory and logical uncertainty for MIRI’s approach, corrigibility and universality (small core of corrigible and universal reasoning) for Paul’s.
That sounds reasonable but I’m not totally sure what you mean by "primary source". What would you say is the primary source of the problem?
Yeah, sure. I think if AI didn’t exist we’d have a better chance that moral/philosophical progress could keep up with scientific/technological progress but I would still be quite concerned about human safety problems. I’m not sure why you ask this though. What do you think the implications of this are?
Comment
Comment
Ah, I think that all makes sense, but next time I suggest saying something like "to check my understanding" so that I don’t end up wondering what conclusions you might be leading me to. :)
Optimistic scenario 6: Technological progress in AI makes difficult philosophical problems much easier. (Lots of overlap with corrigibility). Early examples: Axelrod’s tournaments, Dennett on Conway’s Life as a tool for thinking more clearly about free will.
(This is probably a special case of corrigibilty).
Comment
This seems fairly unlikely to me except insofar as AI acts as a filter that forces us to refine our understanding. The examples you provide arguably didn’t make anything easier, just made what was already there more apparent to more people. This won’t help resolve the fundamental issues, though, although it may at least make more people aware of them (something, I’ll add, I hope to make more progress on at least within the community of folks already doing this work, let alone within a wider audience, because I continue to see, especially as goes epistemology, dangerous misunderstandings or ignorances of key ideas that pose a threat to successfully achieving AI alignment).
Unfortunately many philosophical problems may not have solutions of a form that allow us to construct something that definitely is what we want, but rather only permits us to say something is probably not what we want due to the fundamental ungroundability of our beliefs. My suspicion is that you are right, the problem is even harder than anyone currently realizes, and the best we can hope for is to winnow away as much stuff that obviously doesn’t work while still leaving us with lots of uncertainty about whether or not we can succeed at our safety objectives.
Everyone choosing how their share of ressources is used has the problem that everyone might be horrified at what someone else is doing.
A possible solution: we decide not to solve philosophical problems in irreversible way (e.g. "tiling universe with orgasmatronium is good") - which obviously creates astronomical opportunity costs, but also prevent astronomical risks of wrong solutions. Local agents solve different problems locally in different period of time (the same way as a normal human changes many philosophical systems and believes during his life).