**Update: **I believe that the Counterfactual Prisoner’s Dilemma which was discovered by Cousin_it and I independently is resolves the answer to this question The LessWrong Wiki defines Counterfactual Mugging as follows:
Omega appears and says that it has just tossed a fair coin, and given that the coin came up tails, it decided to ask you to give it $100. Whatever you do in this situation, nothing else will happen differently in reality as a result. Naturally you don’t want to give up your $100. But Omega also tells you that if the coin came up heads instead of tails, it’d give you $10000, but only if you’d agree to give it $100 if the coin came up tails. Do you give Omega $100?I expect that most people would say that you should pay because a 50% chance of $10000 for $100 is an amazing deal according to expected value. I lean this way too, but it is harder to justify than you might think. After all, if you are being asked for $100, you know that the coin came up heads and you won’t receive the $10000. Sure this means that if the coin would have been heads then you wouldn’t have gained the $10000, but you know the coin wasn’t heads so you don’t lose anything. It’s important to emphasise: this doesn’t deny that if the coin had come up heads that this *would *have made you miss out on $10000. Instead, it claims that this point is irrelevant, so merely repeating the point again isn’t a valid counter-argument. You could argue that you would have pre-commited to paying if you had known about the situation ahead of time. True, but you didn’t pre-commit and you didn’t know about it ahead of time, so the burden is on you to justify why you should act as though you did. In Newcomb’s problem you want to have pre-committed and if you act as though you were pre-committed then you will find that you actually were pre-committed. However, here it is the opposite. Upon discovering that the coin came up tails, you want to act as though you were not pre-commited to pay and if you act that way, you will find that you actually were indeed not pre-commited. We could even channel Yudkowsky from Newcomb’s Problem and Regret of Rationality: "Rational agents should WIN… It is precisely the notion that Nature does not care about our algorithm, which frees us up to pursue the winning Way—without attachment to any particular ritual of cognition, apart from our belief that it wins. Every rule is up for grabs, except the rule of winning… Unreasonable? I am a rationalist: what do I care about being unreasonable? I don’t have to conform to a particular ritual of cognition. I don’t have to take only box B because I believe my choice affects the box, even though Omega has already left. I can just… take only box B." You can just not pay the $100. (Vladimir Nesov makes this argument this exact same argument here). Here’s another common reason, I’ve heard as described by Cousin_it: "I usually just think about which decision theory we’d want to program into an AI which might get copied, its source code inspected, etc. That lets you get past the basic stuff, like Newcomb’s Problem, and move on to more interesting things. Then you can see which intuitions can be transferred back to problems involving humans." That’s actually a very good point. It’s entirely possible that solving this problem doesn’t have any relevance to building AI. However, I want to note that: a) it’s possible that a counterfactual mugging situation could have been set up before an AI was built b) understanding this could help deconfuse what a decision is—we still don’t have a solution to logical counterfactuals c) this is probably a good exercise for learning to cut through philosophical confusion d) okay, I admit it, it’s kind of cool and I’d want an answer regardless of any potential application. Or maybe you just directly care about counterfactual selves? But why? Do you really believe that counterfactuals are in the territory and not the map? So why care about that which isn’t real? Or even if they are real, why can’t we just imagine that you are an agent that doesn’t care about counterfactual selves? If we can imagine an agent that likes being hit on the head with a hammer, why can’t we manage that? Then there’s the philosophical uncertainty approach. Even if there’s only a 1⁄50 chance of your analysis being wrong, then you should pay. This is great if you face the decision in real life, but not if you are trying to delve into the nature of decisions. So given all of this, why should you pay?
I’m most fond of the precommitment argument. You say: You could argue that you would have pre-commited to paying if you had known about the situation ahead of time. True, but you didn’t pre-commit and you didn’t know about it ahead of time, so the burden is on you to justify why you should act as though you did. In Newcomb’s problem you want to have pre-committed and if you > act as though you were pre-committed then you will find that you actually were pre-committed. However, here it is the opposite. Upon discovering that the coin came up tails, you want to act as though you were not pre-commited to pay and if you act that way, you will find that you actually were indeed not pre-commited.I do not think this gets at the heart of the precommitment argument. You mention cousin_it’s argument that what we care about is what decision theory we’d prefer a benevolent AI to use. You grant that this makes sense for that case, but you seem skeptical that the same reasoning applies to humans. I argue that it does. When reasoning abstractly about decision-making, I am (in part) thinking about how I would like myself to make decisions in the future. So it makes sense for me to say to myself, "Ah, I’d want to be counterfactually mugged." I will count being-counterfactually-mugged as a point in favor of proposed ways of thinking about decisions; I will count not-being-mugged as a point against. This is not, in itself, a precommitment; this is just a heuristic about good and bad reasoning as it seems to me when thinking about it ahead of time. A generalization of this heuristic is, "Ah, it seems any case where a decision procedure would prefer to make a commitment ahead of time but would prefer to do something different in the moment is a point against that decision procedure". I will, thinking about decision-making in the abstract as things seem to me now, tend to prefer decision procedures which avoid such self-contradictions. In other words, thinking about what constitutes good decision-making in the abstract seems a whole lot like thinking about how we would want a benevolent AI to make decisions. You could argue that I might think such things now, and might think up all sorts of sophisticated arguments which fit that picture, but later, when Omega asks me for $100, if I re-think my decision-theoretic concepts at that time, I’ll know better. But, based on what principles would I be reconsidering? I can think of some. It seems to me now, though, that those principles are mistaken, and I should instead reason using principles which are more self-consistent—principles which, when faced with the question of whether to give Omega $100, arrive at the same answer I currently think to be right. Of course this cannot be a general argument that I prefer to reason by principles which will arrive at conclusions consistent with my current beliefs. What I can do is consider the impact which particular ways of reasoning about decisions have on my overall expected utility (assuming I start out reasoning with some version of expected utility theory). Doing so, I will prefer UDT-like ways of reasoning when it comes to problems like counterfactual mugging. You might argue that beliefs are for true things, so I can’t legitimately discount ways-of-thinking just because they have bad consequences. But, these are ways-of-thinking-about-decisions. The point of ways-of-thinking-about-decisions is winning. And, as I think about it now, it seems preferable to think about it in those ways which reliably achieve higher expected utility (the expectation being taken from my perspective now). Nor is this a quirk of my personal psychology, that I happen to find these arguments compelling in my current mental state, and so, when thinking about how to reason, prefer methods of reasoning which are more consistent with precommitments I would make. Rather, this seems like a fairly general fact about thinking beings who approach decision-making in a roughly expected-utility-like manner. Perhaps you would argue, like the CDT-er sometimes does in response to Newcomb, that you cannot modify your approach to reasoning about decisions so radically. You see that, from your perspective now, it would be better if you reasoned in a way which made you accept future counterfactual muggings. You’d see, in the future, that you are making a choice inconsistent with your preferences now. But this only means that you have different preferences then and now. And anyway, the question of decision theory should be what to do given preferences, right? You can take that perspective, but it seems you must do so regretfully—you should wish you could self-modify in that way. Furthermore, to the extent that a theory of preferences sits in the context of a theory of rational agency, it seems like preferences should be the kind of think which tend to stay the same over time, not the sort of thing which change like this. Basically, it seems that assuming preferences remain fixed, beliefs about what you should do given those preferences and certain information should not change (except due to bounded rationality). IE: certainly I may think I should go to the grocery store but then change my mind when I learn it’s closed. But I should not start out thinking that I should go to the grocery store even in the hypothetical where it’s closed, and then, upon learning it’s closed, go home instead. (Except due to bounded rationality.) That’s what is happening with CDT in counterfactual mugging: it prefers that its future self should, if asked for $100, hand it over; but, when faced with the situation, it thinks it should not hand it over. The CDTer response ("alas, I cannot change my own nature so radically") presumes that we have already figured out how to reason about decisions. I imagine that the real crux behind such a response is actually that CDT feels like the true answer, so that the non-CDT answer does not seem compelling even once it is established to have a higher expected value. The CDTer feels as if they’d have to lie to themselves to 1-box. The truth is that they could modify themselves so easily, if they thought the non-CDT answer was right! They protest that Newcomb’s problem simply punishes rationality. But this argument presumes that CDT defines rationality. An EDT agent who asks how best to act in future situations to maximize expected value in those situations will arrive back at EDT, since expected-value-in-the-situation is the very criterion which EDT already uses. However, this is a circular way of thinking—we can make a variant of that kind of argument which justifies any decision procedure. A CDT or EDT agent who asks itself how best to act in future situations to maximize expected value as estimated by its current self will arrive at UDT. Furthermore, that’s the criterion it seems an agent ought to use when weighing the pros and cons of a decision theory; not the expected value according to some future hypothetical, but the expected value of switching to that decision theory now. And, remember, it’s not the case that we will switch back to CDT/EDT if we reconsider which decision theory is highest-expected-utility when we are later faced with Omega asking for $100. We’d be a UDT agent at that point, and so, would consider handing over the $100 to be the highest-EV action. I expect another protest at this point—that the question of which decision theory gets us the highest expected utility by our current estimation isn’t the same as which one is true or right. To this I respond that, if we ask what highly capable agents would do ("highly intelligent"/"highly rational"), we would expect them to be counterfactually mugged—because highly capable agents would (by the assumption of their high capability) self-modify if necessary in order to behave in the ways they would have precommitted to behave. So, this kind of decision theory / rationality seems like the kind you’d want to study to better understand the behavior of highly capable agents; and, the kind you would want to imitate if trying to become highly capable. This seems like an interesting enough thing to study. If there is some other thing, "the right decision theory", to study, I’m curious what that other thing is—but it does not seem likely to make me lose interest in this thing (the normative theory I currently call decision theory, in which it’s right to be counterfactually mugged).
Comment
Comment
Comment
Comment
Comment
Comment
Comment
"Basically I keep talking about how "yes you can refuse a finite number of muggings"″ - considering I’m considering the case when you are only mugged once, that sounds an awful lot like saying it’s reasonable to choose not to pay. "But if I’ve considered things ahead of time"—a key part of counterfactual mugging is that you haven’t considered things ahead of time. I think it is important to engage with this aspect or explain why this doesn’t make sense. "And further, there’s the classic argument that you should always consider what you would have committed to ahead of time"—imagine instead of $50 it was your hand being cut off to save your life in the counterfactual. It’s going to be awfully tempting to keep your hand. Why is what you would have committed to, but didn’t relevant? My goal is to understand versions that haven’t been watered down or simplified.
Comment
Comment
Comment
Comment
Comment
Comment
You feel that I’m begging the question. I guess I take only thinking about this counterfactual as the default position, as where an average person is likely to be starting from. And I was trying to see if I could find an argument strong enough to displace this. So I’ll freely admit I haven’t provided a first-principles argument for focusing just on this counterfactual.
Comment
Comment
I’m new here. May I ask what’s the core difference between the UDT and the FDT? Also, which is better and why?
Comment
Here is my understanding. I was not really involved in the events, so, take this with a grain of salt; it’s all third hand. FDT was attempting to be an umbrella term for "MIRI-style decision theories", ie decision theories which 1-box on Newcomb, cooperate in twin prisoner’s dilemma, accept counterfactual muggings, grapple with logical uncertainty rather than ignoring it, and don’t require free will (ie, can be implemented as deterministic algorithms without conceptual problems that the decision theory doesn’t provide the tools to handle). The two main alternatives which FDT was trying to be an umbrella term were UDT, and TDT (timeless decision theory). However, the FDT paper leaned far toward TDT ways of describing things—specifically, giving diagrams which look like causal models, and describing the decision procedure as making an intervention on the node corresponding to the ouput of the decision algorithm. This was too far from how Wei Dai envisions UDT. So FDT ended up being mostly a re-branding of TDT, but with less concrete detail (so FDT is an umbrella term for a family of TDT-like decision theories, but, not an umbrella large enough to encompass UDT). I think of TDT and UDT as about equally capable, but only if TDT does anthropic reasoning. Otherwise, UDT is strictly more capable, because TDT will not pay in counterfactual mugging, because it updates on its observations. FDT cannot be directly compared, because it is simply more vague than TDT.
See here: https://www.lesswrong.com/posts/2THFt7BChfCgwYDeA/let-s-discuss-functional-decision-theory#XvXn5NXNgdPLDAabQ
I find that the "you should pay" answer is confused and self-contradictory in its reasoning. Like in all the OO (Omniscient Omega) setups, you, the subject, have no freedom of choice as far as OO is concerned, you are just another deterministic automaton. So any "decision" you make to precommit to a certain action has already been predicted (or could have been predicted) by OO, including any influence exerted on your thought process by other people telling you about rationality and precommitment. To make it clearer, anyone telling you to one-box in the Newcomb’s problem in effect uses classical CDT (which advises two-boxing), because they assume that you have the freedom to make a decision in a setup where your decisions are predetermined. If that were so, two-boxing would make more sense, defying the OO infallibility assumption. So, the whole reasoning advocating for one-boxing and for paying the mugger does not hold up to basic scrutiny. A self-consistent answer would be "you are a deterministic automaton, whatever you feel or think or pretend to decide is an artifact of the algorithm that runs you, so the question whether to pay is meaningless, you either will pay or will not, you have no control over it." Of course, this argument only applies to OO setups. In "reality" there are no OO that we know of, the freedom of choice debate is far from resolved, and if one assumes that we are not automatons whose actions are set in stone (or in the rules of quantum mechanics), then learning to make better decisions is not a futile exercise. One example is the twin prisoner dilemma, where the recommendation to cooperate with one’s twin is self-consistent.
Comment
Newcomb’s paradox still works if Omega is not infallible, just right a substantial proportion of the time. Between the two extremes you have described, of free choice, unpredictable by Omega, and deterministic absence of choice, lies people’s real psychology. Just what is my power to sever links of a causal graph that point towards me? If I am faced with a wily salesman, how shall I be sure of making my decision to buy or not by my own values, taking into account what is informative from the salesman, but uninfluenced by his dark arts? Do I even know what my own values are? Do I have values? When QRO (Quite Reliable Omega) faces me, and I choose one box or two, how can I tell whether I really made that decision? Interactions between people are mostly Newcomb-like. People are always thinking about who the other person is and what they may be thinking, and aiming their words to produce desired results. It is neither easy nor impossible, but a difficult thing, to truly make a decision.
Again, we seem to just have foundational disagreements here. Free will is one of those philosophical topics that I lost interest in a long time ago, so I’m happy to leave it to others to debate.
Comment
It’s not philosophical in an OO setup, it’s experimentally testable, so you cannot simply ignore it.
Comment
One way of experimenting with this would be to use simulable agents (such as RL agents). We could set up the version where Omega is perfectly infallible (simulate the agent 100% accurately, including any random bits) and watch what different decision procedures do in this situation. So, we can set up OO situations in reality. If we did this, we could see agents both 1-boxing and 2-boxing. We would see 1-boxers get better outcomes. Furthermore, if we were designing agents for this task, designing them to 1-box would be a good strategy. This seems to undermine your position that OO situations are self-contradictory (since we can implement them on computers), and that the advice to 1-box is meaningless. If we try to write a decision-making algorithm based on
Comment
Yes, we could definitely implement that!
Comment
Ok, this helps me understand your view better. But not completely. I don’t think there is such a big difference between the agent and the agent-designer.
The alternatives are fake (counterfactuals are subjective), but,
The problem is real,
The agent has to make a choice,
There are better and worse ways of reasoning about that choice—we can see that agents who reason in one way or another do better/worse,
It helps to study better and worse ways of reasoning ahead of time (whether that’s by ML algorithms learning, or humans abstractly reasoning about decision theory). So it seems to me that this is very much like any other sort of hypothetical problem which we can benefit from reasoning about ahead of time (e.g., "how to build bridges"). The alternatives are imaginary, but the problem is real, and we can benefit from considering how to approach it ahead of time (whether we’re human or sufficiently advanced NPC).
Comment
Comment
Comment
Sounds testable in theory, but not in practise
Comment
The test is the fact that OOs exist in that universe.
Just ask which algorithm wins then. At least in these kinds of situations udt does better. The only downside is the algorithm has to check if it’s in this kind of situation; it might not be worth practicing.
Comment
If you are in this situation you have the practical reality that paying the $100 loses you $100 and a theoretical argument that you should pay anyway. If you apply "just ask which algorithm wins" and you mean the practical reality of the situation described, then you wouldn’t choose UDT. If you instead take "just ask which algorithm wins" to mean setting up an empirical experiment, then you’d have to decide whether to consider all agents who encounter the coin flip, or only those who see a tails, at which point there is no need to run the experiment. If you instead are proposing figuring out which algorithm wins according to theory, then that’s a bit of a tautology as that’s what I’m already trying to do.