[Intro to brain-like-AGI safety] 7. From hardcoded drives to foresighted plans: A worked example

https://www.lesswrong.com/posts/zXibERtEWpKuG5XAC/intro-to-brain-like-agi-safety-7-from-hardcoded-drives-to

Contents

7.1 Post summary /​ Table of contents

The previous post presented a big picture of how I think motivation works in the human brain, but it was a bit abstract. In this post, I will walk through an example. To summarize, the steps will be:

7.2 Reminder from the previous post: big picture of motivation and decision-making

From the previous post, here’s my diagram of motivation in the brain: See previous post for details. All acronyms are brain parts—they don’t matter for this post.As also discussed in the previous post, we can split this up by which parts are "hardcoded" by the genome, versus learned within a lifetime—i.e., Steering Subsystem versus Learning Subsystem:

7.3 Building a probabilistic generative world-model in the cortex

The first step in our story is that, over my lifetime, my cortex (specifically, the Thought Generator in the top-left of the diagram above) has been building up a probabilistic generative model, mostly by predictive learning of sensory inputs (Post #4, Section 4.7) (a.k.a. "self-supervised learning"). Basically, we learn patterns in our sensory input, and patterns in the patterns, etc., until we have a nice predictive model of the world (and of ourselves)—a giant web of interconnected entries like "grass" and "standing up" and "slices of prinsesstårta". Predictive learning of sensory inputs is not fundamentally dependent on supervisory signals from the Steering Subsystem. Instead, "the world" provides the ground truth about whether a prediction was correct. Contrast this with, for example, navigating the tradeoff between searching-for-food versus searching-for-a-mate: there is no "ground truth" in the environment for whether the animal is trading off optimally, except after generations of hindsight. In that case, we do need supervisory signals from the Steering Subsystem, which estimate the "correct" tradeoff using heuristics hardcoded by evolution. You can kinda think of the is/​ought divide, with the Steering Subsystem providing the "ought" ("to maximize genetic fitness, what ought the organism to do?") and predictive learning of sensory inputs providing the "is" ("what is likely to happen next, under such-and-such circumstances?") That said, the Steering Subsystem is *indirectly *involved even in predictive learning of sensory inputs—for example, I can be motivated to go learn about a topic. Anyway, every thought I can possibly think, and every plan I can possibly plan, can be represented as some configuration of this generative world-model data structure. The data structure is also continually getting edited, as I learn and experience new things. When you think of this world-model data structure, imagine many terabytes of inscrutable entries—imagine things like, for example, "PATTERN 847836 is defined as the following sequence: {PATTERN 278561, then PATTERN 657862, then PATTERN 128669}." Some entries have references to sensory inputs and/​or motor outputs. And that giant inscrutable mess comprises my entire understanding of the world and myself.

7.4 Credit assignment when I first bite into the cake

As I mentioned at the top, on a fateful day two years ago, I ate a slice of prinsesstårta, and it was really good. Step back to a couple seconds earlier, as I was bringing the cake towards my mouth to take my first-ever bite. At that moment, I didn’t yet have any particularly strong expectation of what it would taste like, or how it would make me feel. But once it was in my mouth, mmmmmmm, oh wow, that’s good cake. Relevant parts of the diagram for what happened when I took my first surprisingly-delicious bite of prinsesstårta, two years ago.So, as I took that bite, my body had a suite of autonomic reactions—releasing certain hormones, salivating, changing my heart rate and blood pressure, etc. Why? The key is that, as described in Post #3, Section 3.2.1, all sensory inputs split:

7.5 Planning towards goals via reward-shaping

I don’t have a particularly rigorous model for this step, but I think I can lean on intuitions a bit, in order to fill in the rest of the story: Remember, ever since my first bite of prinsesstårta two years ago, the Thought Assessors in my brain have been inspecting each thought I think, checking whether the "myself eating prinsesstårta" concept in my world-model is "lit up" /​ "activated", and to the extent that it is, issuing a suggestion to prepare for rewards, salivation, goosebumps, and so on. The diagram above suggests a series of thoughts that I think would "light up" the world-model concept more and more, as we go from top to bottom. To get the intuition here, maybe try replacing "prinsesstårta" with "super-salty cracker". Then go down the list, and try to feel how each thought would make you salivate more and more. Or better yet, replace "eating prinsesstårta" with "asking my crush out on a date", go down the list, and try to feel how each thought makes your heart rate jump up higher and higher. Here’s another way to think about it: If you imagine the world-model being vaguely like a PGM, you can imagine that the "degree of pattern-matching" corresponds roughly to the probability assigned to the "eating prinsesstårta" node in the PGM. For example, if you’re confident in X, and X weakly implies Y, and Y weakly implies Z, and Z weakly implies "eating prinsesstårta", then "eating prinsesstårta" gets a very low but nonzero probability, a.k.a. weak activation, and this is akin to having a far-fetched but not completely impossible plan to eat prinsesstårta. (Don’t take this paragraph too literally, I’m just trying to summon intuitions here.) I’m really hoping this kind of thing is intuitive. After all, I’ve seen it reinvented numerous times! For example, David Hume: "The first circumstance, that strikes my eye, is the great resemblance betwixt our impressions and ideas in every other particular, except their degree of force and vivacity." And here’s William James: "It is hardly possible to confound the liveliest image of fancy with the weakest real sensation." In both these cases, I think the authors are gesturing at the idea that imagination activates some of the same mental constructs (latent variables in the world-model) as perception does, but that imagination activates them more weakly than perception. OK, if you’re still with me, let’s go back to my decision-making model, now with different parts highlighted: Relevant parts of the diagram for the process of making and executing a foresighted plan to procure prinsesstårta.Again, every time I think a thought, the Steering Subsystem looks at the corresponding "scorecard", and issues a corresponding reward. Recall also that the active thought /​ plan gets thrown out when its reward prediction error (RPE) is negative, and it gets kept and strengthened when its RPE is positive. I’ll oversimplify for a second, and ignore everything except the value function (a.k.a. The "Will lead to reward" Thought Assessor). And I’ll also assume the Steering Subsystem just defers to that proposed value, rather than overruling it (see Post #6, Section 6.4.1). In this case, each time our thoughts move down a notch on the purple arrow diagram above—from idle musing about prinsesstårta, to a hypothetical plan to get prinsesstårta, to a decision to get prinsesstårta, etc.—there’s an *immediate positive *RPE, so that the new thought gets strengthened, and gets to establish itself. And conversely, each time we move back up the list—from decision to hypothetical plan to to idle musing—there’s an *immediate negative *RPE, so that thought gets thrown out and we go back to whatever we were thinking before. It’s a ratchet! The system naturally pushes its way down the list, making and executing a good plan to eat cake. So there you have it! From this kind of setup, I think we’re well on the way to explaining the full suite of behaviors associated with humans doing foresighted planning towards explicit goals—including knowing that you have the goal, making a plan, pursuing instrumental strategies as part of the plan, replacing good plans with even better plans, updating plans as the situation changes, pining in vain for unattainable goals, and so on.

7.5.1 The other Thought Assessors. Or: The heroic feat of ordering a cake for next week, when you’re feeling nauseous right now

By the way, what of the other Thought Assessors? Prinsesstårta, after all, is not just associated with "will lead to rewards", but also "will lead to sweet taste", "will lead to salivation", etc. Do those play any role? Sure! For one thing, as I bring the fork towards my mouth, on the verge of consummating my cake-eating plan, I’ll start salivating and releasing cortisol in preparation. But what about the process of foresighted planning (calling the bakery etc.)? I think the other, non-value-function, Thought Assessors are relevant there too—at least to some extent.[1] For example, imagine you’re feeling terribly nauseous. Of course your Steering Subsystem *knows *that you’re feeling terribly nauseous. And then suppose it sees you thinking a thought that seems to be leading towards eating. In that case, the Steering Subsystem may say: "That’s a terrible thought! Negative reward!" OK, so you’re feeling nauseous, and you pick up the phone to place your order at the bakery. This thought gets weakly but noticeably flagged by the Thought Assessors as "likely to lead to eating". Your Steering Subsystem sees that and says "Boo, given my current nausea, that seems like a bad thought." It will feel a bit aversive. "Yuck, I’m really ordering this huge cake??" you say to yourself. Logically, you know that come next week, when you actually receive the cake, you won’t feel nauseous anymore, and you’ll be delighted to have the cake. But still, right now, you feel kinda gross and unmotivated to order it. Do you order the cake anyway? Sure! Maybe the value function (a.k.a. the "will lead to reward" Thought Assessor) is strong enough to overrule the effects of the "will lead to eating" Thought Assessor. Or maybe you call up a different motivation: you imagine yourself as the kind of person who has good foresight and makes good sensible decisions, and who isn’t stuck in the moment. That’s a different thought in your head, which consequently activates a different set of Thought Assessors, and maybe that gets high value from the Steering Subsystem. Either way, you do in fact call the bakery to place the cake order for next week, despite feeling nauseous right now. What a heroic act!