Value loading in the human brain: a worked example

https://www.lesswrong.com/posts/iMM6dvHzco6jBMFMX/value-loading-in-the-human-brain-a-worked-example

Contents

Background on motivation and brainstem-supervised learning

As required background, here’s my diagram of decision-making, motivation, and reinforcement learning in the brain: See A model of decision-making in the brain (the short version) for a walkthrough. All acronyms are brain parts—they don’t matter for this post. The left side involves reinforcement learning, the right side involves supervised learning (SL). For the SL part, I think the brain has dozens-to-hundreds of nearly-identical SL algorithms, each with a different supervisory signal. Maybe there’s one SL algorithm for each autonomic action, or something like that. By the way, in the context of AGI safety, I vote for putting in thousands of SL algorithms if we can! Let’s have one for every friggin’ word in the dictionary! Or heck, millions of them! Or infinity!! Why not an SL algorithm for every point in GPT-3’s latent space? Let’s go nuts! More dakka!!An important aspect of this for this post is that I’m suggesting that some parts (the hypothalamus and brainstem) are (to a first approximation) entirely genetically-hardcoded, while other parts (the "plan proposer" and "plan assessors") are AFAICT "trained models" in ML terminology—they’re initialized from random weights (or something equivalent) at birth, and learned within a lifetime. (See discussion of "learning-from-scratch-ism" here.) Here’s an illustration: (The reward prediction error on the left comes from subtracting a trained model output from a genetically-hardcoded algorithm output, so I left it uncolored.)

1. Building a probabilistic generative world-model in the cortex

The first step is that, over my lifetime, my cortex builds up a probabilistic generative model, mostly by self-supervised (a.k.a. predictive) learning. Basically, we learn patterns in our sensory input, and patterns in the patterns, etc., until we have a nice predictive model of the world (and of ourselves)—a giant web of interconnected entries like "grass" and "standing up" and "slices of prinsesstårta cake" and so on. Note that I left predictive learning off of the diagram above. Sorry! I didn’t want it to be too busy. Anyway, predictive learning lives inside the "plan proposer". A plan is just a special case of a "thought", and a "thought" is some configuration of this generative world-model. Every thought I can possibly think, and every plan I can possibly plan, can be represented as some configuration of this world-model data structure. The data structure is also continually getting edited, as I learn and experience new things. When you think of this data structure, imagine many gigabytes or terabytes of inscrutable entries like "PATTERN 8472836 is defined as the sequence PATTERN 278561 followed by PATTERN 6578362 followed by...", or whatever. Some entries have references to sensory inputs or motor outputs. And that giant inscrutable mess comprises my entire understanding of the world and myself.

2. Credit assignment when I first bite into the cake

As I mentioned, two years ago I ate a slice of prinsesstårta cake, and it was really good. Step back to a couple seconds earlier, as I was bringing the cake towards my mouth to take my first-ever bite. At that moment, I didn’t yet have any particularly strong expectation of what it would taste like, or how it would make me feel. But once it was in my mouth, mmmmmmm, yummy. So, as I took that bite, my body had a suite of autonomic reactions—releasing certain hormones, salivating, changing my heart rate and blood pressure, etc. etc. Why? The key is that, as a rule, all sensory inputs split:

3. Planning towards goals via reward-shaping

I don’t have a particularly rigorous model for this step, but I think I can lean on intuitions a bit, in order to fill in the rest of the story: Remember, ever since my first bite of prinsesstårta two years ago in Step 2, the "plan assessors" in my brain have been looking at each thought I think, pattern-matching that thought to the "myself eating prinsesstårta" world-model concept, and to the extent that it’s a match, issuing a suggestion to prepare for delightful hormones, salivation, goosebumps, and so on. The diagram above suggests a series of thoughts that I think would "pattern-match" better and better, from top to bottom. To get the intuition here, maybe try replacing "prinsesstårta" with "super-salty cracker". Then go down the list, and try to feel how each thought would make you salivate more and more. Or better yet, replace "eating prinsesstårta" with "asking my crush out on a date", go down the list, and try to feel how each thought makes your heart rate jump up higher and higher. Here’s another way to think about it: If you imagine the world-model being vaguely like a PGM, you can imagine that the "degree of pattern-matching" corresponds roughly to the probability assigned to the "eating prinsesstårta" node in the PGM. For example, if you’re confident in X, and X weakly implies Y, and Y weakly implies Z, and Z weakly implies "eating prinsesstårta", then "eating prinsesstårta" gets a very low but nonzero probability, a.k.a. weak activation, and this is kinda like having a far-fetched but not completely impossible plan to eat prinsesstårta. (Don’t take this paragraph too literally, I’m just trying to summon intuitions here.) OK, if you’re still with me, let’s go back to my decision-making model, now with different parts highlighted: Again, every time I think a thought, the hypothalamus & brainstem look at the corresponding "scorecard", and issue a corresponding reward. Recall also (see here) that the active thought /​ plan gets thrown out when its reward prediction error (RPE) is negative, and it gets kept and strengthened when its RPE is positive. Let’s oversimplify for a second, and say that the relevant prinsesstårta-related "assessments" comprise just one entry on the scorecard: "Will lead to feel-good hormones". And let’s also assume the brainstem follows the simple rule: "The higher that a plan /​ thought scores on the ‘Will lead to feel-good hormones’ assessment, the higher the reward I’m gonna give it". Well in that case, each time our thoughts move down the ranked list above—from idle musing about prinsesstårta, to a far-fetched plan to get prinsesstårta, to a plausible plan to get prinsesstårta, etc.—there’s an *immediate positive *RPE, so that the new thought gets strengthened, and gets to establish itself. And conversely each time we move back up the list—from plausible plan to far-fetched plan to idle musing—there’s an *immediate negative *RPE, so that thought gets thrown out and we go back to whatever we were thinking before. It’s a ratchet! The system naturally pushes its way down the list, making and executing a good plan to eat cake. (By the way, the plan-proposing algorithm on the top-left is NOT trying to maximize the sum of future rewards—see here, specifically the discussion of TD learning. Instead, its job is more like "maximize RPE right now".) So there you have it! From this kind of setup, I think we’re well on the way to explaining the full suite of behaviors associated with humans doing foresighted planning towards explicit goals—including knowing that you have the goal, making a plan, pursuing instrumental strategies as part of the plan, replacing good plans with even better plans, updating plans as the situation changes, pining in vain for unattainable goals, and so on. By the way, I oversimplified above by reducing the actual suite of prinsesstårta-related assessments with "will lead to feel-good hormones". In reality, it’s more specific than that—probably some assessment related to salivating, and some other assessment related to releasing certain digestive enzymes, and various hormones, and goosebumps, and who knows what else. Why does that matter? Well, imagine you’re feeling nauseous. Of course your hypothalamus & brainstem know that you’re feeling nauseous. And meanwhile the assessment functions are telling the hypothalamus & brainstem that this plan will lead to eating food. Some hardwired circuit says: "That’s bad! Whatever thought you’re thinking, I’m gonna dock some reward points for its possibly leading to eating, in my current state of nausea." ...And indeed, I think you’ll find that you’re much less intrinsically motivated to make plans to get prinsesstårta, when you’re currently feeling nauseous. Maybe you’ll do it anyway, because thoughts can be quite complicated, and you have *other *motivations in life, and those motivations also feed into these assessment functions and get weighed by the brainstem. So maybe you’ll proceed with the plan, driven by the motivation of "trying to avoid later regret, if I miss the deadline to order prinsesstårta in time for the party next week". So you’ll order the cake anyway, despite it feeling kinda gross right now.

Comment

https://www.lesswrong.com/posts/iMM6dvHzco6jBMFMX/value-loading-in-the-human-brain-a-worked-example?commentId=jM4ezzC4WbLdktfAH

Really enjoying your writing. I’m interested in your conception of the hypothalamus as hard-wired and what you make of this: https://​​journals.physiology.org/​​doi/​​full/​​10.1152/​​ajpregu.00501.2014I’m personally interested in the interaction of the hypothalamus with learning processes and this is one I find particularly interesting. Do you think AI agents could benefit from implementing similar homeostatic drives in their learning algorithms?

Comment

https://www.lesswrong.com/posts/iMM6dvHzco6jBMFMX/value-loading-in-the-human-brain-a-worked-example?commentId=8y9ZgqRyDw2tBmiZg

Thanks for the good question, and interesting reference! In my defense, I did put in a caveat that it’s hardwired "to a first approximation" :-D Anyway: When I think about brain plasticity, I kinda have two different mental images.

  • My first mental image is "learning algorithms". I think about e.g. modern-day ML, or the brain’s various supervised learning & predictive learning & RL algorithms and so on. In this mental image, I’m imagining rewiring rules that result in a "trained model" algorithm that does something difficult and useful, and the "trained model" is at least moderately complicated, in the sense that an astronomically large number of different trained models could have been built if only the inputs to the learning algorithm were different.

  • My second mental image is "specific situation-dependent rewiring rules". My go-to example is the self-modifying code in Linux—e.g. "if debugging is turned off, then replace the debugging-related algorithm steps with no-ops". In a biological example, imagine that the genome is trying to implement the behavior "if you keep winning fights, start being more aggressive". (That’s a real example, see here.) It would be like, there’s some specific signal (related to whether you’re winning fights), and this signal changes the strength of some neuronal connection (controlling aggression). So in this mental image, I’m imagining some control knob or whatever that gets adjusted for legible reasons, and I’m imagining a lot of species-specific idiosyncratic complicated rules. OK, so those are my two mental images. I don’t think there’s really a sharp line between these things; some things are in a gray area between them. But plenty of things are clearly one or clearly the other. I know that neuroscientists tend to lump these two things together and call them both "plasticity" or "learning"—and of course lumping them together makes perfect sense if you’re studying the biochemistry of neurotransmitters. But in terms of algorithm-level understanding, they seem to me like different things, and I want to keep them conceptually separate. Anyway, I have long believed that the hypothalamus has "specific situation-dependent rewiring rules"—probably lots of them. I think the paper you cite is in that category too: it’s like the genome is trying to encode a rule: "if you repeatedly get salt-deprived over the course of your life, start erring on the side of eating extra salt". (Unless I’m misunderstanding.) I assume the brainstem does that kind of thing too. I’m not aware of there being any "real" learning algorithms in the hypothalamus in the specific sense above. I think there may actually be some things in the brainstem that seem like "real" learning algorithms, or at least they’re in the gray area. Conditioned Taste Aversion is maybe an example? (I don’t know where the CTA database is stored—I was figuring probably brainstem or hypothalamus. I guess it could be telencephalon but I’d be surprised.) Also I think the superior colliculus can "learn" to adjust saccade targeting and aligning different sensory streams and whatnot. I’m very confused about the details there (how does it learn? what’s the ground truth?) and hope to look into it someday. Of course the cerebellum is a legit learning algorithm too, but I confusingly don’t count the cerebellum as "brainstem" in my idiosyncratic classification. :-P Anyway, I can certainly imagine "specific situation-dependent rewiring rules" being potentially useful for AI, but I can’t think of any examples off the top of my head. I’m a bit skeptical about "homeostatic drives" for AI, like in the sense of a robot that’s intrinsically motivated to recharge its battery when it’s low. After all, at some point robots will be dangerously powerful and intelligent, and I don’t want them to have any intrinsic motivation beyond "do what my supervisor wants me to do" or "act ethically" or whatever else we come up with. Then it can want to recharge its battery as a means to an end. You’re welcome to disagree with any or all of that. :-)