Competitive safety via gradated curricula

https://www.lesswrong.com/posts/vLepnCxCWW6YTw8eW/competitive-safety-via-gradated-curricula

Epistemic status: brainstorming some speculative research directions. Not trying to thoroughly justify the claims I’m making. One way to think about the AI safety problem: there’s a spectrum of methods which each represent a different tradeoff between safety and ease of training an AGI, and unfortunately the two are anticorrelated. In particular, consider four regimes in which the bulk of training might occur (perhaps with additional fine-tuning afterwards):

Comment

https://www.lesswrong.com/posts/vLepnCxCWW6YTw8eW/competitive-safety-via-gradated-curricula?commentId=vuBT9jMbcv9knZRYx

I don’t think that design (1) is particularly safe. If your claim that design (1) is harder to get working is true, then you get a small amount of safety from the fact that a design that isn’t doing anything is safe. It depends on what the set of questions is, but if you want to be able to reliably answer questions like "how do I get from here to the bank?" then it needs to have a map, and some sort of pathfinding algorithm encoded in it somehow. If it can answer "what would a good advertising slogan be for product X" then it needs to have some model that includes human psychology and business, and be able to seek long term goals like maximising profit. This is getting into dangerous territory. A system trained purely to imitate humans might be limited to human levels of competence, and so not too dangerous. Given that humans are more competent at some tasks than others, and that competence varies between humans, the AI might contain a competence chooser, which guesses at how good an answer a human would produce, and an optimiser module that can optimise a goal with a chosen level of competence. Of course, you aren’t training for anything above top human level competence, so whether or not the optimiser carries on working when asked for superhuman competence depends on the inductive bias. Of course, if humans are unusually bad at X, then superhuman performance on X could be trained by training the general optimiser on A,B,C… which humans are better at. If humans could apply 10 units of optimisation power to problems A,B,C… and we train the AI on human answers, we might train it to apply 10 units of optimisation power to arbitrary problems. If humans can only produce 2 units of optimisation on problem X, then the AI’s 10 units on X is superhuman at that problem. To me, this design space feels like the set of heath robinson contraption that contains several lumps of enriched uranium. If you just run one, you might be lucky and have the dangerous parts avoid hitting each other in just the wrong way. You might be able to find a particular design in which you can prove that the lumps of uranium never get near each other. But all the pieces needed for something to go badly wrong are there.

Comment

https://www.lesswrong.com/posts/vLepnCxCWW6YTw8eW/competitive-safety-via-gradated-curricula?commentId=3c5WmkQ2AtqcA8WbD

I agree. I’m generally okay with the order (oracles do seem marginally safer than agents, for example, and more restrictions should generally be safer than less), but also think the marginal amount of additional safety doesn’t matter much when you consider the total absolute risk. Just to make up some numbers, I think of it like choosing between options that are 99.6%, 99.7%, 99.8%, and 99.9% likely to result in disaster. I mean of course I’ll pick the one with a 0.4% chance of success, but I’d much rather do something radically different that is orders of magnitude safer.

Comment

Yeah, so I guess opinions on this would differ depending on how likely people think existential risk from AGI is. Personally, it’s clear to me that agentic misaligned superintelligences are bad news—but I’m much less persuaded by descriptions of how long-term maximising behaviour arises in something like an oracle. The prospect of an AGI that’s much more intelligent than humans and much less agentic seems quite plausible—even, perhaps, in a RL agent.

https://www.lesswrong.com/posts/vLepnCxCWW6YTw8eW/competitive-safety-via-gradated-curricula?commentId=jcXStJQKgZNDcQbMe

(Doesn’t touch on main points a lot.)

Given that I have no experience of being deaf or blind, and have not looked into it very much, my intuitions on this point are not very well-informed; so I wanted to explicitly flag it as quite speculative.It also happened at an early age. From a perspective that revolves around information this is a disadvantage. From a human perspective, losing a capability later seems like it can be more devastating. This suggests: My explanation for this: the learning problem Helen faced was much harder than what most of us face, but because her brain architecture had already been "trained" by evolution, she could make use of those implicit priors to match, and then surpass, most of her contemporaries.that "training" (learning) happened prior to Harvard but while alive. ("Evolution can "design" but not train," and the formulation in terms of priors seems clunky.)

https://www.lesswrong.com/posts/vLepnCxCWW6YTw8eW/competitive-safety-via-gradated-curricula?commentId=q99zarx68A2dxh6at

The key hypothesis is that it’s not uniformly harder to train AGIs in the safer regimes—rather, it’s primarily harder to get > started in those regimes. Once an AI reaches a given level of intelligence, then transitioning to a safer regime might not slow down the rate at which it gains intelligence very much—but might still decrease the optimisation pressure in favour of that AI being highly agentic and pursuing large-scale goals.Can’t choice of programming language (or coding platform) affect the optimization pressures? [if everyone ends up learning poorly-designed choices, it can cause a lot of weird behaviors long-run, so a safer regime would include, like, a decent programming language]. It’s like harder to get started on blockchains that aren’t as bloated as bitcoin or ethereum.