BASALT: A Benchmark for Learning from Human Feedback

RyH8LtgMbRAJ9Dv6R

title

authors

Rohin Shah

date_published

2021-07-08T17:40

score

omega_karma

votes

Comment

id

eyP5yBvT4W62ZKm3q
authors

Rohin Shah
score

4
omega_karma

3
votes

2
date_published

2021-07-09T07:34

https://www.lesswrong.com/posts/RyH8LtgMbRAJ9Dv6R/basalt-a-benchmark-for-learning-from-human-feedback?commentId=eyP5yBvT4W62ZKm3q

Or would you rather not have this sort of marketing? I would be excited to see this competition promoted widely! (Obviously I wouldn’t want to do anything that reflected really poorly on the marketers, like blackmail, but this seems to clearly not be in that category.)

id

oBPyYTudBYMypQT3w
authors

Edouard Harris
score

4
omega_karma

3
votes

3
date_published

2021-07-11T19:45

https://www.lesswrong.com/posts/RyH8LtgMbRAJ9Dv6R/basalt-a-benchmark-for-learning-from-human-feedback?commentId=oBPyYTudBYMypQT3w

Love this idea. From the linked post on the BAIR website, the idea of "prompting" a Minecraft task with e.g. a brief sequence of video frames seems especially interesting. Would you anticipate the benchmark version of this would ask participants to disclose metrics such as "amount of task-specific feedback or data used in training"? Or does this end up being too hard to quantify because you’re explicitly expecting folks to use a variety of feedback modalities to train their agents?

Comment

id

sbNxHpeMwg39mjDEX
authors

Rohin Shah
score

3
omega_karma

3
votes

2
date_published

2021-07-11T21:34

https://www.lesswrong.com/posts/RyH8LtgMbRAJ9Dv6R/basalt-a-benchmark-for-learning-from-human-feedback?commentId=sbNxHpeMwg39mjDEX

Would you anticipate the benchmark version of this would ask participants to disclose metrics such as "amount of task-specific feedback or data used in training"? Probably not, just because it’s pretty niche—I expect the vast majority of papers (at least in the near future) will have only task-specific feedback, so the extra data isn’t worth the additional hassle. (The prompting approach seems like it would require a lot of compute.) Tbc, "amount of task-specific feedback" should still be inferable from research papers, where you are meant to provide enough details that others could reproduce your work. It just wouldn’t be as simple as looking up the "BASALT evaluation table" for your method of choice.

Comment

id

CE6LmYrbaXZPnbfo5
authors

Edouard Harris
score

3
omega_karma

3
votes

2
date_published

2021-07-12T17:46

https://www.lesswrong.com/posts/RyH8LtgMbRAJ9Dv6R/basalt-a-benchmark-for-learning-from-human-feedback?commentId=CE6LmYrbaXZPnbfo5

That makes sense, though I’d also expect that LfLH benchmarks like BASALT could turn out to be a better fit for superscale models in general. (e.g. a BASALT analogue might do a better job of capturing the flexibility of GPT-N or DALL-E type models than current benchmarks do, though you’d probably need to define a few hundred tasks for that to be useful. It’s also possible this has already been done and I’m unaware of it.)

Comment

id

dXBrCoWMDxPHdsTnk
authors

Rohin Shah
score

3
omega_karma

3
votes

2
date_published

2021-07-13T06:47

https://www.lesswrong.com/posts/RyH8LtgMbRAJ9Dv6R/basalt-a-benchmark-for-learning-from-human-feedback?commentId=dXBrCoWMDxPHdsTnk

That makes sense, though I’d also expect that LfLH benchmarks like BASALT could turn out to be a better fit for superscale models in general. Oh yeah, it totally is, and I’d be excited for that to happen. But I think that will be a single project, whereas the benchmark reporting process is meant to apply for things where there will be lots of projects that you want to compare in a reasonably apples-to-apples way, so when designing the reporting process I’m focused more on the small-scale projects that aren’t GPT-N-like. It’s also possible this has already been done and I’m unaware of it I’m pretty confident that there’s nothing like this that’s been done and publicly released.

id

xvBQfXgoZ6XvmfRdr
authors

Vanessa Kosoy
score

2
omega_karma

1
votes

1
date_published

2021-07-09T09:07

https://www.lesswrong.com/posts/RyH8LtgMbRAJ9Dv6R/basalt-a-benchmark-for-learning-from-human-feedback?commentId=xvBQfXgoZ6XvmfRdr

It’s not quite as interesting as I initially thought, since they allow handcrafted reward functions and heuristics. It would be more interesting if the designers did not know the particular task in advance, and the AI would be forced to learn the task entirely from demonstrations and/or natural language description.

Comment

id

8m87vMxPgf84cFXhP
authors

Daniel Kokotajlo
score

9
omega_karma

6
votes

5
date_published

2021-07-09T11:05

https://www.lesswrong.com/posts/RyH8LtgMbRAJ9Dv6R/basalt-a-benchmark-for-learning-from-human-feedback?commentId=8m87vMxPgf84cFXhP

Going from zero to "produce an AI that learns the task entirely from demonstrations and/or natural language description" is really hard for the modern AI research hive mind. You have to instead give it a shaped reward, breadcrumbs along the way that are easier, (such as allowing handcrafted heuristics and such, and allowing knowledge of a particular target task) to get the hive mind started making progress.

Comment

id

HAePkHuKrqcuE9wJj
authors

Pattern
score

4
omega_karma
votes

2
date_published

2021-07-09T17:19

https://www.lesswrong.com/posts/RyH8LtgMbRAJ9Dv6R/basalt-a-benchmark-for-learning-from-human-feedback?commentId=HAePkHuKrqcuE9wJj

To clarify, by hive mind, do you mean humans?

Comment

id

8hpScGWhsPwTvRwaA
authors

Daniel Kokotajlo
score

2
omega_karma
votes

1
date_published

2021-07-09T17:58

https://www.lesswrong.com/posts/RyH8LtgMbRAJ9Dv6R/basalt-a-benchmark-for-learning-from-human-feedback?commentId=8hpScGWhsPwTvRwaA

Yes. I’m being a bit whimsical here, I’m tickled by the analogy to training neural nets.

Comment

id

zrsKQJJcwXX8um2SC
authors

Pattern
score

2
omega_karma
votes

1
date_published

2021-07-09T20:17

https://www.lesswrong.com/posts/RyH8LtgMbRAJ9Dv6R/basalt-a-benchmark-for-learning-from-human-feedback?commentId=zrsKQJJcwXX8um2SC

In general: a research AI that will train agents based on its understanding of described benchmarks does sound interesting. Although how it would get ahold of things like ‘human feedback’, and craft setups for that, isn’t clear.* Trying to create a setup so AIs can learn from other AIs (crafting rewards—seems unlikely. Expert demonstrations—might be doable. Whether this would be more or less useful than a human demonstration, I’m not sure. There actually might be the option to ask ‘what would _ do in this situation’ if you can actually run _.) Edited to add: *First I was imagining Skynet. Now I’m imagining a really weird license agreement: you may use this AI in exchange for [$ + billing schedule], but with the data that’s trained on (even if seemingly totally unrelated), you must also have it try working on this AI benchmark, (frequency based on usage), and publicly share the score (but not necessarily pictures of outputs—there’s the risk of user data being leaked, by means of being recreated in minecraft, and the user retains full responsibility for the security of such data, and is encouraged but not required to run it ‘offline’, i.e. not on Minecraft servers).

id

XKd6XgSGKxA7TCib6
authors

Vanessa Kosoy
score

2
omega_karma

1
votes

1
date_published

2021-07-09T21:17

https://www.lesswrong.com/posts/RyH8LtgMbRAJ9Dv6R/basalt-a-benchmark-for-learning-from-human-feedback?commentId=XKd6XgSGKxA7TCib6

It’s not "from zero" though, I think that we already have ML techniques that should be applicable here.

id

dtLzzEF39HttwcuPJ
authors

Rohin Shah
score

4
omega_karma

4
votes

2
date_published

2021-07-09T18:21

https://www.lesswrong.com/posts/RyH8LtgMbRAJ9Dv6R/basalt-a-benchmark-for-learning-from-human-feedback?commentId=dtLzzEF39HttwcuPJ

they allow handcrafted reward functions and heuristics We allow it, but we don’t think it will lead to good performance (unless you throw a very large amount of time at it). The AI safety community claims it is hard to specify reward functions. If we actually believe this claim, we should be able to create tasks where even if we allow people to specify reward functions, they won’t be able to do so. That’s what we’ve tried to do here. Note we do ban extraction of information from the Minecraft simulator—you have to work with pixels, so if you want to make handcrafted reward functions, you have to compute rewards from pixels somehow. (Technically you also have inventory information but that’s not that useful.) We have this rule because in a real-world deployment you wouldn’t be able to simply extract the "state" of physical reality. I am a bit more worried about allowing heuristics—it’s plausible to me that our chosen tasks are simple enough that heuristics could solve them, even though real world tasks are too complex for similar heuristics to work—but this is basically a place where we’re sticking our necks out and saying "nope, heuristics won’t suffice either" (again, unless you put a lot of effort into designing the heuristics, where it would have been faster to just build the system that, say, learns from demonstrations). It would be more interesting if the designers did not know the particular task in advance But for real-world deployment of AI systems, designers do know the task in advance! We don’t want to ban strategies that designers could use in a realistic setting.

Comment

id

Y8Mc3YtNLdnbBCBT5
authors

Vanessa Kosoy
score

4
omega_karma

3
votes

2
date_published

2021-07-09T21:11

https://www.lesswrong.com/posts/RyH8LtgMbRAJ9Dv6R/basalt-a-benchmark-for-learning-from-human-feedback?commentId=Y8Mc3YtNLdnbBCBT5

The AI safety community claims it is hard to specify reward functions… But for real-world deployment of AI systems, designers do know the task in advance!

Right, but you’re also going for tasks that are relatively simple and easy. In the sense that, "MakeWaterfall" is something that I can, based on my own experience, imagine solving without any ML at all (but ofc going to that extreme would require massive work). It might be that for such tasks solutions using handcrafted rewards/heuristics would be viable, but wouldn’t scale to more complex tasks. If your task was e.g. "follow arbitrary natural language instructions" then I wouldn’t care about the "lax" rules.

Note we do ban extraction of information from the Minecraft simulator

This is certainly good, but I wonder what are the exact rules here. Suppose the designer trains a neural network to recognize trees in minecraft by getting the minecraft engine to generate lots of images of trees. The resulting network is then used as a hardcoded part of the agent architecture. Is that allowed? If not, how well can you enforce it (I imagine something of the sort can be done in subtler ways)?

Not saying that what you’re doing is not useful, just pointing out a certain way in which the benchmark might diverge from its stated aim.

Comment

id

G8m2AXvT3dif6eCeH
authors

Rohin Shah
score

4
omega_karma

3
votes

2
date_published

2021-07-10T07:16

https://www.lesswrong.com/posts/RyH8LtgMbRAJ9Dv6R/basalt-a-benchmark-for-learning-from-human-feedback?commentId=G8m2AXvT3dif6eCeH

It might be that for such tasks solutions using handcrafted rewards/heuristics would be viable, but wouldn’t scale to more complex tasks. I agree that’s possible. Tbc, we did spend some time thinking about how we might use handcrafted rewards / heuristics to solve the tasks, and eliminated a couple based on this, so I think it probably won’t be true here. Suppose the designer trains a neural network to recognize trees in minecraft by getting the minecraft engine to generate lots of images of trees. The resulting network is then used as a hardcoded part of the agent architecture. Is that allowed? No. If not, how well can you enforce it (I imagine something of the sort can be done in subtler ways)? For the competition, there’s a ban on pretrained models that weren’t publicly available prior to competition start. We look at participants’ training code to ensure compliance. It is still possible to violate this rule in a way that we may not catch (e.g. maybe you use internal simulator details to do hyperparameter tuning, and then hardcode the hyperparameters in your training code), but it seems quite challenging and not worth the effort even if you are willing to cheat. For the benchmark (which is what I’m more excited about in the longer run), we’re relying on researchers to follow the rules. Science already relies on researchers honestly reporting their results—it’s pretty hard to catch cases where you just make up numbers for your experimental results. (Also in the benchmark version, people are unlikely to write a paper about how they solved the task using special-case heuristics; that would be an embarrassing paper.)

id

pjN6NmqrC9mM3F2FY
authors

ChristianKl
score

4
omega_karma

2
votes

2
date_published

2021-07-09T18:58

https://www.lesswrong.com/posts/RyH8LtgMbRAJ9Dv6R/basalt-a-benchmark-for-learning-from-human-feedback?commentId=pjN6NmqrC9mM3F2FY

The AI safety community claims it is hard to specify reward functions. If we actually believe this claim, we should be able to create tasks where even if we allow people to specify reward functions, they won’t be able to do so. That’s what we’ve tried to do here. It being hard to specify reward functions for a specific task and it being hard to specify reward functions for a more general AGI seem to me like two very different problems. Additionally, developing a safe system and developing a nonsafe system are very different. Even if your reward function works 99,9% of the time it can be exploited in those cases where it fails.

Comment

id

JGQSoNvWJBQ8nfcgQ
authors

Rohin Shah
score

6
omega_karma

5
votes

3
date_published

2021-07-09T20:58

https://www.lesswrong.com/posts/RyH8LtgMbRAJ9Dv6R/basalt-a-benchmark-for-learning-from-human-feedback?commentId=JGQSoNvWJBQ8nfcgQ

Okay, regardless of what the AI safety community claims, I want to make that claim. (I think a substantial chunk of the AI safety community also makes that claim but I’m not interested in defending that here.)

It being hard to specify reward functions for a specific task and it being hard to specify reward functions for a more general AGI seem to me like two very different problems. As an aside, if I thought we could build task-specific AI systems for arbitrary tasks, and only super general AI systems were dangerous, I’d be advocating really hard for sticking with task-specific AI systems and never building super general AI systems (or only building them after some really high threshold of safety was met).

Comment

id

hxER6xsKGftWw8eHX
authors

Vanessa Kosoy
score

4
omega_karma

2
votes

2
date_published

2021-07-09T21:14

https://www.lesswrong.com/posts/RyH8LtgMbRAJ9Dv6R/basalt-a-benchmark-for-learning-from-human-feedback?commentId=hxER6xsKGftWw8eHX

if I thought we could build task-specific AI systems for arbitrary tasks, and only super general AI systems were dangerous, I’d be advocating really hard for sticking with task-specific AI systems and never building super general AI systems

The problem with this is that you need an AI whose task is "protect humanity from unaligned AIs", which is already very "general" in a way (i.e. requires operating on large scales of space, time and strategy). Unless you can effectively reduce this to many "narrow" tasks which is probably not impossible but also not easy.

id

Jt4kGTKRDM87qjRQx
authors

ChristianKl
score

2
omega_karma
votes

1
date_published

2021-07-09T23:06

https://www.lesswrong.com/posts/RyH8LtgMbRAJ9Dv6R/basalt-a-benchmark-for-learning-from-human-feedback?commentId=Jt4kGTKRDM87qjRQx

I think it’s very easy to say "don’t do general system do task specific ones", because general ones might promise a lot of economic returns. A task like "Handle this Amazon customer query correctly" is already very general as it includes a host of different long tail issues about possible bugs that might appear (some of those unknown). If a customer faces an issue on a page that’s likely a bug, a customer service AI profits from understanding the code that produces the issue that the customer has. Given the way economic pressures work, I see it as very probably that companies will just go ahead and look at what’s most efficient for their business goals.

id

BJ44XaStrsbfSkQvy
authors

Pattern
score

2
omega_karma
votes

1
date_published

2021-07-09T21:54

https://www.lesswrong.com/posts/RyH8LtgMbRAJ9Dv6R/basalt-a-benchmark-for-learning-from-human-feedback?commentId=BJ44XaStrsbfSkQvy

It’s not clear that a system which doesn’t use reward doesn’t have the same issue (relative to "99,9% of the time").