[updated] how does gpt2′s training corpus capture internet discussion? not well

https://www.lesswrong.com/posts/4JeAoTrAuByXGw6zm/updated-how-does-gpt2-s-training-corpus-capture-internet

Link post [Updated to correct my earlier claim that this doesn’t affect GPT-3. Apparently it does?] I’m out sick today, but had enough energy to do some GPT-related fiddling around. This time, I was curious what "internet discussions" tended to look like in the original training corpus. I thought this might point to a more natural way to represent tumblr threads for @nostalgebraist-autoresponder​ than my special character trick. So, I looked around in the large shard provided as part of https://​​github.com/​​openai/​​gpt-2-output-dataset. Colab notebook here, so you can interactively reproduce my findings or try similar things. —– The results were … revealing, but disappointing. I did find a lot of discussion threads in the data (couldn’t find many chatlogs). But

Comment

https://www.lesswrong.com/posts/4JeAoTrAuByXGw6zm/updated-how-does-gpt2-s-training-corpus-capture-internet?commentId=xHPrpHXPtSMM8gbD7

It can’t be too bad, though, because I have seen GPT-3 generate fairly plausible forum discussions with multiple participants, and how would it do that if it only ever saw single-commenter documents?

Comment

https://www.lesswrong.com/posts/4JeAoTrAuByXGw6zm/updated-how-does-gpt2-s-training-corpus-capture-internet?commentId=t3kHdr2KTxTPhWHn6

Do you have examples of that kind of output for comparison? (Is it reproducing formatting from an actual forum of some kind, or the additional "abstraction headroom" over GPT-2 allowing GPT-3 to output a forum-type structure without having matching examples in the training set?)

Comment

I didn’t copy it but it was fairly reasonable plaintext, something like username /​n date /​n comment /​n /​n next comment.