Parameter counts in Machine Learning

https://www.lesswrong.com/posts/GzoWcYibWYwJva8aL/parameter-counts-in-machine-learning

Contents

Caveats

Insights

Open questions

Next steps

Acknowledgements

This article was written by Jaime Sevilla, Pablo Villalobos and Juan Felipe Cerón. Jaime’s work is supported by a Marie Curie grant of the NL4XAI Horizon 2020 program. We thank Girish Sastry for advising us on the beginning of the project, the Spanish Effective Altruism community for creating a space to incubate projects such as this one, and Haydn Belfield, Pablo Moreno and Ehud Reiter for discussion and system submissions.

Bibliography

Comment

https://www.lesswrong.com/posts/GzoWcYibWYwJva8aL/parameter-counts-in-machine-learning?commentId=sFFBeva2fDgsoynDC

I think the story for the discontinuity is basically "around 2018 industry labs realized that language models would be the next big thing" (based on Attention is all you need, GPT-2, and/​or BERT), and then they switched their largest experiments to be on language (as opposed to the previous contender, games). Similarly for games, if you take DQN to be the event causing people to realize "large games models will be the next big thing", it does kinda look like there’s a discontinuity there (though there are way fewer points so it’s harder to tell, also I’m inclined to ignore things like CURL which came out of an academic lab with a limited compute budget). This story doesn’t hold up for vision though (taking AlexNet as the event); I’m not sure why that is. One theory is that vision is tied to a fixed dataset—ImageNet—and that effectively puts a max size on how big your neural nets can be. You might also think that model size underwent a discontinuity around 2018, independent of which domain it’s in—I think that’s because the biggest experiments moved from vision (2012-15) to games (2015-19) to language (2019-now), with the compute trend staying continuous. However, in games the model-size-to-compute ratio is way lower (since it involves RL, while vision and language involve SL). For example, AlphaZero had fewer parameters than AlexNet, despite taking almost 5 orders of magnitude more compute. So you see max model size stalling a bit in 2015-19, and then bursting upwards around 2019. Aside: I hadn’t realized AlphaZero took 5 orders of magnitude more compute per parameter than AlexNet—the horizon length concept would have predicted ~2 orders (since a full Go game is a couple hundred moves). I wonder what gets the extra 3 orders. Probably at least part of it comes from the difference between using a differentiable vs. non-differentiable objective function.

Comment

https://www.lesswrong.com/posts/GzoWcYibWYwJva8aL/parameter-counts-in-machine-learning?commentId=jSpXWfLbhnSgNHWRP

The difference in compute between AlexNet and AlphaZero is because for AlexNet you are only counting the flops during training, while for AlphaZero you are counting both the training and the self-play data generation (which does 800 forwards per move * ~200 moves to generate each game). If you were to compare supervised training numbers for both (e.g. training on human chess or Go games) then you’d get much closer.

Comment

That’s fair. I was thinking of that as part of "compute needed during training", but you could also split it up into "compute needed for gradient updates" and "compute needed to create data of sufficient quality", and then say that the stable thing is the "compute needed for gradient updates".

https://www.lesswrong.com/posts/GzoWcYibWYwJva8aL/parameter-counts-in-machine-learning?commentId=yTGtAT8QLNHAkrd32

Aside: I hadn’t realized AlphaZero took 5 orders of magnitude more compute per parameter than AlexNet—the horizon length concept would have predicted ~2 orders (since a full Go game is a couple hundred moves). I wonder what gets the extra 3 orders. Probably at least part of it comes from the difference between using a differentiable vs. non-differentiable objective function. I think that in a forward pass, AlexNet uses about 10-15 flops per parameter (assuming 4 bytes per parameter and using this table), because it puts most of its parameters in the small convolutions and FC layers. But I think AlphaZero has most of its parameters in 19x19 convolutions, which involve 722 flops per parameter (19 x 19 x 2). If that’s right, it accounts for a factor of 50; combined with game length that’s 4 orders of magnitude explained. I’m not sure what’s up with the last order of magnitude. I think that’s a normal amount of noise /​ variation across different tasks, though I would have expected AlexNet to be somewhat overtrained given the context. I also think the comparison is kind of complicated because of MCTS and distillation (e.g. AlphaZero uses much more than 1 forward pass per turn, and you can potentially learn from much shorter effective horizons when imitating the distilled targets).

Comment

I also looked into number of training points very briefly, Googling suggests AlexNet used 90 epochs on ImageNet’s 1.3 million train images, while AlphaZero played 44 million games for chess (I didn’t quickly find a number for Go), suggesting that the number of images was roughly similar to the number of games. So I think probably the remaining orders of magnitude are coming from the tree search part of MCTS (which causes there to be > 200 forward passes per game).

https://www.lesswrong.com/posts/GzoWcYibWYwJva8aL/parameter-counts-in-machine-learning?commentId=iRv4o2gwCuswPpEfv

One reason it might not be fitting as well for vision, is that vision has much more weight-tying /​ weight-reuse in convolutional filters. If the underlying variable that mattered was compute, then image processing neural networks would show up more prominently in compute (rather than parameters).

https://www.lesswrong.com/posts/GzoWcYibWYwJva8aL/parameter-counts-in-machine-learning?commentId=8HGkCDKtdWiKtyZgB

Could it be inefficient scaling? Most work not explicitly using scaling laws to plan it seems to generally overestimate in compute per parameter, using too-small models. Anyone want to try to apply Jones 2021 to see if AlphaZero was scaled wrong?

https://www.lesswrong.com/posts/GzoWcYibWYwJva8aL/parameter-counts-in-machine-learning?commentId=TsXKcfqagFNNHcuLh

Great collection of results. I particularly found the interactive graph useful. I’m slightly confused by the trend lines (especially for Games and Other) - they don’t seem intuitively the best fits. It looks like they place a lot of importance on the high parameter recent models (possibly the cost for each datapoint is in parameter space rather than log(parameter) space?

Comment

https://www.lesswrong.com/posts/GzoWcYibWYwJva8aL/parameter-counts-in-machine-learning?commentId=uReKtqbCLzzsHC2s2

Thank you! I think you are right—by default the Altair library (what we used to plot the regressions) does OLS fitting of an exponential instead of fitting a linear model over the log transform. We’ll look into this and report back.

Comment

If you are still interested in fiddling with this graph, here’s a variant I’d love to see: Remove all the datapoints in each AI category that are not record-setting, such that each category just tracks the largest available models at any given time. Then compute the best fit lines for the resulting categories. (Because this is what would be useful for predicting what the biggest models will be in year X, whereas the current design is for predicting what the average model size will be in year X… right?)

Comment

Good suggestion! Understanding the trend of record-setting would be interesting indeed so that we avoid the pesky influence of the systems which are below the trend like CURL in the game domain. The problem with the naive setup of just regressing on record-setters is that is quite sensitive to noise—one early outlier in the trend can completely alter the result. I explore a similar problem in my paper Forecasting timelines of quantum computing, where we try to extrapolate progress on some key metrics like qubit count and gate error rate. The method we use in the paper to address this issue is to bootstrap the input and predict a range of possible growth rates—that way outliers do not completely dominate the result. I will probably not do it right now for this dataset, though I’d be interested in having other people try that if they are so inclined!

Comment

OK, sounds good! I know someone who might be interested... Another, very similar thing that would be good is to just delete all the non-record-setting data points and draw lines to connect the remaining dots. Also, it would be cool if we could delete all the Mixture of Experts models to see what the "dense" version of the trend looks like.

This is now fixed; see the updated graphs. We have also updated the eye ball estimates accordingly.

https://www.lesswrong.com/posts/GzoWcYibWYwJva8aL/parameter-counts-in-machine-learning?commentId=WemR9GkiZNivqJ2tE

Planned summary for the Alignment Newsletter:

This post presents a dataset of the parameter counts of 139 ML models from 1952 to 2021. The resulting graph is fairly noisy and hard to interpret, but suggests that:

  1. There was no discontinuity in model size in 2012 (the year that AlexNet was published, generally acknowledged as the start of the deep learning revolution).2. There was a discontinuity in model size for language in particular some time between 2016-18. Planned opinion:

You can see my thoughts on the trends in model size in this comment.

https://www.lesswrong.com/posts/GzoWcYibWYwJva8aL/parameter-counts-in-machine-learning?commentId=k9HA2JG6pjR5k9qup

Thank you for collecting this dataset! What’s the difference between the squares, triangles, and plus-sign datapoints? If you say it somewhere I haven’t been able to find it I’m afraid.

Comment

https://www.lesswrong.com/posts/GzoWcYibWYwJva8aL/parameter-counts-in-machine-learning?commentId=ucjmvW9HzpoLczvse

Thank you! The shapes mean the same as the color (ie domain) - they were meant to make the graph more clear. Ideally both shape and color would be reflected in the legend. But whenever I tried adding shapes to the legend instead a new legend was created, which was more confusing. If somebody reading this knows how to make the code produce a correct legend I’d be very keen on hearing it!EDIT: Now fixed