Cold Case: The Lost MNIST Digits

source

arxiv

source_type

latex

converted_with

pandoc

paper_version

1905.10498v2

title

Cold Case: The Lost MNIST Digits

authors

["Chhavi Yadav","Léon Bottou"]

date_published

2019-05-25 01:50:51+00:00

data_last_modified

2019-11-04 21:05:26+00:00

abstract

Although the popular MNIST dataset [LeCun et al., 1994] is derived from the NIST database [Grother and Hanaoka, 1995], the precise processing steps for this derivation have been lost to time. We propose a reconstruction that is accurate enough to serve as a replacement for the MNIST dataset, with insignificant changes in accuracy. We trace each MNIST digit to its NIST source and its rich metadata such as writer identifier, partition identifier, etc. We also reconstruct the complete MNIST test set with 60,000 samples instead of the usual 10,000. Since the balance 50,000 were never distributed, they enable us to investigate the impact of twenty-five years of MNIST experiments on the reported testing performances. Our results unambiguously confirm the trends observed by Recht et al. [2018, 2019]: although the misclassification rates are slightly off, classifier ordering and model selection remain broadly reliable. We attribute this phenomenon to the pairing benefits of comparing classifiers on the same digits.

author_comment

Final NeurIPS version

journal_ref

null

doi

null

primary_category

cs.LG

categories

["cs.LG","cs.CV","stat.ML"]

citation_level

alignment_text

pos

confidence_score

1.0

main_tex_filename

./qmnist.tex

bibliography_bbl

\begin{thebibliography}{13} \providecommand{\natexlab}[1]{#1} \providecommand{\url}[1]{\texttt{#1}} \expandafter\ifx\csname urlstyle\endcsname\relax \providecommand{\doi}[1]{doi: #1}\else \providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi \bibitem[Bonferroni(1936)]{bonferroni-1936} Carlo~E. Bonferroni. \newblock \emph{Teoria statistica delle classi e calcolo delle probabilità}. \newblock Pubblicazioni del {R.} {Istituto} superiore di scienze economiche e commerciali di {Firenze}. Libreria internazionale Seeber, 1936. \bibitem[Bottou and {Le Cun}(1988)]{bottou-lecun-88} {L\'eon} Bottou and Yann {Le Cun}. \newblock {SN}: A simulator for connectionist models. \newblock In \emph{Proceedings of NeuroNimes 88}, pages 371--382, Nimes, France, 1988. \bibitem[Bottou and LeCun(2001)]{lush} L\'eon Bottou and Yann LeCun. \newblock \emph{Lush Reference Manual}. \newblock \url{http://lush.sf.net/doc}, 2001. \bibitem[Bottou et~al.(1994)Bottou, Cortes, Denker, Drucker, Guyon, Jackel, {Le Cun}, Muller, S\"{a}ckinger, Simard, and Vapnik]{bottou-cortes-94} L\'{e}on Bottou, Corinna Cortes, John~S. Denker, Harris Drucker, Isabelle Guyon, Lawrence~D. Jackel, Yann {Le Cun}, Urs~A. Muller, Eduard S\"{a}ckinger, Patrice Simard, and Vladimir Vapnik. \newblock Comparison of classifier methods: a case study in handwritten digit recognition. \newblock In \emph{Proceedings of the 12th IAPR International Conference on Pattern Recognition, Conference B: Computer Vision \& Image Processing.}, volume~2, pages 77--82, Jerusalem, October 1994. IEEE. \bibitem[Feldman et~al.(2019)Feldman, Frostig, and Hardt]{feldman-2019} Vitaly Feldman, Roy Frostig, and Moritz Hardt. \newblock The advantages of multiple classes for reducing overfitting from test set reuse. \newblock In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, \emph{Proceedings of the 36th International Conference on Machine Learning}, volume~97 of \emph{Proceedings of Machine Learning Research}, pages 1892--1900. PMLR, 2019. \bibitem[Grother and Hanaoka(1995)]{nist-sd19} Patrick~J. Grother and Kayee~K. Hanaoka. \newblock {NIST} {Special} {Database} 19: Handprinted forms and characters database. \newblock \url{https://www.nist.gov/srd/nist-special-database-19}, 1995. \newblock SD1 was released in 1990, SD3 and SD7 in 1992, SD19 in 1995, SD19 2nd edition in 2016. \bibitem[He et~al.(2016)He, Zhang, Ren, and Sun]{he2016deep} Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. \newblock Deep residual learning for image recognition. \newblock In \emph{Proceedings of the IEEE conference on computer vision and pattern recognition}, pages 770--778, 2016. \bibitem[{Le Cun} et~al.(1998){Le Cun}, Bottou, Bengio, and Haffner]{lecun-98h} Yann {Le Cun}, L\'{e}on Bottou, Yoshua Bengio, and Patrick Haffner. \newblock Gradient based learning applied to document recognition. \newblock \emph{Proceedings of IEEE}, 86\penalty0 (11):\penalty0 2278--2324, 1998. \bibitem[LeCun et~al.(1994)LeCun, Cortes, and Burges]{mnist} Yann LeCun, Corinna Cortes, and Christopher J.~C. Burges. \newblock The {MNIST} database of handwritten digits. \newblock \url{http://yann.lecun.com/exdb/mnist/}, 1994. \newblock MNIST was created in 1994 and released in 1998. \bibitem[Recht et~al.(2018)Recht, Roelofs, Schmidt, and Shankar]{recht2018cifar} Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. \newblock Do {CIFAR-10} classifiers generalize to {CIFAR-10}? \newblock \emph{arXiv preprint arXiv:1806.00451}, 2018. \bibitem[Recht et~al.(2019)Recht, Roelofs, Schmidt, and Shankar]{pmlr-v97-recht19a} Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. \newblock Do {I}mage{N}et classifiers generalize to {I}mage{N}et? \newblock In \emph{Proceedings of the 36th International Conference on Machine Learning}. PMLR, 2019. \bibitem[Simonyan and Zisserman(2014)]{simonyan2014very} Karen Simonyan and Andrew Zisserman. \newblock Very deep convolutional networks for large-scale image recognition. \newblock \emph{arXiv preprint arXiv:1409.1556}, 2014. \bibitem[Vapnik(1982)]{vapnik-82} V.~N. Vapnik. \newblock \emph{Estimation of dependences based on empirical data}. \newblock Springer Series in Statistics. Springer Verlag, Berlin, New York, 1982. \end{thebibliography}

bibliography_bib

arxiv_citations

{"1806.00451":true,"1409.1556":true}

alignment_newsletter

{"source":"alignment-newsletter","source_type":"google-sheets","converted_with":"python","venue":"arXiv","newsletter_category":"Deep learning","highlight":false,"newsletter_number":"AN #103","newsletter_url":"https://mailchi.mp/60475c277263/an-103-arches-an-agenda-for-existential-safety-and-combining-natural-language-with-deep-rl","summarizer":"Flo","summary":"As the MNIST test set only contains 10,000 samples, concerns that further improvements are essentially overfitting on the test set have been voiced. Interestingly, MNIST was originally meant to have a test set of 60,000, as large as the training set, but the remaining 50,000 digits have been lost. The authors made many attempts to reconstruct the way MNIST was obtained from the NIST handwriting database as closely as possible and present QMNIST(v5) which features an additional 50,000 test images for MNIST, while the rest of the images are very close to the originals from MNIST. They test their dataset using multiple classification methods and find little difference in whether MNIST or QMNIST is used for training, but the test error on the additional 50,000 images is consistently higher than on the original 10,000 test images or their reconstruction of these. While the concerns about overuse of a test set are justified, the measured effects were mostly small and their relevance might be outweighed by the usefulness of paired differences for statistical model selection. ","opinion":"I am confused about the overfitting part, as most methods they try (like ResNets) don't seem to have been selected for performance on the MNIST test set. Granted, LeNet seems to degrade more than other models, but it seems like the additional test images in QMNIST are actually harder to classify. This seems especially plausible with the previous summary in mind and because the authors mention a dichotomy between the ease of classification for NIST images generated by highschoolers vs government employees but don’t seem to mention any attempts to deal with potential selection bias.","prerequisites":"nan","read_more":"nan","paper_version":"1905.10498v2","arxiv_id":"1905.10498","title":"Cold Case: The Lost MNIST Digits","authors":["Chhavi Yadav","Léon Bottou"],"date_published":"2019-05-25 01:50:51+00:00","data_last_modified":"2019-11-04 21:05:26+00:00","url":"http://arxiv.org/abs/1905.10498v2","abstract":"Although the popular MNIST dataset [LeCun et al., 1994] is derived from the NIST database [Grother and Hanaoka, 1995], the precise processing steps for this derivation have been lost to time. We propose a reconstruction that is accurate enough to serve as a replacement for the MNIST dataset, with insignificant changes in accuracy. We trace each MNIST digit to its NIST source and its rich metadata such as writer identifier, partition identifier, etc. We also reconstruct the complete MNIST test set with 60,000 samples instead of the usual 10,000. Since the balance 50,000 were never distributed, they enable us to investigate the impact of twenty-five years of MNIST experiments on the reported testing performances. Our results unambiguously confirm the trends observed by Recht et al. [2018, 2019]: although the misclassification rates are slightly off, classifier ordering and model selection remain broadly reliable. We attribute this phenomenon to the pairing benefits of comparing classifiers on the same digits.","author_comment":"Final NeurIPS version","journal_ref":"None","doi":"None","primary_category":"cs.LG","categories":"['cs.LG', 'cs.CV', 'stat.ML']","individual_summary":"Title: Cold Case: The Lost MNIST Digits\nAuthors: Chhavi Yadav, Léon Bottou\nPaper abstract: Although the popular MNIST dataset [LeCun et al., 1994] is derived from the NIST database [Grother and Hanaoka, 1995], the precise processing steps for this derivation have been lost to time. We propose a reconstruction that is accurate enough to serve as a replacement for the MNIST dataset, with insignificant changes in accuracy. We trace each MNIST digit to its NIST source and its rich metadata such as writer identifier, partition identifier, etc. We also reconstruct the complete MNIST test set with 60,000 samples instead of the usual 10,000. Since the balance 50,000 were never distributed, they enable us to investigate the impact of twenty-five years of MNIST experiments on the reported testing performances. Our results unambiguously confirm the trends observed by Recht et al. [2018, 2019]: although the misclassification rates are slightly off, classifier ordering and model selection remain broadly reliable. We attribute this phenomenon to the pairing benefits of comparing classifiers on the same digits.\nSummary: As the MNIST test set only contains 10,000 samples, concerns that further improvements are essentially overfitting on the test set have been voiced. Interestingly, MNIST was originally meant to have a test set of 60,000, as large as the training set, but the remaining 50,000 digits have been lost. The authors made many attempts to reconstruct the way MNIST was obtained from the NIST handwriting database as closely as possible and present QMNIST(v5) which features an additional 50,000 test images for MNIST, while the rest of the images are very close to the originals from MNIST. They test their dataset using multiple classification methods and find little difference in whether MNIST or QMNIST is used for training, but the test error on the additional 50,000 images is consistently higher than on the original 10,000 test images or their reconstruction of these. While the concerns about overuse of a test set are justified, the measured effects were mostly small and their relevance might be outweighed by the usefulness of paired differences for statistical model selection. \nMy opinion: I am confused about the overfitting part, as most methods they try (like ResNets) don't seem to have been selected for performance on the MNIST test set. Granted, LeNet seems to degrade more than other models, but it seems like the additional test images in QMNIST are actually harder to classify. This seems especially plausible with the previous summary in mind and because the authors mention a dichotomy between the ease of classification for NIST images generated by highschoolers vs government employees but don’t seem to mention any attempts to deal with potential selection bias.","paper_text":"","text":"HIGHLIGHTS\n[AI Research Considerations for Human Existential Safety](http://acritch.com/media/arches.pdf) *(Andrew Critch et al)* (summarized by Rohin): This research agenda out of CHAI directly attacks the problem longtermists care about: **how to prevent AI-related existential catastrophe**. This is distinctly different from the notion of being \"provably beneficial\": a key challenge for provable beneficence is defining what we even mean by \"beneficial\". In contrast, there are avenues for preventing AI-caused human extinction that do not require an understanding of \"beneficial\": most trivially, we could coordinate to never build AI systems that could cause human extinction.Since the focus is on the *impact* of the AI system, the authors need a new phrase for this kind of AI system. They define a **prepotent AI system** to be one that cannot be controlled by humanity **and** has the potential to transform the world in a way that is at least as impactful as humanity as a whole. Such an AI system need not be superintelligent, or even an AGI; it may have powerful capabilities in a narrow domain such as technological autonomy, replication speed, or social acumen that enable prepotence.By definition, a prepotent AI system is capable of transforming the world drastically. However, there are a lot of conditions that are necessary for continued human existence, and most transformations of the world will not preserve these conditions. (For example, consider the temperature of the Earth or the composition of the atmosphere.) As a result, human extinction is the *default* outcome from deploying a prepotent AI system, and can only be prevented if the system is designed to preserve human existence with very high precision relative to the significance of its actions. They define a misaligned prepotent AI system (MPAI) as one whose deployment leads to human extinction, and so the main objective is to avert the deployment of MPAI.The authors break down the risk of deployment of MPAI into five subcategories, depending on the beliefs, actions and goals of the developers. The AI developers could fail to predict prepotence, fail to predict misalignment, fail to coordinate with other teams on deployment of systems that aggregate to form an MPAI, accidentally (unilaterally) deploy MPAI, or intentionally (unilaterally) deploy MPAI. There are also hazardous social conditions that could increase the likelihood of risks, such as unsafe development races, economic displacement of humans, human enfeeblement, and avoidance of talking about x-risk at all.Moving from risks to solutions, the authors categorize their research directions along three axes based on the setting they are considering. First, is there one or multiple humans; second, is there one or multiple AI systems; and third, is it helping the human(s) comprehend, instruct, or control the AI system(s). So, multi/single instruction would involve multiple humans instructing a single AI system. While we will eventually need multi/multi, the preceding cases are easier problems from which we could gain insights that help solve the general multi/multi case. Similarly, comprehension can help with instruction, and both can help with control.The authors then go on to list 29 different research directions, which I'm not going to summarize here. |\n\n\n |\n\n\n\n| \n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| **Rohin's opinion:** I love the abstract and introduction, because of their directness at actually stating what we want and care about. I am also a big fan of the distinction between provably beneficial and reducing x-risk, and the single/multi analysis.The human fragility argument, as applied to generally intelligent agents, is a bit tricky. One interpretation is that the \"hardness\" stems from the fact that you need a bunch of \"bits\" of knowledge / control in order to keep humans around. However, it seems like a generally intelligent AI should easily be able to keep humans around \"if it wants\", and so the bits already exist in the AI. (As an analogy: we make big changes to the environment, but we could easily preserve deer habitats if we wanted to.) Thus, it is really a question of what \"distribution\" you expect the AI system is sampled from: if you think we'll build AI systems that try to do what humanity wants, then we're probably fine, but if you think that there will be multiple AI systems that each do what their users want, but the users have conflicts, the overall system seems more \"random\" in its goals, and so more likely to fall into the \"default\" outcome of human extinction. The research directions are very detailed, and while there are some suggestions that don't seem particularly useful to me, overall I am happy with the list. (And as the paper itself notes, what is and isn't useful depends on your models of AI development.) |\n\n |\n\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n\n| |\n| --- |\n| [Human Instruction-Following with Deep Reinforcement Learning via Transfer-Learning from Text](https://arxiv.org/abs/2005.09382) *(Felix Hill et al)* (summarized by Nicholas): This paper proposes the Simulation-to-Human Instruction Following via Transfer from Text (SHIFTT) method for training an RL agent to receive commands from humans in natural language. One approach to this problem is to train an RL agent to respond to commands based on a template; however, this is not robust to small changes in how humans phrase the commands. In SHIFTT, you instead begin with a pretrained language model such as BERT and first feed the templated commands through the language model. This is then combined with vision inputs to produce a policy. The human commands are later fed through the same language model, and they find that the model has zero-shot transfer to the human commands even if they differ in structure. |\n\n\n |\n\n\n\n| \n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| **Nicholas's opinion:** Natural language is a very flexible and intuitive way to convey instructions to AI. In some ways, this shifts the alignment problem from the RL agent to the supervised language model, which just needs to learn how to correctly interpret the meaning behind human speech. One advantage of this approach is that the language model is separately trained so it can be tested and verified for safety criteria before being used to train an RL agent. It also may be more competitive than alternatives such as reward modeling that require training a new reward model for each task. I do see a couple downsides to this approach, however. The first is that humans are not perfect at conveying their values in natural language (e.g. King Midas wishing for everything he touches to turn to gold), and natural language may not have enough information to convey complex preferences. Even if humans give precise and correct commands, the language model needs to verifiably interpret those commands correctly. This could be difficult as current language models are difficult to interpret and contain many harmful biases. |\n\n |\n\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n\n| |\n| --- |\n| [Grounding Language in Play](https://language-play.github.io/) *(Corey Lynch et al)* (summarized by Robert): This paper presents a new approach to learning to follow natural language human instruction in a robotics setting. It builds on similar ideas to [Learning Latent Plans from Play](https://learning-from-play.github.io/) ([AN #65](https://mailchi.mp/3d4e6c2c206f/an-65learning-useful-skills-by-watching-humans-play)), in that it uses unsupervised \"play\" data (trajectories of humans playing on the robot with no goal in mind).The paper combines several ideas to enable training a policy which can follow natural language instructions with only limited human annotations. \\* In *Hindsight Instruction Pairing*, human annotators watch small trajectories from the play data, and label them with the instruction which is being completed in the clip. This instruction can take any form, and means we don't need to choose the instructions and ask humans to perform specific tasks.\\* *Multicontext Imitation Learning* is a method designed to allow goal-conditioned policies to be learned with multiple different types of goals. For example, we can have lots of example trajectories where the goal is an end state image (as these can be generated automatically without humans), and just a small amount of example trajectories where the goal is a natural language instruction (gathered using *Hindsight Instruction Pairing*). The approach is to learn a goal embedding network for each type of goal specification, and a single shared policy which takes the goal embedding as input.Combining these two methods enables them to train a policy and embedding networks end to end using imitation learning from a large dataset of (trajectory, image goal) pairs and a small dataset of (trajectory, natural language goal) pairs. The policy can follow very long sequences of natural language instructions in a fairly complex grasping environment with a variety of buttons and objects. Their method performs better than the Learning from Play (LfP) method, even though LfP uses a goal image as the goal conditioning, instead of a natural language instruction.Further, they propose that instead of learning the goal embedding for the natural language instructions, they use a pretrained large language model to produce the embeddings. This improves the performance of their method over learning the embedding from scratch, which the authors claim is the first example of the knowledge in large language models being transferred and improving performance in a robotics domain. This model also performs well when they create purposefully out of distribution natural language instructions (i.e. with weird synonyms, or google-translated from a different language). |\n\n\n |\n\n\n\n| \n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| **Robert's opinion:** I think this paper shows two important things:1. Embedding the natural language instructions in the same space as the image conditioning works well, and is a good way of extending the usefulness of human annotations.2. Large pretrained language models can be used to improve the performance of language-conditioned reinforcement learning (in this case imitation learning) algorithms and policies.Methods which enable us to scale human feedback to complex settings are useful, and this method seems like it could scale well, especially with the use of pretrained large language models which might reduce the amount of language annotations needed further. |\n\n |\n\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| |\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n\n| |\n| --- |\n| \n\n \n\n TECHNICAL AI ALIGNMENT\n\n\n MISCELLANEOUS (ALIGNMENT)\n[From ImageNet to Image Classification](https://gradientscience.org/benchmarks/) *(Dimitris Tsipras et al)* (summarized by Flo): ImageNet was crowdsourced by presenting images to MTurk workers who had to select images that contain a given class from a pool of images obtained via search on the internet. This is problematic, as an image containing multiple classes will basically get assigned to a random suitable class which can lead to deviations between ImageNet performance and actual capability to recognize images. The authors used MTurk and allowed workers to select multiple classes, as well as one main class for a given image in a pool of 10000 ImageNet validation images. Around 20% of the images seem to contain objects representing multiple classes and the average accuracy for these images was around 10% worse than average for a wide variety of image classifiers. While this is a significant drop, it is still way better than predicting a random class that is in the image. Also, advanced models were still able to predict the ImageNet label in cases where it does not coincide with the main class identified by humans, which suggest that they exploit biases in the dataset generation. While the accuracy of model predictions with respect to the newly identified main class still increased with better accuracy in predicting labels, the accuracy gap seems to grow and we might soon hit a point where gains in ImageNet accuracy don't correspond to improved image classification. **Read more:** [Paper: From ImageNet to Image Classification: Contextualizing Progress on Benchmarks](https://arxiv.org/abs/2005.11295) |\n\n\n |\n\n\n\n| \n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| **Flo's opinion:** I generally find these empiricial tests of whether ML systems actually do what they are assumed to do quite useful for better calibrating intuitions about the speed of AI progress, and to make failure modes more salient. While we have the latter, I am confused about what this means for AI progress: on one hand, this supports the claim that improved benchmark progress does not necessarily translate to better real world applicability. On the other hand, it seems like image classification might be easier than exploiting the dataset biases present in ImageNet, which would mean that we would likely be able to reach even better accuracy than on ImageNet for image classification with the right dataset. |\n\n |\n\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n\n| |\n| --- |\n| [Focus: you are allowed to be bad at accomplishing your goals](https://www.alignmentforum.org/posts/X5WTgfX5Ly4ZNHWZD/focus-you-are-allowed-to-be-bad-at-accomplishing-your-goals) *(Adam Shimi)* (summarized by Rohin): [Goal-directedness](https://www.alignmentforum.org/posts/DfcywmqRSkBaCB6Ma/intuitions-about-goal-directed-behavior) ([AN #35](https://mailchi.mp/bbd47ba94e84/alignment-newsletter-35)) is one of the key drivers of AI risk: it's the underlying factor that leads to [convergent instrumental subgoals](https://selfawaresystems.files.wordpress.com/2008/01/ai_drives_final.pdf). However, it has eluded a good definition so far: we cannot simply say that it is the optimal policy for some simple reward function, as that would imply AlphaGo is not goal-directed (since it was beaten by AlphaZero), which seems wrong. Basically, goal-directedness should not be tied directly to *competence*. So, instead of only considering optimal policies, we can consider any policy that could have been output by an RL algorithm, perhaps with limited resources. Formally, we can construct a set of policies for G that can result from running e.g. SARSA with varying amounts of resources with G as the reward, and define the focus of a system towards G to be the distance of the system’s policy to the constructed set of policies. |\n\n\n |\n\n\n\n| \n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| **Rohin's opinion:** I certainly agree that we should not require full competence in order to call a system goal-directed. I am less convinced of the particular construction here: current RL policies are typically terrible at generalization, and tabular SARSA explicitly doesn’t even *try* to generalize, whereas I see generalization as a key feature of goal-directedness.You could imagine the RL policies get more resources and so are able to understand the whole environment without generalization, e.g. if they get to update on every state at least once. However, in this case realistic goal-directed policies would be penalized for “not knowing what they should have known”. For example, suppose I want to eat sweet things, and I come across a new fruit I’ve never seen before. So I try the fruit, and it turns out it is very bitter. This would count as “not being goal-directed”, since the RL policies for “eat sweet things” would already know that the fruit is bitter and so wouldn’t eat it. |\n\n |\n\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| |\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n\n| |\n| --- |\n| \n\n \n\n OTHER PROGRESS IN AI\n\n\n DEEP LEARNING\n[Identifying Statistical Bias in Dataset Replication](http://gradientscience.org/data_rep_bias/) *(Logan Engstrom et al)* (summarized by Flo): One way of dealing with finite and fixed test sets and the resulting possibility of overfitting on the test set is dataset replication, where one tries to closely mimic the original process of dataset creation to obtain a larger test set. This can lead to bias if the difficulty of the new test images is distributed differently than in the original test set. A previous attempt at [dataset replication on ImageNet](https://arxiv.org/abs/1902.10811) tried to get around this by measuring how often humans under time pressure correctly answered a yes/no question about an image's class (dubbed selection frequency), which can be seen as a proxy for classification difficulty. This data was then used to sample candidate images for every class which match the distribution of difficulty in the original test set. Still, all tested models performed worse on the replicated test set than on the original. Parts of this bias can be explained by noisy measurements combined with disparities in the initial distribution of difficulty, which are likely as the original ImageNet data was prefiltered for quality. Basically, the more noisy our estimates for the difficulty are, the more the original distribution of difficulty matters. As an extreme example, imagine a class for which all images in the original test set have a selection frequency of 100%, but 90% of candidates in the new test set have a selection frequency of 50%, while only 10% are as easy to classify as the images in the original test set. Then, if we only use a single human annotator, half of the difficult images in the candidate pool are indistinguishable from the easy ones, such that most images ending up in the new test set are more difficult to classify than the original ones, even after the adjustment.The authors then replicate the ImageNet dataset replication with varying amounts of annotators and find that the gap in accuracy between the original and the new test set progressively shrinks with reduced noise from 11.7% with one annotator to 5.7% with 40. Lastly, they discuss more sophisticated estimators for accuracy to further lower bias, which additionally decreases the accuracy gap down to around 3.5%. |\n\n\n |\n\n\n\n| \n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| **Flo's opinion:** This was a pretty interesting read and provides evidence against large effects of overfitting on the test set. On the other hand, results like this also seem to highlight how benchmarks are mostly useful for model comparison, and how nonrobust they can be to fairly benign distributional shift. |\n\n |\n\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n\n| |\n| --- |\n| [Cold Case: The Lost MNIST Digits](https://arxiv.org/abs/1905.10498) *(Chhavi Yadav et al)* (summarized by Flo): As the MNIST test set only contains 10,000 samples, concerns that further improvements are essentially overfitting on the test set have been voiced. Interestingly, MNIST was originally meant to have a test set of 60,000, as large as the training set, but the remaining 50,000 digits have been lost. The authors made many attempts to reconstruct the way MNIST was obtained from the NIST handwriting database as closely as possible and present QMNIST(v5) which features an additional 50,000 test images for MNIST, while the rest of the images are very close to the originals from MNIST. They test their dataset using multiple classification methods and find little difference in whether MNIST or QMNIST is used for training, but the test error on the additional 50,000 images is consistently higher than on the original 10,000 test images or their reconstruction of these. While the concerns about overuse of a test set are justified, the measured effects were mostly small and their relevance might be outweighed by the usefulness of paired differences for statistical model selection. |\n\n\n |\n\n\n\n| \n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| **Flo's opinion:** I am confused about the overfitting part, as most methods they try (like ResNets) don't seem to have been selected for performance on the MNIST test set. Granted, LeNet seems to degrade more than other models, but it seems like the additional test images in QMNIST are actually harder to classify. I believe that the issues discussed in the previous summary are responsible for most of the performance gap, especially since the authors mention a dichotomy between the ease of classification for NIST images generated by highschoolers vs government employees but don't seem to mention any attempts to deal with potential selection bias. |\n\n |\n\n\n |\n\n |\n\n |\n| \n\n| | |\n| --- | --- |\n| \n\n\n| |\n| --- |\n| FEEDBACK\nI'm always happy to hear feedback; you can send it to me, [Rohin Shah](https://rohinshah.com/), by **replying to this email**.\n |\n\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n\n| |\n| --- |\n| PODCAST\nAn audio podcast version of the **Alignment Newsletter** is available. This podcast is an audio version of the newsletter, recorded by [Robert Miles](http://robertskmiles.com).\n**Subscribe here:**\n\n[RSS Feed](http://alignment-newsletter.libsyn.com/rss)[Google Podcasts](https://podcasts.google.com/?feed=aHR0cDovL2FsaWdubWVudC1uZXdzbGV0dGVyLmxpYnN5bi5jb20vcnNz)[Spotify Podcasts](https://open.spotify.com/show/5pwApVP0wr1Q61S4LmONuX)[Apple Podcasts](https://podcasts.apple.com/us/podcast/alignment-newsletter-podcast/id1489248000) |\n\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| |\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n\n| |\n| --- |\n| *Copyright © 2020 Alignment Newsletter, All rights reserved.*\n\n**"}

abstract: | Although the popular MNIST dataset [@mnist] is derived from the NIST database [@nist-sd19], the precise processing steps for this derivation have been lost to time. We propose a reconstruction that is accurate enough to serve as a replacement for the MNIST dataset, with insignificant changes in accuracy. We trace each MNIST digit to its NIST source and its rich metadata such as writer identifier, partition identifier, etc. We also reconstruct the complete MNIST test set with 60,000 samples instead of the usual 10,000. Since the balance 50,000 were never distributed, they can be used to investigate the impact of twenty-five years of MNIST experiments on the reported testing performances. Our limited results unambiguously confirm the trends observed by @recht2018cifar [@pmlr-v97-recht19a]: although the misclassification rates are slightly off, classifier ordering and model selection remain broadly reliable. We attribute this phenomenon to the pairing benefits of comparing classifiers on the same digits. author:

| Chhavi Yadav
New York University
New York, NY
chhavi@nyu.edu Léon Bottou
Facebook AI Research
and New York University
New York, NY
leon@bottou.org bibliography:
qmnist.bib title: "Cold Case: the Lost MNIST Digits"

Introduction

The MNIST dataset [@mnist; @bottou-cortes-94] has been used as a standard machine learning benchmark for more than twenty years. During the last decade, many researchers have expressed the opinion that this dataset has been overused. In particular, the small size of its test set, merely 10,000 samples, has been a cause of concern. Hundreds of publications report increasingly good performance on this same test set. Did they overfit the test set? Can we trust any new conclusion drawn on this dataset? How quickly do machine learning datasets become useless?

The first partitions of the large NIST handwritten character collection [@nist-sd19] had been released one year earlier, with a training set written by 2000 Census Bureau employees and a substantially more challenging test set written by 500 high school students. One of the objectives of LeCun, Cortes, and Burges was to create a dataset with similarly distributed training and test sets. The process they describe produces two sets of 60,000 samples. The test set was then downsampled to only 10,000 samples, possibly because manipulating such a dataset with the computers of the times could be annoyingly slow. The remaining 50,000 test samples have since been lost.

The initial purpose of this work was to recreate the MNIST preprocessing algorithms in order to trace back each MNIST digit to its original writer in NIST. This reconstruction was first based on the available information and then considerably improved by iterative refinements. Section 2{reference-type="ref" reference="sec:reconstruction"} describes this process and measures how closely our reconstructed samples match the official MNIST samples. The reconstructed training set contains 60,000 images matching each of the MNIST training images. Similarly, the first 10,000 images of the reconstructed test set match each of the MNIST test set images. The next 50,000 images are a reconstruction of the 50,000 lost MNIST test images.[^1]

In the same spirit as [@recht2018cifar; @pmlr-v97-recht19a], the rediscovery of the 50,000 lost MNIST test digits provides an opportunity to quantify the degradation of the official MNIST test set over a quarter-century of experimental research. Section 3{reference-type="ref" reference="sec:genex"} compares and discusses the performances of well known algorithms measured on the original MNIST test samples, on their reconstructions, and on the reconstructions of the 50,000 lost test samples. Our results provide a well controlled confirmation of the trends identified by @recht2018cifar [@pmlr-v97-recht19a] on a different dataset.

The original NIST test contains 58,527 digit images written by 500 different writers. In contrast to the training set, where blocks of data from each writer appeared in sequence, the data in the NIST test set is scrambled. Writer identities for the test set is available and we used this information to unscramble the writers. We then split this NIST test set in two: characters written by the first 250 writers went into our new training set. The remaining 250 writers were placed in our test set. Thus we had two sets with nearly 30,000 examples each.

The new training set was completed with enough samples from the old NIST training set, starting at pattern #0, to make a full set of 60,000 training patterns. Similarly, the new test set was completed with old training examples starting at pattern #35,000 to make a full set with 60,000 test patterns. All the images were size normalized to fit in a 20 x 20 pixel box, and were then centered to fit in a 28 x 28 image using center of gravity. Grayscale pixel values were used to reduce the effects of aliasing. These are the training and test sets used in the benchmarks described in this paper. In this paper, we will call them the MNIST data.

Recreating MNIST {#sec:reconstruction}

Recreating the algorithms that were used to construct the MNIST dataset is a challenging task. Figure [fig:mnist]{reference-type="ref" reference="fig:mnist"} shows the two paragraphs that describe this process in [@bottou-cortes-94]. Although this was the first paper mentioning MNIST, the creation of the dataset predates this benchmarking effort by several months. [^2] Curiously, this description incorrectly reports that the number of digits in the hsf4 partition, that is, the original NIST testing set, as 58,527 instead of 58,646. [^3]

These two paragraphs give a relatively precise recipe for selecting the 60,000 digits that compose the MNIST training set. Alas, applying this recipe produces a set that contains one more zero and one less eight than the actual MNIST training set. Although they do not match, these class distributions are too close to make it plausible that 119 digits were really missing from the hsf4 partition.

The description of the image processing steps is much less precise. How are the 128x128 binary NIST images cropped? Which heuristics, if any, are used to disregard noisy pixels that do not belong to the digits themselves? How are rectangular crops centered in a square image? How are these square images resampled to 20x20 gray level images? How are the coordinates of the center of gravity rounded for the final centering step?

An iterative process

Our initial reconstruction algorithms were informed by the existing description and, crucially, by our knowledge of a mysterious resampling algorithm found in ancient parts of the Lush codebase: instead of using a bilinear or bicubic interpolation, this code computes the exact overlap of the input and output image pixels. [^4]

Although our first reconstructed dataset, dubbed QMNISTv1, behaves very much like MNIST in machine learning experiments, its digit images could not be reliably matched to the actual MNIST digits. In fact, because many digits have similar shapes, we must rely on subtler details such as the anti-aliasing pixel patterns. It was however possible to identify a few matches. For instance we found that the lightest zero in the QMNIST training set matches the lightest zero in the MNIST training set. We were able to reproduce their antialiasing patterns by fine-tuning the initial centering and resampling algorithms, leading to QMNISTv2.

We then found that the smallest $L_2$ distance between MNIST digits and jittered QMNIST digits was a reliable match indicator. Running the Hungarian assignment algorithm on the two training sets gave good matches for most digits. A careful inspection of the worst matches allowed us to further tune the cropping algorithms, and to discover, for instance, that the extra zero in the reconstructed training set was in fact a duplicate digit that the MNIST creators had identified and removed. The ability to obtain reliable matches allowed us to iterate much faster and explore more aspects the image processing algorithm space, leading to QMNISTv3, v4, and v5. Note that all this tuning was achieved by matching training set images only.

[[fig:train20]]{#fig:train20 label="fig:train20"} Side-by-side display of the first sixteen digits in the MNIST and QMNIST training set. The magnified view of the first one illustrates the correct reconstruction of the antialiased pixels. {#fig:train20 width=".67\linewidth"}

Magnification:\  ![[\[fig:train20\]]{#fig:train20 label="fig:train20"} Side-by-side display of the first sixteen digits in the MNIST and QMNIST training set. The magnified view of the first one illustrates the correct reconstruction of the antialiased pixels.](trainmag.png){#fig:train20 width="\\linewidth"}
     MNIST \#0\ 
  NIST \#229421

This seemingly pointless quest for an exact reconstruction was surprisingly addictive. Supposedly urgent tasks could be indefinitely delayed with this important procrastination pretext. Since all good things must come to an end, we eventually had to freeze one of these datasets and call it QMNIST.

::: {#tbl:jitter} Min 25% Med 75% Max

Jittered $L_2$ distance 0 7.1 8.7 10.5 17.3 Jittered $L_\infty$ distance 0 1 1 1 3

: [[tbl:jitter]]{#tbl:jitter label="tbl:jitter"} Count of training samples for which the MNIST and QMNIST images align best without translation or with a $\pm1$ pixel translation. :::

::: {#tbl:jitter} Jitter $\mathbf{0}$ pixels $\mathbf{\pm1}$ pixels

Number of matches 59853 147

: [[tbl:jitter]]{#tbl:jitter label="tbl:jitter"} Count of training samples for which the MNIST and QMNIST images align best without translation or with a $\pm1$ pixel translation. :::

Evaluating the reconstruction quality {#eval}

Although the QMNIST reconstructions are closer to the MNIST images than we had envisioned, they remain imperfect.

Table 2{reference-type="ref" reference="tbl:jitter"} indicates that about $0.25%$ of the QMNIST training set images are shifted by one pixel relative to their MNIST counterpart. This occurs when the center of gravity computed during the last centering step (see Figure [fig:mnist]{reference-type="ref" reference="fig:mnist"}) is very close to a pixel boundary. Because the image reconstruction is imperfect, the reconstructed center of gravity sometimes lands on the other side of the pixel boundary, and the alignment code shifts the image by a whole pixel.

Table [tbl:qmnist]{reference-type="ref" reference="tbl:qmnist"} gives the quartiles of the $L_2$ distance and $L_\infty$ distances between the MNIST and QMNIST images, after accounting for these occasional single pixel shifts. An $L_2$ distance of $255$ would indicate a full pixel of difference. The $L_\infty$ distance represents the largest difference between image pixels, expressed as integers in range $0\dots 255$.

::: {#tbl:lenet5} Test on MNIST QMNIST10K QMNIST50K

Train on MNIST $0.82%$ ($\pm 0.2%$) $0.81%$ ($\pm 0.2%$) $1.08%$ ($\pm 0.1%$) Train on QMNIST $0.81%$ ($\pm 0.2%$) $0.80%$ ($\pm 0.2%$) $1.08%$ ($\pm 0.1%$)

: [[tbl:lenet5]]{#tbl:lenet5 label="tbl:lenet5"} Misclassification rates of a Lenet5 convolutional network trained on both the MNIST and QMNIST training sets and tested on the MNIST test set, on the 10K QMNIST testing examples matching the MNIST testing set, and on the 50k remaining QMNIST testing examples. :::

In order to further verify the reconstruction quality, we trained a variant of the Lenet5 network described by @lecun-98h. Its original implementation is still available as a demonstration in the Lush codebase. Lush [@lush] descends from the SN neural network software [@bottou-lecun-88] and from its AT&T Bell Laboratories variants developped in the nineties. This particular variant of Lenet5 omits the final Euclidean layer described in [@lecun-98h] without incurring a performance penalty. Following the pattern set by the original implementation, the training protocol consists of three sets of 10 epochs with global stepsizes $10^{-4}$, $10^{-5}$, and $10^{-6}$. Each set starts with estimating the diagonal of the Hessian. Per-weight stepsizes are then computed by dividing the global stepsize by the estimated curvature plus 0.02. Table 3{reference-type="ref" reference="tbl:lenet5"} reports insignificant differences when one trains with the MNIST or QMNIST training set or test with MNIST test set or the matching part of the QMNIST test set. On the other hand, we observe a more substantial difference when testing on the remaining part of the QMNIST test set, that is, the reconstructions of the lost MNIST test digits. Such discrepancies will be discussed more precisely in Section 3{reference-type="ref" reference="sec:genex"}.

MNIST trivia

The reconstruction effort allowed us to uncover a lot of previously unreported facts about MNIST.

There are exactly three duplicate digits in the entire NIST handwritten character collection. Only one of them falls in the segments used to generate MNIST but was removed by the MNIST authors.
The first 5001 images of the MNIST test set seem randomly picked from those written by writers #2350-#2599, all high school students. The next 4999 images are the consecutive NIST images #35,000-#39,998, in this order, written by only 48 Census Bureau employees, writers #326-#373, as shown in Figure 5{reference-type="ref" reference="fig:histogram"}. Although this small number could make us fear for statistical significance, these comparatively very clean images contribute little to the total test error.
Even-numbered images among the 58,100 first MNIST training set samples exactly match the digits written by writers #2100-#2349, all high school students, in random order. The remaining images are the NIST images #0 to #30949 in that order. The beginning of this sequence is visible in Figure 2{reference-type="ref" reference="fig:train20"}. Therefore, half of the images found in a typical minibatch of consecutive MNIST training images are likely to have been written by the same writer. We can only recommend shuffling the training set before assembling the minibatches.
There is a rounding error in the final centering of the 28x28 MNIST images. The average center of mass of a MNIST digits is in fact located half a pixel away from the geometrical center of the image. This is important because training on correctly centered images yields substantially worse performance on the standard MNIST testing set.
A slight defect in the MNIST resampling code generates low amplitude periodic patterns in the dark areas of thick characters. These patterns, illustrated in Figure [fig:waves]{reference-type="ref" reference="fig:waves"}, can be traced to a 0.99 fudge factor that is still visible in the Lush legacy code.[^5] Since the period of these patterns depend on the sizes of the input images passed to the resampling code, we were able to determine that the small NIST images were not upsampled by directly calling the resampling code, but by first doubling their resolution, then downsampling to size 20x20.
Converting the continuous-valued pixels of the subsampled images into integer-valued pixels is delicate. Our code linearly maps the range observed in each image to the interval [0.0,255.0], rounding to the closest integer. Comparing the pixel histograms (see Figure [fig:pixels]{reference-type="ref" reference="fig:pixels"}) reveals that MNIST has substantially more pixels with value 128 and less pixels with value 255. We could not think of a plausibly simple algorithm compatible with this observation.

[[fig:histogram]]{#fig:histogram label="fig:histogram"} Histogram of Writer IDs and Number of digits written by the writer in MNIST Train, MNIST Test 10K and QMNIST Test 50K sets. {#fig:histogram width=".8\linewidth"}

Generalization Experiments {#sec:genex}

This section takes advantage of the reconstruction of the lost 50,000 testing samples to revisit some MNIST performance results reported during the last twenty-five years. @recht2018cifar [@pmlr-v97-recht19a] perform a similar study on the CIFAR10 and ImageNet datasets and identify very interesting trends. However they also explain that they cannot fully ascertain how closely the distribution of the reconstructed dataset matches the distribution of the original dataset, raising the possibility of the reconstructed dataset being substantially harder than the original. Because the published MNIST test set was subsampled from a larger set, we have a much tighter control of the data distribution and can confidently confirm their findings.

Because the MNIST testing error rates are usually low, we start with a careful discussion of the computation of confidence intervals and of the statistical significance of error comparisons in the context of repeated experiments. We then report on MNIST results for several methods: k-nearest neightbors (KNN), support vector machines (SVM), multilayer perceptrons (MLP), and several flavors of convolutional networks (CNN).

About confidence intervals {#confint}

Since we want to know whether the actual performance of a learning system differs from the performance estimated using an overused testing set with run-of-the-mill confidence intervals, all confidence intervals reported in this work were obtained using the classic Wald method: when we observe $n_1$ misclassifications out of $n$ independent samples, the error rate $\nu=n_1/n$ is reported with confidence $1{-}\eta$ as $$\label{eq:wald} \nu ~\pm~ z \sqrt{\frac{\nu(1-\nu)}{n}}~,$$ where $z=\sqrt{2}:{\mathrm{erfc}^{!-1}}(\eta)$ is approximately equal to 2 for a 95% confidence interval. For instance, an error rate close to $1.0%$ measured on the usual 10,000 test example is reported as a $1%\pm0.2%$ error rate, that is, $100\pm20$ misclassifications. This approach is widely used despite the fact that it only holds for a single use of the testing set and that it relies on an imperfect central limit approximation.

The simplest way to account for repeated uses of the testing set is the Bonferroni correction [@bonferroni-1936], that is, dividing $\eta$ by the number $K$ of potential experiments, simultaneously defined before performing any measurement. Although relaxing this simultaneity constraint progressively requires all the apparatus of statistical learning theory [@vapnik-82 §6.3], the correction still takes the form of a divisor $K$ applied to confidence level $\eta$. Because of the asymptotic properties of the ${\mathrm{erfc}}$ function, the width of the actual confidence intervals essentially grows like $\log(K)$.

In order to complete this picture, one also needs to take into account the benefits of using the same testing set. Ordinary confidence intervals are overly pessimistic when we merely want to know whether a first classifier with error rate $\nu_1=n_1/n$ is worse than a second classifier with error rate $\nu_2=n_2/n$. Because these error rates are measured on the same test samples, we can instead rely on a pairing argument: the first classifier can be considered worse with confidence $1{-}\eta$ when $$\label{eq:paired} \nu_1 - \nu_2 ~=~ \frac{n_{12}-n_{21}}{n} ~~\geq ~~ z \frac{\sqrt{n_{12}+n_{21}}}{n}~,$$ where $n_{12}$ represents the count of examples misclassified by the first classifier but not the second classifier, $n_{21}$ is the converse, and $z=\sqrt{2}:{\mathrm{erfc}^{!-1}}(2\eta)$ is approximately $1.7$ for a 95% confidence. For instance, four additional misclassifications out of 10,000 examples is sufficient to make such a determination. This correspond to a difference in error rate of $0.04%$, roughly ten times smaller than what would be needed to observe disjoint error bars [eq:wald]{reference-type="eqref" reference="eq:wald"}. This advantage becomes very significant when combined with a Bonferroni-style correction: $K$ pairwise comparisons remain simultaneously valid with confidence $1{-}\eta$ if all comparisons satisfy $$\begin{aligned} n_{12}-n_{21} & \geq & \sqrt2::{\mathrm{erfc}^{!-1}}!\left(\frac{2\eta}{K}\right) : \sqrt{n_{12}+n_{21}} %\ K & \leq & 2\eta / \erfc\left(\frac{n_{12}-n_{21}}{\sqrt{2(n_{12}+n_{21})}}\right)~.\end{aligned}$$ For instance, in the realistic situation $$n=10000,,~~ n_1=200,,~~ n_{12}=40,,~~ n_{21}=10,,~~ n_2=n_1-n_{12}+n_{21}=170,,$$ the conclusion that classifier 1 is worse than classifier 2 remains valid with confidence 95% as long as it is part of a series of $K{\leq}4545$ pairwise comparisons. In contrast, after merely $K{=}50$ experiments, the 95% confidence interval for the absolute error rate of classifier 1 is already $2%\pm0.5%$, too large to distinguish it from the error rate of classifier 2. We should therefore expect that repeated model selection on the same test set leads to decisions that remain valid far longer than the corresponding absolute error rates.[^6]

Left plot: MLP error rates for various hidden layer sizes after training on MNIST, using the same color and symbols as figure [fig:knn]{reference-type="ref" reference="fig:knn"}. Right plot: scatter plot comparing the MNIST and QMNIST50K testing errors for all our MLP experiments. {#fig:mlp width=".48\linewidth"} Left plot: MLP error rates for various hidden layer sizes after training on MNIST, using the same color and symbols as figure [fig:knn]{reference-type="ref" reference="fig:knn"}. Right plot: scatter plot comparing the MNIST and QMNIST50K testing errors for all our MLP experiments. {#fig:mlp width=".48\linewidth"}

Results {#sec:experiments}

We report results using two training sets, namely the MNIST training set and the QMNIST reconstructions of the MNIST training digits, and three testing sets, namely the official MNIST testing set with 10,000 samples (MNIST), the reconstruction of the official MNIST testing digits (QMNIST10K), and the reconstruction of the lost 50,000 testing samples (QMNIST50K). We use the names TMTM, TMTQ10, TMTQ50 to identify results measured on these three testing sets after training on the MNIST training set. Similarly we use the names TQTM, TQTQ10, and TQTQ50, for results obtained after training on the QMNIST training set and testing on the three test sets. None of these results involves data augmentation or preprocessing steps such as deskewing, noise removal, blurring, jittering, elastic deformations, etc.

Figure [fig:knn]{reference-type="ref" reference="fig:knn"} (left plot) reports the testing error rates obtained with KNN for various values of the parameter $k$ using the MNIST training set as reference points. The QMNIST50K results are slightly worse but within the confidence intervals. The best $k$ determined on MNIST is also the best $k$ for QMNIST50K. Figure [fig:knn]{reference-type="ref" reference="fig:knn"} (right plot) reports similar results and conclusions when using the QMNIST training set as a reference point.

Figure [fig:svm]{reference-type="ref" reference="fig:svm"} reports testing error rates obtained with RBF kernel SVMs after training on the MNIST training set with various values of the hyperparameters $C$ and $g$. The QMNIST50 results are consistently higher but still fall within the confidence intervals except maybe for mis-regularized models. Again the hyperparameters achieving the best MNIST performance also achieve the best QMNIST50K performance.

Figure 11{reference-type="ref" reference="fig:mlp"} (left plot) provides similar results for a single hidden layer multilayer network with various hidden layer sizes, averaged over five runs. The QMNIST50K results again appear consistently worse than the MNIST test set results. On the one hand, the best QMNIST50K performance is achieved for a network with 1100 hidden units whereas the best MNIST testing error is achieved by a network with 700 hidden units. On the other hand, all networks with 300 to 1100 hidden units perform very similarly on both MNIST and QMNIST50, as can be seen in the plot. A 95% confidence interval paired test on representative runs reveals no statistically significant differences between the MNIST test performances of these networks. Each point in figure 11{reference-type="ref" reference="fig:mlp"} (right plot) gives the MNIST and QMNIST50K testing error rates of one MLP experiment. This plot includes experiments with several hidden layer sizes and also several minibatch sizes and learning rates. We were only able to replicate the reported 1.6% error rate @lecun-98h using minibatches of five or less examples.

Scatter plot comparing the MNIST and QMNIST50K testing performance of all the models trained on MNIST during the course of this study. {#fig:cluster width="0.8\linewidth"}

Finally, Figure 12{reference-type="ref" reference="fig:cluster"} summarizes all the experiments reported above. It also includes several flavors of convolutional networks: the Lenet5 results were already presented in Table 3{reference-type="ref" reference="tbl:lenet5"}, the VGG-11 [@simonyan2014very] and ResNet-18 [@he2016deep] results are representative of the modern CNN architectures currently popular in computer vision. We also report results obtained using four models from the TF-KR MNIST challenge.^7 Model TFKR-a[^8] is an ensemble two VGG- and one ResNet-like models trained with an augmented version of the MNIST training set. Models TFKR-b[^9], TFKR-c[^10], and TFKR-d[^11] are single CNN models with varied architectures. This scatter plot shows that the QMNIST50 error rates are consistently slightly higher than the MNIST testing errors. However, the plot also shows that comparing the MNIST testing set performances of various models provides a near perfect ranking of the corresponding QMNIST50K performances. In particular, the best performing model on MNIST, TFKR-a, remains the best performing model on QMNIST50K.

Conclusion

We have recreated a close approximation of the MNIST preprocessing chain. Not only did we track each MNIST digit to its NIST source image and associated metadata, but also recreated the original MNIST test set, including the 50,000 samples that were never distributed. These fresh testing samples allow us to investigate how the results reported on a standard testing set suffer from repeated experimentation. Our results confirm the trends observed by @recht2018cifar [@pmlr-v97-recht19a], albeit on a different dataset and in a substantially more controlled setup. All these results essentially show that the "testing set rot" problem exists but is far less severe than feared. Although the repeated usage of the same testing set impacts absolute performance numbers, it also delivers pairing advantages that help model selection in the long run. In practice, this suggests that a shifting data distribution is far more dangerous than overusing an adequately distributed testing set.

Acknowledgments {#acknowledgments .unnumbered}

We thank Chris Burges, Corinna Cortes, and Yann LeCun for the precious information they were able to share with us about the birth of MNIST. We thank Larry Jackel for instigating the whole MNIST project and for commenting on this "cold case". We thank Maithra Raghu for pointing out how QMNIST could be used to corroborate the results of @pmlr-v97-recht19a. We thank Ben Recht, Ludwig Schmidt and Roman Werpachowski for their constructive comments.

Supplementary Material

This section provides additional tables and plots.

::: {#tbl:KNN3} Test on MNIST QMNIST10K QMNIST50K

Train on MNIST $2.95%$ ($\pm 0.34%$) $2.94%$ ($\pm 0.34%$) $3.19%$ ($\pm 0.16%$) Train on QMNIST $2.94%$ ($\pm 0.34%$) $2.95%$ ($\pm 0.34%$) $3.19%$ ($\pm 0.16%$)

: [[tbl:KNN3]]{#tbl:KNN3 label="tbl:KNN3"} Misclassification rates of the best KNN model obtained when $k$ is set to 3. Model trained on both the MNIST and QMNIST training sets and tested on the MNIST test set, and the two QMNIST test sets of size 10,000 & 50,000 each. :::

::: {#tb2:svm} Test on MNIST QMNIST10K QMNIST50K

Train on MNIST $1.47%$ ($\pm 0.24%$) $1.47%$ ($\pm 0.24%$) $1.8%$ ($\pm 0.12%$) Train on QMNIST $1.47%$ ($\pm 0.24%$) $1.48%$ ($\pm 0.24%$) $1.8%$ ($\pm 0.12%$)

: [[tb2:svm]]{#tb2:svm label="tb2:svm"} Misclassification rates of a SVM when hyperparameters $C$ = 10 & $g$ = 0.02. Training and testing schemes are similar to Table 4{reference-type="ref" reference="tbl:KNN3"}. :::

::: {#tb3:lenet5v2} Test on MNIST QMNIST10K QMNIST50K

Train on MNIST $1.61%$ ($\pm 0.25%$) $1.61%$ ($\pm 0.25%$) $2.02%$ ($\pm 0.13%$) Train on QMNIST $1.63%$ ($\pm 0.25%$) $1.63%$ ($\pm 0.25%$) $2%$ ($\pm 0.13%$)

: [[tb3:lenet5v2]]{#tb3:lenet5v2 label="tb3:lenet5v2"} Misclassification rates of an MLP with a 800 unit hidden layer. Training and testing schemes are similar to Table 4{reference-type="ref" reference="tbl:KNN3"}. :::

::: {#tb4:vgg} Test on MNIST QMNIST10K QMNIST50K

Train on MNIST $0.37%$ ($\pm 0.12%$) $0.37%$ ($\pm 0.12%$) $0.53%$ ($\pm 0.06%$) Train on QMNIST $0.39%$ ($\pm 0.12%$) $0.39%$ ($\pm 0.12%$) $0.53%$ ($\pm 0.06%$)

: [[tb4:vgg]]{#tb4:vgg label="tb4:vgg"} Misclassification rates of a VGG-11 model. Training and testing schemes are similar to Table 4{reference-type="ref" reference="tbl:KNN3"}. :::

::: {#tb5:resnet} Test on MNIST QMNIST10K QMNIST50K

Train on MNIST $0.41%$ ($\pm 0.13%$) $0.42%$ ($\pm 0.13%$) $0.51%$ ($\pm 0.06%$) Train on QMNIST $0.43%$ ($\pm 0.13%$) $0.43%$ ($\pm 0.13%$) $0.50%$ ($\pm 0.06%$)

: [[tb5:resnet]]{#tb5:resnet label="tb5:resnet"} Misclassification rates of a ResNet-18 model. Training and testing schemes are similar to Table 4{reference-type="ref" reference="tbl:KNN3"}. :::

::: {#tb6:kaggle} Github Link MNIST QMNIST10K QMNIST50K

TFKR-a $0.24%$ ($\pm 0.10%$) $0.24%$ ($\pm 0.10%$) $0.32%$ ($\pm 0.05%$) TFKR-b $0.86%$ ($\pm 0.18%$) $0.86%$ ($\pm 0.18%$) $0.97%$ ($\pm 0.09%$) TFKR-c $0.46%$ ($\pm 0.14%$) $0.47%$ ($\pm 0.14%$) $0.56%$ ($\pm 0.07%$) TFKR-d $0.58%$ ($\pm 0.15%$) $0.58%$ ($\pm 0.15%$) $0.69%$ ($\pm 0.07%$)

: [[tb6:kaggle]]{#tb6:kaggle label="tb6:kaggle"} Misclassification rates of top TF-KR MNIST models trained on the MNIST training se and tested on the MNIST, QMNIST10K and QMNIST50K testing sets. :::

Scatter plot comparing the best MNIST and QMNIST50K testing performance of all the classifiers trained on MNIST during the course of this study. {#fig:allleft width=".48\linewidth"} {#fig:allleft width=".48\linewidth"}

Scatter plot comparing the best MNIST and QMNIST50K testing performance of all the classifiers trained on MNIST during the course of this study. {#fig:allleft width=".48\linewidth"}

[^1]: Code and data are available at https://github.com/facebookresearch/qmnist. We of course intend to publish both the reconstruction code and the reconstructed dataset.

[^2]: When LB joined this effort during the summer 1994, the MNIST dataset was already ready.

[^3]: The same description also appears in [@mnist; @lecun-98h]. These more recent texts incorrectly use the names SD1 and SD3 to denote the original NIST test and training sets. And additional sentence explains that only a subset of 10,000 test images was used or made available, "5000 from SD1 and 5000 from SD3."

[^4]: See https://tinyurl.com/y5z7qtcg.

[^5]: See https://tinyurl.com/y5z7abyt

[^6]: See [@feldman-2019] for a different perspective on this issue.

[^8]: TFKR-a: https://github.com/khanrc/mnist

[^9]: TFKR-b: https://github.com/bart99/tensorflow/tree/master/mnist

[^10]: TFKR-c: https://github.com/chaeso/dnn-study

[^11]: TFKR-d: https://github.com/ByeongkiJeong/MostAccurableMNIST_keras