Abstract
We show that, for each of five datasets of increasing complexity, certain training samples are more informative of class membership than others. These samples can be identified a priori to training by analyzing their position in reduced dimensional space relative to the classes’ centroids. Specifically, we demonstrate that for the datasets studied, samples that are nearer to their classes’ centroids are less informative than those that are furthest from them. For all five datasets studied, we show that there is no statistically significant difference between training on the entire training set and when excluding up to 2% of the data nearest to each class’s centroid.
Similar content being viewed by others
References
Deng J, Dong W, Socher R, Li L-J, Kai Li, Fei-Fei L. ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on computer vision and pattern recognition, IEEE. 2010; p. 248–55. https://doi.org/10.1109/cvpr.2009.5206848.
Zhai X, Kolesnikov A, Houlsby N, Beyer L. Scaling vision transformers. 2021. arXiv:2106.04560.
Byerly A, Kalganova T, Grichnik AJ. On the importance of capturing a sufficient diversity of perspective for the classification of micro-pcbs. In: intelligent decision technologies, Springer: Singapore; 2021. vol. 238, pp. 209–19.
Byerly A, Kalganova T. Homogeneous vector capsules enable adaptive gradient descent in convolutional neural networks. IEEE Access. 2021;9:48519–30. https://doi.org/10.1109/ACCESS.2021.3066842.
van der Maaten L, Hinton G. Visualizing data using t-SNE Laurens. J Mach Learn Res. 2008;9:2579–605.
McInnes L, Healy J, Melville J. UMAP: Uniform manifold approximation and projection for dimension reduction. 2018. arXiv:1802.03426.
Cover T, Hart P. Nearest neighbor pattern classification. IEEE Trans Inf Theory. 1967;13(1):21–7. https://doi.org/10.1109/TIT.1967.1053964.
Hart P. The condensed nearest neighbor rule (corresp.). IEEE Trans Inf Theory. 1968;14(3):515–6. https://doi.org/10.1109/TIT.1968.1054155.
Ritter G, Woodruff H, Lowry S, Isenhour T. An algorithm for a selective nearest neighbor decision rule (corresp.). IEEE Trans Inf Theory. 1975;21(6):665–9. https://doi.org/10.1109/TIT.1975.1055464.
Wilson DL. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern. 1972;SMC–2(3):408–21. https://doi.org/10.1109/TSMC.1972.4309137.
Wilson DR, Martinez TR. Reduction techniques for instance-based learning algorithms. Mach Learn. 2000;38:257–86.
Albalate MTL. Data reduction techniques in classification processes. PhD thesis. 2007.
Vázquez F, Sánchez JS, Pla F. A Stochastic approach to wilson’s editing algorithm. In: Marques JS, Pérez de la Blanca N, Pina, P, editors. Pattern recognition and image analysis. Lecture Notes in Computer Science, vol 3523. Springer: Berlin, Heidelberg; 2005. pp. 35–42.
Chou C-H, Kuo B-H, Chang F. The generalized condensed nearest neighbor rule as a data reduction method. In: 18th International Conference on pattern recognition (ICPR’06). 2006; vol. 2, p. 556–9. https://doi.org/10.1109/ICPR.2006.1119.
Ougiaroglou S, Evangelidis G. Efficient dataset size reduction by finding homogeneous clusters. In: Balkan Conference in Informatics (BCI). 2012; p. 168–173. https://doi.org/10.1145/2371316.2371349.
Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. In: NIPS 2012 - 25th Conference on neural information processing systems. 2012; p. 1097–1105. https://doi.org/10.1145/3065386.
Shayegan MA, Aghabozorgi S. A new dataset size reduction approach for PCA-based classification in OCR application. Math Probl Eng. 2014. https://doi.org/10.1155/2014/537428.
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: transformers for image recognition at scale. In: Ninth International Conference on Learning Representations (ICLR) (2020). https://openreview.net/forum?id=YicbFdNTTy.
Kolesnikov A, Beyer L, Zhai X, Puigcerver J, Yung J, Gelly S, Houlsby N. Big Transfer (BiT): General Visual Representation Learning. In: Vedaldi A, Bischof H, Brox T, Frahm JM, editors. 16th European Conference on Computer Vision. Lecture Notes in Computer Science, vol 12350. Springer: Cham; 2020. pp. 491–507.
Touvron H, Vedaldi A, Douze M, Jégou H. Fixing the train-test resolution discrepancy: FixEfficientNet. Adv Neural Inf Proces Syst. 2019;32. https://papers.nips.cc/paper/2019/hash/d03a857a23b5285736c4d55e0bb067c8-Abstract.html.
Pham H, Dai Z, Xie Q, Luong M-T, Le QV. Meta Pseudo Labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020. p. 11557–11568.
Xie Q, Luong MT, Hovy E, Le QV. Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE Computer Society Conference on computer vision and pattern recognition, 2020; p. 10684–95. https://doi.org/10.1109/CVPR42600.2020.01070
Foret P, Kleiner A, Mobahi H, Neyshabur B. Sharpness-aware minimization for efficiently improving generalization. In: Ninth International Conference on learning representations (ICLR), 2020.
Riquelme C, Puigcerver J, Mustafa B, Neumann M, Jenatton R, Pinto AS, Keysers D, Houlsby N. Scaling Vision with Sparse Mixture of Experts. In: Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, editors. Advances in Neural Information Processing Systems, vol 34. Curran Associates, Inc; 2021. pp. 8583–8595.
Ryoo MS, Piergiovanni A, Arnab A, Dehghani M, Angelova A. TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? 2021. arXiv:2106.11297
Jia C, Yang Y, Xia Y, Chen Y-T, Parekh Z, Pham H, Le QV, Sung Y, Li Z, Duerig T. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In: Meila M, Zhang T, editors. Proceedings of the 38th International Conference on machine learning. Proceedings of Machine Learning Research, vol 139. PMLR; 2021. pp. 4904–4916.
Dong X, Bao J, Chen D, Zhang W, Yu N, Yuan L, Chen D, Guo B. CSWin Transformer: a General Vision Transformer Backbone With Cross-Shaped Windows. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022. pp. 12114–12124.
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. pp. 9992–10002.
Dai Z, Liu H, Le QV, Tan M. CoAtNet: Marrying Convolution and Attention for All Data Sizes. In: RanzatoM, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, editors. Advances in Neural Information Processing Systems, vol 34. Curran Associates, Inc; 2021. pp. 3965–3977.
Wu H, Xiao B, Codella N, Liu M, Dai X, Yuan L, Zhang L. CvT: Introducing Convolutions to Vision Transformers. In: The International Conference on computer vision (ICCV). 2021. p. 22–31. https://ieeexplore.ieee.org/document/9710031.
Tan M, Le Q. EfficientNetV2: Smaller Models and Faster Training. In: Meila M, Zhang T, editors. Proceedings of the 38th International Conference on machine learning. Proceedings of Machine Learning Research, vol 139. PMLR; 2021. pp. 10096–10106.
Tolstikhin I, Houlsby N, Kolesnikov A, Beyer L, Zhai X, Unterthiner T, Yung J, Steiner A, Keysers D, Uszkoreit J, Lucic M, Dosovitskiy A. MLP-Mixer: An all-MLP Architecture for Vision. In: Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, editors. Advances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc; 2021. pp. 24261–24272.
Brock A, De S, Smith SL, Simonyan K. High-performance large-scale image recognition without normalization. In: Meila M, Zhang T, editors. Proceedings of the 38th International Conference on machine learning. Proceedings of Machine Learning Research, vol 139. PMLR; 2021. pp. 1059–1071.
LeCun Y, Cortes C, Burges C. MNIST Handwritten Digit Database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist. Retrieved 27 Nov 2018
Xiao H, Rasul K, Vollgraf R. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. 2017. arXiv:1708.07747.
Byerly A, Kalganova T, Dear I. No routing needed between capsules. Neurocomputing. 2021;463:545–53. https://doi.org/10.1016/j.neucom.2021.08.064.
Krizhevsky A. Learning multiple layers of features from tiny images. Technical report. 2009.
Howard J. Imagenette. 2018. https://github.com/fastai/imagenette/. Retrieved March 17, 2020.
Van Horn G, Perona P. The Devil is in the Tails: Fine-Grained Classification in the Wild. 2017. arXiv:1709.01450.
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. Microsoft COCO: Common Objects in Context. In: European Conference on computer vision (ECCV). Springer International Publishing; 2014. pp. 740–755. https://doi.org/10.1007/978-3-319-10602-1_48.
Liu Z, Miao Z, Zhan X, Wang J, Gong B, Yu SX. Large-scale long-tailed recognition in an open world. In: 2019 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), IEEE; 2019. p. 2532–41. https://doi.org/10.1109/CVPR.2019.00264.
Buda M, Maki A, Mazurowski MA. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 2018;106:249–59. https://doi.org/10.1016/j.neunet.2018.07.011.
Cao K, Wei C, Gaidon A, Arechiga N, Ma T. Learning imbalanced datasets with label-distribution-aware margin loss. In: Proceedings of the 33rd international conference on neural information processing systems. Curran Associates Inc; 2019. pp. 1567–1578.
Acknowledgements
The authors would like to acknowledge the support provided by the Cognitively-inspired Agile Information and Knowledge Modelling (CALM) project, funded by the United States Air Force Office of Scientific Research. On behalf of all authors, the corresponding author states that there is no conflict of interest.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
In Figs. 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35 and 36 the values of each of the 3 dimensions of the reductions for each individual class of the training data for MNIST and Imagenette are plotted. The first thing that can be learned from looking at the MNIST plots is that the reduction for each class produces values significantly different than the others and further, the values in each dimension of each class are tightly grouped in the number line, with the noticeable exception of the third dimension of the class that represents the digit 1. It is worth noting that the stylization of the Hindu-Arabic numeral ’1’ contains most of its information in 2-dimensions (accounting for translation) thus providing an explanation for the larger variance in the third dimension. The plots of the Imagenette classes, on the other hand, are much more similar to one another and display greater variance in each of the 3 dimensions.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Byerly, A., Kalganova, T. Towards an Analytical Definition of Sufficient Data. SN COMPUT. SCI. 4, 144 (2023). https://doi.org/10.1007/s42979-022-01549-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-022-01549-4