Towards an Analytical Definition of Sufficient Data

129 Accesses
3 Citations
Explore all metrics

Abstract

We show that, for each of five datasets of increasing complexity, certain training samples are more informative of class membership than others. These samples can be identified a priori to training by analyzing their position in reduced dimensional space relative to the classes’ centroids. Specifically, we demonstrate that for the datasets studied, samples that are nearer to their classes’ centroids are less informative than those that are furthest from them. For all five datasets studied, we show that there is no statistically significant difference between training on the entire training set and when excluding up to 2% of the data nearest to each class’s centroid.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Small Samples of Multidimensional Feature Vectors

Small Sample Size in High Dimensional Space - Minimum Distance Based Classification

Relative Intrinsic Dimensionality Is Intrinsic to Learning

References

Deng J, Dong W, Socher R, Li L-J, Kai Li, Fei-Fei L. ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on computer vision and pattern recognition, IEEE. 2010; p. 248–55. https://doi.org/10.1109/cvpr.2009.5206848.
Zhai X, Kolesnikov A, Houlsby N, Beyer L. Scaling vision transformers. 2021. arXiv:2106.04560.
Byerly A, Kalganova T, Grichnik AJ. On the importance of capturing a sufficient diversity of perspective for the classification of micro-pcbs. In: intelligent decision technologies, Springer: Singapore; 2021. vol. 238, pp. 209–19.
Byerly A, Kalganova T. Homogeneous vector capsules enable adaptive gradient descent in convolutional neural networks. IEEE Access. 2021;9:48519–30. https://doi.org/10.1109/ACCESS.2021.3066842.
Article Google Scholar
van der Maaten L, Hinton G. Visualizing data using t-SNE Laurens. J Mach Learn Res. 2008;9:2579–605.
MATH Google Scholar
McInnes L, Healy J, Melville J. UMAP: Uniform manifold approximation and projection for dimension reduction. 2018. arXiv:1802.03426.
Cover T, Hart P. Nearest neighbor pattern classification. IEEE Trans Inf Theory. 1967;13(1):21–7. https://doi.org/10.1109/TIT.1967.1053964.
Article MATH Google Scholar
Hart P. The condensed nearest neighbor rule (corresp.). IEEE Trans Inf Theory. 1968;14(3):515–6. https://doi.org/10.1109/TIT.1968.1054155.
Article Google Scholar
Ritter G, Woodruff H, Lowry S, Isenhour T. An algorithm for a selective nearest neighbor decision rule (corresp.). IEEE Trans Inf Theory. 1975;21(6):665–9. https://doi.org/10.1109/TIT.1975.1055464.
Article MATH Google Scholar
Wilson DL. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern. 1972;SMC–2(3):408–21. https://doi.org/10.1109/TSMC.1972.4309137.
Article MathSciNet MATH Google Scholar
Wilson DR, Martinez TR. Reduction techniques for instance-based learning algorithms. Mach Learn. 2000;38:257–86.
Article MATH Google Scholar
Albalate MTL. Data reduction techniques in classification processes. PhD thesis. 2007.
Vázquez F, Sánchez JS, Pla F. A Stochastic approach to wilson’s editing algorithm. In: Marques JS, Pérez de la Blanca N, Pina, P, editors. Pattern recognition and image analysis. Lecture Notes in Computer Science, vol 3523. Springer: Berlin, Heidelberg; 2005. pp. 35–42.
Chou C-H, Kuo B-H, Chang F. The generalized condensed nearest neighbor rule as a data reduction method. In: 18th International Conference on pattern recognition (ICPR’06). 2006; vol. 2, p. 556–9. https://doi.org/10.1109/ICPR.2006.1119.
Ougiaroglou S, Evangelidis G. Efficient dataset size reduction by finding homogeneous clusters. In: Balkan Conference in Informatics (BCI). 2012; p. 168–173. https://doi.org/10.1145/2371316.2371349.
Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. In: NIPS 2012 - 25th Conference on neural information processing systems. 2012; p. 1097–1105. https://doi.org/10.1145/3065386.
Shayegan MA, Aghabozorgi S. A new dataset size reduction approach for PCA-based classification in OCR application. Math Probl Eng. 2014. https://doi.org/10.1155/2014/537428.
Article Google Scholar
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: transformers for image recognition at scale. In: Ninth International Conference on Learning Representations (ICLR) (2020). https://openreview.net/forum?id=YicbFdNTTy.
Kolesnikov A, Beyer L, Zhai X, Puigcerver J, Yung J, Gelly S, Houlsby N. Big Transfer (BiT): General Visual Representation Learning. In: Vedaldi A, Bischof H, Brox T, Frahm JM, editors. 16th European Conference on Computer Vision. Lecture Notes in Computer Science, vol 12350. Springer: Cham; 2020. pp. 491–507.
Touvron H, Vedaldi A, Douze M, Jégou H. Fixing the train-test resolution discrepancy: FixEfficientNet. Adv Neural Inf Proces Syst. 2019;32. https://papers.nips.cc/paper/2019/hash/d03a857a23b5285736c4d55e0bb067c8-Abstract.html.
Pham H, Dai Z, Xie Q, Luong M-T, Le QV. Meta Pseudo Labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020. p. 11557–11568.
Xie Q, Luong MT, Hovy E, Le QV. Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE Computer Society Conference on computer vision and pattern recognition, 2020; p. 10684–95. https://doi.org/10.1109/CVPR42600.2020.01070
Foret P, Kleiner A, Mobahi H, Neyshabur B. Sharpness-aware minimization for efficiently improving generalization. In: Ninth International Conference on learning representations (ICLR), 2020.
Riquelme C, Puigcerver J, Mustafa B, Neumann M, Jenatton R, Pinto AS, Keysers D, Houlsby N. Scaling Vision with Sparse Mixture of Experts. In: Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, editors. Advances in Neural Information Processing Systems, vol 34. Curran Associates, Inc; 2021. pp. 8583–8595.
Ryoo MS, Piergiovanni A, Arnab A, Dehghani M, Angelova A. TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? 2021. arXiv:2106.11297
Jia C, Yang Y, Xia Y, Chen Y-T, Parekh Z, Pham H, Le QV, Sung Y, Li Z, Duerig T. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In: Meila M, Zhang T, editors. Proceedings of the 38th International Conference on machine learning. Proceedings of Machine Learning Research, vol 139. PMLR; 2021. pp. 4904–4916.
Dong X, Bao J, Chen D, Zhang W, Yu N, Yuan L, Chen D, Guo B. CSWin Transformer: a General Vision Transformer Backbone With Cross-Shaped Windows. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022. pp. 12114–12124.
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. pp. 9992–10002.
Dai Z, Liu H, Le QV, Tan M. CoAtNet: Marrying Convolution and Attention for All Data Sizes. In: RanzatoM, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, editors. Advances in Neural Information Processing Systems, vol 34. Curran Associates, Inc; 2021. pp. 3965–3977.
Wu H, Xiao B, Codella N, Liu M, Dai X, Yuan L, Zhang L. CvT: Introducing Convolutions to Vision Transformers. In: The International Conference on computer vision (ICCV). 2021. p. 22–31. https://ieeexplore.ieee.org/document/9710031.
Tan M, Le Q. EfficientNetV2: Smaller Models and Faster Training. In: Meila M, Zhang T, editors. Proceedings of the 38th International Conference on machine learning. Proceedings of Machine Learning Research, vol 139. PMLR; 2021. pp. 10096–10106.
Tolstikhin I, Houlsby N, Kolesnikov A, Beyer L, Zhai X, Unterthiner T, Yung J, Steiner A, Keysers D, Uszkoreit J, Lucic M, Dosovitskiy A. MLP-Mixer: An all-MLP Architecture for Vision. In: Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, editors. Advances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc; 2021. pp. 24261–24272.
Brock A, De S, Smith SL, Simonyan K. High-performance large-scale image recognition without normalization. In: Meila M, Zhang T, editors. Proceedings of the 38th International Conference on machine learning. Proceedings of Machine Learning Research, vol 139. PMLR; 2021. pp. 1059–1071.
LeCun Y, Cortes C, Burges C. MNIST Handwritten Digit Database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist. Retrieved 27 Nov 2018
Xiao H, Rasul K, Vollgraf R. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. 2017. arXiv:1708.07747.
Byerly A, Kalganova T, Dear I. No routing needed between capsules. Neurocomputing. 2021;463:545–53. https://doi.org/10.1016/j.neucom.2021.08.064.
Article Google Scholar
Krizhevsky A. Learning multiple layers of features from tiny images. Technical report. 2009.
Howard J. Imagenette. 2018. https://github.com/fastai/imagenette/. Retrieved March 17, 2020.
Van Horn G, Perona P. The Devil is in the Tails: Fine-Grained Classification in the Wild. 2017. arXiv:1709.01450.
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. Microsoft COCO: Common Objects in Context. In: European Conference on computer vision (ECCV). Springer International Publishing; 2014. pp. 740–755. https://doi.org/10.1007/978-3-319-10602-1_48.
Liu Z, Miao Z, Zhan X, Wang J, Gong B, Yu SX. Large-scale long-tailed recognition in an open world. In: 2019 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), IEEE; 2019. p. 2532–41. https://doi.org/10.1109/CVPR.2019.00264.
Buda M, Maki A, Mazurowski MA. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 2018;106:249–59. https://doi.org/10.1016/j.neunet.2018.07.011.
Article Google Scholar
Cao K, Wei C, Gaidon A, Arechiga N, Ma T. Learning imbalanced datasets with label-distribution-aware margin loss. In: Proceedings of the 33rd international conference on neural information processing systems. Curran Associates Inc; 2019. pp. 1567–1578.

Download references

Acknowledgements

The authors would like to acknowledge the support provided by the Cognitively-inspired Agile Information and Knowledge Modelling (CALM) project, funded by the United States Air Force Office of Scientific Research. On behalf of all authors, the corresponding author states that there is no conflict of interest.

Author information

Authors and Affiliations

Department of Electronic and Electrical Engineering, Brunel University London, Uxbridge, UB8 3PH, UK
Adam Byerly & Tatiana Kalganova
Department of Computer Science and Information Systems, Bradley University, Peoria, IL, 61615, USA
Adam Byerly

Authors

Adam Byerly
View author publications
You can also search for this author in PubMed Google Scholar
Tatiana Kalganova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adam Byerly.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

In Figs. 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35 and 36 the values of each of the 3 dimensions of the reductions for each individual class of the training data for MNIST and Imagenette are plotted. The first thing that can be learned from looking at the MNIST plots is that the reduction for each class produces values significantly different than the others and further, the values in each dimension of each class are tightly grouped in the number line, with the noticeable exception of the third dimension of the class that represents the digit 1. It is worth noting that the stylization of the Hindu-Arabic numeral ’1’ contains most of its information in 2-dimensions (accounting for translation) thus providing an explanation for the larger variance in the third dimension. The plots of the Imagenette classes, on the other hand, are much more similar to one another and display greater variance in each of the 3 dimensions.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Byerly, A., Kalganova, T. Towards an Analytical Definition of Sufficient Data. SN COMPUT. SCI. 4, 144 (2023). https://doi.org/10.1007/s42979-022-01549-4

Download citation

Received: 21 March 2022
Accepted: 07 December 2022
Published: 07 January 2023
DOI: https://doi.org/10.1007/s42979-022-01549-4