[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

Towards an Analytical Definition of Sufficient Data

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

We show that, for each of five datasets of increasing complexity, certain training samples are more informative of class membership than others. These samples can be identified a priori to training by analyzing their position in reduced dimensional space relative to the classes’ centroids. Specifically, we demonstrate that for the datasets studied, samples that are nearer to their classes’ centroids are less informative than those that are furthest from them. For all five datasets studied, we show that there is no statistically significant difference between training on the entire training set and when excluding up to 2% of the data nearest to each class’s centroid.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  1. Deng J, Dong W, Socher R, Li L-J, Kai Li, Fei-Fei L. ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on computer vision and pattern recognition, IEEE. 2010; p. 248–55. https://doi.org/10.1109/cvpr.2009.5206848.

  2. Zhai X, Kolesnikov A, Houlsby N, Beyer L. Scaling vision transformers. 2021. arXiv:2106.04560.

  3. Byerly A, Kalganova T, Grichnik AJ. On the importance of capturing a sufficient diversity of perspective for the classification of micro-pcbs. In: intelligent decision technologies, Springer: Singapore; 2021. vol. 238, pp. 209–19.

  4. Byerly A, Kalganova T. Homogeneous vector capsules enable adaptive gradient descent in convolutional neural networks. IEEE Access. 2021;9:48519–30. https://doi.org/10.1109/ACCESS.2021.3066842.

    Article  Google Scholar 

  5. van der Maaten L, Hinton G. Visualizing data using t-SNE Laurens. J Mach Learn Res. 2008;9:2579–605.

    MATH  Google Scholar 

  6. McInnes L, Healy J, Melville J. UMAP: Uniform manifold approximation and projection for dimension reduction. 2018. arXiv:1802.03426.

  7. Cover T, Hart P. Nearest neighbor pattern classification. IEEE Trans Inf Theory. 1967;13(1):21–7. https://doi.org/10.1109/TIT.1967.1053964.

    Article  MATH  Google Scholar 

  8. Hart P. The condensed nearest neighbor rule (corresp.). IEEE Trans Inf Theory. 1968;14(3):515–6. https://doi.org/10.1109/TIT.1968.1054155.

    Article  Google Scholar 

  9. Ritter G, Woodruff H, Lowry S, Isenhour T. An algorithm for a selective nearest neighbor decision rule (corresp.). IEEE Trans Inf Theory. 1975;21(6):665–9. https://doi.org/10.1109/TIT.1975.1055464.

    Article  MATH  Google Scholar 

  10. Wilson DL. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern. 1972;SMC–2(3):408–21. https://doi.org/10.1109/TSMC.1972.4309137.

    Article  MathSciNet  MATH  Google Scholar 

  11. Wilson DR, Martinez TR. Reduction techniques for instance-based learning algorithms. Mach Learn. 2000;38:257–86.

    Article  MATH  Google Scholar 

  12. Albalate MTL. Data reduction techniques in classification processes. PhD thesis. 2007.

  13. Vázquez F, Sánchez JS, Pla F. A Stochastic approach to wilson’s editing algorithm. In: Marques JS, Pérez de la Blanca N, Pina, P, editors. Pattern recognition and image analysis. Lecture Notes in Computer Science, vol 3523. Springer: Berlin, Heidelberg; 2005. pp. 35–42.

  14. Chou C-H, Kuo B-H, Chang F. The generalized condensed nearest neighbor rule as a data reduction method. In: 18th International Conference on pattern recognition (ICPR’06). 2006; vol. 2, p. 556–9. https://doi.org/10.1109/ICPR.2006.1119.

  15. Ougiaroglou S, Evangelidis G. Efficient dataset size reduction by finding homogeneous clusters. In: Balkan Conference in Informatics (BCI). 2012; p. 168–173. https://doi.org/10.1145/2371316.2371349.

  16. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. In: NIPS 2012 - 25th Conference on neural information processing systems. 2012; p. 1097–1105. https://doi.org/10.1145/3065386.

  17. Shayegan MA, Aghabozorgi S. A new dataset size reduction approach for PCA-based classification in OCR application. Math Probl Eng. 2014. https://doi.org/10.1155/2014/537428.

    Article  Google Scholar 

  18. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: transformers for image recognition at scale. In: Ninth International Conference on Learning Representations (ICLR) (2020). https://openreview.net/forum?id=YicbFdNTTy.

  19. Kolesnikov A, Beyer L, Zhai X, Puigcerver J, Yung J, Gelly S, Houlsby N. Big Transfer (BiT): General Visual Representation Learning. In: Vedaldi A, Bischof H, Brox T, Frahm JM, editors. 16th European Conference on Computer Vision. Lecture Notes in Computer Science, vol 12350. Springer: Cham; 2020. pp. 491–507.

  20. Touvron H, Vedaldi A, Douze M, Jégou H. Fixing the train-test resolution discrepancy: FixEfficientNet. Adv Neural Inf Proces Syst. 2019;32. https://papers.nips.cc/paper/2019/hash/d03a857a23b5285736c4d55e0bb067c8-Abstract.html.

  21. Pham H, Dai Z, Xie Q, Luong M-T, Le QV. Meta Pseudo Labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020. p. 11557–11568.

  22. Xie Q, Luong MT, Hovy E, Le QV. Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE Computer Society Conference on computer vision and pattern recognition, 2020; p. 10684–95. https://doi.org/10.1109/CVPR42600.2020.01070

  23. Foret P, Kleiner A, Mobahi H, Neyshabur B. Sharpness-aware minimization for efficiently improving generalization. In: Ninth International Conference on learning representations (ICLR), 2020.

  24. Riquelme C, Puigcerver J, Mustafa B, Neumann M, Jenatton R, Pinto AS, Keysers D, Houlsby N. Scaling Vision with Sparse Mixture of Experts. In: Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, editors. Advances in Neural Information Processing Systems, vol 34. Curran Associates, Inc; 2021. pp. 8583–8595.

  25. Ryoo MS, Piergiovanni A, Arnab A, Dehghani M, Angelova A. TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? 2021. arXiv:2106.11297

  26. Jia C, Yang Y, Xia Y, Chen Y-T, Parekh Z, Pham H, Le QV, Sung Y, Li Z, Duerig T. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In: Meila M, Zhang T, editors. Proceedings of the 38th International Conference on machine learning. Proceedings of Machine Learning Research, vol 139. PMLR; 2021. pp. 4904–4916.

  27. Dong X, Bao J, Chen D, Zhang W, Yu N, Yuan L, Chen D, Guo B. CSWin Transformer: a General Vision Transformer Backbone With Cross-Shaped Windows. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022. pp. 12114–12124.

  28. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. pp. 9992–10002.

  29. Dai Z, Liu H, Le QV, Tan M. CoAtNet: Marrying Convolution and Attention for All Data Sizes. In: RanzatoM, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, editors. Advances in Neural Information Processing Systems, vol 34. Curran Associates, Inc; 2021. pp. 3965–3977.

  30. Wu H, Xiao B, Codella N, Liu M, Dai X, Yuan L, Zhang L. CvT: Introducing Convolutions to Vision Transformers. In: The International Conference on computer vision (ICCV). 2021. p. 22–31. https://ieeexplore.ieee.org/document/9710031.

  31. Tan M, Le Q. EfficientNetV2: Smaller Models and Faster Training. In: Meila M, Zhang T, editors. Proceedings of the 38th International Conference on machine learning. Proceedings of Machine Learning Research, vol 139. PMLR; 2021. pp. 10096–10106.

  32. Tolstikhin I, Houlsby N, Kolesnikov A, Beyer L, Zhai X, Unterthiner T, Yung J, Steiner A, Keysers D, Uszkoreit J, Lucic M, Dosovitskiy A. MLP-Mixer: An all-MLP Architecture for Vision. In: Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, editors. Advances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc; 2021. pp. 24261–24272.

  33. Brock A, De S, Smith SL, Simonyan K. High-performance large-scale image recognition without normalization. In: Meila M, Zhang T, editors. Proceedings of the 38th International Conference on machine learning. Proceedings of Machine Learning Research, vol 139. PMLR; 2021. pp. 1059–1071.

  34. LeCun Y, Cortes C, Burges C. MNIST Handwritten Digit Database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist. Retrieved 27 Nov 2018

  35. Xiao H, Rasul K, Vollgraf R. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. 2017. arXiv:1708.07747.

  36. Byerly A, Kalganova T, Dear I. No routing needed between capsules. Neurocomputing. 2021;463:545–53. https://doi.org/10.1016/j.neucom.2021.08.064.

    Article  Google Scholar 

  37. Krizhevsky A. Learning multiple layers of features from tiny images. Technical report. 2009.

  38. Howard J. Imagenette. 2018. https://github.com/fastai/imagenette/. Retrieved March 17, 2020.

  39. Van Horn G, Perona P. The Devil is in the Tails: Fine-Grained Classification in the Wild. 2017. arXiv:1709.01450.

  40. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. Microsoft COCO: Common Objects in Context. In: European Conference on computer vision (ECCV). Springer International Publishing; 2014. pp. 740–755. https://doi.org/10.1007/978-3-319-10602-1_48.

  41. Liu Z, Miao Z, Zhan X, Wang J, Gong B, Yu SX. Large-scale long-tailed recognition in an open world. In: 2019 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), IEEE; 2019. p. 2532–41. https://doi.org/10.1109/CVPR.2019.00264.

  42. Buda M, Maki A, Mazurowski MA. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 2018;106:249–59. https://doi.org/10.1016/j.neunet.2018.07.011.

    Article  Google Scholar 

  43. Cao K, Wei C, Gaidon A, Arechiga N, Ma T. Learning imbalanced datasets with label-distribution-aware margin loss. In: Proceedings of the 33rd international conference on neural information processing systems. Curran Associates Inc; 2019. pp. 1567–1578.

Download references

Acknowledgements

The authors would like to acknowledge the support provided by the Cognitively-inspired Agile Information and Knowledge Modelling (CALM) project, funded by the United States Air Force Office of Scientific Research. On behalf of all authors, the corresponding author states that there is no conflict of interest.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adam Byerly.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

In Figs. 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35 and 36 the values of each of the 3 dimensions of the reductions for each individual class of the training data for MNIST and Imagenette are plotted. The first thing that can be learned from looking at the MNIST plots is that the reduction for each class produces values significantly different than the others and further, the values in each dimension of each class are tightly grouped in the number line, with the noticeable exception of the third dimension of the class that represents the digit 1. It is worth noting that the stylization of the Hindu-Arabic numeral ’1’ contains most of its information in 2-dimensions (accounting for translation) thus providing an explanation for the larger variance in the third dimension. The plots of the Imagenette classes, on the other hand, are much more similar to one another and display greater variance in each of the 3 dimensions.

Fig. 17
figure 17

MNIST Reduction Distributions—Class ‘0’

Fig. 18
figure 18

MNIST Reduction Distributions—Class ‘1’

Fig. 19
figure 19

MNIST Reduction Distributions—Class ‘2’

Fig. 20
figure 20

MNIST Reduction Distributions—Class ‘3’

Fig. 21
figure 21

MNIST Reduction Distributions—Class ‘4’

Fig. 22
figure 22

MNIST Reduction Distributions—Class ‘5’

Fig. 23
figure 23

MNIST Reduction Distributions—Class ‘6’

Fig. 24
figure 24

MNIST Reduction Distributions—Class ‘7’

Fig. 25
figure 25

MNIST Reduction Distributions—Class ‘8’

Fig. 26
figure 26

MNIST Reduction Distributions—Class ‘9’

Fig. 27
figure 27

Imagenette Reduction Distributions—Class ‘Tench’

Fig. 28
figure 28

Imagenette Reduction Distributions—Class ‘English Springer’

Fig. 29
figure 29

Imagenette Reduction Distributions—Class ‘Cassette Player’

Fig. 30
figure 30

Imagenette Reduction Distributions—Class ‘Chain Saw’

Fig. 31
figure 31

Imagenette Reduction Distributions—Class ‘Church’

Fig. 32
figure 32

Imagenette Reduction Distributions—Class ‘French Horn’

Fig. 33
figure 33

Imagenette Reduction Distributions—Class ‘Garbage Truck’

Fig. 34
figure 34

Imagenette Reduction Distributions—Class ‘Gas Pump’

Fig. 35
figure 35

Imagenette Reduction Distributions—Class ‘Golf Ball’

Fig. 36
figure 36

Imagenette Reduction Distributions—Class ‘Parachute’

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Byerly, A., Kalganova, T. Towards an Analytical Definition of Sufficient Data. SN COMPUT. SCI. 4, 144 (2023). https://doi.org/10.1007/s42979-022-01549-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-022-01549-4

Keywords

Navigation