[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

Self-paced ensemble and big data identification: a classification of substantial imbalance computational analysis

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

This research paper focuses on the challenges associated with learning classifiers from large-scale, highly imbalanced datasets prevalent in many real-world applications. Traditional algorithms learning often need better performance and high computational efficiency when dealing with imbalanced data. Factors such as class imbalance, noise, and class overlap make it demanding to learn effective classifiers. In this study, we propose a novel self-paced ensemble framework for classifying imbalanced data. The framework employs under-sampling to self-harmonize data hardness and build a robust ensemble. Extensive experimental testing demonstrates promising results in handling overlapping classes and skewed distributions while maintaining computational efficiency. The self-paced ensemble method addresses the challenges of high imbalance ratios, class overlap, and noise presence in large-scale imbalanced classification problems. By incorporating the knowledge of these challenges into our learning framework, we establish the concept of classification hardness distribution, and the self-paced ensemble is a revolutionary learning paradigm for massive imbalance categorization, capable of improving the performance of existing learning algorithms on imbalanced data and providing better results for future applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Liu Z, et al (2021) Towards inter-class and intra-class imbalance in class-imbalanced learning. arXiv preprint arXiv:2111.12791, 1–14

  2. Ding R, et al (2021) Semi-supervised optimal transport with self-paced ensemble for cross-hospital sepsis early detection. arXiv preprint arXiv:2106.10352, 1–14

  3. Ristea N-C, Ionescu RT (2021) Self-paced ensemble learning for speech and audio classification. arXiv preprint arXiv:2103.11988, 1–5

  4. Dal Pozzolo A et al (2018) Credit card fraud detection: a realistic modeling and a novel learning strategy. IEEE Trans Neural Netw Learn Syst 29(8):3784–3797

    Article  Google Scholar 

  5. Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106

    Article  Google Scholar 

  6. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    Article  Google Scholar 

  7. He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  8. Chen S, He H (2013) Nonstationary stream data learning with imbalanced class distribution. Imbalanced Learning: Foundations, Algorithms, and Applications. 151–186

  9. Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybernet 6(11):769–772

    MathSciNet  Google Scholar 

  10. Mani I, Zhang I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of Workshop on Learning from Imbalanced Datasets. ICML

  11. Chawla NV et al (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  Google Scholar 

  12. He H, et al (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence)

  13. Elkan C (2001) The foundations of cost-sensitive learning. In: International Joint Conference on Artificial Intelligence. Lawrence Erlbaum Associates Ltd

  14. Liu XY, Zhou ZH (2006) The influence of class imbalance on cost-sensitive learning: An empirical study. In: Sixth International Conference on Data Mining (ICDM'06). IEEE

  15. Wang S, Yao X (2009) Diversity analysis on imbalanced data sets by using ensemble models. In: 2009 IEEE Symposium on Computational Intelligence and Data Mining IEEE

  16. Liu XY, Wu J, Zhou ZH (2008) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybernet Part B Cybernet 39(2):539–550

    Google Scholar 

  17. Napierała K, Stefanowski J, Wilk S (2010) Learning from imbalanced data in presence of noisy and borderline examples. In: International Conference on Rough Sets and Current Trends in Computing. 2010. Springer

  18. García V, Sánchez J, Mollineda R (2007) An empirical study of the behavior of classifiers on imbalanced and overlapped data sets. In: Iberoamerican Congress on Pattern Recognition, Springer

  19. Prati RC, Batista GE, Monard MC (2004) Learning with class skews and small disjuncts. In: Brazilian Symposium on Artificial Intelligence, Springer

  20. Hair JF, Gabriel M, Patel V (2014) AMOS covariance-based structural equation modeling (CB-SEM): guidelines on its application as a marketing research tool. Brazil J Mark, 13(2)

  21. Sekaran U (2006) Research method for business: a skill approach. Willey, New York

    Google Scholar 

  22. Hair JF, Gabriel M, Patel V (2014) AMOS covariance-based structural equation modeling (CB-SEM): Guidelines on its application as a marketing research tool. Brazil J Mark 13(2):1–12

    Google Scholar 

  23. Agresti A, Finlay B (1997) Statistical models for the social sciences. Upper Saddle River, NJ Prentice-Hall. Revascularization procedures after coronary angiography. J Am Med Assoc, 269: 2642–46

  24. Hu LT, Bentler PM (1999) Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Struct Eq Model Multidiscipl J 6(1):1–55

    Article  Google Scholar 

  25. Tomás JM, Meliá JL, Oliver A (1999) A cross-validation of a structural equation model of accidents: organizational and psychological variables as predictors of work safety. Work Stress 13(1):49–58

    Article  Google Scholar 

  26. Byrne BM (2016) Structural equation modeling with AMOS: Basic concepts, applications, and programming. Routledge, Cambridge

    Book  Google Scholar 

  27. Li B, Liu Y, Wang X (2019) Gradient harmonized single-stage detector. In: Proceedings of the AAAI Conference on Artificial Intelligence

  28. Czarnowski I (2022) Weighted ensemble with one-class classification and over-sampling and instance selection (WECOI): an approach for learning from imbalanced data streams. J Comput Sci 61:101614

    Article  Google Scholar 

  29. Zhai J, Qi J, Zhang S (2022) Imbalanced data classification based on diverse sample generation and classifier fusion. Int J Mach Learn Cybern 13(3):735–750

    Article  Google Scholar 

  30. Dong J, Qian Q (2022) A density-based random forest for imbalanced data classification. Fut Internet 14(3):90

    Article  Google Scholar 

  31. Dai W et al (2022) Deep learning approach for defective spot welds classification using small and class-imbalanced datasets. Neurocomputing 477:46–60

    Article  Google Scholar 

  32. Wang Z et al (2022) Geometric imbalanced deep learning with feature scaling and boundary sample mining. Pattern Recogn 126:108564

    Article  Google Scholar 

  33. Kimura T (2022) Customer churn prediction with hybrid resampling and ensemble learning. J Manag Inf Decis Sci 25(1):1–23

    MathSciNet  Google Scholar 

  34. Shi H et al (2022) Resampling algorithms based on sample concatenation for imbalance learning. Knowl-Based Syst 245:108592

    Article  Google Scholar 

  35. Ding H et al (2022) Imbalanced data classification: a KNN and generative adversarial networks-based hybrid approach for intrusion detection. Futur Gener Comput Syst 131:240–254

    Article  Google Scholar 

  36. Singh A, Ranjan RK, Tiwari A (2022) Credit card fraud detection under extreme imbalanced data: a comparative study of data-level algorithms. J Exp Theor Artif Intell 34(4):571–598

    Article  Google Scholar 

  37. Liu F, Qian Q (2022) Cost-sensitive variational autoencoding classifier for imbalanced data classification. Algorithms 15(5):139

    Article  Google Scholar 

  38. Ding R et al (2023) Cross-hospital sepsis early detection via semi-supervised optimal transport with self-paced ensemble. IEEE J Biomed Health Inform 27(6):3049–3060

    Article  Google Scholar 

  39. Wan L, Dong C, Pei X (2022) Self-paced learning-based multi-graphs semi-supervised learning. Multimedia Tools Appl 81(5):7025–7046

    Article  Google Scholar 

  40. Bengar JZ, et al (2022) Class-balanced active learning for image classification. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

  41. Liu B et al (2022) A new self-paced learning method for privilege-based positive and unlabeled learning. Inf Sci 609:996–1009

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

SB Conceptualization, Methodology, Software, Writing - Original Draft, Software. WZ: Conceptualization, Methodology, Software BQ: Conceptualization, Methodology MMK: Software MS: Conceptualization, Methodology GA: Validation, Resources, Writing - Review, Editing NA: Supervision, Project administration MMK,SB,BQ: Validation, Resources, Writing - Review, Editing WZ: Supervision, Project administration, Funding acquisition.

Corresponding author

Correspondence to Weimei Zhi.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bano, S., Zhi, W., Qiu, B. et al. Self-paced ensemble and big data identification: a classification of substantial imbalance computational analysis. J Supercomput 80, 9848–9869 (2024). https://doi.org/10.1007/s11227-023-05828-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-023-05828-6

Keywords

Navigation