Abstract
Datasets with imbalanced class distribution are available in various real-world applications. A great number of approaches has been proposed to address the class imbalance challenge, but most of these models perform poorly when datasets are characterized with high class imbalance, class overlap and low data quality. In this study, we propose an effective meta-framework for high imbalance overlapped classification, called DAPS (DynAmic self-Paced sampling enSemble), which (1) leverages reasonable and effective sampling to maximize the utilization of informative instances and to avoid serious information loss and (2) assigns proper instance weights to address the issues of noisy data. Furthermore, most of the existing canonical classifiers (e.g. Decision Tree, Random Forest) can be integrated in DAPS. The comprehensive experimental results on both synthetic and three real-world datasets show that the DAPS model could obtain considerable improvements in F1-score when compared to a broad range of published models.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The code is available at https://github.com/ZhouF-ECNU/DAPS.
Due to space limitation, we only report precision and recall results on real-world datasets. AUPRC (i.e., the area under the precision-recall curve) does not properly reflect the performance of our model, as DAPS chooses 0.5 as threshold to optimize predictions.
References
Asuncion A, Newman D (2007) UCI machine learning repository
Cao K, Wei C, Gaidon A, Arechiga N, Ma T (2019) Learning imbalanced datasets with label-distribution-aware margin loss. In: Proceedings of the 33rd international conference on neural information processing systems, pp 1567–1578
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: European conference on principles of data mining and knowledge discovery. Springer, pp 107–119
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Chen T, He T, Benesty M, Khotilovich V, Tang Y (2015) Xgboost: extreme gradient boosting. R package version 0.4-2, pp 1–4
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55:119–139
Friedman JH (2002) Stochastic gradient boosting. Comput Stat 38:367–378
Gónzalez S, Garcia S, Lázaro M, Figueiras-Vidal AR, Herrera F (2017) Class switching according to nearest enemy distance for learning from highly imbalanced data-sets. Pattern Recognit 70:12–24
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks, pp 1322–1328
Kumar MP, Packer B, Koller D (2010) Self-paced learning for latent variable models. In: NIPS, pp 1189–1197
Last F, Douzas G, Bacao F (2017) Oversampling for imbalanced learning based on k-means and smote. arXiv preprint arXiv:1711.00837
Liu XY, Wu J, Zhou ZH (2008) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern 539–550
Liu XY, Zhou ZH (2006) The influence of class imbalance on cost-sensitive learning: an empirical study. In: International conference on data mining. IEEE, pp 970–974
Liu Z, Cao W, Gao Z, Bian J, Chen H, Chang Y, Liu T (2020) Self-paced ensemble for highly imbalanced massive data classification. In: IEEE 36th international conference on data engineering, pp 841–852
Lu C, Ke H, Zhang G, Mei Y, Xu H (2019) An improved weighted extreme learning machine for imbalanced data classification. Memetic Comput 11:27–34
O’Brien R, Ishwaran H (2019) A random forests quantile classifier for class imbalanced data. Pattern Recogn 90:232–249
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
Peng M, Zhang Q, Xing X, Gui T, Huang X, Jiang YG, Ding K, Chen Z (2019) Trainable undersampling for class-imbalance learning. In: Proceedings of the AAAI conference on artificial intelligence, pp 4707–4714
Pozzolo AD, Boracchi G, Caelen O, Alippi C, Bontempi G (2017) Credit card fraud detection: a realistic modeling and a novel learning strategy. IEEE Trans Neural Netw Learn Syst 29:3784–3797
Seiffert C, Khoshgoftaar TM, Van HJ, Napolitano A (2009) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A Syst Hum 40:185–197
Vuttipittayamongkol P, Elyan E (2020) Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inf Sci 509:47–70
Vuttipittayamongkol P, Elyan E (2020b) Overlap-based undersampling method for classification of imbalanced medical datasets. In: IFIP international conference on artificial intelligence applications and innovations. Springer, pp 358–369
Vuttipittayamongkol P, Elyan E, Petrovski A, Jayne C (2018) Overlap-based undersampling for improving imbalanced data classification. In: International conference on intelligent data engineering and automated learning. Springer, pp 689–697
Wallace BC, Small K, Brodley C, Trikalinos TA (2011) Class imbalance, redux. In: 2011 IEEE 11th international conference on data mining. IEEE, pp 754–763
Wang Y, Gan W, Yang J, Wu W, Yan J (2019) Dynamic curriculum learning for imbalanced data classification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5017–5026
Wang S, Yao X (2009)Diversity analysis on imbalanced data sets by using ensemble models. In: 2009 IEEE symposium on computational intelligence and data mining. IEEE, pp 324–331
Wei W, Li J, Cao L, Ou Y, Chen J (2013) Effective detection of sophisticated online banking fraud on extremely imbalanced data. WWW, pp 449–475
Wu F, Jing XY, Shan S, Zuo W, Yang JY (2017) Multiset feature learning for highly imbalanced data classification. In: Proceedings of the AAAI conference on artificial intelligence, vol 31
Yuan X, Xie L, Abouelenien M (2018) A regularized ensemble framework of deep learning for cancer detection from multi-class, imbalanced training data. Pattern Recogn 77:160–172
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Albrecht Zimmermann and Peggy Cellier.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This research was supported in part by NSFC Grant 61902127 and Natural Science Foundation of Shanghai 19ZR1415700.
Rights and permissions
About this article
Cite this article
Zhou, F., Gao, S., Ni, L. et al. Dynamic self-paced sampling ensemble for highly imbalanced and class-overlapped data classification. Data Min Knowl Disc 36, 1601–1622 (2022). https://doi.org/10.1007/s10618-022-00838-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-022-00838-z