Abstract
SMOTE has been favored by researchers in improving imbalanced classification. Nevertheless, imbalances within minority classes and noise generation are two main challenges in SMOTE. Recently, clustering-based oversampling methods are developed to improve SMOTE by eliminating imbalances within minority classes and/or overcoming noise generation. Yet, they still suffer from the following challenges: a) some create more synthetic minority samples in large-size or high-density regions; b) most fail to remove noise from the training set; c) most heavily rely on more than one parameter; d) most can not handle non-spherical data; e) almost all adopted clustering methods are not very suitable for class-imbalanced data. To overcome the above issues of existing clustering-based oversampling methods, this paper proposes a novel oversampling approach based on local density peaks clustering (OALDPC). First, a novel local density peaks clustering (LDPC) is proposed to partition the class-imbalanced training set into separated sub-clusters with different sizes and densities. Second, a novel LDPC-based noise filter is proposed to identify and remove suspicious noise from the class-imbalanced training set. Third, a novel sampling weight is proposed and calculated by weighing the sample number and density of each minority class sub-cluster. Four, a novel interpolation method based on the sampling weight and LDPC is proposed to create more synthetic minority class samples in sparser minority class regions. Intensive experiments have proven that OALDPC outperforms 8 state-of-the-art oversampling techniques in improving F-measure and G-mean of Random Forest, Neural Network and XGBoost on synthetic data and extensive real benchmark data sets from industrial applications.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The datasets and third-party libraries used in the experiments are open sources and accessible online (http://archive.ics.uci.edu/ml/datasets.php).
References
Feng HL, Wang H, Jin B, Li H, Xue M, Wang L (2019) Learning a Distance Metric by Balancing KL-Divergence for Imbalanced Datasets. IEEE Transactions on Systems, Man, and Cybernetics: Systems 49(12):2384–2395
Gu X, Chung F, Ishibuchi H, Wang S (2017) Imbalanced TSK fuzzy classifier by cross-class bayesian fuzzy clustering and imbalance learning. IEEE Trans Syst Man Cybern: Syst 47(8):2005–2020
Teng A, Peng L, Xie Y, Zhang H, Chen Z (2020) Gradient descent evolved imbalanced data gravitation classification with an application on Internet video traffic identification. Inf Sci 539:447–460
Ding I, Jia M, Zhuang J, Ding P (2022) Deep imbalanced regression using cost-sensitive learning and deep feature transfer for bearing remaining useful life estimation. Appl Soft Comput 127:109271
Fan J, Yu Y, Wang Z (2022) Addressing label ambiguity imbalance in candidate labels: Measures and disambiguation algorithm. Inf Sci 612:1–19
Shi H, Zhang Y, Chen Y, Ji S, Dong Y (2022) Resampling algorithms based on sample concatenation for imbalance learning. Knowl-Based Syst 245:108592
Pérez-Ortiz M, Gutiérrez P, Tino P, Hervás-Martínez C (2016) Oversampling the minority class in the feature space. IEEE Trans Neural Netw Learn Syst 27(9):1947–1961
Das B, Krishnan NC, Cook DJ (2015) Racog and wracog: two probabilistic oversampling techniques. IEEE Trans Knowl Data Eng 27(1):222–234
Lim P, Goh CK, Tan KC (2017) Evolutionary cluster-based synthetic oversampling ensemble (ECO-Ensemble) for imbalance learning. IEEE Trans Cybern 47(9):2850–2861
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority oversampling technique. J Artif Intell Res 16:321–357
Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci 465:1–20
Sáeza JA, Luengob J, Stefanowskic J, Herreraa F (2015) SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291(10):184–203
Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang DS, Zhang XP, Huang GB (eds) Advances in intelligent computing. ICIC 2005. Lecture notes in computer science, vol 3644. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11538059_91
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho TB (eds) Advances in knowledge discovery and data mining. PAKDD 2009. Lecture notes in computer science, vol 5476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01307-2_43
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. Proc. Int’ l Joint Conf. Neural Networks 1322–1328
Prusty MR, Jayanthi T, Velusamy K (2017) Weighted-SMOTE: a modification to SMOTE for event classification in sodium cooled fast reactors. Prog Nucl Energy 100:355–364
Pan T, Zhao J, Wu W, Yang J (2020) Learning imbalanced datasets based on SMOTE and Gaussian distribution. Inf Sci 512:1214–1233
Chen B, Xia S, Chen Z, Wang B, Wang G (2021) RSMOTE: a self-adaptive robust SMOTE for imbalanced problems with label noise. Inf Sci 553:397–428
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29
Verbiest N, Ramentol E, Cornelis C, Herrera F (2014) Preprocessing noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype selection. Appl Soft Comput 22:511–517
Li J, Zhu Q, Wu Q, Zhang Z, Gong Y, He Z, Zhu F (2021) Smote-nan-de: addressing the noisy and borderline examples problem in imbalanced classification by natural neighbors and differential evolution. Knowl-Based Syst 223(8):107056
Cieslak DA, Chawla NV, Striegel A (2006) Combating imbalance in network intrusion datasets. 2006 IEEE International Conference on Granular Computing, Atlanta, GA, USA, pp 732–737. https://doi.org/10.1109/GRC.2006.1635905
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE international joint conference on neural networks (IEEE World Congress on Computational Intelligence), Hong Kong, pp 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) DBSMOTE: density-based synthetic minority over-sampling TEchnique. Appl Intell 36:664–684
Ma L, Fan SH (2017) CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on Random Forests. BMC Bioinf 18(1):1–18
Barua S, Islam MM, Yao X, Murase K (2014) MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26:405–425
Wei J, Huang H, Yao L, Hu Y, Fan Q, Huang D (2020) NI-MWMOTE: an improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems. Expert Syst Appl 4:113504
Jabi M, Pedersoli M, Mitiche A, Ayed IB (2021) Deep clustering: on the link between discriminative models and k-means. IEEE Trans Pattern Anal Mach Intell 43(6):1887–1896
Tao X, Guo W, Ren C, Li Q, He Q, Liu R, Zou J (2021) Density peak clustering using global and local consistency adjustable manifold distance. Inf Sci 577:759–804
Guha S, Rastogi R, Shim K (2001) CURE: an efficient clustering algorithm for large database. Inf Syst 27(2):73–84
Voorhees EM (1986) Implementing agglomerative hierarchic clustering algorithms for use in document retrieval. Inf Process Manage 22(6):465–476
Wen G, Li X, Zhu B, Chen L (2021) TanM, One-step spectral rotation clustering for imbalanced high-dimensional data. Inf Process Manag 58(1):102388
Liang J, Liang B, Dang C, Cao F (2021) The k-means-type algorithms versus imbalanced data distributions. IEEE Trans Fuzzy Syst 20(4):28–745
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. KDD'96: Proceedings of the second international conference on knowledge discovery and data mining, pp 226–231
Zhu Q, Feng J, Huang J (2016) Natural neighbor: a self-adaptive neighborhood method without parameter k. Pattern Recogn Lett 80(1):30–36
Li J, Zhu Q, Wu Q (2019) A self-training method based on density peaks and an extended parameter-free local noise filter for k nearest neighbor. Knowl-Based Syst 184(15):104895
Li J, Zhu Q (2020) A boosting self-training framework based on instance generation with natural neighbors for K nearest neighbor. Appl Intell 50:3535–3553
Li J, Zhu Q, Wu Q (2020) A parameter-free hybrid instance selection algorithm based on local sets with natural neighbors. Appl Intell 50(5):1527–1541
Li J, Zhu Q (2019) Semi-supervised self-training method based on an optimum-path forest. IEEE Access 7:36388–36399
Ros F, Guillaume S (2019) Munec: a mutual neighbor-based clustering algorithm. Inf Sci 486:148–170
Zhao Y, Wang Y, Zhang J, Fu CW, Xu M, Moritz D (2022) KD-Box: Line-segment-based KD-tree for interactive exploration of large-scale time-series data. IEEE Trans Visual Comput Graph 28(1):890–900
Rodriguez A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496
Ghazi M, Lee L, Samsudin A, Sino H (2022) Evaluation of ensemble data preprocessing strategy on forensic gasoline classification using untargeted GC-MS data and classification and regression tree (CART) algorithm. Microchem J 182:107911
Chu Y, Fei J, Hou S (2020) Adaptive global sliding-mode control for dynamic systems using double hidden layer recurrent neural network structure. IEEE Trans Neural Netw Learn Syst 31(4):1297–1309
Ogunleye A, Wang QG (2020) XGBoost model for chronic kidney disease diagnosis. IEEE/ACM Trans Comput Biol Bioinf 17(6):2131–2140
Li J, Zhou Q, Zhu Q, Wu Q (2023) A framework based on local cores and synthetic examples generation for self-labeled semi-supervised classification. Pattern Recogn 134:109060
Acknowledgements
This work was supported by the National Natural Science Foundation of China under Grant 62006029, Postdoctoral Innovative Talent Support Program of Chongqing under Grant CQBX2021024, Natural Science Foundation of Chongqing CSTB2022NSCQMSX0258 and Chongqing Municipal Education Commission (China) under Grant KJQN202001434.
Author information
Authors and Affiliations
Contributions
Junnan Li: Software, Conceptualization, Methodology, Formal analysis.
Qingsheng Zhu: Supervision.
Corresponding author
Ethics declarations
Ethical and informed consent for data used
The authors declare that they have an informed consent to publish and for data used. This research is not napplicable for both human and/or animal.
Competing Interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, J., Zhu, Q. OALDPC: oversampling approach based on local density peaks clustering for imbalanced classification. Appl Intell 53, 30987–31017 (2023). https://doi.org/10.1007/s10489-023-05030-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-05030-4