Abstract
Imbalanced data classification problem is a prevalent concern within the realms of machine learning and data mining. However, conventional methods primarily concentrate on between-class imbalance, ignoring noisy, overlap and within-class issues. To address these issues, a new adaptive over-sampling method for imbalanced data based on simultaneous clustering and filtering noisy (ASCFNO) was proposed in this study. First, this method develops a new DPINF (Density Peak Clustering with Improved Noise Filter) clustering algorithm not only to identify minority class sub-clusters with various sizes and densities but also simultaneously filter noisy instances, which can deal with noisy problem and be more beneficial for the subsequent steps to solve between-class and within-class imbalance problems. Second, an adaptive strategy determines the over-sampling size for each minority class sub-cluster, which assigns various weights to each sub-cluster by considering different factors to settle the issues of within-class imbalance. In the end, novel synthetic minority instances are generated between two instances located in the same sub-cluster that are selected according to their probability distribution, which prevents the generation of any noisy or overlapped synthetic instances by the traditional SMOTE method. The performance of the proposed ASCFNO was assessed on 32 benchmark imbalanced datasets. The experiment results prove the effectiveness and feasibility of the above improvements.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The datasets generated or analyzed during this study are available in the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets) and KEEL-dataset Repository (http://dx.doi.org/10.1016/j.suscom.2022.100665).
References
Tao XM, Chen W (2021) SVDD boundary and DPC clustering technique-based oversampling approach for handling imbalanced and overlapped data. Knowl-Based Syst 234. https://doi.org/10.1016/j.knosys.2021.107588
Nyokabi NS, Boer IJ (2023) The role of power relationships, trust and social networks in shaping milk quality in Kenya. NJAS Wagen J Life Sci 10(1):330–340. https://doi.org/10.1109/JIOT.2022.3200964
Okkalioglu M, Okkalioglu BD (2022) AFE-MERT: imbalanced text classification with abstract feature extraction. Appl Intell 52:10352–10368. https://doi.org/10.1007/s10489-021-02983-2
Yuan YG, Wei JA, Huang HS (2023) Review of resampling techniques for the treatment of imbalanced industrial data classification in equipment condition monitoring. Eng Appl Artif Intell 126. https://doi.org/10.1016/j.engappai.2023.106911
Zhang J, Li C, Kosov S, Grzegorzek M (2021) LCU-Net: A novel low-cost U-Net for environmental microorganism image segmentation. Pattern Recognit 107885. https://doi.org/10.1016/j.patcog
Rodriguez-Almeida AJ, Fabelo H (2023) Synthetic patient data generation and evaluation in disease prediction using small and imbalanced datasets. IEEE J Biomed Health Inform 27(6):2670–2680. https://doi.org/10.1109/JBHI.2022.3196697
Zhang J, Li C, Yin Y (2023) Applications of artificial neural networks in microorganism image analysis: a comprehensive review from conventional multilayer perceptron to popular convolutional neural network and potential visual transformer. Artif Intell Rev 56:1013–1070. https://doi.org/10.1007/s10462-022-10192-7
Shi SN, Li J, Zhu D (2023) A hybrid imbalanced classification model based on data density. Inf Sci 624:50–67. https://doi.org/10.1016/j.ins.2022.12.046
Fang W, Yao X, Zhao X, Yin J (2018) A stochastic control approach to maximize profit on service provisioning for mobile cloudlet platforms. IEEE Trans Syst Man Cyber 48(4):522–534. https://doi.org/10.1109/TSMC.2016.2606400
Buda M, Maki A, Mazurowski A (2018) A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw 106:249–259. https://doi.org/10.1016/j.neunet.2018.07.011
Tao XM, Chen W, Li XK (2021) The ensemble of density-sensitive SVDD classifier based on maximum soft margin for imbalanced datasets. Knowl-Based Syst 219:1–21. https://doi.org/10.1016/j.knosys.2021.106897
Wang G, He WJ, K, (2023) Majority-to-minority resampling for boosting-based classification under imbalanced data. Appl Intell 53:4541–4562. https://doi.org/10.1007/s10489-022-03585-2
Chen YQ, Pedrycz W, Yang J (2023) A new boundary-degree-based oversampling method for imbalanced data. Appl Intell 53:26518–26541. https://doi.org/10.1007/s10489-023-04846-4
Dudjak M, Martinovic G (2021) An empirical study of data intrinsic characteristics that make learning from imbalanced data difficult. Exp Syst Appl 182. https://doi.org/10.1016/j.eswa.2021.115297
Ren JJ, Wang YP, Cheung YM (2023) Grouping-based Oversampling in Kernel Space for Imbalanced Data Classification. Pattern Recognit 133. https://doi.org/10.1016/j.patcog.2022.108992
Chen ZX, Yan QB, Han HB, Wang SS (2017) Machine learning based mobile malware detection using highly imbalanced network traffic. Inf Sci 433:346–364. https://doi.org/10.1016/j.ins.2017.04.044
Sun ZQ, Ying WH, Zhang WJ, Gong SR (2024) Undersampling method based on minority class density for imbalanced data. Exp Syst Appl 249. https://doi.org/10.1016/j.eswa.2024.123328
Yan YT, Zhu YW, Liu RQ (2023) Spatial Distribution-Based Imbalanced Undersampling. IEEE Trans Knowl Data Eng 35(6):6376–6391. https://doi.org/10.1109/TKDE.2022.3161537
Sharma S, Gosain A, Jain S (2022) A Review of the Oversampling Techniques in Class Imbalance Problem. In: International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing, vol 1387. pp 459–472. https://doi.org/10.1007/978-981-16-2594-7_38
Chawla NV, Bowyer KW, Hall LO (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357. https://doi.org/10.1613/jair.953
Han H, Wang WY, Mao BH (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing, vol 3644. Springer, pp 878–887. https://doi.org/10.1007/11538059_91
He H, Bai Y, Garcia E, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the IEEE international joint conference on neural networks. IEEE, pp 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-Level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Advances in knowledge discovery and data mining, vol 5476. Springer, pp 475–482. https://doi.org/10.1007/978-3-642-01307-2_43
Denil M, Trappenberg T (2010) Overlap versus imbalance. In: Lecture notes in computer science. Lecture notes in computer science, vol 6085. Springer, pp 220–231. https://doi.org/10.1007/978-3-642-13059-5_22
Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. ACM SIGKDD Explorations Newsl 6:40–49. https://doi.org/10.1145/1007730.1007737
Cieslak DA, Chawla NV, Striegel A (2006) Combating imbalance in network intrusion datasets. In: Proceedings of the IEEE international conference on granular computing. IEEE, pp 732–737. https://doi.org/10.1109/GRC.2006.1635905
Barua S, Islam MM, Yao X, Murase K (2014) MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425. https://doi.org/10.1109/TKDE.2012.232
Nekooeimehr I, Lai-Yuen SK (2016) Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets. Expert Syst Appl 46:405–416. https://doi.org/10.1016/j.eswa.2015.10.031
Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci 4651–20. https://doi.org/10.1016/j.ins.2018.06.056
Barua S, Islam MM, Murase K (2013) ProWSyn: proximity weighted synthetic oversampling technique for imbalanced data set learning. In: Advances in knowledge discovery and data mining, vol 7819. Springer, pp 317–328. https://doi.org/10.1007/978-3-642-37456-2_27
Leevy JL, Khoshgoftaar TM (2018) A survey on addressing high-class imbalance in big data. J Big Data 5–42. https://doi.org/10.1186/s40537-018-0151-6
Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496. https://doi.org/10.1126/science.1242072
Author information
Authors and Affiliations
Contributions
Wei Chen conceived of the present idea and performed experiments. Wenjie Guo contributed to data curation and analysis. Weijie Mao supervised the project and directed the research. All authors contributed to the writing and editing of the manuscript.
Corresponding author
Ethics declarations
Ethical and informed consent for data used
Not applicable.
Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, W., Guo, W. & Mao, W. An adaptive over-sampling method for imbalanced data based on simultaneous clustering and filtering noisy. Appl Intell 54, 11430–11449 (2024). https://doi.org/10.1007/s10489-024-05754-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05754-x