An adaptive over-sampling method for imbalanced data based on simultaneous clustering and filtering noisy

211 Accesses
Explore all metrics

Abstract

Imbalanced data classification problem is a prevalent concern within the realms of machine learning and data mining. However, conventional methods primarily concentrate on between-class imbalance, ignoring noisy, overlap and within-class issues. To address these issues, a new adaptive over-sampling method for imbalanced data based on simultaneous clustering and filtering noisy (ASCFNO) was proposed in this study. First, this method develops a new DPINF (Density Peak Clustering with Improved Noise Filter) clustering algorithm not only to identify minority class sub-clusters with various sizes and densities but also simultaneously filter noisy instances, which can deal with noisy problem and be more beneficial for the subsequent steps to solve between-class and within-class imbalance problems. Second, an adaptive strategy determines the over-sampling size for each minority class sub-cluster, which assigns various weights to each sub-cluster by considering different factors to settle the issues of within-class imbalance. In the end, novel synthetic minority instances are generated between two instances located in the same sub-cluster that are selected according to their probability distribution, which prevents the generation of any noisy or overlapped synthetic instances by the traditional SMOTE method. The performance of the proposed ASCFNO was assessed on 32 benchmark imbalanced datasets. The experiment results prove the effectiveness and feasibility of the above improvements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

DDSC-SMOTE: an imbalanced data oversampling algorithm based on data distribution and spectral clustering

Article 02 May 2024

ND-S: an oversampling algorithm based on natural neighbor and density peaks clustering

Article 23 December 2022

OALDPC: oversampling approach based on local density peaks clustering for imbalanced classification

Article 30 November 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

The datasets generated or analyzed during this study are available in the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets) and KEEL-dataset Repository (http://dx.doi.org/10.1016/j.suscom.2022.100665).

References

Tao XM, Chen W (2021) SVDD boundary and DPC clustering technique-based oversampling approach for handling imbalanced and overlapped data. Knowl-Based Syst 234. https://doi.org/10.1016/j.knosys.2021.107588
Nyokabi NS, Boer IJ (2023) The role of power relationships, trust and social networks in shaping milk quality in Kenya. NJAS Wagen J Life Sci 10(1):330–340. https://doi.org/10.1109/JIOT.2022.3200964
Article Google Scholar
Okkalioglu M, Okkalioglu BD (2022) AFE-MERT: imbalanced text classification with abstract feature extraction. Appl Intell 52:10352–10368. https://doi.org/10.1007/s10489-021-02983-2
Article Google Scholar
Yuan YG, Wei JA, Huang HS (2023) Review of resampling techniques for the treatment of imbalanced industrial data classification in equipment condition monitoring. Eng Appl Artif Intell 126. https://doi.org/10.1016/j.engappai.2023.106911
Zhang J, Li C, Kosov S, Grzegorzek M (2021) LCU-Net: A novel low-cost U-Net for environmental microorganism image segmentation. Pattern Recognit 107885. https://doi.org/10.1016/j.patcog
Rodriguez-Almeida AJ, Fabelo H (2023) Synthetic patient data generation and evaluation in disease prediction using small and imbalanced datasets. IEEE J Biomed Health Inform 27(6):2670–2680. https://doi.org/10.1109/JBHI.2022.3196697
Article Google Scholar
Zhang J, Li C, Yin Y (2023) Applications of artificial neural networks in microorganism image analysis: a comprehensive review from conventional multilayer perceptron to popular convolutional neural network and potential visual transformer. Artif Intell Rev 56:1013–1070. https://doi.org/10.1007/s10462-022-10192-7
Article Google Scholar
Shi SN, Li J, Zhu D (2023) A hybrid imbalanced classification model based on data density. Inf Sci 624:50–67. https://doi.org/10.1016/j.ins.2022.12.046
Article Google Scholar
Fang W, Yao X, Zhao X, Yin J (2018) A stochastic control approach to maximize profit on service provisioning for mobile cloudlet platforms. IEEE Trans Syst Man Cyber 48(4):522–534. https://doi.org/10.1109/TSMC.2016.2606400
Article Google Scholar
Buda M, Maki A, Mazurowski A (2018) A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw 106:249–259. https://doi.org/10.1016/j.neunet.2018.07.011
Article Google Scholar
Tao XM, Chen W, Li XK (2021) The ensemble of density-sensitive SVDD classifier based on maximum soft margin for imbalanced datasets. Knowl-Based Syst 219:1–21. https://doi.org/10.1016/j.knosys.2021.106897
Article Google Scholar
Wang G, He WJ, K, (2023) Majority-to-minority resampling for boosting-based classification under imbalanced data. Appl Intell 53:4541–4562. https://doi.org/10.1007/s10489-022-03585-2
Article Google Scholar
Chen YQ, Pedrycz W, Yang J (2023) A new boundary-degree-based oversampling method for imbalanced data. Appl Intell 53:26518–26541. https://doi.org/10.1007/s10489-023-04846-4
Article Google Scholar
Dudjak M, Martinovic G (2021) An empirical study of data intrinsic characteristics that make learning from imbalanced data difficult. Exp Syst Appl 182. https://doi.org/10.1016/j.eswa.2021.115297
Ren JJ, Wang YP, Cheung YM (2023) Grouping-based Oversampling in Kernel Space for Imbalanced Data Classification. Pattern Recognit 133. https://doi.org/10.1016/j.patcog.2022.108992
Chen ZX, Yan QB, Han HB, Wang SS (2017) Machine learning based mobile malware detection using highly imbalanced network traffic. Inf Sci 433:346–364. https://doi.org/10.1016/j.ins.2017.04.044
Article Google Scholar
Sun ZQ, Ying WH, Zhang WJ, Gong SR (2024) Undersampling method based on minority class density for imbalanced data. Exp Syst Appl 249. https://doi.org/10.1016/j.eswa.2024.123328
Yan YT, Zhu YW, Liu RQ (2023) Spatial Distribution-Based Imbalanced Undersampling. IEEE Trans Knowl Data Eng 35(6):6376–6391. https://doi.org/10.1109/TKDE.2022.3161537
Article Google Scholar
Sharma S, Gosain A, Jain S (2022) A Review of the Oversampling Techniques in Class Imbalance Problem. In: International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing, vol 1387. pp 459–472. https://doi.org/10.1007/978-981-16-2594-7_38
Chapter Google Scholar
Chawla NV, Bowyer KW, Hall LO (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357. https://doi.org/10.1613/jair.953
Article Google Scholar
Han H, Wang WY, Mao BH (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing, vol 3644. Springer, pp 878–887. https://doi.org/10.1007/11538059_91
He H, Bai Y, Garcia E, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the IEEE international joint conference on neural networks. IEEE, pp 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-Level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Advances in knowledge discovery and data mining, vol 5476. Springer, pp 475–482. https://doi.org/10.1007/978-3-642-01307-2_43
Denil M, Trappenberg T (2010) Overlap versus imbalance. In: Lecture notes in computer science. Lecture notes in computer science, vol 6085. Springer, pp 220–231. https://doi.org/10.1007/978-3-642-13059-5_22
Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. ACM SIGKDD Explorations Newsl 6:40–49. https://doi.org/10.1145/1007730.1007737
Article Google Scholar
Cieslak DA, Chawla NV, Striegel A (2006) Combating imbalance in network intrusion datasets. In: Proceedings of the IEEE international conference on granular computing. IEEE, pp 732–737. https://doi.org/10.1109/GRC.2006.1635905
Barua S, Islam MM, Yao X, Murase K (2014) MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425. https://doi.org/10.1109/TKDE.2012.232
Article Google Scholar
Nekooeimehr I, Lai-Yuen SK (2016) Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets. Expert Syst Appl 46:405–416. https://doi.org/10.1016/j.eswa.2015.10.031
Article Google Scholar
Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci 4651–20. https://doi.org/10.1016/j.ins.2018.06.056
Barua S, Islam MM, Murase K (2013) ProWSyn: proximity weighted synthetic oversampling technique for imbalanced data set learning. In: Advances in knowledge discovery and data mining, vol 7819. Springer, pp 317–328. https://doi.org/10.1007/978-3-642-37456-2_27
Leevy JL, Khoshgoftaar TM (2018) A survey on addressing high-class imbalance in big data. J Big Data 5–42. https://doi.org/10.1186/s40537-018-0151-6
Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496. https://doi.org/10.1126/science.1242072
Article Google Scholar

Download references

Author information

Weijie Mao
Present address: College of Control Science and Engineering, Zhejiang University, Hangzhou, 310027, China

Authors and Affiliations

College of Control Science and Engineering, Zhejiang University, Hangzhou, 310027, China
Wei Chen & Wenjie Guo

Authors

Wei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Wenjie Guo
View author publications
You can also search for this author in PubMed Google Scholar
Weijie Mao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Wei Chen conceived of the present idea and performed experiments. Wenjie Guo contributed to data curation and analysis. Weijie Mao supervised the project and directed the research. All authors contributed to the writing and editing of the manuscript.

Corresponding author

Correspondence to Weijie Mao.

Ethics declarations

Ethical and informed consent for data used

Not applicable.

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, W., Guo, W. & Mao, W. An adaptive over-sampling method for imbalanced data based on simultaneous clustering and filtering noisy. Appl Intell 54, 11430–11449 (2024). https://doi.org/10.1007/s10489-024-05754-x

Download citation

Accepted: 07 August 2024
Published: 21 August 2024
Issue Date: November 2024
DOI: https://doi.org/10.1007/s10489-024-05754-x

An adaptive over-sampling method for imbalanced data based on simultaneous clustering and filtering noisy

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

DDSC-SMOTE: an imbalanced data oversampling algorithm based on data distribution and spectral clustering

ND-S: an oversampling algorithm based on natural neighbor and density peaks clustering

OALDPC: oversampling approach based on local density peaks clustering for imbalanced classification

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical and informed consent for data used

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

An adaptive over-sampling method for imbalanced data based on simultaneous clustering and filtering noisy

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

DDSC-SMOTE: an imbalanced data oversampling algorithm based on data distribution and spectral clustering

ND-S: an oversampling algorithm based on natural neighbor and density peaks clustering

OALDPC: oversampling approach based on local density peaks clustering for imbalanced classification

Explore related subjects

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical and informed consent for data used

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation