[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

An adaptive over-sampling method for imbalanced data based on simultaneous clustering and filtering noisy

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Imbalanced data classification problem is a prevalent concern within the realms of machine learning and data mining. However, conventional methods primarily concentrate on between-class imbalance, ignoring noisy, overlap and within-class issues. To address these issues, a new adaptive over-sampling method for imbalanced data based on simultaneous clustering and filtering noisy (ASCFNO) was proposed in this study. First, this method develops a new DPINF (Density Peak Clustering with Improved Noise Filter) clustering algorithm not only to identify minority class sub-clusters with various sizes and densities but also simultaneously filter noisy instances, which can deal with noisy problem and be more beneficial for the subsequent steps to solve between-class and within-class imbalance problems. Second, an adaptive strategy determines the over-sampling size for each minority class sub-cluster, which assigns various weights to each sub-cluster by considering different factors to settle the issues of within-class imbalance. In the end, novel synthetic minority instances are generated between two instances located in the same sub-cluster that are selected according to their probability distribution, which prevents the generation of any noisy or overlapped synthetic instances by the traditional SMOTE method. The performance of the proposed ASCFNO was assessed on 32 benchmark imbalanced datasets. The experiment results prove the effectiveness and feasibility of the above improvements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The datasets generated or analyzed during this study are available in the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets) and KEEL-dataset Repository (http://dx.doi.org/10.1016/j.suscom.2022.100665).

References

  1. Tao XM, Chen W (2021) SVDD boundary and DPC clustering technique-based oversampling approach for handling imbalanced and overlapped data. Knowl-Based Syst 234. https://doi.org/10.1016/j.knosys.2021.107588

  2. Nyokabi NS, Boer IJ (2023) The role of power relationships, trust and social networks in shaping milk quality in Kenya. NJAS Wagen J Life Sci 10(1):330–340. https://doi.org/10.1109/JIOT.2022.3200964

    Article  Google Scholar 

  3. Okkalioglu M, Okkalioglu BD (2022) AFE-MERT: imbalanced text classification with abstract feature extraction. Appl Intell 52:10352–10368. https://doi.org/10.1007/s10489-021-02983-2

    Article  Google Scholar 

  4. Yuan YG, Wei JA, Huang HS (2023) Review of resampling techniques for the treatment of imbalanced industrial data classification in equipment condition monitoring. Eng Appl Artif Intell 126. https://doi.org/10.1016/j.engappai.2023.106911

  5. Zhang J, Li C, Kosov S, Grzegorzek M (2021) LCU-Net: A novel low-cost U-Net for environmental microorganism image segmentation. Pattern Recognit 107885. https://doi.org/10.1016/j.patcog

  6. Rodriguez-Almeida AJ, Fabelo H (2023) Synthetic patient data generation and evaluation in disease prediction using small and imbalanced datasets. IEEE J Biomed Health Inform 27(6):2670–2680. https://doi.org/10.1109/JBHI.2022.3196697

    Article  Google Scholar 

  7. Zhang J, Li C, Yin Y (2023) Applications of artificial neural networks in microorganism image analysis: a comprehensive review from conventional multilayer perceptron to popular convolutional neural network and potential visual transformer. Artif Intell Rev 56:1013–1070. https://doi.org/10.1007/s10462-022-10192-7

    Article  Google Scholar 

  8. Shi SN, Li J, Zhu D (2023) A hybrid imbalanced classification model based on data density. Inf Sci 624:50–67. https://doi.org/10.1016/j.ins.2022.12.046

    Article  Google Scholar 

  9. Fang W, Yao X, Zhao X, Yin J (2018) A stochastic control approach to maximize profit on service provisioning for mobile cloudlet platforms. IEEE Trans Syst Man Cyber 48(4):522–534. https://doi.org/10.1109/TSMC.2016.2606400

    Article  Google Scholar 

  10. Buda M, Maki A, Mazurowski A (2018) A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw 106:249–259. https://doi.org/10.1016/j.neunet.2018.07.011

    Article  Google Scholar 

  11. Tao XM, Chen W, Li XK (2021) The ensemble of density-sensitive SVDD classifier based on maximum soft margin for imbalanced datasets. Knowl-Based Syst 219:1–21. https://doi.org/10.1016/j.knosys.2021.106897

    Article  Google Scholar 

  12. Wang G, He WJ, K, (2023) Majority-to-minority resampling for boosting-based classification under imbalanced data. Appl Intell 53:4541–4562. https://doi.org/10.1007/s10489-022-03585-2

    Article  Google Scholar 

  13. Chen YQ, Pedrycz W, Yang J (2023) A new boundary-degree-based oversampling method for imbalanced data. Appl Intell 53:26518–26541. https://doi.org/10.1007/s10489-023-04846-4

    Article  Google Scholar 

  14. Dudjak M, Martinovic G (2021) An empirical study of data intrinsic characteristics that make learning from imbalanced data difficult. Exp Syst Appl 182. https://doi.org/10.1016/j.eswa.2021.115297

  15. Ren JJ, Wang YP, Cheung YM (2023) Grouping-based Oversampling in Kernel Space for Imbalanced Data Classification. Pattern Recognit 133. https://doi.org/10.1016/j.patcog.2022.108992

  16. Chen ZX, Yan QB, Han HB, Wang SS (2017) Machine learning based mobile malware detection using highly imbalanced network traffic. Inf Sci 433:346–364. https://doi.org/10.1016/j.ins.2017.04.044

    Article  Google Scholar 

  17. Sun ZQ, Ying WH, Zhang WJ, Gong SR (2024) Undersampling method based on minority class density for imbalanced data. Exp Syst Appl 249. https://doi.org/10.1016/j.eswa.2024.123328

  18. Yan YT, Zhu YW, Liu RQ (2023) Spatial Distribution-Based Imbalanced Undersampling. IEEE Trans Knowl Data Eng 35(6):6376–6391. https://doi.org/10.1109/TKDE.2022.3161537

    Article  Google Scholar 

  19. Sharma S, Gosain A, Jain S (2022) A Review of the Oversampling Techniques in Class Imbalance Problem. In: International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing, vol 1387. pp 459–472. https://doi.org/10.1007/978-981-16-2594-7_38

    Chapter  Google Scholar 

  20. Chawla NV, Bowyer KW, Hall LO (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357. https://doi.org/10.1613/jair.953

    Article  Google Scholar 

  21. Han H, Wang WY, Mao BH (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing, vol 3644. Springer, pp 878–887. https://doi.org/10.1007/11538059_91

  22. He H, Bai Y, Garcia E, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the IEEE international joint conference on neural networks. IEEE, pp 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969

  23. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-Level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Advances in knowledge discovery and data mining, vol 5476. Springer, pp 475–482. https://doi.org/10.1007/978-3-642-01307-2_43

  24. Denil M, Trappenberg T (2010) Overlap versus imbalance. In: Lecture notes in computer science. Lecture notes in computer science, vol 6085. Springer, pp 220–231. https://doi.org/10.1007/978-3-642-13059-5_22

  25. Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. ACM SIGKDD Explorations Newsl 6:40–49. https://doi.org/10.1145/1007730.1007737

    Article  Google Scholar 

  26. Cieslak DA, Chawla NV, Striegel A (2006) Combating imbalance in network intrusion datasets. In: Proceedings of the IEEE international conference on granular computing. IEEE, pp 732–737. https://doi.org/10.1109/GRC.2006.1635905

  27. Barua S, Islam MM, Yao X, Murase K (2014) MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425. https://doi.org/10.1109/TKDE.2012.232

    Article  Google Scholar 

  28. Nekooeimehr I, Lai-Yuen SK (2016) Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets. Expert Syst Appl 46:405–416. https://doi.org/10.1016/j.eswa.2015.10.031

    Article  Google Scholar 

  29. Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci 4651–20. https://doi.org/10.1016/j.ins.2018.06.056

  30. Barua S, Islam MM, Murase K (2013) ProWSyn: proximity weighted synthetic oversampling technique for imbalanced data set learning. In: Advances in knowledge discovery and data mining, vol 7819. Springer, pp 317–328. https://doi.org/10.1007/978-3-642-37456-2_27

  31. Leevy JL, Khoshgoftaar TM (2018) A survey on addressing high-class imbalance in big data. J Big Data 5–42. https://doi.org/10.1186/s40537-018-0151-6

  32. Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496. https://doi.org/10.1126/science.1242072

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

Wei Chen conceived of the present idea and performed experiments. Wenjie Guo contributed to data curation and analysis. Weijie Mao supervised the project and directed the research. All authors contributed to the writing and editing of the manuscript.

Corresponding author

Correspondence to Weijie Mao.

Ethics declarations

Ethical and informed consent for data used

Not applicable.

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, W., Guo, W. & Mao, W. An adaptive over-sampling method for imbalanced data based on simultaneous clustering and filtering noisy. Appl Intell 54, 11430–11449 (2024). https://doi.org/10.1007/s10489-024-05754-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05754-x

Keywords

Navigation