[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Enhanced SVM-SMOTE with Cluster Consistency for Imbalanced Data Classification

  • Conference paper
  • First Online:
Intelligent Data Engineering and Automated Learning – IDEAL 2023 (IDEAL 2023)

Abstract

The issue of imbalanced data in machine learning has gained significant attention in recent years. Imbalanced data, where one class has significantly fewer samples than others, can lead to poor performance for machine learning models, especially in detecting minority class samples. To address this problem, various resampling techniques have been proposed, including the popular SMOTE (Synthetic Minority Over-sampling TEchnique). However, SMOTE suffers from the overlapping problem and may misclassify samples near the separation boundaries. This paper presents a novel framework to optimise border-based-SMOTEs, including Borderline-SMOTE and SVM-SMOTE which were specifically developed to solve the problem of misclassifying border samples. The proposed method ensures that generated samples improve the decision boundaries and are free from overlapping issues. The proposed method is evaluated on synthetic and real-world datasets, and results demonstrate its effectiveness in enhancing the performance of machine learning models, particularly in classifying minority class samples.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 55.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 69.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Baldi, P., Brunak, S., Chauvin, Y., Andersen, C.A.F., Nielsen, H.: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16(5), 412–424 (2000). https://doi.org/10.1093/bioinformatics/16.5.412

    Article  Google Scholar 

  2. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1), 321–357 (2002). https://doi.org/10.1613/jair.953

    Article  MATH  Google Scholar 

  3. Chicco, D., Jurman, G.: The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics (2020). https://doi.org/10.1186/s12864-019-6413-7

    Article  Google Scholar 

  4. Dal Pozzolo, A., Boracchi, G., Caelen, O., Alippi, C., Bontempi, G.: Credit card fraud detection: a realistic modeling and a novel learning strategy. IEEE Trans. Neural Netw. Learn. Syst. 29(8), 3784–3797 (2018). https://doi.org/10.1109/TNNLS.2017.2736643

    Article  Google Scholar 

  5. Ding, Z.: Diversified ensemble classifiers for highly imbalanced data learning and its application in bioinformatics. Ph.D. thesis, USA (2011). aAI3486649

    Google Scholar 

  6. Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91

    Chapter  Google Scholar 

  7. He, Q., Pang, Y., Jiang, G., Xie, P.: A spatio-temporal multiscale neural network approach for wind turbine fault diagnosis with imbalanced SCADA data. IEEE Trans. Ind. Inf. 17(10), 6875–6884 (2021). https://doi.org/10.1109/TII.2020.3041114

    Article  Google Scholar 

  8. Kelly, M., Longjohn, R., Nottingham, K.: The UCI machine learning repository (2023). https://archive.ics.uci.edu

  9. Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Prog. Artif. Intell. 5(4), 221–232 (2016). https://doi.org/10.1007/s13748-016-0094-0

    Article  Google Scholar 

  10. Ma, Y., Zeng, K., Zhao, C., Ding, X., He, M.: Feature selection and classification of oil spills in SAR image based on statistics and artificial neural network. In: 2014 IEEE Geoscience and Remote Sensing Symposium, pp. 569–571 (2014). https://doi.org/10.1109/IGARSS.2014.6946486

  11. Mahlein, A.K., et al.: Development of spectral indices for detecting and identifying plant diseases. Remote Sens. Environ. 128, 21–30 (2013). https://doi.org/10.1016/j.rse.2012.09.019

    Article  Google Scholar 

  12. Mahlein, A.K., Steiner, U., Dehne, H.W., Oerke, E.C.: Spectral signatures of sugar beet leaves for the detection and differentiation of diseases. Precis. Agric. 11(4), 413–431 (2010). https://doi.org/10.1007/s11119-010-9180-7

    Article  Google Scholar 

  13. Nguyen, H.M., Cooper, E.W., Kamei, K.: Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradigm. 3(1), 4–21 (2011). https://doi.org/10.1504/IJKESDP.2011.039875

  14. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  15. Rousseeuw, P.J.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987). https://doi.org/10.1016/0377-0427(87)90125-7. https://www.sciencedirect.com/science/article/pii/0377042787901257

  16. Roychowdhury, S., Koozekanani, D.D., Parhi, K.K.: DREAM: diabetic retinopathy analysis using machine learning. IEEE J. Biomed. Health Inform. 18(5), 1717–1728 (2014). https://doi.org/10.1109/JBHI.2013.2294635

    Article  Google Scholar 

  17. Sambasivam, G., Opiyo, G.D.: A predictive machine learning application in agriculture: Cassava disease detection and classification with imbalanced dataset using convolutional neural networks. Egypt. Inform. J. 22(1), 27–34 (2021). https://doi.org/10.1016/j.eij.2020.02.007

    Article  Google Scholar 

  18. Siriseriwan, W., Sinapiromsaran, K.S.: Adaptive neighbor synthetic minority oversampling technique under 1nn outcast handling. Songklanakarin J. Sci. Technol. 39, 565–576 (2017). https://doi.org/10.14456/sjst-psu.2017.70

  19. Zheng, M., Wang, F., Hu, X., Miao, Y., Cao, H., Tang, M.: A method for analyzing the performance impact of imbalanced binary data on machine learning models. Axioms 11(11), 607 (2022). https://doi.org/10.3390/axioms11110607. https://www.mdpi.com/2075-1680/11/11/607

Download references

Acknowledgment

Tajul Miftahushudur would like to acknowledge the Scholarship provided by the Indonesian Endowment Fund for Education (LPDP). Halil Mertkan Sahin would like to acknowledge the Scholarship provided by the Ministry of National Education of the Republic of Türkiye.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tajul Miftahushudur .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Miftahushudur, T., Sahin, H.M., Grieve, B., Yin, H. (2023). Enhanced SVM-SMOTE with Cluster Consistency for Imbalanced Data Classification. In: Quaresma, P., Camacho, D., Yin, H., Gonçalves, T., Julian, V., Tallón-Ballesteros, A.J. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2023. IDEAL 2023. Lecture Notes in Computer Science, vol 14404. Springer, Cham. https://doi.org/10.1007/978-3-031-48232-8_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-48232-8_39

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-48231-1

  • Online ISBN: 978-3-031-48232-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics