Abstract
The issue of imbalanced data in machine learning has gained significant attention in recent years. Imbalanced data, where one class has significantly fewer samples than others, can lead to poor performance for machine learning models, especially in detecting minority class samples. To address this problem, various resampling techniques have been proposed, including the popular SMOTE (Synthetic Minority Over-sampling TEchnique). However, SMOTE suffers from the overlapping problem and may misclassify samples near the separation boundaries. This paper presents a novel framework to optimise border-based-SMOTEs, including Borderline-SMOTE and SVM-SMOTE which were specifically developed to solve the problem of misclassifying border samples. The proposed method ensures that generated samples improve the decision boundaries and are free from overlapping issues. The proposed method is evaluated on synthetic and real-world datasets, and results demonstrate its effectiveness in enhancing the performance of machine learning models, particularly in classifying minority class samples.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Baldi, P., Brunak, S., Chauvin, Y., Andersen, C.A.F., Nielsen, H.: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16(5), 412–424 (2000). https://doi.org/10.1093/bioinformatics/16.5.412
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1), 321–357 (2002). https://doi.org/10.1613/jair.953
Chicco, D., Jurman, G.: The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics (2020). https://doi.org/10.1186/s12864-019-6413-7
Dal Pozzolo, A., Boracchi, G., Caelen, O., Alippi, C., Bontempi, G.: Credit card fraud detection: a realistic modeling and a novel learning strategy. IEEE Trans. Neural Netw. Learn. Syst. 29(8), 3784–3797 (2018). https://doi.org/10.1109/TNNLS.2017.2736643
Ding, Z.: Diversified ensemble classifiers for highly imbalanced data learning and its application in bioinformatics. Ph.D. thesis, USA (2011). aAI3486649
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91
He, Q., Pang, Y., Jiang, G., Xie, P.: A spatio-temporal multiscale neural network approach for wind turbine fault diagnosis with imbalanced SCADA data. IEEE Trans. Ind. Inf. 17(10), 6875–6884 (2021). https://doi.org/10.1109/TII.2020.3041114
Kelly, M., Longjohn, R., Nottingham, K.: The UCI machine learning repository (2023). https://archive.ics.uci.edu
Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Prog. Artif. Intell. 5(4), 221–232 (2016). https://doi.org/10.1007/s13748-016-0094-0
Ma, Y., Zeng, K., Zhao, C., Ding, X., He, M.: Feature selection and classification of oil spills in SAR image based on statistics and artificial neural network. In: 2014 IEEE Geoscience and Remote Sensing Symposium, pp. 569–571 (2014). https://doi.org/10.1109/IGARSS.2014.6946486
Mahlein, A.K., et al.: Development of spectral indices for detecting and identifying plant diseases. Remote Sens. Environ. 128, 21–30 (2013). https://doi.org/10.1016/j.rse.2012.09.019
Mahlein, A.K., Steiner, U., Dehne, H.W., Oerke, E.C.: Spectral signatures of sugar beet leaves for the detection and differentiation of diseases. Precis. Agric. 11(4), 413–431 (2010). https://doi.org/10.1007/s11119-010-9180-7
Nguyen, H.M., Cooper, E.W., Kamei, K.: Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradigm. 3(1), 4–21 (2011). https://doi.org/10.1504/IJKESDP.2011.039875
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Rousseeuw, P.J.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987). https://doi.org/10.1016/0377-0427(87)90125-7. https://www.sciencedirect.com/science/article/pii/0377042787901257
Roychowdhury, S., Koozekanani, D.D., Parhi, K.K.: DREAM: diabetic retinopathy analysis using machine learning. IEEE J. Biomed. Health Inform. 18(5), 1717–1728 (2014). https://doi.org/10.1109/JBHI.2013.2294635
Sambasivam, G., Opiyo, G.D.: A predictive machine learning application in agriculture: Cassava disease detection and classification with imbalanced dataset using convolutional neural networks. Egypt. Inform. J. 22(1), 27–34 (2021). https://doi.org/10.1016/j.eij.2020.02.007
Siriseriwan, W., Sinapiromsaran, K.S.: Adaptive neighbor synthetic minority oversampling technique under 1nn outcast handling. Songklanakarin J. Sci. Technol. 39, 565–576 (2017). https://doi.org/10.14456/sjst-psu.2017.70
Zheng, M., Wang, F., Hu, X., Miao, Y., Cao, H., Tang, M.: A method for analyzing the performance impact of imbalanced binary data on machine learning models. Axioms 11(11), 607 (2022). https://doi.org/10.3390/axioms11110607. https://www.mdpi.com/2075-1680/11/11/607
Acknowledgment
Tajul Miftahushudur would like to acknowledge the Scholarship provided by the Indonesian Endowment Fund for Education (LPDP). Halil Mertkan Sahin would like to acknowledge the Scholarship provided by the Ministry of National Education of the Republic of Türkiye.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Miftahushudur, T., Sahin, H.M., Grieve, B., Yin, H. (2023). Enhanced SVM-SMOTE with Cluster Consistency for Imbalanced Data Classification. In: Quaresma, P., Camacho, D., Yin, H., Gonçalves, T., Julian, V., Tallón-Ballesteros, A.J. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2023. IDEAL 2023. Lecture Notes in Computer Science, vol 14404. Springer, Cham. https://doi.org/10.1007/978-3-031-48232-8_39
Download citation
DOI: https://doi.org/10.1007/978-3-031-48232-8_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-48231-1
Online ISBN: 978-3-031-48232-8
eBook Packages: Computer ScienceComputer Science (R0)