Abstract
In imbalanced data classification, training classification model with synthetic samples can effectively improve the performance in mining minority samples. However, the majority samples are more likely to produce noise samples than the minority. The noise samples of the majority will affect the data generation algorithm, and it’s difficult to generate the synthetic boundary data for the minority. A boundary synthetic data generation algorithm called CCR-GSVM is proposed based on Combined Cleaning and Resampling(CCR) and Granular Support Vector Machine(GSVM-RU) in this paper. CCR-GSVM combines the boundary information of SVM and GSVM-RU to filter the noise samples of the majority, so as to use more effective majority samples generate synthetic data by CCR. The synthetic samples located in the margin boundary are obtain if the classification performance is improved. The comparative experiments on 12 imbalanced datasets shows that the boundary data generated by CCR-GSVM can help support vector machine with effect on improving the F1-measure and G-mean under the situation of majority noise, which shows the CCR-GSVM has a better effectiveness to generate synthetic boundary samples.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Nami S, Shajari M (2018) Cost-sensitive payment card fraud detection based on dynamic random forest and k-nearest neighbors. Expert Syst Appl 110:381–392
Prati RC, Luengo J, Herrera F (2019) Emerging topics and challenges of learning from noisy data in nonstandard classification: a survey beyond binary class noise. Knowl Inf Syst 60(1):63– 97
Nematzadeh Z, Ibrahim R, Selamat A (2020) Improving class noise detection and classification performance: a new two-filter cndc model. Appl Soft Comput 94:106428
Sabzevari M, Martínez-Muñoz G, Suárez A (2018) A two-stage ensemble method for the detection of class-label noise. Neurocomputing 275:2374–2383
Hazarika BB, Gupta D (2021) Density-weighted support vector machines for binary class imbalance learning. Neural Comput Applic 33(9):4243–4261
Richhariya B, Tanveer M (2020) A reduced universum twin support vector machine for class imbalance learning. Pattern Recogn 102:107150
Yu S, Li X, Zhang X, Wang H (2019) The ocs-svm: An objective-cost-sensitive svm with sample-based misclassification cost invariance. IEEE Access 7:118931–118942
Wei J, Huang H, Yao L, Hu Y, Fan Q, Huang D (2021) New imbalanced bearing fault diagnosis method based on sample-characteristic oversampling technique (scote) and multi-class ls-svm. Appl Soft Comput 101:107043
Wang Q, Luo Z, Huang J, Feng Y, Liu Z (2017) A novel ensemble method for imbalanced data learning: bagging of extrapolation-smote svm. Computational intelligence and neuroscience 2017
Koziarski M, Woźniak M (2017) Ccr: A combined cleaning and resampling algorithm for imbalanced data classification. International Journal of Applied Mathematics and Computer Science 27(4)
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Koziarski M, Woźniak M, Krawczyk B (2020) Combined cleaning and resampling algorithm for multi-class imbalanced data with label noise. Knowl-Based Syst 204:106223
Tang Y, Zhang Y (2006) Granular svm with repetitive undersampling for highly imbalanced protein homology prediction. In: IEEE International conference on granular computing
Li M, Xiong A, Wang L, Deng S, Ye J (2020) Aco resampling: Enhancing the performance of oversampling methods for class imbalance classification. Knowl-Based Syst 196:105818
Elreedy D, Atiya AF (2019) A comprehensive analysis of synthetic minority oversampling technique (smote) for handling class imbalance. Inf Sci 505:32–64
Verbiest N, Ramentol E, Cornelis C, Herrera F (2012) Improving smote with fuzzy rough prototype selection to detect noise in imbalanced classification data. In: Ibero-american conference on artificial intelligence. pp 169–178. Springer
Sui Y, Wei Y, Zhao D (2015) Computer-aided lung nodule recognition by svm classifier based on combination of random undersampling and smote. Computational and mathematical methods in medicine 2015
Li J, Zhu Q, Wu Q, Zhang Z, Gong Y, He Z, Zhu F (2021) Smote-nan-de: Addressing the noisy and borderline examples problem in imbalanced classification by natural neighbors and differential evolution. Knowl-Based Syst 223:107056
Chen B, Xia S, Chen Z, Wang B, Wang G (2021) Rsmote: a self-adaptive robust smote for imbalanced problems with label noise. Inf Sci 553:397–428
Liang X, Jiang A, Li T, Xue Y, Wang G (2020) Lr-smote— an improved unbalanced data set oversampling based on k-means and svm. Knowl-Based Syst 196:105845
Wang CR, Shao XH (2020) An improving majority weighted minority oversampling technique for imbalanced classification problem. IEEE Access 9:5069–5082
Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) Smote–ipf: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203
Rivera WA (2017) Noise reduction a priori synthetic over-sampling for class imbalanced data sets. Inf Sci 408:146–161
Vo MT, Nguyen T, Vo HA, Le T (2021) Noise-adaptive synthetic oversampling technique. Applied Intelligence pp 1–10
Wei J, Huang H, Yao L, Hu Y, Fan Q, Huang D (2020) Ni-mwmote: an improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems. Expert Syst Appl 158:113504
Ramentol E, Caballero Y, Bello R, Herrera F (2012) Smote-rs b*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowl Inf Syst 33(2):245–265
Cheng K, Zhang C, Yu H, Yang X, Zou H, Gao S (2019) Grouped smote with noise filtering mechanism for classifying imbalanced data. IEEE Access 7:170668–170681
Lee W, Jun CH, Lee JS (2017) Instance categorization by support vector machines to adjust weights in adaboost for imbalanced data classification. Inf Sci 381:92–103
Garcia L, Lehmann J, de Carvalho AC, Lorena AC (2019) New label noise injection methods for the evaluation of noise filters. Knowl-Based Syst 163(JAN.1):693–704
Kovács G. (2019) Smote-variants: a python implementation of 85 minority oversampling techniques. Neurocomputing 366:352–354
Lemaître G, Nogueira F, Aridas CK (2017) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res 18(1):559–563
Bej S, Davtyan N, Wolfien M, Nassar M, Wolkenhauer O (2021) Loras: an oversampling approach for imbalanced datasets. Mach Learn 110(2):1–23
Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and smote. Inf Sci 465:1–20
Douzas G, Bacao F (2019) Geometric smote a geometrically enhanced drop-in replacement for smote. Inf Sci 501:118–135
Guan H, Zhang Y, Xian M, Cheng HD, Tang X (2020) Smote-wenn: Solving class imbalance and small sample problems by oversampling and distance scaling. Applied Intelligence (4)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Huang, K., Wang, X. CCR-GSVM: A boundary data generation algorithm for support vector machine in imbalanced majority noise problem. Appl Intell 53, 1192–1204 (2023). https://doi.org/10.1007/s10489-022-03408-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03408-4