Abstract
Traditional approaches tend to cause classier bias in the imbalanced data set, resulting in poor classification performance for minority classes. In particular, there are many imbalanced data in financial fraud, network intrusion, and fault detection, where recognition rate of minority classes is pertinent than the classification performance of majority classes. Therefore, there is pressure on developing efficient algorithms to solve the class imbalance problem. To this end, this article presents a novel hybrid algorithm Negative Binary General (NBG), to improve the performance of imbalanced classifications by combining oversampling and a feature selection algorithm. A novel oversampling algorithm, Negative-positive Synthetic Minority Oversampling Technique (NPSMOTE), improves sample generation’s practicability while the Binary Ant Lion Optimizer (BALO) algorithm extracts the most significant features to improve the classification performance. Simulation experiments carried out using seven benchmark imbalanced data sets demonstrate that, the proposed NBG algorithm significantly outperforms the classification of imbalanced small-sample data sets compared to nine other existing and six recently published algorithms.
Similar content being viewed by others
References
Alcala-Fdez J, Fernandez A, Luengo J, et al. (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult Valued Log Soft Comput 17(2–3):255–287
Abdi L, Hashemi S (2016) To combat multi-class imbalanced problems by means of over-sampling and boosting techniques. IEEE Trans Knowl Data Eng 28(1):238–251
Al-Ghraibah A, Boucheron LE, Mcateer RTJ (2015) A study of feature selection of magnetogram complexity features in an imbalanced solar flare prediction data-set. In: IEEE international conference on data mining workshop, pp 557–564
Ali S, Majid A, Javed SG, Sattar M (2016) Can-csc-gbe: developing cost-sensitive classifier with gentleboost ensemble for breast cancer classification using protein amino acids and imbalanced data. Comput Biol Med 73:38–46
Alibeigi M, Hashemi S, Hamzeh A (2012) Dbfs: an effective density based feature selection scheme for small sample size and high dimensional imbalanced data sets. Data & Knowledge Engineering 81-82(4):67–103
Amin A, Anwar S, Adnan A, Nawaz M, Howard N, Qadir J, Hawalah A, Hussain A (2016) Comparing oversampling techniques to handle the class imbalance problem: a customer churn prediction case study. IEEE Access,(99):1–1
Anbar M, Abdullah R, Al-Tamimi BN, Hussain A (2018) A machine learning approach to detect router advertisement flooding attacks in next-generation ipv6 networks. Cognit Comput 10(3-4):1–14
Bae SH, Yoon KJ (2015) Polyp detection via imbalanced learning and discriminative feature learning. IEEE Trans Med Imaging 34(11):2379
Bao L, Cao J, Li J, Zhang Y (2016) Boosted near-miss under-sampling on svm ensembles for concept detection in large-scale imbalanced datasets. Neurocomputing 172(C):198–206
Barua S, Islam MM, Yao X, Murase K (2013) Mwmote–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425
Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. Acm Sigkdd Explorations Newsletter 6(1):20–29
Beyan C, Fisher R (2015) Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recogn 48(5):1653–1672
Blagus R, Lusa L (2016) Gradient boosting for high-dimensional prediction of rare events. Computational Statistics & Data Analysis:113
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-asia conference on advances in knowledge discovery and data mining, pp 475–482
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) Dbsmote: Density-based synthetic minority over-sampling technique. Appl Intell 36 (3):664–684
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) Smoteboost: improving prediction of the minority class in boosting. In: European conference on principles of data mining and knowledge discovery, pp 107–1219
Chen S, He H, Garcia EA (2010) Ramoboost:ranked minority oversampling in boosting. IEEE Trans Neural Netw 21(10):1624–1642
Cheng F, Zhang J, Wen C (2016) Cost-sensitive large margin distribution machine for classification of imbalanced data. Pattern Recognit Let 80:107–112. https://doi.org/10.1016/j.patrec.2016.06.009. http://www.sciencedirect.com/science/article/pii/S0167865516301337
Cohen G, Hilario M, Sax H, Hugonnet S, Geissbuhler A (2006) Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med 37(1):7–18
Dubey R, Zhou J, Wang Y, Thompson PM, Ye J (2014) Analysis of sampling techniques for imbalanced data: an n = 648 adni study. Neuroimage 87 (3):220–241
Elkan C (2001) The foundations of cost-sensitive learning. In: International joint conference on artificial intelligence, vol 17, pp 973–978. Lawrence Erlbaum associates Ltd
Emary E, Zawbaa HM, Hassanien AE (2016) Binary ant lion approaches for feature selection. Neurocomputing 213:54–65
Fang F, Zhou Q, Shen Z, Yang X, Han L, Wang JQ (2018) The application of a novel neural network in the detection of phishing websites. J Ambient Intell Humaniz Comput, (13):1–15
Fernandez A, Garcia S, Chawla NV, Herrera F (2018) Smote for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905
Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F (2018) Learning from imbalanced data sets. Springer
García-Pedrajas N, García-Osorio C (2013) Boosting for class-imbalanced datasets using genetically evolved supervised non-linear projections. Prog Artif Intell 2(1):29–44
Ghazikhani A, Yazdi HS, Monsefi R (2012) Class imbalance handling using wrapper-based random oversampling. In: 20Th iranian conference on electrical engineering (ICEE2012). IEEE, pp 611–616
Guo H, Li Y, Shang J, Gu M, Huang Y, Gong B (2016) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Han H, Wang WY, Mao BH (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. Lect Notes Comput Sci 3644 (5):878–887
Hart BPE (1968) a̱the condensed nearest neighbor ruleo̱. In: IEEE Trans Information theory
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks, pp 1322–1328
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
He H, Ma Y (2013) Imbalanced learning: foundations, algorithms, and applications. Wiley
Hu S, Liang Y, Ma L, He Y (2010) Msmote: improving classification performance when training data is imbalanced. In: Second international workshop on computer science and engineering, pp 13–17
Ieracitano C, Adeel A, Gogate M, Dashtipour K, Morabito FC, Larijani H, Raza A, Hussain A (2018) Statistical analysis driven optimized deep learning system for intrusion detection. In: International conference on brain inspired cognitive systems. Springer, pp 759–769
Jin XB, Xie GS, Huang K, Hussain A (2018) Accelerating infinite ensemble of clustering by pivot features. Cognit Comput 10(6):1042–1050
Jz A, Ju JA, Si CA, Rz A, By B, Ql C (2020) A weighted hybrid ensemble method for classifying imbalanced data. Knowl-Based Syst, vol 203
Kennedy J, Eberhart R (1995) Particle swarm optimization. In: Proceedings of ICNN’95-international conference on neural networks. IEEE, vol 4, pp 1942–1948
Kennedy J, Eberhart RC (1997) A discrete binary version of the particle swarm algorithm. In: 1997 IEEE international conference on systems, man, and cybernetics. Computational cybernetics and simulation. IEEE, vol 5, pp 4104–4108
Khan FA, Gumaei A, Derhab A, Hussain A (2019) Tsdl: a twostage deep learning model for efficient network intrusion detection. IEEE Access
Khoshgoftaar TM, Gao K, Bullard LA (2011) A comparative study of filter-based and wrapper-based feature ranking techniques for software quality modeling. Int J Reliab Qual Saf Eng 18(4):341–364
Krawczyk B, Woźniak M, Schaefer G (2014) Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl Soft Comput J 14 (1):554–562
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. Proc Int’l Conf Mach Learn:179–186
Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: Conference on ai in medicine in Europe: artificial intelligence medicine, pp 63–66
Lim P, Goh CK, Tan KC (2016) Evolutionary cluster-based synthetic oversampling ensemble (eco-ensemble) for imbalance learning. IEEE Trans Cybern, (99):1–12
Lima RF, Pereira ACM (2016) A fraud detection model based on feature selection and undersampling applied to web payment systems. In: IEEE / Wic / ACM international conference on web intelligence and intelligent agent technology, pp 219–222
Lin ZY, Hao ZF, Yang XW, Liu XL (2009) Several svm ensemble methods integrated with under-sampling for imbalanced data learning. In: International conference on advanced data mining and applications, pp 536–544
López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141
Loyola-González O, Martínez-Trinidad JF, Carrasco-Ochoa JA, et al. (2016) Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases. Neurocomputing 175:935–947
Mahmud M, Kaiser MS, Hussain A, Vassanelli S (2017) Applications of deep learning and reinforcement learning to biological data. IEEE Trans Neural Netw Learn Syst 29(6):2063–2079
Malik ZK, Hussain A, Wu J (2016) An online generalized eigenvalue version of laplacian eigenmaps for visual big data. Neurocomputing 173:127–136
Mao W, Jiang M, Wang J, Li Y (2017) Online extreme learning machine with hybrid sampling strategy for sequential imbalanced data. Cognit Comput 9(6):780–800
Menardi G, Torelli N (2014) Training and assessing classification rules with imbalanced data. Data Min Knowl Discov 28(1):92–122
Mirjalili S (2015) The ant lion optimizer. Adv Eng Softw 83(C):80–98
Moepya SO, Akhoury SS, Nelwamondo FV (2015) Applying cost-sensitive classification for financial fraud detection under high class-imbalance. In: IEEE international conference on data mining workshop, pp 183–192
Mohammad RFA, Thabtah TM (2017) UCI machine learning repository, http://archive.ics.uci.edu/ml. Accessed 12 Dec, 2017
Mollineda RA (2012) On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl-Based Syst 25(1):13–21
Nekooeimehr I, Lai-Yuen SK (2016) Adaptive semi-unsupervised weighted oversampling (a-suwo) for imbalanced datasets. Expert Syst Appl 46:405–416
Nguyen HM, Cooper EW, Kamei K (2009) Borderline over-sampling for imbalanced data classification. In: Proceedings: fifth international workshop on computational intelligence & applications. IEEE SMC hiroshima chapter, vol 2009, pp 24–29
Oh SH (2011) Error back-propagation algorithm for classification of imbalanced data. Neurocomputing 74(6):1058–1061
Pérez-Godoy M, Rivera AJ, Carmona CJ, Jesus MJD (2014) Training algorithms for radial basis function networks to tackle learning processes with imbalanced data-sets. Appl Soft Comput 25(C):26–39
Poria S, Cambria E, Howard N, Huang GB, Hussain A (2016) Fusing audio, visual and textual clues for sentiment analysis from multimodal content. Neurocomputing 174:50–59
Poria S, Peng H, Hussain A, Howard N, Cambria E (2017) Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis. Neurocomputing:S0925231217302023
Precision R (2015) Data mining for imbalanced datasets: an overview
Ramentol E, Caballero Y, Bello R, Herrera F (2012) Smote-rsb*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowl Inf Syst 33(2):245–265
Rayhan F, Ahmed S, Mahbub A, Jani MR, Shatabda S, Farid DM (2017) Cusboost: cluster-based under-sampling with boosting for imbalanced classification
Ren F, Cao P, Li W, Zhao D, Zaiane O (2017) Ensemble based adaptive over-sampling method for imbalanced data learning in computer aided detection of microaneurysm. Comput Med Imaging Graph 55:54
Rosipal R, Krämer N (2005) Overview and recent advances in partial least squares. In: International statistical and optimization perspectives workshop “subspace, latent structure and feature selection”. Springer, pp 34–51
Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) Smote–ipf: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291(5):184–203
Satapathy R, Cambria E, Hussain A (2018) Sentiment analysis in the bio-medical domain: techniques, tools, and applications. Springer, vol 7
Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2010) Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A Syst Hum 40(1):185–197
Song L, Li D, Zeng X, Wu Y, Guo L, Zou Q (2014) ndna-prot: identification of dna-binding proteins based on unbalanced classification. BMC Bioinformatics,15,1(2014-09-08) 15(1):298
Tian Q, Han D, Li KC, Liu X, Castiglione A (2020) An intrusion detection approach based on improved deep belief network. Appl Intell (3)
Tomczak JM (2015) Boosted svm with active learning strategy for imbalanced data. Soft Comput 19(12):3357–3368
Tomek I (1976) Two modifications of cnn. IEEE Trans Syst Man Cybern Syst 6(11):769–772
Vluymans S, Saeys Y, Cornelis C, Herrera F (2016) Fuzzy rough classifiers for class imbalanced multi-instance data. Pattern Recogn 53(C):36–45
Wajid SK, Hussain A (2015) Local energy-based shape histogram feature extraction technique for breast cancer diagnosis. Expert Syst Appl 42 (20):6990–6999
Wajid SK, Hussain A, Huang K (2018) Three-dimensional local energy-based shape histogram (3d-lesh): a novel feature extraction technique. Expert Syst Appl 112:388–400
Wei MH, Cheng CH, Huang CS, Chiang PC (2013) Discovering medical quality of total hip arthroplasty by rough set classifier with imbalanced class. Qual Quant 47(3):1761–1779
Wilson DL (2007) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern Syst 2(3):408–421
Wong GY, Leung FHF, Ling SH (2018) A hybrid evolutionary preprocessing method for imbalanced datasets. Information Sciences
Xu J, Han D, Li KC, Jiang H (2020) A k-means algorithm based on characteristics of density applied to network intrusion detection. Computer Science and Information Systems:14–14
Yijing L, Haixiang G, Xiao L, Yanan L, Jinling L (2016) Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data. Knowl-Based Syst 94:88–104
Yu H, Sun C, Yang X, Yang W, Shen J, Qi Y (2016) Odoc-elm: optimal decision outputs compensation-based extreme learning machine for classifying imbalanced data. Knowl-Based Syst 92:55–70
Zayed AS, Hussain A, Abdullah RA (2006) A novel multiple-controller incorporating a radial basis function neural network based generalized learning model. Neurocomputing 69(16-18):1868–1881
Zhao H (2016) General vector machine
Zhou Q, Chen H, Zhao H, Zhang G, Yong J, Shen J (2016) A local field correlated and monte carlo based shallow neural network model for non-linear time series prediction. Scalable Information Systems 3(8):e5
Zhou Q, Feng F, Shen Z, Zhou R, Hsieh MY, Li KC (2019) A novel approach for mobile malware classification and detection in android systems. Multimed Tools Appl 78(3):3529–3552
Ziba M, Tomczak JM, Lubicz M, Witek J (2014) Boosted svm for extracting rules from imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients. Appl Soft Comput J 14(1):99–108
Zikria YB, Afzal MK, Kim SW, Marin A, Guizani M (2020) Deep learning for intelligent iot: opportunities, challenges and solutions. Comput Commun 164(0140-3664):50–53
Zou Q, Xie S, Lin Z, Wu M, Ju Y (2016) Finding the best classification threshold in imbalanced classification. Big Data Research 5:2–8
Acknowledgements
This work was supported by Plan Project for Guizhou Provincial Basic Research (NO. QKH-Basic-ZK[2022] General 018) and the school level project of Guizhou University of Finance and economics in 2021 (NO. 2021KYYB13).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Feng, F., Li, KC., Yang, E. et al. A novel oversampling and feature selection hybrid algorithm for imbalanced data classification. Multimed Tools Appl 82, 3231–3267 (2023). https://doi.org/10.1007/s11042-022-13240-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13240-0