[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

ESMOTE: an overproduce-and-choose synthetic examples generation strategy based on evolutionary computation

Published: 03 December 2022 Publication History

Abstract

The class imbalance learning problem is an important topic that has attracted considerable attention in machine learning and data mining. The most common method of addressing imbalanced datasets is the synthetic minority oversampling technique (SMOTE). However, the SMOTE and its variants suffer from the noise derived from the interpolation of synthetic examples. In this paper, an overproduce-and-choose strategy, which is divided into the overproduction and selection phases, is proposed to generate an appropriate set of synthetic examples for imbalance learning problems. In the overproduction phase, a new interpolation mechanism is developed to produce numerous synthetic examples, while in the selection phase, the synthetic examples that are beneficial to the classification task are selected by using instance selection based on evolutionary computation. Experiments are conducted on a large number of datasets selected from the real-world applications. The experimental results demonstrate that the proposed method is significantly better than SMOTE and its well-known variants in terms of several metrics, including G-mean (GM) and area under the curve.

References

[1]
Wang S and Yao X Multiclass imbalance problems: analysis and potential solutions IEEE Trans Syst Man Cybern Part B (Cybern) 2012 42 4 1119-1130
[2]
Fernández A, LóPez V, Galar M, Del Jesus MJ, and Herrera F Analysing the classification of imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches Knowl-Based Syst 2013 42 97-110
[3]
Hou W-H, Wang X-K, Zhang H-Y, Wang J-Q, and Li L A novel dynamic ensemble selection classifier for an imbalanced data set: an application for credit risk assessment Knowl-Based Syst 2020 208
[4]
Choudhary R and Shukla S A clustering based ensemble of weighted kernelized extreme learning machine for class imbalance learning Expert Syst Appl 2021 164
[5]
Galar M, Fernandez A, Barrenechea E, Bustince H, and Herrera F A review on ensembles for the class imbalance problem: bagging, boosting, and hybrid-based approaches IEEE Transact Syst Man Cybern Part C (Appl Rev) 2011 42 4 463-484
[6]
Fernández A, Garcia S, Herrera F, and Chawla NV SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary J Artif Intell Res 2018 61 863-905
[7]
Datta S and Das S Near-Bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs Neural Netw 2015 70 39-52
[8]
Ren Z, Zhu Y, Kang W, Fu H, Niu Q, Gao D, Yan K, and Hong J Adaptive cost-sensitive learning: improving the convergence of intelligent diagnosis models under imbalanced data Knowl-Based Syst 2022 241
[9]
Guzmán-Ponce A, Sánchez JS, Valdovinos RM, and Marcial-Romero JR DBIG-US: a two-stage under-sampling algorithm to face the class imbalance problem Expert Syst Appl 2021 168
[10]
Chen Z, Duan J, Kang L, and Qiu G A hybrid data-level ensemble to enable learning from highly imbalanced dataset Inf Sci 2021 554 157-176
[11]
Barella VH, Garcia LP, de Souto MC, Lorena AC, and de Carvalho AC Assessing the data complexity of imbalanced datasets Inf Sci 2021 553 83-109
[12]
Chawla NV, Bowyer KW, Hall LO, and Kegelmeyer WP SMOTE: synthetic minority over-sampling technique J Artif Intell Res 2002 16 321-357
[13]
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, and Herrera F Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework J Mult-Valued Log Soft Comput 2011 17 255-287
[14]
Barandela R, Valdovinos RM, and Sánchez JS New applications of ensembles of classifiers Pattern Anal Appl 2003 6 3 245-256
[15]
López V, Fernández A, García S, Palade V, and Herrera F An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics Inf Sci 2013 250 113-141
[16]
Gosain A, Sardana S (2019) Farthest SMOTE: a modified SMOTE approach. In: Computational intelligence in data mining. Springer, pp 309–320
[17]
García V, Sánchez JS, Martín-Félez R, and Mollineda RA Surrounding neighborhood-based SMOTE for learning from imbalanced data sets Progr Artif Intell 2012 1 4 347-362
[18]
Dietterich TG and Bakiri G Solving multiclass learning problems via error-correcting output codes J Artif Intell Res 1994 2 263-286
[19]
López V, Fernández A, and Herrera F On the importance of the validation technique for classification with imbalanced datasets: addressing covariate shift when data is skewed Inf Sci 2014 257 1-13
[20]
Weiss GM Mining with rarity: a unifying framework ACM SIGKDD Explor Newsl 2004 6 1 7-19
[21]
Weiss GM and Tian Y Maximizing classifier utility when there are data acquisition and modeling costs Data Min Knowl Disc 2008 17 2 253-282
[22]
Batista GE, Prati RC, and Monard MC A study of the behavior of several methods for balancing machine learning training data ACM SIGKDD Explor Newsl 2004 6 1 20-29
[23]
Cieslak DA, Hoens TR, Chawla NV, and Kegelmeyer WP Hellinger distance decision trees are robust and skew-insensitive Data Min Knowl Disc 2012 24 1 136-158
[24]
Czarnecki WM and Tabor J Multithreshold entropy linear classifier: theory and applications Expert Syst Appl 2015 42 13 5591-5606
[25]
Ando S Classifying imbalanced data in distance-based feature space Knowl Inf Syst 2016 46 3 707-730
[26]
Pérez-Godoy MD, Rivera AJ, Carmona CJ, and del Jesus MJ Training algorithms for radial basis function networks to tackle learning processes with imbalanced data-sets Appl Soft Comput 2014 25 26-39
[27]
Penar W and Wozniak M Cost-sensitive methods of constructing hierarchical classifiers Expert Syst 2010 27 3 146-155
[28]
Zhou Z-H and Liu X-Y On multi-class cost-sensitive learning Comput Intell 2010 26 3 232-257
[29]
Sun Y, Kamel MS, Wong AK, and Wang Y Cost-sensitive boosting for classification of imbalanced data Pattern Recogn 2007 40 12 3358-3378
[30]
Fan W, Stolfo SJ, Zhang J, Chan PK (1999) AdaCost: misclassification cost-sensitive boosting. In: Icml, vol 99, pp 97–105
[31]
Ting KM (2000) A comparative study of cost-sensitive boosting algorithms. In: In Proceedings of the 17th International Conference on Machine Learning . Citeseer
[32]
Zhou Z-H and Liu X-Y Training cost-sensitive neural networks with methods addressing the class imbalance problem IEEE Trans Knowl Data Eng 2005 18 1 63-77
[33]
Drummond C, Holte RC (2000) Exploiting the cost (in) sensitivity of decision tree splitting criteria. In: ICML, vol 1(1), 239–246
[34]
Krawczyk B, Woźniak M, and Schaefer G Cost-sensitive decision tree ensembles for effective imbalanced classification Appl Soft Comput 2014 14 554-562
[35]
López V, Del Río S, Benítez JM, and Herrera F Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data Fuzzy Sets Syst 2015 258 5-38
[36]
Woźniak M, Grana M, and Corchado E A survey of multiple classifier systems as hybrid systems Inform Fus 2014 16 3-17
[37]
Breiman L Bagging predictors Mach Learn 1996 24 2 123-140
[38]
Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: ICML, vol 96, Citeseer, pp 148–156
[39]
Wang S, Yao X (2009) Diversity analysis on imbalanced data sets by using ensemble models. In: 2009 IEEE symposium on computational intelligence and data mining, pp 324–331
[40]
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: European conference on principles of data mining and knowledge discovery, pp. 107–119
[41]
Seiffert C, Khoshgoftaar TM, Van Hulse J, and Napolitano A RUSBoost: a hybrid approach to alleviating class imbalance IEEE Trans Syst Man Cybern-Part A Syst Humans 2009 40 1 185-197
[42]
Galar M, Fernández A, Barrenechea E, and Herrera F EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling Pattern Recogn 2013 46 12 3460-3471
[43]
Liu X-Y, Wu J, and Zhou Z-H Exploratory undersampling for class-imbalance learning IEEE Trans Syst Man Cybern Part B (Cybern) 2008 39 2 539-550
[44]
Nanni L, Fantozzi C, and Lazzarini N Coupling different methods for overcoming the class imbalance problem Neurocomputing 2015 158 48-61
[45]
Sáez JA, Luengo J, Stefanowski J, and Herrera F SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering Inf Sci 2015 291 184-203
[46]
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE World Congress on Computational Intelligence), pp 1322–1328
[47]
Douzas G, Bacao F, and Last F Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE Inf Sci 2018 465 1-20
[48]
Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing, pp 878–887
[49]
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-Level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia conference on knowledge discovery and data mining, pp 475–482
[50]
Maciejewski T, Stefanowski J (2011) Local neighbourhood extension of SMOTE for mining imbalanced data. In: 2011 IEEE symposium on computational intelligence and data mining (CIDM), pp 104–111
[51]
Yun J, Ha J, Lee J-S (2016) Automatic determination of neighborhood size in SMOTE. In: Proceedings of the 10th international conference on ubiquitous information management and communication, pp 1–8
[52]
Ziȩba M, Tomczak JM, Gonczarek A (2015) RBM-SMOTE: restricted boltzmann machines for synthetic minority oversampling technique. In: Asian conference on intelligent information and database systems, pp 377–386
[53]
Wang K-J, Adrian AM, Chen K-H, and Wang K-M A hybrid classifier combining Borderline-SMOTE with AIRS algorithm for estimating brain metastasis from lung cancer: A case study in taiwan Comput Methods Programs Biomed 2015 119 2 63-76
[54]
Raghuwanshi BS and Shukla S SMOTE based class-specific extreme learning machine for imbalanced learning Knowl-Based Syst 2020 187
[55]
Randall D, Tony W, and Martinez R Reduction techniques for Exemplar-Based learning algorithms Mach Learn 2000 38 3 257-286
[56]
García S, Luengo J, and Herrera F Data preprocessing in data mining 2016 Heidelberg Springer Publishing Company Incorporated
[57]
Hart PE The condensed nearest neighbor rule IEEE Trans Inf Theory 1968 14 3 515-516
[58]
Olvera-López JA, Carrasco-Ochoa JA, and Martínez-Trinidad JF A new fast prototype selection method based on clustering Pattern Anal Appl 2010 13 2 131-141
[59]
Wilson Dennis L Asymptotic properties of nearest neighbor rules using edited data IEEE Trans Syst Man Cybern 2007 2 3 408-421
[60]
Sánchez JS, Pla F, and Ferri FJ Prototype selection for the nearest neighbour rule through proximity graphs Pattern Recogn Lett 1997 18 6 507-513
[61]
Vázquez F, Sánchez JS, Pla F (2005) A stochastic approach to Wilson’s editing algorithm. In: Iberian conference on pattern recognition and image analysis, pp 35–42
[62]
Sánchez JS, Barandela R, Marqués AI, Alejo R, and Badenas J Analysis of new techniques to obtain quality training sets Pattern Recogn Lett 2003 24 7 1015-1022
[63]
Jankowski N and Grochowski M Comparison of instances seletion algorithms I. Algorithms survey Curr Gastroenterol Rep 2004 10 1 0937-0942
[64]
Aha DW, Kibler D, and Albert MK Instance-based learning algorithms Mach Learn 1991 6 1 37-66
[65]
Marchiori E Class conditional nearest neighbor for large margin instance selection IEEE Trans Pattern Anal Mach Intell 2009 32 2 364-370
[66]
Brighton H and Mellish C Advances in instance selection for instance-based learning algorithms Data Min Knowl Disc 2002 6 2 153-172
[67]
Zhao K-P, Zhou S-G, Guan J-H, Zhou A-Y (2003) C-pruner: an improved instance pruning algorithm. In: Proceedings of the 2003 international conference on machine learning and cybernetics (IEEE Cat. No. 03EX693), vol. 1, pp 94–99
[68]
Cano JR, Herrera F, and Lozano M Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study IEEE Trans Evol Comput 2003 7 6 561-575
[69]
Tsai CF, Eberle W, and Chu CY Genetic algorithms in feature and instance selection Knowl-Based Syst 2013 39 240-247
[70]
Suganthi M and Karunakaran V Instance selection and feature extraction using cuttlefish optimization algorithm and principal component analysis using decision tree Clust Comput 2019 22 1 89-101
[71]
Rathee S, Ratnoo S, Ahuja J (2019) Instance selection using multi-objective CHC evolutionary algorithm. In: Information and communication technology for competitive strategies, pp 475–484
[72]
Kuncheva LI Editing for the k-nearest neighbors rule by a genetic algorithm Pattern Recogn Lett 1995 16 8 809-814
[73]
Sierra B, Lazkano E, Inza I, Merino M, Larranaga P, Quiroga J (2001) Prototype selection and feature subset selection by estimation of distribution algorithms. A case study in the survival of cirrhotic patients treated with tips. In: Conference on artificial intelligence in medicine in Europe, pp 20–29
[74]
Loh W-Y Classification and regression trees Wiley Interdis Rev Data Min Knowl Disc 2011 1 1 14-23
[75]
Quinlan JR (2014) C4. 5: programs for machine learning, Elsevier
[76]
Rokach L Decision forest: twenty years of research Inform Fus 2016 27 111-125
[77]
Cortes C and Vapnik V Support-vector networks Mach Learn 1995 20 3 273-297
[78]
Cover T and Hart P Nearest neighbor pattern classification IEEE Trans Inf Theory 1967 13 1 21-27
[79]
Wright RE (1995) Logistic regression. Reading and Underst Multivar Stat:217–244
[80]
Domingos P and Pazzani M On the optimality of the simple Bayesian classifier under zero-one loss Mach Learn 1997 29 2 103-130
[81]
Barua S, Islam MM, Yao X, and Murase K MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning IEEE Trans Knowl Data Eng 2012 26 2 405-425
[82]
Rong T, Gong H, Ng WW (2014) Stochastic sensitivity oversampling technique for imbalanced data. In: International conference on machine learning and cybernetics, pp 161–171
[83]
Menardi G and Torelli N Training and assessing classification rules with imbalanced data Data Min Knowl Disc 2014 28 1 92-122
[84]
Jiang K, Lu J, and Xia K A novel algorithm for imbalance data classification based on genetic algorithm improved SMOTE Arab J Sci Eng 2016 41 8 3255-3266
[85]
Fawcett T An introduction to ROC analysis Pattern Recogn Lett 2006 27 8 861-874
[86]
Sheskin DJ (2003) Handbook of parametric and nonparametric statistical procedures, Chapman and Hall/CRC
[87]
Friedman M The use of ranks to avoid the assumption of normality implicit in the analysis of variance J Am Stat Assoc 1937 32 200 675-701
[88]
Friedman M A comparison of alternative tests of significance for the problem of m rankings Ann Math Stat 1940 11 1 86-92
[89]
Wilcoxon F (1992) Individual comparisons by ranking methods. In: Breakthroughs in statistics, Chapman and Hall/CRC, pp 196–202

Index Terms

  1. ESMOTE: an overproduce-and-choose synthetic examples generation strategy based on evolutionary computation
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Neural Computing and Applications
      Neural Computing and Applications  Volume 35, Issue 9
      Mar 2023
      792 pages
      ISSN:0941-0643
      EISSN:1433-3058
      Issue’s Table of Contents

      Publisher

      Springer-Verlag

      Berlin, Heidelberg

      Publication History

      Published: 03 December 2022
      Accepted: 26 October 2022
      Received: 22 May 2022

      Author Tags

      1. Imbalanced datasets
      2. SMOTE
      3. Oversampling
      4. Instance selection
      5. Evolutionary algorithm

      Qualifiers

      • Research-article

      Funding Sources

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 0
        Total Downloads
      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 21 Dec 2024

      Other Metrics

      Citations

      View Options

      View options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media