More Web Proxy on the site http://driver.im/

research-article

Learning class-imbalanced data with region-impurity synthetic minority oversampling technique

Authors:

Kuan-Cheng Huang,

Tung-I TsaiAuthors Info & Claims

Volume 607, Issue C

Pages 1391 - 1407

https://doi.org/10.1016/j.ins.2022.06.067

Published: 01 August 2022 Publication History

Abstract

Learning from class-imbalanced data is a tough task, which often leads classifiers to fail on identifying the minority class. To balance the class ratio, synthetic minority oversampling technique (SMOTE) has shown its improvement in classifying minority class by generating synthetic minority instances. However, in some scenarios, SMOTE and its extensions will generate noise instances and thus causing the performance degradation. This is because of that they were developed based on kNN (k nearest neighbors), which cannot identify the class distributions between pairs of two minority instances. Furthermore, the number of synthetic instances is left to be discussed in this field of study. To conquer these issues, we propose a new algorithm here named Region-Impurity Synthetic Minority Oversampling Technique (RIOT). Specifically, a region radius, we locate neighbors for minority instances and whereby to identify the relatively hard-to-learn minority instances, by the class ratio within the region and selecting building the base of sample generation. Then, generating synthetic instances until the region is approximately balanced. In the experiment, the results revealed that RIOT can perform better than some SMOTE extensions with less synthetic instances in terms of several model performance indicators for twelve real-world datasets.

References

[1]

M. Fahim, A. Sillitti, Anomaly detection, analysis and prediction techniques in IoT environment: A systematic literature review, IEEE Access 7 (2019) 81664–81681.

[2]

G. Cohen, M. Hilario, H. Sax, S. Hugonnet, A. Geissbuhler, Learning from imbalanced data in surveillance of nosocomial infection, Artif. Intell. Med. 37 (2006) 7–18.

[3]

X. Yuan, L. Xie, M. Abouelenien, A regularized ensemble framework of deep learning for cancer detection from multi-class, imbalanced training data, Pattern Recogn. 77 (2018) 160–172.

[4]

S. Bhattacharyya, S. Jha, K. Tharakunnel, J.C. Westland, Data mining for credit card fraud: A comparative study, Decis. Support Syst. 50 (3) (2011) 602–613.

Digital Library

[5]

H. He, W. Zhang, S. Zhang, A novel ensemble method for credit scoring: Adaption of different imbalance ratios, Expert Syst. Appl. 98 (2018) 105–117.

[6]

Y. Xie, X. Li, E. Ngai, W. Ying, Customer churn prediction using improved balanced random forests, Expert Syst. Appl. 36 (2009) 5445–5449.

[7]

K. Philip, S. Chan, Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection, in: P.E.S. Rakesh Agrawal (Ed.), Proceeding of the Fourth International Conference on Knowledge Discovery and Data Mining, AAAI Press, Manchester, UK, 1998, pp. 164–168.

[8]

J.M. Pérez, J. Muguerza, O. Arbelaitz, I. Gurrutxaga, J.I. Martín, Consolidated tree classifier learning in a car insurance fraud detection domain with class imbalance, in: M.S. Sameer Singh, C. Apte, P. Perner (Eds.), Third International Conference on Advances in Pattern Recognition, 2005, pp. 381–389.

[9]

Y. Freung, R. Shapire, A decision-theoretic generalization of on-line learning and an application to boosting, J. Comput. Syst. Sci. 55 (1997) 119–139.

[10]

C. Elkan, The foundations of cost-sensitive learning, in: Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, Morgan Kaufmann Publishers Inc., 2001, pp. 973-978.

[11]

C.L. Castro, A.P. Braga, Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data, IEEE Trans. Neural Networks Learn. Syst. 24 (2013) 888–899.

[12]

Z.-Q. Zeng, J. Gao, Improving SVM classification with imbalance data set, in: M.L. Chi Sing Leung, Jonathan H. Chan (Eds.), 16th International Conference, Springer, 2009, pp. 389–398.

[13]

Y. Zhang, P. Fu, W. Liu, G. Chen, Imbalanced data classification based on scaling kernel-based support vector machine, Neural Comput. Appl. 25 (2014) 927–935.

[14]

N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: synthetic minority over-sampling technique, J. Artif. Intell. Res. 16 (2002) 321–357.

[15]

N.V. Chawla, A. Lazarevic, L.O. Hall, K.W. Bowyer, SMOTEBoost, Improving prediction of the minority class in boosting, in: D.G. Nada Lavrač, L. Todorovski, H. Blockeel (Eds.), 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Springer, 2003, pp. 107–119.

[16]

H. Han, W.-Y. Wang, B.-H. Mao, Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning, in: International Conference on Intelligent Computing, ICIC 2005, Springer, 2005, pp. 878-887.

[17]

H. He, Y. Bai, E.A. Garcia, S. Li, ADASYN: Adaptive synthetic sampling approach for imbalanced learning, in: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), IEEE, Hong Kong, China, 2008, pp. 1322-1328.

[18]

C. Bunkhumpornpat, K. Sinapiromsaran, C. Lursinsap, Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem, in: B.K. Thanaruk Theeramunkong, N. Cercone, TuBao Ho (Eds.), Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, Springer-Verlag, Bangkok, Thailand, 2009, pp. 475–482.

[19]

S. Chen, G. Guo, L. Chen, A new over-sampling method based on cluster ensembles, in: 2010 IEEE 24th International Conference on Advanced Information Networking and Applications Workshops, IEEE, Perth, Western Australia, Australia, 2010, pp. 599–604.

[20]

T. Maciejewski, J. Stefanowski, Local neighbourhood extension of SMOTE for mining imbalanced data, in: 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, Paris, France, 2011, pp. 104–111.

[21]

S. Barua, M.M. Islam, X. Yao, K. Murase, MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning, IEEE Trans. Knowl. Data Eng. 26 (2012) 405–425.

Digital Library

[22]

A.I. Sánchez, E.F. Morales, J.A. Gonzalez, Synthetic oversampling of instances using clustering, Int. J. Artif. Intell. Tools 22 (02) (2013) 1350008.

[23]

M. Gao, X. Hong, S. Chen, C.J. Harris, E. Khalaf, PDFOS: PDF estimation based over-sampling for imbalanced two-class problems, Neurocomputing 138 (2014) 248–259.

[24]

J.A. Sáez, J. Luengo, J. Stefanowski, F. Herrera, SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Inf. Sci. 291 (2015) 184–203.

[25]

C. Bunkhumpornpat, K. Sinapiromsaran, DBMUTE: density-based majority under-sampling technique, Knowl. Inf. Syst. 50 (2017) 827–850.

Digital Library

[26]

W.A. Rivera, Noise reduction a priori synthetic over-sampling for class imbalanced data sets, Inf. Sci. 408 (2017) 146–161.

[27]

G. Douzas, F. Bacao, F. Last, Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE, Inf. Sci. 465 (2018) 1–20.

Digital Library

[28]

C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (1995) 273–297.

[29]

M. Kubat, S. Matwin, Addressing the curse of imbalanced training sets: one-sided selection, in: Proceedings of the Fourteenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., 1997, pp. 179–186.

[30]

W.-C. Lin, C.-F. Tsai, Y.-H. Hu, J.-S. Jhang, Clustering-based undersampling in class-imbalanced data, Inf. Sci. 409 (2017) 17–26.

[31]

T. Cover, P. Hart, Nearest neighbor pattern classification, IEEE Trans. Inf. Theory 13 (1) (1967) 21–27.

Digital Library

[32]

G.E.A.P.A. Batista, R.C. Prati, M.C. Monard, A study of the behavior of several methods for balancing machine learning training data, ACM Sigkdd Explor. Newsl. 6 (1) (2004) 20–29.

Digital Library

[33]

T. Jo, N. Japkowicz, Class imbalances versus small disjuncts, ACM Sigkdd Explor. Newsl. 6 (1) (2004) 40–49.

[34]

V. García, J. Sánchez, R. Mollineda, An empirical study of the behavior of classifiers on imbalanced and overlapped data sets, in: 12th IberoAmerican Congress on Pattern Recognition, Springer-Verlag, Viña del Mar-Valparaiso, Chile, 2007, pp. 397–406.

[35]

R.C. Prati, G.E. Batista, M.C. Monard, Class imbalances versus class overlapping: an analysis of a learning system behavior, in: G.A.-F. Raúl Monroy, Luis Enrique Sucar, Humberto Sossa (Ed.) Third Mexican International Conference on Artificial Intelligence, Springer, 2004, pp. 312-321.

[36]

D.A. Cieslak, N.V. Chawla, A. Striegel, Combating imbalance in network intrusion datasets, in: 2006 IEEE International Conference on Granular Computing, IEEE, Atlanta, Georgia, United States, 2006, pp. 732–737.

[37]

D.-C. Li, C.-W. Liu, S.C. Hu, A learning method for the class imbalance problem with medical data sets, Comput. Biol. Med. 40 (2010) 509–518.

Digital Library

[38]

D.-C. Li, C.-S. Wu, T.-I. Tsai, Y.-S. Lina, Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge, Comput. Oper. Res. 34 (2007) 966–982.

[39]

G. Menardi, N. Torelli, Training and assessing classification rules with imbalanced data, Data Min. Knowl. Disc. 28 (2014) 92–122.

Digital Library

[40]

J. Lu, C. Zhang, F. Shi, A classification method of imbalanced data base on PSO algorithm, in: Q.H. Wanxiang Che, H. Wang, W. Jing, S. Peng, J. Lin, G. Sun, X. Song, H. Song, L.u. Zeguang (Eds.), Second International Conference of Young Computer Scientists, Engineers and Educators, Springer, Singapore, 2016, pp. 121–134.

[41]

F. Ren, P. Cao, W. Li, D. Zhao, O. Zaiane, Ensemble based adaptive over-sampling method for imbalanced data learning in computer aided detection of microaneurysm, Comput. Med. Imaging Graph. 55 (2017) 54–67.

[42]

M. Ester, H.-P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in: J.H. Evangelos Simoudis, U.M. Fayyad (Eds.), Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press, 1996, pp. 226–231.

[43]

G.C. Dua Dheeru, {UCI} Machine Learning Repository, in: I. University of California, School of Information and Computer Sciences (Ed.), 2019.

[44]

J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Herrera, Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework, J. Multiple-Valued Logic Soft Comput. 17 (2011).

Cited By

Wang NZhang ZLuo X(2024)Iterative minority oversampling and its ensemble for ordinal imbalanced datasetsEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107211127:PAOnline publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1016/j.engappai.2023.107211
Zhang YDai YLi J(2024)Incremental and sequence learning algorithms for weighted regularized extreme learning machinesApplied Intelligence10.1007/s10489-024-05470-654:7(5859-5878)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1007/s10489-024-05470-6
Zhang JLi YZhang BWang XGong H(2023)A new oversampling approach based differential evolution on the safe set for highly imbalanced datasetsExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121039234:COnline publication date: 30-Dec-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.121039

Index Terms

Learning class-imbalanced data with region-impurity synthetic minority oversampling technique
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Information systems applications
    1. Data mining

Index terms have been assigned to the content through auto-classification.

Recommendations

A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning
Abstract
Class imbalance occurs when the class distribution is not equal. Namely, one class is under-represented (minority class), and the other class has significantly more samples in the data (majority class). The class imbalance problem is prevalent in ...
Whale Optimization-based Synthetic Minority Oversampling Technique for Binary Imbalanced Datasets
Abstract
The problem of class imbalance has become a predominant area of research recently. Synthetic Minority Oversampling Technique (SMOTE) stands as a popular and widely adopted oversampling technique that effectively addresses the challenge of class ...
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance
Highlights
- The paper is the first paper that provides a comprehensive theoretical analysis to the popular over-sampling method, SMOTE.
Abstract
Imbalanced classification problems are often encountered in many applications. The challenge is that there is a minority class that has typically very little data and is often the focus of attention. One approach for handling imbalance ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Sciences: an International Journal

Information Sciences: an International Journal Volume 607, Issue C

Aug 2022

1637 pages

ISSN:0020-0255

Issue’s Table of Contents

Elsevier Inc.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 August 2022

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang NZhang ZLuo X(2024)Iterative minority oversampling and its ensemble for ordinal imbalanced datasetsEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107211127:PAOnline publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1016/j.engappai.2023.107211
Zhang YDai YLi J(2024)Incremental and sequence learning algorithms for weighted regularized extreme learning machinesApplied Intelligence10.1007/s10489-024-05470-654:7(5859-5878)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1007/s10489-024-05470-6
Zhang JLi YZhang BWang XGong H(2023)A new oversampling approach based differential evolution on the safe set for highly imbalanced datasetsExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121039234:COnline publication date: 30-Dec-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.121039

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents