[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Learning class-imbalanced data with region-impurity synthetic minority oversampling technique

Published: 01 August 2022 Publication History

Abstract

Learning from class-imbalanced data is a tough task, which often leads classifiers to fail on identifying the minority class. To balance the class ratio, synthetic minority oversampling technique (SMOTE) has shown its improvement in classifying minority class by generating synthetic minority instances. However, in some scenarios, SMOTE and its extensions will generate noise instances and thus causing the performance degradation. This is because of that they were developed based on kNN (k nearest neighbors), which cannot identify the class distributions between pairs of two minority instances. Furthermore, the number of synthetic instances is left to be discussed in this field of study. To conquer these issues, we propose a new algorithm here named Region-Impurity Synthetic Minority Oversampling Technique (RIOT). Specifically, a region radius, we locate neighbors for minority instances and whereby to identify the relatively hard-to-learn minority instances, by the class ratio within the region and selecting building the base of sample generation. Then, generating synthetic instances until the region is approximately balanced. In the experiment, the results revealed that RIOT can perform better than some SMOTE extensions with less synthetic instances in terms of several model performance indicators for twelve real-world datasets.

References

[1]
M. Fahim, A. Sillitti, Anomaly detection, analysis and prediction techniques in IoT environment: A systematic literature review, IEEE Access 7 (2019) 81664–81681.
[2]
G. Cohen, M. Hilario, H. Sax, S. Hugonnet, A. Geissbuhler, Learning from imbalanced data in surveillance of nosocomial infection, Artif. Intell. Med. 37 (2006) 7–18.
[3]
X. Yuan, L. Xie, M. Abouelenien, A regularized ensemble framework of deep learning for cancer detection from multi-class, imbalanced training data, Pattern Recogn. 77 (2018) 160–172.
[4]
S. Bhattacharyya, S. Jha, K. Tharakunnel, J.C. Westland, Data mining for credit card fraud: A comparative study, Decis. Support Syst. 50 (3) (2011) 602–613.
[5]
H. He, W. Zhang, S. Zhang, A novel ensemble method for credit scoring: Adaption of different imbalance ratios, Expert Syst. Appl. 98 (2018) 105–117.
[6]
Y. Xie, X. Li, E. Ngai, W. Ying, Customer churn prediction using improved balanced random forests, Expert Syst. Appl. 36 (2009) 5445–5449.
[7]
K. Philip, S. Chan, Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection, in: P.E.S. Rakesh Agrawal (Ed.), Proceeding of the Fourth International Conference on Knowledge Discovery and Data Mining, AAAI Press, Manchester, UK, 1998, pp. 164–168.
[8]
J.M. Pérez, J. Muguerza, O. Arbelaitz, I. Gurrutxaga, J.I. Martín, Consolidated tree classifier learning in a car insurance fraud detection domain with class imbalance, in: M.S. Sameer Singh, C. Apte, P. Perner (Eds.), Third International Conference on Advances in Pattern Recognition, 2005, pp. 381–389.
[9]
Y. Freung, R. Shapire, A decision-theoretic generalization of on-line learning and an application to boosting, J. Comput. Syst. Sci. 55 (1997) 119–139.
[10]
C. Elkan, The foundations of cost-sensitive learning, in: Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, Morgan Kaufmann Publishers Inc., 2001, pp. 973-978.
[11]
C.L. Castro, A.P. Braga, Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data, IEEE Trans. Neural Networks Learn. Syst. 24 (2013) 888–899.
[12]
Z.-Q. Zeng, J. Gao, Improving SVM classification with imbalance data set, in: M.L. Chi Sing Leung, Jonathan H. Chan (Eds.), 16th International Conference, Springer, 2009, pp. 389–398.
[13]
Y. Zhang, P. Fu, W. Liu, G. Chen, Imbalanced data classification based on scaling kernel-based support vector machine, Neural Comput. Appl. 25 (2014) 927–935.
[14]
N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: synthetic minority over-sampling technique, J. Artif. Intell. Res. 16 (2002) 321–357.
[15]
N.V. Chawla, A. Lazarevic, L.O. Hall, K.W. Bowyer, SMOTEBoost, Improving prediction of the minority class in boosting, in: D.G. Nada Lavrač, L. Todorovski, H. Blockeel (Eds.), 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Springer, 2003, pp. 107–119.
[16]
H. Han, W.-Y. Wang, B.-H. Mao, Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning, in: International Conference on Intelligent Computing, ICIC 2005, Springer, 2005, pp. 878-887.
[17]
H. He, Y. Bai, E.A. Garcia, S. Li, ADASYN: Adaptive synthetic sampling approach for imbalanced learning, in: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), IEEE, Hong Kong, China, 2008, pp. 1322-1328.
[18]
C. Bunkhumpornpat, K. Sinapiromsaran, C. Lursinsap, Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem, in: B.K. Thanaruk Theeramunkong, N. Cercone, TuBao Ho (Eds.), Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, Springer-Verlag, Bangkok, Thailand, 2009, pp. 475–482.
[19]
S. Chen, G. Guo, L. Chen, A new over-sampling method based on cluster ensembles, in: 2010 IEEE 24th International Conference on Advanced Information Networking and Applications Workshops, IEEE, Perth, Western Australia, Australia, 2010, pp. 599–604.
[20]
T. Maciejewski, J. Stefanowski, Local neighbourhood extension of SMOTE for mining imbalanced data, in: 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, Paris, France, 2011, pp. 104–111.
[21]
S. Barua, M.M. Islam, X. Yao, K. Murase, MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning, IEEE Trans. Knowl. Data Eng. 26 (2012) 405–425.
[22]
A.I. Sánchez, E.F. Morales, J.A. Gonzalez, Synthetic oversampling of instances using clustering, Int. J. Artif. Intell. Tools 22 (02) (2013) 1350008.
[23]
M. Gao, X. Hong, S. Chen, C.J. Harris, E. Khalaf, PDFOS: PDF estimation based over-sampling for imbalanced two-class problems, Neurocomputing 138 (2014) 248–259.
[24]
J.A. Sáez, J. Luengo, J. Stefanowski, F. Herrera, SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Inf. Sci. 291 (2015) 184–203.
[25]
C. Bunkhumpornpat, K. Sinapiromsaran, DBMUTE: density-based majority under-sampling technique, Knowl. Inf. Syst. 50 (2017) 827–850.
[26]
W.A. Rivera, Noise reduction a priori synthetic over-sampling for class imbalanced data sets, Inf. Sci. 408 (2017) 146–161.
[27]
G. Douzas, F. Bacao, F. Last, Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE, Inf. Sci. 465 (2018) 1–20.
[28]
C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (1995) 273–297.
[29]
M. Kubat, S. Matwin, Addressing the curse of imbalanced training sets: one-sided selection, in: Proceedings of the Fourteenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., 1997, pp. 179–186.
[30]
W.-C. Lin, C.-F. Tsai, Y.-H. Hu, J.-S. Jhang, Clustering-based undersampling in class-imbalanced data, Inf. Sci. 409 (2017) 17–26.
[31]
T. Cover, P. Hart, Nearest neighbor pattern classification, IEEE Trans. Inf. Theory 13 (1) (1967) 21–27.
[32]
G.E.A.P.A. Batista, R.C. Prati, M.C. Monard, A study of the behavior of several methods for balancing machine learning training data, ACM Sigkdd Explor. Newsl. 6 (1) (2004) 20–29.
[33]
T. Jo, N. Japkowicz, Class imbalances versus small disjuncts, ACM Sigkdd Explor. Newsl. 6 (1) (2004) 40–49.
[34]
V. García, J. Sánchez, R. Mollineda, An empirical study of the behavior of classifiers on imbalanced and overlapped data sets, in: 12th IberoAmerican Congress on Pattern Recognition, Springer-Verlag, Viña del Mar-Valparaiso, Chile, 2007, pp. 397–406.
[35]
R.C. Prati, G.E. Batista, M.C. Monard, Class imbalances versus class overlapping: an analysis of a learning system behavior, in: G.A.-F. Raúl Monroy, Luis Enrique Sucar, Humberto Sossa (Ed.) Third Mexican International Conference on Artificial Intelligence, Springer, 2004, pp. 312-321.
[36]
D.A. Cieslak, N.V. Chawla, A. Striegel, Combating imbalance in network intrusion datasets, in: 2006 IEEE International Conference on Granular Computing, IEEE, Atlanta, Georgia, United States, 2006, pp. 732–737.
[37]
D.-C. Li, C.-W. Liu, S.C. Hu, A learning method for the class imbalance problem with medical data sets, Comput. Biol. Med. 40 (2010) 509–518.
[38]
D.-C. Li, C.-S. Wu, T.-I. Tsai, Y.-S. Lina, Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge, Comput. Oper. Res. 34 (2007) 966–982.
[39]
G. Menardi, N. Torelli, Training and assessing classification rules with imbalanced data, Data Min. Knowl. Disc. 28 (2014) 92–122.
[40]
J. Lu, C. Zhang, F. Shi, A classification method of imbalanced data base on PSO algorithm, in: Q.H. Wanxiang Che, H. Wang, W. Jing, S. Peng, J. Lin, G. Sun, X. Song, H. Song, L.u. Zeguang (Eds.), Second International Conference of Young Computer Scientists, Engineers and Educators, Springer, Singapore, 2016, pp. 121–134.
[41]
F. Ren, P. Cao, W. Li, D. Zhao, O. Zaiane, Ensemble based adaptive over-sampling method for imbalanced data learning in computer aided detection of microaneurysm, Comput. Med. Imaging Graph. 55 (2017) 54–67.
[42]
M. Ester, H.-P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in: J.H. Evangelos Simoudis, U.M. Fayyad (Eds.), Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press, 1996, pp. 226–231.
[43]
G.C. Dua Dheeru, {UCI} Machine Learning Repository, in: I. University of California, School of Information and Computer Sciences (Ed.), 2019.
[44]
J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Herrera, Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework, J. Multiple-Valued Logic Soft Comput. 17 (2011).

Cited By

View all
  • (2024)Iterative minority oversampling and its ensemble for ordinal imbalanced datasetsEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107211127:PAOnline publication date: 1-Feb-2024
  • (2024)Incremental and sequence learning algorithms for weighted regularized extreme learning machinesApplied Intelligence10.1007/s10489-024-05470-654:7(5859-5878)Online publication date: 1-Apr-2024
  • (2023)A new oversampling approach based differential evolution on the safe set for highly imbalanced datasetsExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121039234:COnline publication date: 30-Dec-2023

Index Terms

  1. Learning class-imbalanced data with region-impurity synthetic minority oversampling technique
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image Information Sciences: an International Journal
        Information Sciences: an International Journal  Volume 607, Issue C
        Aug 2022
        1637 pages

        Publisher

        Elsevier Science Inc.

        United States

        Publication History

        Published: 01 August 2022

        Author Tags

        1. Class-imbalanced
        2. SMOTE
        3. Region-impurity
        4. Overlapped data

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 11 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Iterative minority oversampling and its ensemble for ordinal imbalanced datasetsEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107211127:PAOnline publication date: 1-Feb-2024
        • (2024)Incremental and sequence learning algorithms for weighted regularized extreme learning machinesApplied Intelligence10.1007/s10489-024-05470-654:7(5859-5878)Online publication date: 1-Apr-2024
        • (2023)A new oversampling approach based differential evolution on the safe set for highly imbalanced datasetsExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121039234:COnline publication date: 30-Dec-2023

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media