[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

A MeanShift-guided oversampling with self-adaptive sizes for imbalanced data classification

Published: 18 July 2024 Publication History

Highlights

The method can address within-class imbalance, small sample sizes and small disjuncts.
The oversampling simultaneously considers majority and minority class distribution.
The introduced randomness and cut-off threshold can avoid overlapping and overfitting.
The assigned oversampling sizes are inversely proportional to density and distance.
The oversampling size assignment strategy can enhance minority boundary information.

Abstract

The imbalanced data classification has gained popularity in machine learning research domain due to its prevalence in numerous applications and its difficulty. However, the majority of contemporary work primarily focuses on addressing between-class imbalance issues. Previous researches have shown that combined with other elements, such as within-class imbalance, small sample size and the presence of small disjuncts, the imbalanced data significantly increase the difficulties for the traditional classifiers to learn. Therefore, we propose a novel MeanShift-guided oversampling with self-adaptive sizes for imbalanced data classification. The proposed MeanShift-guided oversampling technique can simultaneously consider the distribution of minority class and majority class within the sphere with the current minority instance as its center, which can favor addressing small sample size and avoiding overlapping issues often caused by the nearest neighbor (NN)-based oversampling techniques. The incorporation of random vector and flexible cut-off mechanism for vector length can enhance the diversity among the generated synthetic minority instances and avoid overlapping, which makes it suitable for small sample size and small disjuncts problems. To address between-class and within-class imbalance issues, we also introduce a self-adaptive sizes assignment strategy for each minority instance to be oversampled, where the assigned size is inversely proportional to its density and its distance from the majority class. In addition to eliminating within-class imbalance, the strategy can ensure that the informative border minority instances have more opportunities to be oversampled, thus improving classification performance. Extensive experimental results on some datasets with different distributions and imbalance ratios show the proposed algorithm outperforms other compared ones with significant difference.

References

[1]
A.A. Aburomman, M.B. Ibne Reaz, A novel weighted support vector machines multiclass classifier based on differential evolution for intrusion detection systems, Information Sciences. 414 (2017) 225–246,.
[2]
A. Rodriguez, A. Laio, Clustering by fast search and find of density peaks, Science 344 (6191) (2014) 1492–1496,.
[3]
A. Subasi, S.M. Qaisar, Surface EMG signal classification using TQWT, Bagging and Boosting for hand movement recognition, Journal of Ambient Intelligence and Humanized Computing. (2020),.
[4]
B. Krawczyk, M. Koziarski, M. Woźniak, Radial-Based Oversampling for Multiclass Imbalanced Data Classification, IEEE Transactions on Networks and Learning Systems. 31 (2019) 2818–2831,.
[5]
C. Bunkhumpornpat, K. Sinapiromsaran, C. Lursinsap, Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling Technique for Handling the Class Imbalanced Problem, in: Advances in Knowledge Discovery and Data Mining, Springer, 2009, pp. 475-482.
[6]
C. Huang, Y. Li, C.L. Chen, X. Tang, Deep Imbalanced Learning for Face Recognition and Attribute Prediction, IEEE Transactions on Pattern Analysis and Machine Intelligence. (2019) 1,.
[7]
D.A. Cieslak, N.V. Chawla, A. Striegel, Combating imbalance in network intrusion datasets, IEEE Conference on Granular Computing, IEEE (2006) 732–737,.
[8]
E. Elyan, C.F. Moreno-Garcia, C. Jayne, CDSMOTE: class decomposition and synthetic minority class oversampling technique for imbalanced-data classification, Neural Computing and Applications. 33 (2021) 2839–2851,.
[9]
E.R.Q. Fernandes, A.C.P.L.F. De Carvalho, Evolutionary inversion of class distribution in overlapping areas for multi-class imbalanced learning, Information Sciences. 494 (2019) 141–154,.
[10]
F. Sigrist, Gradient and Newton boosting for classification and regression, Expert Systems with Applications. 167 (2021),.
[11]
G. Douzas, F. Bacao, F. Last, Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE, Information Sciences. 465 (2018) 1–20,.
[12]
G. Douzas, F. Bacao, Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning, Expert Systems with Applications. 82 (2017) 40–52,.
[13]
GitHub repository. GitHub - fsleeman/minority-type-imbalanced.
[14]
H. Han, W. Wang, B. Mao, Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning, Advances in Intelligent, Computing. 3644 (2005) 878–887,.
[15]
H. He, E.A. Garcia, Learning from Imbalanced Data, IEEE Transactions on Knowledge and Data Engineering. 21 (2009) 1263–1284,.
[16]
H. He, Y. Bai, E.A. Garcia, S. Li, Adasyn, Adaptive synthetic sampling approach for imbalanced learning, in, in: Proceedings of the IEEE Transactions on Knowledge and Data Engineering, 2008, pp. 1322–1328,.
[17]
H.X. Guo, Y.J. Li, J. Shang, M.Y. Gu, Y.Y. Huang, B. Gong, Learning from class-imbalanced data: Review of methods and applications, Expert Systems with Applications. 73 (2017) 220–239,.
[18]
I. Nekooeimehr, S.K. Lai-Yuen, Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets, Expert Systems with Applications. 46 (2016) 405–416,.
[19]
J.A. Sáez, J. Luengo, J. Stefanowski, F. Herrera, SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Information Sciences. 291 (2015) 184–203,.
[20]
J.F. Díez-Pastor, J.J. Rodríguez, C.I. García-Osorio, L.I. Kuncheva, Diversity techniques improve the performance of the best imbalance learning ensembles, Information Sciences. 325 (2015) 98–117,.
[21]
J. Mathew, M. Luo, C.K. Pang, H.L. Chan, Kernel-based SMOTE for SVM classification of imbalanced datasets, in: IECON 2015–41st Annual Conference of the IEEE Industrial Electronics Society, 2015, pp. 001127–001132,.
[22]
J. Wei, H. Huang, L. Yao, Y. Hu, Q. Fan, D. Huang, NI-MWMOTE: An improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems, Expert Systems with Applications. 158 (2020),.
[23]
K. Fukunaga, L. Hostetler, The estimation of the gradient of a density function, with applications in pattern recognition, IEEE Transactions on Information Theory (1975) 32–40,.
[24]
K. Napierała, J. Stefanowski, S. Wilk, Learning from Imbalanced Data in Presence of Noisy and Borderline Examples, Lecture Notes in Computer Science 6086 (2010) 158–167,.
[25]
L. Sun, Y. Zhou, Y. Wang, C. Zhu, W. Zhang, The Effective Methods for Intrusion Detection With Limited Network Attack Data: Multi-Task Learning and Oversampling, IEEE Access. 8 (2020) 185384–185398,.
[26]
Machine Learning Repository UCI. Http://archive.ics.uci.edu/ml/datasets.html.
[27]
M. Khushi, K. Shaukat, T.M. Alam, I.A. Hameed, S. Uddin, S. Luo, X. Yang, M.C. Reyes, A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data, IEEE Access. 9 (2021) 109960–109975,.
[28]
M. Koziarski, M. Woźniak, B. Krawczyk, Combined Cleaning and Resampling algorithm for multi-class imbalanced data with label noise, Knowledge-Based Systems. 204 (2020),.
[29]
N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: Synthetic Minority Over-sampling Technique, Journal of Artificial Intelligence Research. 16 (2002) 321–357,.
[30]
P. Vuttipittayamongkol, E. Elyan, Neighbourhood-based undersampling approach for handling imbalanced and overlapped data, Information Sciences. 509 (2020) 47–70,.
[31]
P. Vuttipittayamongkol, E. Elyan, A. Petrovski, C. Jayne, Overlap-Based Undersampling for Improving Imbalanced Data Classification, in: Lecture Notes in Computer Science, Lecture Notes in Computer Science, 2018: pp. 689–697.
[32]
R.N. D’Souza, P.Y. Huang, F.C. Yeh, Structural Analysis and Optimization of Convolutional Neural Networks with a Small Sample Size, Scientific Reports. 10 (2020),.
[33]
S. Barua, M.M. Islam, X. Yao, K. Murase, MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning, IEEE Transactions on Knowledge and Data Engineering. 26 (2) (2014) 405–425,.
[34]
S. Ren, B. Liao, W. Zhu, Z. Li, W. Liu, K. Li, The Gradual Resampling Ensemble for mining imbalanced data streams with concept drift, Neurocomputing. 286 (2018) 150–166,.
[35]
S. Sharma, C. Bellinger, B. Krawczyk, O. Zaiane, N. Japkowicz, Synthetic Oversampling with the Majority Class: A New Perspective on Handling Extreme Imbalance, IEEE International Conference on Data Mining (2018) 447–456,.
[36]
T. Jo, N. Japkowicz, Class imbalances versus small disjuncts, ACM SIGKDD Explorations Newsletter. 6 (1) (2004) 40–49,.
[37]
U. Fiore, A. De Santis, F. Perla, P. Zanetti, F. Palmieri, Using generative adversarial networks for improving classification effectiveness in credit card fraud detection, Information Sciences. 479 (2019) 448–455,.
[38]
W. Zheng, H. Zhao, Cost-sensitive hierarchical classification via multi-scale information entropy for data with an imbalanced distribution, Applied Intelligence. 51 (2021) 5940–5952,.
[39]
X.M. Tao, Q. Li, W.J. Guo, C. Ren, C.X. Li, R. Liu, J.R. Zou, Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification, Inf. Sci. 487 (2019) 31–56,.
[40]
X.M. Tao, Q. Li, W.J. Guo, C. Ren, Q. He, R. Liu, J.R. Zou, Adaptive weighted over-sampling for imbalanced datasets based on density peaks clustering with heuristic filtering, Information Sciences. 519 (2020) 43–73,.
[41]
X.M. Tao, Q. Li, W.J. Guo, C. Ren, Q. He, R. Liu, J.R. Zou, Real-value negative selection over-sampling for imbalanced data set learning, Expert Systems with Applications. 129 (2019) 118–134,.
[42]
Y. Lu, Y.M. Cheung, Y.Y. Tang, Bayes Imbalance Impact Index: A Measure of Class Imbalanced Data Set for Classification Problem, IEEE Transactions on Neural Networks and Learning Systems. 31 (2020) 3525–3539,.
[43]
Y.Z. Cheng, Mean shift, mode seeking, and clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence. 17 (1995) 790–799,.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Sciences: an International Journal
Information Sciences: an International Journal  Volume 672, Issue C
Jun 2024
319 pages

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 18 July 2024

Author Tags

  1. Imbalanced datasets
  2. Classification
  3. Over-sampling
  4. Overlapping
  5. Within-class imbalance

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Dec 2024

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media