Abstract
Imbalance in data distribution hinders the learning performance of classifiers. To solve this problem, a popular type of methods is based on sampling (including oversampling for minority class and undersampling for majority class) so that the imbalanced data becomes relatively balanced data. However, they usually focus on one sampling technique, oversampling or undersampling. Such strategy makes the existing methods suffer from the large imbalance ratio (the majority instances size over the minority instances size). In this paper, an active learning framework is proposed to deal with imbalanced data by alternative performing important sampling (ALIS), which consists of selecting important majority-class instances and generating informative minority-class instances. In ALIS, two important sampling strategies affect each other so that the selected majority-class instances provide much clearer information in the next oversampling process, meanwhile the generated minority-class instances provide much more sufficient information for the next undersampling procedure. Extensive experiments have been conducted on real world datasets with a large range of imbalance ratio to verify ALIS. The experimental results demonstrate the superiority of ALIS in terms of several well-known evaluation metrics by comparing with the state-of-the-art methods.
Similar content being viewed by others
References
Xu C, Tao D, Xu C. Robust extreme multi-label learning. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. 1275–1284
Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, 2017. 2980–2988
Batuwita R, Palade V. Efficient resampling methods for training support vector machines with imbalanced datasets. In: Proceedings of the International Joint Conference on Neural Networks, 2010. 1–8
Peng Y. Adaptive sampling with optimal cost for class-imbalance learning. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence, 2015. 2921–2927
Attenberg J, Ertekin S. Class imbalance and active learning. In: Imbalanced Learning: Foundations, Algorithms, and Applications. Piscataway: Wiley-IEEE Press, 2013. 101–149
Guo J, Wan X, Lin H, et al. An active learning method based on mistake sampling for large scale imbalanced classification. In: Proceedings of International Conference on Service Systems and Service Management, 2017. 1–6
Stefanowski J. Dealing with data difficulty factors while learning from imbalanced data. In: Challenges in Computational Statistics and Data Mining. Berlin: Springer, 2016. 333–363
Alejo R, Valdovinos R M, García V, et al. A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios. Pattern Recogn Lett, 2013, 34: 380–388
Cheng F, Zhang J, Wen C. Cost-sensitive large margin distribution machine for classification of imbalanced data. Pattern Recogn Lett, 2016, 80: 107–112
Chung Y A, Lin H T, Yang S W. Cost-aware pre-training for multiclass cost-sensitive deep learning. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence, 2016. 1411–1417
Ren Y, Zhao P, Sheng Y, et al. Robust softmax regression for multi-class classification with self-paced learning. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, 2017. 2641–2647
Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res, 2002, 16: 321–357
Han H, Wang W Y, Mao B H. Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: Proceedings of the 2005 International Conference on Advances in Intelligent Computing. Berlin: Springer, 2005. 878–887
Tang B, He H. KernelADASYN: kernel based adaptive synthetic data generation for imbalanced learning. In: Proceedings of IEEE Congress on Evolutionary Computation, 2015. 664–671
Zhou C, Liu B, Wang S. Cmo-smote: misclassification cost minimization oriented synthetic minority oversampling technique for imbalanced learning. In: Proceedings of the 8th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), 2016. 353–358
Barua S, Islam M M, Yao X, et al. MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng, 2014, 26: 405–425
Yuan J, Li J, Zhang B. Learning concepts from large scale imbalanced data sets using support cluster machines. In: Proceedings of the 14th ACM International Conference on Multimedia, 2006. 441–450
He H B, Garcia E A. Learning from imbalanced data. IEEE Trans Knowl Data Eng, 2009, 21: 1263–1284
Tahir M A, Kittler J, Yan F. Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recogn, 2012, 45: 3738–3750
Galar M, Fernandez A, Barrenechea E, et al. EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recogn, 2013, 46: 3460–3471
Thanathamathee P, Lursinsap C. Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and AdaBoost techniques. Pattern Recogn Lett, 2013, 34: 1339–1347
Settles B. Active Learning Literature Survey. Technical Report. University of Wisconsin-Madison Department of Computer Sciences, 2009
Lughofer E, Weigl E, Heidl W, et al. Integrating new classes on the fly in evolving fuzzy classifier designs and their application in visual inspection. Appl Soft Comput, 2015, 35: 558–582
Weigl E, Heidl W, Lughofer E, et al. On improving performance of surface inspection systems by online active learning and flexible classifier updates. Machine Vision Appl, 2016, 27: 103–127
Pratama M, Dimla E, Lai C Y, et al. Metacognitive learning approach for online tool condition monitoring. J Intell Manuf, 2019, 30: 1717–1737
Ertekin S, Huang J, Bottou L, et al. Learning on the border: active learning in imbalanced data classification. In: Proceedings of the 16th ACM Conference on Information and Knowledge Management, 2007. 127–136
Batuwita R, Palade V. Class imbalance learning methods for support vector machines. In: Imbalanced Learning: Foundations, Algorithms, and Applications. Piscataway: Wiley-IEEE Press, 2013. 83
Oh S, Lee M S, Zhang B T. Ensemble learning with active example selection for imbalanced biomedical data classification. IEEE/ACM Trans Comput Biol Bioinf, 2011, 8: 316–325
Chen Y, Mani S. Active learning for unbalanced data in the challenge with multiple models and biasing. In: Proceedings of Workshop on Active Learning and Experimental Design, 2011. 113–126
Zhang X, Yang T, Srinivasan P. Online asymmetric active learning with imbalanced data. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. 2055–2064
Zhang T, Zhou Z H. Large margin distribution machine. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014. 313–322
Roweis S. Boltzmann machines. Lecture notes, 1995. https://ftp.cs.nyu.edu/~roweis/notes/boltz.pdf
Yang Y, Ma Z, Nie F, et al. Multi-class active learning by uncertainty sampling with diversity maximization. Int J Comput Vision, 2015, 113: 113–127
Asuncion A, Newman D. Uci machine learning repository, 2007. http://archive.ics.uci.edu/ml
Alcala-Fdez J, Fernandez A, Luengo J, et al. Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Logic Soft Comput, 2010, 17: 255–287
Sun Z, Song Q, Zhu X, et al. A novel ensemble method for classifying imbalanced data. Pattern Recogn, 2015, 48: 1623–1637
Yan Q, Xia S, Meng F. Optimizing cost-sensitive SVM for imbalanced data: connecting cluster to classification. 2017. ArXiv: 170201504
Wu F, Jing X Y, Shan S, et al. Multiset feature learning for highly imbalanced data classification. In: Prcoeedings of the 31st AAAI Conference on Artificial Intelligence, 2017
More A. Survey of resampling techniques for improving classification performance in unbalanced datasets. 2016. ArXiv: 160806048
Acknowledgements
This work was supported in part by National Natural Science Foundation of China (Grant Nos. 61822601, 61773050, 61632004, 61972132), Beijing Natural Science Foundation (Grant No. Z180006), National Key Research and Development Program (Grant No. 2017YFC1703506), Fundamental Research Funds for the Central Universities (Grant Nos. 2019JBZ110, 2019YJS040), Youth Foundation of Hebei Education Department (Grant No. QN2018084), Science and Technology Foundation of Hebei Agricultural University (Grant No. LG201804), and Research Project for Self-cultivating Talents of Hebei Agricultural University (Grant No. PY201810).
Author information
Authors and Affiliations
Corresponding author
Additional information
Supporting information
Appendixes A-C. The supporting information is available online at info.scichina.com and link.springer.com. The supporting materials are published as submitted, without typesetting or editing. The responsibility for scientific accuracy and content remains entirely with the authors.
Supplementary File
Rights and permissions
About this article
Cite this article
Wang, X., Liu, B., Cao, S. et al. Important sampling based active learning for imbalance classification. Sci. China Inf. Sci. 63, 182104 (2020). https://doi.org/10.1007/s11432-019-2771-0
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-019-2771-0