Abstract
The classification of imbalanced data has been recognized as a crucial problem in machine learning and data mining. In an imbalanced dataset, there are significantly fewer training instances of one class compared to another class. Hence, the minority class instances are much more likely to be misclassified. In the literature, the synthetic minority over-sampling technique (SMOTE) has been developed to deal with the classification of imbalanced datasets. It synthesizes new samples of the minority class to balance the dataset, by re-sampling the instances of the minority class. Nevertheless, the existing algorithms-based SMOTE uses the same sampling rate for all instances of the minority class. This results in sub-optimal performance. To address this issue, we propose a novel genetic algorithm-based SMOTE (GASMOTE) algorithm. The GASMOTE algorithm uses different sampling rates for different minority class instances and finds the combination of optimal sampling rates. The experimental results on ten typical imbalance datasets show that, compared with SMOTE algorithm, GASMOTE can increase 5.9% on F-measure value and 1.6% on G-mean value, and compared with Borderline-SMOTE algorithm, GASMOTE can increase 3.7% on F-measure value and 2.3% on G-mean value. GASMOTE can be used as a new over-sampling technique to deal with imbalance dataset classification problem. We have particularly applied the GASMOTE algorithm to a practical engineering application: prediction of rockburst in the VCR rockburst datasets. The experiment results indicate that the GASMOTE algorithm can accurately predict the rockburst occurrence and hence provides guidance to the design and construction of safe deep mining engineering structures.
Similar content being viewed by others
References
Anand A., Pugalenthi G., Fogel G., Suganthan P.: An approach for classification of highly imbalanced data using weighting and undersampling. Amino Acids 39, 1385–1391 (2010)
Liu L., Cai Y., Lu W., Feng K., Peng C., Niu B.: Prediction of protein–protein interactions based on pseAA composition and hybrid feature selection. Biochem. Biophys. Res. Commun. 380, 318–322 (2009)
He, H.; Shen, X.: A ranked subspace learning method for gene expression data classification. In: IC-AI, pp. 358–364 (2007)
Kubat M., Holte R., Matwin S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30, 195–215 (1998)
Castillo M., Serrano J.: A multistrategy approach for digital text categorization from imbalanced documents. SIGKDD Explor. Newsl. 6, 70–79 (2004)
Phua C., Alahakoon D., Lee V.: Minority report in fraud detection: classification of skewed data. SIGKDD Explor. Newsl. 6, 50–59 (2004)
Soda P.: A multi-objective optimization approach for class imbalance learning. Pattern Recognit. 44, 1801–1810 (2011)
Haibo H.E., Garcia E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21, 1263–1284 (2009)
Gu Q., Yuan L., Xiong Q., Ning B., Li W.: A comparative study of cost-sensitive learning algorithm based on imbalanced data sets. Microelectron. Comput. 28, 146–149 (2009)
Wang C., Pan Z., Dong L., Ma C., Zhang X.: Research on classification for imbalanced dataset based on improved SMOTE. Comput. Eng. Appl. 49, 184–187 (2013)
Ge J., Qiu Y., Wu C., Pu G.: Summary of genetic algorithms research. Appl. Res. Comput. 25, 2911–2916 (2008)
Estabrooks A., Jo T., Japkowicz N.: A multiple resampling method for learning from imbalanced data sets. Comput. Intell. 20, 18–36 (2004)
Chawla N., Bowyer K., Hall L., Kegelmeyer W.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Wang, B.X.; Japkowicz, N.: Imbalanced Data Set Learning with Synthetic Samples. In: Proceedings of the IRIS Machine Learning Workshop (2004)
Han, H.; Wang, W.; Mao, B.: Borderline-SMOTE: a new over-sampling method in imbalance data set learning. In: Proceedings of International Conference on Intelligent Computing. Springer, Berlin Heidelberg, pp. 878–887 (2005)
He, H.; Bai, Y.; Garcia, E.; Li, S.: Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: IEEE International Joint Conference on Neural Networks, pp. 1322–1328 (2008)
Chawla, N.; Lazarevic, A.; Hall, L.; Bowyer, K.: Smoteboost: improving prediction of the minority class in boosting. In: Proceedings of the Principles of Knowledge Discovery in Databases, pp. 107–119 (2003)
Guo H., Viktor H.L.: Learning from imbalance data set with boosting and data generation: the DataBoost-IM approach. ACM SIGKDD Explor. Newsl. 6, 30–39 (2004)
Chen S., Guo G., Chen L.: Clustering ensembles based classification method for imbalanced data sets. Pattern Recognit. Artif. Intell. 23, 772–780 (2010)
Chen S., He H., Garcia E.: Ramoboost: ranked minority oversampling in boosting. IEEE Trans. Neural Netw. 21, 1624–1642 (2010)
Ling C.X., Shen G., Victor S.: A comparative study of cost-sensitive classifiers. Chin. J. Comput. 30, 1203–1212 (2007)
Zhou Z., Liu X.: Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans. Knowl. Data Eng. 18, 63–77 (2006)
Sun Y., Kamel M., Wong A., Wang Y.: Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit. 40, 3358–3378 (2007)
Wu G, Chen Q.: Combined classifier algorithm for imbalanced datasets. Comput. Eng. Des. 28, 5687–5690 (2007)
Luo B., Yu G.: AdaBoost Classification of Multiple Classes with Imbalanced Distribution. J. Yangtze Univ. (Nat. Sci. Edit.) Sci. Eng. 4, 50–54 (2007)
Zhou Z.: Ensemble Methods: Foundations and Algorithms. Chapman and Hall/CRC, Boca Raton, FL (2012)
Galar M., Fernandez A., Barrenechea E., Bustince H.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. C Appl. Rev. 42, 463–484 (2012)
Liao H.-W., Zhou D.-L.: Review of adaboost and its improvement. Comput. Syst. Appl. 21, 240–244 (2012)
Liu, Y.; An, A.; Huang, X.: Boosting prediction accuracy on imbalanced datasets with svm ensembles. PAKDD, pp. 107–118 (2006)
Wang B., Japkowicz N.: Boosting support vector machines for imbalanced data sets. Knowl. Inf. Syst. 25, 1–20 (2010)
Liu X.-Y., Wu J., Zhou Z.-H.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B 39, 539–550 (2009)
Ertekin, S.; Huang, J.; Bottou, L.; Giles, C.: Learning on the border: active learning in imbalanced data classification. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, pp. 127–136 (2007)
Ertekin, S.; Huang, J.; Giles, C.: Active learning for class imbalance problem. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in, Information Retrieval, pp. 823–824 (2007)
Weiss G.M.: Mining with rarity: a unifying framework. Sigkdd Explor. Spec. Issue Learn. Imbalanced Datasets 6, 7–19 (2004)
Van~Rijsbergen C.J.: Information Retrieval. Butterworths, London (1979)
Kubat M., Holte R.C., Matwin S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30, 195–215 (1998)
Wang Y.: The Research and Application of Genetic Algorithm–3PM Crossover Operator Based Annealing Genetic Algorithm and the Research of Its Application. Jiangnan University, Wuxin (2009)
Gong W.: Differential Evolution Algorithm and Its Application in Clustering Analysis. School of Computer, China University of Geosciences, Wuhan (2010)
Pan Z., Kang L., Chen Y.: Evolutionary Computation. Tsinghua University Press, Beijing (1998)
Quinlan J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco, CA (1993)
Witten L.H., Frank E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, Seattle, WA (2000)
Feng X.: Introduction of Intelligent Rock mechanics. Science Press, Beijing (2000)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Jiang, K., Lu, J. & Xia, K. A Novel Algorithm for Imbalance Data Classification Based on Genetic Algorithm Improved SMOTE. Arab J Sci Eng 41, 3255–3266 (2016). https://doi.org/10.1007/s13369-016-2179-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13369-016-2179-2