Abstract
Imbalanced data sets are those in which data samples have uneven distribution amongst the classes. When classifying such data, classical classifiers encounter problem; hence, this problem has become a challenging issue in the field of machine learning. To weaken this problem, we propose a novel hyper-heuristic algorithm, called HTSS, to select the best training samples in this paper. In other words, the best training sample subset is chosen with the goal of enhancing the performance of classifier when confronting imbalanced data. To do so, some local search algorithms and a choice function are incorporated with a global search algorithm to improve its effectiveness. The global search used in this paper is binary quantum inspired gravitational search algorithm (BQIGSA) which is a recently proposed meta-heuristic search for optimization of binary encoded problems. Experiments are performed on 75 imbalanced data sets, and G-mean and AUC measures are employed for evaluation. The results of comparing the proposed method with other state of the art algorithms show the superiority of the proposed HTSS method.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30, 195–215 (1998)
Cieslak, D.A., Chawla, N.V., Striegel, A.: Combating imbalance in network intrusion datasets. In: GrC, pp. 732–737 (2006)
Zhang, D., Islam, M.M., Lu, G.: A review on automatic image annotation techniques. Pattern Recogn. 45, 346–362 (2012)
Cardie, C., Howe, N.: Improving minority class prediction using case-specific feature weights. In: ICML, pp. 57–65 (1997)
Burke, E., Kendall, G., Newall, J., Hart, E., Ross, P., Schulenburg, S.: Hyper-heuristics: an emerging direction in modern search technology. Handbook of Meta-Heuristics, pp. 457–474 (2003)
Nezamabadi-pour, H.: A quantum-inspired gravitational search algorithm for binary encoded optimization problems. Eng. Appl. Artif. Intell. 40, 62–75 (2015)
García, S., Herrera, F.: Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol. Comput. 17, 275–306 (2009)
Gao, M., Hong, X., Chen, S., Harris, C.J.: A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems. Neurocomputing 74, 3456–3466 (2011)
Yu, D.-J., Hu, J., Tang, Z.-M., Shen, H.-B., Yang, J., Yang, J.-Y.: Improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling. Neurocomputing 104, 180–190 (2013)
Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 6, 769–772 (1976)
Hart, P.: The condensed nearest neighbor rule (Corresp.). IEEE Trans. Inf. Theory 14, 515–516 (1968)
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, pp. 179–186 (1997)
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2, 408–421 (1972)
Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Quaglini, S., Barahona, P., Andreassen, S. (eds.) Artificial Intelligence in Medicine. Lecture Notes in Computer Science, vol. 2101, pp. 63–66. Springer, Berlin (2001)
García, S., Fernández, A., Herrera, F.: Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems. Appl. Soft Comput. 9, 1304–1314 (2009)
Garcı, S., Triguero, I., Carmona, C.J., Herrera, F.: Evolutionary-based selection of generalized instances for imbalanced classification. Knowl.-Based Syst. 25, 3–12 (2012)
Jian, C., Gao, J., Ao, Y.: A new sampling method for classifying imbalanced data based on support vector machine ensemble. Neurocomputing 193, 115–122 (2016)
Chen, S., He, H., Garcia, E.A.: RAMOBoost: ranked minority oversampling in boosting. IEEE Trans. Neural Netw. 21, 1624–1642 (2010)
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: IEEE International Joint Conference on Neural Networks, 2008. IJCNN 2008 (IEEE World Congress on Computational Intelligence), pp. 1322–1328 (2008)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Hu, S., Liang, Y., Ma, L., He, Y.: MSMOTE: improving classification performance when training data is imbalanced. In: Second International Workshop on Computer Science and Engineering, 2009. WCSE’09, pp. 13–17 (2009)
Barua, S., Islam, M.M., Yao, X., Murase, K.: MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 26, 405–425 (2014)
Gao, M., Hong, X., Chen, S., Harris, C.J., Khalaf, E.: PDFOS: PDF estimation based over-sampling for imbalanced two-class problems. Neurocomputing 138, 248–259 (2014)
Ramentol, E., Caballero, Y., Bello, R., Herrera, F.: SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl. Inf. Syst. 33, 245–265 (2012)
Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explor. Newsl 6, 20–29 (2004)
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Advances in Intelligent Computing, pp. 878–887 (2005)
Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. Advances in Knowledge Discovery and Data Mining, pp. 475–482 (2009)
Cateni, S., Colla, V., Vannucci, M.: A method for resampling imbalanced datasets in binary classification tasks for real-world problems. Neurocomputing 135, 32–41 (2014)
Yang, C.-Y., Yang, J.-S., Wang, J.-J.: Margin calibration in SVM class-imbalanced learning. Neurocomputing 73, 397–411 (2009)
Krawczyk, B., Woźniak, M., Schaefer, G.: Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl. Soft Comput. 14, 554–562 (2014)
Krawczyk, B., Woźniak, M.: Cost-sensitive neural network with roc-based moving threshold for imbalanced classification. In: International Conference on Intelligent Data Engineering and Automated Learning, pp. 45–52 (2015)
Zhu, Y., Wang, Z., Gao, D.: Gravitational fixed radius nearest neighbor for imbalanced problem. Knowl.-Based Syst. 90, 224–238 (2015)
Nikpour, B., Shabani, M., Nezamabadi-pour, H.: Proposing new method to improve gravitational fixed nearest neighbor algorithm for imbalanced data classification. In: 2017 2nd Conference on Swarm Intelligence and Evolutionary Computation (CSIEC), pp. 6–11 (2017)
Shabani, M., Nikpour, B., Nezamabadi-pour, H.: An improvement to gravitational fixed radius nearest neighbor for imbalaced problem. Artificial Intelligence and Signal Processing (AISP) (2017)
Saryazdi, S., Nikpour, B., Nezamabadi-pour, H.: NPC: Neighbors Progressive Competition Algorithm for Classification of Imbalanced Data Sets, arXiv preprint arXiv:1711.10934 (2017)
Domingos, P.: Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 155–164 (1999)
Elkan, C.: The foundations of cost-sensitive learning. In: International Joint Conference on Artificial Intelligence, pp. 973–978 (2001)
Ting, K.M.: An instance-weighting method to induce cost-sensitive trees. IEEE Trans. Knowl. Data Eng. 14, 659–665 (2002)
Zhou, Z.-H., Liu, X.-Y.: Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans. Knowl. Data Eng. 18, 63–77 (2006)
Lin, S.-C., Yuan-chin, I.C., Yang, W.-N.: Meta-learning for imbalanced data and classification ensemble in binary classification. Neurocomputing 73, 484–494 (2009)
Ghazikhani, A., Monsefi, R., Yazdi, H.S.: Ensemble of online neural networks for non-stationary and imbalanced data streams. Neurocomputing 122, 535–544 (2013)
Yin, Q.-Y., Zhang, J.-S., Zhang, C.-X., Ji, N.-N.: A novel selective ensemble algorithm for imbalanced data classification based on exploratory undersampling. Mathematical Problems in Engineering, vol. 2014 (2014)
Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B., Zhou, Y.: A novel ensemble method for classifying imbalanced data. Pattern Recogn. 48, 1623–1637 (2015)
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42, 463–484 (2012)
Galar, M., Fernández, A., Barrenechea, E., Herrera, F.: EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recogn. 46, 3460–3471 (2013)
Lim, P., Goh, C.K., Tan, K.C.: Evolutionary cluster-based synthetic oversampling ensemble (eco-ensemble) for imbalance learning. IEEE Trans. Cybern. (2017)
Koulinas, G., Kotsikas, L., Anagnostopoulos, K.: A particle swarm optimization based hyper-heuristic algorithm for the classic resource constrained project scheduling problem. Inf. Sci. 277, 680–693 (2014)
Burke, E.K., Gendreau, M., Hyde, M., Kendall, G., Ochoa, G., Özcan, E., et al.: Hyper-heuristics: a survey of the state of the art. J. Oper. Res. Soc. 64, 1695–1724 (2013)
Dowsland, K.A., Soubeiga, E., Burke, E.: A simulated annealing based hyperheuristic for determining shipper sizes for storage and transportation. Eur. J. Oper. Res. 179, 759–774 (2007)
Alcala-Fdez, J., Sanchez, L., Garcia, S., del Jesus, M., Ventura, S., Garrell, J., Otero, J., Romero, C., Bacardit, J., Rivas, V.: Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13(3), 307–318 (2009)
Gascón-Moreno, J., Salcedo-Sanz, S., Saavedra-Moreno, B., Carro-Calvo, L., Portilla-Figueras, A.: An evolutionary-based hyper-heuristic approach for optimal construction of group method of data handling networks. Inf. Sci. 247, 94–108 (2013)
Burke, E., Kendall, G., O’Brien, R., Redrup, D., Soubeiga, E.: An ant algorithm hyper-heuristic. In: Proceedings of the Fifth Meta-heuristics International Conference (MIC’03), pp. 1–10 (2003)
Koulinas, G.K., Anagnostopoulos, K.P.: Construction resource allocation and leveling using a threshold accepting-based hyperheuristic algorithm. J. Constr. Eng. Manag. 138, 854–863 (2011)
Burke, E.K., Hyde, M., Kendall, G., Ochoa, G., Özcan, E., Woodward, J.R.: A classification of hyper-heuristic approaches. In: Handbook of Meta-heuristics. Springer, pp. 449–468 (2010)
Nielsen, M.A., Chuang, I.L.: Quantum computation and quantum information. Quantum 546, 1231 (2000)
García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf. Sci. 180, 2044–2064 (2010)
Holm, S.: A simple sequentially rejective multiple test procedure. Scand. J. Stat 6(2), 65–70 (1979)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Nikpour, B., Nezamabadi-pour, H. HTSS: a hyper-heuristic training set selection method for imbalanced data sets. Iran J Comput Sci 1, 109–128 (2018). https://doi.org/10.1007/s42044-018-0009-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42044-018-0009-2