HTSS: a hyper-heuristic training set selection method for imbalanced data sets

Bahareh Nikpour¹ &
Hossein Nezamabadi-pour¹

893 Accesses
Explore all metrics

Abstract

Imbalanced data sets are those in which data samples have uneven distribution amongst the classes. When classifying such data, classical classifiers encounter problem; hence, this problem has become a challenging issue in the field of machine learning. To weaken this problem, we propose a novel hyper-heuristic algorithm, called HTSS, to select the best training samples in this paper. In other words, the best training sample subset is chosen with the goal of enhancing the performance of classifier when confronting imbalanced data. To do so, some local search algorithms and a choice function are incorporated with a global search algorithm to improve its effectiveness. The global search used in this paper is binary quantum inspired gravitational search algorithm (BQIGSA) which is a recently proposed meta-heuristic search for optimization of binary encoded problems. Experiments are performed on 75 imbalanced data sets, and G-mean and AUC measures are employed for evaluation. The results of comparing the proposed method with other state of the art algorithms show the superiority of the proposed HTSS method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

References

Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30, 195–215 (1998)
Article Google Scholar
Cieslak, D.A., Chawla, N.V., Striegel, A.: Combating imbalance in network intrusion datasets. In: GrC, pp. 732–737 (2006)
Zhang, D., Islam, M.M., Lu, G.: A review on automatic image annotation techniques. Pattern Recogn. 45, 346–362 (2012)
Article Google Scholar
Cardie, C., Howe, N.: Improving minority class prediction using case-specific feature weights. In: ICML, pp. 57–65 (1997)
Burke, E., Kendall, G., Newall, J., Hart, E., Ross, P., Schulenburg, S.: Hyper-heuristics: an emerging direction in modern search technology. Handbook of Meta-Heuristics, pp. 457–474 (2003)
Nezamabadi-pour, H.: A quantum-inspired gravitational search algorithm for binary encoded optimization problems. Eng. Appl. Artif. Intell. 40, 62–75 (2015)
Article Google Scholar
García, S., Herrera, F.: Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol. Comput. 17, 275–306 (2009)
Article Google Scholar
Gao, M., Hong, X., Chen, S., Harris, C.J.: A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems. Neurocomputing 74, 3456–3466 (2011)
Article Google Scholar
Yu, D.-J., Hu, J., Tang, Z.-M., Shen, H.-B., Yang, J., Yang, J.-Y.: Improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling. Neurocomputing 104, 180–190 (2013)
Article Google Scholar
Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 6, 769–772 (1976)
MathSciNet MATH Google Scholar
Hart, P.: The condensed nearest neighbor rule (Corresp.). IEEE Trans. Inf. Theory 14, 515–516 (1968)
Article Google Scholar
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, pp. 179–186 (1997)
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2, 408–421 (1972)
Article MathSciNet MATH Google Scholar
Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Quaglini, S., Barahona, P., Andreassen, S. (eds.) Artificial Intelligence in Medicine. Lecture Notes in Computer Science, vol. 2101, pp. 63–66. Springer, Berlin (2001)
García, S., Fernández, A., Herrera, F.: Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems. Appl. Soft Comput. 9, 1304–1314 (2009)
Article Google Scholar
Garcı, S., Triguero, I., Carmona, C.J., Herrera, F.: Evolutionary-based selection of generalized instances for imbalanced classification. Knowl.-Based Syst. 25, 3–12 (2012)
Article Google Scholar
Jian, C., Gao, J., Ao, Y.: A new sampling method for classifying imbalanced data based on support vector machine ensemble. Neurocomputing 193, 115–122 (2016)
Article Google Scholar
Chen, S., He, H., Garcia, E.A.: RAMOBoost: ranked minority oversampling in boosting. IEEE Trans. Neural Netw. 21, 1624–1642 (2010)
Article Google Scholar
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: IEEE International Joint Conference on Neural Networks, 2008. IJCNN 2008 (IEEE World Congress on Computational Intelligence), pp. 1322–1328 (2008)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
MATH Google Scholar
Hu, S., Liang, Y., Ma, L., He, Y.: MSMOTE: improving classification performance when training data is imbalanced. In: Second International Workshop on Computer Science and Engineering, 2009. WCSE’09, pp. 13–17 (2009)
Barua, S., Islam, M.M., Yao, X., Murase, K.: MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 26, 405–425 (2014)
Article Google Scholar
Gao, M., Hong, X., Chen, S., Harris, C.J., Khalaf, E.: PDFOS: PDF estimation based over-sampling for imbalanced two-class problems. Neurocomputing 138, 248–259 (2014)
Article Google Scholar
Ramentol, E., Caballero, Y., Bello, R., Herrera, F.: SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl. Inf. Syst. 33, 245–265 (2012)
Article Google Scholar
Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explor. Newsl 6, 20–29 (2004)
Article Google Scholar
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Advances in Intelligent Computing, pp. 878–887 (2005)
Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. Advances in Knowledge Discovery and Data Mining, pp. 475–482 (2009)
Cateni, S., Colla, V., Vannucci, M.: A method for resampling imbalanced datasets in binary classification tasks for real-world problems. Neurocomputing 135, 32–41 (2014)
Article Google Scholar
Yang, C.-Y., Yang, J.-S., Wang, J.-J.: Margin calibration in SVM class-imbalanced learning. Neurocomputing 73, 397–411 (2009)
Article Google Scholar
Krawczyk, B., Woźniak, M., Schaefer, G.: Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl. Soft Comput. 14, 554–562 (2014)
Article Google Scholar
Krawczyk, B., Woźniak, M.: Cost-sensitive neural network with roc-based moving threshold for imbalanced classification. In: International Conference on Intelligent Data Engineering and Automated Learning, pp. 45–52 (2015)
Zhu, Y., Wang, Z., Gao, D.: Gravitational fixed radius nearest neighbor for imbalanced problem. Knowl.-Based Syst. 90, 224–238 (2015)
Article Google Scholar
Nikpour, B., Shabani, M., Nezamabadi-pour, H.: Proposing new method to improve gravitational fixed nearest neighbor algorithm for imbalanced data classification. In: 2017 2nd Conference on Swarm Intelligence and Evolutionary Computation (CSIEC), pp. 6–11 (2017)
Shabani, M., Nikpour, B., Nezamabadi-pour, H.: An improvement to gravitational fixed radius nearest neighbor for imbalaced problem. Artificial Intelligence and Signal Processing (AISP) (2017)
Saryazdi, S., Nikpour, B., Nezamabadi-pour, H.: NPC: Neighbors Progressive Competition Algorithm for Classification of Imbalanced Data Sets, arXiv preprint arXiv:1711.10934 (2017)
Domingos, P.: Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 155–164 (1999)
Elkan, C.: The foundations of cost-sensitive learning. In: International Joint Conference on Artificial Intelligence, pp. 973–978 (2001)
Ting, K.M.: An instance-weighting method to induce cost-sensitive trees. IEEE Trans. Knowl. Data Eng. 14, 659–665 (2002)
Article Google Scholar
Zhou, Z.-H., Liu, X.-Y.: Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans. Knowl. Data Eng. 18, 63–77 (2006)
Article Google Scholar
Lin, S.-C., Yuan-chin, I.C., Yang, W.-N.: Meta-learning for imbalanced data and classification ensemble in binary classification. Neurocomputing 73, 484–494 (2009)
Article Google Scholar
Ghazikhani, A., Monsefi, R., Yazdi, H.S.: Ensemble of online neural networks for non-stationary and imbalanced data streams. Neurocomputing 122, 535–544 (2013)
Article Google Scholar
Yin, Q.-Y., Zhang, J.-S., Zhang, C.-X., Ji, N.-N.: A novel selective ensemble algorithm for imbalanced data classification based on exploratory undersampling. Mathematical Problems in Engineering, vol. 2014 (2014)
Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B., Zhou, Y.: A novel ensemble method for classifying imbalanced data. Pattern Recogn. 48, 1623–1637 (2015)
Article Google Scholar
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42, 463–484 (2012)
Article Google Scholar
Galar, M., Fernández, A., Barrenechea, E., Herrera, F.: EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recogn. 46, 3460–3471 (2013)
Article Google Scholar
Lim, P., Goh, C.K., Tan, K.C.: Evolutionary cluster-based synthetic oversampling ensemble (eco-ensemble) for imbalance learning. IEEE Trans. Cybern. (2017)
Koulinas, G., Kotsikas, L., Anagnostopoulos, K.: A particle swarm optimization based hyper-heuristic algorithm for the classic resource constrained project scheduling problem. Inf. Sci. 277, 680–693 (2014)
Article Google Scholar
Burke, E.K., Gendreau, M., Hyde, M., Kendall, G., Ochoa, G., Özcan, E., et al.: Hyper-heuristics: a survey of the state of the art. J. Oper. Res. Soc. 64, 1695–1724 (2013)
Dowsland, K.A., Soubeiga, E., Burke, E.: A simulated annealing based hyperheuristic for determining shipper sizes for storage and transportation. Eur. J. Oper. Res. 179, 759–774 (2007)
Article MATH Google Scholar
Alcala-Fdez, J., Sanchez, L., Garcia, S., del Jesus, M., Ventura, S., Garrell, J., Otero, J., Romero, C., Bacardit, J., Rivas, V.: Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13(3), 307–318 (2009)
Article Google Scholar
Gascón-Moreno, J., Salcedo-Sanz, S., Saavedra-Moreno, B., Carro-Calvo, L., Portilla-Figueras, A.: An evolutionary-based hyper-heuristic approach for optimal construction of group method of data handling networks. Inf. Sci. 247, 94–108 (2013)
Article MathSciNet Google Scholar
Burke, E., Kendall, G., O’Brien, R., Redrup, D., Soubeiga, E.: An ant algorithm hyper-heuristic. In: Proceedings of the Fifth Meta-heuristics International Conference (MIC’03), pp. 1–10 (2003)
Koulinas, G.K., Anagnostopoulos, K.P.: Construction resource allocation and leveling using a threshold accepting-based hyperheuristic algorithm. J. Constr. Eng. Manag. 138, 854–863 (2011)
Article Google Scholar
Burke, E.K., Hyde, M., Kendall, G., Ochoa, G., Özcan, E., Woodward, J.R.: A classification of hyper-heuristic approaches. In: Handbook of Meta-heuristics. Springer, pp. 449–468 (2010)
Nielsen, M.A., Chuang, I.L.: Quantum computation and quantum information. Quantum 546, 1231 (2000)
MATH Google Scholar
García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf. Sci. 180, 2044–2064 (2010)
Article Google Scholar
Holm, S.: A simple sequentially rejective multiple test procedure. Scand. J. Stat 6(2), 65–70 (1979)

Download references

Author information

Authors and Affiliations

Intelligent Data Processing Laboratory (IDPL), Department of Electrical Engineering, Shahid Bahonar University of Kerman, Kerman, Iran
Bahareh Nikpour & Hossein Nezamabadi-pour

Authors

Bahareh Nikpour
View author publications
You can also search for this author in PubMed Google Scholar
Hossein Nezamabadi-pour
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hossein Nezamabadi-pour.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nikpour, B., Nezamabadi-pour, H. HTSS: a hyper-heuristic training set selection method for imbalanced data sets. Iran J Comput Sci 1, 109–128 (2018). https://doi.org/10.1007/s42044-018-0009-2

Download citation

Received: 06 December 2017
Accepted: 28 January 2018
Published: 12 February 2018
Issue Date: June 2018
DOI: https://doi.org/10.1007/s42044-018-0009-2

HTSS: a hyper-heuristic training set selection method for imbalanced data sets

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A memetic approach for training set selection in imbalanced data sets

Hyper-Parameter Optimization in Support Vector Machine on Unbalanced Datasets Using Genetic Algorithms

Feature Selection for Multi-Class Imbalanced Data Sets Based on Genetic Algorithm

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

HTSS: a hyper-heuristic training set selection method for imbalanced data sets

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A memetic approach for training set selection in imbalanced data sets

Hyper-Parameter Optimization in Support Vector Machine on Unbalanced Datasets Using Genetic Algorithms

Feature Selection for Multi-Class Imbalanced Data Sets Based on Genetic Algorithm

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation