[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content

Advertisement

Log in

HTSS: a hyper-heuristic training set selection method for imbalanced data sets

  • Original Article
  • Published:
Iran Journal of Computer Science Aims and scope Submit manuscript

Abstract

Imbalanced data sets are those in which data samples have uneven distribution amongst the classes. When classifying such data, classical classifiers encounter problem; hence, this problem has become a challenging issue in the field of machine learning. To weaken this problem, we propose a novel hyper-heuristic algorithm, called HTSS, to select the best training samples in this paper. In other words, the best training sample subset is chosen with the goal of enhancing the performance of classifier when confronting imbalanced data. To do so, some local search algorithms and a choice function are incorporated with a global search algorithm to improve its effectiveness. The global search used in this paper is binary quantum inspired gravitational search algorithm (BQIGSA) which is a recently proposed meta-heuristic search for optimization of binary encoded problems. Experiments are performed on 75 imbalanced data sets, and G-mean and AUC measures are employed for evaluation. The results of comparing the proposed method with other state of the art algorithms show the superiority of the proposed HTSS method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30, 195–215 (1998)

    Article  Google Scholar 

  2. Cieslak, D.A., Chawla, N.V., Striegel, A.: Combating imbalance in network intrusion datasets. In: GrC, pp. 732–737 (2006)

  3. Zhang, D., Islam, M.M., Lu, G.: A review on automatic image annotation techniques. Pattern Recogn. 45, 346–362 (2012)

    Article  Google Scholar 

  4. Cardie, C., Howe, N.: Improving minority class prediction using case-specific feature weights. In: ICML, pp. 57–65 (1997)

  5. Burke, E., Kendall, G., Newall, J., Hart, E., Ross, P., Schulenburg, S.: Hyper-heuristics: an emerging direction in modern search technology. Handbook of Meta-Heuristics, pp. 457–474 (2003)

  6. Nezamabadi-pour, H.: A quantum-inspired gravitational search algorithm for binary encoded optimization problems. Eng. Appl. Artif. Intell. 40, 62–75 (2015)

    Article  Google Scholar 

  7. García, S., Herrera, F.: Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol. Comput. 17, 275–306 (2009)

    Article  Google Scholar 

  8. Gao, M., Hong, X., Chen, S., Harris, C.J.: A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems. Neurocomputing 74, 3456–3466 (2011)

    Article  Google Scholar 

  9. Yu, D.-J., Hu, J., Tang, Z.-M., Shen, H.-B., Yang, J., Yang, J.-Y.: Improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling. Neurocomputing 104, 180–190 (2013)

    Article  Google Scholar 

  10. Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 6, 769–772 (1976)

    MathSciNet  MATH  Google Scholar 

  11. Hart, P.: The condensed nearest neighbor rule (Corresp.). IEEE Trans. Inf. Theory 14, 515–516 (1968)

    Article  Google Scholar 

  12. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, pp. 179–186 (1997)

  13. Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2, 408–421 (1972)

    Article  MathSciNet  MATH  Google Scholar 

  14. Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Quaglini, S., Barahona, P., Andreassen, S. (eds.) Artificial Intelligence in Medicine. Lecture Notes in Computer Science, vol. 2101, pp. 63–66. Springer, Berlin (2001)

  15. García, S., Fernández, A., Herrera, F.: Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems. Appl. Soft Comput. 9, 1304–1314 (2009)

    Article  Google Scholar 

  16. Garcı, S., Triguero, I., Carmona, C.J., Herrera, F.: Evolutionary-based selection of generalized instances for imbalanced classification. Knowl.-Based Syst. 25, 3–12 (2012)

    Article  Google Scholar 

  17. Jian, C., Gao, J., Ao, Y.: A new sampling method for classifying imbalanced data based on support vector machine ensemble. Neurocomputing 193, 115–122 (2016)

    Article  Google Scholar 

  18. Chen, S., He, H., Garcia, E.A.: RAMOBoost: ranked minority oversampling in boosting. IEEE Trans. Neural Netw. 21, 1624–1642 (2010)

    Article  Google Scholar 

  19. He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: IEEE International Joint Conference on Neural Networks, 2008. IJCNN 2008 (IEEE World Congress on Computational Intelligence), pp. 1322–1328 (2008)

  20. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    MATH  Google Scholar 

  21. Hu, S., Liang, Y., Ma, L., He, Y.: MSMOTE: improving classification performance when training data is imbalanced. In: Second International Workshop on Computer Science and Engineering, 2009. WCSE’09, pp. 13–17 (2009)

  22. Barua, S., Islam, M.M., Yao, X., Murase, K.: MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 26, 405–425 (2014)

    Article  Google Scholar 

  23. Gao, M., Hong, X., Chen, S., Harris, C.J., Khalaf, E.: PDFOS: PDF estimation based over-sampling for imbalanced two-class problems. Neurocomputing 138, 248–259 (2014)

    Article  Google Scholar 

  24. Ramentol, E., Caballero, Y., Bello, R., Herrera, F.: SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl. Inf. Syst. 33, 245–265 (2012)

    Article  Google Scholar 

  25. Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explor. Newsl 6, 20–29 (2004)

    Article  Google Scholar 

  26. Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Advances in Intelligent Computing, pp. 878–887 (2005)

  27. Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. Advances in Knowledge Discovery and Data Mining, pp. 475–482 (2009)

  28. Cateni, S., Colla, V., Vannucci, M.: A method for resampling imbalanced datasets in binary classification tasks for real-world problems. Neurocomputing 135, 32–41 (2014)

    Article  Google Scholar 

  29. Yang, C.-Y., Yang, J.-S., Wang, J.-J.: Margin calibration in SVM class-imbalanced learning. Neurocomputing 73, 397–411 (2009)

    Article  Google Scholar 

  30. Krawczyk, B., Woźniak, M., Schaefer, G.: Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl. Soft Comput. 14, 554–562 (2014)

    Article  Google Scholar 

  31. Krawczyk, B., Woźniak, M.: Cost-sensitive neural network with roc-based moving threshold for imbalanced classification. In: International Conference on Intelligent Data Engineering and Automated Learning, pp. 45–52 (2015)

  32. Zhu, Y., Wang, Z., Gao, D.: Gravitational fixed radius nearest neighbor for imbalanced problem. Knowl.-Based Syst. 90, 224–238 (2015)

    Article  Google Scholar 

  33. Nikpour, B., Shabani, M., Nezamabadi-pour, H.: Proposing new method to improve gravitational fixed nearest neighbor algorithm for imbalanced data classification. In: 2017 2nd Conference on Swarm Intelligence and Evolutionary Computation (CSIEC), pp. 6–11 (2017)

  34. Shabani, M., Nikpour, B., Nezamabadi-pour, H.: An improvement to gravitational fixed radius nearest neighbor for imbalaced problem. Artificial Intelligence and Signal Processing (AISP) (2017)

  35. Saryazdi, S., Nikpour, B., Nezamabadi-pour, H.: NPC: Neighbors Progressive Competition Algorithm for Classification of Imbalanced Data Sets, arXiv preprint arXiv:1711.10934 (2017)

  36. Domingos, P.: Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 155–164 (1999)

  37. Elkan, C.: The foundations of cost-sensitive learning. In: International Joint Conference on Artificial Intelligence, pp. 973–978 (2001)

  38. Ting, K.M.: An instance-weighting method to induce cost-sensitive trees. IEEE Trans. Knowl. Data Eng. 14, 659–665 (2002)

    Article  Google Scholar 

  39. Zhou, Z.-H., Liu, X.-Y.: Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans. Knowl. Data Eng. 18, 63–77 (2006)

    Article  Google Scholar 

  40. Lin, S.-C., Yuan-chin, I.C., Yang, W.-N.: Meta-learning for imbalanced data and classification ensemble in binary classification. Neurocomputing 73, 484–494 (2009)

    Article  Google Scholar 

  41. Ghazikhani, A., Monsefi, R., Yazdi, H.S.: Ensemble of online neural networks for non-stationary and imbalanced data streams. Neurocomputing 122, 535–544 (2013)

    Article  Google Scholar 

  42. Yin, Q.-Y., Zhang, J.-S., Zhang, C.-X., Ji, N.-N.: A novel selective ensemble algorithm for imbalanced data classification based on exploratory undersampling. Mathematical Problems in Engineering, vol. 2014 (2014)

  43. Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B., Zhou, Y.: A novel ensemble method for classifying imbalanced data. Pattern Recogn. 48, 1623–1637 (2015)

    Article  Google Scholar 

  44. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42, 463–484 (2012)

    Article  Google Scholar 

  45. Galar, M., Fernández, A., Barrenechea, E., Herrera, F.: EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recogn. 46, 3460–3471 (2013)

    Article  Google Scholar 

  46. Lim, P., Goh, C.K., Tan, K.C.: Evolutionary cluster-based synthetic oversampling ensemble (eco-ensemble) for imbalance learning. IEEE Trans. Cybern. (2017)

  47. Koulinas, G., Kotsikas, L., Anagnostopoulos, K.: A particle swarm optimization based hyper-heuristic algorithm for the classic resource constrained project scheduling problem. Inf. Sci. 277, 680–693 (2014)

    Article  Google Scholar 

  48. Burke, E.K., Gendreau, M., Hyde, M., Kendall, G., Ochoa, G., Özcan, E., et al.: Hyper-heuristics: a survey of the state of the art. J. Oper. Res. Soc. 64, 1695–1724 (2013)

  49. Dowsland, K.A., Soubeiga, E., Burke, E.: A simulated annealing based hyperheuristic for determining shipper sizes for storage and transportation. Eur. J. Oper. Res. 179, 759–774 (2007)

    Article  MATH  Google Scholar 

  50. Alcala-Fdez, J., Sanchez, L., Garcia, S., del Jesus, M., Ventura, S., Garrell, J., Otero, J., Romero, C., Bacardit, J., Rivas, V.: Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13(3), 307–318 (2009)

    Article  Google Scholar 

  51. Gascón-Moreno, J., Salcedo-Sanz, S., Saavedra-Moreno, B., Carro-Calvo, L., Portilla-Figueras, A.: An evolutionary-based hyper-heuristic approach for optimal construction of group method of data handling networks. Inf. Sci. 247, 94–108 (2013)

    Article  MathSciNet  Google Scholar 

  52. Burke, E., Kendall, G., O’Brien, R., Redrup, D., Soubeiga, E.: An ant algorithm hyper-heuristic. In: Proceedings of the Fifth Meta-heuristics International Conference (MIC’03), pp. 1–10 (2003)

  53. Koulinas, G.K., Anagnostopoulos, K.P.: Construction resource allocation and leveling using a threshold accepting-based hyperheuristic algorithm. J. Constr. Eng. Manag. 138, 854–863 (2011)

    Article  Google Scholar 

  54. Burke, E.K., Hyde, M., Kendall, G., Ochoa, G., Özcan, E., Woodward, J.R.: A classification of hyper-heuristic approaches. In: Handbook of Meta-heuristics. Springer, pp. 449–468 (2010)

  55. Nielsen, M.A., Chuang, I.L.: Quantum computation and quantum information. Quantum 546, 1231 (2000)

    MATH  Google Scholar 

  56. García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf. Sci. 180, 2044–2064 (2010)

    Article  Google Scholar 

  57. Holm, S.: A simple sequentially rejective multiple test procedure. Scand. J. Stat 6(2), 65–70 (1979)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hossein Nezamabadi-pour.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nikpour, B., Nezamabadi-pour, H. HTSS: a hyper-heuristic training set selection method for imbalanced data sets. Iran J Comput Sci 1, 109–128 (2018). https://doi.org/10.1007/s42044-018-0009-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42044-018-0009-2

Keywords

Navigation