Abstract
This paper introduces a new hybrid method to address the issue of redundant and irrelevant features selected by filter-based methods for text classification. The method utilizes an enhanced genetic algorithm called “Feature Correlation-based Genetic Algorithm” (FC-GA). Initially, a feature subset with the highest classification accuracy is selected by a filter-based method, which will be then used by the FC-GA to generate potential solutions by considering the correlation between features that have similar classification weights and avoiding useless random solutions. The encoding process involves assigning a value of 0 to features that provide a high degree of correlation with other features having almost the same classification information beyond a specified context, while features that are lowly correlated retain their initial code of 1. Through iterative optimization using crossover and mutation operators, the algorithm should remove redundant features that provide strong correlations and high redundancy, which could lead to improved classification performance at a lower computation cost. The aim of this study is to improve the efficiency of filter-based methods, incorporate feature correlation information into genetic algorithms, and utilize pre-optimized feature subsets to efficiently identify optimal solutions. To evaluate the effectiveness of the proposed method, SVM and NB classifiers are employed on six public datasets and compared to five well-known and effective filter-based methods. The results indicate that a significant portion (about 50%) of the features selected by reference filter-based methods are redundant. Eliminating those redundant features leads to a significant improvement in classification performance as measured by the micro-f1 measure.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Availability of data and materials
The data described in this article is publicly available at https://www.kaggle.com/datasets and https://starling.utdallas.edu/datasets/.
References
Abd Rahman R, Ramli R, Jamari Z, Ku-Mahamud KR (2016) Evolutionary algorithm with roulette-tournament selection for solving aquaculture diet formulation. Math Probl Eng 2016:1–10. https://doi.org/10.1155/2016/3672758
Almuzaini HA, Azmi AM (2022) An unsupervised annotation of Arabic texts using multi-label topic modeling and genetic algorithm. Expert Syst Appl 203:117384. https://doi.org/10.1016/j.eswa.2022.117384
Aote SS, Pimpalshende A, Potnurwar A, Lohi S (2023) Binary particle swarm optimization with an improved genetic algorithm to solve multi-document text summarization problem of Hindi documents. Eng Appl Artif Intell 117:105575. https://doi.org/10.1016/j.engappai.2022.105575
Asim MN, Wasim M, Ali MS, Rehman A (2017) Comparison of feature selection methods in text classification on highly skewed datasets. In: 2017 First international conference on latest trends in electrical engineering and computing technologies (INTELLECT). IEEE, Karachi, pp 1–8. https://doi.org/10.1109/INTELLECT.2017.8277634
Basu A, Walters C, Shepherd M (2003) Support vector machines for text categorization. In: 36th annual Hawaii international conference on system sciences, 2003. Proceedings of the. IEEE, Big Island, HI, USA, p 7. https://doi.org/10.1109/HICSS.2003.1174243
Bennasar M, Hicks Y, Setchi R (2015) Feature selection using joint mutual information maximisation. Expert Syst Appl 42:8520–8532. https://doi.org/10.1016/j.eswa.2015.07.007
Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97:245–271. https://doi.org/10.1016/S0004-3702(97)00063-5
Cheng J-H, Sun D-W, Pu H (2016) Combining the genetic algorithm and successive projection algorithm for the selection of feature wavelengths to evaluate exudative characteristics in frozen–thawed fish muscle. Food Chem 197:855–863. https://doi.org/10.1016/j.foodchem.2015.11.019
Cilia ND, De Stefano C, Fontanella F, Scotto di Freca A (2019) Variable-length representation for EC-based feature selection in high-dimensional data. In: Kaufmann P, Castillo PA (eds) Applications of evolutionary computation. Springer, Cham, pp 325-340. https://doi.org/10.1007/978-3-030-16692-2_22
Colas F, Brazdil P (2006) Comparison of SVM and some older classification algorithms in text classification tasks. In: Bramer M (ed) Artificial intelligence in theory and practice. Springer, New York, pp 169-178. https://doi.org/10.1007/978-0-387-34747-9_18
Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New York
Cover TM, Thomas JA (2006) Elements of information theory, 2nd edn. Wiley, Hoboken
Craven MW, DiPasquo D, Freitag D, et al (1998) Learning to extract symbolic knowledge from the world wide web. In: AAAI/IAAI
Dwarakanath L, Kamsin A, Shuib L (2023) A genetic algorithm based domain adaptation framework for classification of disaster topic text tweets. IAJIT 20. https://doi.org/10.34028/iajit/20/1/7
Eligüzel N, Çetinkaya C, Dereli T (2022) A novel approach for text categorization by applying hybrid genetic bat algorithm through feature extraction and feature selection methods. Expert Syst Appl 202:117433. https://doi.org/10.1016/j.eswa.2022.117433
Endalie D, Haile G, Taye Abebe W (2022) Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification. PeerJ Comput Sci 8:e961. https://doi.org/10.7717/peerj-cs.961
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3(Mar):1289–1305
Gao W, Hu L, Zhang P (2020) Feature redundancy term variation for mutual information-based feature selection. Appl Intell 50:1272–1288. https://doi.org/10.1007/s10489-019-01597-z
Ghareb AS, Bakar AA, Hamdan AR (2016) Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst Appl 49:31–47. https://doi.org/10.1016/j.eswa.2015.12.004
Gómez Hidalgo JM, Bringas GC, Sánz EP, García FC (2006) Content based SMS spam filtering. In: Proceedings of the 2006 ACM symposium on document engineering. ACM, Amsterdam The Netherlands, pp 107–114. https://doi.org/10.1145/1166160.1166191
Günal S (2012) Hybrid feature selection for text classification. Turk J Electr Eng Comput Sci. https://doi.org/10.3906/elk-1101-1064
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182. https://doi.org/10.1162/153244303322753616
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422. https://doi.org/10.1023/A:1012487302797
Holland JH (1975) Adaptation in natural and artificial systems. University of Michigan Press, Ann Arbor
Hu Z, Bao Y, Xiong T, Chiong R (2015) Hybrid filter-wrapper feature selection for short-term load forecasting. Eng Appl Artif Intell 40:17–27. https://doi.org/10.1016/j.engappai.2014.12.014
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Nédellec C, Rouveirol C (eds) Machine learning: ECML-98. Springer, Berlin, pp 137–142. https://doi.org/10.1007/BFb0026683
Jurman G, Riccadonna S, Furlanello C (2012) A comparison of MCC and CEN in multi-class prediction. PLoS ONE 7:e41882. https://doi.org/10.1371/journal.pone.0041882
Kibriya AM, Frank E, Pfahringer B, Holmes G (2004) Multinomial Naive Bayes for text categorization revisited. In: Webb GI, Yu X (eds) AI 2004: advances in artificial intelligence. Springer, Berlin, pp 488-499. https://doi.org/10.1007/978-3-540-30549-1_43
Kolluri J, Razia S (2020) WITHDRAWN: text classification using Naïve Bayes classifier. Mater Today Proc ISSN 2214–7853. https://doi.org/10.1016/j.matpr.2020.10.058
Kou G, Yang P, Peng Y et al (2020) Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods. Appl Soft Comput 86:105836. https://doi.org/10.1016/j.asoc.2019.105836
Kumar V (2014) Feature selection: a literature review. SmartCR 4. https://doi.org/10.6029/smartcr.2014.03.007
Lei S (2012) A feature selection method based on information gain and genetic algorithm. In: 2012 international conference on computer science and electronics engineering. IEEE, Hangzhou, Zhejiang, China, pp 355–358. https://doi.org/10.1109/ICCSEE.2012.97
Liu H, Ditzler G (2021) Data poisoning against information-theoretic feature selection. Inf Sci 573:396–411. https://doi.org/10.1016/j.ins.2021.05.049
Liu H, Motoda H (1998) Feature selection for knowledge discovery and data mining. Springer, Boston. https://doi.org/10.1007/978-1-4615-5689-3
Lewis D (1997) Reuters-21578 text categorization collection. UCI Machine Learning Repository. https://doi.org/10.24432/C52G6M
Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge
Matthews BW (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta (BBA) Protein Struct 405:442–451. https://doi.org/10.1016/0005-2795(75)90109-9
Monirul Kabir Md, Monirul Islam Md, Murase K (2010) A new wrapper feature selection approach using neural network. Neurocomputing 73:3273–3283. https://doi.org/10.1016/j.neucom.2010.04.003
Neri-Mendoza V, Ledeneva Y, García-Hernandez RA, Hernández-Castañeda A (2023) Generic and update multi-document text summarization based on genetic algorithm. CyS. https://doi.org/10.13053/cys-27-1-4538
Pak A, Paroubek P (2010) Twitter as a corpus for sentiment analysis and opinion mining. In: Proceedings of the seventh international conference on language resources and evaluation (LREC’10). European Language Resources Association (ELRA), Valletta, Malta
Pang B, Lee L (2004) A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the 42nd annual meeting on association for computational linguistics—ACL’04. Association for Computational Linguistics, Barcelona, Spain, p 271. https://doi.org/10.3115/1218955.1218990
Pintas JT, Fernandes LAF, Garcia ACB (2021) Feature selection methods for text classification: a systematic literature review. Artif Intell Rev 54:6149–6200. https://doi.org/10.1007/s10462-021-09970-6
Rasool A, Tao R, Kamyab M, Hayat S (2020) GAWA-A feature selection method for hybrid sentiment classification. IEEE Access 8:191850–191861. https://doi.org/10.1109/ACCESS.2020.3030642
Robnik-Šikonja M, Kononenko I (2003) Theoretical and empirical analysis of ReliefF and RReliefF. Mach Learn 53:23–69. https://doi.org/10.1023/A:1025667309714
Shang W, Huang H, Zhu H et al (2007) A novel feature selection algorithm for text categorization. Expert Syst Appl 33:1–5. https://doi.org/10.1016/j.eswa.2006.04.001
Tanfouri I, Jarray F (2023) GaSUM: a genetic algorithm wrapped BERT for text summarization: In: Proceedings of the 15th international conference on agents and artificial intelligence. SCITEPRESS—Science and Technology Publications, Lisbon, Portugal, pp 447–453. https://doi.org/10.5220/0011893000003393
Tang J, Alelyani S, Liu H (2014) Feature selection for classification: a review. In: Data classification: algorithms and applications, pp 37-64. https://doi.org/10.1201/B17320
Timme N, Alford W, Flecker B, Beggs JM (2014) Synergy, redundancy, and multivariate information measures: an experimentalist’s perspective. J Comput Neurosci 36:119–140. https://doi.org/10.1007/s10827-013-0458-4
Uǧuz H (2011) A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl Based Syst 24:1024–1032. https://doi.org/10.1016/j.knosys.2011.04.014
Uysal AK, Gunal S (2014) Text classification using genetic algorithm oriented latent semantic features. Expert Syst Appl 41:5938–5947. https://doi.org/10.1016/j.eswa.2014.03.041
Vergara JR, Estévez PA (2014) A review of feature selection methods based on mutual information. Neural Comput Appl 24:175–186. https://doi.org/10.1007/s00521-013-1368-0
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Elsevier/Morgan Kaufmann, Amsterdam/Heidelberg ([repr.])
Yang J, Liu Y, Zhu X et al (2012) A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inform Process Manag 48:741–754. https://doi.org/10.1016/j.ipm.2011.12.005
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
The authors, Lazhar Farek and Amira Benaidja, contributed equally to this work.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest in preparing this article.
Ethical approval
This research did not contain any studies involving animal or human participants, nor did it take place on any private or protected areas.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Farek, L., Benaidja, A. A hybrid feature selection method for text classification using a feature-correlation-based genetic algorithm. Soft Comput 28, 13567–13593 (2024). https://doi.org/10.1007/s00500-024-10386-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-024-10386-x