A hybrid feature selection method for text classification using a feature-correlation-based genetic algorithm

112 Accesses
Explore all metrics

Abstract

This paper introduces a new hybrid method to address the issue of redundant and irrelevant features selected by filter-based methods for text classification. The method utilizes an enhanced genetic algorithm called “Feature Correlation-based Genetic Algorithm” (FC-GA). Initially, a feature subset with the highest classification accuracy is selected by a filter-based method, which will be then used by the FC-GA to generate potential solutions by considering the correlation between features that have similar classification weights and avoiding useless random solutions. The encoding process involves assigning a value of 0 to features that provide a high degree of correlation with other features having almost the same classification information beyond a specified context, while features that are lowly correlated retain their initial code of 1. Through iterative optimization using crossover and mutation operators, the algorithm should remove redundant features that provide strong correlations and high redundancy, which could lead to improved classification performance at a lower computation cost. The aim of this study is to improve the efficiency of filter-based methods, incorporate feature correlation information into genetic algorithms, and utilize pre-optimized feature subsets to efficiently identify optimal solutions. To evaluate the effectiveness of the proposed method, SVM and NB classifiers are employed on six public datasets and compared to five well-known and effective filter-based methods. The results indicate that a significant portion (about 50%) of the features selected by reference filter-based methods are redundant. Eliminating those redundant features leads to a significant improvement in classification performance as measured by the micro-f1 measure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Feature Selection for Text Classification Using Genetic Algorithm

An optimal feature selection method for text classification through redundancy and synergy analysis

Article 28 June 2024

A new feature selection method for handling redundant information in text classification

Article 01 February 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Availability of data and materials

The data described in this article is publicly available at https://www.kaggle.com/datasets and https://starling.utdallas.edu/datasets/.

References

Abd Rahman R, Ramli R, Jamari Z, Ku-Mahamud KR (2016) Evolutionary algorithm with roulette-tournament selection for solving aquaculture diet formulation. Math Probl Eng 2016:1–10. https://doi.org/10.1155/2016/3672758
Article Google Scholar
Almuzaini HA, Azmi AM (2022) An unsupervised annotation of Arabic texts using multi-label topic modeling and genetic algorithm. Expert Syst Appl 203:117384. https://doi.org/10.1016/j.eswa.2022.117384
Article Google Scholar
Aote SS, Pimpalshende A, Potnurwar A, Lohi S (2023) Binary particle swarm optimization with an improved genetic algorithm to solve multi-document text summarization problem of Hindi documents. Eng Appl Artif Intell 117:105575. https://doi.org/10.1016/j.engappai.2022.105575
Article Google Scholar
Asim MN, Wasim M, Ali MS, Rehman A (2017) Comparison of feature selection methods in text classification on highly skewed datasets. In: 2017 First international conference on latest trends in electrical engineering and computing technologies (INTELLECT). IEEE, Karachi, pp 1–8. https://doi.org/10.1109/INTELLECT.2017.8277634
Basu A, Walters C, Shepherd M (2003) Support vector machines for text categorization. In: 36th annual Hawaii international conference on system sciences, 2003. Proceedings of the. IEEE, Big Island, HI, USA, p 7. https://doi.org/10.1109/HICSS.2003.1174243
Bennasar M, Hicks Y, Setchi R (2015) Feature selection using joint mutual information maximisation. Expert Syst Appl 42:8520–8532. https://doi.org/10.1016/j.eswa.2015.07.007
Article Google Scholar
Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97:245–271. https://doi.org/10.1016/S0004-3702(97)00063-5
Article MathSciNet Google Scholar
Cheng J-H, Sun D-W, Pu H (2016) Combining the genetic algorithm and successive projection algorithm for the selection of feature wavelengths to evaluate exudative characteristics in frozen–thawed fish muscle. Food Chem 197:855–863. https://doi.org/10.1016/j.foodchem.2015.11.019
Article Google Scholar
Cilia ND, De Stefano C, Fontanella F, Scotto di Freca A (2019) Variable-length representation for EC-based feature selection in high-dimensional data. In: Kaufmann P, Castillo PA (eds) Applications of evolutionary computation. Springer, Cham, pp 325-340. https://doi.org/10.1007/978-3-030-16692-2_22
Colas F, Brazdil P (2006) Comparison of SVM and some older classification algorithms in text classification tasks. In: Bramer M (ed) Artificial intelligence in theory and practice. Springer, New York, pp 169-178. https://doi.org/10.1007/978-0-387-34747-9_18
Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New York
Google Scholar
Cover TM, Thomas JA (2006) Elements of information theory, 2nd edn. Wiley, Hoboken
Google Scholar
Craven MW, DiPasquo D, Freitag D, et al (1998) Learning to extract symbolic knowledge from the world wide web. In: AAAI/IAAI
Dwarakanath L, Kamsin A, Shuib L (2023) A genetic algorithm based domain adaptation framework for classification of disaster topic text tweets. IAJIT 20. https://doi.org/10.34028/iajit/20/1/7
Eligüzel N, Çetinkaya C, Dereli T (2022) A novel approach for text categorization by applying hybrid genetic bat algorithm through feature extraction and feature selection methods. Expert Syst Appl 202:117433. https://doi.org/10.1016/j.eswa.2022.117433
Article Google Scholar
Endalie D, Haile G, Taye Abebe W (2022) Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification. PeerJ Comput Sci 8:e961. https://doi.org/10.7717/peerj-cs.961
Article Google Scholar
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3(Mar):1289–1305
Google Scholar
Gao W, Hu L, Zhang P (2020) Feature redundancy term variation for mutual information-based feature selection. Appl Intell 50:1272–1288. https://doi.org/10.1007/s10489-019-01597-z
Article Google Scholar
Ghareb AS, Bakar AA, Hamdan AR (2016) Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst Appl 49:31–47. https://doi.org/10.1016/j.eswa.2015.12.004
Article Google Scholar
Gómez Hidalgo JM, Bringas GC, Sánz EP, García FC (2006) Content based SMS spam filtering. In: Proceedings of the 2006 ACM symposium on document engineering. ACM, Amsterdam The Netherlands, pp 107–114. https://doi.org/10.1145/1166160.1166191
Günal S (2012) Hybrid feature selection for text classification. Turk J Electr Eng Comput Sci. https://doi.org/10.3906/elk-1101-1064
Article Google Scholar
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182. https://doi.org/10.1162/153244303322753616
Article Google Scholar
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422. https://doi.org/10.1023/A:1012487302797
Article Google Scholar
Holland JH (1975) Adaptation in natural and artificial systems. University of Michigan Press, Ann Arbor
Google Scholar
Hu Z, Bao Y, Xiong T, Chiong R (2015) Hybrid filter-wrapper feature selection for short-term load forecasting. Eng Appl Artif Intell 40:17–27. https://doi.org/10.1016/j.engappai.2014.12.014
Article Google Scholar
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Nédellec C, Rouveirol C (eds) Machine learning: ECML-98. Springer, Berlin, pp 137–142. https://doi.org/10.1007/BFb0026683
Jurman G, Riccadonna S, Furlanello C (2012) A comparison of MCC and CEN in multi-class prediction. PLoS ONE 7:e41882. https://doi.org/10.1371/journal.pone.0041882
Article Google Scholar
Kibriya AM, Frank E, Pfahringer B, Holmes G (2004) Multinomial Naive Bayes for text categorization revisited. In: Webb GI, Yu X (eds) AI 2004: advances in artificial intelligence. Springer, Berlin, pp 488-499. https://doi.org/10.1007/978-3-540-30549-1_43
Kolluri J, Razia S (2020) WITHDRAWN: text classification using Naïve Bayes classifier. Mater Today Proc ISSN 2214–7853. https://doi.org/10.1016/j.matpr.2020.10.058
Kou G, Yang P, Peng Y et al (2020) Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods. Appl Soft Comput 86:105836. https://doi.org/10.1016/j.asoc.2019.105836
Article Google Scholar
Kumar V (2014) Feature selection: a literature review. SmartCR 4. https://doi.org/10.6029/smartcr.2014.03.007
Lei S (2012) A feature selection method based on information gain and genetic algorithm. In: 2012 international conference on computer science and electronics engineering. IEEE, Hangzhou, Zhejiang, China, pp 355–358. https://doi.org/10.1109/ICCSEE.2012.97
Liu H, Ditzler G (2021) Data poisoning against information-theoretic feature selection. Inf Sci 573:396–411. https://doi.org/10.1016/j.ins.2021.05.049
Article MathSciNet Google Scholar
Liu H, Motoda H (1998) Feature selection for knowledge discovery and data mining. Springer, Boston. https://doi.org/10.1007/978-1-4615-5689-3
Book Google Scholar
Lewis D (1997) Reuters-21578 text categorization collection. UCI Machine Learning Repository. https://doi.org/10.24432/C52G6M
Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge
Google Scholar
Matthews BW (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta (BBA) Protein Struct 405:442–451. https://doi.org/10.1016/0005-2795(75)90109-9
Article Google Scholar
Monirul Kabir Md, Monirul Islam Md, Murase K (2010) A new wrapper feature selection approach using neural network. Neurocomputing 73:3273–3283. https://doi.org/10.1016/j.neucom.2010.04.003
Article Google Scholar
Neri-Mendoza V, Ledeneva Y, García-Hernandez RA, Hernández-Castañeda A (2023) Generic and update multi-document text summarization based on genetic algorithm. CyS. https://doi.org/10.13053/cys-27-1-4538
Pak A, Paroubek P (2010) Twitter as a corpus for sentiment analysis and opinion mining. In: Proceedings of the seventh international conference on language resources and evaluation (LREC’10). European Language Resources Association (ELRA), Valletta, Malta
Pang B, Lee L (2004) A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the 42nd annual meeting on association for computational linguistics—ACL’04. Association for Computational Linguistics, Barcelona, Spain, p 271. https://doi.org/10.3115/1218955.1218990
Pintas JT, Fernandes LAF, Garcia ACB (2021) Feature selection methods for text classification: a systematic literature review. Artif Intell Rev 54:6149–6200. https://doi.org/10.1007/s10462-021-09970-6
Article Google Scholar
Rasool A, Tao R, Kamyab M, Hayat S (2020) GAWA-A feature selection method for hybrid sentiment classification. IEEE Access 8:191850–191861. https://doi.org/10.1109/ACCESS.2020.3030642
Article Google Scholar
Robnik-Šikonja M, Kononenko I (2003) Theoretical and empirical analysis of ReliefF and RReliefF. Mach Learn 53:23–69. https://doi.org/10.1023/A:1025667309714
Article Google Scholar
Shang W, Huang H, Zhu H et al (2007) A novel feature selection algorithm for text categorization. Expert Syst Appl 33:1–5. https://doi.org/10.1016/j.eswa.2006.04.001
Article Google Scholar
Tanfouri I, Jarray F (2023) GaSUM: a genetic algorithm wrapped BERT for text summarization: In: Proceedings of the 15th international conference on agents and artificial intelligence. SCITEPRESS—Science and Technology Publications, Lisbon, Portugal, pp 447–453. https://doi.org/10.5220/0011893000003393
Tang J, Alelyani S, Liu H (2014) Feature selection for classification: a review. In: Data classification: algorithms and applications, pp 37-64. https://doi.org/10.1201/B17320
Timme N, Alford W, Flecker B, Beggs JM (2014) Synergy, redundancy, and multivariate information measures: an experimentalist’s perspective. J Comput Neurosci 36:119–140. https://doi.org/10.1007/s10827-013-0458-4
Article MathSciNet Google Scholar
Uǧuz H (2011) A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl Based Syst 24:1024–1032. https://doi.org/10.1016/j.knosys.2011.04.014
Article Google Scholar
Uysal AK, Gunal S (2014) Text classification using genetic algorithm oriented latent semantic features. Expert Syst Appl 41:5938–5947. https://doi.org/10.1016/j.eswa.2014.03.041
Article Google Scholar
Vergara JR, Estévez PA (2014) A review of feature selection methods based on mutual information. Neural Comput Appl 24:175–186. https://doi.org/10.1007/s00521-013-1368-0
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Elsevier/Morgan Kaufmann, Amsterdam/Heidelberg ([repr.])
Google Scholar
Yang J, Liu Y, Zhu X et al (2012) A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inform Process Manag 48:741–754. https://doi.org/10.1016/j.ipm.2011.12.005
Article Google Scholar

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Computer Science Department, University of Guelma, Guelma, Algeria
Lazhar Farek
Computer Science Department, University of Setif 1, Setif, Algeria
Amira Benaidja
Laboratory of Vision and Artificial Intelligence - LAVIA, Larbi Tebessi University, Tebessa, Algeria
Lazhar Farek & Amira Benaidja

Authors

Lazhar Farek
View author publications
You can also search for this author in PubMed Google Scholar
Amira Benaidja
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The authors, Lazhar Farek and Amira Benaidja, contributed equally to this work.

Corresponding author

Correspondence to Lazhar Farek.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest in preparing this article.

Ethical approval

This research did not contain any studies involving animal or human participants, nor did it take place on any private or protected areas.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Farek, L., Benaidja, A. A hybrid feature selection method for text classification using a feature-correlation-based genetic algorithm. Soft Comput 28, 13567–13593 (2024). https://doi.org/10.1007/s00500-024-10386-x

Download citation

Accepted: 25 May 2024
Published: 19 November 2024
Issue Date: December 2024
DOI: https://doi.org/10.1007/s00500-024-10386-x

A hybrid feature selection method for text classification using a feature-correlation-based genetic algorithm

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Feature Selection for Text Classification Using Genetic Algorithm

An optimal feature selection method for text classification through redundancy and synergy analysis

A new feature selection method for handling redundant information in text classification

Availability of data and materials

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

A hybrid feature selection method for text classification using a feature-correlation-based genetic algorithm

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Feature Selection for Text Classification Using Genetic Algorithm

An optimal feature selection method for text classification through redundancy and synergy analysis

A new feature selection method for handling redundant information in text classification

Explore related subjects

Availability of data and materials

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now