[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

An optimal feature selection method for text classification through redundancy and synergy analysis

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Feature selection is an essential step in text classification tasks to enhance model performance, reduce computational complexity, and mitigate the risk of overfitting. Filter-based methods have gained popularity for their effectiveness and efficiency in selecting informative features. However, these methods often overlook feature correlations, resulting in the selection of redundant and irrelevant features while underestimating others. To address this limitation, this paper proposes FS-RSA (Feature Selection through Redundancy and Synergy Analysis), a novel method for text classification. FS-RSA aims to identify an optimal feature subset by considering feature interactions at a lower computational cost. It achieves this by evaluating features to optimize their synergy information and minimize redundancy within small subsets. The core principle of FS-RSA is that features offering similar classification information to the class variable are likely to be correlated and redundant, whereas features with high and low classification information can provide synergistic information. In the conducted experiments on five public datasets, FS-RSA was compared to five effective filter-based methods in text classification. It consistently achieved higher F1 scores with NB and SVM classifiers, highlighting its effectiveness in feature selection while significantly reducing dimensionality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4

Similar content being viewed by others

Availability of supporting data

The data described in this article is publicly available at https://www.kaggle.com/datasets

References

  1. Ashokkumar P, Srivastava G et al (2021) A Two-stage Text Feature Selection Algorithm for Improving Text Classification. ACM Trans Asian Low-Resour Lang Inf Process 20:1–19. https://doi.org/10.1145/3425781

    Article  Google Scholar 

  2. Apté C, Damerau F, Weiss SM (1994) Automated learning of decision rules for text categorization. ACM Trans Inf Syst 12:233–251. https://doi.org/10.1145/183422.183423

    Article  Google Scholar 

  3. Basu A, Watters C, Shepherd M (2003) Support vector machines for text categorization. In: 36th Annual hawaii international conference on system sciences. https://doi.org/10.1109/HICSS.2003.1174243

  4. Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5(4):537–550. https://doi.org/10.1109/72.298224

    Article  Google Scholar 

  5. Behera SK, Dash R (2022) A novel feature selection technique for enhancing performance of unbalanced text classification problem. IDT 16:51–69. https://doi.org/10.3233/IDT-210057

    Article  Google Scholar 

  6. Bell AJ (2003) The co-information lattice. 4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA2003). Nara, Japan 2003:921–926

    Google Scholar 

  7. Bennasar M, Hicks Y, Setchi R (2015) Feature selection using Joint Mutual Information Maximisation. Expert Syst Appl 42:8520–8532. https://doi.org/10.1016/j.eswa.2015.07.007

    Article  Google Scholar 

  8. Chechik G, Globerson A, Anderson MJ et al (2002) Group Redundancy Measures Reveal Redundancy Reduction in the Auditory Pathway. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems 14. The MIT Press, pp 173–180

    Chapter  Google Scholar 

  9. Chen J, Huang H, Tian S, Qu Y (2009) Feature selection for text classification with Naïve Bayes. Expert Sys Appl 36(3):5432–5435. https://doi.org/10.1016/j.eswa.2008.06.054

  10. Chen Y, Han B, Hou P (2014) New feature selection methods based on context similarity for text categorization. In: Proceedings of the international conference on fuzzy systems and knowledge discovery. https://doi.org/10.1109/FSKD.2014.6980902

  11. Cover T, Thomas J (2006) Elements of Information Theory. Wiley, New York

  12. Craven M, Pasquo DD, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S (1998) Learning to extract symbolic knowledge from the World Wide Web. In: Proceedings of the fifteenth national tenth conference on Artificial intelligence Innovative applications of artificial intelligence AAAI’98/IAAI’98). American Association for Artificial Intelligence, USA, pp 509–516

  13. Dhal P, Azad C (2022) A deep learning and multi-objective PSO with GWO based feature selection approach for text classification. In: Proceedings of International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE). IEEE, Greater Noida, India, pp 2140-2144

  14. Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305

    Google Scholar 

  15. Fu G, Li B, Yang Y, Li C (2023) Re-ranking and TOPSIS-based ensemble feature selection with multi-stage aggregation for text categorization. Pattern Recogn Lett 168:47–56. https://doi.org/10.1016/j.patrec.2023.02.027

    Article  Google Scholar 

  16. Georgieva-Trifonova T, Duraku M (2021) Research on N-grams feature selection methods for text classification. IOP Conf Ser: Mater Sci Eng 1031:012048. https://doi.org/10.1088/1757-899X/1031/1/012048

    Article  Google Scholar 

  17. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182. https://doi.org/10.1162/153244303322753616

    Article  Google Scholar 

  18. Hidalgo JMG, Bringas GC, Sanz EP, Garcia FC (2006) Content based SMS spam filtering. In: Proceedings of the 2006 ACM Symposium on Document Engineering, Amsterdam, The Netherlands, pp 107-114. https://doi.org/10.1145/1166160.1166191

  19. Joachims T (1998) Text categorization with support vector machines: Learning with many relevant features. In: Nédellec C, Rouveirol C (eds) Machine Learning: ECML-98. ECML 1998. Lect Notes Comput Sci (Lecture Notes in Artificial Intelligence). Springer, Berlin, Heidelberg. vol 1398. https://doi.org/10.1007/BFb0026683

  20. Khurana A, Verma OP (2023) Optimal Feature Selection for Imbalanced Text Classification. IEEE Trans Artif Intell 4:135–147. https://doi.org/10.1109/TAI.2022.3144651

    Article  Google Scholar 

  21. Kolluri J, Razia S (2020) WITHDRAWN: Text classification using Naïve Bayes classifier. Materials Today: Proceedings. https://doi.org/10.1016/j.matpr.2020.10.058

  22. Kumar V (2014) Feature Selection: A literature Review. SmartCR 4. https://doi.org/10.6029/smartcr.2014.03.007

  23. Lei S (2012) A Feature Selection Method Based on Information Gain and Genetic Algorithm. In: 2012 International conference on computer science and electronics engineering. IEEE, Hangzhou, Zhejiang, China, pp 355–358. https://doi.org/10.1109/ICCSEE.2012.97

  24. Liu X, Wang S, Lu S et al (2023) Adapting Feature Selection Algorithms for the Classification of Chinese Texts. Systems 11:483. https://doi.org/10.3390/systems11090483

    Article  Google Scholar 

  25. Mamdouh Farghaly H, Abd El-Hafeez T (2023) A high-quality feature selection method based on frequent and correlated items for text classification. Soft Comput 27:11259–11274. https://doi.org/10.1007/s00500-023-08587-x

    Article  Google Scholar 

  26. Mao KZ (2004) Orthogonal forward selection and backward elimination algorithms for feature subset selection. IEEE Trans Syst Man Cybernet Part B (Cybernetics) 34:629–634

    Article  Google Scholar 

  27. McGill WJ (1954) Multivariate information transmission. Psychometrika 19:97–116. https://doi.org/10.1007/BF02289159

    Article  Google Scholar 

  28. Miri M, Dowlatshahi MB, Hashemi A et al (2022) Ensemble feature selection for multi-label text classification: An intelligent order statistics approach. Int J of Intelligent Sys 37:11319–11341. https://doi.org/10.1002/int.23044

    Article  Google Scholar 

  29. Ogura H, Amano H, Kondo M (2011) Comparison of metrics for feature selection in imbalanced text classification. Expert Syst Appl 38:4978–4989. https://doi.org/10.1016/j.eswa.2010.09.153

    Article  Google Scholar 

  30. Pang B, Lee L (2004) A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the 42nd annual meeting on association for computational linguistics–ACL ACL ’04. Association for Computational Linguistics, Barcelona, Spain, pp 271-es

  31. Pintas JT, Fernandes LAF, Garcia ACB (2021) Feature selection methods for text classification: a systematic literature review. Artif Intell Rev 54:6149–6200. https://doi.org/10.1007/s10462-021-09970-6

    Article  Google Scholar 

  32. Pudil P, Novovičová J, Kittler J (1994) Floating search methods in feature selection. Pattern Recogn Lett 15:1119–1125. https://doi.org/10.1016/0167-8655(94)90127-9

    Article  Google Scholar 

  33. Roul RK, Satyanath G (2022) A Novel Feature Selection Based Text Classification Using Multi-layer ELM. In: Roy PP, Agarwal A, Li T et al (eds) Big Data Analytics. Springer Nature Switzerland, Cham, pp 33–52

    Chapter  Google Scholar 

  34. Saeed MM, Al Aghbari Z (2022) ARTC: feature selection using association rules for text classification. Neural Comput & Applic 34:22519–22529. https://doi.org/10.1007/s00521-022-07669-5

    Article  Google Scholar 

  35. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34:1–47. https://doi.org/10.1145/505282.505283

    Article  Google Scholar 

  36. Shang W, Huang H, Zhu H, Lin Y, Qu Y, Wang Z (2007) A novel feature selection algorithm for text categorization. Expert Syst Appl 33(1):1–5. https://doi.org/10.1016/j.eswa.2006.04.001

    Article  Google Scholar 

  37. Stearns SD (1976) On selecting features for pattern classifiers. In: Pattern recognition, proceedings of the 3rd international conference on Pattern Recognition, Coronado, CA, pp 71-75

  38. Tang J, Alelyani S, Liu H (2014) Feature selection for classification: a review. Data classification : algorithms and applications, pp 37-64. https://doi.org/10.1201/B17320

  39. Timme N, Alford W, Flecker B, Beggs JM (2014) Synergy, redundancy, and multivariate information measures: an experimentalist’s perspective. J Comput Neurosci 36:119–140. https://doi.org/10.1007/s10827-013-0458-4

  40. Uysal AK (2016) An improved global feature selection scheme for text classification. Expert Syst Appl 43:82–92. https://doi.org/10.1016/j.eswa.2015.08.050

    Article  Google Scholar 

  41. Vergara JR, Estévez PA (2014) A review of feature selection methods based on mutual information. Neural Comput & Appl 24:175–186. https://doi.org/10.1007/s00521-013-1368-0

    Article  Google Scholar 

  42. Williams PL, & Beer RD (2010) Nonnegative Decomposition of Multivariate Information. arXiv:1004.2515v1

  43. Wolf D (1996) The Generalization of Mutual Information as the Information between a Set of Variables: The Information Correlation Function Hierarchy and the Information Structure of Multi-Agent Systems (Technical report). NASA Ames Research Center

  44. Xue W, Xu X (2010) Three New Feature Weighting Methods for Text Categorization. In: Wang FL, Gong Z, Luo X, Lei J (eds) Web Information Systems and Mining. Springer, Berlin Heidelberg, Berlin, Heidelberg, pp 352–359

    Chapter  Google Scholar 

  45. Yang J, Liu Y, Zhu X et al (2012) A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inform Process Manag 48:741–754. https://doi.org/10.1016/j.ipm.2011.12.005

  46. Yap BW, Ibrahim NS, Hamid HA, Rahman SA, Fong SJ (2018) Feature selection methods: Case of filter and wrapper approaches for maximising classification accuracy. Pertanika J Sci Technol 26:329–340

    Google Scholar 

  47. Zheng Z, Wu X, Srihari RK (2004) Feature selection for text categorization on imbalanced data. SIGKDD Explor Newsl 6(1):80–89. https://doi.org/10.1145/1007730.1007741

    Article  Google Scholar 

  48. Zingade DS, Deshmukh RK, Kadam DB (2023) Multi-objective Hybrid Optimization-based Feature Selection for Sentiment Analysis. In: Proceddings of the 4th international conference for emerging technology (INCET). IEEE, Belgaum, India, pp 1–6

Download references

Author information

Authors and Affiliations

Authors

Contributions

The authors, Farek Lazhar and Benaidja Amira, contributed equally to this work.

Corresponding author

Correspondence to Lazhar Farek.

Ethics declarations

Ethical Approval

This research did not contain any studies involving animal or human participants, nor did it take place on any private or protected areas.

Competing interests

The authors declare no conflicts of interest in preparing this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Farek, L., Benaidja, A. An optimal feature selection method for text classification through redundancy and synergy analysis. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19736-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11042-024-19736-1

Keywords

Navigation