[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/584792.584911acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

High-performing feature selection for text classification

Published: 04 November 2002 Publication History

Abstract

This paper reports a controlled study on a large number of filter feature selection methods for text classification. Over 100 variants of five major feature selection criteria were examined using four well-known classification algorithms: a Naive Bayesian (NB) approach, a Rocchio-style classifier, a k-nearest neighbor (kNN) method and a Support Vector Machine (SVM) system. Two benchmark collections were chosen as the testbeds: Reuters-21578 and small portion of Reuters Corpus Version 1 (RCV1), making the new results comparable to published results. We found that feature selection methods based on chi2 statistics consistently outperformed those based on other criteria (including information gain) for all four classifiers and both data collections, and that a further increase in performance was obtained by combining uncorrelated and high-performing feature selection methods.The results we obtained using only 3% of the available features are among the best reported, including results obtained with the full feature set.

References

[1]
L. D. Baker and A. K. McCallum. Distributional clustering of words for text classification. In Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval, pages 96--103, 1998.
[2]
S. Das. Filters, wrappers and a boosting-based hybrid for feature selection. In International Conference on Machine Learning, 2001.
[3]
T. Joachims. Making large-scale support vector machine learning practical, 1998.
[4]
G. H. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. In International Conference on Machine Learning, pages 121--129, 1994.
[5]
D. Koller and M. Sahami. Toward optimal feature selection. In International Conference on Machine Learning, pages 284--292, 1996.
[6]
T. Lewis, F. Li, R. Tony, and Y. Yang. The reuters corpus volume i as a text categorization test collection. 2002.
[7]
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow, 1996.
[8]
J. Moore, E. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, and B. Mobasher. Web page categorization and feature selection using association rule and principal component clustering, 1997.
[9]
A. Rozsypal and M. Kubat. Using the genetic algorithm to reduce the size of a nearest neighbor classifier and to select relevant attributes. In Proc. 18th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, 2001.
[10]
P. Soucy and P. Mineau. A simple feature selection method for text classification. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, pages 897--902, 2001.
[11]
E. P. Xing, M. I. Jordan, and R. M. Karp. Feature selection for high-dimensional genomic microarray data. In Proc. 18th International Conf. on Machine Learning, pages 601--608. Morgan Kaufmann, San Francisco, CA, 2001.
[12]
Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In International Conference on Machine Learning, pages 412--420, 1997.

Cited By

View all
  • (2024)A simple and efficient filter feature selection method via document-term matrix unitizationPattern Recognition Letters10.1016/j.patrec.2024.02.025181(23-29)Online publication date: May-2024
  • (2024)A COPRAS-based Approach to Multi-Label Feature Selection for Text ClassificationMathematics and Computers in Simulation10.1016/j.matcom.2023.07.022222(3-23)Online publication date: Aug-2024
  • (2024)Efficient utilization of pre-trained models: A review of sentiment analysis via prompt learningKnowledge-Based Systems10.1016/j.knosys.2023.111148283(111148)Online publication date: Jan-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management
November 2002
704 pages
ISBN:1581134924
DOI:10.1145/584792
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 November 2002

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. feature selection
  2. text classification

Qualifiers

  • Article

Conference

CIKM02

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)28
  • Downloads (Last 6 weeks)5
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A simple and efficient filter feature selection method via document-term matrix unitizationPattern Recognition Letters10.1016/j.patrec.2024.02.025181(23-29)Online publication date: May-2024
  • (2024)A COPRAS-based Approach to Multi-Label Feature Selection for Text ClassificationMathematics and Computers in Simulation10.1016/j.matcom.2023.07.022222(3-23)Online publication date: Aug-2024
  • (2024)Efficient utilization of pre-trained models: A review of sentiment analysis via prompt learningKnowledge-Based Systems10.1016/j.knosys.2023.111148283(111148)Online publication date: Jan-2024
  • (2023)Improving Automated Labeling for ATT&CK Tactics in Malware Threat ReportsDigital Threats: Research and Practice10.1145/35945535:1(1-16)Online publication date: 17-May-2023
  • (2023)kNN Classification: a reviewAnnals of Mathematics and Artificial Intelligence10.1007/s10472-023-09882-xOnline publication date: 1-Sep-2023
  • (2023)Survey on KNN Methods in Data ScienceLearning and Intelligent Optimization10.1007/978-3-031-24866-5_28(379-393)Online publication date: 5-Feb-2023
  • (2022)All Burglaries Are Not the Same: Predicting Near-Repeat Burglaries in Cities Using Modus OperandiISPRS International Journal of Geo-Information10.3390/ijgi1103016011:3(160)Online publication date: 23-Feb-2022
  • (2022)Impacts of Learning Orientation on the Modeling of Programming Using Feature Selection and XGBOOST: A Gender-Focused AnalysisApplied Sciences10.3390/app1210492212:10(4922)Online publication date: 12-May-2022
  • (2022)Impact analysis of feature selection techniques on cyberstalking detectioni-manager’s Journal on Image Processing10.26634/jip.9.4.191389:4(21)Online publication date: 2022
  • (2022)Identifying Patients With Delirium Based on Unstructured Clinical Notes: Observational StudyJMIR Formative Research10.2196/338346:6(e33834)Online publication date: 24-Jun-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media