[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Using kNN model for automatic text categorization

Published: 01 March 2006 Publication History

Abstract

An investigation is conducted on two well-known similarity-based learning approaches to text categorization: the k-nearest neighbors (kNN) classifier and the Rocchio classifier. After identifying the weakness and strength of each technique, a new classifier called the kNN model-based classifier (kNN Model) is proposed. It combines the strength of both kNN and Rocchio. A text categorization prototype, which implements kNN Model along with kNN and Rocchio, is described. An experimental evaluation of different methods is carried out on two common document corpora: the 20-newsgroup collection and the ModApte version of the Reuters-21578 collection of news stories. The experimental results show that the proposed kNN model-based method outperforms the kNN and Rocchio classifiers, and is therefore a good alternative for kNN and Rocchio in some application areas.

References

[1]
Lam W, Ho C (1998) Using a generalized instance set for automatic text categorization. SIGIR'98, pp 81---89
[2]
Lewis D (1998) Naïve (Bayes) at forty: the independent assumption in information retrieval. In: Proceedings of ECML-98, 10th European conference on machine learning, pp 4---15
[3]
Cohen W, Singer Y (1999) Context-sensitive learning methods for text categorization. ACM Trans Inform Syst 17(2):141---173
[4]
Li H, Yamanishi K (1999) Text classification using esc-based stochastic decision lists. In: Proceedings of CIKM-99, 8th ACM international conference on information and knowledge management, pp 122---130
[5]
Yang Y, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of SIGIR-99, 22nd ACM international conference on research and development in information retrieval, pp 42---49
[6]
Ruiz M, Srinivasan P (1999) Hierarchical neural networks for text categorization. In: Proceedings of SIGIR-99, 22nd ACM International Information Retrieval, pp 281---282
[7]
Mitchell T (1996) Machine learning. McGraw Hill, New York
[8]
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of 10th European conference on machine learning, Chemnitz, Germany, pp 137---142
[9]
Joachims T (2001) A statistical learning model of text classification for support vector machines. In: Proceedings of SIGIR-01, 24th ACM international conference on research and development in information retrieval, pp 128---136
[10]
Rocchio J (1971) Relevance feedback in information retrieval. In: The SMART retrieval system: experiments in automatic document processing. Salton G (ed) Prentice-Hall, Englewood Cliffs
[11]
Joachims T (1997) A probabilistic analysis of the rocchio algorithm with TFIDF for test categorization. In: Proceedings of ICML-97, 14th international conference on machine learning, pp 143---151
[12]
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1---47
[13]
Dietterich T (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10(7): 1895---1924
[14]
Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Vesley, Reading
[15]
Han E, Karypis G (2000) Centroid-based document classification: analysis and experimental results, Technical Report:#00-017, University of Minnesota, Department of Computer Science / Army HPC Research Center, Minneapolis, MN 55455
[16]
ICONS (2001) ICONS Consortium, intelligent content management system contract number IST-2001-32429. Annex I-Description of work

Cited By

View all
  • (2023)A differential evolution based algorithm to cluster text corpora using lazy re-evaluation of fringe pointsMultimedia Tools and Applications10.1007/s11042-023-14716-382:21(32177-32201)Online publication date: 2-Mar-2023
  • (2022)Quantum K-nearest neighbor classification algorithm based on Hamming distanceQuantum Information Processing10.1007/s11128-021-03361-021:1Online publication date: 1-Jan-2022
  • (2021)Retrieving and Classifying LinkedIn Job Titles for Alumni Career AnalysisProceedings of the 22nd Annual Conference on Information Technology Education10.1145/3450329.3476858(85-90)Online publication date: 6-Oct-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Soft Computing - A Fusion of Foundations, Methodologies and Applications
Soft Computing - A Fusion of Foundations, Methodologies and Applications  Volume 10, Issue 5
March 2006
64 pages
ISSN:1432-7643
EISSN:1433-7479
Issue’s Table of Contents

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 March 2006

Author Tags

  1. Performance
  2. Rocchio
  3. Text categorization
  4. kNN
  5. kNN Model

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)A differential evolution based algorithm to cluster text corpora using lazy re-evaluation of fringe pointsMultimedia Tools and Applications10.1007/s11042-023-14716-382:21(32177-32201)Online publication date: 2-Mar-2023
  • (2022)Quantum K-nearest neighbor classification algorithm based on Hamming distanceQuantum Information Processing10.1007/s11128-021-03361-021:1Online publication date: 1-Jan-2022
  • (2021)Retrieving and Classifying LinkedIn Job Titles for Alumni Career AnalysisProceedings of the 22nd Annual Conference on Information Technology Education10.1145/3450329.3476858(85-90)Online publication date: 6-Oct-2021
  • (2017)A new Centroid-Based Classification model for text categorizationKnowledge-Based Systems10.5555/3170714.3170794136:C(15-26)Online publication date: 15-Nov-2017
  • (2017)Text Classification Based on Enriched Vector Space ModelProceedings of the 18th International Conference on Computer Systems and Technologies10.1145/3134302.3134343(103-110)Online publication date: 23-Jun-2017
  • (2016)Multi-level topical text categorization with wikipediaProceedings of the 9th International Conference on Utility and Cloud Computing10.1145/2996890.3007856(343-352)Online publication date: 6-Dec-2016
  • (2016)Kernel-based linear classification on categorical dataSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-015-1926-820:8(2981-2993)Online publication date: 1-Aug-2016
  • (2015)N-Gram Representations For Comment FilteringProceedings of the 2015 Annual Research Conference on South African Institute of Computer Scientists and Information Technologists10.1145/2815782.2815789(1-10)Online publication date: 28-Sep-2015
  • (2015)From data mining to knowledge miningExpert Systems with Applications: An International Journal10.1016/j.eswa.2014.08.02442:3(1436-1445)Online publication date: 15-Feb-2015
  • (2015)CenKNNData Mining and Knowledge Discovery10.1007/s10618-014-0358-x29:3(593-625)Online publication date: 1-May-2015
  • Show More Cited By

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media