Abstract
An investigation is conducted on two well-known similarity-based learning approaches to text categorization: the k-nearest neighbors (kNN) classifier and the Rocchio classifier. After identifying the weakness and strength of each technique, a new classifier called the kNN model-based classifier (kNN Model) is proposed. It combines the strength of both kNN and Rocchio. A text categorization prototype, which implements kNN Model along with kNN and Rocchio, is described. An experimental evaluation of different methods is carried out on two common document corpora: the 20-newsgroup collection and the ModApte version of the Reuters-21578 collection of news stories. The experimental results show that the proposed kNN model-based method outperforms the kNN and Rocchio classifiers, and is therefore a good alternative for kNN and Rocchio in some application areas.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Lam W, Ho C (1998) Using a generalized instance set for automatic text categorization. SIGIR'98, pp 81–89
Lewis D (1998) Naïve (Bayes) at forty: the independent assumption in information retrieval. In: Proceedings of ECML-98, 10th European conference on machine learning, pp 4–15
Cohen W, Singer Y (1999) Context-sensitive learning methods for text categorization. ACM Trans Inform Syst 17(2):141–173
Li H, Yamanishi K (1999) Text classification using esc-based stochastic decision lists. In: Proceedings of CIKM-99, 8th ACM international conference on information and knowledge management, pp 122–130
Yang Y, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of SIGIR-99, 22nd ACM international conference on research and development in information retrieval, pp 42–49
Ruiz M, Srinivasan P (1999) Hierarchical neural networks for text categorization. In: Proceedings of SIGIR-99, 22nd ACM International Information Retrieval, pp 281–282
Mitchell T (1996) Machine learning. McGraw Hill, New York
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of 10th European conference on machine learning, Chemnitz, Germany, pp 137–142
Joachims T (2001) A statistical learning model of text classification for support vector machines. In: Proceedings of SIGIR-01, 24th ACM international conference on research and development in information retrieval, pp 128–136
Rocchio J (1971) Relevance feedback in information retrieval. In: The SMART retrieval system: experiments in automatic document processing. Salton G (ed) Prentice-Hall, Englewood Cliffs
Joachims T (1997) A probabilistic analysis of the rocchio algorithm with TFIDF for test categorization. In: Proceedings of ICML-97, 14th international conference on machine learning, pp 143–151
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Dietterich T (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10(7): 1895–1924
Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Vesley, Reading
Han E, Karypis G (2000) Centroid-based document classification: analysis and experimental results, Technical Report:#00-017, University of Minnesota, Department of Computer Science / Army HPC Research Center, Minneapolis, MN 55455
ICONS (2001) ICONS Consortium, intelligent content management system contract number IST-2001-32429. Annex I-Description of work
Author information
Authors and Affiliations
Additional information
This work was partly supported by the European Commission project ICONS, project no. IST-2001-32429.
Rights and permissions
About this article
Cite this article
Guo, G., Wang, H., Bell, D. et al. Using kNN model for automatic text categorization. Soft Comput 10, 423–430 (2006). https://doi.org/10.1007/s00500-005-0503-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-005-0503-y