[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

An adaptive k-nearest neighbor text categorization strategy

Published: 01 December 2004 Publication History

Abstract

k is the most important parameter in a text categorization system based on the k-nearest neighbor algorithm (kNN). To classify a new document, the k-nearest documents in the training set are determined first. The prediction of categories for this document can then be made according to the category distribution among the k nearest neighbors. Generally speaking, the class distribution in a training set is not even; some classes may have more samples than others. The system's performance is very sensitive to the choice of the parameter k. And it is very likely that a fixed k value will result in a bias for large categories, and will not make full use of the information in the training set. To deal with these problems, an improved kNN strategy, in which different numbers of nearest neighbors for different categories are used instead of a fixed number across all categories, is proposed in this article. More samples (nearest neighbors) will be used to decide whether a test document should be classified in a category that has more samples in the training set. The numbers of nearest neighbors selected for different categories are adaptive to their sample size in the training set. Experiments on two different datasets show that our methods are less sensitive to the parameter k than the traditional ones, and can properly classify documents belonging to smaller classes with a large k. The strategy is especially applicable and promising for cases where estimating the parameter k via cross-validation is not possible and the class distribution of a training set is skewed.

References

[1]
Allan, J. 2002. Topic Detection and Tracking: Event-based Information Organization. Kluwer Academic Boston, MA.]]
[2]
Cardoso-Cachopo, A., and Olivera, A. L. 2003. An empirical comparison of text categorization methods. In Proceedings of the 10th International Symposium on String Processing and Information Retrieval (Manaus, Brazil, Oct.8--10, 2003). M.A. Nasciente et al. eds. Springer--Verlag, Heidelberg. 183--196.]]
[3]
Dasarathy, B.V. 1991. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, Las Alamitos, CA.]]
[4]
Han, E. H., Karypis, G., and Kumar, V. 2001. Text categorization using weight adjusted k-nearest neighbor classification. In Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining (Hong Kong, April 16--18, 2001). D. Cheung, et al. eds. Springer-Verlag, Heidelberg. 53--65.]]
[5]
He, J., Tan, A.H., and Tan, C. L. 2000. Machine learning methods for Chinese web page categorization. In Proceedings of the ACL'2000 2nd Workshop on Chinese Language Processing (Hong Kong, Oct. 2000). 93--100.]]
[6]
Joachims, T. 1998. Text categorization with support vector machines: learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning (Chemnitz, Germany, April 21--24, 1998). 137--142.]]
[7]
Lang, K. 1995. Newsweeder: learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning (Tahoe City, CA, July 9--12, 1995). A. Prieditis et al. eds. Morgan Kaufmann. 331--339.]]
[8]
Manning, C. D. and Schutze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.]]
[9]
Masand, B., Linoff, G., and Waltz, D. 1992. Classifying news stories using memory based reasoning. In Proceedings of 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Copenhagen, June 21--24, 1992). N. J. Belkin et al. eds. ACM Press, New York. 59--64.]]
[10]
Mitchell, T. 1997. Machine Learning. McGraw Hill, New York.]]
[11]
Porter, M. F. 1980. An algorithm for suffix stripping. Program 14, 3 (1980), 130--137.]]
[12]
Salton, G. 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Longman Publishing, Boston, MA.]]
[13]
Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing Surveys 34, 1 (2002), 1--47.]]
[14]
Yang, Y. 1994. Expert network: effective and efficient learning from human decisions in text categorization and retrieval. In Proceedings of the 17th International Conference on Research and Development in Information Retrieval (SIGIR'94, Dublin, July 3--6, 1994). W.B. Croft et al. eds. ACM/Springer. 13--22.]]
[15]
Yang, Y. 1999. An evaluation of statistical approaches to text categorization. J. Information Retrieval 1, 1/2 (1999), 67--88.]]
[16]
Yang, Y., Ault, T., Pierce, T., and Lattimer, C.W. 2000. Improving text categorization methods for event tracking. In Proceedings of the 23rd International Conference on Research and Development in Information Retrieval (SIGIR-2000, Athens, July 24--28, 2000). N. J. Belkin et al. eds. ACM Press, New York, 65--72.]]
[17]
Yang, Y. and Chute, C. G. 1994. An example-based mapping method for text categorization and retrieval. ACM Trans. on Information Systems 12, 3 (1994), 252--277.]]
[18]
Yang, Y. and Liu, X. 1999. A re-examination of text categorization methods. In Proceedings of 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Berkeley, CA, Aug. 15--19, 1999). ACM Press, New York, 42--49.]]
[19]
Yang, Y. and Pedersen, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of Fourteenth International Conference on Machine Learning (Nashville, TN, July 8--12, 1997). D. H. Fisher, ed. Morgan Kaufmann, 412--420.]]

Cited By

View all
  • (2024)Generating Fuzzy Membership Functions for Modeling Wetland Ecosystems From Multispectral Remote Sensing ImagesIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2024.337937117(7640-7654)Online publication date: 2024
  • (2024)Adaptive locally weighted support vector algorithm with asymmetrically parametric insensitive/margin modelKnowledge-Based Systems10.1016/j.knosys.2024.111713293(111713)Online publication date: Jun-2024
  • (2024)A deep learning based approach for image retrieval extraction in mobile edge computingJournal of Umm Al-Qura University for Engineering and Architecture10.1007/s43995-024-00060-615:3(318-326)Online publication date: 21-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian Language Information Processing
ACM Transactions on Asian Language Information Processing  Volume 3, Issue 4
December 2004
57 pages
ISSN:1530-0226
EISSN:1558-3430
DOI:10.1145/1039621
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2004
Published in TALIP Volume 3, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. k-nearest neighbor algorithm
  2. machine learning
  3. text categorization
  4. text classification

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)1
Reflects downloads up to 10 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Generating Fuzzy Membership Functions for Modeling Wetland Ecosystems From Multispectral Remote Sensing ImagesIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2024.337937117(7640-7654)Online publication date: 2024
  • (2024)Adaptive locally weighted support vector algorithm with asymmetrically parametric insensitive/margin modelKnowledge-Based Systems10.1016/j.knosys.2024.111713293(111713)Online publication date: Jun-2024
  • (2024)A deep learning based approach for image retrieval extraction in mobile edge computingJournal of Umm Al-Qura University for Engineering and Architecture10.1007/s43995-024-00060-615:3(318-326)Online publication date: 21-May-2024
  • (2024)Mutual Learning for News ClassificationIntelligent Systems and Applications10.1007/978-3-031-66428-1_3(37-54)Online publication date: 31-Jul-2024
  • (2023)Multidimensional Adaptative kNN over Tracking Outliers (Makoto)Advanced Data Mining and Applications10.1007/978-3-031-46661-8_36(535-550)Online publication date: 27-Aug-2023
  • (2022)A New Fuzzy Adaptive Algorithm to Classify Imbalanced DataComputers, Materials & Continua10.32604/cmc.2022.01711470:1(73-89)Online publication date: 2022
  • (2022)A Survey on Text Classification: From Traditional to Deep LearningACM Transactions on Intelligent Systems and Technology10.1145/349516213:2(1-41)Online publication date: 8-Apr-2022
  • (2021)Interactive Graph Construction for Graph-Based Semi-Supervised LearningIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2021.308469427:9(3701-3716)Online publication date: 1-Sep-2021
  • (2021)Dynamic k determination in k-NN classifier: A literature review2021 12th International Conference on Information, Intelligence, Systems & Applications (IISA)10.1109/IISA52424.2021.9555525(1-8)Online publication date: 12-Jul-2021
  • (2020)A review on classification of imbalanced data for wireless sensor networksInternational Journal of Distributed Sensor Networks10.1177/155014772091640416:4(155014772091640)Online publication date: 14-Apr-2020
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media