More Web Proxy on the site http://driver.im/

article

An adaptive k-nearest neighbor text categorization strategy

Authors:

Yu ShiwenAuthors Info & Claims

ACM Transactions on Asian Language Information Processing (TALIP), Volume 3, Issue 4

Pages 215 - 226

https://doi.org/10.1145/1039621.1039623

Published: 01 December 2004 Publication History

Abstract

k is the most important parameter in a text categorization system based on the k-nearest neighbor algorithm (kNN). To classify a new document, the k-nearest documents in the training set are determined first. The prediction of categories for this document can then be made according to the category distribution among the k nearest neighbors. Generally speaking, the class distribution in a training set is not even; some classes may have more samples than others. The system's performance is very sensitive to the choice of the parameter k. And it is very likely that a fixed k value will result in a bias for large categories, and will not make full use of the information in the training set. To deal with these problems, an improved kNN strategy, in which different numbers of nearest neighbors for different categories are used instead of a fixed number across all categories, is proposed in this article. More samples (nearest neighbors) will be used to decide whether a test document should be classified in a category that has more samples in the training set. The numbers of nearest neighbors selected for different categories are adaptive to their sample size in the training set. Experiments on two different datasets show that our methods are less sensitive to the parameter k than the traditional ones, and can properly classify documents belonging to smaller classes with a large k. The strategy is especially applicable and promising for cases where estimating the parameter k via cross-validation is not possible and the class distribution of a training set is skewed.

References

[1]

Allan, J. 2002. Topic Detection and Tracking: Event-based Information Organization. Kluwer Academic Boston, MA.]]

[2]

Cardoso-Cachopo, A., and Olivera, A. L. 2003. An empirical comparison of text categorization methods. In Proceedings of the 10th International Symposium on String Processing and Information Retrieval (Manaus, Brazil, Oct.8--10, 2003). M.A. Nasciente et al. eds. Springer--Verlag, Heidelberg. 183--196.]]

[3]

Dasarathy, B.V. 1991. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, Las Alamitos, CA.]]

[4]

Han, E. H., Karypis, G., and Kumar, V. 2001. Text categorization using weight adjusted k-nearest neighbor classification. In Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining (Hong Kong, April 16--18, 2001). D. Cheung, et al. eds. Springer-Verlag, Heidelberg. 53--65.]]

[5]

He, J., Tan, A.H., and Tan, C. L. 2000. Machine learning methods for Chinese web page categorization. In Proceedings of the ACL'2000 2nd Workshop on Chinese Language Processing (Hong Kong, Oct. 2000). 93--100.]]

[6]

Joachims, T. 1998. Text categorization with support vector machines: learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning (Chemnitz, Germany, April 21--24, 1998). 137--142.]]

[7]

Lang, K. 1995. Newsweeder: learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning (Tahoe City, CA, July 9--12, 1995). A. Prieditis et al. eds. Morgan Kaufmann. 331--339.]]

[8]

Manning, C. D. and Schutze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.]]

[9]

Masand, B., Linoff, G., and Waltz, D. 1992. Classifying news stories using memory based reasoning. In Proceedings of 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Copenhagen, June 21--24, 1992). N. J. Belkin et al. eds. ACM Press, New York. 59--64.]]

[10]

Mitchell, T. 1997. Machine Learning. McGraw Hill, New York.]]

[11]

Porter, M. F. 1980. An algorithm for suffix stripping. Program 14, 3 (1980), 130--137.]]

[12]

Salton, G. 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Longman Publishing, Boston, MA.]]

[13]

Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing Surveys 34, 1 (2002), 1--47.]]

[14]

Yang, Y. 1994. Expert network: effective and efficient learning from human decisions in text categorization and retrieval. In Proceedings of the 17th International Conference on Research and Development in Information Retrieval (SIGIR'94, Dublin, July 3--6, 1994). W.B. Croft et al. eds. ACM/Springer. 13--22.]]

[15]

Yang, Y. 1999. An evaluation of statistical approaches to text categorization. J. Information Retrieval 1, 1/2 (1999), 67--88.]]

[16]

Yang, Y., Ault, T., Pierce, T., and Lattimer, C.W. 2000. Improving text categorization methods for event tracking. In Proceedings of the 23rd International Conference on Research and Development in Information Retrieval (SIGIR-2000, Athens, July 24--28, 2000). N. J. Belkin et al. eds. ACM Press, New York, 65--72.]]

[17]

Yang, Y. and Chute, C. G. 1994. An example-based mapping method for text categorization and retrieval. ACM Trans. on Information Systems 12, 3 (1994), 252--277.]]

[18]

Yang, Y. and Liu, X. 1999. A re-examination of text categorization methods. In Proceedings of 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Berkeley, CA, Aug. 15--19, 1999). ACM Press, New York, 42--49.]]

[19]

Yang, Y. and Pedersen, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of Fourteenth International Conference on Machine Learning (Nashville, TN, July 8--12, 1997). D. H. Fisher, ed. Morgan Kaufmann, 412--420.]]

Cited By

Guo JDu S(2024)Generating Fuzzy Membership Functions for Modeling Wetland Ecosystems From Multispectral Remote Sensing ImagesIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2024.337937117(7640-7654)Online publication date: 2024
https://doi.org/10.1109/JSTARS.2024.3379371
Zhao ZFan Y(2024)Adaptive locally weighted support vector algorithm with asymmetrically parametric insensitive/margin modelKnowledge-Based Systems10.1016/j.knosys.2024.111713293(111713)Online publication date: Jun-2024
https://doi.org/10.1016/j.knosys.2024.111713
Alasadi JBati GAl Hilli A(2024)A deep learning based approach for image retrieval extraction in mobile edge computingJournal of Umm Al-Qura University for Engineering and Architecture10.1007/s43995-024-00060-615:3(318-326)Online publication date: 21-May-2024
https://doi.org/10.1007/s43995-024-00060-6
Show More Cited By

Index Terms

An adaptive k-nearest neighbor text categorization strategy

Recommendations

Cross-lingual text categorization: Conquering language boundaries in globalized environments

Text categorization pertains to the automatic learning of a text categorization model from a training set of preclassified documents on the basis of their contents and the subsequent assignment of unclassified documents to appropriate categories. Most ...
Using kNN model for automatic text categorization

An investigation is conducted on two well-known similarity-based learning approaches to text categorization: the k-nearest neighbors (kNN) classifier and the Rocchio classifier. After identifying the weakness and strength of each technique, a new ...
Text categorization based on k-nearest neighbor approach for web site classification

Automatic categorization is a viable method to deal with the scaling problem on the World Wide Web. For Web site classification, this paper proposes the use of Web pages linked with the home page in a different manner from the sole use of home pages in ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian Language Information Processing

ACM Transactions on Asian Language Information Processing Volume 3, Issue 4

December 2004

57 pages

ISSN:1530-0226

EISSN:1558-3430

DOI:10.1145/1039621

Issue’s Table of Contents

Copyright © 2004 ACM.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2004

Published in TALIP Volume 3, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

82
Total Citations
View Citations
1,923
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)1

Reflects downloads up to 10 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Guo JDu S(2024)Generating Fuzzy Membership Functions for Modeling Wetland Ecosystems From Multispectral Remote Sensing ImagesIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2024.337937117(7640-7654)Online publication date: 2024
https://doi.org/10.1109/JSTARS.2024.3379371
Zhao ZFan Y(2024)Adaptive locally weighted support vector algorithm with asymmetrically parametric insensitive/margin modelKnowledge-Based Systems10.1016/j.knosys.2024.111713293(111713)Online publication date: Jun-2024
https://doi.org/10.1016/j.knosys.2024.111713
Alasadi JBati GAl Hilli A(2024)A deep learning based approach for image retrieval extraction in mobile edge computingJournal of Umm Al-Qura University for Engineering and Architecture10.1007/s43995-024-00060-615:3(318-326)Online publication date: 21-May-2024
https://doi.org/10.1007/s43995-024-00060-6
Chowdhury SKumpati NMukhopadhyay S(2024)Mutual Learning for News ClassificationIntelligent Systems and Applications10.1007/978-3-031-66428-1_3(37-54)Online publication date: 31-Jul-2024
https://doi.org/10.1007/978-3-031-66428-1_3
Colonval JBouquet F(2023)Multidimensional Adaptative kNN over Tracking Outliers (Makoto)Advanced Data Mining and Applications10.1007/978-3-031-46661-8_36(535-550)Online publication date: 27-Aug-2023
https://dl.acm.org/doi/10.1007/978-3-031-46661-8_36
Patel HSingh Rajput DPetru Stan OCristian Miclea L(2022)A New Fuzzy Adaptive Algorithm to Classify Imbalanced DataComputers, Materials & Continua10.32604/cmc.2022.01711470:1(73-89)Online publication date: 2022
https://doi.org/10.32604/cmc.2022.017114
Li QPeng HLi JXia CYang RSun LYu PHe L(2022)A Survey on Text Classification: From Traditional to Deep LearningACM Transactions on Intelligent Systems and Technology10.1145/349516213:2(1-41)Online publication date: 8-Apr-2022
https://dl.acm.org/doi/10.1145/3495162
Chen CWang ZWu JWang XGuo LLi YLiu S(2021)Interactive Graph Construction for Graph-Based Semi-Supervised LearningIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2021.308469427:9(3701-3716)Online publication date: 1-Sep-2021
https://doi.org/10.1109/TVCG.2021.3084694
Papanikolaou MEvangelidis GOugiaroglou S(2021)Dynamic k determination in k-NN classifier: A literature review2021 12th International Conference on Information, Intelligence, Systems & Applications (IISA)10.1109/IISA52424.2021.9555525(1-8)Online publication date: 12-Jul-2021
https://doi.org/10.1109/IISA52424.2021.9555525
Patel HSingh Rajput DThippa Reddy GIwendi CKashif Bashir AJo O(2020)A review on classification of imbalanced data for wireless sensor networksInternational Journal of Distributed Sensor Networks10.1177/155014772091640416:4(155014772091640)Online publication date: 14-Apr-2020
https://doi.org/10.1177/1550147720916404
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents