Abstract
This paper is a comparative study on representing units in Chinese text categorization. Several kinds of representing units, including byte 3-gram, Chinese character, Chinese word, and Chinese word with part of speech tag, were investigated. Empirical evidence shows that when the size of training data is large enough, representations of higher-level or with larger feature spaces result in better performance than those of lower level or with smaller feature spaces, whereas when the training data is limited the conclusion may be the reverse. In general, representations of higher-level or with larger feature spaces need more training data to reach the best performance. But, as to a specific representation, the size of training data and the categorization performance are not always positively correlated.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Christopher D. Manning, Hinrich Schutze: Foundations of Statistical Natural Language Processing. MIT Press (1999)
Wang Mengyun, Cao Suqing: The System for Automatic Text Categorization Based on Chinese Character Vector. Journal of Informatics (in Chinese), 19:6 (2000) 644–649
Pang Jianfeng, et al.: Research and Implementation of Text Categorization System Based on VSM. Journal of Research on Computer Application (in Chinese), 9 (2001) 23–26
Marc Damashek: Gauging Similarity with n-Grams: Language-Independent Categorization of Text. Science, 267:10(1995) 843–848
Palmer D., Burger J.: Chinese Word Segmentation and Information Retrieval. In AAAI Symposium Cross-Language Text and Speech Retrieval (1997)
Peng Fuchun, et al.: Investigating the Relationship between Word Segmentation Performance and Retrieval Performance in Chinese IR. In the Proceedings of the 19th International Conference on Computational Linguistics (2002)
Joachims T.: Learning to Classify Text Using SVM: Methods, Theory and Algorithms. Kluwer Academic Publishers (2002)
Li Baoli, et al.: A Comparative Study on Automatic Categorization Methods for Chinese Search Engine. In the Proceedings of the Eighth Joint International Computer Conference (2002) 117–120
Liu Yuan, et al.: Segmentation Standard for Modern Chinese Information Processing and Automatic Segmentation Methodology. Tsinghua University Press (1994)
Yang Y., Pedersen J.O.: A Comparative Study on Feature Selection in Text Categorization. In the Proceedings of Fourteenth International Conference on Machine Learning (1997) 412–420
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Baoli, L., Yuzhong, C., Xiaojing, B., Shiwen, Y. (2003). Experimental Study on Representing Units in Chinese Text Categorization. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2003. Lecture Notes in Computer Science, vol 2588. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36456-0_67
Download citation
DOI: https://doi.org/10.1007/3-540-36456-0_67
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00532-2
Online ISBN: 978-3-540-36456-6
eBook Packages: Springer Book Archive