Experimental Study on Representing Units in Chinese Text Categorization

Li Baoli⁵,
Chen Yuzhong⁵,
Bai Xiaojing⁵ &
…
Yu Shiwen⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2588))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

909 Accesses

Abstract

This paper is a comparative study on representing units in Chinese text categorization. Several kinds of representing units, including byte 3-gram, Chinese character, Chinese word, and Chinese word with part of speech tag, were investigated. Empirical evidence shows that when the size of training data is large enough, representations of higher-level or with larger feature spaces result in better performance than those of lower level or with smaller feature spaces, whereas when the training data is limited the conclusion may be the reverse. In general, representations of higher-level or with larger feature spaces need more training data to reach the best performance. But, as to a specific representation, the size of training data and the categorization performance are not always positively correlated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 35.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 44.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Hierarchical Neural Representation for Document Classification

Article 16 January 2019

Classification of Chinese Texts Based on Recognition of Semantic Topics

Article 02 July 2015

Chinese Text Classification Based on Character-Level CNN and SVM

References

Christopher D. Manning, Hinrich Schutze: Foundations of Statistical Natural Language Processing. MIT Press (1999)
Google Scholar
Wang Mengyun, Cao Suqing: The System for Automatic Text Categorization Based on Chinese Character Vector. Journal of Informatics (in Chinese), 19:6 (2000) 644–649
Google Scholar
Pang Jianfeng, et al.: Research and Implementation of Text Categorization System Based on VSM. Journal of Research on Computer Application (in Chinese), 9 (2001) 23–26
Google Scholar
Marc Damashek: Gauging Similarity with n-Grams: Language-Independent Categorization of Text. Science, 267:10(1995) 843–848
Article Google Scholar
Palmer D., Burger J.: Chinese Word Segmentation and Information Retrieval. In AAAI Symposium Cross-Language Text and Speech Retrieval (1997)
Google Scholar
Peng Fuchun, et al.: Investigating the Relationship between Word Segmentation Performance and Retrieval Performance in Chinese IR. In the Proceedings of the 19^th International Conference on Computational Linguistics (2002)
Google Scholar
Joachims T.: Learning to Classify Text Using SVM: Methods, Theory and Algorithms. Kluwer Academic Publishers (2002)
Google Scholar
Li Baoli, et al.: A Comparative Study on Automatic Categorization Methods for Chinese Search Engine. In the Proceedings of the Eighth Joint International Computer Conference (2002) 117–120
Google Scholar
Liu Yuan, et al.: Segmentation Standard for Modern Chinese Information Processing and Automatic Segmentation Methodology. Tsinghua University Press (1994)
Google Scholar
Yang Y., Pedersen J.O.: A Comparative Study on Feature Selection in Text Categorization. In the Proceedings of Fourteenth International Conference on Machine Learning (1997) 412–420
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computational Linguistics, Department of Computer Science and Technology, Peking University, 100871, Beijing, P.R. China
Li Baoli, Chen Yuzhong, Bai Xiaojing & Yu Shiwen

Authors

Li Baoli
View author publications
You can also search for this author in PubMed Google Scholar
Chen Yuzhong
View author publications
You can also search for this author in PubMed Google Scholar
Bai Xiaojing
View author publications
You can also search for this author in PubMed Google Scholar
Yu Shiwen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Centro de Investigación en Computación (CIC), Instituto Politécnico Nacional (IPN), Col. Zacatenco, CP 07738, Mexico D.F., Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Baoli, L., Yuzhong, C., Xiaojing, B., Shiwen, Y. (2003). Experimental Study on Representing Units in Chinese Text Categorization. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2003. Lecture Notes in Computer Science, vol 2588. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36456-0_67

Download citation

DOI: https://doi.org/10.1007/3-540-36456-0_67
Published: 30 April 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00532-2
Online ISBN: 978-3-540-36456-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics