Abstract
Text categorization (TC) has become one the most researched fields in NLP. In this paper, we try to solve the problem of TC through a 2-step feature selection approach. First we cluster the words that appear in the texts according to their distribution in categories. Then we extract concepts from these clusters, which are DEF terms in HowNet. The extraction is according to the word clusters instead of single words. This method maintains the generalization ability of concept extraction based TC and at the same time makes full use of the occurrences of new words that are not found in concept thesaurus. We test the performance of our feature selection method on the Sogou corpus for TC with an SVM classifier. Results of our experiments show that our method can improve the performance of TC in all categories.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Liao, S., Jiang, M.: An Improved Method of Feature Selection Based on Concept Attributes in Text Classification. In: Wang, L., Chen, K., Ong, Y.S. (eds.) ICNC 2005. LNCS, vol. 3610, pp. 1140–1149. Springer, Heidelberg (2005)
Zhang, J., Li, C.: WordNet-based Concept Vector Space Model for Text Classification. Computer Engineering and Applications 42(4), 174–178 (2006)
Peng, F., Huang, X., Schuurmans, D., Wang, S.: Text Classification in Asian Languages without Word Segmentation. In: Proceedings of the Sixth International Workshop on Information retrieval with Asian languages - 11, pp. 41-48 (2003)
Tishby, N., Pereira, F.C., Bialek, W.: The Information Bottleneck Method. In: Proceedings of 37th Annual Allerton Conference on Communication, pp. 368–377 (1999)
Slonim, N.: The Information Bottleneck: Theory and Applications. Ph. D. Thesis, Hebrew University (2002)
Slonim, N., Tishby, N.: Document Clustering Using Word Clusters via the Information Bottleneck Method. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 208–215. ACM Press, New York (2000)
Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: On Feature Distributional Clustering for Text Categorization. In: Proceedings of the 24th ACM International Conference on Research and Development in Information Retrieval, pp. 146–153. ACM Press, New York (2001)
Al-Mubaid, H., Umair, S.A.: A New Text Categorization Technique Using Distributional Clustering and Learning Logic. IEEE Transactions on Knowledge and Data Engineering 18(9), 1156–1165 (2006)
Slonim, N., Tishby, N.: Agglomerative Information Bottleneck. Advances in Neural Information Processing Systems 12, 617–623 (2000)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
He, Y., Jiang, M. (2007). Text Categorization Using Distributional Clustering and Concept Extraction. In: Huang, DS., Heutte, L., Loog, M. (eds) Advanced Intelligent Computing Theories and Applications. With Aspects of Theoretical and Methodological Issues. ICIC 2007. Lecture Notes in Computer Science, vol 4681. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74171-8_71
Download citation
DOI: https://doi.org/10.1007/978-3-540-74171-8_71
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74170-1
Online ISBN: 978-3-540-74171-8
eBook Packages: Computer ScienceComputer Science (R0)