Text Categorization Using Distributional Clustering and Concept Extraction

Yifan He¹ &
Minghu Jiang¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4681))

Included in the following conference series:

International Conference on Intelligent Computing

1150 Accesses

Abstract

Text categorization (TC) has become one the most researched fields in NLP. In this paper, we try to solve the problem of TC through a 2-step feature selection approach. First we cluster the words that appear in the texts according to their distribution in categories. Then we extract concepts from these clusters, which are DEF terms in HowNet. The extraction is according to the word clusters instead of single words. This method maintains the generalization ability of concept extraction based TC and at the same time makes full use of the occurrences of new words that are not found in concept thesaurus. We test the performance of our feature selection method on the Sogou corpus for TC with an SVM classifier. Results of our experiments show that our method can improve the performance of TC in all categories.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 103.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 129.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A Fully Semantic Approach to Large Scale Text Categorization

Classification over Clustering: Augmenting Text Representation with Clusters Helps!

An Integrated Approach to Improve the Text Categorization Using Semantic Measures

References

Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Liao, S., Jiang, M.: An Improved Method of Feature Selection Based on Concept Attributes in Text Classification. In: Wang, L., Chen, K., Ong, Y.S. (eds.) ICNC 2005. LNCS, vol. 3610, pp. 1140–1149. Springer, Heidelberg (2005)
Google Scholar
Zhang, J., Li, C.: WordNet-based Concept Vector Space Model for Text Classification. Computer Engineering and Applications 42(4), 174–178 (2006)
Google Scholar
Peng, F., Huang, X., Schuurmans, D., Wang, S.: Text Classification in Asian Languages without Word Segmentation. In: Proceedings of the Sixth International Workshop on Information retrieval with Asian languages - 11, pp. 41-48 (2003)
Google Scholar
Tishby, N., Pereira, F.C., Bialek, W.: The Information Bottleneck Method. In: Proceedings of 37th Annual Allerton Conference on Communication, pp. 368–377 (1999)
Google Scholar
Slonim, N.: The Information Bottleneck: Theory and Applications. Ph. D. Thesis, Hebrew University (2002)
Google Scholar
Slonim, N., Tishby, N.: Document Clustering Using Word Clusters via the Information Bottleneck Method. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 208–215. ACM Press, New York (2000)
Chapter Google Scholar
Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: On Feature Distributional Clustering for Text Categorization. In: Proceedings of the 24th ACM International Conference on Research and Development in Information Retrieval, pp. 146–153. ACM Press, New York (2001)
Chapter Google Scholar
Al-Mubaid, H., Umair, S.A.: A New Text Categorization Technique Using Distributional Clustering and Learning Logic. IEEE Transactions on Knowledge and Data Engineering 18(9), 1156–1165 (2006)
Article Google Scholar
Slonim, N., Tishby, N.: Agglomerative Information Bottleneck. Advances in Neural Information Processing Systems 12, 617–623 (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Lab of Computational Linguistics, School of Humanities and Social Sciences, Tsinghua University, Beijing, 100084, China
Yifan He & Minghu Jiang

Authors

Yifan He
View author publications
You can also search for this author in PubMed Google Scholar
Minghu Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

De-Shuang Huang Laurent Heutte Marco Loog

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

He, Y., Jiang, M. (2007). Text Categorization Using Distributional Clustering and Concept Extraction. In: Huang, DS., Heutte, L., Loog, M. (eds) Advanced Intelligent Computing Theories and Applications. With Aspects of Theoretical and Methodological Issues. ICIC 2007. Lecture Notes in Computer Science, vol 4681. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74171-8_71

Download citation

DOI: https://doi.org/10.1007/978-3-540-74171-8_71
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74170-1
Online ISBN: 978-3-540-74171-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Text Categorization Using Distributional Clustering and Concept Extraction

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

A Fully Semantic Approach to Large Scale Text Categorization

Classification over Clustering: Augmenting Text Representation with Clusters Helps!

An Integrated Approach to Improve the Text Categorization Using Semantic Measures

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Text Categorization Using Distributional Clustering and Concept Extraction

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

A Fully Semantic Approach to Large Scale Text Categorization

Classification over Clustering: Augmenting Text Representation with Clusters Helps!

An Integrated Approach to Improve the Text Categorization Using Semantic Measures

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation