Abstract
Document datasets can be described with a bipartite graph where terms and documents are modeled as vertices on two sides respectively. Partitioning such a graph yields a co-clustering of words and documents, in the hope that the cluster topic can be captured by the top terms and documents in the same cluster. However, single terms alone are often not enough to capture the semantics of documents. To that end, in this paper, we propose to employ hyperclique patterns of terms as additional features for document representation. Then we use F-score to select the top discriminative features to construct the bipartite. Finally, the extensive experiments indicated that compared to the standard bipartite formulation, our approach is able to achieve better clustering performance at a smaller graph size.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Baker, L.D., McCallum, A.: Distributional clustering of words for text classification. In: SIGIR, pp. 96–103 (1998)
Usui, S., Naud, A., Ueda, N., Taniguchi, T.: 3d-SE viewer: A text mining tool based on bipartite graph visualization. In: IJCNN, pp. 1103–1108 (2007)
Xiong, H., Tan, P.N., Kumar, V.: Mining strong affinity association patterns in data sets with skewed support distribution. In: ICDM, pp. 387–394 (2003)
Xiong, H., Tan, P.N., Kumar, V.: Hyperclique pattern discovery. Data Mining and Knowledge Discovery 13(2), 219–242 (2006)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Computing Surveys 31(3), 264–323 (1999)
Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning 55(3), 311–331 (2004)
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2000)
Hu, T., Liu, L., Qu, C., Sung, S.Y.: Joint cluster based co-clustering for clustering ensembles. In: Li, X., Zaïane, O.R., Li, Z. (eds.) ADMA 2006. LNCS (LNAI), vol. 4093, pp. 284–295. Springer, Heidelberg (2006)
Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: Proc. AAAI: Workshop of Artificial Intelligence for Web Search, pp. 58–64 (2000)
Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: SIGKDD, pp. 269–274 (2001)
Huang, Y., Xiong, H., Wu, W., Zhang, Z.: A hybrid approach for mining maximal hyperclique patterns. In: ICTAI, pp. 354–361 (2004)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Analysis and Machine Intelligence 22, 888–905 (2000)
Dhillon, I.S., Guan, Y., Kulis, B.: A fast kernel-based multilevel algorithm for graph clustering. In: SIGKDD, pp. 629–634 (2005)
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: SIGKDD, pp. 16–22 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Qu, C., Li, Y., Zhang, J., Hu, T., Chen, Q. (2008). Selecting the Right Features for Bipartite-Based Text Clustering. In: Tang, C., Ling, C.X., Zhou, X., Cercone, N.J., Li, X. (eds) Advanced Data Mining and Applications. ADMA 2008. Lecture Notes in Computer Science(), vol 5139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88192-6_46
Download citation
DOI: https://doi.org/10.1007/978-3-540-88192-6_46
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88191-9
Online ISBN: 978-3-540-88192-6
eBook Packages: Computer ScienceComputer Science (R0)