[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1390334.1390368acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Knowledge transformation from word space to document space

Published: 20 July 2008 Publication History

Abstract

In most IR clustering problems, we directly cluster the documents, working in the document space, using cosine similarity between documents as the similarity measure. In many real-world applications, however, we usually have knowledge on the word side and wish to transform this knowledge to the document (concept) side. In this paper, we provide a mechanism for this knowledge transformation. To the best of our knowledge, this is the first model for such type of knowledge transformation. This model uses a nonnegative matrix factorization model X = FSGT, where X is the word document semantic matrix, F is the posterior probability of a word belonging to a word cluster and represents knowledge in the word space, G is the posterior probability of a document belonging to a document cluster and represents knowledge in the document space, and S is a scaled matrix factor which provides a condensed view of X. We show how knowledge on words can improve document clustering, i.e, knowledge in the word space is transformed into the document space. We perform extensive experiments to validate our approach.

References

[1]
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley Longman, 1999.
[2]
S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. In Proceedings of ACM SIGKDD, pages 59--68, 2004.
[3]
P. Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002.
[4]
M. Bilenko, S. Basu, and R. Mooney. Integrating constraints and metric learning in semi-supervised clustering. Proc. Int'l Conf. Machine Learning (ICML2004), 2004.
[5]
H. Cho, I. Dhillon, Y. Guan, and S. Sra. Minimum sum squared residue co-clustering of gene expression data. In Proceedings of The 4th SIAM Data Mining Conference, pages 22--24, April 2004.
[6]
D. Cohn, R. Caruana, and A. McCallum. Semi-supervised clustering with user feedback. Technical Report TR2003-1892, Cornell University, 2003.
[7]
I. Davidson and S. Ravi. Clustering under constraints: Feasibility results and the k-means algorithm. In Proceedings of SIAM Data Mining Conference, 2005.
[8]
I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. Proceeding of ACM SIGKDD, 2001.
[9]
I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretical co-clustering. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03), pages 89--98, 2003.
[10]
C. Ding and X. He. K-means clustering and principal component analysis. Int'l Conf. Machine Learning (ICML), 2004.
[11]
C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix tri-factorizations for clustering. In Proceedings of ACM SIGKDD, pages 126--135, 2006.
[12]
J. Hartigan. Clustering Algorithms. Wiley, 1975.
[13]
T. Hofmann. Probabilistic latent semantic indexing. Proc. ACM Conf. on Research and Develop. IR (SIGIR), pages 50--57, 1999.
[14]
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
[15]
Z. Kou and C. Zhang. Reply networks on a bulletin board system. Phys. Rev. E, (67), 2003.
[16]
D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems 13, Cambridge, MA, 2001. MIT Press.
[17]
T. Li. A general model for clustering binary data. In KDD, pages 188--197, 2005.
[18]
T. Li and C. Ding. The relationships among various nonnegative matrix factorization methods for clustering. In Proceedings of the 2006 IEEE International Conference on Data Mining (ICDM 2006), pages 362--371, 2006.
[19]
T. Li, C. Ding, and M. Jordan. Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization. In Proceedings of the 2007 IEEE International Conference on Data Mining (ICDM 2007), pages 577--582, 2007.
[20]
B. Long, X. Wu, Z. M. Zhang, and P. S. Yu. Unsupervised learning on k-partite graphs. In Proceedings of ACM SIGKDD, pages 317--326, 2006.
[21]
J. Nocedal and S. J. Wright. Numerical Optimization. Springer-Verlag, 1999.
[22]
N. Slonim and N. Tishby. Document clustering using word clusters via the information bottleneck method. In SIGIR, pages 208--215, 2000.
[23]
A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research, 3:583--617, March 2003.
[24]
K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained k-means clustering with background knowledge. ICML, 2001.
[25]
F. Wang, T. Li, and C. Zhang. Semi-supervised learning via matrix factorization. In Proceedings of 2008 SIAM International Conference on Data Mining, 2008.
[26]
E. Xing, A. Ng, M. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. NIPS, 2002.
[27]
H. Zha, C. Ding, M. Gu, X. He, and H. Simon. Spectral relaxation for K-means clustering. NIPS, pages 1057--1064, 2002.
[28]
H. Zha, X. He, C. Ding, M. Gu, and H. Simon. Bipartite graph partitioning and data clustering. CIKM, 2001.

Cited By

View all
  • (2021)A constrained optimization approach for cross-domain emotion distribution learningKnowledge-Based Systems10.1016/j.knosys.2021.107160227:COnline publication date: 5-Sep-2021
  • (2016)Topic Detection from Short Text: A Term-based Consensus Clustering method2016 13th International Conference on Service Systems and Service Management (ICSSSM)10.1109/ICSSSM.2016.7538624(1-6)Online publication date: Jun-2016
  • (2016)Multi-type Co-clustering of General Heterogeneous Information Networks via Nonnegative Matrix Tri-Factorization2016 IEEE 16th International Conference on Data Mining (ICDM)10.1109/ICDM.2016.0185(1353-1358)Online publication date: Dec-2016
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
July 2008
934 pages
ISBN:9781605581644
DOI:10.1145/1390334
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. clustering
  2. knowledge transformation

Qualifiers

  • Research-article

Conference

SIGIR '08
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2021)A constrained optimization approach for cross-domain emotion distribution learningKnowledge-Based Systems10.1016/j.knosys.2021.107160227:COnline publication date: 5-Sep-2021
  • (2016)Topic Detection from Short Text: A Term-based Consensus Clustering method2016 13th International Conference on Service Systems and Service Management (ICSSSM)10.1109/ICSSSM.2016.7538624(1-6)Online publication date: Jun-2016
  • (2016)Multi-type Co-clustering of General Heterogeneous Information Networks via Nonnegative Matrix Tri-Factorization2016 IEEE 16th International Conference on Data Mining (ICDM)10.1109/ICDM.2016.0185(1353-1358)Online publication date: Dec-2016
  • (2016)Heterogeneous Information Networks Bi-clustering with Similarity RegularizationProceedings of the 11th Pacific Asia Workshop on Intelligence and Security Informatics - Volume 965010.1007/978-3-319-31863-9_2(19-30)Online publication date: 19-Apr-2016
  • (2014)Triplex Transfer Learning: Exploiting Both Shared and Distinct Concepts for Text ClassificationIEEE Transactions on Cybernetics10.1109/TCYB.2013.228145144:7(1191-1203)Online publication date: Jul-2014
  • (2014)Semi-supervised Nonnegative Matrix Factorization for Microblog Clustering Based on Term CorrelationWeb Technologies and Applications10.1007/978-3-319-11116-2_46(511-516)Online publication date: 2014
  • (2013)Triplex transfer learningProceedings of the sixth ACM international conference on Web search and data mining10.1145/2433396.2433449(425-434)Online publication date: 4-Feb-2013
  • (2013)Constrained Text Coclustering with Supervised and Unsupervised ConstraintsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2012.4525:6(1227-1239)Online publication date: 1-Jun-2013
  • (2013)A nonnegative matrix factorization framework for semi-supervised document clustering with dual constraintsKnowledge and Information Systems10.1007/s10115-012-0560-336:3(629-651)Online publication date: 1-Sep-2013
  • (2013)A Probabilistic Model Based on Uncertainty for Data ClusteringAgents and Data Mining Interaction10.1007/978-3-642-36288-0_12(126-138)Online publication date: 2013
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media