[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2187836.2187936acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Document hierarchies from text and links

Published: 16 April 2012 Publication History

Abstract

Hierarchical taxonomies provide a multi-level view of large document collections, allowing users to rapidly drill down to fine-grained distinctions in topics of interest. We show that automatically induced taxonomies can be made more robust by combining text with relational links. The underlying mechanism is a Bayesian generative model in which a latent hierarchical structure explains the observed data --- thus, finding hierarchical groups of documents with similar word distributions and dense network connections. As a nonparametric Bayesian model, our approach does not require pre-specification of the branching factor at each non-terminal, but finds the appropriate level of detail directly from the data. Unlike many prior latent space models of network structure, the complexity of our approach does not grow quadratically in the number of documents, enabling application to networks with more than ten thousand nodes. Experimental results on hypertext and citation network corpora demonstrate the advantages of our hierarchical, multimodal approach.

References

[1]
R. P. Adams, Z. Ghahramani, and M. I. Jordan. Tree-Structured stick breaking processes for hierarchical data. In Neural Information Processing Systems, June 2010.
[2]
E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels. Journal of Machine Learning Research, 9:1981--2014, 2008.
[3]
Alias-i. Lingpipe 3.9.1, 2010.
[4]
S. Bethard and D. Jurafsky. Who should I cite: learning literature search models from citation behavior. In Proceedings of CIKM, pages 609--618, 2010.
[5]
S. Bird, R. Dale, B. J. Dorr, B. Gibson, M. T. Joseph, M.-y. Kan, D. Lee, B. Powley, D. R. Radev, and Y. F. Tan. The ACL anthology reference corpus: A reference dataset for bibliographic research in computational linguistics. In Proceedings of LREC, 2008.
[6]
D. Blei and J. Lafferty. Topic models. In Text Mining: Theory and Applications. Taylor and Francis, 2009.
[7]
D. M. Blei, T. L. Griffiths, and M. I. Jordan. The nested Chinese restaurant process and bayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57(2):1--30, Feb. 2010.
[8]
J. Chang and D. Blei. Hierarchical relational models for document networks. Annals of Applied Statistics, 2009.
[9]
A. Clauset, C. Moore, and M. E. J. Newman. Hierarchical structure and the prediction of missing links in networks. Nature, 453(7191):98--101, May 2008.
[10]
D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In Neural Information Processing Systems, 2001.
[11]
D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/Gather: a cluster-based approach to browsing large document collections. In Proceedings of SIGIR, 1992.
[12]
S. Gerrish and D. Blei. A language-based approach to measuring scholarly impact. In Proceedings of ICML, 2010.
[13]
T. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101:5228--5235, 2004.
[14]
A. Gruber, M. Rosen-zvi, and Y. Weiss. Latent topic models for hypertext. In Proceedings of UAI, 2008.
[15]
S. Gupta and C. Manning. Analyzing the dynamics of research by extracting key aspects of scientific papers. In Proceedings of IJCNLP, 2011.
[16]
D. Hall, D. Jurafsky, and C. D. Manning. Studying the history of ideas using topic models. In Proceedings of EMNLP, 2008.
[17]
T. H. Haveliwala. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search. IEEE Transactions on Knowledge Data Engineering, 15(4):784--796, 2003.
[18]
Q. He, J. Pei, D. Kifer, P. Mitra, and C. L. Giles. Context-aware citation recommendation. In Proceedings of WWW, pages 421--430, 2010.
[19]
K. A. Heller and Z. Ghahramani. Bayesian hierarchical clustering. In Proceedings of ICML. ACM, 2005.
[20]
Q. Ho, A. P. Parkih, L. Song, and E. P. Xing. Multiscale Community Blockmodel for Network Exploration. In Proceedings of AISTATS, 2011.
[21]
P. Holland and S. Leinhardt. Local structure in social networks. Sociological methodology, 7:1--45, 1976.
[22]
J. Huang, H. Sun, J. Han, H. Deng, Y. Sun, and Y. Liu. Shrink: a structural clustering algorithm for detecting hierarchical communities in networks. In Proceedings of CIKM, pages 219--228, 2010.
[23]
A. Lancichinetti and S. Fortunato. Community detection algorithms: A comparative analysis. Physical Review E, 80(5):056117, Nov. 2009.
[24]
D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361--397, December 2004.
[25]
Y. Liu, A. Niculescu-Mizil, and W. Gryc. Topic-link lda: joint models of topic and author community. In Proceedings of ICML, 2009.
[26]
C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK, 2008.
[27]
A. McCallum, A. Corrada-Emmanuel, and X. Wang. Topic and role discovery in social networks. In Proceedings of IJCAI, 2005.
[28]
Q. Mei, D. Cai, D. Zhang, and C. Zhai. Topic modeling with network regularization. In Proceedings of WWW, 2008.
[29]
R. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen. Joint latent topic models for text and citations. In Proceedings of KDD, 2008.
[30]
R. Nallapati, D. McFarland, and C. Manning. Topicflow model: Unsupervised learning of topic-specific influences of hyperlinked documents. In Proceedings of AISTATS, 2011.
[31]
Y. Petinot, K. McKeown, and K. Thadani. A hierarchical model of web summaries. In Proceedings of ACL, 2011.
[32]
X. Phan, L. Nguyen, and S. Horiguchi. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In Proceedings of WWW, 2008.
[33]
D. Radev, M. Joseph, B. Gibson, and P. Muthukrishnan. A bibliometric and network analysis of the field of computational linguistics. Journal of the American Society for Information Science and Technology, 2009.
[34]
C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, 2004.
[35]
M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In Proceedings of UAI, pages 487--494, 2004.
[36]
N. Sahoo, J. Callan, R. Krishnan, G. Duncan, and R. Padman. Incremental hierarchical clustering of text documents. In Proceedings of CIKM, 2006.
[37]
J. Shi and J. Malik. Normalized cuts and image segmentation. Transactions on Pattern Analysis and Machine Intelligence, 22(8):888--905, Aug. 2000.
[38]
Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical dirichlet processes. Journal of the American Statistical Association, 101(476):1566--1581, 2006.
[39]
P. Willett. Recent trends in hierarchic document clustering: A critical review. Information Processing & Management, 24(5):577--597, 1988.
[40]
F. Wu and D. S. Weld. Automatically refining the wikipedia infobox ontology. In Proceedings of WWW, 2008.
[41]
Y. Zhao, G. Karypis, and U. Fayyad. Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery, 10(2):141--168, Mar. 2005.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '12: Proceedings of the 21st international conference on World Wide Web
April 2012
1078 pages
ISBN:9781450312295
DOI:10.1145/2187836
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • Univ. de Lyon: Universite de Lyon

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 April 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. bayesian generative models
  2. hierarchical clustering
  3. stochastic block models
  4. topic models

Qualifiers

  • Research-article

Conference

WWW 2012
Sponsor:
  • Univ. de Lyon
WWW 2012: 21st World Wide Web Conference 2012
April 16 - 20, 2012
Lyon, France

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)1
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2019)A review of stochastic block models and extensions for graph clusteringApplied Network Science10.1007/s41109-019-0232-24:1Online publication date: 23-Dec-2019
  • (2019)Modeling Community Structure and Topics in Dynamic Text NetworksJournal of Classification10.1007/s00357-018-9289-336:2(322-349)Online publication date: 1-Jul-2019
  • (2017)Joint label inference in networksThe Journal of Machine Learning Research10.5555/3122009.315301518:1(1941-1979)Online publication date: 1-Jan-2017
  • (2016)Topic-adjusted visibility metric for scientific articlesThe Annals of Applied Statistics10.1214/15-AOAS88710:1Online publication date: 1-Mar-2016
  • (2016)Discriminative Link Prediction using Local, Community, and Global SignalsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2016.255366528:8(2057-2070)Online publication date: 1-Aug-2016
  • (2016)SLR: A scalable latent role model for attribute completion and tie prediction in social networks2016 IEEE 32nd International Conference on Data Engineering (ICDE)10.1109/ICDE.2016.7498313(1062-1073)Online publication date: May-2016
  • (2016)Scalable models for computing hierarchies in information networksKnowledge and Information Systems10.1007/s10115-016-0917-049:2(687-717)Online publication date: 1-Nov-2016
  • (2016)Discovering hierarchical topic evolution in time-stamped documentsJournal of the Association for Information Science and Technology10.1002/asi.2343967:4(915-927)Online publication date: 1-Apr-2016
  • (2015)Generating reading orders over document collections2015 IEEE 31st International Conference on Data Engineering10.1109/ICDE.2015.7113310(507-518)Online publication date: Apr-2015
  • (2014)Modeling citation networks using Latent random offsetsProceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence10.5555/3020751.3020817(633-642)Online publication date: 23-Jul-2014
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media