[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1963405.1963445acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Unified analysis of streaming news

Published: 28 March 2011 Publication History

Abstract

News clustering, categorization and analysis are key components of any news portal. They require algorithms capable of dealing with dynamic data to cluster, interpret and to temporally aggregate news articles. These three tasks are often solved separately. In this paper we present a unified framework to group incoming news articles into temporary but tightly-focused storylines, to identify prevalent topics and key entities within these stories, and to reveal the temporal structure of stories as they evolve. We achieve this by building a hybrid clustering and topic model. To deal with the available wealth of data we build an efficient parallel inference algorithm by sequential Monte Carlo estimation. Time and memory costs are nearly constant in the length of the history, and the approach scales to hundreds of thousands of documents. We demonstrate the efficiency and accuracy on the publicly available TDT dataset and data of a major internet news site.

References

[1]
---. The online infinite topic-cluster model: Storylines from streaming text. AISTATS, 2011. Under review.
[2]
A. Ahmed,Q. Ho, C. Teo, J. Eisenstein, A. J. Smola, E. P. Xing The online infinite topic-cluster model: storylines from streaming text. Carnegie Mellon University-ML-11-100, 2011.
[3]
A. Ahmed and E. P. Xing. Dynamic non-parametric mixture models and the recurrent chinese restaurant process: with applications to evolutionary clustering. In SDM, 2008.
[4]
N. Ailon, M. Charikar, A. Newman Aggregating inconsistent information: Ranking and clustering In Journal of ACM, 55(5):1--27, 2008.
[5]
J. Allan. Topic Detection and Tracking: Event-based Information Organization. Kluwer, 2002.
[6]
J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study: Final report. In DARPA News Understanding Workshop, 1998.
[7]
J. Allan, V. Lavrenko, and H. Jin. First story detection in TDT is hard. In CIKM, 374--381, 2000.
[8]
C. E. Antoniak. Mixtures of dirichlet processes with applications to bayesian nonparametric problems. The Annals of Statistics, 2(6):1152--1174, 1974.
[9]
A. Banerjee and S. Basu. Topic models over text streams: A study of batch and online unsupervised learning. In Proceedings of SDM, 2007.
[10]
D. Blei and J. Lafferty. Dynamic topic models. In W. W. Cohen and A. Moore, editors, ICML, 2006.
[11]
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. JMLR, 3:993--1022, 2003.
[12]
K. R. Canini, L. Shi, and T. L. Griffiths. Online inference of topics with latent dirichlet allocation. In AISTATS, 2009.
[13]
J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. Blei. Reading tea leaves: How humans interpret topic models. In Neural Information Processing Systems, 2009.
[14]
M. Connell, A. Feng, G. Kumaran, H. Raghavan, C. Shah, and J. Allan. UMass at TDT 2004. In TDT 2004 Workshop Proceedings, 2004.
[15]
A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods in Practice. Springer, 2001.
[16]
G. Doyle and C. Elkan. Accounting for burstiness in topic models. In ICML, 2009.
[17]
M. Escobar and M. West. Bayesian density estimation and inference using mixtures. JASA 90, 1995.
[18]
J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL, pages 363--370, 2005.
[19]
A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In VLDB, 1999.
[20]
T. Haveliwala, A. Gionis, and P. Indyk. Scalable Techniques for Clustering the Web. In WebDB, 2000.
[21]
G. Kumaran and J. Allan. Text classification and named entities for new event detection. In SIGIR,2004.
[22]
D. M. Mimno and A. McCallum. Topic models conditioned on arbitrary features with dirichlet-multinomial regression. In UAI, 2008.
[23]
T. P. Minka. Estimating a Dirichlet distribution. Technical report, MIT, 2003.
[24]
D. Newman, C. Chemudugunta, and P. Smyth. Statistical entity-topic models. In KDD, pages 680--686, New York, NY, USA, 2006.
[25]
S. Petrovic, M. Osborne and V. Lavrenko. Streaming First Story Detection with application to Twitter. In NAACL, 2010.
[26]
NIST. http://www.itl.nist.gov/iad/mig/tests/tdt/2004/workshop.html.
[27]
J. Pitman. Exchangeable and partially exchangeable random partitions. Probability Theory, 102(2), 1995.
[28]
G. Salton. The SMART Retrieval System - Experiments in Automatic Document Processing. Prentice-Hall, 1971.
[29]
H. Wallach. Structured topic models for language. PhD Thesis. Cambridge, 2008.
[30]
X. Wang and A. McCallum. Topics over time: A non-markov continuous-time model of topical trends. In KDD, 2006.
[31]
L. Yao, D. Mimno, and A. McCallum. Efficient methods for topic model inference on streaming document collections. In KDD, pages 937--946, 2009.
[32]
K. Yu, S. Yu, and V. Tresp. Dirichlet enhanced latent semantic analysis. In AISTATS, 2005.
[33]
J. Zhang, Y. Yang, and Z. Ghahramani. A probabilistic model for online document clustering with application to novelty detection. In NIPS, 2004.
[34]
Y. Zhou, L. Nie, O. Rouhani-Kalleh, F. Vasile, and S. Gaffney. Resolving Surface Forms to Wikipedia Topics. In COLING, 1335--1343, 2010.

Cited By

View all
  • (2024)A Method for Parameter Updating in Distributed Deep Learning2024 Sixth International Conference on Next Generation Data-driven Networks (NGDN)10.1109/NGDN61651.2024.10744125(344-348)Online publication date: 26-Apr-2024
  • (2024)CoRBS: a dynamic storytelling algorithm using a novel contextualization approach for documents utilizing BERT featuresKnowledge and Information Systems10.1007/s10115-024-02263-867:2(1213-1248)Online publication date: 14-Oct-2024
  • (2023)Multilingual News Feed Analysis using Intelligent Linguistic Particle Filtering TechniquesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/356989922:3(1-19)Online publication date: 10-Mar-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '11: Proceedings of the 20th international conference on World wide web
March 2011
840 pages
ISBN:9781450306324
DOI:10.1145/1963405
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 March 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dirichlet processes
  2. online inference
  3. topic models

Qualifiers

  • Research-article

Conference

WWW '11
WWW '11: 20th International World Wide Web Conference
March 28 - April 1, 2011
Hyderabad, India

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A Method for Parameter Updating in Distributed Deep Learning2024 Sixth International Conference on Next Generation Data-driven Networks (NGDN)10.1109/NGDN61651.2024.10744125(344-348)Online publication date: 26-Apr-2024
  • (2024)CoRBS: a dynamic storytelling algorithm using a novel contextualization approach for documents utilizing BERT featuresKnowledge and Information Systems10.1007/s10115-024-02263-867:2(1213-1248)Online publication date: 14-Oct-2024
  • (2023)Multilingual News Feed Analysis using Intelligent Linguistic Particle Filtering TechniquesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/356989922:3(1-19)Online publication date: 10-Mar-2023
  • (2021)Unsupervised latent event representation learning and storyline extraction from news articles based on neural networksIntelligent Data Analysis10.3233/IDA-19506125:3(589-603)Online publication date: 20-Apr-2021
  • (2021)Bringing order to episodes: Mining timeline in social mediaNeurocomputing10.1016/j.neucom.2021.04.020450(80-90)Online publication date: Aug-2021
  • (2020)Storyline extraction from news articles with dynamic dependencyIntelligent Data Analysis10.3233/IDA-18444824:1(183-197)Online publication date: 18-Feb-2020
  • (2019)Automated Monitoring and Forecasting of the Development of Educational TechnologiesHandbook of Research on Engineering Education in a Global Context10.4018/978-1-5225-3395-5.ch027(311-330)Online publication date: 2019
  • (2019)Accounting for Temporal Dynamics in Document StreamsProceedings of the 28th ACM International Conference on Information and Knowledge Management10.1145/3357384.3358022(1813-1822)Online publication date: 3-Nov-2019
  • (2018)Exploring Entity-centric Networks in Entangled News StreamsCompanion Proceedings of the The Web Conference 201810.1145/3184558.3188726(555-563)Online publication date: 23-Apr-2018
  • (2018)Topic Chronicle Forest for Topic Discovery and TrackingProceedings of the Eleventh ACM International Conference on Web Search and Data Mining10.1145/3159652.3159653(315-323)Online publication date: 2-Feb-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media