[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2911451.2911499acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Topic Modeling for Short Texts with Auxiliary Word Embeddings

Published: 07 July 2016 Publication History

Abstract

For many applications that require semantic understanding of short texts, inferring discriminative and coherent latent topics from short texts is a critical and fundamental task. Conventional topic models largely rely on word co-occurrences to derive topics from a collection of documents. However, due to the length of each document, short texts are much more sparse in terms of word co-occurrences. Data sparsity therefore becomes a bottleneck for conventional topic models to achieve good results on short texts. On the other hand, when a human being interprets a piece of short text, the understanding is not solely based on its content words, but also her background knowledge (e.g., semantically related words). The recent advances in word embedding offer effective learning of word semantic relations from a large corpus. Exploiting such auxiliary word embeddings to enrich topic modeling for short texts is the main focus of this paper. To this end, we propose a simple, fast, and effective topic model for short texts, named GPU-DMM. Based on the Dirichlet Multinomial Mixture (DMM) model, GPU-DMM promotes the semantically related words under the same topic during the sampling process by using the generalized Polya urn (GPU) model. In this sense, the background knowledge about word semantic relatedness learned from millions of external documents can be easily exploited to improve topic modeling for short texts. Through extensive experiments on two real-world short text collections in two languages, we show that GPU-DMM achieves comparable or better topic representations than state-of-the-art models, measured by topic coherence. The learned topic representation leads to the best accuracy in text classification task, which is used as an indirect evaluation.

References

[1]
Y. Bengio, H. Schwenk, J.-S. Senécal, F. Morin, and J.-L. Gauvain. Neural probabilistic language models. Springer, 2006.
[2]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 2003.
[3]
J. Chang, S. Gerrish, C. Wang, J. L. Boyd-Graber, and D. M. Blei. Reading tea leaves: How humans interpret topic models. In NIPS, 2009.
[4]
M. Chen, X. Jin, and D. Shen. Short text classification improved by learning multi-granularity topics. In IJCAI, 2011.
[5]
Z. Chen and B. Liu. Mining topics in documents: standing on the shoulders of big data. In SIGKDD, 2014.
[6]
Z. Chen, A. Mukherjee, B. Liu, M. Hsu, M. Castellanos, and R. Ghosh. Discovering coherent topics using general knowledge. In CIKM, 2013.
[7]
Z. Chen, A. Mukherjee, B. Liu, M. Hsu, M. Castellanos, and R. Ghosh. Leveraging multi-domain prior knowledge in topic models. In IJCAI, 2013.
[8]
X. Cheng, X. Yan, Y. Lan, and J. Guo. BTM: topic modeling over short texts. IEEE Trans. Knowl. Data Eng., 2014.
[9]
R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML, 2008.
[10]
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 2011.
[11]
R. Das, M. Zaheer, and C. Dyer. Gaussian LDA for topic models with word embeddings. In ACL, 2015.
[12]
T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, 1999.
[13]
L. Hong and B. D. Davison. Empirical study of topic modeling in twitter. In The First Workshop on Social Media Analytics, 2010.
[14]
L. Hong, D. Yin, J. Guo, and B. D. Davison. Tracking trends: incorporating term volume into temporal topic models. In SIGKDD, 2011.
[15]
O. Jin, N. N. Liu, K. Zhao, Y. Yu, and Q. Yang. Transferring topical knowledge from auxiliary long texts for short text clustering. In CIKM, 2011.
[16]
T. Kenter and M. de Rijke. Short text similarity with word embeddings. In CIKM, 2015.
[17]
M. J. Kusner, Y. Sun, N. I. Kolkin, and K. Q. Weinberger. From word embeddings to document distances. In ICML, 2015.
[18]
H. Mahmoud. Polya urn models. Chapman & Hall/CRC Texts in Statistical Science, 2008.
[19]
R. Mehrotra, S. Sanner, W. Buntine, and L. Xie. Improving lda topic models for microblogs via tweet pooling and automatic labeling. In SIGIR, 2013.
[20]
T. Mikolov, K. Chen, G. Corrada, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
[21]
D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and A. McCallum. Optimizing semantic coherence in topic models. In EMNLP, 2011.
[22]
A. Mnih and G. E. Hinton. A scalable hierarchical distributed language model. In NIPS, 2009.
[23]
D. Newman, J. H. Lau, K. Grieser, and T. Baldwin. Automatic evaluation of topic coherence. In HLT-NAACL, 2010.
[24]
D. Q. Nguyen, R. Billingsley, L. Du, and M. Johnson. Improving topic models with latent feature word representations. TACL, 2015.
[25]
A. Niculescu-Mizil and R. Caruana. Predicting good probabilities with supervised learning. In ICML.
[26]
K. Nigam, A. K. MacCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using em. Machine Learning, 2000.
[27]
J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.
[28]
X.-H. Phan, L.-M. Nguyen, and S. Horiguchi. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In WWW, 2008.
[29]
I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, and M. Welling. Fast collapsed gibbs sampling for latent dirichlet allocation. In SIGKDD, 2008.
[30]
X. Quan, C. Kit, Y. Ge, and S. J. Pan. Short and sparse text topic modeling via self-aggregation. In AAAI, 2015.
[31]
Z. Ma, A. Sun, Q. Yuan, and G. Cong. Topic-driven reader comments summarization. In CIKM, 2012.
[32]
D. Ramage, S. T. Dumais, and D. J. Liebling. Characterizing microblogs with topic models. In ICWSM, 2010.
[33]
D. E. Rumelhar, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Cognitive Modeling, 1988.
[34]
B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu, and M. Demirbas. Short text classification in twitter to improve information filtering. In SIGIR, 2010.
[35]
A. Sun. Short text classification using very few words. In SIGIR, 2012.
[36]
C. Wang and D. M. Blei. Collaborative topic modeling for recommending scientific articles. In SIGKDD, 2011.
[37]
X. Wang, C. Zhai, X. Hu, and R. Sproat. Mining correlated bursty topic patterns from coordinated text streams. In SIGKDD, 2007.
[38]
J. Weng, E.-P. Lim, J. Jiang, and Q. He. Twitterrank: finding topic-sensitive influential twitterers. In WSDM, 2010.
[39]
X. Yan, J. Guo, Y. Lan, and X. Chen. A biterm topic model for short texts. In WWW, 2013.
[40]
J. Yin and J. Wang. A dirichlet multinomial mixture model-based approach for short text clustering. In SIGKDD, 2014.
[41]
X. Yunqing, T. Nan, H. Amir, and C. Erik. Discriminative bi-term topic model for headline-based social news clustering. In AAAI, 2015.
[42]
W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In ECIR, 2011.
[43]
G. Zheng and J. Callan. Learning to reweight terms with distributed representations. In SIGIR, 2015.

Cited By

View all
  • (2024)How Effective Is the Judiciary? Evidence on Correlation Between Cases’ Characteristics and Probability of AppealEuropean Journal of Empirical Legal Studies10.62355/ejels.248621:2(179-206)Online publication date: 17-Nov-2024
  • (2024)An Efficient Automatic Meta-Path Selection for Social Event Detection via Hyperbolic SpaceProceedings of the ACM Web Conference 202410.1145/3589334.3645526(2519-2529)Online publication date: 13-May-2024
  • (2024)A Multi-View Clustering Algorithm for Short Text2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00107(5101-5110)Online publication date: 13-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval
July 2016
1296 pages
ISBN:9781450340694
DOI:10.1145/2911451
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 July 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. short texts
  2. topic model
  3. word embeddings

Qualifiers

  • Research-article

Funding Sources

Conference

SIGIR '16
Sponsor:

Acceptance Rates

SIGIR '16 Paper Acceptance Rate 62 of 341 submissions, 18%;
Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)116
  • Downloads (Last 6 weeks)7
Reflects downloads up to 16 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)How Effective Is the Judiciary? Evidence on Correlation Between Cases’ Characteristics and Probability of AppealEuropean Journal of Empirical Legal Studies10.62355/ejels.248621:2(179-206)Online publication date: 17-Nov-2024
  • (2024)An Efficient Automatic Meta-Path Selection for Social Event Detection via Hyperbolic SpaceProceedings of the ACM Web Conference 202410.1145/3589334.3645526(2519-2529)Online publication date: 13-May-2024
  • (2024)A Multi-View Clustering Algorithm for Short Text2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00107(5101-5110)Online publication date: 13-May-2024
  • (2024)Sensing the diversity of rumors: Rumor detection with hierarchical prototype contrastive learningInformation Processing & Management10.1016/j.ipm.2024.10383261:6(103832)Online publication date: Nov-2024
  • (2024)Short-text topic modeling with dual reinforcement from internal and external semanticsInternational Journal of Machine Learning and Cybernetics10.1007/s13042-024-02427-6Online publication date: 21-Oct-2024
  • (2024)A unified framework for financial commentary predictionInformation Technology and Management10.1007/s10799-024-00439-wOnline publication date: 5-Sep-2024
  • (2024)Anomaly-aware symmetric non-negative matrix factorization for short text clusteringKnowledge and Information Systems10.1007/s10115-024-02226-zOnline publication date: 4-Nov-2024
  • (2024)Topic-aware cosine graph convolutional neural network for short text classificationSoft Computing10.1007/s00500-024-09679-y28:13-14(8119-8132)Online publication date: 3-Jul-2024
  • (2024)Graph neural networks in vision-language image understanding: a surveyThe Visual Computer10.1007/s00371-024-03343-0Online publication date: 29-Mar-2024
  • (2024)Unveiling an Effective Framework for Extracting and Evaluating User Opinions on Public Transportation Services Through Twitter: A Case Study of Delhi MetroRecent Advances in Transportation Systems Engineering and Management—Volume 210.1007/978-981-97-6071-8_22(373-401)Online publication date: 15-Oct-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media