More Web Proxy on the site http://driver.im/

research-article

Topic Modeling for Short Texts with Auxiliary Word Embeddings

Authors:

Zongyang MaAuthors Info & Claims

SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

Pages 165 - 174

https://doi.org/10.1145/2911451.2911499

Published: 07 July 2016 Publication History

Abstract

For many applications that require semantic understanding of short texts, inferring discriminative and coherent latent topics from short texts is a critical and fundamental task. Conventional topic models largely rely on word co-occurrences to derive topics from a collection of documents. However, due to the length of each document, short texts are much more sparse in terms of word co-occurrences. Data sparsity therefore becomes a bottleneck for conventional topic models to achieve good results on short texts. On the other hand, when a human being interprets a piece of short text, the understanding is not solely based on its content words, but also her background knowledge (e.g., semantically related words). The recent advances in word embedding offer effective learning of word semantic relations from a large corpus. Exploiting such auxiliary word embeddings to enrich topic modeling for short texts is the main focus of this paper. To this end, we propose a simple, fast, and effective topic model for short texts, named GPU-DMM. Based on the Dirichlet Multinomial Mixture (DMM) model, GPU-DMM promotes the semantically related words under the same topic during the sampling process by using the generalized Polya urn (GPU) model. In this sense, the background knowledge about word semantic relatedness learned from millions of external documents can be easily exploited to improve topic modeling for short texts. Through extensive experiments on two real-world short text collections in two languages, we show that GPU-DMM achieves comparable or better topic representations than state-of-the-art models, measured by topic coherence. The learned topic representation leads to the best accuracy in text classification task, which is used as an indirect evaluation.

References

[1]

Y. Bengio, H. Schwenk, J.-S. Senécal, F. Morin, and J.-L. Gauvain. Neural probabilistic language models. Springer, 2006.

[2]

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 2003.

Digital Library

[3]

J. Chang, S. Gerrish, C. Wang, J. L. Boyd-Graber, and D. M. Blei. Reading tea leaves: How humans interpret topic models. In NIPS, 2009.

Digital Library

[4]

M. Chen, X. Jin, and D. Shen. Short text classification improved by learning multi-granularity topics. In IJCAI, 2011.

Digital Library

[5]

Z. Chen and B. Liu. Mining topics in documents: standing on the shoulders of big data. In SIGKDD, 2014.

Digital Library

[6]

Z. Chen, A. Mukherjee, B. Liu, M. Hsu, M. Castellanos, and R. Ghosh. Discovering coherent topics using general knowledge. In CIKM, 2013.

Digital Library

[7]

Z. Chen, A. Mukherjee, B. Liu, M. Hsu, M. Castellanos, and R. Ghosh. Leveraging multi-domain prior knowledge in topic models. In IJCAI, 2013.

Digital Library

[8]

X. Cheng, X. Yan, Y. Lan, and J. Guo. BTM: topic modeling over short texts. IEEE Trans. Knowl. Data Eng., 2014.

[9]

R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML, 2008.

Digital Library

[10]

R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 2011.

Digital Library

[11]

R. Das, M. Zaheer, and C. Dyer. Gaussian LDA for topic models with word embeddings. In ACL, 2015.

[12]

T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, 1999.

Digital Library

[13]

L. Hong and B. D. Davison. Empirical study of topic modeling in twitter. In The First Workshop on Social Media Analytics, 2010.

Digital Library

[14]

L. Hong, D. Yin, J. Guo, and B. D. Davison. Tracking trends: incorporating term volume into temporal topic models. In SIGKDD, 2011.

Digital Library

[15]

O. Jin, N. N. Liu, K. Zhao, Y. Yu, and Q. Yang. Transferring topical knowledge from auxiliary long texts for short text clustering. In CIKM, 2011.

Digital Library

[16]

T. Kenter and M. de Rijke. Short text similarity with word embeddings. In CIKM, 2015.

Digital Library

[17]

M. J. Kusner, Y. Sun, N. I. Kolkin, and K. Q. Weinberger. From word embeddings to document distances. In ICML, 2015.

Digital Library

[18]

H. Mahmoud. Polya urn models. Chapman & Hall/CRC Texts in Statistical Science, 2008.

Digital Library

[19]

R. Mehrotra, S. Sanner, W. Buntine, and L. Xie. Improving lda topic models for microblogs via tweet pooling and automatic labeling. In SIGIR, 2013.

Digital Library

[20]

T. Mikolov, K. Chen, G. Corrada, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

[21]

D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and A. McCallum. Optimizing semantic coherence in topic models. In EMNLP, 2011.

Digital Library

[22]

A. Mnih and G. E. Hinton. A scalable hierarchical distributed language model. In NIPS, 2009.

Digital Library

[23]

D. Newman, J. H. Lau, K. Grieser, and T. Baldwin. Automatic evaluation of topic coherence. In HLT-NAACL, 2010.

Digital Library

[24]

D. Q. Nguyen, R. Billingsley, L. Du, and M. Johnson. Improving topic models with latent feature word representations. TACL, 2015.

[25]

A. Niculescu-Mizil and R. Caruana. Predicting good probabilities with supervised learning. In ICML.

Digital Library

[26]

K. Nigam, A. K. MacCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using em. Machine Learning, 2000.

Digital Library

[27]

J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.

[28]

X.-H. Phan, L.-M. Nguyen, and S. Horiguchi. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In WWW, 2008.

Digital Library

[29]

I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, and M. Welling. Fast collapsed gibbs sampling for latent dirichlet allocation. In SIGKDD, 2008.

Digital Library

[30]

X. Quan, C. Kit, Y. Ge, and S. J. Pan. Short and sparse text topic modeling via self-aggregation. In AAAI, 2015.

Digital Library

[31]

Z. Ma, A. Sun, Q. Yuan, and G. Cong. Topic-driven reader comments summarization. In CIKM, 2012.

Digital Library

[32]

D. Ramage, S. T. Dumais, and D. J. Liebling. Characterizing microblogs with topic models. In ICWSM, 2010.

[33]

D. E. Rumelhar, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Cognitive Modeling, 1988.

[34]

B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu, and M. Demirbas. Short text classification in twitter to improve information filtering. In SIGIR, 2010.

Digital Library

[35]

A. Sun. Short text classification using very few words. In SIGIR, 2012.

Digital Library

[36]

C. Wang and D. M. Blei. Collaborative topic modeling for recommending scientific articles. In SIGKDD, 2011.

Digital Library

[37]

X. Wang, C. Zhai, X. Hu, and R. Sproat. Mining correlated bursty topic patterns from coordinated text streams. In SIGKDD, 2007.

Digital Library

[38]

J. Weng, E.-P. Lim, J. Jiang, and Q. He. Twitterrank: finding topic-sensitive influential twitterers. In WSDM, 2010.

Digital Library

[39]

X. Yan, J. Guo, Y. Lan, and X. Chen. A biterm topic model for short texts. In WWW, 2013.

Digital Library

[40]

J. Yin and J. Wang. A dirichlet multinomial mixture model-based approach for short text clustering. In SIGKDD, 2014.

Digital Library

[41]

X. Yunqing, T. Nan, H. Amir, and C. Erik. Discriminative bi-term topic model for headline-based social news clustering. In AAAI, 2015.

[42]

W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In ECIR, 2011.

Digital Library

[43]

G. Zheng and J. Callan. Learning to reweight terms with distributed representations. In SIGIR, 2015.

Digital Library

Cited By

Świtała M(2024)How Effective Is the Judiciary? Evidence on Correlation Between Cases’ Characteristics and Probability of AppealEuropean Journal of Empirical Legal Studies10.62355/ejels.248621:2(179-206)Online publication date: 17-Nov-2024
https://doi.org/10.62355/ejels.24862
Qiu ZMa CWu JYang JChua TNgo CKa-Wei Lee RKumar RLauw H(2024)An Efficient Automatic Meta-Path Selection for Social Event Detection via Hyperbolic SpaceProceedings of the ACM Web Conference 202410.1145/3589334.3645526(2519-2529)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645526
Lu MYin JWang KNie L(2024)A Multi-View Clustering Algorithm for Short Text2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00107(5101-5110)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00107
Show More Cited By

Index Terms

Topic Modeling for Short Texts with Auxiliary Word Embeddings
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Topic modeling
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Document topic models

Recommendations

Topic Modeling of Short Texts: A Pseudo-Document View
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Recent years have witnessed the unprecedented growth of online social media, which empower short texts as the prevalent format for information of Internet. Given the nature of sparsity, however, short text topic modeling remains a critical yet much-...
Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings

Many applications require semantic understanding of short texts, and inferring discriminative and coherent latent topics is a critical and fundamental task in these applications. Conventional topic models largely rely on word co-occurrences to derive ...
A biterm topic model for short texts
WWW '13: Proceedings of the 22nd international conference on World Wide Web

Uncovering the topics within short texts, such as tweets and instant messages, has become an important task for many content analysis applications. However, directly applying conventional topic models (e.g. LDA and PLSA) on such short texts may not work ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

July 2016

1296 pages

ISBN:9781450340694

DOI:10.1145/2911451

General Chairs:
Raffaele Perego
ISTI-CNR, Italy
,
Fabrizio Sebastiani
Qatar Computing Research Institute, HBKU, Qatar
,
Program Chairs:
Javed Aslam
Northeastern University, US
,
Ian Ruthven
University of Strathclyde, UK
,
Justin Zobel
University of Melbourne, Australia

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 July 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SIGIR '16

Sponsor:

SIGIR

SIGIR '16: The 39th International ACM SIGIR conference on research and development in Information Retrieval

July 17 - 21, 2016

Pisa, Italy

Acceptance Rates

SIGIR '16 Paper Acceptance Rate 62 of 341 submissions, 18%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

188
Total Citations
View Citations
3,751
Total Downloads

Downloads (Last 12 months)116
Downloads (Last 6 weeks)7

Reflects downloads up to 16 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Świtała M(2024)How Effective Is the Judiciary? Evidence on Correlation Between Cases’ Characteristics and Probability of AppealEuropean Journal of Empirical Legal Studies10.62355/ejels.248621:2(179-206)Online publication date: 17-Nov-2024
https://doi.org/10.62355/ejels.24862
Qiu ZMa CWu JYang JChua TNgo CKa-Wei Lee RKumar RLauw H(2024)An Efficient Automatic Meta-Path Selection for Social Event Detection via Hyperbolic SpaceProceedings of the ACM Web Conference 202410.1145/3589334.3645526(2519-2529)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645526
Lu MYin JWang KNie L(2024)A Multi-View Clustering Algorithm for Short Text2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00107(5101-5110)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00107
Zheng PDou YYan Y(2024)Sensing the diversity of rumors: Rumor detection with hierarchical prototype contrastive learningInformation Processing & Management10.1016/j.ipm.2024.10383261:6(103832)Online publication date: Nov-2024
https://doi.org/10.1016/j.ipm.2024.103832
Wang JChen LZhang ZHe JZhou X(2024)Short-text topic modeling with dual reinforcement from internal and external semanticsInternational Journal of Machine Learning and Cybernetics10.1007/s13042-024-02427-6Online publication date: 21-Oct-2024
https://doi.org/10.1007/s13042-024-02427-6
Ozyegen OMalik GCevik MIoi KEl Mokhtari K(2024)A unified framework for financial commentary predictionInformation Technology and Management10.1007/s10799-024-00439-wOnline publication date: 5-Sep-2024
https://doi.org/10.1007/s10799-024-00439-w
Li XGuan YFu BLuo Z(2024)Anomaly-aware symmetric non-negative matrix factorization for short text clusteringKnowledge and Information Systems10.1007/s10115-024-02226-zOnline publication date: 4-Nov-2024
https://doi.org/10.1007/s10115-024-02226-z
Min CChu YLin HWang BYang LXu B(2024)Topic-aware cosine graph convolutional neural network for short text classificationSoft Computing10.1007/s00500-024-09679-y28:13-14(8119-8132)Online publication date: 3-Jul-2024
https://doi.org/10.1007/s00500-024-09679-y
Senior HSlabaugh GYuan SRossi L(2024)Graph neural networks in vision-language image understanding: a surveyThe Visual Computer10.1007/s00371-024-03343-0Online publication date: 29-Mar-2024
https://doi.org/10.1007/s00371-024-03343-0
Purwar DMachavarapu PRam S(2024)Unveiling an Effective Framework for Extracting and Evaluating User Opinions on Public Transportation Services Through Twitter: A Case Study of Delhi MetroRecent Advances in Transportation Systems Engineering and Management—Volume 210.1007/978-981-97-6071-8_22(373-401)Online publication date: 15-Oct-2024
https://doi.org/10.1007/978-981-97-6071-8_22
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents