[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3197026.3197043acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

WELDA: Enhancing Topic Models by Incorporating Local Word Context

Published: 23 May 2018 Publication History

Abstract

The distributional hypothesis states that similar words tend to have similar contexts in which they occur. Word embedding models exploit this hypothesis by learning word vectors based on the local context of words. Probabilistic topic models on the other hand utilize word co-occurrences across documents to identify topically related words. Due to their complementary nature, these models define different notions of word similarity, which, when combined, can produce better topical representations. In this paper we propose WELDA, a new type of topic model, which combines word embeddings (WE) with latent Dirichlet allocation (LDA) to improve topic quality. We achieve this by estimating topic distributions in the word embedding space and exchanging selected topic words via Gibbs sampling from this space. We present an extensive evaluation showing that WELDA cuts runtime by at least 30% while outperforming other combined approaches with respect to topic coherence and for solving word intrusion tasks.

References

[1]
Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pas, and Aitor Soroa . 2009. A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches Proceedings of the Conference of the North American Chapter of the ACL (NAACL). ACL, 19--27.
[2]
Nikolaos Aletras and Mark Stevenson . 2013. Evaluating Topic Coherence Using Distributional Semantics Proceedings of Conference on Computational Semantics (IWCS). ACL, 13--22.
[3]
Marco Baroni, Georgiana Dinu, and German Kruszewski . 2014. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the Meeting of the Association for Computational Linguistics (ACL). ACL, 238--247.
[4]
Kayhan Batmanghelich, Ardavan Saeedi, Karthik Narasimhan, and Sam Gershman . 2016. Nonparametric Spherical Topic Modeling with Word Embeddings Proceedings of the Meeting of the Association for Computational Linguistics (ACL). ACL.
[5]
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin . 2003. A Neural Probabilistic Language Model. The Journal of Machine Learning Research Vol. 3 (2003), 1137--1155.
[6]
Jon Louis Bentley . 1975. Multidimensional Binary Search Trees Used for Associative Searching. Commun. ACM Vol. 18, 9 (1975), 509--517.
[7]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan . 2003. Latent Dirichlet Allocation. The Journal of Machine Learning Research Vol. 3 (2003), 993--1022.
[8]
Jonathan Chang, Sean Gerrish, Chong Wang, J. L. Boyd-graber, and David M Blei . 2009. Reading Tea Leaves: How Humans Interpret Topic Models Advances in Neural Information Processing Systems (NIPS). Curran Associates, Inc., 288--296.
[9]
Jianpeng Cheng, Zhongyuan Wang, Ji-Rong Wen, Jun Yan, and Zheng Chen . 2015. Contextual Text Understanding in Distributional Semantic Space Proceedings of the Conference on Information and Knowledge Management (CIKM). ACM, 133--142.
[10]
Rajarshi Das, Manzil Zaheer, and Chris Dyer . 2015. Gaussian LDA for Topic Models with Word Embeddings Proceedings of the Meeting of the Association for Computational Linguistics (ACL). ACL, 795--804.
[11]
John Rupert Firth . 1957. Papers in linguistics, 1934--1951. Oxford University Press.
[12]
Weihua Hu and Jun'ichi Tsujii . 2016. A Latent Concept Topic Model for Robust Topic Inference Using Word Embeddings Proceedings of the Meeting of the Association for Computational Linguistics (ACL). ACL, 380--386.
[13]
Piotr Indyk and Rajeev Motwani . 1998. Approximate Nearest Neighbors : Towards Removing the Curse of Dimensionality Proceedings of the Symposium on Theory of Computing (STOC). ACM, 604--613.
[14]
Omer Levy and Yoav Goldberg . 2014 a. Dependency-Based Word Embeddings. In Proceedings of the Meeting of the Association for Computational Linguistics (ACL). ACL, 302--308.
[15]
Omer Levy and Yoav Goldberg . 2014 b. Linguistic regularities in sparse and explicit word representations Proceedings of the Conference on Natural Language Learning (CoNLL). ACL, 171--180.
[16]
Shaohua Li, Tat-Seng Chua, Jun Zhu, and Chunyan Miao . 2016. Generative Topic Embedding: a Continuous Representation of Documents Proceedings of the Meeting of the Association for Computational Linguistics (ACL). ACL, 666--675.
[17]
Yang Liu, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun . 2015. Topical Word Embeddings. In Proceedings of the Conference on Artificial Intelligence (AAAI). AAAI Press, 2418--2424.
[18]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean . 2013 a. Distributed Representations of Words and Phrases and their Compositionality Advances in Neural Information Processing Systems (NIPS). Curran Associates, Inc., 3111--3119.
[19]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean . 2013 b. Efficient estimation of word representations in vector space. CoRR Vol. abs/1301.3781 (2013).
[20]
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig . 2013 c. Linguistic regularities in continuous space word representations Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Vol. Vol. 13. ACL, 746--751.
[21]
Christopher E. Moody . 2016. Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec. CoRR Vol. abs/1605.02019 (2016).
[22]
Dat Quoc Nguyen, Richard Billingsley, Lan Du, and Mark Johnson . 2015. Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics (TACL) Vol. 3 (2015), 299--313.
[23]
Michael Röder, Andreas Both, and Alexander Hinneburg . 2015. Exploring the Space of Topic Coherence Measures. Proceedings of the Conference on Web Search and Data Mining (WSDM). ACM, 399--408.
[24]
David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams . 1986. Learning representations by back-propagating errors. Nature Vol. 323 (Oct. . 1986), 533--536.
[25]
David Sontag and Daniel Roy . 2011. Complexity of Inference in Latent Dirichlet Allocation Advances in Neural Information Processing Systems (NIPS). Curran Associates, Inc., 1008--1016.
[26]
Fei Sun, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng . 2015. Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations Proceedings of the Meeting of the Association for Computational Linguistics (ACL). ACL, 136--145.

Cited By

View all
  • (2024)Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topicsPeerJ Computer Science10.7717/peerj-cs.175810(e1758)Online publication date: 3-Jan-2024
  • (2024)A Topic Modeling Based on Prompt LearningElectronics10.3390/electronics1316321213:16(3212)Online publication date: 14-Aug-2024
  • (2024)Latent Dirichlet Allocation (LDA) topic models for Space Syntax studies on spatial experienceCity, Territory and Architecture10.1186/s40410-023-00223-311:1Online publication date: 9-Jan-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
JCDL '18: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries
May 2018
453 pages
ISBN:9781450351782
DOI:10.1145/3197026
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 May 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. document representations
  2. topic models
  3. word embeddings

Qualifiers

  • Research-article

Conference

JCDL '18
Sponsor:

Acceptance Rates

JCDL '18 Paper Acceptance Rate 26 of 71 submissions, 37%;
Overall Acceptance Rate 415 of 1,482 submissions, 28%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)1
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topicsPeerJ Computer Science10.7717/peerj-cs.175810(e1758)Online publication date: 3-Jan-2024
  • (2024)A Topic Modeling Based on Prompt LearningElectronics10.3390/electronics1316321213:16(3212)Online publication date: 14-Aug-2024
  • (2024)Latent Dirichlet Allocation (LDA) topic models for Space Syntax studies on spatial experienceCity, Territory and Architecture10.1186/s40410-023-00223-311:1Online publication date: 9-Jan-2024
  • (2024)Identification of paraphrased text in research articles through improved embeddings and fine-tuned BERT modelMultimedia Tools and Applications10.1007/s11042-024-18359-w83:30(74205-74232)Online publication date: 15-Feb-2024
  • (2023)A decision support system in precision medicine: contrastive multimodal learning for patient stratificationAnnals of Operations Research10.1007/s10479-023-05545-6Online publication date: 29-Aug-2023
  • (2022)Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short TextsSensors10.3390/s2203085222:3(852)Online publication date: 23-Jan-2022
  • (2022)Topic Discovery via Latent Space Clustering of Pretrained Language Model RepresentationsProceedings of the ACM Web Conference 202210.1145/3485447.3512034(3143-3152)Online publication date: 25-Apr-2022
  • (2022)A Semantic Embedding Enhanced Topic Model For User-Generated Textual Content Modeling In Social EcosystemsThe Computer Journal10.1093/comjnl/bxac09165:11(2953-2968)Online publication date: 1-Oct-2022
  • (2021)Topic Modeling Using Latent Dirichlet allocationACM Computing Surveys10.1145/346247854:7(1-35)Online publication date: 17-Sep-2021
  • (2020)Enhancing Short Text Topic Modeling with FastText Embeddings2020 International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE)10.1109/ICBAIE49996.2020.00060(255-259)Online publication date: Jun-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media