More Web Proxy on the site http://driver.im/

research-article

WELDA: Enhancing Topic Models by Incorporating Local Word Context

Authors:

Ralf KrestelAuthors Info & Claims

JCDL '18: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries

Pages 293 - 302

https://doi.org/10.1145/3197026.3197043

Published: 23 May 2018 Publication History

Abstract

The distributional hypothesis states that similar words tend to have similar contexts in which they occur. Word embedding models exploit this hypothesis by learning word vectors based on the local context of words. Probabilistic topic models on the other hand utilize word co-occurrences across documents to identify topically related words. Due to their complementary nature, these models define different notions of word similarity, which, when combined, can produce better topical representations. In this paper we propose WELDA, a new type of topic model, which combines word embeddings (WE) with latent Dirichlet allocation (LDA) to improve topic quality. We achieve this by estimating topic distributions in the word embedding space and exchanging selected topic words via Gibbs sampling from this space. We present an extensive evaluation showing that WELDA cuts runtime by at least 30% while outperforming other combined approaches with respect to topic coherence and for solving word intrusion tasks.

References

[1]

Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pas, and Aitor Soroa . 2009. A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches Proceedings of the Conference of the North American Chapter of the ACL (NAACL). ACL, 19--27.

Digital Library

[2]

Nikolaos Aletras and Mark Stevenson . 2013. Evaluating Topic Coherence Using Distributional Semantics Proceedings of Conference on Computational Semantics (IWCS). ACL, 13--22.

[3]

Marco Baroni, Georgiana Dinu, and German Kruszewski . 2014. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the Meeting of the Association for Computational Linguistics (ACL). ACL, 238--247.

[4]

Kayhan Batmanghelich, Ardavan Saeedi, Karthik Narasimhan, and Sam Gershman . 2016. Nonparametric Spherical Topic Modeling with Word Embeddings Proceedings of the Meeting of the Association for Computational Linguistics (ACL). ACL.

[5]

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin . 2003. A Neural Probabilistic Language Model. The Journal of Machine Learning Research Vol. 3 (2003), 1137--1155.

Digital Library

[6]

Jon Louis Bentley . 1975. Multidimensional Binary Search Trees Used for Associative Searching. Commun. ACM Vol. 18, 9 (1975), 509--517.

Digital Library

[7]

David M. Blei, Andrew Y. Ng, and Michael I. Jordan . 2003. Latent Dirichlet Allocation. The Journal of Machine Learning Research Vol. 3 (2003), 993--1022.

Digital Library

[8]

Jonathan Chang, Sean Gerrish, Chong Wang, J. L. Boyd-graber, and David M Blei . 2009. Reading Tea Leaves: How Humans Interpret Topic Models Advances in Neural Information Processing Systems (NIPS). Curran Associates, Inc., 288--296.

Digital Library

[9]

Jianpeng Cheng, Zhongyuan Wang, Ji-Rong Wen, Jun Yan, and Zheng Chen . 2015. Contextual Text Understanding in Distributional Semantic Space Proceedings of the Conference on Information and Knowledge Management (CIKM). ACM, 133--142.

Digital Library

[10]

Rajarshi Das, Manzil Zaheer, and Chris Dyer . 2015. Gaussian LDA for Topic Models with Word Embeddings Proceedings of the Meeting of the Association for Computational Linguistics (ACL). ACL, 795--804.

[11]

John Rupert Firth . 1957. Papers in linguistics, 1934--1951. Oxford University Press.

[12]

Weihua Hu and Jun'ichi Tsujii . 2016. A Latent Concept Topic Model for Robust Topic Inference Using Word Embeddings Proceedings of the Meeting of the Association for Computational Linguistics (ACL). ACL, 380--386.

[13]

Piotr Indyk and Rajeev Motwani . 1998. Approximate Nearest Neighbors : Towards Removing the Curse of Dimensionality Proceedings of the Symposium on Theory of Computing (STOC). ACM, 604--613.

Digital Library

[14]

Omer Levy and Yoav Goldberg . 2014 a. Dependency-Based Word Embeddings. In Proceedings of the Meeting of the Association for Computational Linguistics (ACL). ACL, 302--308.

[15]

Omer Levy and Yoav Goldberg . 2014 b. Linguistic regularities in sparse and explicit word representations Proceedings of the Conference on Natural Language Learning (CoNLL). ACL, 171--180.

[16]

Shaohua Li, Tat-Seng Chua, Jun Zhu, and Chunyan Miao . 2016. Generative Topic Embedding: a Continuous Representation of Documents Proceedings of the Meeting of the Association for Computational Linguistics (ACL). ACL, 666--675.

[17]

Yang Liu, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun . 2015. Topical Word Embeddings. In Proceedings of the Conference on Artificial Intelligence (AAAI). AAAI Press, 2418--2424.

Digital Library

[18]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean . 2013 a. Distributed Representations of Words and Phrases and their Compositionality Advances in Neural Information Processing Systems (NIPS). Curran Associates, Inc., 3111--3119.

Digital Library

[19]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean . 2013 b. Efficient estimation of word representations in vector space. CoRR Vol. abs/1301.3781 (2013).

[20]

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig . 2013 c. Linguistic regularities in continuous space word representations Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Vol. Vol. 13. ACL, 746--751.

[21]

Christopher E. Moody . 2016. Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec. CoRR Vol. abs/1605.02019 (2016).

[22]

Dat Quoc Nguyen, Richard Billingsley, Lan Du, and Mark Johnson . 2015. Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics (TACL) Vol. 3 (2015), 299--313.

[23]

Michael Röder, Andreas Both, and Alexander Hinneburg . 2015. Exploring the Space of Topic Coherence Measures. Proceedings of the Conference on Web Search and Data Mining (WSDM). ACM, 399--408.

Digital Library

[24]

David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams . 1986. Learning representations by back-propagating errors. Nature Vol. 323 (Oct. . 1986), 533--536.

[25]

David Sontag and Daniel Roy . 2011. Complexity of Inference in Latent Dirichlet Allocation Advances in Neural Information Processing Systems (NIPS). Curran Associates, Inc., 1008--1016.

Digital Library

[26]

Fei Sun, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng . 2015. Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations Proceedings of the Meeting of the Association for Computational Linguistics (ACL). ACL, 136--145.

Cited By

Koltcov SSurkov AFilippov VIgnatenko V(2024)Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topicsPeerJ Computer Science10.7717/peerj-cs.175810(e1758)Online publication date: 3-Jan-2024
https://doi.org/10.7717/peerj-cs.1758
Qiu MYang WWei FChen M(2024)A Topic Modeling Based on Prompt LearningElectronics10.3390/electronics1316321213:16(3212)Online publication date: 14-Aug-2024
https://doi.org/10.3390/electronics13163212
Lee JOstwald M(2024)Latent Dirichlet Allocation (LDA) topic models for Space Syntax studies on spatial experienceCity, Territory and Architecture10.1186/s40410-023-00223-311:1Online publication date: 9-Jan-2024
https://doi.org/10.1186/s40410-023-00223-3
Show More Cited By

Index Terms

WELDA: Enhancing Topic Models by Incorporating Local Word Context
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Document analysis
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Document collection models
      2. Document topic models

Recommendations

Using Word Embedding to Evaluate the Coherence of Topics from Twitter Data
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

Scholars often seek to understand topics discussed on Twitter using topic modelling approaches. Several coherence metrics have been proposed for evaluating the coherence of the topics generated by these approaches, including the pre-calculated Pointwise ...
Probabilistic Approach for Embedding Arbitrary Features of Text
Analysis of Images, Social Networks and Texts
Abstract
Topic modeling is usually used to model words in documents by probabilistic mixtures of topics. We generalize this setup and consider arbitrary features of the positions in a corpus, e.g. “contains a word”, “belongs to a sentence”, “has a word in ...
Unsupervised Bilingual Sentiment Word Embeddings for Cross-lingual Sentiment Classification
ICIAI '20: Proceedings of the 2020 the 4th International Conference on Innovation in Artificial Intelligence

In recent years, bilingual word embeddings have been used to promote sentiment classification task in low-resource languages. However, existing bilingual word embedding methods either require annotated cross-lingual data or fail to capture enough ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

JCDL '18: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries

May 2018

453 pages

ISBN:9781450351782

DOI:10.1145/3197026

General Chairs:
Jiangping Chen
College of Information, UNT, USA
,
Marcos André Gonçalves
, Brazil
,
Jeff M. Allen
College of Information, UNT, USA
,
Program Chairs:
Edward A. Fox
Virginia Tech, USA
,
Min-Yen Kan
National University of Singapore, Singapore
,
Vivien Petras
Humboldt-Universität zu Berlin, Germany

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 May 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

JCDL '18

Sponsor:

JCDL '18: The 18th ACM/IEEE Joint Conference on Digital Libraries

June 3 - 7, 2018

Texas, Fort Worth, USA

Acceptance Rates

JCDL '18 Paper Acceptance Rate 26 of 71 submissions, 37%;

Overall Acceptance Rate 415 of 1,482 submissions, 28%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
311
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)1

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Koltcov SSurkov AFilippov VIgnatenko V(2024)Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topicsPeerJ Computer Science10.7717/peerj-cs.175810(e1758)Online publication date: 3-Jan-2024
https://doi.org/10.7717/peerj-cs.1758
Qiu MYang WWei FChen M(2024)A Topic Modeling Based on Prompt LearningElectronics10.3390/electronics1316321213:16(3212)Online publication date: 14-Aug-2024
https://doi.org/10.3390/electronics13163212
Lee JOstwald M(2024)Latent Dirichlet Allocation (LDA) topic models for Space Syntax studies on spatial experienceCity, Territory and Architecture10.1186/s40410-023-00223-311:1Online publication date: 9-Jan-2024
https://doi.org/10.1186/s40410-023-00223-3
Razaq AHalim ZUr Rahman ASikandar K(2024)Identification of paraphrased text in research articles through improved embeddings and fine-tuned BERT modelMultimedia Tools and Applications10.1007/s11042-024-18359-w83:30(74205-74232)Online publication date: 15-Feb-2024
https://doi.org/10.1007/s11042-024-18359-w
Yin QZhong LSong YBai LWang ZLi CXu YYang X(2023)A decision support system in precision medicine: contrastive multimodal learning for patient stratificationAnnals of Operations Research10.1007/s10479-023-05545-6Online publication date: 29-Aug-2023
https://doi.org/10.1007/s10479-023-05545-6
Murakami RChakraborty B(2022)Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short TextsSensors10.3390/s2203085222:3(852)Online publication date: 23-Jan-2022
https://doi.org/10.3390/s22030852
Meng YZhang YHuang JZhang YHan J(2022)Topic Discovery via Latent Space Clustering of Pretrained Language Model RepresentationsProceedings of the ACM Web Conference 202210.1145/3485447.3512034(3143-3152)Online publication date: 25-Apr-2022
https://dl.acm.org/doi/10.1145/3485447.3512034
Zhang PLiu BLu TGu HDing XGu N(2022)A Semantic Embedding Enhanced Topic Model For User-Generated Textual Content Modeling In Social EcosystemsThe Computer Journal10.1093/comjnl/bxac09165:11(2953-2968)Online publication date: 1-Oct-2022
https://doi.org/10.1093/comjnl/bxac091
Chauhan UShah A(2021)Topic Modeling Using Latent Dirichlet allocationACM Computing Surveys10.1145/346247854:7(1-35)Online publication date: 17-Sep-2021
https://dl.acm.org/doi/10.1145/3462478
Zhang FGao WFang YZhang B(2020)Enhancing Short Text Topic Modeling with FastText Embeddings2020 International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE)10.1109/ICBAIE49996.2020.00060(255-259)Online publication date: Jun-2020
https://doi.org/10.1109/ICBAIE49996.2020.00060
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents