Prior-Knowledge-Embedded LDA with Word2vec – for Detecting Specific Topics in Documents

Hiroshi Uehara¹⁰,
Akihiro Ito¹⁰,
Yutaka Saito¹⁰ &
…
Kenichi Yoshida¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11669))

Included in the following conference series:

Pacific Rim Knowledge Acquisition Workshop

743 Accesses
1 Citations

Abstract

This paper proposes a method to apply prior knowledge about topics of interest to Latent Dirichlet Allocation (LDA). The conventional LDA sometimes fails to detect specific topics of interest. Therefore, our approach uses word2vec to acquire linkages between words related to specific topics. The extracted linkages are used as prior knowledge about the topics in the subsequent LDA process. The extracted linkages can also be used to annotate words in a consistent manner. Such consistent annotations cannot be realized using conventional LDA, which relies on bag-of-words–based clustering. We examine our approach by applying it to travelers’ reviews, to detect topics related to Japanese shrines. The experimental results show that our approach is effective in the following three aspects: (1) The average coherence of our approach, i.e., the semantic consistencies among words, outperforms that of the conventional LDA. (2) Words in each sentence are annotated such that the annotations reflect the topic of the sentence. The conventional LDA sometimes makes confusing/mixed annotations to the words in a single sentence. Our approach, on the contrary, can make annotations that reflect the topic of the sentence in a consistent manner. (3) Our approach enables to detect very specific topics complying with users’ interests.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 35.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 44.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

cluTM: Content and Link Integrated Topic Model on Heterogeneous Information Networks

General Topic Annotation in Social Networks: A Latent Dirichlet Allocation Approach

Document Labeling Using Source-LDA Combined with Correlation Matrix

References

Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)
MATH Google Scholar
Budhkar, A., Rudzicz, F.: Augmenting word2vec with latent Dirichlet allocation within a clinical application. arXiv preprint arXiv:1808.03967 (2018)
He, Y.: Extracting topical phrases from clinical documents. In: Thirtieth AAAI Conference on Artificial Intelligence, pp. 2957–2963 (2016)
Google Scholar
Li, C., et al.: LDA meets word2vec: A novel model for academic abstract clustering. In: Companion Proceedings of the The Web Conference 2018, pp. 1699–1706. International World Wide Web Conferences Steering Committee (2018)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances In Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Moody, C.E.: Mixing Dirichlet topic models and word embeddings to make lda2vec. arXiv preprint arXiv:1605.02019 (2016)
Yao, L., et al.: Incorporating knowledge graph embeddings into topic modeling. In: Thirty-First AAAI Conference on Artificial Intelligence, pp. 3119–3126 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Akita Prefectural University, Akita, Japan
Hiroshi Uehara, Akihiro Ito & Yutaka Saito
University of Tsukuba, Tokyo, Japan
Kenichi Yoshida

Authors

Hiroshi Uehara
View author publications
You can also search for this author in PubMed Google Scholar
Akihiro Ito
View author publications
You can also search for this author in PubMed Google Scholar
Yutaka Saito
View author publications
You can also search for this author in PubMed Google Scholar
Kenichi Yoshida
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hiroshi Uehara .

Editor information

Editors and Affiliations

Aoyama Gakuin University, Tokyo, Japan
Kouzou Ohara
University of Tasmania, Tasmania, Australia
Quan Bai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Uehara, H., Ito, A., Saito, Y., Yoshida, K. (2019). Prior-Knowledge-Embedded LDA with Word2vec – for Detecting Specific Topics in Documents. In: Ohara, K., Bai, Q. (eds) Knowledge Management and Acquisition for Intelligent Systems. PKAW 2019. Lecture Notes in Computer Science(), vol 11669. Springer, Cham. https://doi.org/10.1007/978-3-030-30639-7_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-30639-7_10
Published: 22 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30638-0
Online ISBN: 978-3-030-30639-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Prior-Knowledge-Embedded LDA with Word2vec – for Detecting Specific Topics in Documents

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

cluTM: Content and Link Integrated Topic Model on Heterogeneous Information Networks

General Topic Annotation in Social Networks: A Latent Dirichlet Allocation Approach

Document Labeling Using Source-LDA Combined with Correlation Matrix

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Prior-Knowledge-Embedded LDA with Word2vec – for Detecting Specific Topics in Documents

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

cluTM: Content and Link Integrated Topic Model on Heterogeneous Information Networks

General Topic Annotation in Social Networks: A Latent Dirichlet Allocation Approach

Document Labeling Using Source-LDA Combined with Correlation Matrix

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation