More Web Proxy on the site http://driver.im/

research-article

CitationLDA++: an Extension of LDA for Discovering Topics in Document Network

Authors:

Phuc DoAuthors Info & Claims

SoICT '18: Proceedings of the 9th International Symposium on Information and Communication Technology

Pages 31 - 37

https://doi.org/10.1145/3287921.3287930

Published: 06 December 2018 Publication History

Abstract

Along with rapid development of electronic scientific publication repositories, automatic topics identification from papers has helped a lot for the researchers in their research. Latent Dirichlet Allocation (LDA) model is the most popular method which is used to discover hidden topics in texts basing on the co-occurrence of words in a corpus. LDA algorithm has achieved good results for large documents. However, article repositories usually only store title and abstract that are too short for LDA algorithm to work effectively. In this paper, we propose CitationLDA++ model that can improve the performance of the LDA algorithm in inferring topics of the papers basing on the title or/and abstract and citation information. The proposed model is based on the assumption that the topics of the cited papers also reflects the topics of the original paper. In this study, we divide the dataset into two sets. The first one is used to build prior knowledge source using LDA algorithm. The second is training dataset used in CitationLDA++. In the inference process with Gibbs sampling, CitationLDA++ algorithm use topics distribution of prior knowledge source and citation information to guide the process of assigning the topic to words in the text. The use of topics of cited papers helps to tackle the limit of word co-occurrence in case of linked short text. Experiments with the AMiner dataset including title or/and abstract of papers and citation information, CitationLDA++ algorithm gains better perplexity measurement than no additional knowledge. Experimental results suggest that the citation information can improve the performance of LDA algorithm to discover topics of papers in the case of full content of them are not available.

References

[1]

David M Blei, John D Lafferty, et al. 2007. A correlated topic model of science. The Annals of Applied Statistics 1, 1 (2007), 17--35.

[2]

David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993--1022.

Digital Library

[3]

Jonathan Chang and David Blei. 2009. Relational topic models for document networks. In Artificial Intelligence and Statistics. 81--88.

[4]

Kuan-Yu Menphis Chen and Yufei Wang. 2007. Latent dirichlet allocation. (2007).

[5]

Leo Egghe and Ronald Rousseau. 1990. Introduction to informetrics: Quantitative methods in library, documentation and information science. Elsevier Science Publishers.

[6]

Tom Griffiths. 2002. Gibbs sampling in the generative model of Latent Dirichlet Allocation. Technical Report.

[7]

Zhen Guo, Zhongfei Mark Zhang, Shenghuo Zhu, Yun Chi, and Yihong Gong. 2014. A two-level topic model towards knowledge discovery from citation networks. IEEE Transactions on Knowledge and Data Engineering 26, 4 (2014), 780--794.

Digital Library

[8]

Kar Wai Lim, Changyou Chen, and Wray Buntine. 2016. Twitter-network topic model: A full Bayesian treatment for social network and text modeling. arXiv preprint arXiv:1609.06791 (2016).

[9]

Arun S Maiya and Robert M Rolfe. 2014. Topic similarity networks: visual analytics for large document sets. arXiv preprint arXiv:1409.7591 (2014).

[10]

Asbjørn Steinskog, Jonas Therkelsen, and Björn Gambäck. 2017. Twitter Topic Modeling by Tweet Aggregation. In Proceedings of the 21st Nordic Conference on Computational Linguistics. 77--86.

[11]

Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. Arnet-Miner: Extraction and Mining of Academic Social Networks. In KDD'08. 990--998.

Digital Library

[12]

Xiaolong Wang, Chengxiang Zhai, and Dan Roth. 2013. Understanding evolution of research themes: a probabilistic generative model for citations. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1115--1123.

Digital Library

[13]

Justin Wood, Patrick Tan, Wei Wang, and Corey Arnold. 2017. Source-LDA: Enhancing probabilistic topic models using prior knowledge sources. In Data Engineering (ICDE), 2017 IEEE 33rd International Conference on. IEEE, 411--422.

[14]

Huan Xia, Juanzi Li, Jie Tang, and Marie-Francine Moens. 2012. Plink-LDA: using link as prior information in topic modeling. In International Conference on Database Systems for Advanced Applications. Springer, 213--227.

Digital Library

[15]

Reynold S Xin, Joseph E Gonzalez, Michael J Franklin, and Ion Stoica. 2013. Graphx: A resilient distributed graph system on spark. In First International Workshop on Graph Data Management Experiences and Systems. ACM, 2.

Digital Library

[16]

Bo Zhao, Hucheng Zhou, Guoqiang Li, and Yihua Huang. 2018. ZenLDA: Large-scale topic model training on distributed data-parallel platform. Big Data Mining and Analytics 1, 1 (2018), 57--74.

Cited By

Chuan LZhao JQi SJia QZhang HYe S(2024)Research Frontiers in the Field of Agricultural Resources and the EnvironmentApplied Sciences10.3390/app1412499614:12(4996)Online publication date: 7-Jun-2024
https://doi.org/10.3390/app14124996
Wood JArnold CWang W(2022)Knowledge Source Rankings for Semi-Supervised Topic ModelingInformation10.3390/info1302005713:2(57)Online publication date: 24-Jan-2022
https://doi.org/10.3390/info13020057
AKBULUT MTONTA Y(2022)Incremental Refinement of Relevance Rankings: Introducing a New Method Supported with Pennant RetrievalTurk Kutuphaneciligi - Turkish Librarianship10.24146/tk.1062751Online publication date: 10-Apr-2022
https://doi.org/10.24146/tk.1062751

Recommendations

A Non-Parametric Topic Model for Short Texts Incorporating Word Coherence Knowledge
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

Mining topics in short texts (e.g. tweets, instant messages) can help people grasp essential information and understand key contents, and is widely used in many applications related to social media and text analysis. The sparsity and noise of short ...
Topic-driven reader comments summarization
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Readers of a news article often read its comments contributed by other readers. By reading comments, readers obtain not only complementary information about this news article but also the opinions from other readers. However, the existing ranking ...
Research on Multi-document Summarization Based on LDA Topic Model
IHMSC '14: Proceedings of the 2014 Sixth International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 02

Compared with VSM (Vector Space Model) and graph-ranking models, LDA (Latent Dirichlet Allocation) Model can discover latent topics in the corpus and latent topics are beneficial to use sentence-ranking mechanisms to form a good summary. In the paper, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

SoICT '18: Proceedings of the 9th International Symposium on Information and Communication Technology

December 2018

496 pages

ISBN:9781450365390

DOI:10.1145/3287921

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

SOICT: School of Information and Communication Technology - HUST
NAFOSTED: The National Foundation for Science and Technology Development

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 December 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

University of Information Technology VNU-HCM

Conference

SoICT 2018

SoICT 2018: The Ninth International Symposium on Information and Communication Technology

December 6 - 7, 2018

Danang City, Viet Nam

Acceptance Rates

Overall Acceptance Rate 147 of 318 submissions, 46%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
83
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)2

Reflects downloads up to 31 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chuan LZhao JQi SJia QZhang HYe S(2024)Research Frontiers in the Field of Agricultural Resources and the EnvironmentApplied Sciences10.3390/app1412499614:12(4996)Online publication date: 7-Jun-2024
https://doi.org/10.3390/app14124996
Wood JArnold CWang W(2022)Knowledge Source Rankings for Semi-Supervised Topic ModelingInformation10.3390/info1302005713:2(57)Online publication date: 24-Jan-2022
https://doi.org/10.3390/info13020057
AKBULUT MTONTA Y(2022)Incremental Refinement of Relevance Rankings: Introducing a New Method Supported with Pennant RetrievalTurk Kutuphaneciligi - Turkish Librarianship10.24146/tk.1062751Online publication date: 10-Apr-2022
https://doi.org/10.24146/tk.1062751

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents