[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3287921.3287930acmotherconferencesArticle/Chapter ViewAbstractPublication PagessoictConference Proceedingsconference-collections
research-article

CitationLDA++: an Extension of LDA for Discovering Topics in Document Network

Published: 06 December 2018 Publication History

Abstract

Along with rapid development of electronic scientific publication repositories, automatic topics identification from papers has helped a lot for the researchers in their research. Latent Dirichlet Allocation (LDA) model is the most popular method which is used to discover hidden topics in texts basing on the co-occurrence of words in a corpus. LDA algorithm has achieved good results for large documents. However, article repositories usually only store title and abstract that are too short for LDA algorithm to work effectively. In this paper, we propose CitationLDA++ model that can improve the performance of the LDA algorithm in inferring topics of the papers basing on the title or/and abstract and citation information. The proposed model is based on the assumption that the topics of the cited papers also reflects the topics of the original paper. In this study, we divide the dataset into two sets. The first one is used to build prior knowledge source using LDA algorithm. The second is training dataset used in CitationLDA++. In the inference process with Gibbs sampling, CitationLDA++ algorithm use topics distribution of prior knowledge source and citation information to guide the process of assigning the topic to words in the text. The use of topics of cited papers helps to tackle the limit of word co-occurrence in case of linked short text. Experiments with the AMiner dataset including title or/and abstract of papers and citation information, CitationLDA++ algorithm gains better perplexity measurement than no additional knowledge. Experimental results suggest that the citation information can improve the performance of LDA algorithm to discover topics of papers in the case of full content of them are not available.

References

[1]
David M Blei, John D Lafferty, et al. 2007. A correlated topic model of science. The Annals of Applied Statistics 1, 1 (2007), 17--35.
[2]
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993--1022.
[3]
Jonathan Chang and David Blei. 2009. Relational topic models for document networks. In Artificial Intelligence and Statistics. 81--88.
[4]
Kuan-Yu Menphis Chen and Yufei Wang. 2007. Latent dirichlet allocation. (2007).
[5]
Leo Egghe and Ronald Rousseau. 1990. Introduction to informetrics: Quantitative methods in library, documentation and information science. Elsevier Science Publishers.
[6]
Tom Griffiths. 2002. Gibbs sampling in the generative model of Latent Dirichlet Allocation. Technical Report.
[7]
Zhen Guo, Zhongfei Mark Zhang, Shenghuo Zhu, Yun Chi, and Yihong Gong. 2014. A two-level topic model towards knowledge discovery from citation networks. IEEE Transactions on Knowledge and Data Engineering 26, 4 (2014), 780--794.
[8]
Kar Wai Lim, Changyou Chen, and Wray Buntine. 2016. Twitter-network topic model: A full Bayesian treatment for social network and text modeling. arXiv preprint arXiv:1609.06791 (2016).
[9]
Arun S Maiya and Robert M Rolfe. 2014. Topic similarity networks: visual analytics for large document sets. arXiv preprint arXiv:1409.7591 (2014).
[10]
Asbjørn Steinskog, Jonas Therkelsen, and Björn Gambäck. 2017. Twitter Topic Modeling by Tweet Aggregation. In Proceedings of the 21st Nordic Conference on Computational Linguistics. 77--86.
[11]
Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. Arnet-Miner: Extraction and Mining of Academic Social Networks. In KDD'08. 990--998.
[12]
Xiaolong Wang, Chengxiang Zhai, and Dan Roth. 2013. Understanding evolution of research themes: a probabilistic generative model for citations. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1115--1123.
[13]
Justin Wood, Patrick Tan, Wei Wang, and Corey Arnold. 2017. Source-LDA: Enhancing probabilistic topic models using prior knowledge sources. In Data Engineering (ICDE), 2017 IEEE 33rd International Conference on. IEEE, 411--422.
[14]
Huan Xia, Juanzi Li, Jie Tang, and Marie-Francine Moens. 2012. Plink-LDA: using link as prior information in topic modeling. In International Conference on Database Systems for Advanced Applications. Springer, 213--227.
[15]
Reynold S Xin, Joseph E Gonzalez, Michael J Franklin, and Ion Stoica. 2013. Graphx: A resilient distributed graph system on spark. In First International Workshop on Graph Data Management Experiences and Systems. ACM, 2.
[16]
Bo Zhao, Hucheng Zhou, Guoqiang Li, and Yihua Huang. 2018. ZenLDA: Large-scale topic model training on distributed data-parallel platform. Big Data Mining and Analytics 1, 1 (2018), 57--74.

Cited By

View all
  • (2024)Research Frontiers in the Field of Agricultural Resources and the EnvironmentApplied Sciences10.3390/app1412499614:12(4996)Online publication date: 7-Jun-2024
  • (2022)Knowledge Source Rankings for Semi-Supervised Topic ModelingInformation10.3390/info1302005713:2(57)Online publication date: 24-Jan-2022
  • (2022)Incremental Refinement of Relevance Rankings: Introducing a New Method Supported with Pennant RetrievalTurk Kutuphaneciligi - Turkish Librarianship10.24146/tk.1062751Online publication date: 10-Apr-2022

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
SoICT '18: Proceedings of the 9th International Symposium on Information and Communication Technology
December 2018
496 pages
ISBN:9781450365390
DOI:10.1145/3287921
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • SOICT: School of Information and Communication Technology - HUST
  • NAFOSTED: The National Foundation for Science and Technology Development

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 December 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. citation network analysis
  2. distributed computing
  3. document network
  4. text mining
  5. topic model

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • University of Information Technology VNU-HCM

Conference

SoICT 2018

Acceptance Rates

Overall Acceptance Rate 147 of 318 submissions, 46%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)2
Reflects downloads up to 31 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Research Frontiers in the Field of Agricultural Resources and the EnvironmentApplied Sciences10.3390/app1412499614:12(4996)Online publication date: 7-Jun-2024
  • (2022)Knowledge Source Rankings for Semi-Supervised Topic ModelingInformation10.3390/info1302005713:2(57)Online publication date: 24-Jan-2022
  • (2022)Incremental Refinement of Relevance Rankings: Introducing a New Method Supported with Pennant RetrievalTurk Kutuphaneciligi - Turkish Librarianship10.24146/tk.1062751Online publication date: 10-Apr-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media