[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Heterogeneous-Length Text Topic Modeling for Reader-Aware Multi-Document Summarization

Published: 08 August 2019 Publication History

Abstract

More and more user comments like Tweets are available, which often contain user concerns. In order to meet the demands of users, a good summary generating from multiple documents should consider reader interests as reflected in reader comments. In this article, we focus on how to generate a summary from multi-document documents by considering reader comments, named as reader-aware multi-document summarization (RA-MDS). We present an innovative topic-based method for RA-MDA, which exploits latent topics to obtain the most salient and lessen redundancy summary from multiple documents. Since finding latent topics for RA-MDS is a crucial step, we also present a Heterogeneous-length Text Topic Modeling (HTTM) to extract topics from the corpus that includes both news reports and user comments, denoted as heterogeneous-length texts. In this case, the latent topics extract by HTTM cover not only important aspects of the event, but also aspects that attract reader interests. Comparisons on summary benchmark datasets also confirm that the proposed RA-MDS method is effective in improving the quality of extracted summaries. In addition, experimental results demonstrate that the proposed topic modeling method outperforms existing topic modeling algorithms.

References

[1]
Rasim M. Alguliev, Ramiz M. Aliguliyev, and Nijat R. Isazade. 2013. Multiple documents summarization based on evolutionary optimization algorithm. Expert Systems with Applications 40, 5 (2013), 1675--1689.
[2]
Elena Baralis, Luca Cagliero, Saima Jabeen, Alessandro Fiori, and Sajid Shah. 2013. Multi-document summarization based on the Yago ontology. Expert Systems with Applications 40, 17 (2013), 6976--6984.
[3]
Long Chen, Huaizhi Zhang, Joemon M. Jose, Haitao Yu, Yashar Moshfeghi, and Peter Triantafillou. 2017. Topic detection and tracking on heterogeneous information. Journal of Intelligent Information Systems 10--17 (2017), 1--23.
[4]
Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng. 2015. A probabilistic model for bursty topic discovery in microblogs. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI’15). 353--359.
[5]
Piji Li, Lidong Bing, Wai Lam, Hang Li, and Yi Liao. 2015. Reader-aware multi-document summarization via sparse coding. In Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI’15). 1270–1276.
[6]
Rachit Arora and Balaraman Ravindran. 2008. Latent Dirichlet allocation based multi-document summarization. In Proceedings of the 2nd Workshop on Analytics for Noisy Unstructured Text Data. ACM, 91--97.
[7]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. The Journal of Machine Learning Research 3 (2003), 993--1022.
[8]
Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences 101, suppl. 1 (2004), 5228--5235.
[9]
Jianhua Yin and Jianyong Wang. 2014. A Dirichlet multinomial mixture model-based approach for short text clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 233--242.
[10]
Rumi Ghosh and Asur Sitaram. 2013. Mining information from heterogenous sources: A topic modelling approach. In Proc. of the MDS Workshop at the 19th ACM SIGKDD (MDS-SIGKDD’13).
[11]
Xueqi Cheng, Xiaohui Yan, Yanyan Lan, and Jiafeng Guo. 2014. BTM: Topic modeling over short texts. IEEE Transactions on Knowledge and Data Engineering 26, 12 (2014), 2928--2941.
[12]
Xin Wang, Ying Wang, Wanli Zuo, and Guoyong Cai. 2015. Exploring social context for topic identification in short and noisy texts. In Proceedings of the 29th AAAI Conference on Artificial Intelligence.
[13]
Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 50--57.
[14]
Kamal Nigam, Andrew Kachites McCallum, Sebastian Thrun, and Tom Mitchell. 2000. Text classification from labeled and unlabeled documents using EM. Machine Learning 39, 2--3 (2000), 103--134.
[15]
Guan Yu, Ruizhang Huang, and Zhaojun Wang. 2010. Document clustering via Dirichlet process mixture model with feature selection. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’10). ACM, 763--772.
[16]
Ruizhang Huang, Guan Yu, Zhaojun Wang, Jun Zhang, and Liangxing Shi. 2013. Dirichlet process mixture model for document clustering with feature partition. IEEE Transactions on Knowledge and Data Engineering 25, 8 (2013), 1748--1759.
[17]
Jianhua Yin and Jianyong Wang. 2016. A text clustering algorithm using an online clustering scheme for initialization. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16). 1995--2004.
[18]
Jipeng Qiang, Yun Li, Yunhao Yuan, and Xindong Wu. 2018. Short text clustering based on Pitman--Yor process mixture model. Applied Intelligence 48, 7 (2018), 1802--1812.
[19]
Yuan Zuo, Jichang Zhao, and Ke Xu. 2016. Word network topic model: A simple but general solution for short and imbalanced texts. Knowledge and Information Systems 48, 2 (2016), 379--398.
[20]
Wu Wang, Houquan Zhou, Kun He, and John E. Hopcroft. 2017. Learning latent topics from the word co-occurrence network. In Proceedings of National Conference of Theoretical Computer Science. 18--30.
[21]
Xiaojun Quan, Chunyu Kit, Yong Ge, and Sinno Jialin Pan. 2015. Short and sparse text topic modeling via self-aggregation. In Proceedings of the 24th International Conference on Artificial Intelligence. 2270--2276.
[22]
Yuan Zuo, Junjie Wu, Hui Zhang, Hao Lin, Fei Wang, Ke Xu, and Hui Xiong. 2016. Topic modeling of short texts: A pseudo-document view. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2105--2114.
[23]
Yang Yang, Feifei Wang, Junni Zhang, Jin Xu, and S. Yu Philip. 2018. A topic model for co-occurring normal documents and short texts. World Wide Web 21, 2 (2018), 487--513.
[24]
Ruifang He, Jiliang Tang, Pinghua Gong, Qinghua Hu, and Wang Bo. 2016. Multi-document summarization via group sparse learning. Information Sciences An International Journal 349, C (2016), 12--24.
[25]
Yu-Tong Wu, Xue-Feng Li, Yue Xu, and Wei Wang. 2016. Mining topically coherent patterns for unsupervised extractive multi-document summarization. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence. 129--136.
[26]
Jin Ge Yao, Xiaojun Wan, and Jianguo Xiao. 2015. Compressive document summarization via sparse optimization. In Proceedings of the International Conference on Artificial Intelligence.
[27]
Yihong Gong and Xin Liu. 2001. Generic text summarization using relevance measure and latent semantic analysis. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 19--25.
[28]
Ju-Hong Lee, Sun Park, Chan-Min Ahn, and Daeho Kim. 2009. Automatic generic document summarization based on non-negative matrix factorization. Information Processing 8 Management 45, 1 (2009), 20--34.
[29]
Ji-Peng Qiang, Ping Chen, Wei Ding, Fei Xie, and Xindong Wu. 2016. Multi-document summarization using closed patterns. Knowledge-Based Systems 99 (2016), 28--38.
[30]
Dragomir R. Radev, Hongyan Jing, Małgorzata Styś, and Daniel Tam. 2004. Centroid-based summarization of multiple documents. Information Processing 8 Management 40, 6 (2004), 919--938.
[31]
Günes Erkan and Dragomir R. Radev. 2004. LexRank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research 22 (2004), 457--479.
[32]
Zi Yang, Keke Cai, Jie Tang, Li Zhang, Zhong Su, and Juanzi Li. 2011. Social context summarization. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 255--264.
[33]
Leonhard Hennig, Winfried Umbrath, and Robert Wetzker. 2008. An ontology-based approach to text summarization. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Vol. 3 (WI-IAT’08). IEEE, 291--294.
[34]
K. Sarkar. 2010. Syntactic trimming of extracted sentences for improving extractive multi-document summarization. Journal of Computing 2 (2010), 177--184.
[35]
Jaime Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 335--336.
[36]
Solomon Kullback. 1997. Information Theory and Statistics. Courier Corporation.
[37]
Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. Vol. 999. MIT Press.
[38]
Amir Globerson, Gal Chechik, Fernando Pereira, and Naftali Tishby. 2004. Euclidean embedding of co-occurrence data. In Proceedings of Advances in Neural Information Processing Systems. 497--504.
[39]
Thorsten Joachims. 1996. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Technical Report. DTIC Document.
[40]
Jipeng Qiang, Ping Chen, Wei Ding, Tong Wang, Fei Xie, and Xindong Wu. 2016. Topic discovery from heterogeneous texts. In Proceedings of the IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI’16). IEEE, 196--203.
[41]
David Mimno, Hanna M. Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Optimizing semantic coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 262--272.
[42]
Pengtao Xie, Diyi Yang, and Eric P. Xing. 2015. Incorporating word correlation knowledge into topic modeling. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics.
[43]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the ACL-04 Workshop on Text Summarization Branches Out, Vol. 8. Barcelona, Spain.
[44]
Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi, and Thomas Griffiths. 2004. Probabilistic author-topic models for information discovery. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 306--315.

Cited By

View all
  • (2023)Fintech Key-Phrase: A New Chinese Financial High-Tech Dataset Accelerating Expression-Level Information RetrievalACM Transactions on Asian and Low-Resource Language Information Processing10.1145/362798922:11(1-37)Online publication date: 1-Nov-2023
  • (2023)Representation learning via an integrated autoencoder for unsupervised domain adaptationFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-022-1349-517:5Online publication date: 5-Jan-2023
  • (2023)Fintech Key-Phrase: A New Chinese Financial High-Tech Dataset Accelerating Expression-Level Information RetrievalDatabase Systems for Advanced Applications10.1007/978-3-031-30675-4_31(425-440)Online publication date: 17-Apr-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 13, Issue 4
August 2019
235 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3343141
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 August 2019
Accepted: 01 May 2019
Revised: 01 May 2019
Received: 01 July 2018
Published in TKDD Volume 13, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. LDA
  2. Topic modeling
  3. heterogeneous-length text
  4. multi-document summarization

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) of the Ministry of Education, China
  • National Key Research and Development Program of China
  • National Natural Science Foundation of China
  • Natural Science Foundation of Jiangsu Province of China
  • Natural Science Foundation of the Higher Education Institutions of Jiangsu Province of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)18
  • Downloads (Last 6 weeks)1
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Fintech Key-Phrase: A New Chinese Financial High-Tech Dataset Accelerating Expression-Level Information RetrievalACM Transactions on Asian and Low-Resource Language Information Processing10.1145/362798922:11(1-37)Online publication date: 1-Nov-2023
  • (2023)Representation learning via an integrated autoencoder for unsupervised domain adaptationFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-022-1349-517:5Online publication date: 5-Jan-2023
  • (2023)Fintech Key-Phrase: A New Chinese Financial High-Tech Dataset Accelerating Expression-Level Information RetrievalDatabase Systems for Advanced Applications10.1007/978-3-031-30675-4_31(425-440)Online publication date: 17-Apr-2023
  • (2022)Prompt Tuning for Multi-Label Text Classification: How to Link Exercises to Knowledge Concepts?Applied Sciences10.3390/app12201036312:20(10363)Online publication date: 14-Oct-2022
  • (2022)Nested Named Entity Recognition: A SurveyACM Transactions on Knowledge Discovery from Data10.1145/352259316:6(1-29)Online publication date: 15-Mar-2022
  • (2022)Opinion Mining Using Enriched Joint Sentiment-Topic ModelInternational Journal of Information Technology & Decision Making10.1142/S021962202250058422:01(313-375)Online publication date: 28-Sep-2022
  • (2022)Twitter Accounts Suggestion: Pipeline Technique SpaCy Entity Recognition2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020570(5121-5125)Online publication date: 17-Dec-2022
  • (2022)Review of automatic text summarization techniques & methodsJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2020.05.00634:4(1029-1046)Online publication date: Apr-2022
  • (2022)In Search of Insight from Unstructured Text Data: Towards an Identification of Text Mining TechniquesDigital Science10.1007/978-3-030-93677-8_52(591-603)Online publication date: 17-Jan-2022
  • (2020)ST-ETM: A Spatial-Temporal Emergency Topic Model for Public Opinion Identifying in Social NetworksIEEE Access10.1109/ACCESS.2020.30010728(125659-125670)Online publication date: 2020
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media