More Web Proxy on the site http://driver.im/

research-article

Heterogeneous-Length Text Topic Modeling for Reader-Aware Multi-Document Summarization

Authors:

Xindong WuAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 13, Issue 4

Article No.: 42, Pages 1 - 21

https://doi.org/10.1145/3333030

Published: 08 August 2019 Publication History

Abstract

More and more user comments like Tweets are available, which often contain user concerns. In order to meet the demands of users, a good summary generating from multiple documents should consider reader interests as reflected in reader comments. In this article, we focus on how to generate a summary from multi-document documents by considering reader comments, named as reader-aware multi-document summarization (RA-MDS). We present an innovative topic-based method for RA-MDA, which exploits latent topics to obtain the most salient and lessen redundancy summary from multiple documents. Since finding latent topics for RA-MDS is a crucial step, we also present a Heterogeneous-length Text Topic Modeling (HTTM) to extract topics from the corpus that includes both news reports and user comments, denoted as heterogeneous-length texts. In this case, the latent topics extract by HTTM cover not only important aspects of the event, but also aspects that attract reader interests. Comparisons on summary benchmark datasets also confirm that the proposed RA-MDS method is effective in improving the quality of extracted summaries. In addition, experimental results demonstrate that the proposed topic modeling method outperforms existing topic modeling algorithms.

References

[1]

Rasim M. Alguliev, Ramiz M. Aliguliyev, and Nijat R. Isazade. 2013. Multiple documents summarization based on evolutionary optimization algorithm. Expert Systems with Applications 40, 5 (2013), 1675--1689.

Digital Library

[2]

Elena Baralis, Luca Cagliero, Saima Jabeen, Alessandro Fiori, and Sajid Shah. 2013. Multi-document summarization based on the Yago ontology. Expert Systems with Applications 40, 17 (2013), 6976--6984.

Digital Library

[3]

Long Chen, Huaizhi Zhang, Joemon M. Jose, Haitao Yu, Yashar Moshfeghi, and Peter Triantafillou. 2017. Topic detection and tracking on heterogeneous information. Journal of Intelligent Information Systems 10--17 (2017), 1--23.

Digital Library

[4]

Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng. 2015. A probabilistic model for bursty topic discovery in microblogs. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI’15). 353--359.

Digital Library

[5]

Piji Li, Lidong Bing, Wai Lam, Hang Li, and Yi Liao. 2015. Reader-aware multi-document summarization via sparse coding. In Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI’15). 1270–1276.

Digital Library

[6]

Rachit Arora and Balaraman Ravindran. 2008. Latent Dirichlet allocation based multi-document summarization. In Proceedings of the 2nd Workshop on Analytics for Noisy Unstructured Text Data. ACM, 91--97.

Digital Library

[7]

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. The Journal of Machine Learning Research 3 (2003), 993--1022.

Digital Library

[8]

Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences 101, suppl. 1 (2004), 5228--5235.

[9]

Jianhua Yin and Jianyong Wang. 2014. A Dirichlet multinomial mixture model-based approach for short text clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 233--242.

Digital Library

[10]

Rumi Ghosh and Asur Sitaram. 2013. Mining information from heterogenous sources: A topic modelling approach. In Proc. of the MDS Workshop at the 19th ACM SIGKDD (MDS-SIGKDD’13).

[11]

Xueqi Cheng, Xiaohui Yan, Yanyan Lan, and Jiafeng Guo. 2014. BTM: Topic modeling over short texts. IEEE Transactions on Knowledge and Data Engineering 26, 12 (2014), 2928--2941.

[12]

Xin Wang, Ying Wang, Wanli Zuo, and Guoyong Cai. 2015. Exploring social context for topic identification in short and noisy texts. In Proceedings of the 29th AAAI Conference on Artificial Intelligence.

Digital Library

[13]

Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 50--57.

Digital Library

[14]

Kamal Nigam, Andrew Kachites McCallum, Sebastian Thrun, and Tom Mitchell. 2000. Text classification from labeled and unlabeled documents using EM. Machine Learning 39, 2--3 (2000), 103--134.

Digital Library

[15]

Guan Yu, Ruizhang Huang, and Zhaojun Wang. 2010. Document clustering via Dirichlet process mixture model with feature selection. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’10). ACM, 763--772.

Digital Library

[16]

Ruizhang Huang, Guan Yu, Zhaojun Wang, Jun Zhang, and Liangxing Shi. 2013. Dirichlet process mixture model for document clustering with feature partition. IEEE Transactions on Knowledge and Data Engineering 25, 8 (2013), 1748--1759.

Digital Library

[17]

Jianhua Yin and Jianyong Wang. 2016. A text clustering algorithm using an online clustering scheme for initialization. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16). 1995--2004.

Digital Library

[18]

Jipeng Qiang, Yun Li, Yunhao Yuan, and Xindong Wu. 2018. Short text clustering based on Pitman--Yor process mixture model. Applied Intelligence 48, 7 (2018), 1802--1812.

Digital Library

[19]

Yuan Zuo, Jichang Zhao, and Ke Xu. 2016. Word network topic model: A simple but general solution for short and imbalanced texts. Knowledge and Information Systems 48, 2 (2016), 379--398.

Digital Library

[20]

Wu Wang, Houquan Zhou, Kun He, and John E. Hopcroft. 2017. Learning latent topics from the word co-occurrence network. In Proceedings of National Conference of Theoretical Computer Science. 18--30.

[21]

Xiaojun Quan, Chunyu Kit, Yong Ge, and Sinno Jialin Pan. 2015. Short and sparse text topic modeling via self-aggregation. In Proceedings of the 24th International Conference on Artificial Intelligence. 2270--2276.

Digital Library

[22]

Yuan Zuo, Junjie Wu, Hui Zhang, Hao Lin, Fei Wang, Ke Xu, and Hui Xiong. 2016. Topic modeling of short texts: A pseudo-document view. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2105--2114.

Digital Library

[23]

Yang Yang, Feifei Wang, Junni Zhang, Jin Xu, and S. Yu Philip. 2018. A topic model for co-occurring normal documents and short texts. World Wide Web 21, 2 (2018), 487--513.

Digital Library

[24]

Ruifang He, Jiliang Tang, Pinghua Gong, Qinghua Hu, and Wang Bo. 2016. Multi-document summarization via group sparse learning. Information Sciences An International Journal 349, C (2016), 12--24.

Digital Library

[25]

Yu-Tong Wu, Xue-Feng Li, Yue Xu, and Wei Wang. 2016. Mining topically coherent patterns for unsupervised extractive multi-document summarization. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence. 129--136.

[26]

Jin Ge Yao, Xiaojun Wan, and Jianguo Xiao. 2015. Compressive document summarization via sparse optimization. In Proceedings of the International Conference on Artificial Intelligence.

Digital Library

[27]

Yihong Gong and Xin Liu. 2001. Generic text summarization using relevance measure and latent semantic analysis. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 19--25.

Digital Library

[28]

Ju-Hong Lee, Sun Park, Chan-Min Ahn, and Daeho Kim. 2009. Automatic generic document summarization based on non-negative matrix factorization. Information Processing 8 Management 45, 1 (2009), 20--34.

Digital Library

[29]

Ji-Peng Qiang, Ping Chen, Wei Ding, Fei Xie, and Xindong Wu. 2016. Multi-document summarization using closed patterns. Knowledge-Based Systems 99 (2016), 28--38.

Digital Library

[30]

Dragomir R. Radev, Hongyan Jing, Małgorzata Styś, and Daniel Tam. 2004. Centroid-based summarization of multiple documents. Information Processing 8 Management 40, 6 (2004), 919--938.

Digital Library

[31]

Günes Erkan and Dragomir R. Radev. 2004. LexRank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research 22 (2004), 457--479.

[32]

Zi Yang, Keke Cai, Jie Tang, Li Zhang, Zhong Su, and Juanzi Li. 2011. Social context summarization. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 255--264.

Digital Library

[33]

Leonhard Hennig, Winfried Umbrath, and Robert Wetzker. 2008. An ontology-based approach to text summarization. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Vol. 3 (WI-IAT’08). IEEE, 291--294.

Digital Library

[34]

K. Sarkar. 2010. Syntactic trimming of extracted sentences for improving extractive multi-document summarization. Journal of Computing 2 (2010), 177--184.

[35]

Jaime Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 335--336.

Digital Library

[36]

Solomon Kullback. 1997. Information Theory and Statistics. Courier Corporation.

Digital Library

[37]

Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. Vol. 999. MIT Press.

Digital Library

[38]

Amir Globerson, Gal Chechik, Fernando Pereira, and Naftali Tishby. 2004. Euclidean embedding of co-occurrence data. In Proceedings of Advances in Neural Information Processing Systems. 497--504.

Digital Library

[39]

Thorsten Joachims. 1996. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Technical Report. DTIC Document.

[40]

Jipeng Qiang, Ping Chen, Wei Ding, Tong Wang, Fei Xie, and Xindong Wu. 2016. Topic discovery from heterogeneous texts. In Proceedings of the IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI’16). IEEE, 196--203.

[41]

David Mimno, Hanna M. Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Optimizing semantic coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 262--272.

Digital Library

[42]

Pengtao Xie, Diyi Yang, and Eric P. Xing. 2015. Incorporating word correlation knowledge into topic modeling. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics.

[43]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the ACL-04 Workshop on Text Summarization Branches Out, Vol. 8. Barcelona, Spain.

[44]

Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi, and Thomas Griffiths. 2004. Probabilistic author-topic models for information discovery. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 306--315.

Digital Library

Cited By

Jin WZhao BZhang YSun GYu H(2023)Fintech Key-Phrase: A New Chinese Financial High-Tech Dataset Accelerating Expression-Level Information RetrievalACM Transactions on Asian and Low-Resource Language Information Processing10.1145/362798922:11(1-37)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1145/3627989
Zhu YWu XQiang JYuan YLi Y(2023)Representation learning via an integrated autoencoder for unsupervised domain adaptationFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-022-1349-517:5Online publication date: 5-Jan-2023
https://dl.acm.org/doi/10.1007/s11704-022-1349-5
Jin WZhao BLiu C(2023)Fintech Key-Phrase: A New Chinese Financial High-Tech Dataset Accelerating Expression-Level Information RetrievalDatabase Systems for Advanced Applications10.1007/978-3-031-30675-4_31(425-440)Online publication date: 17-Apr-2023
https://dl.acm.org/doi/10.1007/978-3-031-30675-4_31
Show More Cited By

Index Terms

Heterogeneous-Length Text Topic Modeling for Reader-Aware Multi-Document Summarization
1. Applied computing
  1. Document management and text processing
2. Information systems
  1. Information retrieval
    1. Document representation
  2. Information systems applications
    1. Data mining

Recommendations

Research on Multi-document Summarization Based on LDA Topic Model
IHMSC '14: Proceedings of the 2014 Sixth International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 02

Compared with VSM (Vector Space Model) and graph-ranking models, LDA (Latent Dirichlet Allocation) Model can discover latent topics in the corpus and latent topics are beneficial to use sentence-ranking mechanisms to form a good summary. In the paper, ...
Topic-Driven Multi-document Summarization
IALP '10: Proceedings of the 2010 International Conference on Asian Language Processing

This paper presents a topic-driven framework for generating a generic summary from multi-documents. Our approach is based on the intuition that, from the statistical point of view, the summary’s probability distribution over the topics should be ...
A topic modeled unsupervised approach to single document extractive text summarization
Abstract
Automatic Text Summarization (ATS) is an essential field in natural language processing that attempts to condense large text documents so that users can assimilate information quickly. It finds uses in medical document summarization, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 13, Issue 4

August 2019

235 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3343141

Editors:
Charu Aggarwal
IBM T. J. Watson Research, USA
,
Xindong Wu
University of Louisiana at Lafayette, USA

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 August 2019

Accepted: 01 May 2019

Revised: 01 May 2019

Received: 01 July 2018

Published in TKDD Volume 13, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) of the Ministry of Education, China
National Key Research and Development Program of China
National Natural Science Foundation of China
Natural Science Foundation of Jiangsu Province of China
Natural Science Foundation of the Higher Education Institutions of Jiangsu Province of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
351
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)1

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jin WZhao BZhang YSun GYu H(2023)Fintech Key-Phrase: A New Chinese Financial High-Tech Dataset Accelerating Expression-Level Information RetrievalACM Transactions on Asian and Low-Resource Language Information Processing10.1145/362798922:11(1-37)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1145/3627989
Zhu YWu XQiang JYuan YLi Y(2023)Representation learning via an integrated autoencoder for unsupervised domain adaptationFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-022-1349-517:5Online publication date: 5-Jan-2023
https://dl.acm.org/doi/10.1007/s11704-022-1349-5
Jin WZhao BLiu C(2023)Fintech Key-Phrase: A New Chinese Financial High-Tech Dataset Accelerating Expression-Level Information RetrievalDatabase Systems for Advanced Applications10.1007/978-3-031-30675-4_31(425-440)Online publication date: 17-Apr-2023
https://dl.acm.org/doi/10.1007/978-3-031-30675-4_31
Wei LLi YZhu YLi BZhang L(2022)Prompt Tuning for Multi-Label Text Classification: How to Link Exercises to Knowledge Concepts?Applied Sciences10.3390/app12201036312:20(10363)Online publication date: 14-Oct-2022
https://doi.org/10.3390/app122010363
Wang YTong HZhu ZLi Y(2022)Nested Named Entity Recognition: A SurveyACM Transactions on Knowledge Discovery from Data10.1145/352259316:6(1-29)Online publication date: 15-Mar-2022
https://dl.acm.org/doi/10.1145/3522593
Osmani AMohasefi J(2022)Opinion Mining Using Enriched Joint Sentiment-Topic ModelInternational Journal of Information Technology & Decision Making10.1142/S021962202250058422:01(313-375)Online publication date: 28-Sep-2022
https://doi.org/10.1142/S0219622022500584
Algamdi SAlbanyan AShah STariq Z(2022)Twitter Accounts Suggestion: Pipeline Technique SpaCy Entity Recognition2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020570(5121-5125)Online publication date: 17-Dec-2022
https://doi.org/10.1109/BigData55660.2022.10020570
Widyassari ARustad SShidik GNoersasongko ESyukur AAffandy ASetiadi D(2022)Review of automatic text summarization techniques & methodsJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2020.05.00634:4(1029-1046)Online publication date: Apr-2022
https://doi.org/10.1016/j.jksuci.2020.05.006
Eybers SKahts H(2022)In Search of Insight from Unstructured Text Data: Towards an Identification of Text Mining TechniquesDigital Science10.1007/978-3-030-93677-8_52(591-603)Online publication date: 17-Jan-2022
https://doi.org/10.1007/978-3-030-93677-8_52
Dai LWang HLiu X(2020)ST-ETM: A Spatial-Temporal Emergency Topic Model for Public Opinion Identifying in Social NetworksIEEE Access10.1109/ACCESS.2020.30010728(125659-125670)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3001072
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents