More Web Proxy on the site http://driver.im/

research-article

Grouping news events using semantic representations of hierarchical elements of articles and named entities

Authors:

Abhishek Desai,

Prateek NagwanshiAuthors Info & Claims

ACAI '20: Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence

Article No.: 72, Pages 1 - 6

https://doi.org/10.1145/3446132.3446399

Published: 09 March 2021 Publication History

Abstract

Enormous amount of news articles are being generated through different news agencies. The variation in journalistic content and online availability of news content, makes it difficult to monitor and interpret in real time. Organizing news articles would play a crucial role in its consumption and interpretation. Our work assists end user by grouping news articles based on the story. We present here a novel approach of grouping news articles based on a multi-level embedding representation of articles, coupled with a standard TF-IDF score based on named entities. Our results shows that combining the syntactic(TF-IDF) as well as the semantic (Bert) representations can boost the performance of the news grouping task.

We also experiment with transfer learning and fine tuning of state-of-the-art BERT models for the task of document similarity and use the output embeddings as document representations.

References

[1]

Marieke van Erp, Gleb Satyukov, Piek Vossen, and Marit Nijsen. 2014. Discovering and visualising stories in news. In LREC, pages 3277–3282.

[2]

Tom Nicholls & Jonathan Bright (2019) Understanding News Story Chains using Information Retrieval and Network Clustering Techniques, Communication Methods and Measures, 13:1, 43-59.

[3]

[Kumaran and Allan2004] Giridhar Kumaran and James Allan. 2004. Text classification and named entities for new event detection. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 297–304. ACM.

Digital Library

[4]

Fabbri, A., Li, I., She, T., Li, S., and Radev, D. Multinews: A large-scale multi-document summarization dataset and abstractive hierarchical model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1074–1084, Florence, Italy, July 2019. Association for Computational Linguistics. URL https: //www.aclweb.org/anthology/P19-1102.

[5]

Xiaotao Gu, Yuning Mao, Jiawei Han, Jialu Liu, Hongkun Yu, You Wu, Cong Yu, Daniel Finnie, Jiaqi Zhai, and Nicholas Zukoski. 2020. Generating Representative Headlines for News Stories. In Proc. of the the Web Conf. 2020.

Digital Library

[6]

Aggarwal C.C., Zhai C. (2012) A Survey of Text Clustering Algorithms. In: Aggarwal C., Zhai C. (eds) Mining Text Data. Springer, Boston, MA. https://doi.org/10.1007/978-1-4614-3223-4_4

[7]

Herranz, Soto & Martínez-Unanue, Raquel & Fernández, Víctor. (2012). NESM: a Named Entity based Proximity Measure for Multilingual News Clustering. Procesamiento de Lenguaje Natural. 48.

[8]

Escoter, Llorenc & Pivovarova, Lidia & Du, Mian & Katinskaia, Anisia & Yangarber, Roman. (2017). Grouping business news stories based on salience of named entities. 1096-1106. 10.18653/v1/E17-1103.

[9]

Blokh, Ilya & Alexandrov, Vassil. (2017). News clustering based on similarity analysis. Procedia Computer Science. 122. 715-719. 10.1016/j.procs.2017.11.428.

[10]

Joel Azzopardi and Christopher Staff. 2012. Incremental clustering of news reports. Algorithms, 5(3):364–378.

[11]

M. Tarik Altuncu, Sophia N. Yaliraki, and Mauricio Barahona. 2018. Contentdriven, unsupervised clustering of news articles through multiscale graph partitioning. In Proceedings of 2018 KDD Data Science, Journalism and Media (DSJM2018). ACM, New York, NY, USA, 8 pages.

[12]

Naughton, M., Kushmerick, N., Carthy, J.: Clustering Sentences for Discovering Events in News Articles. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 535–538. Springer, Heidelberg (2006)

[13]

Steinbach, Michael & Karypis, George & Kumar, Vipin. (2000). A Comparison of Document Clustering Techniques. Proceedings of the International KDD Workshop on Text Mining.

[14]

Weischedel, Ralph, OntoNotes Release 5.0 LDC2013T19. Web Download. Philadelphia: Linguistic Data Consortium, 2013.

[15]

Quoc Qv Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. International Conference on Machine Learning - ICML 2014 32 (2014), 1188–1196. https://doi.org/10.1145/2740908.2742760

Digital Library

[16]

Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton and Toutanova, Kristina BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. (2018)., cite arxiv:1810.04805 Comment: 13 pages.

[17]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.

[18]

Vinh N, Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance, J. Mach. Learn. Res., 2010, vol. 11 (pg. 2837-2854)

[19]

Romano, Simone & Vinh, Nguyen & Verspoor, Karin & Bailey, James. (2017). The randomized information coefficient: assessing dependencies in noisy data. Machine Learning. 10.1007/s10994-017-5664-2.

Cited By

Weng MWu SDyer M(2022)Identification and Visualization of Key Topics in Scientific Publications with Transformer-Based Language Models and Document Clustering MethodsApplied Sciences10.3390/app12211122012:21(11220)Online publication date: 5-Nov-2022
https://doi.org/10.3390/app122111220

Recommendations

A similarity assessment technique for effective grouping of documents

Display Omitted Document clustering refers to the task of grouping similar documents and segregating dissimilar documents. It is very useful to find meaningful categories from a large corpus. In practice, the task to categorize a corpus is not so easy, ...
A clustering technique for news articles using WordNet

The Web is overcrowded with news articles, an overwhelming information source both with its amount and diversity. Document clustering is a powerful technique that has been widely used for organizing data into smaller and manageable information kernels. ...
Parallel Document Clustering using Iterative MapReduce
BDAW '16: Proceedings of the International Conference on Big Data and Advanced Wireless Technologies

Document clustering is an attractive field that interests increasingly the research community, and so giving rise to several clustering algorithms. In addition to this, document collections are expanding continuously which limits the traditional and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ACAI '20: Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence

December 2020

576 pages

ISBN:9781450388115

DOI:10.1145/3446132

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 March 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ACAI 2020

ACAI 2020: 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence

December 24 - 26, 2020

Sanya, China

Acceptance Rates

Overall Acceptance Rate 173 of 395 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
91
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Weng MWu SDyer M(2022)Identification and Visualization of Key Topics in Scientific Publications with Transformer-Based Language Models and Document Clustering MethodsApplied Sciences10.3390/app12211122012:21(11220)Online publication date: 5-Nov-2022
https://doi.org/10.3390/app122111220

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents