[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3446132.3446399acmotherconferencesArticle/Chapter ViewAbstractPublication PagesacaiConference Proceedingsconference-collections
research-article

Grouping news events using semantic representations of hierarchical elements of articles and named entities

Published: 09 March 2021 Publication History

Abstract

Enormous amount of news articles are being generated through different news agencies. The variation in journalistic content and online availability of news content, makes it difficult to monitor and interpret in real time. Organizing news articles would play a crucial role in its consumption and interpretation. Our work assists end user by grouping news articles based on the story. We present here a novel approach of grouping news articles based on a multi-level embedding representation of articles, coupled with a standard TF-IDF score based on named entities. Our results shows that combining the syntactic(TF-IDF) as well as the semantic (Bert) representations can boost the performance of the news grouping task.
We also experiment with transfer learning and fine tuning of state-of-the-art BERT models for the task of document similarity and use the output embeddings as document representations.

References

[1]
Marieke van Erp, Gleb Satyukov, Piek Vossen, and Marit Nijsen. 2014. Discovering and visualising stories in news. In LREC, pages 3277–3282.
[2]
Tom Nicholls & Jonathan Bright (2019) Understanding News Story Chains using Information Retrieval and Network Clustering Techniques, Communication Methods and Measures, 13:1, 43-59.
[3]
[Kumaran and Allan2004] Giridhar Kumaran and James Allan. 2004. Text classification and named entities for new event detection. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 297–304. ACM.
[4]
Fabbri, A., Li, I., She, T., Li, S., and Radev, D. Multinews: A large-scale multi-document summarization dataset and abstractive hierarchical model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1074–1084, Florence, Italy, July 2019. Association for Computational Linguistics. URL https: //www.aclweb.org/anthology/P19-1102.
[5]
Xiaotao Gu, Yuning Mao, Jiawei Han, Jialu Liu, Hongkun Yu, You Wu, Cong Yu, Daniel Finnie, Jiaqi Zhai, and Nicholas Zukoski. 2020. Generating Representative Headlines for News Stories. In Proc. of the the Web Conf. 2020.
[6]
Aggarwal C.C., Zhai C. (2012) A Survey of Text Clustering Algorithms. In: Aggarwal C., Zhai C. (eds) Mining Text Data. Springer, Boston, MA. https://doi.org/10.1007/978-1-4614-3223-4_4
[7]
Herranz, Soto & Martínez-Unanue, Raquel & Fernández, Víctor. (2012). NESM: a Named Entity based Proximity Measure for Multilingual News Clustering. Procesamiento de Lenguaje Natural. 48.
[8]
Escoter, Llorenc & Pivovarova, Lidia & Du, Mian & Katinskaia, Anisia & Yangarber, Roman. (2017). Grouping business news stories based on salience of named entities. 1096-1106. 10.18653/v1/E17-1103.
[9]
Blokh, Ilya & Alexandrov, Vassil. (2017). News clustering based on similarity analysis. Procedia Computer Science. 122. 715-719. 10.1016/j.procs.2017.11.428.
[10]
Joel Azzopardi and Christopher Staff. 2012. Incremental clustering of news reports. Algorithms, 5(3):364–378.
[11]
M. Tarik Altuncu, Sophia N. Yaliraki, and Mauricio Barahona. 2018. Contentdriven, unsupervised clustering of news articles through multiscale graph partitioning. In Proceedings of 2018 KDD Data Science, Journalism and Media (DSJM2018). ACM, New York, NY, USA, 8 pages.
[12]
Naughton, M., Kushmerick, N., Carthy, J.: Clustering Sentences for Discovering Events in News Articles. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 535–538. Springer, Heidelberg (2006)
[13]
Steinbach, Michael & Karypis, George & Kumar, Vipin. (2000). A Comparison of Document Clustering Techniques. Proceedings of the International KDD Workshop on Text Mining.
[14]
Weischedel, Ralph, OntoNotes Release 5.0 LDC2013T19. Web Download. Philadelphia: Linguistic Data Consortium, 2013.
[15]
Quoc Qv Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. International Conference on Machine Learning - ICML 2014 32 (2014), 1188–1196. https://doi.org/10.1145/2740908.2742760
[16]
Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton and Toutanova, Kristina BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. (2018)., cite arxiv:1810.04805 Comment: 13 pages.
[17]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
[18]
Vinh N, Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance, J. Mach. Learn. Res., 2010, vol. 11 (pg. 2837-2854)
[19]
Romano, Simone & Vinh, Nguyen & Verspoor, Karin & Bailey, James. (2017). The randomized information coefficient: assessing dependencies in noisy data. Machine Learning. 10.1007/s10994-017-5664-2.

Cited By

View all
  • (2022)Identification and Visualization of Key Topics in Scientific Publications with Transformer-Based Language Models and Document Clustering MethodsApplied Sciences10.3390/app12211122012:21(11220)Online publication date: 5-Nov-2022

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ACAI '20: Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence
December 2020
576 pages
ISBN:9781450388115
DOI:10.1145/3446132
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 March 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Document Clustering
  2. News Clustering
  3. News grouping
  4. Text Embedding
  5. Transfer learning
  6. Unsupervised Clustering

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ACAI 2020

Acceptance Rates

Overall Acceptance Rate 173 of 395 submissions, 44%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Identification and Visualization of Key Topics in Scientific Publications with Transformer-Based Language Models and Document Clustering MethodsApplied Sciences10.3390/app12211122012:21(11220)Online publication date: 5-Nov-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media