[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3511808.3557573acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper
Open access

CStory: A Chinese Large-scale News Storyline Dataset

Published: 17 October 2022 Publication History

Abstract

In today's massive news streams, storylines can help us discover related event pairs and understand the evolution of hot events. Hence many efforts have been devoted to automatically constructing news storylines. However, the development of these methods is strongly limited by the size and quality of existing storyline datasets since news storylines are expensive to annotate as they contain a myriad of unlabeled relationships growing quadratically with the number of news events. Working around these difficulties, we propose a sophisticated pre-processing method to filter candidate news pairs by entity co-occurrence and semantic similarity. With the filter reducing annotation overhead, we construct CStory, a large-scale Chinese news storyline dataset, which contains 11,978 news articles, 112,549 manually labeled storyline relation pairs, and 49,832 evidence sentences for annotation judgment. We conduct extensive experiments on CStory using various algorithms and find that constructing news storylines is challenging even for pre-trained language models. Empirical analysis shows that the sample unbalance issue significantly influences model performance, which shall be the focus of future works. Our dataset is now publicly available at https://github.com/THU-KEG/CStory.

Supplementary Material

MP4 File (meeting_01.mp4)
Presentation video

References

[1]
Jeffery Ansah, Lin Liu, Wei Kang, Selasie Kwashie, Jixue Li, and Jiuyong Li. 2019. A graph is worth a thousand words: Telling event stories using timeline summarization graphs. In The World Wide Web Conference. 2565--2571.
[2]
Mieke Bal and Christine Van Boheemen. 2009. Narratology: Introduction to the theory of narrative. University of Toronto Press.
[3]
Tommaso Caselli and Piek Vossen. 2016. The storyline annotation and representation scheme (star): A proposal. In Proceedings of the 2nd Workshop on Computing News Storylines (CNS 2016). 67--72.
[4]
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, and Ziqing Yang. 2021. Pretraining with whole word masking for chinese bert. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3504--3514.
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[6]
Ao Feng and James Allan. 2007. Finding and linking incidents in news. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. 821--830.
[7]
Ao Feng and James Allan. 2009. Incident threading for news passages. In Proceedings of the 18th ACM conference on Information and knowledge management. 1307--1316.
[8]
Mark Granroth-Wilding and Stephen Clark. 2016. What happens next? event prediction using a compositional neural network model. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30.
[9]
Lieve Hamers et al. 1989. Similarity measures in scientometric research: The Jaccard index versus Salton's cosine formula. Information Processing and Management 25, 3 (1989), 315--18.
[10]
Dongping Huang, Shuyu Hu, Yi Cai, and Huaqing Min. 2014. Discovering event evolution graphs based on news articles relationships. In 2014 IEEE 11th International Conference on e-Business Engineering. IEEE, 246--251.
[11]
Lifu Huang et al. 2013. Optimized event storyline generation based on mixtureevent- aspect model. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 726--735.
[12]
Zhongyang Li, Xiao Ding, and Ting Liu. 2018. Constructing narrative event evolutionary graph for script event prediction. arXiv preprint arXiv:1805.05081 (2018).
[13]
Fu-ren Lin and Chia-Hao Liang. 2008. Storyline-based summarization for news topic retrospection. Decision Support Systems 45, 3 (2008), 473--490.
[14]
Bang Liu, Di Niu, Kunfeng Lai, Linglong Kong, and Yu Xu. 2017. Growing story forest online from massive breaking news. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 777--785.
[15]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[16]
Zhongyu Lu, Weiren Yu, Richong Zhang, Jianxin Li, and Hua Wei. 2015. Discovering event evolution chain in microblog. In 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems. IEEE, 635--640.
[17]
Ramesh Nallapati, Ao Feng, Fuchun Peng, and James Allan. 2004. Event threading within news topics. In Proceedings of the thirteenth ACM international conference on Information and knowledge management. 446--453.
[18]
Ron Papka, James Allan, et al. 1998. On-line new event detection using single pass clustering. University of Massachusetts, Amherst 10, 290941.290954 (1998).
[19]
Robert R Provine. 2012. Curious behavior. In Curious Behavior. Harvard University Press.
[20]
Kira Radinsky, Sagie Davidovich, and Shaul Markovitch. 2012. Learning causality for news events prediction. In Proceedings of the 21st international conference on World Wide Web. 909--918.
[21]
Nils Reimers and Iryna Gurevych. 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. arXiv preprint arXiv:2004.09813 (2020).
[22]
Yeon Seonwoo, Alice Oh, and Sungjoon Park. 2018. Hierarchical dirichlet gaussian marked hawkes process for narrative reconstruction in continuous time domain. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 3316--3325.
[23]
Dafna Shahaf, Jaewon Yang, Caroline Suen, Jeff Jacobs, Heidi Wang, and Jure Leskovec. 2013. Information cartography: creating zoomable, large-scale maps of information. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 1097--1105.
[24]
J Sun. 2012. Jieba chinese word segmentation tool.
[25]
Piek Vossen, Tommaso Caselli, and Yiota Kontzopoulou. 2015. Storylines for structuring massive streams of news. In Proceedings of the First Workshop on Computing News Storylines. 40--49.
[26]
Chunzi Wu, Bin Wu, and Bai Wang. 2016. Event evolution model based on random walk model with hot topic extraction. In International Conference on Advanced Data Mining and Applications. Springer, 591--603.
[27]
Christopher C Yang, Xiaodong Shi, and Chih-Ping Wei. 2009. Discovering event evolution graphs from news corpora. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 39, 4 (2009), 850--863.
[28]
Sendong Zhao, Quan Wang, Sean Massung, Bing Qin, Ting Liu, Bin Wang, and ChengXiang Zhai. 2017. Constructing and embedding abstract event causality networks from text snippets. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. 335--344.
[29]
Deyu Zhou, Haiyang Xu, Xin-Yu Dai, and Yulan He. 2016. Unsupervised Storyline Extraction from News Articles. In IJCAI. 3014--3021.
[30]
Deyu Zhou, Haiyang Xu, and Yulan He. 2015. An unsupervised bayesian modelling approach for storyline detection on news articles. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1943--1948.
[31]
Wubai Zhou, Chao Shen, Tao Li, Shu-Ching Chen, and Ning Xie. 2014. Generating textual storyline to improve situation awareness in disaster management. In Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014). IEEE, 585--592.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management
October 2022
5274 pages
ISBN:9781450392365
DOI:10.1145/3511808
  • General Chairs:
  • Mohammad Al Hasan,
  • Li Xiong
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2022

Check for updates

Author Tags

  1. event evolution
  2. imbalanced dataset
  3. storyline datasets
  4. storyline relation
  5. topic detection and tracking

Qualifiers

  • Short-paper

Conference

CIKM '22
Sponsor:

Acceptance Rates

CIKM '22 Paper Acceptance Rate 621 of 2,257 submissions, 28%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 514
    Total Downloads
  • Downloads (Last 12 months)264
  • Downloads (Last 6 weeks)38
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media