CStory: A Chinese Large-scale News Storyline Dataset

被引:0
|
作者
Shi, Kaijie [1 ]
Wang, Xiaozhi [1 ]
Yu, Jifan [1 ]
Hou, Lei [2 ]
Li, Juanzi [2 ]
Wu, Jingtong [3 ]
Yong, Dingyu [4 ]
Xiao, Jinghui [5 ]
Liu, Qun [5 ]
机构
[1] Tsinghua Univ, Beijing, Peoples R China
[2] Tsinghua Univ, BNRist, Dept Comp Sci & Technol, Beijing, Peoples R China
[3] Beijing Huawei Digital Technol Co Ltd, Beijing, Peoples R China
[4] Huawei Device Co Ltd, Beijing, Peoples R China
[5] Huawei Noahs Ark Lab, Beijing, Peoples R China
关键词
Storyline Datasets; Event Evolution; Storyline Relation; Topic Detection and Tracking; Imbalanced Dataset;
D O I
10.1145/3511808.3557573
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In today's massive news streams, storylines can help us discover related event pairs and understand the evolution of hot events. Hence many efforts have been devoted to automatically constructing news storylines. However, the development of these methods is strongly limited by the size and quality of existing storyline datasets since news storylines are expensive to annotate as they contain a myriad of unlabeled relationships growing quadratically with the number of news events. Working around these difficulties, we propose a sophisticated pre-processing method to filter candidate news pairs by entity co-occurrence and semantic similarity. With the filter reducing annotation overhead, we construct CStory, a large-scale Chinese news storyline dataset, which contains 11, 978 news articles, 112, 549 manually labeled storyline relation pairs, and 49, 832 evidence sentences for annotation judgment. We conduct extensive experiments on CStory using various algorithms and find that constructing news storylines is challenging even for pre-trained language models. Empirical analysis shows that the sample unbalance issue significantly influences model performance, which shall be the focus of future works. Our dataset is now publicly available at https://github.com/THU-KEG/CStory.
引用
收藏
页码:4475 / 4479
页数:5
相关论文
共 50 条
  • [31] EdNet: A Large-Scale Hierarchical Dataset in Education
    Choi, Youngduck
    Lee, Youngnam
    Shin, Dongmin
    Cho, Junghyun
    Park, Seoyon
    Lee, Seewoo
    Baek, Jineon
    Bae, Chan
    Kim, Byungsoo
    Heo, Jaewe
    ARTIFICIAL INTELLIGENCE IN EDUCATION (AIED 2020), PT II, 2020, 12164 : 69 - 73
  • [32] A Large-Scale Dataset for Empathetic Response Generation
    Welivita, Anuradha
    Xie, Yubo
    Pu, Pearl
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 1251 - 1264
  • [33] VoxCeleb: a large-scale speaker identification dataset
    Nagrani, Arsha
    Chung, Joon Son
    Zisserman, Andrew
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2616 - 2620
  • [34] A large-scale hyperspectral dataset for flower classification
    Zheng, Yongrong
    Zhang, Tao
    Fu, Ying
    KNOWLEDGE-BASED SYSTEMS, 2022, 236
  • [35] Dungeons and Data: A Large-Scale NetHack Dataset
    Hambro, Eric
    Raileanu, Roberta
    Rothermel, Danielle
    Mella, Vegard
    Rocktaschel, Tim
    Kuttler, Heinrich
    Murray, Naila
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [36] Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction
    Kim, Gunhee
    Sigal, Leonid
    Xing, Eric P.
    2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 4225 - 4232
  • [37] SDFC dataset: a large-scale benchmark dataset for hyperspectral image classification
    Sun, Liwei
    Zhang, Junjie
    Li, Jia
    Wang, Yueming
    Zeng, Dan
    OPTICAL AND QUANTUM ELECTRONICS, 2023, 55 (02)
  • [38] The Blackbird Dataset: A Large-Scale Dataset for UAV Perception in Aggressive Flight
    Antonini, Amado
    Guerra, Winter
    Murali, Varun
    Sayre-McCord, Thomas
    Karaman, Sertac
    PROCEEDINGS OF THE 2018 INTERNATIONAL SYMPOSIUM ON EXPERIMENTAL ROBOTICS, 2020, 11 : 130 - 139
  • [39] SDFC dataset: a large-scale benchmark dataset for hyperspectral image classification
    Liwei Sun
    Junjie Zhang
    Jia Li
    Yueming Wang
    Dan Zeng
    Optical and Quantum Electronics, 2023, 55
  • [40] Multi-News: a Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model
    Fabbri, Alexander R.
    Li, Irene
    She, Tianwei
    Li, Suyi
    Radev, Dragomir R.
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1074 - 1084