CStory: A Chinese Large-scale News Storyline Dataset

被引：0

作者：

Shi, Kaijie ^{[1
]}

Wang, Xiaozhi ^{[1
]}

Yu, Jifan ^{[1
]}

Hou, Lei ^{[2
]}

Li, Juanzi ^{[2
]}

Wu, Jingtong ^{[3
]}

Yong, Dingyu ^{[4
]}

Xiao, Jinghui ^{[5
]}

Liu, Qun ^{[5
]}

机构：

[1] Tsinghua Univ, Beijing, Peoples R China

[2] Tsinghua Univ, BNRist, Dept Comp Sci & Technol, Beijing, Peoples R China

[3] Beijing Huawei Digital Technol Co Ltd, Beijing, Peoples R China

[4] Huawei Device Co Ltd, Beijing, Peoples R China

[5] Huawei Noahs Ark Lab, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022 | 2022年

关键词：

Storyline Datasets; Event Evolution; Storyline Relation; Topic Detection and Tracking; Imbalanced Dataset;

D O I：

10.1145/3511808.3557573

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In today's massive news streams, storylines can help us discover related event pairs and understand the evolution of hot events. Hence many efforts have been devoted to automatically constructing news storylines. However, the development of these methods is strongly limited by the size and quality of existing storyline datasets since news storylines are expensive to annotate as they contain a myriad of unlabeled relationships growing quadratically with the number of news events. Working around these difficulties, we propose a sophisticated pre-processing method to filter candidate news pairs by entity co-occurrence and semantic similarity. With the filter reducing annotation overhead, we construct CStory, a large-scale Chinese news storyline dataset, which contains 11, 978 news articles, 112, 549 manually labeled storyline relation pairs, and 49, 832 evidence sentences for annotation judgment. We conduct extensive experiments on CStory using various algorithms and find that constructing news storylines is challenging even for pre-trained language models. Empirical analysis shows that the sample unbalance issue significantly influences model performance, which shall be the focus of future works. Our dataset is now publicly available at https://github.com/THU-KEG/CStory.

引用

页码：4475 / 4479

页数：5

共 50 条

[31] EdNet: A Large-Scale Hierarchical Dataset in Education
Choi, Youngduck
Lee, Youngnam
Shin, Dongmin
Cho, Junghyun
Park, Seoyon
Lee, Seewoo
Baek, Jineon
Bae, Chan
Kim, Byungsoo
Heo, Jaewe
ARTIFICIAL INTELLIGENCE IN EDUCATION (AIED 2020), PT II, 2020, 12164 : 69 - 73
[32] A Large-Scale Dataset for Empathetic Response Generation
Welivita, Anuradha
Xie, Yubo
Pu, Pearl
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 1251 - 1264
[33] VoxCeleb: a large-scale speaker identification dataset
Nagrani, Arsha
Chung, Joon Son
Zisserman, Andrew
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2616 - 2620
[34] A large-scale hyperspectral dataset for flower classification
Zheng, Yongrong
Zhang, Tao
Fu, Ying
KNOWLEDGE-BASED SYSTEMS, 2022, 236
[35] Dungeons and Data: A Large-Scale NetHack Dataset
Hambro, Eric
Raileanu, Roberta
Rothermel, Danielle
Mella, Vegard
Rocktaschel, Tim
Kuttler, Heinrich
Murray, Naila
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[36] Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction
Kim, Gunhee
Sigal, Leonid
Xing, Eric P.
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 4225 - 4232
[37] SDFC dataset: a large-scale benchmark dataset for hyperspectral image classification
Sun, Liwei
Zhang, Junjie
Li, Jia
Wang, Yueming
Zeng, Dan
OPTICAL AND QUANTUM ELECTRONICS, 2023, 55 (02)
[38] The Blackbird Dataset: A Large-Scale Dataset for UAV Perception in Aggressive Flight
Antonini, Amado
Guerra, Winter
Murali, Varun
Sayre-McCord, Thomas
Karaman, Sertac
PROCEEDINGS OF THE 2018 INTERNATIONAL SYMPOSIUM ON EXPERIMENTAL ROBOTICS, 2020, 11 : 130 - 139
[39] SDFC dataset: a large-scale benchmark dataset for hyperspectral image classification
Liwei Sun
Junjie Zhang
Jia Li
Yueming Wang
Dan Zeng
Optical and Quantum Electronics, 2023, 55
[40] Multi-News: a Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model
Fabbri, Alexander R.
Li, Irene
She, Tianwei
Li, Suyi
Radev, Dragomir R.
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1074 - 1084

← 1 2 3 4 5 →