CStory: A Chinese Large-scale News Storyline Dataset

被引：0

作者：

Shi, Kaijie ^{[1
]}

Wang, Xiaozhi ^{[1
]}

Yu, Jifan ^{[1
]}

Hou, Lei ^{[2
]}

Li, Juanzi ^{[2
]}

Wu, Jingtong ^{[3
]}

Yong, Dingyu ^{[4
]}

Xiao, Jinghui ^{[5
]}

Liu, Qun ^{[5
]}

机构：

[1] Tsinghua Univ, Beijing, Peoples R China

[2] Tsinghua Univ, BNRist, Dept Comp Sci & Technol, Beijing, Peoples R China

[3] Beijing Huawei Digital Technol Co Ltd, Beijing, Peoples R China

[4] Huawei Device Co Ltd, Beijing, Peoples R China

[5] Huawei Noahs Ark Lab, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022 | 2022年

关键词：

Storyline Datasets; Event Evolution; Storyline Relation; Topic Detection and Tracking; Imbalanced Dataset;

D O I：

10.1145/3511808.3557573

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In today's massive news streams, storylines can help us discover related event pairs and understand the evolution of hot events. Hence many efforts have been devoted to automatically constructing news storylines. However, the development of these methods is strongly limited by the size and quality of existing storyline datasets since news storylines are expensive to annotate as they contain a myriad of unlabeled relationships growing quadratically with the number of news events. Working around these difficulties, we propose a sophisticated pre-processing method to filter candidate news pairs by entity co-occurrence and semantic similarity. With the filter reducing annotation overhead, we construct CStory, a large-scale Chinese news storyline dataset, which contains 11, 978 news articles, 112, 549 manually labeled storyline relation pairs, and 49, 832 evidence sentences for annotation judgment. We conduct extensive experiments on CStory using various algorithms and find that constructing news storylines is challenging even for pre-trained language models. Empirical analysis shows that the sample unbalance issue significantly influences model performance, which shall be the focus of future works. Our dataset is now publicly available at https://github.com/THU-KEG/CStory.

引用

页码：4475 / 4479

页数：5

共 50 条

[1] CNewsTS - A Large-scale Chinese News Dataset with Hierarchical Topic Category and Summary
Li, Quanzhi
Liu, Yingchi
Chao, Yang
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 4193 - 4198
[2] MIND: A Large-scale Dataset for News Recommendation
Wu, Fangzhao
Qiao, Ying
Chen, Jiun-Hung
Wu, Chuhan
Qi, Tao
Lian, Jianxun
Liu, Danyang
Xie, Xing
Gao, Jianfeng
Wu, Winnie
Zhou, Ming
58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 3597 - 3606
[3] EB-NeRD a large-scale dataset for news recommendation
Kruse, Johannes
Lindskow, Kasper
Kalloori, Saikishore
Polignano, Marco
Pomo, Claudio
Srivastava, Abhishek
Uppal, Anshuk
Andersen, Michael Riis
Frellsen, Jes
PROCEEDINGS OF WORKSHOP ON THE RECSYS CHALLENGE 2024, 2024, : 1 - 11
[4] A large-scale Chinese patent dataset for information extraction
Zheng, Qian
Guo, Kefu
Xu, Lin
SYSTEMS SCIENCE & CONTROL ENGINEERING, 2024, 12 (01)
[5] LEVEN: A Large-Scale Chinese Legal Event Detection Dataset
Yao, Feng
Xiao, Chaojun
Wang, Xiaozhi
Liu, Zhiyuan
Hou, Lei
Tu, Cunchao
Li, Juanzi
Liu, Yun
Shen, Weixing
Sun, Maosong
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 183 - 201
[6] ChID: A Large-scale Chinese IDiom Dataset for Cloze Test
Zheng, Chujie
Huang, Minlie
Sun, Aixin
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 778 - 787
[7] A Large-Scale Chinese Multimodal NER Dataset with Speech Clues
Sui, Dianbo
Tian, Zhengkun
Chen, Yubo
Liu, Kang
Zhao, Jun
59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 2807 - 2818
[8] A large-scale dataset for Chinese historical document recognition and analysis
Shi, Yongxin
Peng, Dezhi
Zhang, Yuyi
Cao, Jiahuan
Jin, Lianwen
SCIENTIFIC DATA, 2025, 12 (01)
[9] NEWSFARM: A Large-Scale Chinese Corpus of Long News Summarization
Zang, Shunan
Zhang, Chuang
Liu, Xiaojun
Chen, Xiaojun
Zhang, Peng
Liu, Jie
2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 2260 - 2272
[10] DuSQL: A Large-Scale and Pragmatic Chinese Text-to-SQL Dataset
Wang, Lijie
Zhang, Ao
Wu, Kun
Sun, Ke
Li, Zhenghua
Wu, Hua
Zhang, Min
Wang, Haifeng
PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 6923 - 6935

← 1 2 3 4 5 →