CStory: A Chinese Large-scale News Storyline Dataset

被引：0

作者：

Shi, Kaijie ^{[1
]}

Wang, Xiaozhi ^{[1
]}

Yu, Jifan ^{[1
]}

Hou, Lei ^{[2
]}

Li, Juanzi ^{[2
]}

Wu, Jingtong ^{[3
]}

Yong, Dingyu ^{[4
]}

Xiao, Jinghui ^{[5
]}

Liu, Qun ^{[5
]}

机构：

[1] Tsinghua Univ, Beijing, Peoples R China

[2] Tsinghua Univ, BNRist, Dept Comp Sci & Technol, Beijing, Peoples R China

[3] Beijing Huawei Digital Technol Co Ltd, Beijing, Peoples R China

[4] Huawei Device Co Ltd, Beijing, Peoples R China

[5] Huawei Noahs Ark Lab, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022 | 2022年

关键词：

Storyline Datasets; Event Evolution; Storyline Relation; Topic Detection and Tracking; Imbalanced Dataset;

D O I：

10.1145/3511808.3557573

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In today's massive news streams, storylines can help us discover related event pairs and understand the evolution of hot events. Hence many efforts have been devoted to automatically constructing news storylines. However, the development of these methods is strongly limited by the size and quality of existing storyline datasets since news storylines are expensive to annotate as they contain a myriad of unlabeled relationships growing quadratically with the number of news events. Working around these difficulties, we propose a sophisticated pre-processing method to filter candidate news pairs by entity co-occurrence and semantic similarity. With the filter reducing annotation overhead, we construct CStory, a large-scale Chinese news storyline dataset, which contains 11, 978 news articles, 112, 549 manually labeled storyline relation pairs, and 49, 832 evidence sentences for annotation judgment. We conduct extensive experiments on CStory using various algorithms and find that constructing news storylines is challenging even for pre-trained language models. Empirical analysis shows that the sample unbalance issue significantly influences model performance, which shall be the focus of future works. Our dataset is now publicly available at https://github.com/THU-KEG/CStory.

引用

页码：4475 / 4479

页数：5

共 50 条

[21] DANEWSROOM: A Large-scale Danish Summarisation Dataset
Varab, Daniel
Schluter, Natalie
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 6731 - 6739
[22] Pchatbot: A Large-Scale Dataset for Personalized Chatbot
Qian, Hongjin
Li, Xiaohe
Zhong, Hanxun
Guo, Yu
Ma, Yueyuan
Zhu, Yutao
Liu, Zhanliang
Dou, Zhicheng
Wen, Ji-Rong
SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 2470 - 2477
[23] openDD: A Large-Scale Roundabout Drone Dataset
Breuer, Antonia
Termoehlen, Jan-Aike
Homoceanu, Silviu
Fingscheidt, Tim
2020 IEEE 23RD INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS (ITSC), 2020,
[24] PatchDB: A Large-Scale Security Patch Dataset
Wang, Xinda
Wang, Shu
Feng, Pengbin
Sun, Kun
Jajodia, Sushil
51ST ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN 2021), 2021, : 149 - 160
[25] Large-Scale Analysis of the Docker Hub Dataset
Zhao, Nannan
Tarasov, Vasily
Albahar, Hadeel
Anwar, Ali
Rupprecht, Lukas
Skourtis, Dimitrios
Warke, Amit S.
Mohamed, Mohamed
Butt, Ali R.
2019 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2019, : 215 - 224
[26] A large-scale dataset of buildings and construction sites
Cheng, Xuanhao
Jia, Mingming
He, Jian
COMPUTER-AIDED CIVIL AND INFRASTRUCTURE ENGINEERING, 2024, 39 (09) : 1390 - 1406
[27] SGF: A Crowdsourced Large-scale Event Dataset
Heuschkel, Jens
Froemmgen, Alexander
PROCEEDINGS OF THE 9TH ACM MULTIMEDIA SYSTEMS CONFERENCE (MMSYS'18), 2018, : 351 - 356
[28] MineRL: A Large-Scale Dataset of Minecraft Demonstrations
Guss, William H.
Houghton, Brandon
Topin, Nicholay
Wang, Phillip
Codel, Cayden
Veloso, Manuela
Salakhutdinov, Ruslan
PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 2442 - 2448
[29] MultiSubs: A Large-scale Multimodal and Multilingual Dataset
Wang, Josiah
Figueiredo, Josiel
Specia, Lucia
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6776 - 6785
[30] A large-scale and global car dataset for verification
Hu, Lingji
Luo, Xingcheng
Deng, Jianhua
Lai, Fengjie
Hu, Jian
Yu, Yongbin
PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ELECTRONIC TECHNOLOGY, 2016, 48 : 49 - 52

← 1 2 3 4 5 →