CStory: A Chinese Large-scale News Storyline Dataset

被引：0

作者：

Shi, Kaijie ^{[1
]}

Wang, Xiaozhi ^{[1
]}

Yu, Jifan ^{[1
]}

Hou, Lei ^{[2
]}

Li, Juanzi ^{[2
]}

Wu, Jingtong ^{[3
]}

Yong, Dingyu ^{[4
]}

Xiao, Jinghui ^{[5
]}

Liu, Qun ^{[5
]}

机构：

[1] Tsinghua Univ, Beijing, Peoples R China

[2] Tsinghua Univ, BNRist, Dept Comp Sci & Technol, Beijing, Peoples R China

[3] Beijing Huawei Digital Technol Co Ltd, Beijing, Peoples R China

[4] Huawei Device Co Ltd, Beijing, Peoples R China

[5] Huawei Noahs Ark Lab, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022 | 2022年

关键词：

Storyline Datasets; Event Evolution; Storyline Relation; Topic Detection and Tracking; Imbalanced Dataset;

D O I：

10.1145/3511808.3557573

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In today's massive news streams, storylines can help us discover related event pairs and understand the evolution of hot events. Hence many efforts have been devoted to automatically constructing news storylines. However, the development of these methods is strongly limited by the size and quality of existing storyline datasets since news storylines are expensive to annotate as they contain a myriad of unlabeled relationships growing quadratically with the number of news events. Working around these difficulties, we propose a sophisticated pre-processing method to filter candidate news pairs by entity co-occurrence and semantic similarity. With the filter reducing annotation overhead, we construct CStory, a large-scale Chinese news storyline dataset, which contains 11, 978 news articles, 112, 549 manually labeled storyline relation pairs, and 49, 832 evidence sentences for annotation judgment. We conduct extensive experiments on CStory using various algorithms and find that constructing news storylines is challenging even for pre-trained language models. Empirical analysis shows that the sample unbalance issue significantly influences model performance, which shall be the focus of future works. Our dataset is now publicly available at https://github.com/THU-KEG/CStory.

引用

页码：4475 / 4479

页数：5

共 50 条

[41] CrossWOZ: A Large-Scale Chinese Cross-Domain Task-Oriented Dialogue Dataset
Zhu, Qi
Huang, Kaili
Zhang, Zheng
Zhu, Xiaoyan
Huang, Minlie
TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2020, 8 (08) : 281 - 295
[42] CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical Dataset
Zhang, Hanchong
Li, Jieyu
Chen, Lu
Cao, Ruisheng
Zhang, Yunyan
Huang, Yu
Zheng, Yefeng
Yu, Kai
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 6970 - 6983
[43] AgCNER, the First Large-Scale Chinese Named Entity Recognition Dataset for Agricultural Diseases and Pests
Yao, Xiaochuang
Hao, Xia
Liu, Ruilin
Li, Lin
Guo, Xuchao
SCIENTIFIC DATA, 2024, 11 (01)
[44] A large-scale audit of dataset licensing and attribution in AI
Longpre, Shayne
Mahari, Robert
Chen, Anthony
Obeng-Marnu, Naana
Sileo, Damien
Brannon, William
Muennighoff, Niklas
Khazam, Nathan
Kabbara, Jad
Perisetla, Kartik
Wu, Xinyi
Shippole, Enrico
Bollacker, Kurt
Wu, Tongshuang
Villa, Luis
Pentland, Sandy
Hooker, Sara
NATURE MACHINE INTELLIGENCE, 2024, 6 (08) : 975 - 987
[45] A large-scale dataset of in vivo pharmacology assay results
Fiona M. I. Hunter
Francis L. Atkinson
A. Patrícia Bento
Nicolas Bosc
Anna Gaulton
Anne Hersey
Andrew R. Leach
Scientific Data, 5
[46] Fraud Detection Using Large-scale Imbalance Dataset
Rubaidi, Zainab Saad
Ben Ammar, Boulbaba
Ben Aouicha, Mohamed
INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2022, 31 (08)
[47] KoDF: A Large-scale Korean DeepFake Detection Dataset
Kwon, Patrick
You, Jaeseong
Nam, Gyuhyeon
Park, Sungwoo
Chae, Gyeongsu
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 10724 - 10733
[48] Modernizing Analytics for Melanoma with a Large-Scale Research Dataset
Richter, Aaron N.
Khoshgoftaar, Taghi M.
2017 IEEE 18TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IEEE IRI 2017), 2017, : 551 - 558
[49] BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization
Sharma, Eva
Li, Chen
Wang, Lu
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 2204 - 2213
[50] Large-scale Cloze Test Dataset Created by Teachers
Xie, Qizhe
Lai, Guokun
Dai, Zihang
Hovy, Eduard
2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 2344 - 2356

← 1 2 3 4 5 →