CStory: A Chinese Large-scale News Storyline Dataset

被引:0
|
作者
Shi, Kaijie [1 ]
Wang, Xiaozhi [1 ]
Yu, Jifan [1 ]
Hou, Lei [2 ]
Li, Juanzi [2 ]
Wu, Jingtong [3 ]
Yong, Dingyu [4 ]
Xiao, Jinghui [5 ]
Liu, Qun [5 ]
机构
[1] Tsinghua Univ, Beijing, Peoples R China
[2] Tsinghua Univ, BNRist, Dept Comp Sci & Technol, Beijing, Peoples R China
[3] Beijing Huawei Digital Technol Co Ltd, Beijing, Peoples R China
[4] Huawei Device Co Ltd, Beijing, Peoples R China
[5] Huawei Noahs Ark Lab, Beijing, Peoples R China
关键词
Storyline Datasets; Event Evolution; Storyline Relation; Topic Detection and Tracking; Imbalanced Dataset;
D O I
10.1145/3511808.3557573
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In today's massive news streams, storylines can help us discover related event pairs and understand the evolution of hot events. Hence many efforts have been devoted to automatically constructing news storylines. However, the development of these methods is strongly limited by the size and quality of existing storyline datasets since news storylines are expensive to annotate as they contain a myriad of unlabeled relationships growing quadratically with the number of news events. Working around these difficulties, we propose a sophisticated pre-processing method to filter candidate news pairs by entity co-occurrence and semantic similarity. With the filter reducing annotation overhead, we construct CStory, a large-scale Chinese news storyline dataset, which contains 11, 978 news articles, 112, 549 manually labeled storyline relation pairs, and 49, 832 evidence sentences for annotation judgment. We conduct extensive experiments on CStory using various algorithms and find that constructing news storylines is challenging even for pre-trained language models. Empirical analysis shows that the sample unbalance issue significantly influences model performance, which shall be the focus of future works. Our dataset is now publicly available at https://github.com/THU-KEG/CStory.
引用
收藏
页码:4475 / 4479
页数:5
相关论文
共 50 条
  • [41] CrossWOZ: A Large-Scale Chinese Cross-Domain Task-Oriented Dialogue Dataset
    Zhu, Qi
    Huang, Kaili
    Zhang, Zheng
    Zhu, Xiaoyan
    Huang, Minlie
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2020, 8 (08) : 281 - 295
  • [42] CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical Dataset
    Zhang, Hanchong
    Li, Jieyu
    Chen, Lu
    Cao, Ruisheng
    Zhang, Yunyan
    Huang, Yu
    Zheng, Yefeng
    Yu, Kai
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 6970 - 6983
  • [43] AgCNER, the First Large-Scale Chinese Named Entity Recognition Dataset for Agricultural Diseases and Pests
    Yao, Xiaochuang
    Hao, Xia
    Liu, Ruilin
    Li, Lin
    Guo, Xuchao
    SCIENTIFIC DATA, 2024, 11 (01)
  • [44] A large-scale audit of dataset licensing and attribution in AI
    Longpre, Shayne
    Mahari, Robert
    Chen, Anthony
    Obeng-Marnu, Naana
    Sileo, Damien
    Brannon, William
    Muennighoff, Niklas
    Khazam, Nathan
    Kabbara, Jad
    Perisetla, Kartik
    Wu, Xinyi
    Shippole, Enrico
    Bollacker, Kurt
    Wu, Tongshuang
    Villa, Luis
    Pentland, Sandy
    Hooker, Sara
    NATURE MACHINE INTELLIGENCE, 2024, 6 (08) : 975 - 987
  • [45] A large-scale dataset of in vivo pharmacology assay results
    Fiona M. I. Hunter
    Francis L. Atkinson
    A. Patrícia Bento
    Nicolas Bosc
    Anna Gaulton
    Anne Hersey
    Andrew R. Leach
    Scientific Data, 5
  • [46] Fraud Detection Using Large-scale Imbalance Dataset
    Rubaidi, Zainab Saad
    Ben Ammar, Boulbaba
    Ben Aouicha, Mohamed
    INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2022, 31 (08)
  • [47] KoDF: A Large-scale Korean DeepFake Detection Dataset
    Kwon, Patrick
    You, Jaeseong
    Nam, Gyuhyeon
    Park, Sungwoo
    Chae, Gyeongsu
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 10724 - 10733
  • [48] Modernizing Analytics for Melanoma with a Large-Scale Research Dataset
    Richter, Aaron N.
    Khoshgoftaar, Taghi M.
    2017 IEEE 18TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IEEE IRI 2017), 2017, : 551 - 558
  • [49] BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization
    Sharma, Eva
    Li, Chen
    Wang, Lu
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 2204 - 2213
  • [50] Large-scale Cloze Test Dataset Created by Teachers
    Xie, Qizhe
    Lai, Guokun
    Dai, Zihang
    Hovy, Eduard
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 2344 - 2356