CStory: A Chinese Large-scale News Storyline Dataset

被引:0
|
作者
Shi, Kaijie [1 ]
Wang, Xiaozhi [1 ]
Yu, Jifan [1 ]
Hou, Lei [2 ]
Li, Juanzi [2 ]
Wu, Jingtong [3 ]
Yong, Dingyu [4 ]
Xiao, Jinghui [5 ]
Liu, Qun [5 ]
机构
[1] Tsinghua Univ, Beijing, Peoples R China
[2] Tsinghua Univ, BNRist, Dept Comp Sci & Technol, Beijing, Peoples R China
[3] Beijing Huawei Digital Technol Co Ltd, Beijing, Peoples R China
[4] Huawei Device Co Ltd, Beijing, Peoples R China
[5] Huawei Noahs Ark Lab, Beijing, Peoples R China
关键词
Storyline Datasets; Event Evolution; Storyline Relation; Topic Detection and Tracking; Imbalanced Dataset;
D O I
10.1145/3511808.3557573
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In today's massive news streams, storylines can help us discover related event pairs and understand the evolution of hot events. Hence many efforts have been devoted to automatically constructing news storylines. However, the development of these methods is strongly limited by the size and quality of existing storyline datasets since news storylines are expensive to annotate as they contain a myriad of unlabeled relationships growing quadratically with the number of news events. Working around these difficulties, we propose a sophisticated pre-processing method to filter candidate news pairs by entity co-occurrence and semantic similarity. With the filter reducing annotation overhead, we construct CStory, a large-scale Chinese news storyline dataset, which contains 11, 978 news articles, 112, 549 manually labeled storyline relation pairs, and 49, 832 evidence sentences for annotation judgment. We conduct extensive experiments on CStory using various algorithms and find that constructing news storylines is challenging even for pre-trained language models. Empirical analysis shows that the sample unbalance issue significantly influences model performance, which shall be the focus of future works. Our dataset is now publicly available at https://github.com/THU-KEG/CStory.
引用
收藏
页码:4475 / 4479
页数:5
相关论文
共 50 条
  • [1] CNewsTS - A Large-scale Chinese News Dataset with Hierarchical Topic Category and Summary
    Li, Quanzhi
    Liu, Yingchi
    Chao, Yang
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 4193 - 4198
  • [2] MIND: A Large-scale Dataset for News Recommendation
    Wu, Fangzhao
    Qiao, Ying
    Chen, Jiun-Hung
    Wu, Chuhan
    Qi, Tao
    Lian, Jianxun
    Liu, Danyang
    Xie, Xing
    Gao, Jianfeng
    Wu, Winnie
    Zhou, Ming
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 3597 - 3606
  • [3] EB-NeRD a large-scale dataset for news recommendation
    Kruse, Johannes
    Lindskow, Kasper
    Kalloori, Saikishore
    Polignano, Marco
    Pomo, Claudio
    Srivastava, Abhishek
    Uppal, Anshuk
    Andersen, Michael Riis
    Frellsen, Jes
    PROCEEDINGS OF WORKSHOP ON THE RECSYS CHALLENGE 2024, 2024, : 1 - 11
  • [4] A large-scale Chinese patent dataset for information extraction
    Zheng, Qian
    Guo, Kefu
    Xu, Lin
    SYSTEMS SCIENCE & CONTROL ENGINEERING, 2024, 12 (01)
  • [5] LEVEN: A Large-Scale Chinese Legal Event Detection Dataset
    Yao, Feng
    Xiao, Chaojun
    Wang, Xiaozhi
    Liu, Zhiyuan
    Hou, Lei
    Tu, Cunchao
    Li, Juanzi
    Liu, Yun
    Shen, Weixing
    Sun, Maosong
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 183 - 201
  • [6] ChID: A Large-scale Chinese IDiom Dataset for Cloze Test
    Zheng, Chujie
    Huang, Minlie
    Sun, Aixin
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 778 - 787
  • [7] A Large-Scale Chinese Multimodal NER Dataset with Speech Clues
    Sui, Dianbo
    Tian, Zhengkun
    Chen, Yubo
    Liu, Kang
    Zhao, Jun
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 2807 - 2818
  • [8] A large-scale dataset for Chinese historical document recognition and analysis
    Shi, Yongxin
    Peng, Dezhi
    Zhang, Yuyi
    Cao, Jiahuan
    Jin, Lianwen
    SCIENTIFIC DATA, 2025, 12 (01)
  • [9] NEWSFARM: A Large-Scale Chinese Corpus of Long News Summarization
    Zang, Shunan
    Zhang, Chuang
    Liu, Xiaojun
    Chen, Xiaojun
    Zhang, Peng
    Liu, Jie
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 2260 - 2272
  • [10] DuSQL: A Large-Scale and Pragmatic Chinese Text-to-SQL Dataset
    Wang, Lijie
    Zhang, Ao
    Wu, Kun
    Sun, Ke
    Li, Zhenghua
    Wu, Hua
    Zhang, Min
    Wang, Haifeng
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 6923 - 6935