CStory: A Chinese Large-scale News Storyline Dataset

被引:0
|
作者
Shi, Kaijie [1 ]
Wang, Xiaozhi [1 ]
Yu, Jifan [1 ]
Hou, Lei [2 ]
Li, Juanzi [2 ]
Wu, Jingtong [3 ]
Yong, Dingyu [4 ]
Xiao, Jinghui [5 ]
Liu, Qun [5 ]
机构
[1] Tsinghua Univ, Beijing, Peoples R China
[2] Tsinghua Univ, BNRist, Dept Comp Sci & Technol, Beijing, Peoples R China
[3] Beijing Huawei Digital Technol Co Ltd, Beijing, Peoples R China
[4] Huawei Device Co Ltd, Beijing, Peoples R China
[5] Huawei Noahs Ark Lab, Beijing, Peoples R China
关键词
Storyline Datasets; Event Evolution; Storyline Relation; Topic Detection and Tracking; Imbalanced Dataset;
D O I
10.1145/3511808.3557573
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In today's massive news streams, storylines can help us discover related event pairs and understand the evolution of hot events. Hence many efforts have been devoted to automatically constructing news storylines. However, the development of these methods is strongly limited by the size and quality of existing storyline datasets since news storylines are expensive to annotate as they contain a myriad of unlabeled relationships growing quadratically with the number of news events. Working around these difficulties, we propose a sophisticated pre-processing method to filter candidate news pairs by entity co-occurrence and semantic similarity. With the filter reducing annotation overhead, we construct CStory, a large-scale Chinese news storyline dataset, which contains 11, 978 news articles, 112, 549 manually labeled storyline relation pairs, and 49, 832 evidence sentences for annotation judgment. We conduct extensive experiments on CStory using various algorithms and find that constructing news storylines is challenging even for pre-trained language models. Empirical analysis shows that the sample unbalance issue significantly influences model performance, which shall be the focus of future works. Our dataset is now publicly available at https://github.com/THU-KEG/CStory.
引用
收藏
页码:4475 / 4479
页数:5
相关论文
共 50 条
  • [21] DANEWSROOM: A Large-scale Danish Summarisation Dataset
    Varab, Daniel
    Schluter, Natalie
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 6731 - 6739
  • [22] Pchatbot: A Large-Scale Dataset for Personalized Chatbot
    Qian, Hongjin
    Li, Xiaohe
    Zhong, Hanxun
    Guo, Yu
    Ma, Yueyuan
    Zhu, Yutao
    Liu, Zhanliang
    Dou, Zhicheng
    Wen, Ji-Rong
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 2470 - 2477
  • [23] openDD: A Large-Scale Roundabout Drone Dataset
    Breuer, Antonia
    Termoehlen, Jan-Aike
    Homoceanu, Silviu
    Fingscheidt, Tim
    2020 IEEE 23RD INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS (ITSC), 2020,
  • [24] PatchDB: A Large-Scale Security Patch Dataset
    Wang, Xinda
    Wang, Shu
    Feng, Pengbin
    Sun, Kun
    Jajodia, Sushil
    51ST ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN 2021), 2021, : 149 - 160
  • [25] Large-Scale Analysis of the Docker Hub Dataset
    Zhao, Nannan
    Tarasov, Vasily
    Albahar, Hadeel
    Anwar, Ali
    Rupprecht, Lukas
    Skourtis, Dimitrios
    Warke, Amit S.
    Mohamed, Mohamed
    Butt, Ali R.
    2019 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2019, : 215 - 224
  • [26] A large-scale dataset of buildings and construction sites
    Cheng, Xuanhao
    Jia, Mingming
    He, Jian
    COMPUTER-AIDED CIVIL AND INFRASTRUCTURE ENGINEERING, 2024, 39 (09) : 1390 - 1406
  • [27] SGF: A Crowdsourced Large-scale Event Dataset
    Heuschkel, Jens
    Froemmgen, Alexander
    PROCEEDINGS OF THE 9TH ACM MULTIMEDIA SYSTEMS CONFERENCE (MMSYS'18), 2018, : 351 - 356
  • [28] MineRL: A Large-Scale Dataset of Minecraft Demonstrations
    Guss, William H.
    Houghton, Brandon
    Topin, Nicholay
    Wang, Phillip
    Codel, Cayden
    Veloso, Manuela
    Salakhutdinov, Ruslan
    PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 2442 - 2448
  • [29] MultiSubs: A Large-scale Multimodal and Multilingual Dataset
    Wang, Josiah
    Figueiredo, Josiel
    Specia, Lucia
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6776 - 6785
  • [30] A large-scale and global car dataset for verification
    Hu, Lingji
    Luo, Xingcheng
    Deng, Jianhua
    Lai, Fengjie
    Hu, Jian
    Yu, Yongbin
    PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ELECTRONIC TECHNOLOGY, 2016, 48 : 49 - 52