SciReviewGen: A Large-scale Dataset for Automatic Literature Review Generation

被引:0
|
作者
Kasanishi, Tetsu [1 ]
Isonuma, Masaru [1 ]
Mori, Junichiro [1 ,2 ]
Sakata, Ichiro [1 ]
机构
[1] Univ Tokyo, Tokyo, Japan
[2] RIKEN Ctr Adv Intelligence Project, Wako, Saitama, Japan
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automatic literature review generation is one of the most challenging tasks in natural language processing. Although large language models have tackled literature review generation, the absence of large-scale datasets has been a stumbling block to the progress. We release SciReviewGen, consisting of over 10,000 literature reviews and 690,000 papers cited in the reviews. Based on the dataset, we evaluate recent transformer-based summarization models on the literature review generation task, including Fusion-in-Decoder (Izacard and Grave, 2021) extended for literature review generation. Human evaluation results show that some machine-generated summaries are comparable to human-written reviews, while revealing the challenges of automatic literature review generation such as hallucinations and a lack of detailed information. Our dataset and code are available at https://github.com/ tetsu9923/SciReviewGen.
引用
收藏
页码:6695 / 6708
页数:14
相关论文
共 50 条
  • [1] A Large-Scale Dataset for Empathetic Response Generation
    Welivita, Anuradha
    Xie, Yubo
    Pu, Pearl
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 1251 - 1264
  • [2] RNSum: A Large-Scale Dataset for Automatic Release Note Generation via Commit Logs Summarization
    Kamezawa, Hisashi
    Nishida, Noriki
    Shimizu, Nobuyuki
    Miyazaki, Takashi
    Nakayama, Hideki
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 8718 - 8735
  • [3] Generation and Analysis of a Large-Scale Urban Vehicular Mobility Dataset
    Uppoor, Sandesh
    Trullols-Cruces, Oscar
    Fiore, Marco
    Barcelo-Ordinas, Jose M.
    IEEE TRANSACTIONS ON MOBILE COMPUTING, 2014, 13 (05) : 1061 - 1075
  • [4] Introduction and Analysis of a Large-Scale Benchmark Automatic Vehicle Identification Dataset
    He, Zhaocheng
    Chen, Kaiying
    Chen, Xinyu
    Sun, Weiwei
    INTERNATIONAL CONFERENCE ON TRANSPORTATION AND DEVELOPMENT 2018: CONNECTED AND AUTONOMOUS VEHICLES AND TRANSPORTATION SAFETY, 2018, : 35 - 43
  • [5] Large-Scale Ontology Matching: a Review of the Literature
    Babalou, Samira
    Kargar, Mohammad Javad
    Davarpanah, Seyyed Hashem
    2016 SECOND INTERNATIONAL CONFERENCE ON WEB RESEARCH (ICWR), 2016, : 158 - 165
  • [6] Varta: A Large-Scale Headline-Generation Dataset for Indic Languages
    Aralikatte, Rahul
    Cheng, Ziling
    Doddapaneni, Sumanth
    Cheung, Jackie Chi Kit
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 3468 - 3492
  • [7] DACSA: A large-scale Dataset for Automatic summarization of Catalan and Spanish newspaper Articles
    Segarra, Encarna
    Ahuir, Vicent
    Hurtado, Lluis-F
    Angel Gonzalez, Jose
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5931 - 5943
  • [8] DMDD: A Large-Scale Dataset for Dataset Mentions Detection
    Pan, Huitong
    Zhang, Qi
    Dragut, Eduard
    Caragea, Cornelia
    Latecki, Longin Jan
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2023, 11 : 1132 - 1146
  • [9] Large-scale RDF Dataset Slicing
    Marx, Edgard
    Shekarpour, Saeedeh
    Auer, Soeren
    Ngomo, Axel-Cyrille Ngonga
    2013 IEEE SEVENTH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC 2013), 2013, : 228 - 235
  • [10] Euler Clustering on Large-scale Dataset
    Wu, Jian-Sheng
    Zheng, Wei-Shi
    Lai, Jian-Huang
    Suen, Ching Y.
    IEEE TRANSACTIONS ON BIG DATA, 2018, 4 (04) : 502 - 515