SciReviewGen: A Large-scale Dataset for Automatic Literature Review Generation

被引:0
|
作者
Kasanishi, Tetsu [1 ]
Isonuma, Masaru [1 ]
Mori, Junichiro [1 ,2 ]
Sakata, Ichiro [1 ]
机构
[1] Univ Tokyo, Tokyo, Japan
[2] RIKEN Ctr Adv Intelligence Project, Wako, Saitama, Japan
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automatic literature review generation is one of the most challenging tasks in natural language processing. Although large language models have tackled literature review generation, the absence of large-scale datasets has been a stumbling block to the progress. We release SciReviewGen, consisting of over 10,000 literature reviews and 690,000 papers cited in the reviews. Based on the dataset, we evaluate recent transformer-based summarization models on the literature review generation task, including Fusion-in-Decoder (Izacard and Grave, 2021) extended for literature review generation. Human evaluation results show that some machine-generated summaries are comparable to human-written reviews, while revealing the challenges of automatic literature review generation such as hallucinations and a lack of detailed information. Our dataset and code are available at https://github.com/ tetsu9923/SciReviewGen.
引用
收藏
页码:6695 / 6708
页数:14
相关论文
共 50 条
  • [21] PatchDB: A Large-Scale Security Patch Dataset
    Wang, Xinda
    Wang, Shu
    Feng, Pengbin
    Sun, Kun
    Jajodia, Sushil
    51ST ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN 2021), 2021, : 149 - 160
  • [22] openDD: A Large-Scale Roundabout Drone Dataset
    Breuer, Antonia
    Termoehlen, Jan-Aike
    Homoceanu, Silviu
    Fingscheidt, Tim
    2020 IEEE 23RD INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS (ITSC), 2020,
  • [23] A large-scale dataset of buildings and construction sites
    Cheng, Xuanhao
    Jia, Mingming
    He, Jian
    COMPUTER-AIDED CIVIL AND INFRASTRUCTURE ENGINEERING, 2024, 39 (09) : 1390 - 1406
  • [24] MineRL: A Large-Scale Dataset of Minecraft Demonstrations
    Guss, William H.
    Houghton, Brandon
    Topin, Nicholay
    Wang, Phillip
    Codel, Cayden
    Veloso, Manuela
    Salakhutdinov, Ruslan
    PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 2442 - 2448
  • [25] SGF: A Crowdsourced Large-scale Event Dataset
    Heuschkel, Jens
    Froemmgen, Alexander
    PROCEEDINGS OF THE 9TH ACM MULTIMEDIA SYSTEMS CONFERENCE (MMSYS'18), 2018, : 351 - 356
  • [26] MultiSubs: A Large-scale Multimodal and Multilingual Dataset
    Wang, Josiah
    Figueiredo, Josiel
    Specia, Lucia
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6776 - 6785
  • [27] EdNet: A Large-Scale Hierarchical Dataset in Education
    Choi, Youngduck
    Lee, Youngnam
    Shin, Dongmin
    Cho, Junghyun
    Park, Seoyon
    Lee, Seewoo
    Baek, Jineon
    Bae, Chan
    Kim, Byungsoo
    Heo, Jaewe
    ARTIFICIAL INTELLIGENCE IN EDUCATION (AIED 2020), PT II, 2020, 12164 : 69 - 73
  • [28] A large-scale and global car dataset for verification
    Hu, Lingji
    Luo, Xingcheng
    Deng, Jianhua
    Lai, Fengjie
    Hu, Jian
    Yu, Yongbin
    PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ELECTRONIC TECHNOLOGY, 2016, 48 : 49 - 52
  • [29] VoxCeleb: a large-scale speaker identification dataset
    Nagrani, Arsha
    Chung, Joon Son
    Zisserman, Andrew
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2616 - 2620
  • [30] A large-scale hyperspectral dataset for flower classification
    Zheng, Yongrong
    Zhang, Tao
    Fu, Ying
    KNOWLEDGE-BASED SYSTEMS, 2022, 236