A new dataset for French and multilingual keyphrase generation

被引:0
|
作者
Piedboeuf, Frederic [1 ]
Langlais, Philippe [1 ]
机构
[1] Univ Montreal, RALI, Diro, Montreal, PQ, Canada
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022 | 2022年
关键词
EXTRACTION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Keyphrases are key components in efficiently dealing with the everincreasing amount of information present on the internet. While there are many recent papers on English keyphrase generation, keyphrase generation for other languages remains vastly understudied, mostly due to the absence of datasets. To address this, we present a novel dataset called Papyrus, composed of 16427 pairs of abstracts and keyphrases. We release four versions of this dataset, corresponding to different subtasks. Papyrus-e considers only English keyphrases, Papyrus-f considers French keyphrases, Papyrus-m considers keyphrase generation in any language (mostly French and English), and Papyrus-a considers keyphrase generation in several languages. We train a state-of-the-art model on all four tasks and show that they lead to better results for non-English languages, with an average improvement of 14.2% on keyphrase extraction and 2.0% on generation. We also show an improvement of 0.4% on extraction and 0.7% on generation over English state-of-the-art results by concatenating Papyrus-e with the Kp20K training set.
引用
收藏
页数:14
相关论文
共 50 条
  • [21] Leyzer: A Dataset for Multilingual Virtual Assistants
    Sowanski, Marcin
    Janicki, Artur
    TEXT, SPEECH, AND DIALOGUE (TSD 2020), 2020, 12284 : 477 - 486
  • [22] CMU WILDERNESS MULTILINGUAL SPEECH DATASET
    Black, Alan W.
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5971 - 5975
  • [23] A Dataset for Multilingual Epidemiological Event Extraction
    Mutuvi, Stephen
    Doucet, Antoine
    Lejeune, Gael
    Odeo, Moses
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 4139 - 4144
  • [24] MULTIFIN: A Dataset for Multilingual Financial NLP
    Jorgensen, Rasmus Kaer
    Brandt, Oliver
    Hartmann, Mareike
    Dai, Xiang
    Igel, Christian
    Elliott, Desmond
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 894 - 909
  • [25] Slovak Dataset for Multilingual Question Answering
    Hladek, Daniel
    Stas, Jan
    Juhar, Jozef
    Koctur, Tomas
    IEEE ACCESS, 2023, 11 : 32869 - 32881
  • [26] JukeBox: A Multilingual Singer Recognition Dataset
    Chowdhury, Anurag
    Cozzo, Austin
    Ross, Arun
    INTERSPEECH 2020, 2020, : 2267 - 2271
  • [27] VoxTube: a multilingual speaker recognition dataset
    Yakovlev, Ivan
    Okhotnikov, Anton
    Torgashov, Nikita
    Makarov, Rostislav
    Voevodin, Yuri
    Simonchik, Konstantin
    INTERSPEECH 2023, 2023, : 2238 - 2242
  • [28] A Comparative Assessment of State-Of-The-Art Methods for Multilingual Unsupervised Keyphrase Extraction
    Giarelis, Nikolaos
    Kanakaris, Nikos
    Karacapilidis, Nikos
    ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS, AIAI 2021, 2021, 627 : 635 - 645
  • [29] Keyphrase Generation: A Multi-Aspect Survey
    Cano, Erion
    Bojar, Ondrej
    PROCEEDINGS OF THE 2019 25TH CONFERENCE OF OPEN INNOVATIONS ASSOCIATION (FRUCT), 2019, : 85 - 94
  • [30] Exclusive Hierarchical Decoding for Deep Keyphrase Generation
    Chen, Wang
    Chan, Hou Pong
    Li, Piji
    King, Irwin
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 1095 - 1105