A new dataset for French and multilingual keyphrase generation

被引：0

作者：

Piedboeuf, Frederic ^{[1
]}

Langlais, Philippe ^{[1
]}

机构：

[1] Univ Montreal, RALI, Diro, Montreal, PQ, Canada

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022 | 2022年

关键词：

EXTRACTION;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Keyphrases are key components in efficiently dealing with the everincreasing amount of information present on the internet. While there are many recent papers on English keyphrase generation, keyphrase generation for other languages remains vastly understudied, mostly due to the absence of datasets. To address this, we present a novel dataset called Papyrus, composed of 16427 pairs of abstracts and keyphrases. We release four versions of this dataset, corresponding to different subtasks. Papyrus-e considers only English keyphrases, Papyrus-f considers French keyphrases, Papyrus-m considers keyphrase generation in any language (mostly French and English), and Papyrus-a considers keyphrase generation in several languages. We train a state-of-the-art model on all four tasks and show that they lead to better results for non-English languages, with an average improvement of 14.2% on keyphrase extraction and 2.0% on generation. We also show an improvement of 0.4% on extraction and 0.7% on generation over English state-of-the-art results by concatenating Papyrus-e with the Kp20K training set.

引用

页数：14

共 50 条

[21] Leyzer: A Dataset for Multilingual Virtual Assistants
Sowanski, Marcin
Janicki, Artur
TEXT, SPEECH, AND DIALOGUE (TSD 2020), 2020, 12284 : 477 - 486
[22] CMU WILDERNESS MULTILINGUAL SPEECH DATASET
Black, Alan W.
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5971 - 5975
[23] A Dataset for Multilingual Epidemiological Event Extraction
Mutuvi, Stephen
Doucet, Antoine
Lejeune, Gael
Odeo, Moses
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 4139 - 4144
[24] MULTIFIN: A Dataset for Multilingual Financial NLP
Jorgensen, Rasmus Kaer
Brandt, Oliver
Hartmann, Mareike
Dai, Xiang
Igel, Christian
Elliott, Desmond
17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 894 - 909
[25] Slovak Dataset for Multilingual Question Answering
Hladek, Daniel
Stas, Jan
Juhar, Jozef
Koctur, Tomas
IEEE ACCESS, 2023, 11 : 32869 - 32881
[26] JukeBox: A Multilingual Singer Recognition Dataset
Chowdhury, Anurag
Cozzo, Austin
Ross, Arun
INTERSPEECH 2020, 2020, : 2267 - 2271
[27] VoxTube: a multilingual speaker recognition dataset
Yakovlev, Ivan
Okhotnikov, Anton
Torgashov, Nikita
Makarov, Rostislav
Voevodin, Yuri
Simonchik, Konstantin
INTERSPEECH 2023, 2023, : 2238 - 2242
[28] A Comparative Assessment of State-Of-The-Art Methods for Multilingual Unsupervised Keyphrase Extraction
Giarelis, Nikolaos
Kanakaris, Nikos
Karacapilidis, Nikos
ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS, AIAI 2021, 2021, 627 : 635 - 645
[29] Keyphrase Generation: A Multi-Aspect Survey
Cano, Erion
Bojar, Ondrej
PROCEEDINGS OF THE 2019 25TH CONFERENCE OF OPEN INNOVATIONS ASSOCIATION (FRUCT), 2019, : 85 - 94
[30] Exclusive Hierarchical Decoding for Deep Keyphrase Generation
Chen, Wang
Chan, Hou Pong
Li, Piji
King, Irwin
58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 1095 - 1105

← 1 2 3 4 5 →