A new dataset for French and multilingual keyphrase generation

被引:0
|
作者
Piedboeuf, Frederic [1 ]
Langlais, Philippe [1 ]
机构
[1] Univ Montreal, RALI, Diro, Montreal, PQ, Canada
关键词
EXTRACTION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Keyphrases are key components in efficiently dealing with the everincreasing amount of information present on the internet. While there are many recent papers on English keyphrase generation, keyphrase generation for other languages remains vastly understudied, mostly due to the absence of datasets. To address this, we present a novel dataset called Papyrus, composed of 16427 pairs of abstracts and keyphrases. We release four versions of this dataset, corresponding to different subtasks. Papyrus-e considers only English keyphrases, Papyrus-f considers French keyphrases, Papyrus-m considers keyphrase generation in any language (mostly French and English), and Papyrus-a considers keyphrase generation in several languages. We train a state-of-the-art model on all four tasks and show that they lead to better results for non-English languages, with an average improvement of 14.2% on keyphrase extraction and 2.0% on generation. We also show an improvement of 0.4% on extraction and 0.7% on generation over English state-of-the-art results by concatenating Papyrus-e with the Kp20K training set.
引用
收藏
页数:14
相关论文
共 50 条
  • [31] Scientific poster generation: A new dataset and approach
    Zhong, Xinyi
    Tan, Zusheng
    Li, Jing
    Gao, Shen
    Ma, Jing
    Feng, Shanshan
    Chiu, Billy
    PATTERN RECOGNITION, 2025, 164
  • [32] Unsupervised Open-domain Keyphrase Generation
    Lam Thanh Do
    Akash, Pritom Saha
    Chang, Kevin Chen-Chuan
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 10614 - 10627
  • [33] Title-Guided Encoding for Keyphrase Generation
    Chen, Wang
    Gao, Yifan
    Zhang, Jiani
    King, Irwin
    Lyu, Michael R.
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 6268 - 6275
  • [34] A Multilingual Handwritten Character Dataset: T-H-E Dataset
    Bartos, Gaye Ediboglu
    Hoscan, Yasar
    Kauer, Andras
    Hajnal, Eva
    ACTA POLYTECHNICA HUNGARICA, 2020, 17 (09) : 141 - 160
  • [35] Heterogeneous Graph Neural Networks for Keyphrase Generation
    Ye, Jiacheng
    Cai, Ruijian
    Gui, Tao
    Zhang, Qi
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 2705 - 2715
  • [36] An Annotated Multilingual Dataset to Study Modality in the Gospels
    Bermudez-Sabel, Helena
    Dell'Oro, Francesca
    DIGITAL HUMANITIES QUARTERLY, 2024, 18 (01): : 1 - 16
  • [37] Toxic Language Detection in Social Media for Brazilian Portuguese: New Dataset and Multilingual Analysis
    Leite, Joao A.
    Silva, Diego F.
    Bontcheva, Kalina
    Scarton, Carolina
    1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 914 - 924
  • [38] SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation
    Clark, Elizabeth
    Rijhwani, Shruti
    Gehrmann, Sebastian
    Maynez, Joshua
    Aharoni, Roee
    Nikolaev, Vitaly
    Sellam, Thibault
    Siddhant, Aditya
    Das, Dipanjan
    Parikh, Ankur P.
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 9397 - 9413
  • [39] XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning
    Ponti, Edoardo M.
    Glaves, Goran
    Majewska, Olga
    Liu, Qianchu
    Vulic, Ivan
    Korhonen, Anna
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2362 - 2376
  • [40] A Multilingual Evaluation Dataset for MonolingualWord Sense Alignment
    Ahmadi, Sina
    McCrae, John P.
    Nimb, Sanni
    Khan, Fahad
    Monachini, Monica
    Pedersen, Bolette S.
    Declerck, Thierry
    Wissik, Tanja
    Bellandi, Andrea
    Pisani, Irene
    Troelsgard, Thomas
    Olsen, Sussi
    Krek, Simon
    Lipp, Veronika
    Varadi, Tamas
    Simon, Laszlo
    Gyorffy, Andras
    Tiberius, Carole
    Schoonheim, Tanneke
    Ben Moshe, Yifat
    Rudich, Maya
    Abu Ahmad, Raya
    Lonke, Dorielle
    Kovalenko, Kira
    Langemets, Margit
    Kallas, Jelena
    Dereza, Oksana
    Fransen, Theodorus
    Cillessen, David
    Lindemann, David
    Alonso, Mikel
    Salgado, Ana
    Sancho, Jose Luis
    Urena-Ruiz, Rafael-J
    Porta Zamorano, Jordi
    Simov, Kiril
    Osenova, Petya
    Kancheva, Zara
    Radev, Ivaylo
    Stankovic, Ranka
    Perdih, Andrej
    Gabrovsek, Dejan
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3232 - 3242