A new dataset for French and multilingual keyphrase generation

被引:0
|
作者
Piedboeuf, Frederic [1 ]
Langlais, Philippe [1 ]
机构
[1] Univ Montreal, RALI, Diro, Montreal, PQ, Canada
关键词
EXTRACTION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Keyphrases are key components in efficiently dealing with the everincreasing amount of information present on the internet. While there are many recent papers on English keyphrase generation, keyphrase generation for other languages remains vastly understudied, mostly due to the absence of datasets. To address this, we present a novel dataset called Papyrus, composed of 16427 pairs of abstracts and keyphrases. We release four versions of this dataset, corresponding to different subtasks. Papyrus-e considers only English keyphrases, Papyrus-f considers French keyphrases, Papyrus-m considers keyphrase generation in any language (mostly French and English), and Papyrus-a considers keyphrase generation in several languages. We train a state-of-the-art model on all four tasks and show that they lead to better results for non-English languages, with an average improvement of 14.2% on keyphrase extraction and 2.0% on generation. We also show an improvement of 0.4% on extraction and 0.7% on generation over English state-of-the-art results by concatenating Papyrus-e with the Kp20K training set.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] EUROPA: A Legal Multilingual Keyphrase Generation Dataset
    Salaun, Olivier
    Piedboeuf, Frederic
    Le Berre, Guillaume
    Hermelo, David Alfonso
    Langlais, Philippe
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 12718 - 12736
  • [2] The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
    Dung Nguyen Manh
    Nam Le Hai
    Dau, Anh T. V.
    Anh Minh Nguyen
    Khanh Nghiem
    Guo, Jin
    Bui, Nghi D. Q.
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 4763 - 4788
  • [3] Deep Keyphrase Generation
    Meng, Rui
    Zhao, Sanqiang
    Han, Shuguang
    He, Daqing
    Brusilovsky, Peter
    Chi, Yu
    PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 582 - 592
  • [4] Unsupervised Deep Keyphrase Generation
    Shen, Xianjie
    Wang, Yinghan
    Meng, Rui
    Shang, Jingbo
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 11303 - 11311
  • [5] Hyperbolic Deep Keyphrase Generation
    Zhang, Yuxiang
    Yang, Tianyu
    Jiang, Tao
    Li, Xiaoli
    Wang, Suge
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2022, PT II, 2023, 13714 : 521 - 536
  • [6] Keyphrase Generation with Correlation Constraints
    Chen, Jun
    Zhang, Xiaoming
    Wu, Yu
    Yan, Zhao
    Li, Zhoujun
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 4057 - 4066
  • [7] Keyphrase Generation with Word Attention
    Huang, Hai
    Huang, Tianshuo
    Ma, Longxuan
    Zhang, Lei
    NEURAL INFORMATION PROCESSING (ICONIP 2019), PT III, 2019, 11955 : 270 - 281
  • [8] Multilingual Image Corpus - Towards a Multimodal and Multilingual Dataset
    Koeva, Svetla
    Stoyanova, Ivelina
    Kralev, Jordan
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1509 - 1518
  • [9] A multilingual, multimodal dataset of aggression and bias: the ComMA dataset
    Kumar, Ritesh
    Ratan, Shyam
    Singh, Siddharth
    Nandi, Enakshi
    Devi, Laishram Niranjana
    Bhagat, Akash
    Dawer, Yogesh
    Lahiri, Bornini
    Bansal, Akanksha
    LANGUAGE RESOURCES AND EVALUATION, 2024, 58 (02) : 757 - 837
  • [10] An Empirical Study on Neural Keyphrase Generation
    Meng, Rui
    Yuan, Xingdi
    Wang, Tong
    Zhao, Sanqiang
    Trischler, Adam
    He, Daqing
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 4985 - 5007