A new dataset for French and multilingual keyphrase generation

被引:0
|
作者
Piedboeuf, Frederic [1 ]
Langlais, Philippe [1 ]
机构
[1] Univ Montreal, RALI, Diro, Montreal, PQ, Canada
关键词
EXTRACTION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Keyphrases are key components in efficiently dealing with the everincreasing amount of information present on the internet. While there are many recent papers on English keyphrase generation, keyphrase generation for other languages remains vastly understudied, mostly due to the absence of datasets. To address this, we present a novel dataset called Papyrus, composed of 16427 pairs of abstracts and keyphrases. We release four versions of this dataset, corresponding to different subtasks. Papyrus-e considers only English keyphrases, Papyrus-f considers French keyphrases, Papyrus-m considers keyphrase generation in any language (mostly French and English), and Papyrus-a considers keyphrase generation in several languages. We train a state-of-the-art model on all four tasks and show that they lead to better results for non-English languages, with an average improvement of 14.2% on keyphrase extraction and 2.0% on generation. We also show an improvement of 0.4% on extraction and 0.7% on generation over English state-of-the-art results by concatenating Papyrus-e with the Kp20K training set.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] Building a Dataset of Multilingual Cognates for the Romanian Lexicon
    Ciobanu, Alina Maria
    Dinu, Liviu P.
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 1038 - 1043
  • [42] Multilingual Topic Classification in X: Dataset and Analysis
    Antypas, Dimosthenis
    Ushio, Asahi
    Barbieri, Francesco
    Camacho-Collados, Jose
    EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 2024, : 20136 - 20152
  • [43] Multilingual Entity and Relation Extraction Dataset and Model
    Seganti, Alessandro
    Firlag, Klaudia
    Skowronska, Helena
    Satlawa, Michal
    Andruszkiewicz, Piotr
    16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 1946 - 1955
  • [44] REDFM: a Filtered and Multilingual Relation Extraction Dataset
    Cabot, Pere-Lluis Huguet
    Tedeschi, Simone
    Ngomo, Axel-Cyrille Ngonga
    Navigli, Roberto
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 4326 - 4343
  • [45] VoxEL: A Benchmark Dataset for Multilingual Entity Linking
    Rosales-Mendez, Henry
    Hogan, Aidan
    Poblete, Barbara
    SEMANTIC WEB - ISWC 2018, PT II, 2018, 11137 : 170 - 186
  • [46] Interactive multilingual generation
    Coch, J
    Chevreau, K
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2001, 2004 : 239 - 250
  • [47] FRENCH ART - A NEW GENERATION
    NITTVE, L
    ARTFORUM, 1986, 24 (05): : 103 - 103
  • [48] FRENCH FILMS, THE NEW GENERATION
    STRAUSS, F
    CAHIERS DU CINEMA, 1990, (431-32): : 60 - &
  • [49] THE NEW GENERATION OF FRENCH FILMMAKERS
    DETASSIS, P
    CINEFORUM, 1987, 27 (08): : 45 - 54
  • [50] THE NEW GENERATION OF FRENCH POETS
    JULIO, MD
    WORLD LITERATURE TODAY, 1985, 59 (02) : 233 - 235