HAKE: an Unsupervised Approach to Automatic Keyphrase Extraction for Multiple Domains

被引:4
|
作者
Merrouni, Zakariae Alami [1 ]
Frikh, Bouchra [1 ]
Ouhbi, Brahim [2 ]
机构
[1] Sidi Mohamed Ben Abdellah Univ, Natl Sch Appl Sci ENSA, LIASSE Lab, BP 72,Route Dimouzer, Fes, Morocco
[2] Moulay Ismail Univ UMI, Natl Higher Sch Arts & Crafts ENSAM, Math Modeling & Comp Lab LM2I, Marjane 2,BP 4024, Meknes, Morocco
关键词
Automatic keyphrase extraction; Unsupervised machine learning; Feature selection; FEATURE-SELECTION; KEYWORD EXTRACTION; SYSTEM;
D O I
10.1007/s12559-021-09979-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Keyphrases capture the main content of a free text document. The task of automatic keyphrase extraction (AKPE) plays a significant role in retrieving and summarizing valuable information from several documents with different domains. Various techniques have been proposed for this task. However, supervised AKPE requires large annotated data and depends on the tested domain. An alternative solution is to consider a new independent domain method that can be applied to several domains (such as medical, social). In this paper, we tackle keyphrase extraction from single documents with HAKE, a novel unsupervised method that takes full advantage of mining linguistic, statistical, structural, and semantic text features simultaneously to select the most relevant keyphrases in a text. HAKE achieves higher F-scores than the unsupervised state-of-the-art systems on standard datasets and is suitable for real-time processing of large amounts of Web and text data across different domains. With HAKE, we also explicitly increase coverage and diversity among the selected keyphrases by introducing a novel technique (based on a parse tree approach, part of speech tagging, and filtering) for candidate keyphrase identification and extraction. This technique allows us to generate a comprehensive and meaningful list of candidate keyphrases and reduce the candidate set's size without increasing the computational complexity. HAKE's effectiveness is compared to twelve state-of-the-art and recent unsupervised approaches, as well as to some other supervised approaches. Experimental analysis is conducted to validate the proposed method using five of the top available benchmark corpora from different domains and shows that HAKE significantly outperforms both the existing unsupervised and supervised methods. Our method does not require training on a particular set of documents, nor does it depend on external corpora, dictionaries, domain, or text size. Our experiments confirm that HAKE's candidate selection model and its ranking model are effective.
引用
收藏
页码:852 / 874
页数:23
相关论文
共 50 条
  • [41] Automatic keyphrase extraction from scientific articles
    Kim, Su Nam
    Medelyan, Olena
    Kan, Min-Yen
    Baldwin, Timothy
    LANGUAGE RESOURCES AND EVALUATION, 2013, 47 (03) : 723 - 742
  • [42] Automatic Keyphrase Extraction: A Survey of the State of the Art
    Hasan, Kazi Saidul
    Ng, Vincent
    PROCEEDINGS OF THE 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2014, : 1262 - 1273
  • [43] Automatic Keyphrase Extraction from Medical Documents
    Sarkar, Kamal
    PATTERN RECOGNITION AND MACHINE INTELLIGENCE, PROCEEDINGS, 2009, 5909 : 273 - 278
  • [44] Automatic Keyphrase Extraction and Segmentation of Video Lectures
    Balagopalan, Arun
    Balasubramanian, Lalitha Lakshmi
    Balasubramanian, Vidhya
    Chandrasekharan, Nithin
    Damodar, Aswin
    2012 IEEE INTERNATIONAL CONFERENCE ON TECHNOLOGY ENHANCED EDUCATION (ICTEE 2012), 2012,
  • [45] Automatic Keyphrase Extraction : An Overview Of The State Of The Art
    Merrouni, Zakariae Alami
    Frikh, Bouchra
    Ouhbi, Brahim
    2016 4TH IEEE INTERNATIONAL COLLOQUIUM ON INFORMATION SCIENCE AND TECHNOLOGY (CIST), 2016, : 306 - 313
  • [46] Unsupervised Keyphrase Extraction in Academic Publications Using Human Attention
    Zhang, Yingyi
    Zhang, Chengzhi
    17TH INTERNATIONAL CONFERENCE ON SCIENTOMETRICS & INFORMETRICS (ISSI2019), VOL II, 2019, : 2483 - 2484
  • [47] TeKET: a Tree-Based Unsupervised Keyphrase Extraction Technique
    Rabby, Gollam
    Azad, Saiful
    Mahmud, Mufti
    Zamli, Kamal Z.
    Rahman, Mohammed Mostafizur
    COGNITIVE COMPUTATION, 2020, 12 (04) : 811 - 833
  • [48] AttentionRank: Unsupervised keyphrase Extraction using Self and Cross Attentions
    Ding, Haoran
    Luo, Xiao
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 1919 - 1928
  • [49] Improving Diversity in Unsupervised Keyphrase Extraction with Determinantal Point Process
    Song, Mingyang
    Liu, Huafeng
    Jing, Liping
    PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 4294 - 4299
  • [50] TeKET: a Tree-Based Unsupervised Keyphrase Extraction Technique
    Gollam Rabby
    Saiful Azad
    Mufti Mahmud
    Kamal Z. Zamli
    Mohammed Mostafizur Rahman
    Cognitive Computation, 2020, 12 : 811 - 833