RoBERTa-Based Keyword Extraction from Small Number of Korean Documents

被引:0
|
作者
Kim, So-Eon [1 ]
Lee, Jun-Beom [1 ]
Park, Gyu-Min [1 ]
Sohn, Seok-Man [2 ]
Park, Seong-Bae [1 ]
机构
[1] Kyung Hee Univ, Sch Comp, Yongin 17104, South Korea
[2] Korea Elect Power Res Inst, Daejeon 34056, South Korea
关键词
keyword extraction; sequence labeling; post-processing; RoBERTa; learning with small dataset; GENERATION;
D O I
10.3390/electronics12224560
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Keyword extraction is the task of identifying essential words in a lengthy document. This process is primarily executed through supervised keyword extraction. In instances where the dataset is limited in size, a classification-based approach is typically employed. Therefore, this paper introduces a novel keyword extractor based on a classification approach. The proposed keyword extractor comprises three key components: RoBERTa, a keyword estimator, and a decision rule. RoBERTa encodes an input document, the keyword estimator calculates the probability of each token in the document becoming a keyword, and the decision rule ultimately determines whether each token is a keyword based on these probabilities. However, training the proposed model with a small dataset presents two challenges. One problem is the case that all tokens in the documents are not a keyword, and the other problem is that a single word can be composed of keyword tokens and non-keyword tokens. Two novel heuristics are thus proposed to tackle these problems. To address these issues, two novel heuristics are proposed. These heuristics have been extensively tested through experiments, demonstrating that the proposed keyword extractor surpasses both the generation-based approach and the vanilla RoBERTa in environments with limited data. The efficacy of the heuristics is further validated through an ablation study. In summary, the proposed heuristics have proven to be effective in developing a supervised keyword extractor with a small dataset.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] Consensus-based Approach for Keyword Extraction from Urban Events Collections
    Alves, Ana
    Ribeiro, Bernardete
    ADCAIJ-ADVANCES IN DISTRIBUTED COMPUTING AND ARTIFICIAL INTELLIGENCE JOURNAL, 2015, 4 (02): : 41 - 59
  • [32] Automatic ontology-based knowledge extraction from web documents
    Alani, H
    Kim, S
    Millard, DE
    Weal, MJ
    Hall, W
    Lewis, PH
    Shadbolt, NR
    IEEE INTELLIGENT SYSTEMS, 2003, 18 (01) : 14 - 21
  • [33] Feedback-based Keyphrase extraction from Unstructured Text Documents
    Madaan, Nishtha
    Saxena, Mudit
    Patel, Hima
    Mehta, Sameep
    2020 INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS & NETWORKS (COMSNETS), 2020,
  • [34] Keyword Extraction from TV Program Viewers' Tweet Based on Neural Embedding Model
    Kirihara, Taiga
    Matsumoto, Kazuyuki
    Yoshida, Minoru
    Kita, Kenji
    FUZZY SYSTEMS AND DATA MINING VI, 2020, 331 : 360 - 369
  • [35] Context-based extraction of concepts from unstructured textual documents
    Gul, Saima
    Rabiger, Stefan
    Saygin, Yucel
    INFORMATION SCIENCES, 2022, 588 : 248 - 264
  • [36] ONTOLOGY-BASED INFORMATION EXTRACTION FROM PDF DOCUMENTS WITH XONTO
    Oro, Ermelinda
    Ruffolo, Massimo
    Sacca, Domenico
    INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2009, 18 (05) : 673 - 695
  • [37] Confidence estimation and keyword extraction from speech recognition result based on Web information
    Kensuke, Hara
    Hideki, Sekiya
    Tetsuya, Kawase
    Satoshi, Tamura
    Satoru, Hayamizu
    2013 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2013,
  • [38] Keyword Extraction from Scientific Research Projects Based on SRP-TF-IDF
    WANG, Zhuohao
    WANG, Dong
    Li, Qing
    CHINESE JOURNAL OF ELECTRONICS, 2021, 30 (04) : 652 - 657
  • [39] Keyword Extraction from Scientific Research Projects Based on SRP-TF-IDF
    WANG Zhuohao
    WANG Dong
    LI Qing
    ChineseJournalofElectronics, 2021, 30 (04) : 652 - 657
  • [40] Information Extraction from Web Documents Based on unranked Tree Automaton Inference
    Huang Zhaohua
    Yang Fan
    2012 FOURTH INTERNATIONAL CONFERENCE ON MULTIMEDIA INFORMATION NETWORKING AND SECURITY (MINES 2012), 2012, : 195 - 198