RoBERTa-Based Keyword Extraction from Small Number of Korean Documents

被引:0
|
作者
Kim, So-Eon [1 ]
Lee, Jun-Beom [1 ]
Park, Gyu-Min [1 ]
Sohn, Seok-Man [2 ]
Park, Seong-Bae [1 ]
机构
[1] Kyung Hee Univ, Sch Comp, Yongin 17104, South Korea
[2] Korea Elect Power Res Inst, Daejeon 34056, South Korea
关键词
keyword extraction; sequence labeling; post-processing; RoBERTa; learning with small dataset; GENERATION;
D O I
10.3390/electronics12224560
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Keyword extraction is the task of identifying essential words in a lengthy document. This process is primarily executed through supervised keyword extraction. In instances where the dataset is limited in size, a classification-based approach is typically employed. Therefore, this paper introduces a novel keyword extractor based on a classification approach. The proposed keyword extractor comprises three key components: RoBERTa, a keyword estimator, and a decision rule. RoBERTa encodes an input document, the keyword estimator calculates the probability of each token in the document becoming a keyword, and the decision rule ultimately determines whether each token is a keyword based on these probabilities. However, training the proposed model with a small dataset presents two challenges. One problem is the case that all tokens in the documents are not a keyword, and the other problem is that a single word can be composed of keyword tokens and non-keyword tokens. Two novel heuristics are thus proposed to tackle these problems. To address these issues, two novel heuristics are proposed. These heuristics have been extensively tested through experiments, demonstrating that the proposed keyword extractor surpasses both the generation-based approach and the vanilla RoBERTa in environments with limited data. The efficacy of the heuristics is further validated through an ablation study. In summary, the proposed heuristics have proven to be effective in developing a supervised keyword extractor with a small dataset.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Neural based approach to keyword extraction from documents
    Jo, TH
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2003, PT 1, PROCEEDINGS, 2003, 2667 : 456 - 461
  • [2] Biomedical event causal relation extraction with deep knowledge fusion and Roberta-based data augmentation
    Li, Lishuang
    Xiang, Yi
    Hao, Jing
    METHODS, 2024, 231 : 8 - 14
  • [3] Contrastive Keyword Extraction from Versioned Documents
    Eder, Lukas
    Campos, Ricardo
    Jatowt, Adam
    PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 5026 - 5030
  • [4] Automatic keyword extraction from documents based on multiple content-based measures
    Yue, Kun
    Liu, Wei-Yi
    Zhou, Li-Ping
    COMPUTER SYSTEMS SCIENCE AND ENGINEERING, 2011, 26 (02): : 133 - 145
  • [5] An Algorithm for Cross-Language Keyword Extraction from Multiple Documents Based on HowNet
    Dai, Liuling
    Wang, ShuMei
    Hu, JinWu
    Liu, WanChun
    PROCEEDINGS OF 2008 INTERNATIONAL PRE-OLYMPIC CONGRESS ON COMPUTER SCIENCE, VOL II: INFORMATION SCIENCE AND ENGINEERING, 2008, : 1 - 7
  • [6] Keyword Extraction from Hindi Documents Using Statistical Approach
    Sharan, Aditi
    Siddiqi, Sifatullah
    Singh, Jagendra
    INTELLIGENT COMPUTING, COMMUNICATION AND DEVICES, 2015, 309 : 507 - 513
  • [7] Keyword Extraction From Specification Documents for Planning Security Mechanisms
    Poozhithara, Jeffy Jahfar
    Asuncion, Hazeline U.
    Lagesse, Brent
    2023 IEEE/ACM 45TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ICSE, 2023, : 1661 - 1673
  • [8] Effect of Centrality Measures for Keyword Extraction from Turkish Documents
    Goz, Furkan
    Mutlu, Alev
    Kucuk, Kerem
    Temur, Mahir
    Gun, Abdurrahman
    29TH IEEE CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS (SIU 2021), 2021,
  • [9] An Empirical Study of Important Keyword Extraction Techniques from Documents
    Hasan, H. M. Mahedi
    Sanyal, Falguni
    Chaki, Dipankar
    Ali, Md. Haider
    2017 1ST INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS AND INFORMATION MANAGEMENT (ICISIM), 2017, : 91 - 94
  • [10] Keyword extraction from documents using a neural network model
    Jo, Taeho
    Lee, Malrey
    Gatton, Thomas M.
    2006 INTERNATIONAL CONFERENCE ON HYBRID INFORMATION TECHNOLOGY, VOL 2, PROCEEDINGS, 2006, : 194 - +