RoBERTa-Based Keyword Extraction from Small Number of Korean Documents

被引:0
|
作者
Kim, So-Eon [1 ]
Lee, Jun-Beom [1 ]
Park, Gyu-Min [1 ]
Sohn, Seok-Man [2 ]
Park, Seong-Bae [1 ]
机构
[1] Kyung Hee Univ, Sch Comp, Yongin 17104, South Korea
[2] Korea Elect Power Res Inst, Daejeon 34056, South Korea
关键词
keyword extraction; sequence labeling; post-processing; RoBERTa; learning with small dataset; GENERATION;
D O I
10.3390/electronics12224560
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Keyword extraction is the task of identifying essential words in a lengthy document. This process is primarily executed through supervised keyword extraction. In instances where the dataset is limited in size, a classification-based approach is typically employed. Therefore, this paper introduces a novel keyword extractor based on a classification approach. The proposed keyword extractor comprises three key components: RoBERTa, a keyword estimator, and a decision rule. RoBERTa encodes an input document, the keyword estimator calculates the probability of each token in the document becoming a keyword, and the decision rule ultimately determines whether each token is a keyword based on these probabilities. However, training the proposed model with a small dataset presents two challenges. One problem is the case that all tokens in the documents are not a keyword, and the other problem is that a single word can be composed of keyword tokens and non-keyword tokens. Two novel heuristics are thus proposed to tackle these problems. To address these issues, two novel heuristics are proposed. These heuristics have been extensively tested through experiments, demonstrating that the proposed keyword extractor surpasses both the generation-based approach and the vanilla RoBERTa in environments with limited data. The efficacy of the heuristics is further validated through an ablation study. In summary, the proposed heuristics have proven to be effective in developing a supervised keyword extractor with a small dataset.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] Visual segmentation-based data record extraction from web documents
    Li, Longzhuang
    Liu, Yonghuai
    Obregon, Abel
    IRI 2007: PROCEEDINGS OF THE 2007 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION, 2007, : 502 - +
  • [42] Towards a System for Ontology-Based Information Extraction from PDF Documents
    Oro, Ermelinda
    Ruffolo, Massimo
    ON THE MOVE TO MEANINGFUL INTERNET SYSTEMS: OTM 2008, PT II, PROCEEDINGS, 2008, 5332 : 1482 - 1499
  • [43] Extraction and segmentation of tables from Chinese ink documents based on a matrix model
    Zhang, Xi-wen
    Lyu, Michael R.
    Dai, Guo-zhong
    PATTERN RECOGNITION, 2007, 40 (07) : 1855 - 1867
  • [44] Theme Extraction from Chinese Web Documents Based on Page Segmentation and Entropy
    Wang, Deqing
    Zhang, Hui
    Zhou, Gang
    FOUNDATIONS OF INTELLIGENT SYSTEMS, PROCEEDINGS, 2009, 5722 : 221 - 230
  • [45] Stroke extraction from grayscale images of financial documents based on figures of importance
    Hassanein, K
    Wesolkowski, S
    INTERNATIONAL CONFERENCE ON IMAGE PROCESSING - PROCEEDINGS, VOL III, 1997, : 224 - 227
  • [46] A Feature-Based Approach for Relation Extraction from Thai News Documents
    Tonatep, Nattapong
    Theeramunkong, Thanaruk
    INTELLIGENCE AND SECURITY INFORMATICS, PROCEEDINGS, 2009, 5477 : 149 - 154
  • [47] Deep Learning Based Architecture for Entity Extraction from Covid Related Documents
    Kumar, Sushil
    Sahu, Avantika
    Sharan, Aditi
    INFORMATION SYSTEMS AND MANAGEMENT SCIENCE, ISMS 2021, 2023, 521 : 419 - 427
  • [48] Establishing an RNA extraction method from a small number of Demodex mites for transcriptome sequencing
    Hu, Li
    Zhao, Yae
    Niu, Dongling
    Yang, Rui
    EXPERIMENTAL PARASITOLOGY, 2019, 200 : 67 - 72
  • [49] A graph-based ranking model for automatic keyphrases extraction from Arabic documents
    Salim El Bazzi, Mohamed
    Mammass, Driss
    Zaki, Taher
    Ennaji, Abdelatif
    ADVANCES IN DATA MINING: APPLICATIONS AND THEORETICAL ASPECTS, ICDM 2017, 2017, 10357 : 313 - 322
  • [50] Extraction of type style-based meta-information from imaged documents
    Chaudhuri B.B.
    Garain U.
    International Journal on Document Analysis and Recognition, 2001, 3 (3) : 138 - 149