RoBERTa-Based Keyword Extraction from Small Number of Korean Documents

被引：0

作者：

Kim, So-Eon ^{[1
]}

Lee, Jun-Beom ^{[1
]}

Park, Gyu-Min ^{[1
]}

Sohn, Seok-Man ^{[2
]}

Park, Seong-Bae ^{[1
]}

机构：

[1] Kyung Hee Univ, Sch Comp, Yongin 17104, South Korea

[2] Korea Elect Power Res Inst, Daejeon 34056, South Korea

来源：

ELECTRONICS | 2023年 / 12卷 / 22期

关键词：

keyword extraction; sequence labeling; post-processing; RoBERTa; learning with small dataset; GENERATION;

D O I：

10.3390/electronics12224560

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Keyword extraction is the task of identifying essential words in a lengthy document. This process is primarily executed through supervised keyword extraction. In instances where the dataset is limited in size, a classification-based approach is typically employed. Therefore, this paper introduces a novel keyword extractor based on a classification approach. The proposed keyword extractor comprises three key components: RoBERTa, a keyword estimator, and a decision rule. RoBERTa encodes an input document, the keyword estimator calculates the probability of each token in the document becoming a keyword, and the decision rule ultimately determines whether each token is a keyword based on these probabilities. However, training the proposed model with a small dataset presents two challenges. One problem is the case that all tokens in the documents are not a keyword, and the other problem is that a single word can be composed of keyword tokens and non-keyword tokens. Two novel heuristics are thus proposed to tackle these problems. To address these issues, two novel heuristics are proposed. These heuristics have been extensively tested through experiments, demonstrating that the proposed keyword extractor surpasses both the generation-based approach and the vanilla RoBERTa in environments with limited data. The efficacy of the heuristics is further validated through an ablation study. In summary, the proposed heuristics have proven to be effective in developing a supervised keyword extractor with a small dataset.

引用

页数：13

共 50 条

[31] Consensus-based Approach for Keyword Extraction from Urban Events Collections
Alves, Ana
Ribeiro, Bernardete
ADCAIJ-ADVANCES IN DISTRIBUTED COMPUTING AND ARTIFICIAL INTELLIGENCE JOURNAL, 2015, 4 (02): : 41 - 59
[32] Automatic ontology-based knowledge extraction from web documents
Alani, H
Kim, S
Millard, DE
Weal, MJ
Hall, W
Lewis, PH
Shadbolt, NR
IEEE INTELLIGENT SYSTEMS, 2003, 18 (01) : 14 - 21
[33] Feedback-based Keyphrase extraction from Unstructured Text Documents
Madaan, Nishtha
Saxena, Mudit
Patel, Hima
Mehta, Sameep
2020 INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS & NETWORKS (COMSNETS), 2020,
[34] Keyword Extraction from TV Program Viewers' Tweet Based on Neural Embedding Model
Kirihara, Taiga
Matsumoto, Kazuyuki
Yoshida, Minoru
Kita, Kenji
FUZZY SYSTEMS AND DATA MINING VI, 2020, 331 : 360 - 369
[35] Context-based extraction of concepts from unstructured textual documents
Gul, Saima
Rabiger, Stefan
Saygin, Yucel
INFORMATION SCIENCES, 2022, 588 : 248 - 264
[36] ONTOLOGY-BASED INFORMATION EXTRACTION FROM PDF DOCUMENTS WITH XONTO
Oro, Ermelinda
Ruffolo, Massimo
Sacca, Domenico
INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2009, 18 (05) : 673 - 695
[37] Confidence estimation and keyword extraction from speech recognition result based on Web information
Kensuke, Hara
Hideki, Sekiya
Tetsuya, Kawase
Satoshi, Tamura
Satoru, Hayamizu
2013 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2013,
[38] Keyword Extraction from Scientific Research Projects Based on SRP-TF-IDF
WANG, Zhuohao
WANG, Dong
Li, Qing
CHINESE JOURNAL OF ELECTRONICS, 2021, 30 (04) : 652 - 657
[39] Keyword Extraction from Scientific Research Projects Based on SRP-TF-IDF
WANG Zhuohao
WANG Dong
LI Qing
ChineseJournalofElectronics, 2021, 30 (04) : 652 - 657
[40] Information Extraction from Web Documents Based on unranked Tree Automaton Inference
Huang Zhaohua
Yang Fan
2012 FOURTH INTERNATIONAL CONFERENCE ON MULTIMEDIA INFORMATION NETWORKING AND SECURITY (MINES 2012), 2012, : 195 - 198

← 1 2 3 4 5 →