Domain-Specific Language Model Pre-Training for Korean Tax Law Classification

被引：1

作者：

Gu, Yeong Hyeon ^{[1
]}

Piao, Xianghua ^{[1
,2
]}

Yin, Helin ^{[1
]}

Jin, Dong ^{[1
,2
]}

Zheng, Ri ^{[1
,2
]}

Yoo, Seong Joon ^{[1
]}

机构：

[1] Sejong Univ, Dept Comp Sci & Engn, Seoul 05006, South Korea

[2] Sejong Univ, Dept Convergence Engn Intelligent Drone, Seoul 05006, South Korea

来源：

IEEE ACCESS | 2022年 / 10卷

关键词：

Task analysis; Biological system modeling; Finance; Bit error rate; Licenses; Employee welfare; Vocabulary; BERT; domain-specific; Korean tax law; pre-trained language model; text classification;

D O I：

10.1109/ACCESS.2022.3164098

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Owing to their increasing amendments and complexity, most taxpayers do not have the required knowledge of tax laws, which results in issues in everyday life. To use tax counseling services through the internet, a person must first select a category of tax laws corresponding to their tax question. However, a layperson without prior knowledge of tax laws may not know which category to select in the first place. Therefore, a model capable of automatically classifying the categories of tax laws is needed. Recently, a model using BERT has been frequently used for text classification; however, it is generally used in open-domains, and often experiences a degraded performance due to domain-specific technical terms, such as tax laws. Furthermore, a significant amount of time is required to train the model, since BERT is a large-scale model. To address these issues, this study proposes Korean tax law-BERT (KTL-BERT) for the automatic classification of categories of tax questions. For the proposed KTL-BERT, a new pre-trained language model was constructed by performing learning from scratch, to which a static masking method was applied based on DistilRoBERTa. Subsequently, the pre-trained language model was fine-tuned to classify five categories of tax law. A total of 327,735 tax law questions were used to verify the performance of the proposed KTL-BERT. The F1-score of the proposed KTL-BERT was approximately 91.06%, which is higher than that of the benchmark models by approximately 1.07%-15.46%, and the training speed was approximately 0.89%-56.07% higher.

引用

页码：46342 / 46353

页数：12

共 50 条

[1] Pre-training language model incorporating domain-specific heterogeneous knowledge into a unified representation
Zhu, Hongyin
Peng, Hao
Lyu, Zhiheng
Hou, Lei
Li, Juanzi
Xiao, Jinghui
EXPERT SYSTEMS WITH APPLICATIONS, 2023, 215
[2] Domain-Specific Pre-training Improves Confidence in Whole Slide Image Classification
Chitnis, Soham Rohit
Liu, Sidong
Dash, Tirtharaj
Verlekar, Tanmay Tulsidas
Di Ieva, Antonio
Berkovsky, Shlomo
Vig, Lovekesh
Srinivasan, Ashwin
2023 45TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE & BIOLOGY SOCIETY, EMBC, 2023,
[3] On Domain-Specific Pre-Training for Effective Semantic Perception in Agricultural Robotics
Roggiolani, Gianmarco
Magistri, Federico
Guadagnino, Tiziano
Weyler, Jan
Grisetti, Giorgio
Stachniss, Cyrill
Behley, Jens
2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2023), 2023, : 11786 - 11793
[4] A Joint Domain-Specific Pre-Training Method Based on Data Enhancement
Gan, Yi
Lu, Gaoyong
Su, Zhihui
Wang, Lei
Zhou, Junlin
Jiang, Jiawei
Chen, Duanbing
APPLIED SCIENCES-BASEL, 2023, 13 (07):
[5] Framework for automation of short answer grading based on domain-specific pre-training
Bonthu, Sridevi
Sree, S. Rama
Prasad, M. H. M. Krishna
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 137
[6] Subset selection for domain adaptive pre-training of language model
Hwang, Junha
Lee, Seungdong
Kim, Haneul
Jeong, Young-Seob
SCIENTIFIC REPORTS, 2025, 15 (01):
[7] Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science
Trewartha, Amalie
Walker, Nicholas
Huo, Haoyan
Lee, Sanghoon
Cruse, Kevin
Dagdelen, John
Dunn, Alexander
Persson, Kristin A.
Ceder, Gerbrand
Jain, Anubhav
PATTERNS, 2022, 3 (04):
[8] A domain-specific language for model coupling
Bulatewicz, Tom
Cuny, Janice
PROCEEDINGS OF THE 2006 WINTER SIMULATION CONFERENCE, VOLS 1-5, 2006, : 1091 - +
[9] MindLLM: Lightweight large language model pre-training, evaluation and domain application
Yang, Yizhe
Sun, Huashan
Li, Jiawei
Liu, Runheng
Li, Yinghao
Liu, Yuhang
Gao, Yang
Huang, Heyan
AI OPEN, 2024, 5 : 155 - 180
[10] KU_ai at MEDIQA 2019: Domain-specific Pre-training and Transfer Learning for Medical NLI
Cengiz, Cemil
Sert, Ulas
Yuret, Deniz
SIGBIOMED WORKSHOP ON BIOMEDICAL NATURAL LANGUAGE PROCESSING (BIONLP 2019), 2019, : 427 - 436

← 1 2 3 4 5 →