Domain-Specific Language Model Pre-Training for Korean Tax Law Classification

被引:1
|
作者
Gu, Yeong Hyeon [1 ]
Piao, Xianghua [1 ,2 ]
Yin, Helin [1 ]
Jin, Dong [1 ,2 ]
Zheng, Ri [1 ,2 ]
Yoo, Seong Joon [1 ]
机构
[1] Sejong Univ, Dept Comp Sci & Engn, Seoul 05006, South Korea
[2] Sejong Univ, Dept Convergence Engn Intelligent Drone, Seoul 05006, South Korea
关键词
Task analysis; Biological system modeling; Finance; Bit error rate; Licenses; Employee welfare; Vocabulary; BERT; domain-specific; Korean tax law; pre-trained language model; text classification;
D O I
10.1109/ACCESS.2022.3164098
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Owing to their increasing amendments and complexity, most taxpayers do not have the required knowledge of tax laws, which results in issues in everyday life. To use tax counseling services through the internet, a person must first select a category of tax laws corresponding to their tax question. However, a layperson without prior knowledge of tax laws may not know which category to select in the first place. Therefore, a model capable of automatically classifying the categories of tax laws is needed. Recently, a model using BERT has been frequently used for text classification; however, it is generally used in open-domains, and often experiences a degraded performance due to domain-specific technical terms, such as tax laws. Furthermore, a significant amount of time is required to train the model, since BERT is a large-scale model. To address these issues, this study proposes Korean tax law-BERT (KTL-BERT) for the automatic classification of categories of tax questions. For the proposed KTL-BERT, a new pre-trained language model was constructed by performing learning from scratch, to which a static masking method was applied based on DistilRoBERTa. Subsequently, the pre-trained language model was fine-tuned to classify five categories of tax law. A total of 327,735 tax law questions were used to verify the performance of the proposed KTL-BERT. The F1-score of the proposed KTL-BERT was approximately 91.06%, which is higher than that of the benchmark models by approximately 1.07%-15.46%, and the training speed was approximately 0.89%-56.07% higher.
引用
收藏
页码:46342 / 46353
页数:12
相关论文
共 50 条
  • [1] Pre-training language model incorporating domain-specific heterogeneous knowledge into a unified representation
    Zhu, Hongyin
    Peng, Hao
    Lyu, Zhiheng
    Hou, Lei
    Li, Juanzi
    Xiao, Jinghui
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 215
  • [2] Domain-Specific Pre-training Improves Confidence in Whole Slide Image Classification
    Chitnis, Soham Rohit
    Liu, Sidong
    Dash, Tirtharaj
    Verlekar, Tanmay Tulsidas
    Di Ieva, Antonio
    Berkovsky, Shlomo
    Vig, Lovekesh
    Srinivasan, Ashwin
    2023 45TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE & BIOLOGY SOCIETY, EMBC, 2023,
  • [3] On Domain-Specific Pre-Training for Effective Semantic Perception in Agricultural Robotics
    Roggiolani, Gianmarco
    Magistri, Federico
    Guadagnino, Tiziano
    Weyler, Jan
    Grisetti, Giorgio
    Stachniss, Cyrill
    Behley, Jens
    2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2023), 2023, : 11786 - 11793
  • [4] A Joint Domain-Specific Pre-Training Method Based on Data Enhancement
    Gan, Yi
    Lu, Gaoyong
    Su, Zhihui
    Wang, Lei
    Zhou, Junlin
    Jiang, Jiawei
    Chen, Duanbing
    APPLIED SCIENCES-BASEL, 2023, 13 (07):
  • [5] Framework for automation of short answer grading based on domain-specific pre-training
    Bonthu, Sridevi
    Sree, S. Rama
    Prasad, M. H. M. Krishna
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 137
  • [6] Subset selection for domain adaptive pre-training of language model
    Hwang, Junha
    Lee, Seungdong
    Kim, Haneul
    Jeong, Young-Seob
    SCIENTIFIC REPORTS, 2025, 15 (01):
  • [7] Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science
    Trewartha, Amalie
    Walker, Nicholas
    Huo, Haoyan
    Lee, Sanghoon
    Cruse, Kevin
    Dagdelen, John
    Dunn, Alexander
    Persson, Kristin A.
    Ceder, Gerbrand
    Jain, Anubhav
    PATTERNS, 2022, 3 (04):
  • [8] A domain-specific language for model coupling
    Bulatewicz, Tom
    Cuny, Janice
    PROCEEDINGS OF THE 2006 WINTER SIMULATION CONFERENCE, VOLS 1-5, 2006, : 1091 - +
  • [9] MindLLM: Lightweight large language model pre-training, evaluation and domain application
    Yang, Yizhe
    Sun, Huashan
    Li, Jiawei
    Liu, Runheng
    Li, Yinghao
    Liu, Yuhang
    Gao, Yang
    Huang, Heyan
    AI OPEN, 2024, 5 : 155 - 180
  • [10] KU_ai at MEDIQA 2019: Domain-specific Pre-training and Transfer Learning for Medical NLI
    Cengiz, Cemil
    Sert, Ulas
    Yuret, Deniz
    SIGBIOMED WORKSHOP ON BIOMEDICAL NATURAL LANGUAGE PROCESSING (BIONLP 2019), 2019, : 427 - 436