TC-BERT: large-scale language model for Korean technology commercialization documents

被引:0
|
作者
Kim, Taero [1 ]
Oh, Changdae [2 ]
Hwang, Hyeji [6 ]
Lee, Eunkyeong [3 ,7 ]
Kim, Yewon [8 ]
Choi, Yunjeong [4 ]
Kim, Sungjin [4 ]
Choi, Hosik [3 ]
Song, Kyungwoo [1 ,5 ]
机构
[1] Yonsei Univ, Dept Stat & Data Sci, 50 Yonsei Ro, Seoul 03722, South Korea
[2] Univ Wisconsin Madison, Dept Comp Sci, Madison, WI USA
[3] Univ Seoul, Dept Urban Big Data Convergence, 163 Seoulsiripdaero, Seoul 02504, South Korea
[4] Korea Inst Sci & Technol Informat, Technol Commercializat Res Ctr, Seoul 02456, South Korea
[5] Yonsei Univ, Dept Appl Stat, 50 Yonsei Ro, Seoul 03722, South Korea
[6] WOORI BANK, Seoul, South Korea
[7] KT, Seoul, South Korea
[8] Univ Seoul, Dept Artificial Intelligence, Seoul, South Korea
来源
JOURNAL OF SUPERCOMPUTING | 2025年 / 81卷 / 01期
基金
新加坡国家研究基金会;
关键词
Natural language processing; Language model; BERT; Technology commercialization; Keyword extraction;
D O I
10.1007/s11227-024-06597-6
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Pre-trained language models (LMs) have shown remarkable success in diverse tasks and domains. An LM trained on the document of a specific area (e.g., biomedicine, education, and finance) provides expert-level knowledge about that domain, and there have been many efforts to develop such domain-specific LMs. Despite its potential benefits, however, developing LM in the technology commercialization (TC) domain has not been investigated. In this study, we build a TC-specialized large LM pre-trained on the Korean TC corpus. Firstly, we collect a large-scale dataset containing 199,857,586 general Korean sentences and 17,562,751 TC-related Korean sentences. Second, based on this large dataset, we pre-train a Transformer-based language model resulting in TC-BERT. Third, we validate TC-BERT on three practical applications: document classification, keyword extraction, and recommender system. For this, we devise a new keyword extraction algorithm and propose a document recommender algorithm based on TC-BERT's document embedding. Through various quantitative and qualitative experiments, we comprehensively verify TC-BERT's effectiveness and its application.
引用
收藏
页数:20
相关论文
共 50 条
  • [1] Development of a Large-scale Korean Language Model in the Field of Geosciences
    Lee, Sang-ho
    ECONOMIC AND ENVIRONMENTAL GEOLOGY, 2024, 57 (05): : 539 - 550
  • [2] Deep Context: A Neural Language Model for Large-scale Networked Documents
    Wu, Hao
    Lerman, Kristina
    PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3091 - 3097
  • [3] Large-Scale News Classification using BERT Language Model: Spark NLP Approach
    Nugroho, Kuncahyo Setyo
    Sukmadewa, Anantha Yullian
    Yudistira, Novanto
    PROCEEDINGS OF 2021 INTERNATIONAL CONFERENCE ON SUSTAINABLE INFORMATION ENGINEERING AND TECHNOLOGY, SIET 2021, 2021, : 240 - 246
  • [4] TECHNOLOGY-TRANSFER FROM UNIVERSITY TO INDUSTRY - A LARGE-SCALE EXPERIMENT WITH TECHNOLOGY DEVELOPMENT AND COMMERCIALIZATION
    LEE, Y
    GAERTNER, R
    POLICY STUDIES JOURNAL, 1994, 22 (02) : 384 - 399
  • [5] Studying the history of the Arabic language: language technology and a large-scale historical corpus
    Belinkov, Yonatan
    Magidow, Alexander
    Barron-Cedeno, Alberto
    Shmidman, Avi
    Romanov, Maxim
    LANGUAGE RESOURCES AND EVALUATION, 2019, 53 (04) : 771 - 805
  • [6] Studying the history of the Arabic language: language technology and a large-scale historical corpus
    Yonatan Belinkov
    Alexander Magidow
    Alberto Barrón-Cedeño
    Avi Shmidman
    Maxim Romanov
    Language Resources and Evaluation, 2019, 53 : 771 - 805
  • [7] THE LARGE-SCALE COMMERCIALIZATION OF ALUMINUM-MATRIX COMPOSITES
    KLIMOWICZ, TF
    JOM-JOURNAL OF THE MINERALS METALS & MATERIALS SOCIETY, 1994, 46 (11): : 49 - 53
  • [8] Large-scale commercialization of aluminum-matrix composites
    Klimowicz, Thomas F.
    JOM, 1994, 46 (11) : 49 - 53
  • [9] Synthesis of Large-Scale Transition Metal Dichalcogenides for Their Commercialization
    Park, Seoung-Woong
    Jo, Yong Jun
    Bae, Sukang
    Hong, Byung Hee
    Lee, Seoung-Ki
    APPLIED SCIENCE AND CONVERGENCE TECHNOLOGY, 2020, 29 (06): : 133 - 142
  • [10] KoreALBERT: Pretraining a Lite BERT Model for Korean Language Understanding
    Lee, Hyunjae
    Yoon, Jaewoong
    Hwang, Bonggyu
    Joe, Seongho
    Min, Seungjai
    Gwon, Youngjune
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 5551 - 5557