TC-BERT: large-scale language model for Korean technology commercialization documents

被引:0
|
作者
Kim, Taero [1 ]
Oh, Changdae [2 ]
Hwang, Hyeji [6 ]
Lee, Eunkyeong [3 ,7 ]
Kim, Yewon [8 ]
Choi, Yunjeong [4 ]
Kim, Sungjin [4 ]
Choi, Hosik [3 ]
Song, Kyungwoo [1 ,5 ]
机构
[1] Yonsei Univ, Dept Stat & Data Sci, 50 Yonsei Ro, Seoul 03722, South Korea
[2] Univ Wisconsin Madison, Dept Comp Sci, Madison, WI USA
[3] Univ Seoul, Dept Urban Big Data Convergence, 163 Seoulsiripdaero, Seoul 02504, South Korea
[4] Korea Inst Sci & Technol Informat, Technol Commercializat Res Ctr, Seoul 02456, South Korea
[5] Yonsei Univ, Dept Appl Stat, 50 Yonsei Ro, Seoul 03722, South Korea
[6] WOORI BANK, Seoul, South Korea
[7] KT, Seoul, South Korea
[8] Univ Seoul, Dept Artificial Intelligence, Seoul, South Korea
来源
JOURNAL OF SUPERCOMPUTING | 2025年 / 81卷 / 01期
基金
新加坡国家研究基金会;
关键词
Natural language processing; Language model; BERT; Technology commercialization; Keyword extraction;
D O I
10.1007/s11227-024-06597-6
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Pre-trained language models (LMs) have shown remarkable success in diverse tasks and domains. An LM trained on the document of a specific area (e.g., biomedicine, education, and finance) provides expert-level knowledge about that domain, and there have been many efforts to develop such domain-specific LMs. Despite its potential benefits, however, developing LM in the technology commercialization (TC) domain has not been investigated. In this study, we build a TC-specialized large LM pre-trained on the Korean TC corpus. Firstly, we collect a large-scale dataset containing 199,857,586 general Korean sentences and 17,562,751 TC-related Korean sentences. Second, based on this large dataset, we pre-train a Transformer-based language model resulting in TC-BERT. Third, we validate TC-BERT on three practical applications: document classification, keyword extraction, and recommender system. For this, we devise a new keyword extraction algorithm and propose a document recommender algorithm based on TC-BERT's document embedding. Through various quantitative and qualitative experiments, we comprehensively verify TC-BERT's effectiveness and its application.
引用
收藏
页数:20
相关论文
共 50 条
  • [11] Improved Text Extraction from PDF Documents for Large-Scale Natural Language Processing
    Tiedemann, Jorg
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, CICLING 2014, PT I, 2014, 8403 : 102 - 112
  • [12] STRATEGIC MANAGEMENT OF A LARGE-SCALE TECHNOLOGY DEVELOPMENT - THE CASE OF THE KOREAN TELECOMMUNICATIONS INDUSTRY
    LEE, J
    BAE, ZT
    LEE, J
    JOURNAL OF ENGINEERING AND TECHNOLOGY MANAGEMENT, 1994, 11 (02) : 149 - 170
  • [13] Implementation of a large-scale language model adaptation in a cloud environment
    Kwang-Ho Kim
    Dae-Young Jung
    Donghyun Lee
    Hyuk-Jun Lee
    Sung-Yong Park
    Myoung-Wan Koo
    Ji-Hwan Kim
    Jeong-sik Park
    Hyung-Bae Jeon
    Yun-Keun Lee
    Multimedia Tools and Applications, 2016, 75 : 5029 - 5045
  • [14] Implementation of a large-scale language model adaptation in a cloud environment
    Kim, Kwang-Ho
    Jung, Dae-Young
    Lee, Donghyun
    Lee, Hyuk-Jun
    Park, Sung-Yong
    Koo, Myoung-Wan
    Kim, Ji-Hwan
    Park, Jeong-sik
    Jeon, Hyung-Bae
    Lee, Yun-Keun
    MULTIMEDIA TOOLS AND APPLICATIONS, 2016, 75 (09) : 5029 - 5045
  • [15] Large-scale distributed language modeling
    Emami, Ahmad
    Papineni, Kishore
    Sorensen, Jeffrey
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 37 - +
  • [16] Finding a representative subset from large-scale documents
    Zhang, Jin
    Liu, Guannan
    Ren, Ming
    JOURNAL OF INFORMETRICS, 2016, 10 (03) : 762 - 775
  • [17] A Large-Scale Exploration of Terms of Service Documents on the Web
    Sundareswara, Soundarya Nurani
    Srinath, Mukund
    Wilson, Shomir
    Giles, C. Lee
    PROCEEDINGS OF THE 21ST ACM SYMPOSIUM ON DOCUMENT ENGINEERING (DOCENG '21), 2021,
  • [18] QUERY-BASED COMPOSITION FOR LARGE-SCALE LANGUAGE MODEL IN LVCSR
    Han, Yang
    Zhang, Chenwei
    Li, Xiangang
    Liu, Yi
    Wu, Xihong
    2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [19] The technology of large-scale CFD simulations
    Gorobets A.V.
    Mathematical Models and Computer Simulations, 2016, 8 (6) : 660 - 670
  • [20] TECHNOLOGY OF LARGE-SCALE SOLAR ENERGETICS
    Strebkov, Demetrius S.
    LIGHT & ENGINEERING, 2008, 16 (04): : 5 - 11