Gradual Syntactic Label Replacement for Language Model Pre-Training

被引:0
|
作者
Wang, Yile [1 ]
Zhang, Yue [2 ]
Li, Peng [1 ]
Liu, Yang [3 ]
机构
[1] Tsinghua Univ, Inst AI Ind Res, Beijing 100084, Peoples R China
[2] Westlake Univ, Sch Engn, Hangzhou 310024, Peoples R China
[3] Tsinghua Univ, Dept Comp Sci & Technol, Beijing 100084, Peoples R China
基金
中国国家自然科学基金;
关键词
Language model pre-training; syntactic label replacement; curriculum learning; data-centric;
D O I
10.1109/TASLP.2023.3331096
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Pre-training serves as a foundation of recent NLP models, where language modeling tasks are performed over large texts. Typical models like BERT and GPT take the corpus as a whole and treat each word equally for language modeling. However, recent works show that the naturally existing frequency bias in the raw corpus may limit the power of the language model. In this article, we propose a multi-stage training strategy that gradually increases the training vocabulary by modifying the training data. Specifically, we leverage the syntactic structure as a bridge for infrequent words and replace them with the corresponding syntactic labels, then we recover their original lexical surface for further training. Such strategy results in an easy-to-hard curriculum learning process, where the model learns the most common words and some basic syntax concepts, before recognizing a large number of uncommon words via their specific usages and the previously learned category knowledge. Experimental results show that such a method can improve the performance of both discriminative and generative pre-trained language models on benchmarks and various downstream tasks.
引用
收藏
页码:486 / 496
页数:11
相关论文
共 50 条
  • [31] Table Pre-training: A Survey on Model Architectures, Pre-training Objectives, and Downstream Tasks
    Dong, Haoyu
    Cheng, Zhoujun
    He, Xinyi
    Zhou, Mengyu
    Zhou, Anda
    Zhou, Fan
    Liu, Ao
    Han, Shi
    Zhang, Dongmei
    PROCEEDINGS OF THE THIRTY-FIRST INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2022, 2022, : 5426 - 5435
  • [32] Dialogue-adaptive language model pre-training from quality estimation
    Li, Junlong
    Zhang, Zhuosheng
    Zhao, Hai
    NEUROCOMPUTING, 2023, 516 : 27 - 35
  • [33] ChouBERT: Pre-training French Language Model for Crowdsensing with Tweets in Phytosanitary Context
    Jiang, Shufan
    Angarita, Rafael
    Cormier, Stephane
    Orensanz, Julien
    Rousseaux, Francis
    RESEARCH CHALLENGES IN INFORMATION SCIENCE, 2022, 446 : 653 - 661
  • [34] Knowledge Enhanced Pre-Training Model for Vision-Language-Navigation Task
    HUANG Jitao
    ZENG Guohui
    HUANG Bo
    GAO Yongbin
    LIU Jin
    SHI Zhicai
    Wuhan University Journal of Natural Sciences, 2021, 26 (02) : 147 - 155
  • [35] POSPAN: Position-Constrained Span Masking for Language Model Pre-training
    Zhang, Zhenyu
    Shen, Lei
    Zhao, Yuming
    Chen, Meng
    He, Xiaodong
    PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 4420 - 4424
  • [36] Graph Structure Enhanced Pre-Training Language Model for Knowledge Graph Completion
    Zhu, Huashi
    Xu, Dexuan
    Huang, Yu
    Jin, Zhi
    Ding, Weiping
    Tong, Jiahui
    Chong, Guoshuang
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024, 8 (04): : 2697 - 2708
  • [37] Multi-task Pre-training Language Model for Semantic Network Completion
    Li, Da
    Zhu, Boqing
    Yang, Sen
    Xu, Kele
    Yi, Ming
    He, Yukai
    Wang, Huaimin
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (11)
  • [38] LuxemBERT: Simple and Practical Data Augmentation in Language Model Pre-Training for Luxembourgish
    Lothritz, Cedric
    Lebichot, Bertrand
    Allix, Kevin
    Veiber, Lisa
    Bissyande, Tegawende F.
    Klein, Jacques
    Boytsov, Andrey
    Goujon, Anne
    Lefebvre, Clement
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 5080 - 5089
  • [39] MindLLM: Lightweight large language model pre-training, evaluation and domain application
    Yang, Yizhe
    Sun, Huashan
    Li, Jiawei
    Liu, Runheng
    Li, Yinghao
    Liu, Yuhang
    Gao, Yang
    Huang, Heyan
    AI OPEN, 2024, 5 : 155 - 180
  • [40] Speech Model Pre-training for End-to-End Spoken Language Understanding
    Lugosch, Loren
    Ravanelli, Mirco
    Ignoto, Patrick
    Tomar, Vikrant Singh
    Bengio, Yoshua
    INTERSPEECH 2019, 2019, : 814 - 818