Gradual Syntactic Label Replacement for Language Model Pre-Training

被引：0

作者：

Wang, Yile ^{[1
]}

Zhang, Yue ^{[2
]}

Li, Peng ^{[1
]}

Liu, Yang ^{[3
]}

机构：

[1] Tsinghua Univ, Inst AI Ind Res, Beijing 100084, Peoples R China

[2] Westlake Univ, Sch Engn, Hangzhou 310024, Peoples R China

[3] Tsinghua Univ, Dept Comp Sci & Technol, Beijing 100084, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

基金：

中国国家自然科学基金;

关键词：

Language model pre-training; syntactic label replacement; curriculum learning; data-centric;

D O I：

10.1109/TASLP.2023.3331096

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Pre-training serves as a foundation of recent NLP models, where language modeling tasks are performed over large texts. Typical models like BERT and GPT take the corpus as a whole and treat each word equally for language modeling. However, recent works show that the naturally existing frequency bias in the raw corpus may limit the power of the language model. In this article, we propose a multi-stage training strategy that gradually increases the training vocabulary by modifying the training data. Specifically, we leverage the syntactic structure as a bridge for infrequent words and replace them with the corresponding syntactic labels, then we recover their original lexical surface for further training. Such strategy results in an easy-to-hard curriculum learning process, where the model learns the most common words and some basic syntax concepts, before recognizing a large number of uncommon words via their specific usages and the previously learned category knowledge. Experimental results show that such a method can improve the performance of both discriminative and generative pre-trained language models on benchmarks and various downstream tasks.

引用

页码：486 / 496

页数：11

共 50 条

[31] Table Pre-training: A Survey on Model Architectures, Pre-training Objectives, and Downstream Tasks
Dong, Haoyu
Cheng, Zhoujun
He, Xinyi
Zhou, Mengyu
Zhou, Anda
Zhou, Fan
Liu, Ao
Han, Shi
Zhang, Dongmei
PROCEEDINGS OF THE THIRTY-FIRST INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2022, 2022, : 5426 - 5435
[32] Dialogue-adaptive language model pre-training from quality estimation
Li, Junlong
Zhang, Zhuosheng
Zhao, Hai
NEUROCOMPUTING, 2023, 516 : 27 - 35
[33] ChouBERT: Pre-training French Language Model for Crowdsensing with Tweets in Phytosanitary Context
Jiang, Shufan
Angarita, Rafael
Cormier, Stephane
Orensanz, Julien
Rousseaux, Francis
RESEARCH CHALLENGES IN INFORMATION SCIENCE, 2022, 446 : 653 - 661
[34] Knowledge Enhanced Pre-Training Model for Vision-Language-Navigation Task
HUANG Jitao
ZENG Guohui
HUANG Bo
GAO Yongbin
LIU Jin
SHI Zhicai
Wuhan University Journal of Natural Sciences, 2021, 26 (02) : 147 - 155
[35] POSPAN: Position-Constrained Span Masking for Language Model Pre-training
Zhang, Zhenyu
Shen, Lei
Zhao, Yuming
Chen, Meng
He, Xiaodong
PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 4420 - 4424
[36] Graph Structure Enhanced Pre-Training Language Model for Knowledge Graph Completion
Zhu, Huashi
Xu, Dexuan
Huang, Yu
Jin, Zhi
Ding, Weiping
Tong, Jiahui
Chong, Guoshuang
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024, 8 (04): : 2697 - 2708
[37] Multi-task Pre-training Language Model for Semantic Network Completion
Li, Da
Zhu, Boqing
Yang, Sen
Xu, Kele
Yi, Ming
He, Yukai
Wang, Huaimin
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (11)
[38] LuxemBERT: Simple and Practical Data Augmentation in Language Model Pre-Training for Luxembourgish
Lothritz, Cedric
Lebichot, Bertrand
Allix, Kevin
Veiber, Lisa
Bissyande, Tegawende F.
Klein, Jacques
Boytsov, Andrey
Goujon, Anne
Lefebvre, Clement
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 5080 - 5089
[39] MindLLM: Lightweight large language model pre-training, evaluation and domain application
Yang, Yizhe
Sun, Huashan
Li, Jiawei
Liu, Runheng
Li, Yinghao
Liu, Yuhang
Gao, Yang
Huang, Heyan
AI OPEN, 2024, 5 : 155 - 180
[40] Speech Model Pre-training for End-to-End Spoken Language Understanding
Lugosch, Loren
Ravanelli, Mirco
Ignoto, Patrick
Tomar, Vikrant Singh
Bengio, Yoshua
INTERSPEECH 2019, 2019, : 814 - 818

← 1 2 3 4 5 →