Towards Efficient Pre-Trained Language Model via Feature Correlation Distillation

被引：0

作者：

Huang, Kun ^{[1
]}

Guo, Xin ^{[1
]}

Wang, Meng ^{[1
]}

机构：

[1] Ant Grp, Hangzhou, Peoples R China

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Knowledge Distillation (KD) has emerged as a promising approach for compressing large Pre-trained Language Models (PLMs). The performance of KD relies on how to effectively formulate and transfer the knowledge from the teacher model to the student model. Prior arts mainly focus on directly aligning output features from the transformer block, which may impose overly strict constraints on the student model's learning process and complicate the training process by introducing extra parameters and computational cost. Moreover, our analysis indicates that the different relations within self-attention, as adopted in other works, involves more computation complexities and can easily be constrained by the number of heads, potentially leading to suboptimal solutions. To address these issues, we propose a novel approach that builds relationships directly from output features. Specifically, we introduce token-level and sequence-level relations concurrently to fully exploit the knowledge from the teacher model. Furthermore, we propose a correlation-based distillation loss to alleviate the exact match properties inherent in traditional KL divergence or MSE loss functions. Our method, dubbed FCD, presents a simple yet effective method to compress various architectures (BERT, RoBERTa, and GPT) and model sizes (base-size and large-size). Extensive experimental results demonstrate that our distilled, smaller language models significantly surpass existing KD methods across various NLP tasks.

引用

页数：15

共 50 条

[1] Knowledge Base Grounded Pre-trained Language Models via Distillation
Sourty, Raphael
Moreno, Jose G.
Servant, Francois-Paul
Tamine, Lynda
39TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, SAC 2024, 2024, : 1617 - 1625
[2] Domain Knowledge Transferring for Pre-trained Language Model via Calibrated Activation Boundary Distillation
Choi, Dongha
Choi, HongSeok
Lee, Hyunju
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 1658 - 1669
[3] Explanation Guided Knowledge Distillation for Pre-trained Language Model Compression
Yang, Zhao
Zhang, Yuanzhe
Sui, Dianbo
Ju, Yiming
Zhao, Jun
Liu, Kang
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (02)
[4] Dynamic Knowledge Distillation for Pre-trained Language Models
Li, Lei
Lin, Yankai
Ren, Shuhuai
Li, Peng
Zhou, Jie
Sun, Xu
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 379 - 389
[5] Hyperbolic Pre-Trained Language Model
Chen, Weize
Han, Xu
Lin, Yankai
He, Kaichen
Xie, Ruobing
Zhou, Jie
Liu, Zhiyuan
Sun, Maosong
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3101 - 3112
[6] MERGEDISTILL: Merging Pre-trained Language Models using Distillation
Khanuja, Simran
Johnson, Melvin
Talukdar, Partha
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2874 - 2887
[7] Towards Efficient Post-training Quantization of Pre-trained Language Models
Bai, Haoli
Hou, Lu
Shang, Lifeng
Jiang, Xin
King, Irwin
Lyu, Michael R.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[8] AdaDS: Adaptive data selection for accelerating pre-trained language model knowledge distillation
Zhou, Qinhong
Li, Peng
Liu, Yang
Guan, Yuyang
Xing, Qizhou
Chen, Ming
Sun, Maosong
Liu, Yang
AI OPEN, 2023, 4 : 56 - 63
[9] Efficient feature selection for pre-trained vision transformers
Huang, Lan
Zeng, Jia
Yu, Mengqiang
Ding, Weiping
Bai, Xingyu
Wang, Kangping
COMPUTER VISION AND IMAGE UNDERSTANDING, 2025, 254
[10] Pre-trained Language Model Representations for Language Generation
Edunov, Sergey
Baevski, Alexei
Auli, Michael
2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 4052 - 4059

← 1 2 3 4 5 →