Towards Efficient Pre-Trained Language Model via Feature Correlation Distillation

被引：0

作者：

Huang, Kun ^{[1
]}

Guo, Xin ^{[1
]}

Wang, Meng ^{[1
]}

机构：

[1] Ant Grp, Hangzhou, Peoples R China

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Knowledge Distillation (KD) has emerged as a promising approach for compressing large Pre-trained Language Models (PLMs). The performance of KD relies on how to effectively formulate and transfer the knowledge from the teacher model to the student model. Prior arts mainly focus on directly aligning output features from the transformer block, which may impose overly strict constraints on the student model's learning process and complicate the training process by introducing extra parameters and computational cost. Moreover, our analysis indicates that the different relations within self-attention, as adopted in other works, involves more computation complexities and can easily be constrained by the number of heads, potentially leading to suboptimal solutions. To address these issues, we propose a novel approach that builds relationships directly from output features. Specifically, we introduce token-level and sequence-level relations concurrently to fully exploit the knowledge from the teacher model. Furthermore, we propose a correlation-based distillation loss to alleviate the exact match properties inherent in traditional KL divergence or MSE loss functions. Our method, dubbed FCD, presents a simple yet effective method to compress various architectures (BERT, RoBERTa, and GPT) and model sizes (base-size and large-size). Extensive experimental results demonstrate that our distilled, smaller language models significantly surpass existing KD methods across various NLP tasks.

引用

页数：15

共 50 条

[31] CLIP-Llama: A New Approach for Scene Text Recognition with a Pre-Trained Vision-Language Model and a Pre-Trained Language Model
Zhao, Xiaoqing
Xu, Miaomiao
Silamu, Wushour
Li, Yanbing
SENSORS, 2024, 24 (22)
[32] Efficient word segmentation for enhancing Chinese spelling check in pre-trained language model
Li, Fangfang
Jiang, Jie
Tang, Dafu
Shan, Youran
Duan, Junwen
Zhang, Shichao
KNOWLEDGE AND INFORMATION SYSTEMS, 2025, 67 (01) : 603 - 632
[33] Compression of Generative Pre-trained Language Models via Quantization
Tao, Chaofan
Hou, Lu
Zhang, Wei
Shang, Lifeng
Jiang, Xin
Liu, Qun
Luo, Ping
Wong, Ngai
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 4821 - 4836
[34] Parallel Corpus Filtering via Pre-trained Language Models
DiDi Labs
arXiv, 2020,
[35] Multilingual Translation via Grafting Pre-trained Language Models
Sun, Zewei
Wang, Mingxuan
Li, Lei
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 2735 - 2747
[36] Knowledge graph extension with a pre-trained language model via unified learning method
Choi, Bonggeun
Ko, Youngjoong
KNOWLEDGE-BASED SYSTEMS, 2023, 262
[37] APIRecX: Cross-Library API Recommendation via Pre-Trained Language Model
Kang, Yuning
Wang, Zan
Zhang, Hongyu
Chen, Junjie
You, Hanmo
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 3425 - 3436
[38] Enhancing Language Generation with Effective Checkpoints of Pre-trained Language Model
Park, Jeonghyeok
Zhao, Hai
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2686 - 2694
[39] Tuning Pre-trained Model via Moment Probing
Gao, Mingze
Wang, Qilong
Lin, Zhenyi
Zhu, Pengfei
Hu, Qinghua
Zhou, Jingbo
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11769 - 11779
[40] Towards unifying pre-trained language models for semantic text exchange
Miao, Jingyuan
Zhang, Yuqi
Jiang, Nan
Wen, Jie
Pei, Kanglu
Wan, Yue
Wan, Tao
Chen, Honglong
WIRELESS NETWORKS, 2024, 30 (07) : 6385 - 6398

← 1 2 3 4 5 →