Towards Efficient Pre-Trained Language Model via Feature Correlation Distillation

被引:0
|
作者
Huang, Kun [1 ]
Guo, Xin [1 ]
Wang, Meng [1 ]
机构
[1] Ant Grp, Hangzhou, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Knowledge Distillation (KD) has emerged as a promising approach for compressing large Pre-trained Language Models (PLMs). The performance of KD relies on how to effectively formulate and transfer the knowledge from the teacher model to the student model. Prior arts mainly focus on directly aligning output features from the transformer block, which may impose overly strict constraints on the student model's learning process and complicate the training process by introducing extra parameters and computational cost. Moreover, our analysis indicates that the different relations within self-attention, as adopted in other works, involves more computation complexities and can easily be constrained by the number of heads, potentially leading to suboptimal solutions. To address these issues, we propose a novel approach that builds relationships directly from output features. Specifically, we introduce token-level and sequence-level relations concurrently to fully exploit the knowledge from the teacher model. Furthermore, we propose a correlation-based distillation loss to alleviate the exact match properties inherent in traditional KL divergence or MSE loss functions. Our method, dubbed FCD, presents a simple yet effective method to compress various architectures (BERT, RoBERTa, and GPT) and model sizes (base-size and large-size). Extensive experimental results demonstrate that our distilled, smaller language models significantly surpass existing KD methods across various NLP tasks.
引用
收藏
页数:15
相关论文
共 50 条
  • [21] BERTweet: A pre-trained language model for English Tweets
    Dat Quoc Nguyen
    Thanh Vu
    Anh Tuan Nguyen
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING: SYSTEM DEMONSTRATIONS, 2020, : 9 - 14
  • [22] Pre-trained Language Model for Biomedical Question Answering
    Yoon, Wonjin
    Lee, Jinhyuk
    Kim, Donghyeon
    Jeong, Minbyul
    Kang, Jaewoo
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2019, PT II, 2020, 1168 : 727 - 740
  • [23] ViDeBERTa: A powerful pre-trained language model for Vietnamese
    Tran, Cong Dao
    Pham, Nhut Huy
    Nguyen, Anh
    Hy, Truong Son
    Vu, Tu
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 1071 - 1078
  • [24] EventBERT: A Pre-Trained Model for Event Correlation Reasoning
    Zhou, Yucheng
    Geng, Xiubo
    Shen, Tao
    Long, Guodong
    Jiang, Daxin
    PROCEEDINGS OF THE ACM WEB CONFERENCE 2022 (WWW'22), 2022, : 850 - 859
  • [25] Misspelling Correction with Pre-trained Contextual Language Model
    Hu, Yifei
    Ting, Xiaonan
    Ko, Youlim
    Rayz, Julia Taylor
    PROCEEDINGS OF 2020 IEEE 19TH INTERNATIONAL CONFERENCE ON COGNITIVE INFORMATICS & COGNITIVE COMPUTING (ICCI*CC 2020), 2020, : 144 - 149
  • [26] Towards semantic versioning of open pre-trained language model releases on hugging face
    Ajibode, Adekunle
    Bangash, Abdul Ali
    Cogo, Filipe R.
    Adams, Bram
    Hassan, Ahmed E.
    EMPIRICAL SOFTWARE ENGINEERING, 2025, 30 (03)
  • [27] MediSwift: Efficient Sparse Pre-trained Biomedical Language Models
    Thangarasa, Vithursan
    Salem, Mahmoud
    Saxena, Shreyas
    Leong, Kevin
    Hestness, Joel
    Lie, Sean
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 214 - 230
  • [28] Structured Pruning for Efficient Generative Pre-trained Language Models
    Tao, Chaofan
    Hou, Lu
    Bai, Haoli
    Wei, Jiansheng
    Jiang, Xin
    Liu, Qun
    Lu, Ping
    Wong, Ngai
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 10880 - 10895
  • [29] Towards Parameter-Efficient Integration of Pre-Trained Language Models In Temporal Video Grounding
    Shimomoto, Erica K.
    Marrese-Taylor, Edison
    Takamur, Hiroya
    Kobayashi, Ichiro
    Nakayama, Hideki
    Miyao, Yusuke
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 13101 - 13123
  • [30] ReAugKD: Retrieval-Augmented Knowledge Distillation For Pre-trained Language Models
    Zhang, Jianyi
    Muhamed, Aashiq
    Anantharaman, Aditya
    Wang, Guoyin
    Chen, Changyou
    Zhong, Kai
    Cui, Qingjun
    Xu, Yi
    Zeng, Belinda
    Chilimbi, Trishul
    Chen, Yiran
    61ST CONFERENCE OF THE THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 2, 2023, : 1128 - 1136