Contrastive Distillation on Intermediate Representations for Language Model Compression

被引:0
|
作者
Sun, Siqi [1 ]
Gan, Zhe [1 ]
Cheng, Yu [1 ]
Fang, Yuwei [1 ]
Wang, Shuohang [1 ]
Liu, Jingjing [1 ]
机构
[1] Microsoft Dynam 365 Res, Redmond, WA 98008 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing language model compression methods mostly use a simple L-2 loss to distill knowledge in the intermediate representations of a large BERT model to a smaller one. Although widely used, this objective by design assumes that all the dimensions of hidden representations are independent, failing to capture important structural knowledge in the intermediate layers of the teacher network. To achieve better distillation efficacy, we propose Contrastive Distillation on Intermediate Representations (CODIR), a principled knowledge distillation framework where the student is trained to distill knowledge through intermediate layers of the teacher via a contrastive objective. By learning to distinguish positive sample from a large set of negative samples, CoDIR facilitates the student's exploitation of rich information in teacher's hidden layers. CoDIR can be readily applied to compress large-scale language models in both pre-training and finetuning stages, and achieves superb performance on the GLUE benchmark, outperforming state-of-the-art compression methods.(1)
引用
收藏
页码:498 / 508
页数:11
相关论文
共 50 条
  • [31] INTERMEDIATE SYMMETRY, THE SUPERPOSITION MODEL AND INDUCED REPRESENTATIONS
    NEWMAN, DJ
    NG, B
    MOLECULAR PHYSICS, 1987, 61 (06) : 1443 - 1453
  • [32] Language model representations for the GOPOLIS database
    Zibert, J
    Gros, J
    Dobrisek, S
    Mihelic, F
    TEXT, SPEECH AND DIALOGUE, 1999, 1692 : 380 - 383
  • [33] COMPRESSION DISTILLATION
    不详
    INDUSTRIAL AND ENGINEERING CHEMISTRY, 1945, 37 (12): : 5 - &
  • [34] LRC-BERT: Latent-representation Contrastive Knowledge Distillation for Natural Language Understanding
    Fu, Hao
    Zhou, Shaojun
    Yang, Qihong
    Tang, Junjie
    Liu, Guiquan
    Liu, Kaikui
    Li, Xiaolong
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 12830 - 12838
  • [35] Compression of Acoustic Model via Knowledge Distillation and Pruning
    Li, Chenxing
    Zhu, Lei
    Xu, Shuang
    Gao, Peng
    Xu, Bo
    2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 2785 - 2790
  • [36] On Contrastive Representations of Stochastic Processes
    Mathieu, Emile
    Foster, Adam
    Teh, Yee Whye
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [37] Quantization via Distillation and Contrastive Learning
    Pei, Zehua
    Yao, Xufeng
    Zhao, Wenqian
    Yu, Bei
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (12) : 17164 - 17176
  • [38] Multilingual Representation Distillation with Contrastive Learning
    Tan, Weiting
    Heffernan, Kevin
    Schwenk, Holger
    Koehn, Philipp
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 1477 - 1490
  • [39] Pixel-Wise Contrastive Distillation
    Huang, Junqiang
    Guo, Zichao
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 16313 - 16323
  • [40] STABLE KNOWLEDGE TRANSFER FOR CONTRASTIVE DISTILLATION
    Tang, Qiankun
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 4995 - 4999