Compressing Visual-linguistic Model via Knowledge Distillation

被引:24
|
作者
Fang, Zhiyuan [1 ]
Wang, Jianfeng [2 ]
Hu, Xiaowei [2 ]
Wang, Lijuan [2 ]
Yang, Yezhou [1 ]
Liu, Zicheng [2 ]
机构
[1] Arizona State Univ, Tempe, AZ 85287 USA
[2] Microsoft Corp, Redmond, WA 98052 USA
基金
美国国家科学基金会;
关键词
LANGUAGE;
D O I
10.1109/ICCV48922.2021.00146
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite exciting progress in pre-training for visual-linguistic (VL) representations, very few aspire to a small VL model. In this paper, we study knowledge distillation (KD) to effectively compress a transformer based large VL model into a small VL model. The major challenge arises from the inconsistent regional visual tokens extracted from different detectors of Teacher and Student, resulting in the misalignment of hidden representations and attention distributions. To address the problem, we retrain and adapt the Teacher by using the same region proposals from Student's detector while the features are from Teacher's own object detector. With aligned network inputs, the adapted Teacher is capable of transferring the knowledge through the intermediate representations. Specifically, we use the mean square error loss to mimic the attention distribution inside the transformer block, and present a token-wise noise contrastive loss to align the hidden state by contrasting with negative representations stored in a sample queue. To this end, we show that our proposed distillation significantly improves the performance of small VL models on image captioning and visual question answering tasks. It reaches 120.8 in CIDEr score on COCO captioning, an improvement of 5.1 over its non-distilled counterpart; and an accuracy of 69.8 on VQA 2.0, a 0.8 gain from the baseline. Our extensive experiments and ablations confirm the effectiveness of VL distillation in both pre-training and fine-tuning stages.
引用
收藏
页码:1408 / 1418
页数:11
相关论文
共 50 条
  • [31] Temporal Grounding Graphs for Language Understanding with Accrued Visual-Linguistic Context
    Paul, Rohan
    Barbu, Andrei
    Felshin, Sue
    Katz, Boris
    Roy, Nicholas
    PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 4506 - 4514
  • [32] Compressing speaker extraction model with ultra-low precision quantization and knowledge distillation
    Huang, Yating
    Hao, Yunzhe
    Xu, Jiaming
    Xu, Bo
    Neural Networks, 2022, 154 : 13 - 21
  • [33] Private Model Compression via Knowledge Distillation
    Wang, Ji
    Bao, Weidong
    Sun, Lichao
    Zhu, Xiaomin
    Cao, Bokai
    Yu, Philip S.
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 1190 - +
  • [34] Monitoring saccades during a visual search task to uncover the time course of visual-linguistic integration.
    Van de Velde, C
    de Almeida, RG
    Galera, C
    von Grunau, MW
    INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 2001, 42 (04) : S619 - S619
  • [35] AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model
    Yeo, Jeong Hun
    Kim, Minsu
    Choi, Jeongsoo
    Kim, Dae Hoe
    Ro, Yong Man
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 6462 - 6474
  • [36] VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning
    Yamazaki, Kashu
    Vo, Khoa
    Truong, Quang Sang
    Raj, Bhiksha
    Le, Ngan
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 3081 - 3090
  • [37] Compression of Acoustic Model via Knowledge Distillation and Pruning
    Li, Chenxing
    Zhu, Lei
    Xu, Shuang
    Gao, Peng
    Xu, Bo
    2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 2785 - 2790
  • [38] Online Knowledge Distillation via Mutual Contrastive Learning for Visual Recognition
    Yang, Chuanguang
    An, Zhulin
    Zhou, Helong
    Zhuang, Fuzhen
    Xu, Yongjun
    Zhang, Qian
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (08) : 10212 - 10227
  • [39] Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-Labeling
    Wang, Xu
    Li, Yifan
    Zhang, Qiudan
    Wu, Wenhui
    Li, Mark Junjie
    Ma, Lin
    Jiang, Jianmin
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 11164 - 11175
  • [40] Triangle-Reward Reinforcement Learning: Visual-Linguistic Semantic Alignment for Image Captioning
    Nie, Weizhi
    Li, Jiesi
    Xu, Ning
    Liu, An-An
    Li, Xuanya
    Zhang, Yongdong
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4510 - 4518