Compressing Visual-linguistic Model via Knowledge Distillation

被引:24
|
作者
Fang, Zhiyuan [1 ]
Wang, Jianfeng [2 ]
Hu, Xiaowei [2 ]
Wang, Lijuan [2 ]
Yang, Yezhou [1 ]
Liu, Zicheng [2 ]
机构
[1] Arizona State Univ, Tempe, AZ 85287 USA
[2] Microsoft Corp, Redmond, WA 98052 USA
基金
美国国家科学基金会;
关键词
LANGUAGE;
D O I
10.1109/ICCV48922.2021.00146
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite exciting progress in pre-training for visual-linguistic (VL) representations, very few aspire to a small VL model. In this paper, we study knowledge distillation (KD) to effectively compress a transformer based large VL model into a small VL model. The major challenge arises from the inconsistent regional visual tokens extracted from different detectors of Teacher and Student, resulting in the misalignment of hidden representations and attention distributions. To address the problem, we retrain and adapt the Teacher by using the same region proposals from Student's detector while the features are from Teacher's own object detector. With aligned network inputs, the adapted Teacher is capable of transferring the knowledge through the intermediate representations. Specifically, we use the mean square error loss to mimic the attention distribution inside the transformer block, and present a token-wise noise contrastive loss to align the hidden state by contrasting with negative representations stored in a sample queue. To this end, we show that our proposed distillation significantly improves the performance of small VL models on image captioning and visual question answering tasks. It reaches 120.8 in CIDEr score on COCO captioning, an improvement of 5.1 over its non-distilled counterpart; and an accuracy of 69.8 on VQA 2.0, a 0.8 gain from the baseline. Our extensive experiments and ablations confirm the effectiveness of VL distillation in both pre-training and fine-tuning stages.
引用
收藏
页码:1408 / 1418
页数:11
相关论文
共 50 条
  • [1] Compressing the Multiobject Tracking Model via Knowledge Distillation
    Liang, Tianyi
    Wang, Mengzhu
    Chen, Junyang
    Chen, Dingyao
    Luo, Zhigang
    Leung, Victor C. M.
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024, 11 (02) : 2713 - 2723
  • [2] Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations
    Chiou, Meng-Jiun
    Zimmermann, Roger
    Feng, Jiashi
    IEEE ACCESS, 2021, 9 : 50441 - 50451
  • [3] Automated Construction of Visual-Linguistic Knowledge via Concept Learning from Cartoon Videos
    Ha, Jung-Woo
    Kim, Kyung-Min
    Zhang, Byoung-Tak
    PROCEEDINGS OF THE TWENTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2015, : 522 - 528
  • [4] CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model
    Yin, Pengwei
    Zeng, Guanzhong
    Wang, Jingjing
    Xie, Di
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 6729 - 6737
  • [5] Globalese: a new visual-linguistic register
    Jaworski, Adam
    SOCIAL SEMIOTICS, 2015, 25 (02) : 217 - 235
  • [6] SKDBERT: Compressing BERT via Stochastic Knowledge Distillation
    Ding, Zixiang
    Jiang, Guoqing
    Zhang, Shuai
    Guo, Lin
    Lin, Wei
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 6, 2023, : 7414 - 7422
  • [7] Dense Contrastive Visual-Linguistic Pretraining
    Shi, Lei
    Shuang, Kai
    Geng, Shijie
    Gao, Peng
    Fu, Zuohui
    de Melo, Gerard
    Chen, Yunpeng
    Su, Sen
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5203 - 5212
  • [8] Recommending Themes for Ad Creative Design via Visual-Linguistic Representations
    Zhou, Yichao
    Mishra, Shaunak
    Verma, Manisha
    Bhamidipati, Narayan
    Wang, Wei
    WEB CONFERENCE 2020: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2020), 2020, : 2521 - 2527
  • [9] Distilling DETR with Visual-Linguistic Knowledge for Open-Vocabulary Object Detection
    Li, Liangqi
    Miao, Jiaxu
    Shi, Dahu
    Tan, Wenming
    Ren, Ye
    Yang, Yi
    Pu, Shiliang
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 6478 - 6487
  • [10] Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning
    Yang, Li
    Xu, Yan
    Yuan, Chunfeng
    Liu, Wei
    Li, Bing
    Hu, Weiming
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 9489 - 9498