Compressing Visual-linguistic Model via Knowledge Distillation

被引:24
|
作者
Fang, Zhiyuan [1 ]
Wang, Jianfeng [2 ]
Hu, Xiaowei [2 ]
Wang, Lijuan [2 ]
Yang, Yezhou [1 ]
Liu, Zicheng [2 ]
机构
[1] Arizona State Univ, Tempe, AZ 85287 USA
[2] Microsoft Corp, Redmond, WA 98052 USA
基金
美国国家科学基金会;
关键词
LANGUAGE;
D O I
10.1109/ICCV48922.2021.00146
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite exciting progress in pre-training for visual-linguistic (VL) representations, very few aspire to a small VL model. In this paper, we study knowledge distillation (KD) to effectively compress a transformer based large VL model into a small VL model. The major challenge arises from the inconsistent regional visual tokens extracted from different detectors of Teacher and Student, resulting in the misalignment of hidden representations and attention distributions. To address the problem, we retrain and adapt the Teacher by using the same region proposals from Student's detector while the features are from Teacher's own object detector. With aligned network inputs, the adapted Teacher is capable of transferring the knowledge through the intermediate representations. Specifically, we use the mean square error loss to mimic the attention distribution inside the transformer block, and present a token-wise noise contrastive loss to align the hidden state by contrasting with negative representations stored in a sample queue. To this end, we show that our proposed distillation significantly improves the performance of small VL models on image captioning and visual question answering tasks. It reaches 120.8 in CIDEr score on COCO captioning, an improvement of 5.1 over its non-distilled counterpart; and an accuracy of 69.8 on VQA 2.0, a 0.8 gain from the baseline. Our extensive experiments and ablations confirm the effectiveness of VL distillation in both pre-training and fine-tuning stages.
引用
收藏
页码:1408 / 1418
页数:11
相关论文
共 50 条
  • [21] Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning
    Xu Yang
    Hanwang Zhang
    Chongyang Gao
    Jianfei Cai
    International Journal of Computer Vision, 2023, 131 : 82 - 100
  • [22] Visual-Linguistic Alignment and Composition for Image Retrieval with Text Feedback
    Li, Dafeng
    Zhu, Yingying
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 108 - 113
  • [23] Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning
    Yang, Xu
    Zhang, Hanwang
    Gao, Chongyang
    Cai, Jianfei
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 131 (01) : 82 - 100
  • [24] AdsCVLR: Commercial Visual-Linguistic Representation Modeling in Sponsored Search
    Zhu, Yongjie
    Han, Chunhui
    Zhan, Yuefeng
    Pang, Bochen
    Li, Zhaoju
    Sun, Hao
    Li, Si
    Shi, Boxin
    Duan, Nan
    Deng, Weiwei
    Zhang, Ruofei
    Zhang, Liangjie
    Zhang, Qi
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022,
  • [25] VLMAH: Visual-Linguistic Modeling of Action History for Effective Action Anticipation
    Manousaki, Victoria
    Bacharidis, Konstantinos
    Papoutsakis, Konstantinos
    Argyros, Antonis
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 1909 - 1919
  • [26] Faster Zero-shot Multi-modal Entity Linking via Visual-Linguistic Representation
    Qiushuo Zheng
    Hao Wen
    Meng Wang
    Guilin Qi
    Chaoyu Bai
    Data Intelligence, 2022, 4 (03) : 493 - 508
  • [27] Faster Zero-shot Multi-modal Entity Linking via Visual-Linguistic Representation
    Zheng, Qiushuo
    Wen, Hao
    Wang, Meng
    Qi, Guilin
    Bai, Chaoyu
    DATA INTELLIGENCE, 2022, 4 (03) : 493 - 508
  • [28] LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering
    Jiang, Jingjing
    Liu, Ziyi
    Zheng, Nanning
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 5002 - 5013
  • [29] Weakly-Supervised Grounding for VQA with Dual Visual-Linguistic Interaction
    Liu, Yi
    Pan, Junwen
    Wang, Qilong
    Chen, Guanlin
    Nie, Weiguo
    Zhang, Yudong
    Gao, Qian
    Hu, Qinghua
    Zhu, Pengfei
    ARTIFICIAL INTELLIGENCE, CICAI 2023, PT I, 2024, 14473 : 156 - 169
  • [30] Compressing speaker extraction model with ultra-low precision quantization and knowledge distillation
    Huang, Yating
    Hao, Yunzhe
    Xu, Jiaming
    Xu, Bo
    NEURAL NETWORKS, 2022, 154 : 13 - 21