Compressing Visual-linguistic Model via Knowledge Distillation

被引:24
|
作者
Fang, Zhiyuan [1 ]
Wang, Jianfeng [2 ]
Hu, Xiaowei [2 ]
Wang, Lijuan [2 ]
Yang, Yezhou [1 ]
Liu, Zicheng [2 ]
机构
[1] Arizona State Univ, Tempe, AZ 85287 USA
[2] Microsoft Corp, Redmond, WA 98052 USA
基金
美国国家科学基金会;
关键词
LANGUAGE;
D O I
10.1109/ICCV48922.2021.00146
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite exciting progress in pre-training for visual-linguistic (VL) representations, very few aspire to a small VL model. In this paper, we study knowledge distillation (KD) to effectively compress a transformer based large VL model into a small VL model. The major challenge arises from the inconsistent regional visual tokens extracted from different detectors of Teacher and Student, resulting in the misalignment of hidden representations and attention distributions. To address the problem, we retrain and adapt the Teacher by using the same region proposals from Student's detector while the features are from Teacher's own object detector. With aligned network inputs, the adapted Teacher is capable of transferring the knowledge through the intermediate representations. Specifically, we use the mean square error loss to mimic the attention distribution inside the transformer block, and present a token-wise noise contrastive loss to align the hidden state by contrasting with negative representations stored in a sample queue. To this end, we show that our proposed distillation significantly improves the performance of small VL models on image captioning and visual question answering tasks. It reaches 120.8 in CIDEr score on COCO captioning, an improvement of 5.1 over its non-distilled counterpart; and an accuracy of 69.8 on VQA 2.0, a 0.8 gain from the baseline. Our extensive experiments and ablations confirm the effectiveness of VL distillation in both pre-training and fine-tuning stages.
引用
收藏
页码:1408 / 1418
页数:11
相关论文
共 50 条
  • [41] STVGBert: A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding
    Su, Rui
    Yu, Qian
    Xu, Dong
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1513 - 1522
  • [42] Distilling a Powerful Student Model via Online Knowledge Distillation
    Li, Shaojie
    Lin, Mingbao
    Wang, Yan
    Wu, Yongjian
    Tian, Yonghong
    Shao, Ling
    Ji, Rongrong
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (11) : 8743 - 8752
  • [43] Model Compression Algorithm via Reinforcement Learning and Knowledge Distillation
    Liu, Botao
    Hu, Bing-Bing
    Zhao, Ming
    Peng, Sheng-Lung
    Chang, Jou-Ming
    Tsoulos, Ioannis G.
    MATHEMATICS, 2023, 11 (22)
  • [44] PQK: Model Compression via Pruning, Quantization, and Knowledge Distillation
    Kim, Jangho
    Chang, Simyung
    Kwak, Nojun
    INTERSPEECH 2021, 2021, : 4568 - 4572
  • [45] DeepVID: Deep Visual Interpretation and Diagnosis for Image Classifiers via Knowledge Distillation
    Wang, Junpeng
    Gou, Liang
    Zhang, Wei
    Yang, Hao
    Shen, Han-Wei
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2019, 25 (06) : 2168 - 2180
  • [46] VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition
    Tian, Changyao
    Wang, Wenhai
    Zhu, Xizhou
    Dai, Jifeng
    Qiao, Yu
    COMPUTER VISION, ECCV 2022, PT XXV, 2022, 13685 : 73 - 91
  • [47] Compressing Transfer: Mutual Learning-Empowered Knowledge Distillation for Temporal Knowledge Graph Reasoning
    Qian, Ye
    Wang, Xiaoyan
    Sun, Fuhui
    Pan, Li
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2025,
  • [48] Visual Grounding With Dual Knowledge Distillation
    Wu, Wansen
    Cao, Meng
    Hu, Yue
    Peng, Yong
    Qin, Long
    Yin, Quanjun
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (10) : 10399 - 10410
  • [49] Locally controllable network based on visual-linguistic relation alignment for text-to-image generation
    Li, Zaike
    Liu, Li
    Zhang, Huaxiang
    Liu, Dongmei
    Song, Yu
    Li, Boqun
    MULTIMEDIA SYSTEMS, 2024, 30 (01)
  • [50] Deep visual-linguistic fusion network considering cross-modal inconsistency for rumor detection
    Yang YANG
    Ran BAO
    Weili GUO
    De-Chuan ZHAN
    Yilong YIN
    Jian YANG
    ScienceChina(InformationSciences), 2023, 66 (12) : 16 - 32