Compressing Visual-linguistic Model via Knowledge Distillation

被引：24

作者：

Fang, Zhiyuan ^{[1
]}

Wang, Jianfeng ^{[2
]}

Hu, Xiaowei ^{[2
]}

Wang, Lijuan ^{[2
]}

Yang, Yezhou ^{[1
]}

Liu, Zicheng ^{[2
]}

机构：

[1] Arizona State Univ, Tempe, AZ 85287 USA

[2] Microsoft Corp, Redmond, WA 98052 USA

来源：

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年

基金：

美国国家科学基金会;

关键词：

LANGUAGE;

D O I：

10.1109/ICCV48922.2021.00146

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Despite exciting progress in pre-training for visual-linguistic (VL) representations, very few aspire to a small VL model. In this paper, we study knowledge distillation (KD) to effectively compress a transformer based large VL model into a small VL model. The major challenge arises from the inconsistent regional visual tokens extracted from different detectors of Teacher and Student, resulting in the misalignment of hidden representations and attention distributions. To address the problem, we retrain and adapt the Teacher by using the same region proposals from Student's detector while the features are from Teacher's own object detector. With aligned network inputs, the adapted Teacher is capable of transferring the knowledge through the intermediate representations. Specifically, we use the mean square error loss to mimic the attention distribution inside the transformer block, and present a token-wise noise contrastive loss to align the hidden state by contrasting with negative representations stored in a sample queue. To this end, we show that our proposed distillation significantly improves the performance of small VL models on image captioning and visual question answering tasks. It reaches 120.8 in CIDEr score on COCO captioning, an improvement of 5.1 over its non-distilled counterpart; and an accuracy of 69.8 on VQA 2.0, a 0.8 gain from the baseline. Our extensive experiments and ablations confirm the effectiveness of VL distillation in both pre-training and fine-tuning stages.

引用

页码：1408 / 1418

页数：11

共 50 条

[41] STVGBert: A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding
Su, Rui
Yu, Qian
Xu, Dong
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1513 - 1522
[42] Distilling a Powerful Student Model via Online Knowledge Distillation
Li, Shaojie
Lin, Mingbao
Wang, Yan
Wu, Yongjian
Tian, Yonghong
Shao, Ling
Ji, Rongrong
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (11) : 8743 - 8752
[43] Model Compression Algorithm via Reinforcement Learning and Knowledge Distillation
Liu, Botao
Hu, Bing-Bing
Zhao, Ming
Peng, Sheng-Lung
Chang, Jou-Ming
Tsoulos, Ioannis G.
MATHEMATICS, 2023, 11 (22)
[44] PQK: Model Compression via Pruning, Quantization, and Knowledge Distillation
Kim, Jangho
Chang, Simyung
Kwak, Nojun
INTERSPEECH 2021, 2021, : 4568 - 4572
[45] DeepVID: Deep Visual Interpretation and Diagnosis for Image Classifiers via Knowledge Distillation
Wang, Junpeng
Gou, Liang
Zhang, Wei
Yang, Hao
Shen, Han-Wei
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2019, 25 (06) : 2168 - 2180
[46] VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition
Tian, Changyao
Wang, Wenhai
Zhu, Xizhou
Dai, Jifeng
Qiao, Yu
COMPUTER VISION, ECCV 2022, PT XXV, 2022, 13685 : 73 - 91
[47] Compressing Transfer: Mutual Learning-Empowered Knowledge Distillation for Temporal Knowledge Graph Reasoning
Qian, Ye
Wang, Xiaoyan
Sun, Fuhui
Pan, Li
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2025,
[48] Visual Grounding With Dual Knowledge Distillation
Wu, Wansen
Cao, Meng
Hu, Yue
Peng, Yong
Qin, Long
Yin, Quanjun
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (10) : 10399 - 10410
[49] Locally controllable network based on visual-linguistic relation alignment for text-to-image generation
Li, Zaike
Liu, Li
Zhang, Huaxiang
Liu, Dongmei
Song, Yu
Li, Boqun
MULTIMEDIA SYSTEMS, 2024, 30 (01)
[50] Deep visual-linguistic fusion network considering cross-modal inconsistency for rumor detection
Yang YANG
Ran BAO
Weili GUO
De-Chuan ZHAN
Yilong YIN
Jian YANG
ScienceChina(InformationSciences), 2023, 66 (12) : 16 - 32

← 1 2 3 4 5 →