Visual Grounding With Dual Knowledge Distillation

被引：0

作者：

Wu, Wansen ^{[1
]}

Cao, Meng ^{[2
]}

Hu, Yue ^{[1
]}

Peng, Yong ^{[1
]}

Qin, Long ^{[1
]}

Yin, Quanjun ^{[1
]}

机构：

[1] Natl Univ Def Technol, Coll Syst Engn, Changsha 410072, Peoples R China

[2] Tencent AI Lab, Shenzhen 518000, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 10期

基金：

中国国家自然科学基金; 湖南省自然科学基金;

关键词：

Visualization; Task analysis; Semantics; Grounding; Feature extraction; Location awareness; Proposals; Visual grounding; vision and language; knowledge distillation;

D O I：

10.1109/TCSVT.2024.3407785

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Visual grounding is a task that seeks to predict the specific location of an object or region described by a linguistic expression within an image. Despite the recent success, existing methods still suffer from two problems. First, most methods use independently pre-trained unimodal feature encoders for extracting expressive feature embeddings, thus resulting in a significant semantic gap between unimodal embeddings and limiting the effective interaction of visual-linguistic contexts. Second, existing attention-based approaches equipped with the global receptive field have a tendency to neglect the local information present in the images. This limitation restricts the semantic understanding required to distinguish between referred objects and the background, consequently leading to inadequate localization performance. Inspired by the recent advance in knowledge distillation, in this paper, we propose a DUal knowlEdge disTillation (DUET) method for visual grounding models to bridge the cross-modal semantic gap and improve localization performance simultaneously. Specifically, we utilize the CLIP model as the teacher model to transfer the semantic knowledge to a student model, in which the vision and language modalities are linked into a unified embedding space. Besides, we design a self-distillation method for the student model to acquire localization knowledge by performing the region-level contrastive learning to make the predicted region close to the positive samples. To this end, this work further proposes a Semantics-Location Aware sampling mechanism to generate high-quality self-distillation samples. Extensive experiments on five datasets and ablation studies demonstrate the state-of-the-art performance of DUET and its orthogonality with different student models, thereby making DUET adaptable to a wide range of visual grounding architectures. Code are available on DUET.

引用

页码：10399 / 10410

页数：12

共 50 条

[21] Dual knowledge distillation for bidirectional neural machine translation
Zhang, Huaao
Qiu, Shigui
Wu, Shilong
2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
[22] Efficient Crowd Counting via Dual Knowledge Distillation
Wang, Rui
Hao, Yixue
Hu, Long
Li, Xianzhi
Chen, Min
Miao, Yiming
Humar, Iztok
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 569 - 583
[23] Dual model knowledge distillation for industrial anomaly detection
Thomine, Simon
Snoussi, Hichem
PATTERN ANALYSIS AND APPLICATIONS, 2024, 27 (03)
[24] DUAL KNOWLEDGE DISTILLATION FOR EFFICIENT SOUND EVENT DETECTION
Xiao, Yang
Das, Rohan Kumar
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 690 - 694
[25] Weakly Supervised Referring Expression Grounding via Dynamic Self-Knowledge Distillation
Mi, Jinpeng
Chen, Zhiqian
Zhang, Jianwei
2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS, 2023, : 1254 - 1260
[26] Weakly Supervised Referring Expression Grounding via Target-Guided Knowledge Distillation
Mi, Jinpeng
Tang, Song
Ma, Zhiyuan
Liu, Dan
Li, Qingdu
Zhang, Jianwei
2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2023), 2023, : 8299 - 8305
[27] VISUAL GROUNDING
CUMBOW, RC
AMERICAN FILM, 1978, 3 (10): : 16 - 16
[28] A Teacher-Free Graph Knowledge Distillation Framework With Dual Self-Distillation
Wu, Lirong
Lin, Haitao
Gao, Zhangyang
Zhao, Guojiang
Li, Stan Z.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (09) : 4375 - 4385
[29] Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation
Yu, Ruichi
Li, Ang
Morariu, Vlad I.
Davis, Larry S.
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1068 - 1076
[30] Compressing Visual-linguistic Model via Knowledge Distillation
Fang, Zhiyuan
Wang, Jianfeng
Hu, Xiaowei
Wang, Lijuan
Yang, Yezhou
Liu, Zicheng
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1408 - 1418

← 1 2 3 4 5 →