Visual Grounding With Dual Knowledge Distillation

被引:0
|
作者
Wu, Wansen [1 ]
Cao, Meng [2 ]
Hu, Yue [1 ]
Peng, Yong [1 ]
Qin, Long [1 ]
Yin, Quanjun [1 ]
机构
[1] Natl Univ Def Technol, Coll Syst Engn, Changsha 410072, Peoples R China
[2] Tencent AI Lab, Shenzhen 518000, Peoples R China
基金
中国国家自然科学基金; 湖南省自然科学基金;
关键词
Visualization; Task analysis; Semantics; Grounding; Feature extraction; Location awareness; Proposals; Visual grounding; vision and language; knowledge distillation;
D O I
10.1109/TCSVT.2024.3407785
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Visual grounding is a task that seeks to predict the specific location of an object or region described by a linguistic expression within an image. Despite the recent success, existing methods still suffer from two problems. First, most methods use independently pre-trained unimodal feature encoders for extracting expressive feature embeddings, thus resulting in a significant semantic gap between unimodal embeddings and limiting the effective interaction of visual-linguistic contexts. Second, existing attention-based approaches equipped with the global receptive field have a tendency to neglect the local information present in the images. This limitation restricts the semantic understanding required to distinguish between referred objects and the background, consequently leading to inadequate localization performance. Inspired by the recent advance in knowledge distillation, in this paper, we propose a DUal knowlEdge disTillation (DUET) method for visual grounding models to bridge the cross-modal semantic gap and improve localization performance simultaneously. Specifically, we utilize the CLIP model as the teacher model to transfer the semantic knowledge to a student model, in which the vision and language modalities are linked into a unified embedding space. Besides, we design a self-distillation method for the student model to acquire localization knowledge by performing the region-level contrastive learning to make the predicted region close to the positive samples. To this end, this work further proposes a Semantics-Location Aware sampling mechanism to generate high-quality self-distillation samples. Extensive experiments on five datasets and ablation studies demonstrate the state-of-the-art performance of DUET and its orthogonality with different student models, thereby making DUET adaptable to a wide range of visual grounding architectures. Code are available on DUET.
引用
收藏
页码:10399 / 10410
页数:12
相关论文
共 50 条
  • [1] Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation
    Wang, Liwei
    Huang, Jing
    Li, Yin
    Xu, Kun
    Yang, Zhengyuan
    Yu, Dong
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 14085 - 14095
  • [2] Dual-student knowledge distillation for visual anomaly detection
    Hao, Jutao
    Huang, Kai
    Chen, Chen
    Mao, Jian
    COMPLEX & INTELLIGENT SYSTEMS, 2024, 10 (04) : 4853 - 4865
  • [3] Dual knowledge distillation for visual tracking with teacher-student network
    Wang, Yuanyun
    Sun, Chuanyu
    Wang, Jun
    Chai, Bingfei
    SIGNAL IMAGE AND VIDEO PROCESSING, 2024, 18 (6-7) : 5203 - 5211
  • [4] Novel Visual Category Discovery with Dual Ranking Statistics and Mutual Knowledge Distillation
    Zhao, Bingchen
    Han, Kai
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [5] Visual "Nosing Around". On the Theoretical Grounding of Communicating Visual Knowledge
    Miko, Katharina
    SOZIALE WELT-ZEITSCHRIFT FUR SOZIALWISSENSCHAFTLICHE FORSCHUNG UND PRAXIS, 2013, 64 (1-2): : 153 - +
  • [6] Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding
    Bakr, Eslam Mohamed
    Alsaedy, Yasmeen
    Elhoseiny, Mohamed
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [7] Advancing Visual Grounding with Scene Knowledge: Benchmark and Method
    Chen, Zhihong
    Zhang, Ruifei
    Song, Yibing
    Wan, Xiang
    Li, Guanbin
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 15039 - 15049
  • [8] Gaze-assisted visual grounding via knowledge distillation for referred object grasping with under-specified object referring
    Zhang, Zhuoyang
    Qian, Kun
    Zhou, Bo
    Fang, Fang
    Ma, Xudong
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 133
  • [9] A Fast Knowledge Distillation Framework for Visual Recognition
    Shen, Zhiqiang
    Xing, Eric
    COMPUTER VISION, ECCV 2022, PT XXIV, 2022, 13684 : 673 - 690
  • [10] Spatial Knowledge Distillation to aid Visual Reasoning
    Aditya, Somak
    Saha, Rudra
    Yang, Yezhou
    Baral, Chitta
    2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 227 - 235