Visual Grounding With Dual Knowledge Distillation

被引:0
|
作者
Wu, Wansen [1 ]
Cao, Meng [2 ]
Hu, Yue [1 ]
Peng, Yong [1 ]
Qin, Long [1 ]
Yin, Quanjun [1 ]
机构
[1] Natl Univ Def Technol, Coll Syst Engn, Changsha 410072, Peoples R China
[2] Tencent AI Lab, Shenzhen 518000, Peoples R China
基金
中国国家自然科学基金; 湖南省自然科学基金;
关键词
Visualization; Task analysis; Semantics; Grounding; Feature extraction; Location awareness; Proposals; Visual grounding; vision and language; knowledge distillation;
D O I
10.1109/TCSVT.2024.3407785
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Visual grounding is a task that seeks to predict the specific location of an object or region described by a linguistic expression within an image. Despite the recent success, existing methods still suffer from two problems. First, most methods use independently pre-trained unimodal feature encoders for extracting expressive feature embeddings, thus resulting in a significant semantic gap between unimodal embeddings and limiting the effective interaction of visual-linguistic contexts. Second, existing attention-based approaches equipped with the global receptive field have a tendency to neglect the local information present in the images. This limitation restricts the semantic understanding required to distinguish between referred objects and the background, consequently leading to inadequate localization performance. Inspired by the recent advance in knowledge distillation, in this paper, we propose a DUal knowlEdge disTillation (DUET) method for visual grounding models to bridge the cross-modal semantic gap and improve localization performance simultaneously. Specifically, we utilize the CLIP model as the teacher model to transfer the semantic knowledge to a student model, in which the vision and language modalities are linked into a unified embedding space. Besides, we design a self-distillation method for the student model to acquire localization knowledge by performing the region-level contrastive learning to make the predicted region close to the positive samples. To this end, this work further proposes a Semantics-Location Aware sampling mechanism to generate high-quality self-distillation samples. Extensive experiments on five datasets and ablation studies demonstrate the state-of-the-art performance of DUET and its orthogonality with different student models, thereby making DUET adaptable to a wide range of visual grounding architectures. Code are available on DUET.
引用
收藏
页码:10399 / 10410
页数:12
相关论文
共 50 条
  • [41] VISUAL RELATIONSHIP DETECTION BASED ON GUIDED PROPOSALS AND SEMANTIC KNOWLEDGE DISTILLATION
    Plesse, Francois
    Ginsca, Alexandru
    Delezoide, Bertrand
    Preteux, Franeoise
    2018 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2018,
  • [42] A transformer based visual tracker with restricted token interaction and knowledge distillation
    Liu, Nian
    Zhang, Yi
    KNOWLEDGE-BASED SYSTEMS, 2025, 307
  • [43] Context-Aware Graph Inference With Knowledge Distillation for Visual Dialog
    Guo, Dan
    Wang, Hui
    Wang, Meng
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (10) : 6056 - 6073
  • [44] DistilVPR: Cross-Modal Knowledge Distillation for Visual Place Recognition
    Wang, Sijie
    She, Rui
    Kang, Qiyu
    Jian, Xingchao
    Zhao, Kai
    Song, Yang
    Tay, Wee Peng
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 9, 2024, : 10377 - 10385
  • [45] Online Knowledge Distillation via Mutual Contrastive Learning for Visual Recognition
    Yang, Chuanguang
    An, Zhulin
    Zhou, Helong
    Zhuang, Fuzhen
    Xu, Yongjun
    Zhang, Qian
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (08) : 10212 - 10227
  • [46] Dual Motion Attention and Enhanced Knowledge Distillation for Video Frame Interpolation
    Zhang, Dengyong
    Lou, Runqi
    Chen, Jiaxin
    Liao, Xin
    Yang, Gaobo
    Ding, Xiangling
    2024 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2024,
  • [47] Dual Learning with Dynamic Knowledge Distillation for Partially Relevant Video Retrieval
    Dong, Jianfeng
    Zhang, Minsong
    Zhang, Zheng
    Chen, Xianke
    Liu, Daizong
    Qu, Xiaoye
    Wang, Xun
    Liu, Baolong
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11268 - 11278
  • [48] ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding
    Guo, Zoey
    Tang, Yiwen
    Zhang, Ray
    Wang, Dong
    Wang, Zhigang
    Zhao, Bin
    Li, Xuelong
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15326 - 15337
  • [49] Cross-Modal Knowledge Distillation for Depth Privileged Monocular Visual Odometry
    Li, Bin
    Wang, Shuling
    Ye, Haifeng
    Gong, Xiaojin
    Xiang, Zhiyu
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (03) : 6171 - 6178
  • [50] Advancing Model Explainability: Visual Concept Knowledge Distillation for Concept Bottleneck Models
    Lee, Ju-Hwan
    Vu, Dang Thanh
    Lee, Nam-Kyung
    Shin, Il-Hong
    Kim, Jin-Young
    APPLIED SCIENCES-BASEL, 2025, 15 (02):