Visual Grounding With Dual Knowledge Distillation

被引:0
|
作者
Wu, Wansen [1 ]
Cao, Meng [2 ]
Hu, Yue [1 ]
Peng, Yong [1 ]
Qin, Long [1 ]
Yin, Quanjun [1 ]
机构
[1] Natl Univ Def Technol, Coll Syst Engn, Changsha 410072, Peoples R China
[2] Tencent AI Lab, Shenzhen 518000, Peoples R China
基金
中国国家自然科学基金; 湖南省自然科学基金;
关键词
Visualization; Task analysis; Semantics; Grounding; Feature extraction; Location awareness; Proposals; Visual grounding; vision and language; knowledge distillation;
D O I
10.1109/TCSVT.2024.3407785
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Visual grounding is a task that seeks to predict the specific location of an object or region described by a linguistic expression within an image. Despite the recent success, existing methods still suffer from two problems. First, most methods use independently pre-trained unimodal feature encoders for extracting expressive feature embeddings, thus resulting in a significant semantic gap between unimodal embeddings and limiting the effective interaction of visual-linguistic contexts. Second, existing attention-based approaches equipped with the global receptive field have a tendency to neglect the local information present in the images. This limitation restricts the semantic understanding required to distinguish between referred objects and the background, consequently leading to inadequate localization performance. Inspired by the recent advance in knowledge distillation, in this paper, we propose a DUal knowlEdge disTillation (DUET) method for visual grounding models to bridge the cross-modal semantic gap and improve localization performance simultaneously. Specifically, we utilize the CLIP model as the teacher model to transfer the semantic knowledge to a student model, in which the vision and language modalities are linked into a unified embedding space. Besides, we design a self-distillation method for the student model to acquire localization knowledge by performing the region-level contrastive learning to make the predicted region close to the positive samples. To this end, this work further proposes a Semantics-Location Aware sampling mechanism to generate high-quality self-distillation samples. Extensive experiments on five datasets and ablation studies demonstrate the state-of-the-art performance of DUET and its orthogonality with different student models, thereby making DUET adaptable to a wide range of visual grounding architectures. Code are available on DUET.
引用
收藏
页码:10399 / 10410
页数:12
相关论文
共 50 条
  • [31] Long-Term Knowledge Distillation of Visual Place Classifiers
    Tomoe, Hiroki
    Kanji, Tanaka
    2019 IEEE INTELLIGENT TRANSPORTATION SYSTEMS CONFERENCE (ITSC), 2019, : 541 - 546
  • [32] Dual cross knowledge distillation for image super-resolution
    Fang, Hangxiang
    Long, Yongwen
    Hu, Xinyi
    Ou, Yangtao
    Huang, Yuanjia
    Hu, Haoji
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2023, 95
  • [33] Dual-decoder transformer network for answer grounding in visual question answering
    Zhu, Liangjun
    Peng, Li
    Zhou, Weinan
    Yang, Jielong
    PATTERN RECOGNITION LETTERS, 2023, 171 : 53 - 60
  • [34] Weakly-Supervised Grounding for VQA with Dual Visual-Linguistic Interaction
    Liu, Yi
    Pan, Junwen
    Wang, Qilong
    Chen, Guanlin
    Nie, Weiguo
    Zhang, Yudong
    Gao, Qian
    Hu, Qinghua
    Zhu, Pengfei
    ARTIFICIAL INTELLIGENCE, CICAI 2023, PT I, 2024, 14473 : 156 - 169
  • [35] Trustful LLMs: Customizing and Grounding Text Generation with Knowledge Bases and Dual Decoders
    Zhu, Xiaofeng
    Mandivarapu, Jaya Krishna
    1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual, CustomNLP4U 2024 - Proceedings of the Workshop, 2024, : 156 - 166
  • [36] Deconfounded Visual Grounding
    Huang, Jianqiang
    Qin, Yu
    Qi, Jiaxin
    Sun, Qianru
    Zhang, Hanwang
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 998 - 1006
  • [37] Grounding Visual Explanations
    Hendricks, Lisa Anne
    Hu, Ronghang
    Darrell, Trevor
    Akata, Zeynep
    COMPUTER VISION - ECCV 2018, PT II, 2018, 11206 : 269 - 286
  • [38] Trustful LLMs: Customizing and Grounding Text Generation with Knowledge Bases and Dual Decoders
    Zhu, Xiaofeng
    Mandivarapu, Jaya Krishna
    arXiv,
  • [39] Flexible Visual Grounding
    Kim, Yongmin
    Chu, Chenhui
    Kurohashi, Sadao
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): STUDENT RESEARCH WORKSHOP, 2022, : 285 - 299
  • [40] Revisiting knowledge distillation for light-weight visual object detection
    Gao, Tianze
    Gao, Yunfeng
    Li, Yu
    Qin, Peiyuan
    TRANSACTIONS OF THE INSTITUTE OF MEASUREMENT AND CONTROL, 2021, 43 (13) : 2888 - 2898