Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation

被引：27

作者：

Wang, Liwei ^{[1
]}

Huang, Jing ^{[2
]}

Li, Yin ^{[3
]}

Xu, Kun ^{[4
]}

Yang, Zhengyuan ^{[5
]}

Yu, Dong ^{[4
]}

机构：

[1] Chinese Univ Hong Kong, Hong Kong, Peoples R China

[2] Univ Illinois, Champaign, IL USA

[3] Univ Wisconsin Madison, Madison, WI USA

[4] Tencent AI Lab, Bellevue, WA USA

[5] Univ Rochester, Rochester, NY 14627 USA

来源：

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 | 2021年

关键词：

D O I：

10.1109/CVPR46437.2021.01387

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Weakly supervised phrase grounding aims at learning region-phrase correspondences using only image-sentence pairs. A major challenge thus lies in the missing links between image regions and sentence phrases during training. To address this challenge, we leverage a generic object detector at training time, and propose a contrastive learning framework that accounts for both region-phrase and image-sentence matching. Our core innovation is the learning of a region-phrase score function, based on which an image-sentence score function is further constructed. Importantly, our region-phrase score function is learned by distilling from soft matching scores between the detected object names and candidate phrases within an image-sentence pair, while the image-sentence score function is supervised by ground-truth image-sentence pairs. The design of such score functions removes the need of object detection at test time, thereby significantly reducing the inference cost. Without bells and whistles, our approach achieves state-of-the-art results on visual phrase grounding, surpassing previous methods that require expensive object detectors at test time.

引用

页码：14085 / 14095

页数：11

共 50 条

[21] Weakly-Supervised Generation and Grounding of Visual Descriptions with Conditional Generative Models
Mavroudi, Effrosyni
Vidal, Rene
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15523 - 15533
[22] Improving Event Representation via Simultaneous Weakly Supervised Contrastive Learning and Clustering
Gao, Jun
Wang, Wei
Yu, Changlong
Zhao, Huan
Ng, Wilfred
Xu, Ruifeng
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 3036 - 3049
[23] Knowledge Consistency Distillation for Weakly Supervised One Step Person Search
Li, Zongyi
Shi, Yuxuan
Ling, Hefei
Chen, Jiazhong
Wang, Runsheng
Zhao, Chengxin
Wang, Qian
Huang, Shijuan
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) : 11695 - 11708
[24] Adaptive knowledge distillation and integration for weakly supervised referring expression comprehension
Mi, Jinpeng
Wermter, Stefan
Zhang, Jianwei
KNOWLEDGE-BASED SYSTEMS, 2024, 286
[25] Self-Supervised Contrastive Learning for Camera-to-Radar Knowledge Distillation
Wang, Wenpeng
Campbell, Bradford
Munir, Sirajum
2024 20TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING IN SMART SYSTEMS AND THE INTERNET OF THINGS, DCOSS-IOT 2024, 2024, : 154 - 161
[26] Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection
Yu, Jiashuo
Liu, Jinyu
Cheng, Ying
Feng, Rui
Zhang, Yuejie
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 6278 - 6287
[27] Improving Structural and Semantic Global Knowledge in Graph Contrastive Learning with Distillation
Wen, Mi
Wang, Hongwei
Xue, Yunsheng
Wu, Yi
Wen, Hong
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT II, PAKDD 2024, 2024, 14646 : 364 - 375
[28] Online Knowledge Distillation via Mutual Contrastive Learning for Visual Recognition
Yang, Chuanguang
An, Zhulin
Zhou, Helong
Zhuang, Fuzhen
Xu, Yongjun
Zhang, Qian
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (08) : 10212 - 10227
[29] Confidence-aware Pseudo-label Learning for Weakly Supervised Visual Grounding
Liu, Yang
Zhang, Jiahua
Chen, Qingchao
Peng, Yuxin
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2816 - 2826
[30] Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos
Huang, De-An
Buch, Shyamal
Dery, Lucio
Garg, Animesh
Li Fei-Fei
Niebles, Juan Carlos
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 5948 - 5957

← 1 2 3 4 5 →