Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval

被引：0

作者：

Lin, Chengzhi ^{[1
]}

Wu, Ancong ^{[1
]}

Liang, Junwei ^{[2
]}

Zhang, Jun ^{[3
]}

Ge, Wenhang ^{[1
]}

Zheng, Wei-Shi ^{[1
,4
,5
]}

Shen, Chunhua ^{[6
]}

机构：

[1] Sun Yatsen Univ China, Sch Comp Sci & Engn, Guangzhou, Peoples R China

[2] Hong Kong Univ Sci & Technol, AI Thrust, Guangzhou, Peoples R China

[3] Tencent Youtu Lab, Shenzhen, Peoples R China

[4] Guangdong Prov Key Lab Informat Secur, Guangzhou, Peoples R China

[5] Minist Educ, Key Lab Machine Intelligence & Adv Comp, Guangzhou, Peoples R China

[6] Zhejiang Univ, Hangzhou, Peoples R China

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022) | 2022年

基金：

美国国家科学基金会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Cross-modal retrieval between videos and texts has gained increasing research interest due to the rapid emergence of videos on the web. Generally, a video contains rich instance and event information and the query text only describes a part of the information. Thus, a video can correspond to multiple different text descriptions and queries. We call this phenomenon the "Video-Text Correspondence Ambiguity" problem. Current techniques mostly concentrate on mining local or multi-level alignment between contents of a video and text (e.g., object to entity and action to verb). It is difficult for these methods to alleviate the video-text correspondence ambiguity by describing a video using only one single feature, which is required to be matched with multiple different text features at the same time. To address this problem, we propose a Text-Adaptive Multiple Visual Prototype Matching model, which automatically captures multiple prototypes to describe a video by adaptive aggregation of video token features. Given a query text, the similarity is determined by the most similar prototype to find correspondence in the video, which is termed text-adaptive matching. To learn diverse prototypes for representing the rich information in videos, we propose a variance loss to encourage different prototypes to attend to different contents of the video. Our method outperforms the state-of-the-art methods on four public video retrieval datasets.

引用

页数：12

共 50 条

[41] Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning
Chen, Shizhe
Zhao, Yida
Jin, Qin
Wu, Qi
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, : 10635 - 10644
[42] Contrastive Transformer Cross-Modal Hashing for Video-Text Retrieval
Shen, Xiaobo
Huang, Qianxin
Lan, Long
Zheng, Yuhui
PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 1227 - 1235
[43] KDProR: A Knowledge-Decoupling Probabilistic Framework for Video-Text Retrieval
Zhuang, Xianwei
Li, Hongxiang
Cheng, Xuxin
Zhu, Zhihong
Xie, Yuxin
Zou, Yuexian
COMPUTER VISION - ECCV 2024, PT XXXIV, 2025, 15092 : 313 - 331
[44] EA-VTR: Event-Aware Video-Text Retrieval
Ma, Zongyang
Zhang, Ziqi
Chen, Yuxin
Qi, Zhongang
Yuan, Chunfeng
Li, Bing
Luo, Yingmin
Li, Xu
Qi, Xiaojuan
Shan, Ying
Hu, Weiming
COMPUTER VISION - ECCV 2024, PT LII, 2025, 15110 : 76 - 94
[45] Debiased Video-Text Retrieval via Soft Positive Sample Calibration
Zhang, Huaiwen
Yang, Yang
Qi, Fan
Qian, Shengsheng
Xu, Changsheng
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 5257 - 5270
[46] Video-Text Retrieval by Supervised Sparse Multi-Grained Learning
Wang, Yimu
Shi, Peng
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 633 - 649
[47] LSECA: local semantic enhancement and cross aggregation for video-text retrieval
Wang, Zhiwen
Zhang, Donglin
Hu, Zhikai
INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2024, 13 (03)
[48] Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data
Ma, Wufei
Li, Kai
Jiang, Zhongshi
Meshry, Moustafa
Liu, Qihao
Wang, Huiyu
Hane, Christian
Yuille, Alan
COMPUTER VISION - ECCV 2024, PT XIII, 2025, 15071 : 254 - 269
[49] Self-expressive induced clustered attention for video-text retrieval
Zhu, Jingxuan
Shen, Xiangjun
Mehta, Sumet
Abeo, Timothy Apasiba
Zhan, Yongzhao
MULTIMEDIA SYSTEMS, 2024, 30 (06)
[50] Multimodal video-text matching using a deep bifurcation network and joint embedding of visual and textual features
Nabati, Masoomeh
Behrad, Alireza
EXPERT SYSTEMS WITH APPLICATIONS, 2021, 184

← 1 2 3 4 5 →