Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval

被引:0
|
作者
Lin, Chengzhi [1 ]
Wu, Ancong [1 ]
Liang, Junwei [2 ]
Zhang, Jun [3 ]
Ge, Wenhang [1 ]
Zheng, Wei-Shi [1 ,4 ,5 ]
Shen, Chunhua [6 ]
机构
[1] Sun Yatsen Univ China, Sch Comp Sci & Engn, Guangzhou, Peoples R China
[2] Hong Kong Univ Sci & Technol, AI Thrust, Guangzhou, Peoples R China
[3] Tencent Youtu Lab, Shenzhen, Peoples R China
[4] Guangdong Prov Key Lab Informat Secur, Guangzhou, Peoples R China
[5] Minist Educ, Key Lab Machine Intelligence & Adv Comp, Guangzhou, Peoples R China
[6] Zhejiang Univ, Hangzhou, Peoples R China
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-modal retrieval between videos and texts has gained increasing research interest due to the rapid emergence of videos on the web. Generally, a video contains rich instance and event information and the query text only describes a part of the information. Thus, a video can correspond to multiple different text descriptions and queries. We call this phenomenon the "Video-Text Correspondence Ambiguity" problem. Current techniques mostly concentrate on mining local or multi-level alignment between contents of a video and text (e.g., object to entity and action to verb). It is difficult for these methods to alleviate the video-text correspondence ambiguity by describing a video using only one single feature, which is required to be matched with multiple different text features at the same time. To address this problem, we propose a Text-Adaptive Multiple Visual Prototype Matching model, which automatically captures multiple prototypes to describe a video by adaptive aggregation of video token features. Given a query text, the similarity is determined by the most similar prototype to find correspondence in the video, which is termed text-adaptive matching. To learn diverse prototypes for representing the rich information in videos, we propose a variance loss to encourage different prototypes to attend to different contents of the video. Our method outperforms the state-of-the-art methods on four public video retrieval datasets.
引用
收藏
页数:12
相关论文
共 50 条
  • [21] Mask to Reconstruct: Cooperative Semantics Completion for Video-text Retrieval
    Fang, Han
    Yang, Zhifei
    Zang, Xianghao
    Ban, Chao
    He, Zhongjiang
    Sun, Hao
    Zhou, Lanxiang
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3847 - 3856
  • [22] Dual Alignment Unsupervised Domain Adaptation for Video-Text Retrieval
    Hao, Xiaoshuai
    Zhang, Wanqian
    Wu, Dayan
    Zhu, Fei
    Li, Bo
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18962 - 18972
  • [23] Complementarity-Aware Space Learning for Video-Text Retrieval
    Zhu, Jinkuan
    Zeng, Pengpeng
    Gao, Lianli
    Li, Gongfu
    Liao, Dongliang
    Song, Jingkuan
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (08) : 4362 - 4374
  • [24] Uncertainty-Aware with Negative Samples for Video-Text Retrieval
    Song, Weitao
    Chen, Weiran
    Xu, Jialiang
    Ji, Yi
    Li, Ying
    Liu, Chunping
    PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 318 - 332
  • [25] Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval
    Ma, Wentao
    Chen, Qingchao
    Zhou, Tongqing
    Zhao, Shan
    Cai, Zhiping
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (10) : 5486 - 5497
  • [26] Reliable Phrase Feature Mining for Hierarchical Video-Text Retrieval
    Lai, Huakai
    Yang, Wenfei
    Zhang, Tianzhu
    Zhang, Yongdong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) : 12019 - 12031
  • [27] HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval
    Liu, Song
    Fan, Haoqi
    Qian, Shengsheng
    Chen, Yiru
    Ding, Wenkui
    Wang, Zhongyuan
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 11895 - 11905
  • [28] Expert-guided contrastive learning for video-text retrieval
    Lee, Jewook
    Lee, Pilhyeon
    Park, Sungho
    Byun, Hyeran
    NEUROCOMPUTING, 2023, 536 : 50 - 58
  • [29] Robust Video-Text Retrieval Via Noisy Pair Calibration
    Zhang, Huaiwen
    Yang, Yang
    Qi, Fan
    Qian, Shengsheng
    Xu, Changsheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8632 - 8645
  • [30] SEMANTIC-PRESERVING METRIC LEARNING FOR VIDEO-TEXT RETRIEVAL
    Choo, Sungkwon
    Ha, Seong Jong
    Lee, Joonsoo
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 2388 - 2392