Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval

被引:0
|
作者
Lin, Chengzhi [1 ]
Wu, Ancong [1 ]
Liang, Junwei [2 ]
Zhang, Jun [3 ]
Ge, Wenhang [1 ]
Zheng, Wei-Shi [1 ,4 ,5 ]
Shen, Chunhua [6 ]
机构
[1] Sun Yatsen Univ China, Sch Comp Sci & Engn, Guangzhou, Peoples R China
[2] Hong Kong Univ Sci & Technol, AI Thrust, Guangzhou, Peoples R China
[3] Tencent Youtu Lab, Shenzhen, Peoples R China
[4] Guangdong Prov Key Lab Informat Secur, Guangzhou, Peoples R China
[5] Minist Educ, Key Lab Machine Intelligence & Adv Comp, Guangzhou, Peoples R China
[6] Zhejiang Univ, Hangzhou, Peoples R China
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-modal retrieval between videos and texts has gained increasing research interest due to the rapid emergence of videos on the web. Generally, a video contains rich instance and event information and the query text only describes a part of the information. Thus, a video can correspond to multiple different text descriptions and queries. We call this phenomenon the "Video-Text Correspondence Ambiguity" problem. Current techniques mostly concentrate on mining local or multi-level alignment between contents of a video and text (e.g., object to entity and action to verb). It is difficult for these methods to alleviate the video-text correspondence ambiguity by describing a video using only one single feature, which is required to be matched with multiple different text features at the same time. To address this problem, we propose a Text-Adaptive Multiple Visual Prototype Matching model, which automatically captures multiple prototypes to describe a video by adaptive aggregation of video token features. Given a query text, the similarity is determined by the most similar prototype to find correspondence in the video, which is termed text-adaptive matching. To learn diverse prototypes for representing the rich information in videos, we propose a variance loss to encourage different prototypes to attend to different contents of the video. Our method outperforms the state-of-the-art methods on four public video retrieval datasets.
引用
收藏
页数:12
相关论文
共 50 条
  • [41] Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning
    Chen, Shizhe
    Zhao, Yida
    Jin, Qin
    Wu, Qi
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, : 10635 - 10644
  • [42] Contrastive Transformer Cross-Modal Hashing for Video-Text Retrieval
    Shen, Xiaobo
    Huang, Qianxin
    Lan, Long
    Zheng, Yuhui
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 1227 - 1235
  • [43] KDProR: A Knowledge-Decoupling Probabilistic Framework for Video-Text Retrieval
    Zhuang, Xianwei
    Li, Hongxiang
    Cheng, Xuxin
    Zhu, Zhihong
    Xie, Yuxin
    Zou, Yuexian
    COMPUTER VISION - ECCV 2024, PT XXXIV, 2025, 15092 : 313 - 331
  • [44] EA-VTR: Event-Aware Video-Text Retrieval
    Ma, Zongyang
    Zhang, Ziqi
    Chen, Yuxin
    Qi, Zhongang
    Yuan, Chunfeng
    Li, Bing
    Luo, Yingmin
    Li, Xu
    Qi, Xiaojuan
    Shan, Ying
    Hu, Weiming
    COMPUTER VISION - ECCV 2024, PT LII, 2025, 15110 : 76 - 94
  • [45] Debiased Video-Text Retrieval via Soft Positive Sample Calibration
    Zhang, Huaiwen
    Yang, Yang
    Qi, Fan
    Qian, Shengsheng
    Xu, Changsheng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 5257 - 5270
  • [46] Video-Text Retrieval by Supervised Sparse Multi-Grained Learning
    Wang, Yimu
    Shi, Peng
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 633 - 649
  • [47] LSECA: local semantic enhancement and cross aggregation for video-text retrieval
    Wang, Zhiwen
    Zhang, Donglin
    Hu, Zhikai
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2024, 13 (03)
  • [48] Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data
    Ma, Wufei
    Li, Kai
    Jiang, Zhongshi
    Meshry, Moustafa
    Liu, Qihao
    Wang, Huiyu
    Hane, Christian
    Yuille, Alan
    COMPUTER VISION - ECCV 2024, PT XIII, 2025, 15071 : 254 - 269
  • [49] Self-expressive induced clustered attention for video-text retrieval
    Zhu, Jingxuan
    Shen, Xiangjun
    Mehta, Sumet
    Abeo, Timothy Apasiba
    Zhan, Yongzhao
    MULTIMEDIA SYSTEMS, 2024, 30 (06)
  • [50] Multimodal video-text matching using a deep bifurcation network and joint embedding of visual and textual features
    Nabati, Masoomeh
    Behrad, Alireza
    EXPERT SYSTEMS WITH APPLICATIONS, 2021, 184