Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval

被引:0
|
作者
Lin, Chengzhi [1 ]
Wu, Ancong [1 ]
Liang, Junwei [2 ]
Zhang, Jun [3 ]
Ge, Wenhang [1 ]
Zheng, Wei-Shi [1 ,4 ,5 ]
Shen, Chunhua [6 ]
机构
[1] Sun Yatsen Univ China, Sch Comp Sci & Engn, Guangzhou, Peoples R China
[2] Hong Kong Univ Sci & Technol, AI Thrust, Guangzhou, Peoples R China
[3] Tencent Youtu Lab, Shenzhen, Peoples R China
[4] Guangdong Prov Key Lab Informat Secur, Guangzhou, Peoples R China
[5] Minist Educ, Key Lab Machine Intelligence & Adv Comp, Guangzhou, Peoples R China
[6] Zhejiang Univ, Hangzhou, Peoples R China
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-modal retrieval between videos and texts has gained increasing research interest due to the rapid emergence of videos on the web. Generally, a video contains rich instance and event information and the query text only describes a part of the information. Thus, a video can correspond to multiple different text descriptions and queries. We call this phenomenon the "Video-Text Correspondence Ambiguity" problem. Current techniques mostly concentrate on mining local or multi-level alignment between contents of a video and text (e.g., object to entity and action to verb). It is difficult for these methods to alleviate the video-text correspondence ambiguity by describing a video using only one single feature, which is required to be matched with multiple different text features at the same time. To address this problem, we propose a Text-Adaptive Multiple Visual Prototype Matching model, which automatically captures multiple prototypes to describe a video by adaptive aggregation of video token features. Given a query text, the similarity is determined by the most similar prototype to find correspondence in the video, which is termed text-adaptive matching. To learn diverse prototypes for representing the rich information in videos, we propose a variance loss to encourage different prototypes to attend to different contents of the video. Our method outperforms the state-of-the-art methods on four public video retrieval datasets.
引用
收藏
页数:12
相关论文
共 50 条
  • [31] STACKED CONVOLUTIONAL DEEP ENCODING NETWORK FOR VIDEO-TEXT RETRIEVAL
    Zhao, Rui
    Zheng, Kecheng
    Zha, Zheng-jun
    2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
  • [32] Improving Transformer with Dynamic Convolution and Shortcut for Video-Text Retrieval
    Liu, Zhi
    Cai, Jincen
    Zhang, Mengmeng
    KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2022, 16 (07): : 2407 - 2424
  • [33] Unified Coarse-to-Fine Alignment for Video-Text Retrieval
    Wang, Ziyang
    Sung, Yi-Lin
    Cheng, Feng
    Bertasius, Gedas
    Bansal, Mohit
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2804 - 2815
  • [34] MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text Retrieval
    Ge, Yuying
    Ge, Yixiao
    Liu, Xihui
    Wang, Jinpeng
    Wu, Jianping
    Shan, Ying
    Qie, Xiaohu
    Luo, Ping
    COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 691 - 708
  • [35] Query-Adaptive Late Fusion for Hierarchical Fine-Grained Video-Text Retrieval
    Ma, Wentao
    Chen, Qingchao
    Liu, Fang
    Zhou, Tongqing
    Cai, Zhiping
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (05) : 7150 - 7161
  • [36] Video-text extraction and recognition
    Chen, TB
    Ghosh, D
    Ranganath, S
    TENCON 2004 - 2004 IEEE REGION 10 CONFERENCE, VOLS A-D, PROCEEDINGS: ANALOG AND DIGITAL TECHNIQUES IN ELECTRICAL ENGINEERING, 2004, : A319 - A322
  • [37] Alignment of Image-Text and Video-Text Datasets
    Ozkose, Yunus Emre
    Gokce, Zeynep
    Duygulu, Pinar
    2023 31ST SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, SIU, 2023,
  • [38] LOOK, TELL AND MATCH: REFINING VIDEO-TEXT RETRIEVAL WITH SEMANTIC INFORMATION
    Zhu Jinkuan
    Hu Weiyi
    2022 19TH INTERNATIONAL COMPUTER CONFERENCE ON WAVELET ACTIVE MEDIA TECHNOLOGY AND INFORMATION PROCESSING (ICCWAMTIP), 2022,
  • [39] Boosting Video-Text Retrieval with Explicit High-Level Semantics
    Wang, Haoran
    Xu, Di
    He, Dongliang
    Li, Fu
    Ji, Zhong
    Han, Jungong
    Ding, Errui
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4887 - 4898
  • [40] Learning Semantics-Grounded Vocabulary Representation for Video-Text Retrieval
    Shi, Yaya
    Liu, Haowei
    Xu, Haiyang
    Ma, Zongyang
    Ye, Qinghao
    Hu, Anwen
    Yan, Ming
    Zhang, Ji
    Huang, Fei
    Yuan, Chunfeng
    Li, Bing
    Hu, Weiming
    Zha, Zheng-Jun
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4460 - 4470