Video-Text Retrieval by Supervised Sparse Multi-Grained Learning

被引:0
|
作者
Wang, Yimu [1 ]
Shi, Peng [1 ]
机构
[1] Univ Waterloo, Waterloo, ON, Canada
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While recent progress in video-text retrieval has been advanced by the exploration of better representation learning, in this paper, we present a novel multi-grained sparse learning framework, S3MA, to learn an aligned sparse space shared between the video and the text for video-text retrieval. The shared sparse space is initialized with a finite number of sparse concepts, each of which refers to a number of words. With the text data at hand, we learn and update the shared sparse space in a supervised manner using the proposed similarity and alignment losses. Moreover, to enable multi-grained alignment, we incorporate frame representations for better modeling the video modality and calculating fine-grained and coarse-grained similarities. Benefiting from the learned shared sparse space and multi-grained similarities, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of S3MA over existing methods. Our code is available at link.
引用
收藏
页码:633 / 649
页数:17
相关论文
共 50 条
  • [31] Exploiting Visual Semantic Reasoning for Video-Text Retrieval
    Feng, Zerun
    Zeng, Zhimin
    Guo, Caili
    Li, Zheng
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 1005 - 1011
  • [32] Animating Images to Transfer CLIP for Video-Text Retrieval
    Liu, Yu
    Chen, Huai
    Huang, Lianghua
    Chen, Di
    Wang, Bin
    Pan, Pan
    Wang, Lisheng
    PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 1906 - 1911
  • [33] VTC: Improving Video-Text Retrieval with User Comments
    Hanu, Laura
    Thewlis, James
    Asano, Yuki M.
    Rupprecht, Christian
    COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 616 - 633
  • [34] An Efficient Multimodal Aggregation Network for Video-Text Retrieval
    Liu, Zhi
    Zhao, Fangyuan
    Zhang, Mengmeng
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2022, E105D (10) : 1825 - 1828
  • [35] Joint embeddings with multimodal cues for video-text retrieval
    Niluthpol C. Mithun
    Juncheng Li
    Florian Metze
    Amit K. Roy-Chowdhury
    International Journal of Multimedia Information Retrieval, 2019, 8 : 3 - 18
  • [36] Joint embeddings with multimodal cues for video-text retrieval
    Mithun, Niluthpol C.
    Li, Juncheng
    Metze, Florian
    Roy-Chowdhury, Amit K.
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2019, 8 (01) : 3 - 18
  • [37] Bridging Video-text Retrieval with Multiple Choice Questions
    Ge, Yuying
    Ge, Yixiao
    Liu, Xihui
    Li, Dian
    Shan, Ying
    Qie, Xiaohu
    Luo, Ping
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16146 - 16155
  • [38] Survey on Video-Text Cross-Modal Retrieval
    Chen, Lei
    Xi, Yimeng
    Liu, Libo
    Computer Engineering and Applications, 2024, 60 (04) : 1 - 20
  • [39] HANet: Hierarchical Alignment Networks for Video-Text Retrieval
    Wu, Peng
    He, Xiangteng
    Tang, Mingqian
    Lv, Yiliang
    Liu, Jing
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3518 - 3527
  • [40] Learning Video-Text Aligned Representations for Video Captioning
    Shi, Yaya
    Xu, Haiyang
    Yuan, Chunfeng
    Li, Bing
    Hu, Weiming
    Zha, Zheng-Jun
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)