Video-Text Retrieval by Supervised Sparse Multi-Grained Learning

被引:0
|
作者
Wang, Yimu [1 ]
Shi, Peng [1 ]
机构
[1] Univ Waterloo, Waterloo, ON, Canada
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While recent progress in video-text retrieval has been advanced by the exploration of better representation learning, in this paper, we present a novel multi-grained sparse learning framework, S3MA, to learn an aligned sparse space shared between the video and the text for video-text retrieval. The shared sparse space is initialized with a finite number of sparse concepts, each of which refers to a number of words. With the text data at hand, we learn and update the shared sparse space in a supervised manner using the proposed similarity and alignment losses. Moreover, to enable multi-grained alignment, we incorporate frame representations for better modeling the video modality and calculating fine-grained and coarse-grained similarities. Benefiting from the learned shared sparse space and multi-grained similarities, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of S3MA over existing methods. Our code is available at link.
引用
收藏
页码:633 / 649
页数:17
相关论文
共 50 条
  • [41] Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval
    Chen, Yizhen
    Wang, Jie
    Lin, Lijian
    Qi, Zhongang
    Ma, Jin
    Shan, Ying
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 396 - 404
  • [42] Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval
    Mithun, Niluthpol Chowdhury
    Li, Juncheng
    Metze, Florian
    Roy-Chowdhury, Amit K.
    ICMR '18: PROCEEDINGS OF THE 2018 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2018, : 19 - 27
  • [43] Multi-grained unsupervised evidence retrieval for question answering
    Hao You
    Neural Computing and Applications, 2023, 35 : 21247 - 21257
  • [44] Multi-grained unsupervised evidence retrieval for question answering
    You, Hao
    NEURAL COMPUTING & APPLICATIONS, 2023, 35 (28): : 21247 - 21257
  • [45] Video-Text Pre-training with Learned Regions for Retrieval
    Yan, Rui
    Shou, Mike Zheng
    Ge, Yixiao
    Wang, Jinpeng
    Lin, Xudong
    Cai, Guanyu
    Tang, Jinhui
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 3100 - 3108
  • [46] Mask to Reconstruct: Cooperative Semantics Completion for Video-text Retrieval
    Fang, Han
    Yang, Zhifei
    Zang, Xianghao
    Ban, Chao
    He, Zhongjiang
    Sun, Hao
    Zhou, Lanxiang
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3847 - 3856
  • [47] Dual Alignment Unsupervised Domain Adaptation for Video-Text Retrieval
    Hao, Xiaoshuai
    Zhang, Wanqian
    Wu, Dayan
    Zhu, Fei
    Li, Bo
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18962 - 18972
  • [48] Adaptive Token Excitation with Negative Selection for Video-Text Retrieval
    Yu, Juntao
    Ni, Zhangkai
    Su, Taiyi
    Wang, Hanli
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VII, 2023, 14260 : 349 - 361
  • [49] Uncertainty-Aware with Negative Samples for Video-Text Retrieval
    Song, Weitao
    Chen, Weiran
    Xu, Jialiang
    Ji, Yi
    Li, Ying
    Liu, Chunping
    PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 318 - 332
  • [50] Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval
    Ma, Wentao
    Chen, Qingchao
    Zhou, Tongqing
    Zhao, Shan
    Cai, Zhiping
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (10) : 5486 - 5497