Text-guided distillation learning to diversify video embeddings for text-video retrieval

被引:0
|
作者
Lee, Sangmin [1 ]
Kim, Hyung-Il [2 ]
Ro, Yong Man [3 ]
机构
[1] Univ Illinois, Dept Comp Sci, Urbana, IL 61801 USA
[2] Elect & Telecommun Res Inst, Visual Intelligence Res Sect, Daejeon 34129, South Korea
[3] Korea Adv Inst Sci & Technol, Image & Video Syst Lab, Daejeon 34141, South Korea
关键词
text-video retrieval; Diverse video embedding; Text-guided distillation learning; Text-agnostic; One-to-many correspondence;
D O I
10.1016/j.patcog.2024.110754
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Conventional text-video retrieval methods typically match a video with a text on a one-to-one manner. However, a single video can contain diverse semantics, and text descriptions can vary significantly. Therefore, such methods fail to match a video with multiple texts simultaneously. In this paper, we propose a novel approach to tackle this one-to-many correspondence problem in text-video retrieval. We devise diverse temporal aggregation and a multi-key memory to address temporal and semantic diversity, consequently constructing multiple video embedding paths from a single video. Additionally, we introduce text-guided distillation learning that enables each video path to acquire meaningful distinct competencies in representing varied semantics. Our video embedding approach is text-agnostic, allowing the prepared video embeddings to be used continuously for any new text query. Experiments show our method outperforms existing methods on four datasets. We further validate the effectiveness of our designs with ablation studies and analyses on diverse video embeddings.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] Joint embeddings with multimodal cues for video-text retrieval
    Niluthpol C. Mithun
    Juncheng Li
    Florian Metze
    Amit K. Roy-Chowdhury
    International Journal of Multimedia Information Retrieval, 2019, 8 : 3 - 18
  • [32] Joint embeddings with multimodal cues for video-text retrieval
    Mithun, Niluthpol C.
    Li, Juncheng
    Metze, Florian
    Roy-Chowdhury, Amit K.
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2019, 8 (01) : 3 - 18
  • [33] A cross-modal conditional mechanism based on attention for text-video retrieval
    Du, Wanru
    Jing, Xiaochuan
    Zhu, Quan
    Wang, Xiaoyin
    Liu, Xuan
    MATHEMATICAL BIOSCIENCES AND ENGINEERING, 2023, 20 (11) : 20073 - 20092
  • [34] Enhancing Text-Video Retrieval Performance With Low-Salient but Discriminative Objects
    Zheng, Yanwei
    Huang, Bowen
    Chen, Zekai
    Yu, Dongxiao
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2025, 34 : 581 - 593
  • [35] Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval
    Wu, Xiaoyu
    Wang, Tiantian
    Wang, Shengjin
    ELECTRONICS, 2020, 9 (12) : 1 - 17
  • [36] X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
    Gorti, Satya Krishna
    Vouitsis, Noel
    Ma, Junwei
    Golestan, Keyvan
    Volkovs, Maksims
    Garg, Animesh
    Yu, Guangwei
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 4996 - 5005
  • [37] UMP: Unified Modality-Aware Prompt Tuning for Text-Video Retrieval
    Zhang, Haonan
    Zeng, Pengpeng
    Gao, Lianli
    Song, Jingkuan
    Shen, Heng Tao
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) : 11954 - 11964
  • [38] Text-video retrieval method based on enhanced self-attention and multi-task learning
    Xiaoyu Wu
    Jiayao Qian
    Tiantian Wang
    Multimedia Tools and Applications, 2023, 82 : 24387 - 24406
  • [39] DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval
    Yang, Xiangpeng
    Zhu, Linchao
    Wang, Xiaohan
    Yang, Yi
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 6540 - 6548
  • [40] A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval
    Falcon, Alex
    Serra, Giuseppe
    Lanz, Oswald
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4385 - 4394