Text-guided distillation learning to diversify video embeddings for text-video retrieval

被引:0
|
作者
Lee, Sangmin [1 ]
Kim, Hyung-Il [2 ]
Ro, Yong Man [3 ]
机构
[1] Univ Illinois, Dept Comp Sci, Urbana, IL 61801 USA
[2] Elect & Telecommun Res Inst, Visual Intelligence Res Sect, Daejeon 34129, South Korea
[3] Korea Adv Inst Sci & Technol, Image & Video Syst Lab, Daejeon 34141, South Korea
关键词
text-video retrieval; Diverse video embedding; Text-guided distillation learning; Text-agnostic; One-to-many correspondence;
D O I
10.1016/j.patcog.2024.110754
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Conventional text-video retrieval methods typically match a video with a text on a one-to-one manner. However, a single video can contain diverse semantics, and text descriptions can vary significantly. Therefore, such methods fail to match a video with multiple texts simultaneously. In this paper, we propose a novel approach to tackle this one-to-many correspondence problem in text-video retrieval. We devise diverse temporal aggregation and a multi-key memory to address temporal and semantic diversity, consequently constructing multiple video embedding paths from a single video. Additionally, we introduce text-guided distillation learning that enables each video path to acquire meaningful distinct competencies in representing varied semantics. Our video embedding approach is text-agnostic, allowing the prepared video embeddings to be used continuously for any new text query. Experiments show our method outperforms existing methods on four datasets. We further validate the effectiveness of our designs with ablation studies and analyses on diverse video embeddings.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] Mind-the-Gap! Unsupervised Domain Adaptation for Text-Video Retrieval
    Chen, Qingchao
    Liu, Yang
    Albanie, Samuel
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 1072 - 1080
  • [22] Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
    Wu, Wenhao
    Luo, Haipeng
    Fang, Bo
    Wang, Jingdong
    Ouyang, Wanli
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10704 - 10713
  • [23] Hierarchical bi-directional conceptual interaction for text-video retrieval
    Han, Wenpeng
    Niu, Guanglin
    Zhou, Mingliang
    Zhang, Xiaowei
    MULTIMEDIA SYSTEMS, 2024, 30 (06)
  • [24] Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval
    Li, Pandeng
    Xie, Chen-Wei
    Zhao, Liming
    Xie, Hongtao
    Ge, Jiannan
    Zheng, Yun
    Zhao, Deli
    Zhang, Yongdong
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 4077 - 4087
  • [25] Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment
    Jin, Peng
    Li, Hao
    Cheng, Zesen
    Huang, Jinfa
    Wang, Zhennan
    Yuan, Li
    Liu, Chang
    Chen, Jie
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 938 - 946
  • [26] Text-Video Retrieval via Multi-Modal Hypergraph Networks
    Li, Qian
    Su, Lixin
    Zhao, Jiashu
    Xia, Long
    Cai, Hengyi
    Cheng, Suqi
    Tang, Hengzhu
    Wang, Junfeng
    Yin, Dawei
    PROCEEDINGS OF THE 17TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, WSDM 2024, 2024, : 369 - 377
  • [27] PIDRo: Parallel Isomeric Attention with Dynamic Routing for Text-Video Retrieval
    Guan, Peiyan
    Pei, Renjing
    Bin Shao
    Liu, Jianzhuang
    Li, Weimian
    Gu, Jiaxi
    Xu, Hang
    Xu, Songcen
    Yan, Youliang
    Lam, Edmund Y.
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11130 - 11139
  • [28] Level-wise aligned dual networks for text-video retrieval
    Lin, Qiubin
    Cao, Wenming
    He, Zhiquan
    EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2022, 2022 (01)
  • [29] BiC-Net: Learning Efficient Spatio-temporal Relation for Text-Video Retrieval
    Han, Ning
    Zeng, Yawen
    Shi, Chuhao
    Xiao, Guangyi
    Chen, Hao
    Chen, Jingjing
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (03)
  • [30] Expert-guided contrastive learning for video-text retrieval
    Lee, Jewook
    Lee, Pilhyeon
    Park, Sungho
    Byun, Hyeran
    NEUROCOMPUTING, 2023, 536 : 50 - 58