Learning Text-to-Video Retrieval from Image Captioning

被引:0
|
作者
Ventura, Lucas [1 ,2 ]
Schmid, Cordelia [2 ]
Varol, Gul [1 ]
机构
[1] Univ Gustave Eiffel, Ecole Ponts, LIGM, CNRS, Marne La Vallee, France
[2] PSL Res Univ, Inria, CNRS, ENS, Paris, France
关键词
Text-to-video retrieval; Image captioning; Multimodal learning;
D O I
10.1007/s11263-024-02202-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the form of text. Using image expert models is a realistic scenario given that annotating images is cheaper therefore scalable, in contrast to expensive video labeling schemes. Recently, zero-shot image experts such as CLIP have established a new strong baseline for video understanding tasks. In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training. This process adapts the features to the target domain at no manual annotation cost, consequently outperforming the strong zero-shot CLIP baseline. During training, we sample captions from multiple video frames that best match the visual content, and perform a temporal pooling over frame representations by scoring frames according to their relevance to each caption. We conduct extensive ablations to provide insights and demonstrate the effectiveness of this simple framework by outperforming the CLIP zero-shot baselines on text-to-video retrieval on three standard datasets, namely ActivityNet, MSR-VTT, and MSVD. Code and models will be made publicly available.
引用
收藏
页码:1834 / 1854
页数:21
相关论文
共 50 条
  • [21] Rethink video retrieval representation for video captioning
    Tian, Mingkai
    Li, Guorong
    Qi, Yuankai
    Wang, Shuhui
    Sheng, Quan Z.
    Huang, Qingming
    PATTERN RECOGNITION, 2024, 156
  • [22] Predicting Visual Features From Text for Image and Video Caption Retrieval
    Dong, Jianfeng
    Li, Xirong
    Snoek, Cees G. M.
    IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (12) : 3377 - 3388
  • [23] Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
    Wu, Jay Zhangjie
    Ge, Yixiao
    Wang, Xintao
    Lei, Stan Weixian
    Gu, Yuchao
    Shi, Yufei
    Hsu, Wynne
    Shan, Ying
    Qie, Xiaohu
    Shou, Mike Zheng
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 7589 - 7599
  • [24] MEVG: Multi-event Video Generation with Text-to-Video Models
    Oh, Gyeongrok
    Jeong, Jaehwan
    Kim, Sieun
    Byeon, Wonmin
    Kim, Jinkyu
    Kim, Sungwoong
    Kim, Sangpil
    COMPUTER VISION-ECCV 2024, PT XLIII, 2025, 15101 : 401 - 418
  • [25] ImproveYourVideos: Architectural Improvements for Text-to-Video Generation Pipeline
    Arkhipkin, Vladimir
    Shaheen, Zein
    Vasilev, Viacheslav
    Dakhova, Elizaveta
    Sobolev, Konstantin
    Kuznetsov, Andrey
    Dimitrov, Denis
    IEEE ACCESS, 2025, 13 : 1986 - 2003
  • [26] Bilingual video captioning model for enhanced video retrieval
    Norah Alrebdi
    Amal A. Al-Shargabi
    Journal of Big Data, 11
  • [27] Breathing Life Into Sketches Using Text-to-Video Priors
    Gal, Rinon
    Vinker, Yael
    Alaluf, Yuval
    Bermano, Amit
    Cohen-Or, Daniel
    Shamir, Ariel
    Chechik, Gal
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 4325 - 4336
  • [28] Text-to-video generative artificial intelligence: sora in neurosurgery
    Mohamed, Ali A.
    Lucke-Wold, Brandon
    NEUROSURGICAL REVIEW, 2024, 47 (01)
  • [29] Text-to-video: a semantic search engine for internet videos
    Jiang, Lu
    Yu, Shoou-, I
    Meng, Deyu
    Mitamura, Teruko
    Hauptmann, Alexander G.
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2016, 5 (01) : 3 - 18
  • [30] Bilingual video captioning model for enhanced video retrieval
    Alrebdi, Norah
    Al-Shargabi, Amal A.
    JOURNAL OF BIG DATA, 2024, 11 (01)