Learning Text-to-Video Retrieval from Image Captioning

被引:0
|
作者
Ventura, Lucas [1 ,2 ]
Schmid, Cordelia [2 ]
Varol, Gul [1 ]
机构
[1] Univ Gustave Eiffel, Ecole Ponts, LIGM, CNRS, Marne La Vallee, France
[2] PSL Res Univ, Inria, CNRS, ENS, Paris, France
关键词
Text-to-video retrieval; Image captioning; Multimodal learning;
D O I
10.1007/s11263-024-02202-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the form of text. Using image expert models is a realistic scenario given that annotating images is cheaper therefore scalable, in contrast to expensive video labeling schemes. Recently, zero-shot image experts such as CLIP have established a new strong baseline for video understanding tasks. In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training. This process adapts the features to the target domain at no manual annotation cost, consequently outperforming the strong zero-shot CLIP baseline. During training, we sample captions from multiple video frames that best match the visual content, and perform a temporal pooling over frame representations by scoring frames according to their relevance to each caption. We conduct extensive ablations to provide insights and demonstrate the effectiveness of this simple framework by outperforming the CLIP zero-shot baselines on text-to-video retrieval on three standard datasets, namely ActivityNet, MSR-VTT, and MSVD. Code and models will be made publicly available.
引用
收藏
页码:1834 / 1854
页数:21
相关论文
共 50 条
  • [41] Deep learning for video-text retrieval: a review
    Cunjuan Zhu
    Qi Jia
    Wei Chen
    Yanming Guo
    Yu Liu
    International Journal of Multimedia Information Retrieval, 2023, 12
  • [42] Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning
    Kang, Wooyoung
    Mun, Jonghwan
    Lee, Sungjun
    Roh, Byungseok
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2930 - 2940
  • [43] Image-Text Surgery: Efficient Concept Learning in Image Captioning by Generating Pseudopairs
    Fu, Kun
    Li, Jin
    Jin, Junqi
    Zhang, Changshui
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2018, 29 (12) : 5910 - 5921
  • [44] Retrieval-augmented Image Captioning
    Ramos, Rita
    Elliott, Desmond
    Martins, Bruno
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 3666 - 3681
  • [45] Video captioning with text -based dynamic attention and step-by-step learning
    Xiao, Huanhou
    Shi, Jinglun
    PATTERN RECOGNITION LETTERS, 2020, 133 : 305 - 312
  • [46] An Investigation into the Issues Concerning the Copyright of Content Generated by Text-to-Video AI
    Zhou Chunguang
    Yi Jia
    Contemporary Social Sciences, 2024, 9 (05) : 95 - 117
  • [47] Text-guided distillation learning to diversify video embeddings for text-video retrieval
    Lee, Sangmin
    Kim, Hyung-Il
    Ro, Yong Man
    PATTERN RECOGNITION, 2024, 156
  • [48] Compositional Learning of Image-Text Query for Image Retrieval
    Anwaar, Muhammad Umer
    Labintcev, Egor
    Kleinsteuber, Martin
    2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2021), 2021, : 1139 - 1148
  • [49] SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models
    Guo, Yuwei
    Yang, Ceyuan
    Rao, Anyi
    Agrawala, Maneesh
    Lin, Dahua
    Dai, Bo
    COMPUTER VISION - ECCV 2024, PT XLII, 2025, 15100 : 330 - 348
  • [50] Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation
    Sarto, Sara
    Barraco, Manuele
    Cornia, Marcella
    Baraldi, Lorenzo
    Cucchiara, Rita
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6914 - 6924