Learning Text-to-Video Retrieval from Image Captioning

被引:0
|
作者
Ventura, Lucas [1 ,2 ]
Schmid, Cordelia [2 ]
Varol, Gul [1 ]
机构
[1] Univ Gustave Eiffel, Ecole Ponts, LIGM, CNRS, Marne La Vallee, France
[2] PSL Res Univ, Inria, CNRS, ENS, Paris, France
关键词
Text-to-video retrieval; Image captioning; Multimodal learning;
D O I
10.1007/s11263-024-02202-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the form of text. Using image expert models is a realistic scenario given that annotating images is cheaper therefore scalable, in contrast to expensive video labeling schemes. Recently, zero-shot image experts such as CLIP have established a new strong baseline for video understanding tasks. In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training. This process adapts the features to the target domain at no manual annotation cost, consequently outperforming the strong zero-shot CLIP baseline. During training, we sample captions from multiple video frames that best match the visual content, and perform a temporal pooling over frame representations by scoring frames according to their relevance to each caption. We conduct extensive ablations to provide insights and demonstrate the effectiveness of this simple framework by outperforming the CLIP zero-shot baselines on text-to-video retrieval on three standard datasets, namely ActivityNet, MSR-VTT, and MSVD. Code and models will be made publicly available.
引用
收藏
页码:1834 / 1854
页数:21
相关论文
共 50 条
  • [31] Fine-Grained Text-to-Video Temporal Grounding from Coarse Boundary
    Hao, Jiachang
    Sun, Haifeng
    Ren, Pengfei
    Zhong, Yiming
    Wang, Jingyu
    Qi, Qi
    Liao, Jianxin
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (05)
  • [32] Incorporating the Graph Representation of Video and Text into Video Captioning
    Lu, Min
    Li, Yuan
    2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 396 - 401
  • [33] Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation
    Wang, Wenjing
    Yang, Huan
    Tuo, Zixi
    He, Huiguo
    Zhu, Junchen
    Fu, Jianlong
    Liu, Jiaying
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025,
  • [34] Text to Image Synthesis for Improved Image Captioning
    Hossain, Md. Zakir
    Sohel, Ferdous
    Shiratuddin, Mohd Fairuz
    Laga, Hamid
    Bennamoun, Mohammed
    IEEE ACCESS, 2021, 9 : 64918 - 64928
  • [35] A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
    Wang, Xiang
    Zhang, Shiwei
    Yuan, Hangjie
    Qing, Zhiwu
    Gong, Biao
    Zhang, Yingya
    Shen, Yujun
    Gao, Changxin
    Sang, Nong
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 6572 - 6582
  • [36] Text-to-video Generation: Research Status, Progress and Challenges
    Deng Z.
    He X.
    Peng Y.
    Dianzi Yu Xinxi Xuebao/Journal of Electronics and Information Technology, 2024, 46 (05): : 1632 - 1644
  • [37] MotionDirector: Motion Customization of Text-to-Video Diffusion Models
    Zhao, Rui
    Gu, Yuchao
    Wu, Jay Zhangjie
    Zhang, David Junhao
    Liu, Jia-Wei
    Wu, Weijia
    Keppo, Jussi
    Shou, Mike Zheng
    COMPUTER VISION - ECCV 2024, PT LVI, 2025, 15114 : 273 - 290
  • [38] Modeling Accounting Workplace Interactions with Text-to-Video Animation
    Phillips, Fred
    Sheehan, Norman T.
    ACCOUNTING PERSPECTIVES, 2013, 12 (01) : 75 - 87
  • [39] Video Captioning based on Image Captioning as Subsidiary Content
    Vaishnavi, J.
    Narmatha, V
    2022 SECOND INTERNATIONAL CONFERENCE ON ADVANCES IN ELECTRICAL, COMPUTING, COMMUNICATION AND SUSTAINABLE TECHNOLOGIES (ICAECT), 2022,
  • [40] Deep learning for video-text retrieval: a review
    Zhu, Cunjuan
    Jia, Qi
    Chen, Wei
    Guo, Yanming
    Liu, Yu
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2023, 12 (01)