Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

被引:53
|
作者
Yang, Antoine [2 ,3 ,6 ]
Nagrani, Arsha [1 ]
Seo, Paul Hongsuck [1 ]
Miech, Antoine [4 ]
Pont-Tuset, Jordi [1 ]
Laptev, Ivan [2 ,3 ]
Sivic, Josef [5 ]
Schmid, Cordelia [1 ]
机构
[1] Google Res, Mountain View, CA USA
[2] Inria Paris, Paris, France
[3] PSL Res Univ, CNRS, Dept Informat ENS, Paris, France
[4] DeepMind, London, England
[5] Czech Tech Univ, Czech Inst Informat Robot & Cybernet, Prague, Czech Republic
[6] Google, Mountain View, CA USA
关键词
D O I
10.1109/CVPR52729.2023.01032
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale. The Vid2Seq architecture augments a language model with special time tokens, allowing it to seamlessly predict event boundaries and textual descriptions in the same output sequence. Such a unified model requires large-scale training data, which is not available in current annotated datasets. We show that it is possible to leverage unlabeled narrated videos for dense video captioning, by reformulating sentence boundaries of transcribed speech as pseudo event boundaries, and using the transcribed speech sentences as pseudo event captions. The resulting Vid2Seq model pretrained on the YT-Temporal-1B dataset improves the state of the art on a variety of dense video captioning benchmarks including YouCook2, ViTT and ActivityNet Captions. Vid2Seq also generalizes well to the tasks of video paragraph captioning and video clip captioning, and to few-shot settings. Our code is publicly available at [1].
引用
收藏
页码:10714 / 10726
页数:13
相关论文
共 50 条
  • [1] Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model
    Wang, Xiao
    Zhou, Weikang
    Zhang, Qi
    Zhou, Jie
    Gao, Songyang
    Wang, Junzhe
    Zhang, Menghan
    Gao, Xiang
    Chen, Yunwen
    Gui, Tao
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 555 - 568
  • [2] On the Effect of Pretraining Corpora on In-context Learning by a Large-scale Language Model
    Shin, Seongjin
    Lee, Sang-Woo
    Ahn, Hwijeen
    Kim, Sungdong
    Kim, HyoungSeok
    Kim, Boseop
    Cho, Kyunghyun
    Lee, Gichang
    Park, Woomyoung
    Ha, Jung-Woo
    Sung, Nako
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5168 - 5186
  • [3] FedID: Federated Interactive Distillation for Large-Scale Pretraining Language Models
    Ma, Xinge
    Liu, Jiangming
    Wang, Jin
    Zhang, Xuejie
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 8566 - 8577
  • [4] MSVD-Turkish: A Large-Scale Dataset for Video Captioning in Turkish
    Citamak, Begum
    Kuyu, Menekse
    Erdem, Aykut
    Erdem, Erkut
    2019 27TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2019,
  • [5] 3D Vision and Language Pretraining with Large-Scale Synthetic Data
    Yang, Dejie
    Xu, Zhu
    Mo, Wentao
    Chen, Qingchao
    Huang, Siyuan
    Liu, Yang
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 1552 - 1560
  • [6] ParameterNet: Parameters Are All You Need for Large-scale Visual Pretraining of Mobile Networks
    Han, Kai
    Wang, Yunhe
    Guo, Jianyuan
    Wu, Enhua
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 15751 - 15761
  • [7] Omnidirectional Image Quality Captioning: A Large-Scale Database and a New Model
    Yan, Jiebin
    Tan, Ziwen
    Fang, Yuming
    Chen, Junjie
    Jiang, Wenhui
    Wang, Zhou
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2025, 34 : 1326 - 1339
  • [8] Audio-visual large-scale video copy detection
    Liu, Yang
    Xu, Changsheng
    Lu, Hanqing
    INTERNATIONAL JOURNAL OF COMPUTER MATHEMATICS, 2011, 88 (18) : 3803 - 3816
  • [9] Large-Scale Visual Language Model Boosted by Contrast Domain Adaptation for Intelligent Industrial Visual Monitoring
    Wang, Huan
    Li, Chenxi
    Li, Yan-Fu
    IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2024, : 14114 - 14123
  • [10] Visual Analytics of Large-Scale Climate Model Data
    Wong, Pak Chung
    Shen, Han-Wei
    Leung, Ruby
    Hagos, Samson
    Lee, Teng-Yok
    Tong, Xin
    Lu, Kewei
    2014 IEEE 4TH SYMPOSIUM ON LARGE DATA ANALYSIS AND VISUALIZATION (LDAV), 2014, : 85 - 92