Revisiting the "Video" in Video-Language Understanding

被引:61
|
作者
Buch, Shyamal [1 ]
Eyzaguirre, Cristobal [1 ]
Gaidon, Adrien [2 ]
Wu, Jiajun [1 ]
Li Fei-Fei [1 ]
Niebles, Juan Carlos [1 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
[2] Toyota Res Inst, Stanford, CA USA
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年
关键词
D O I
10.1109/CVPR52688.2022.00293
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
What makes a video task uniquely suited for videos, beyond what can be understood from a single image? Building on recent progress in self-supervised image-language models, we revisit this question in the context of video and language tasks. We propose the atemporal probe (ATP), a new model for video-language analysis which provides a stronger bound on the baseline accuracy of multimodal models constrained by image-level understanding. By applying this model to standard discriminative video and language tasks, such as video question answering and text-to-video retrieval, we characterize the limitations and potential of current video-language benchmarks. We find that understanding of event temporality is often not necessary to achieve strong or state-of-the-art performance, even compared with recent large-scale video-language models and in contexts intended to benchmark deeper video-level understanding. We also demonstrate how ATP can improve both video-language dataset and model design. We describe a technique for leveraging ATP to better disentangle dataset subsets with a higher concentration of temporally challenging data, improving benchmarking efficacy for causal and temporal understanding. Further, we show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.(1).
引用
收藏
页码:2907 / 2917
页数:11
相关论文
共 50 条
  • [31] HiVLP: Hierarchical Interactive Video-Language Pre-Training
    Shao, Bin
    Liu, Jianzhuang
    Pei, Renjing
    Xu, Songcen
    Dai, Peng
    Lu, Juwei
    Li, Weimian
    Yan, Youliang
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 13710 - 13720
  • [32] OmniVL: One Foundation Model for Image-Language and Video-Language Tasks
    Wang, Junke
    Chen, Dongdong
    Wu, Zuxuan
    Luo, Chong
    Zhou, Luowei
    Zhao, Yucheng
    Xie, Yujia
    Liu, Ce
    Jiang, Yu-Gang
    Yuan, Lu
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [33] Object-aware Video-language Pre-training for Retrieval
    Wang, Alex Jinpeng
    Ge, Yixiao
    Cai, Guanyu
    Yan, Rui
    Lin, Xudong
    Shan, Ying
    Qie, Xiaohu
    Shou, Mike Zheng
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 3303 - 3312
  • [34] Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
    Wang, Zhenhailong
    Li, Manling
    Xu, Ruochen
    Zhou, Luowei
    Lei, Jie
    Lin, Xudong
    Wang, Shuohang
    Yang, Ziyi
    Zhu, Chenguang
    Hoiem, Derek
    Chang, Shih-Fu
    Bansal, Mohit
    Ji, Heng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [35] ε-ViLM : Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer
    Fang, Jacob Zhiyuan
    Zheng, Skyler
    Sharma, Vasu
    Piramuthu, Robinson
    2024 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS, WACVW 2024, 2024, : 529 - 540
  • [36] All in One: Exploring Unified Video-Language Pre-training
    Wang, Jinpeng
    Ge, Yixiao
    Yan, Rui
    Ge, Yuying
    Lin, Kevin Qinghong
    Tsutsui, Satoshi
    Lin, Xudong
    Cai, Guanyu
    Wu, Jianping
    Shan, Ying
    Qie, Xiaohu
    Shou, Mike Zheng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6598 - 6608
  • [37] Enhancing Video-Language Representations With Structural Spatio-Temporal Alignment
    Fei, Hao
    Wu, Shengqiong
    Zhang, Meishan
    Zhang, Min
    Chua, Tat-Seng
    Yan, Shuicheng
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 7701 - 7719
  • [38] Compositional Prompting Video-language Models to Understand Procedure in Instructional Videos
    Hu, Guyue
    He, Bin
    Zhang, Hanwang
    MACHINE INTELLIGENCE RESEARCH, 2023, 20 (02) : 249 - 262
  • [39] PiTe: Pixel-Temporal Alignment for Large Video-Language Model
    Liu, Yang
    Ding, Pengxiang
    Huang, Siteng
    Zhang, Min
    Zhao, Han
    Wang, Donglin
    COMPUTER VISION - ECCV 2024, PT V, 2025, 15063 : 160 - 176
  • [40] Multimodal Analysis for Deep Video Understanding with Video Language Transformer
    Zhang, Beibei
    Fang, Yaqun
    Ren, Tongwei
    Wu, Gangshan
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 7165 - 7169