Revisiting the "Video" in Video-Language Understanding

被引:61
|
作者
Buch, Shyamal [1 ]
Eyzaguirre, Cristobal [1 ]
Gaidon, Adrien [2 ]
Wu, Jiajun [1 ]
Li Fei-Fei [1 ]
Niebles, Juan Carlos [1 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
[2] Toyota Res Inst, Stanford, CA USA
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年
关键词
D O I
10.1109/CVPR52688.2022.00293
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
What makes a video task uniquely suited for videos, beyond what can be understood from a single image? Building on recent progress in self-supervised image-language models, we revisit this question in the context of video and language tasks. We propose the atemporal probe (ATP), a new model for video-language analysis which provides a stronger bound on the baseline accuracy of multimodal models constrained by image-level understanding. By applying this model to standard discriminative video and language tasks, such as video question answering and text-to-video retrieval, we characterize the limitations and potential of current video-language benchmarks. We find that understanding of event temporality is often not necessary to achieve strong or state-of-the-art performance, even compared with recent large-scale video-language models and in contexts intended to benchmark deeper video-level understanding. We also demonstrate how ATP can improve both video-language dataset and model design. We describe a technique for leveraging ATP to better disentangle dataset subsets with a higher concentration of temporally challenging data, improving benchmarking efficacy for causal and temporal understanding. Further, we show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.(1).
引用
收藏
页码:2907 / 2917
页数:11
相关论文
共 50 条
  • [21] Test of Time: Instilling Video-Language Models with a Sense of Time
    Bagad, Piyush
    Tapaswi, Makarand
    Snoek, Cees G. M.
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2503 - 2516
  • [22] Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
    Xue, Hongwei
    Hang, Tiankai
    Zeng, Yanhong
    Sun, Yuchong
    Liu, Bei
    Yang, Huan
    Fu, Jianlong
    Guo, Baining
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 5026 - 5035
  • [23] PAXION: Patching Action Knowledge in Video-Language Foundation Models
    Wang, Zhenhailong
    Blume, Ansel
    Li, Sha
    Liu, Genglin
    Cho, Jaemin
    Tang, Zineng
    Bansal, Mohit
    Ji, Heng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [24] Learning Unified Video-Language Representations via Joint Modeling and Contrastive Learning for Natural Language Video Localization
    Cui, Chenhao
    Liang, Xinnian
    Wu, Shuangzhi
    Li, Zhoujun
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [25] Depth-Aware Sparse Transformer for Video-Language Learning
    Zhang, Haonan
    Gao, Lianli
    Zeng, Pengpeng
    Hanjalic, Alan
    Shen, Heng Tao
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4778 - 4787
  • [26] Hierarchical Banzhaf Interaction for General Video-Language Representation Learning
    Jin, Peng
    Li, Hao
    Yuan, Li
    Yan, Shuicheng
    Chen, Jie
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2025, 47 (03) : 2125 - 2139
  • [27] Clover : Towards A Unified Video-Language Alignment and Fusion Model
    Huang, Jingjia
    Li, Yinan
    Feng, Jiashi
    Wu, Xinglong
    Sun, Xiaoshuai
    Ji, Rongrong
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14856 - 14866
  • [28] Survey: Transformer based video-language pre-training
    Ruan, Ludan
    Jin, Qin
    AI OPEN, 2022, 3 : 1 - 13
  • [29] VideoCon: Robust Video-Language Alignment via Contrast Captions
    Bansall, Hritik
    Bitton, Yonatan
    Szpektor, Idan
    Chang, Kai-Wei
    Grover, Aditya
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13927 - 13937
  • [30] Learning Trajectory-Word Alignments for Video-Language Tasks
    Yang, Xu
    Li, Zhangzikang
    Xu, Haiyang
    Zhang, Hanwang
    Ye, Qinghao
    Li, Chenliang
    Yan, Ming
    Zhang, Yu
    Huang, Fei
    Huang, Songfang
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2504 - 2514