Revisiting the "Video" in Video-Language Understanding

被引:61
|
作者
Buch, Shyamal [1 ]
Eyzaguirre, Cristobal [1 ]
Gaidon, Adrien [2 ]
Wu, Jiajun [1 ]
Li Fei-Fei [1 ]
Niebles, Juan Carlos [1 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
[2] Toyota Res Inst, Stanford, CA USA
关键词
D O I
10.1109/CVPR52688.2022.00293
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
What makes a video task uniquely suited for videos, beyond what can be understood from a single image? Building on recent progress in self-supervised image-language models, we revisit this question in the context of video and language tasks. We propose the atemporal probe (ATP), a new model for video-language analysis which provides a stronger bound on the baseline accuracy of multimodal models constrained by image-level understanding. By applying this model to standard discriminative video and language tasks, such as video question answering and text-to-video retrieval, we characterize the limitations and potential of current video-language benchmarks. We find that understanding of event temporality is often not necessary to achieve strong or state-of-the-art performance, even compared with recent large-scale video-language models and in contexts intended to benchmark deeper video-level understanding. We also demonstrate how ATP can improve both video-language dataset and model design. We describe a technique for leveraging ATP to better disentangle dataset subsets with a higher concentration of temporally challenging data, improving benchmarking efficacy for causal and temporal understanding. Further, we show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.(1).
引用
收藏
页码:2907 / 2917
页数:11
相关论文
共 50 条
  • [1] Deep Video Understanding with Video-Language Model
    Liu, Runze
    Fang, Yaqun
    Yu, Fan
    Tian, Ruiqi
    Ren, Tongwei
    Wu, Gangshan
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9551 - 9555
  • [2] Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding
    Wang, Xiao
    Wu, Jianlong
    Lin, Zijia
    Zhang, Fuzheng
    Zhang, Di
    Nie, Liqiang
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2025, 47 (04) : 2912 - 2923
  • [3] LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
    Li, Linjie
    Can, Zhe
    Lin, Kevin
    Lin, Chung-Ching
    Liu, Zicheng
    Liu, Ce
    Wang, Lijuan
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23119 - 23129
  • [4] Verbs in Action: Improving verb understanding in video-language models
    Momeni, Liliane
    Caron, Mathilde
    Nagrani, Arsha
    Zisserman, Andrew
    Schmid, Cordelia
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15533 - 15545
  • [5] Egocentric Video-Language Pretraining
    Lin, Kevin Qinghong
    Wang, Alex Jinpeng
    Soldan, Mattia
    Wray, Michael
    Yan, Rui
    Xu, Eric Zhongcong
    Gao, Difei
    Tu, Rongcheng
    Zhao, Wenzhe
    Kong, Weijie
    Cai, Chengfei
    Wang, Hongfa
    Damen, Dima
    Ghanem, Bernard
    Liu, Wei
    Shou, Mike Zheng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [6] DeVAn: Dense Video Annotation for Video-Language Models
    Liu, Tingkai
    Tao, Yunzhe
    Liu, Haogeng
    Fan, Qihang
    Zhou, Ding
    Huang, Huaibo
    He, Ran
    Yang, Hongxia
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 14305 - 14321
  • [7] VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
    Xu, Hu
    Ghosh, Gargi
    Huang, Po-Yao
    Arora, Prahal
    Aminzadeh, Masoumeh
    Feichtenhofer, Christoph
    Metze, Florian
    Zettlemoyer, Luke
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 4227 - 4239
  • [8] ViLA: Efficient Video-Language Alignment for Video Question Answering
    Wang, Xijun
    Liang, Junbang
    Wang, Chun-Kai
    Deng, Kenan
    Lou, Yu
    Lin, Ming C.
    Yang, Shan
    COMPUTER VISION - ECCV 2024, PT LXII, 2025, 15120 : 186 - 204
  • [9] VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
    Li, Shicheng
    Li, Lei
    Liu, Yi
    Ren, Shuhuai
    Liu, Yuanxin
    Gao, Rundong
    Sun, Xu
    Hou, Lu
    COMPUTER VISION - ECCV 2024, PT LXX, 2025, 15128 : 331 - 348
  • [10] VidLA: Video-Language Alignment at Scale
    Rizve, Mamshad Nayeem
    Fei, Fan
    Unnikrishnan, Jayakrishnan
    Tran, Son
    Yao, Benjamin Z.
    Zeng, Belinda
    Shah, Mubarak
    Chilimbi, Trishul
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 14043 - 14055