Revisiting the "Video" in Video-Language Understanding

被引：61

作者：

Buch, Shyamal ^{[1
]}

Eyzaguirre, Cristobal ^{[1
]}

Gaidon, Adrien ^{[2
]}

Wu, Jiajun ^{[1
]}

Li Fei-Fei ^{[1
]}

Niebles, Juan Carlos ^{[1
]}

机构：

[1] Stanford Univ, Stanford, CA 94305 USA

[2] Toyota Res Inst, Stanford, CA USA

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年

关键词：

D O I：

10.1109/CVPR52688.2022.00293

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

What makes a video task uniquely suited for videos, beyond what can be understood from a single image? Building on recent progress in self-supervised image-language models, we revisit this question in the context of video and language tasks. We propose the atemporal probe (ATP), a new model for video-language analysis which provides a stronger bound on the baseline accuracy of multimodal models constrained by image-level understanding. By applying this model to standard discriminative video and language tasks, such as video question answering and text-to-video retrieval, we characterize the limitations and potential of current video-language benchmarks. We find that understanding of event temporality is often not necessary to achieve strong or state-of-the-art performance, even compared with recent large-scale video-language models and in contexts intended to benchmark deeper video-level understanding. We also demonstrate how ATP can improve both video-language dataset and model design. We describe a technique for leveraging ATP to better disentangle dataset subsets with a higher concentration of temporally challenging data, improving benchmarking efficacy for causal and temporal understanding. Further, we show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.(1).

引用

页码：2907 / 2917

页数：11

共 50 条

[1] Deep Video Understanding with Video-Language Model
Liu, Runze
Fang, Yaqun
Yu, Fan
Tian, Ruiqi
Ren, Tongwei
Wu, Gangshan
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9551 - 9555
[2] Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding
Wang, Xiao
Wu, Jianlong
Lin, Zijia
Zhang, Fuzheng
Zhang, Di
Nie, Liqiang
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2025, 47 (04) : 2912 - 2923
[3] LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
Li, Linjie
Can, Zhe
Lin, Kevin
Lin, Chung-Ching
Liu, Zicheng
Liu, Ce
Wang, Lijuan
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23119 - 23129
[4] Verbs in Action: Improving verb understanding in video-language models
Momeni, Liliane
Caron, Mathilde
Nagrani, Arsha
Zisserman, Andrew
Schmid, Cordelia
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15533 - 15545
[5] Egocentric Video-Language Pretraining
Lin, Kevin Qinghong
Wang, Alex Jinpeng
Soldan, Mattia
Wray, Michael
Yan, Rui
Xu, Eric Zhongcong
Gao, Difei
Tu, Rongcheng
Zhao, Wenzhe
Kong, Weijie
Cai, Chengfei
Wang, Hongfa
Damen, Dima
Ghanem, Bernard
Liu, Wei
Shou, Mike Zheng
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[6] DeVAn: Dense Video Annotation for Video-Language Models
Liu, Tingkai
Tao, Yunzhe
Liu, Haogeng
Fan, Qihang
Zhou, Ding
Huang, Huaibo
He, Ran
Yang, Hongxia
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 14305 - 14321
[7] VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
Xu, Hu
Ghosh, Gargi
Huang, Po-Yao
Arora, Prahal
Aminzadeh, Masoumeh
Feichtenhofer, Christoph
Metze, Florian
Zettlemoyer, Luke
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 4227 - 4239
[8] ViLA: Efficient Video-Language Alignment for Video Question Answering
Wang, Xijun
Liang, Junbang
Wang, Chun-Kai
Deng, Kenan
Lou, Yu
Lin, Ming C.
Yang, Shan
COMPUTER VISION - ECCV 2024, PT LXII, 2025, 15120 : 186 - 204
[9] VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
Li, Shicheng
Li, Lei
Liu, Yi
Ren, Shuhuai
Liu, Yuanxin
Gao, Rundong
Sun, Xu
Hou, Lu
COMPUTER VISION - ECCV 2024, PT LXX, 2025, 15128 : 331 - 348
[10] VidLA: Video-Language Alignment at Scale
Rizve, Mamshad Nayeem
Fei, Fan
Unnikrishnan, Jayakrishnan
Tran, Son
Yao, Benjamin Z.
Zeng, Belinda
Shah, Mubarak
Chilimbi, Trishul
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 14043 - 14055

← 1 2 3 4 5 →