Revisiting the "Video" in Video-Language Understanding

被引：61

作者：

Buch, Shyamal ^{[1
]}

Eyzaguirre, Cristobal ^{[1
]}

Gaidon, Adrien ^{[2
]}

Wu, Jiajun ^{[1
]}

Li Fei-Fei ^{[1
]}

Niebles, Juan Carlos ^{[1
]}

机构：

[1] Stanford Univ, Stanford, CA 94305 USA

[2] Toyota Res Inst, Stanford, CA USA

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年

关键词：

D O I：

10.1109/CVPR52688.2022.00293

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

What makes a video task uniquely suited for videos, beyond what can be understood from a single image? Building on recent progress in self-supervised image-language models, we revisit this question in the context of video and language tasks. We propose the atemporal probe (ATP), a new model for video-language analysis which provides a stronger bound on the baseline accuracy of multimodal models constrained by image-level understanding. By applying this model to standard discriminative video and language tasks, such as video question answering and text-to-video retrieval, we characterize the limitations and potential of current video-language benchmarks. We find that understanding of event temporality is often not necessary to achieve strong or state-of-the-art performance, even compared with recent large-scale video-language models and in contexts intended to benchmark deeper video-level understanding. We also demonstrate how ATP can improve both video-language dataset and model design. We describe a technique for leveraging ATP to better disentangle dataset subsets with a higher concentration of temporally challenging data, improving benchmarking efficacy for causal and temporal understanding. Further, we show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.(1).

引用

页码：2907 / 2917

页数：11

共 50 条

[21] Test of Time: Instilling Video-Language Models with a Sense of Time
Bagad, Piyush
Tapaswi, Makarand
Snoek, Cees G. M.
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2503 - 2516
[22] Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
Xue, Hongwei
Hang, Tiankai
Zeng, Yanhong
Sun, Yuchong
Liu, Bei
Yang, Huan
Fu, Jianlong
Guo, Baining
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 5026 - 5035
[23] PAXION: Patching Action Knowledge in Video-Language Foundation Models
Wang, Zhenhailong
Blume, Ansel
Li, Sha
Liu, Genglin
Cho, Jaemin
Tang, Zineng
Bansal, Mohit
Ji, Heng
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[24] Learning Unified Video-Language Representations via Joint Modeling and Contrastive Learning for Natural Language Video Localization
Cui, Chenhao
Liang, Xinnian
Wu, Shuangzhi
Li, Zhoujun
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[25] Depth-Aware Sparse Transformer for Video-Language Learning
Zhang, Haonan
Gao, Lianli
Zeng, Pengpeng
Hanjalic, Alan
Shen, Heng Tao
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4778 - 4787
[26] Hierarchical Banzhaf Interaction for General Video-Language Representation Learning
Jin, Peng
Li, Hao
Yuan, Li
Yan, Shuicheng
Chen, Jie
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2025, 47 (03) : 2125 - 2139
[27] Clover : Towards A Unified Video-Language Alignment and Fusion Model
Huang, Jingjia
Li, Yinan
Feng, Jiashi
Wu, Xinglong
Sun, Xiaoshuai
Ji, Rongrong
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14856 - 14866
[28] Survey: Transformer based video-language pre-training
Ruan, Ludan
Jin, Qin
AI OPEN, 2022, 3 : 1 - 13
[29] VideoCon: Robust Video-Language Alignment via Contrast Captions
Bansall, Hritik
Bitton, Yonatan
Szpektor, Idan
Chang, Kai-Wei
Grover, Aditya
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13927 - 13937
[30] Learning Trajectory-Word Alignments for Video-Language Tasks
Yang, Xu
Li, Zhangzikang
Xu, Haiyang
Zhang, Hanwang
Ye, Qinghao
Li, Chenliang
Yan, Ming
Zhang, Yu
Huang, Fei
Huang, Songfang
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2504 - 2514

← 1 2 3 4 5 →