Revisiting the "Video" in Video-Language Understanding

被引：61

作者：

Buch, Shyamal ^{[1
]}

Eyzaguirre, Cristobal ^{[1
]}

Gaidon, Adrien ^{[2
]}

Wu, Jiajun ^{[1
]}

Li Fei-Fei ^{[1
]}

Niebles, Juan Carlos ^{[1
]}

机构：

[1] Stanford Univ, Stanford, CA 94305 USA

[2] Toyota Res Inst, Stanford, CA USA

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年

关键词：

D O I：

10.1109/CVPR52688.2022.00293

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

What makes a video task uniquely suited for videos, beyond what can be understood from a single image? Building on recent progress in self-supervised image-language models, we revisit this question in the context of video and language tasks. We propose the atemporal probe (ATP), a new model for video-language analysis which provides a stronger bound on the baseline accuracy of multimodal models constrained by image-level understanding. By applying this model to standard discriminative video and language tasks, such as video question answering and text-to-video retrieval, we characterize the limitations and potential of current video-language benchmarks. We find that understanding of event temporality is often not necessary to achieve strong or state-of-the-art performance, even compared with recent large-scale video-language models and in contexts intended to benchmark deeper video-level understanding. We also demonstrate how ATP can improve both video-language dataset and model design. We describe a technique for leveraging ATP to better disentangle dataset subsets with a higher concentration of temporally challenging data, improving benchmarking efficacy for causal and temporal understanding. Further, we show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.(1).

引用

页码：2907 / 2917

页数：11

共 50 条

[31] HiVLP: Hierarchical Interactive Video-Language Pre-Training
Shao, Bin
Liu, Jianzhuang
Pei, Renjing
Xu, Songcen
Dai, Peng
Lu, Juwei
Li, Weimian
Yan, Youliang
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 13710 - 13720
[32] OmniVL: One Foundation Model for Image-Language and Video-Language Tasks
Wang, Junke
Chen, Dongdong
Wu, Zuxuan
Luo, Chong
Zhou, Luowei
Zhao, Yucheng
Xie, Yujia
Liu, Ce
Jiang, Yu-Gang
Yuan, Lu
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[33] Object-aware Video-language Pre-training for Retrieval
Wang, Alex Jinpeng
Ge, Yixiao
Cai, Guanyu
Yan, Rui
Lin, Xudong
Shan, Ying
Qie, Xiaohu
Shou, Mike Zheng
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 3303 - 3312
[34] Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
Wang, Zhenhailong
Li, Manling
Xu, Ruochen
Zhou, Luowei
Lei, Jie
Lin, Xudong
Wang, Shuohang
Yang, Ziyi
Zhu, Chenguang
Hoiem, Derek
Chang, Shih-Fu
Bansal, Mohit
Ji, Heng
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[35] ε-ViLM : Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer
Fang, Jacob Zhiyuan
Zheng, Skyler
Sharma, Vasu
Piramuthu, Robinson
2024 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS, WACVW 2024, 2024, : 529 - 540
[36] All in One: Exploring Unified Video-Language Pre-training
Wang, Jinpeng
Ge, Yixiao
Yan, Rui
Ge, Yuying
Lin, Kevin Qinghong
Tsutsui, Satoshi
Lin, Xudong
Cai, Guanyu
Wu, Jianping
Shan, Ying
Qie, Xiaohu
Shou, Mike Zheng
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6598 - 6608
[37] Enhancing Video-Language Representations With Structural Spatio-Temporal Alignment
Fei, Hao
Wu, Shengqiong
Zhang, Meishan
Zhang, Min
Chua, Tat-Seng
Yan, Shuicheng
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 7701 - 7719
[38] Compositional Prompting Video-language Models to Understand Procedure in Instructional Videos
Hu, Guyue
He, Bin
Zhang, Hanwang
MACHINE INTELLIGENCE RESEARCH, 2023, 20 (02) : 249 - 262
[39] PiTe: Pixel-Temporal Alignment for Large Video-Language Model
Liu, Yang
Ding, Pengxiang
Huang, Siteng
Zhang, Min
Zhao, Han
Wang, Donglin
COMPUTER VISION - ECCV 2024, PT V, 2025, 15063 : 160 - 176
[40] Multimodal Analysis for Deep Video Understanding with Video Language Transformer
Zhang, Beibei
Fang, Yaqun
Ren, Tongwei
Wu, Gangshan
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 7165 - 7169

← 1 2 3 4 5 →