Video Question Answering with Procedural Programs

被引：0

作者：

Choudhury, Rohan ^{[1
]}

Niinuma, Koichiro ^{[2
]}

Kitani, Kris M. ^{[1
]}

Jeni, Laszlo A. ^{[1
]}

机构：

[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[2] Fujitsu Res Amer, Santa Clara, CA USA

来源：

COMPUTER VISION-ECCV 2024, PT XXXVIII | 2025年 / 15096卷

关键词：

D O I：

10.1007/978-3-031-72920-1_18

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We propose to answer questions about videos by generating short procedural programs that solve visual subtasks to obtain a final answer. We present Procedural Video Querying (ProViQ), which uses a large language model to generate such programs from an input question and an API of visual modules in the prompt, then executes them to obtain the output. Recent similar procedural approaches have proven successful for image question answering, but cannot effectively or efficiently answer questions about videos due to their image-centric modules and lack of temporal reasoning ability. We address this by providing ProViQ with novel modules intended for video understanding, allowing it to generalize to a wide variety of videos with no additional training. As a result, ProViQ can efficiently find relevant moments in long videos, do causal and temporal reasoning, and summarize videos over long time horizons in order to answer complex questions. This code generation framework additionally enables ProViQ to perform other video tasks beyond question answering, such as multi-object tracking or basic video editing. ProViQ achieves state-of-the-art results on a diverse range of benchmarks, with improvements of up to 25% on short, long, open-ended, multiple-choice and multimodal video question-answering datasets. Our project page is at https://rccchoudhury.github.io/proviq2023/.

引用

页码：315 / 332

页数：18

共 50 条

[31] Complementary spatiotemporal network for video question answering
Xinrui Li
Aming Wu
Yahong Han
Multimedia Systems, 2022, 28 : 161 - 169
[32] Measuring Compositional Consistency for Video Question Answering
Gandhi, Mona
Gul, Mustafa Omer
Prakash, Eva
Grunde-McLaughlin, Madeleine
Krishna, Ranjay
Agrawala, Maneesh
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 5036 - 5045
[33] Complementary spatiotemporal network for video question answering
Li, Xinrui
Wu, Aming
Han, Yahong
MULTIMEDIA SYSTEMS, 2022, 28 (01) : 161 - 169
[34] Remember and forget: video and text fusion for video question answering
Feng Gao
Yuanyuan Ge
Yongge Liu
Multimedia Tools and Applications, 2018, 77 : 29269 - 29282
[35] Video question answering via traffic knowledge database and question classification
Xiaoyong Sun
Yu Dai
Yuchen Wang
Weifeng Ma
Xuefen Lin
Multimedia Systems, 2024, 30
[36] Video question answering via traffic knowledge database and question classification
Sun, Xiaoyong
Dai, Yu
Wang, Yuchen
Ma, Weifeng
Lin, Xuefen
MULTIMEDIA SYSTEMS, 2024, 30 (01)
[37] Question Difficulty Estimation with Directional Modality Association in Video Question Answering
Kim, Bong-Min
Park, Seong-Bae
ADVANCES AND TRENDS IN ARTIFICIAL INTELLIGENCE: THEORY AND PRACTICES IN ARTIFICIAL INTELLIGENCE, 2022, 13343 : 287 - 299
[38] Learning Question-Guided Video Representation for Multi-Turn Video Question Answering
Chao, Guan-Lin
Rastogi, Abhinav
Yavuz, Semih
Hakkani-Tur, Dilek
Chen, Jindong
Lane, Ian
20TH ANNUAL MEETING OF THE SPECIAL INTEREST GROUP ON DISCOURSE AND DIALOGUE (SIGDIAL 2019), 2019, : 215 - 225
[39] ViLA: Efficient Video-Language Alignment for Video Question Answering
Wang, Xijun
Liang, Junbang
Wang, Chun-Kai
Deng, Kenan
Lou, Yu
Lin, Ming C.
Yang, Shan
COMPUTER VISION - ECCV 2024, PT LXII, 2025, 15120 : 186 - 204
[40] Knowledge Proxy Intervention for Deconfounded Video Question Answering
Li, Jiangtong
Niu, Li
Zhang, Liqing
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2770 - 2781

← 1 2 3 4 5 →