Video Question Answering with Procedural Programs

被引:0
|
作者
Choudhury, Rohan [1 ]
Niinuma, Koichiro [2 ]
Kitani, Kris M. [1 ]
Jeni, Laszlo A. [1 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] Fujitsu Res Amer, Santa Clara, CA USA
来源
关键词
D O I
10.1007/978-3-031-72920-1_18
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose to answer questions about videos by generating short procedural programs that solve visual subtasks to obtain a final answer. We present Procedural Video Querying (ProViQ), which uses a large language model to generate such programs from an input question and an API of visual modules in the prompt, then executes them to obtain the output. Recent similar procedural approaches have proven successful for image question answering, but cannot effectively or efficiently answer questions about videos due to their image-centric modules and lack of temporal reasoning ability. We address this by providing ProViQ with novel modules intended for video understanding, allowing it to generalize to a wide variety of videos with no additional training. As a result, ProViQ can efficiently find relevant moments in long videos, do causal and temporal reasoning, and summarize videos over long time horizons in order to answer complex questions. This code generation framework additionally enables ProViQ to perform other video tasks beyond question answering, such as multi-object tracking or basic video editing. ProViQ achieves state-of-the-art results on a diverse range of benchmarks, with improvements of up to 25% on short, long, open-ended, multiple-choice and multimodal video question-answering datasets. Our project page is at https://rccchoudhury.github.io/proviq2023/.
引用
收藏
页码:315 / 332
页数:18
相关论文
共 50 条
  • [31] Complementary spatiotemporal network for video question answering
    Xinrui Li
    Aming Wu
    Yahong Han
    Multimedia Systems, 2022, 28 : 161 - 169
  • [32] Measuring Compositional Consistency for Video Question Answering
    Gandhi, Mona
    Gul, Mustafa Omer
    Prakash, Eva
    Grunde-McLaughlin, Madeleine
    Krishna, Ranjay
    Agrawala, Maneesh
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 5036 - 5045
  • [33] Complementary spatiotemporal network for video question answering
    Li, Xinrui
    Wu, Aming
    Han, Yahong
    MULTIMEDIA SYSTEMS, 2022, 28 (01) : 161 - 169
  • [34] Remember and forget: video and text fusion for video question answering
    Feng Gao
    Yuanyuan Ge
    Yongge Liu
    Multimedia Tools and Applications, 2018, 77 : 29269 - 29282
  • [35] Video question answering via traffic knowledge database and question classification
    Xiaoyong Sun
    Yu Dai
    Yuchen Wang
    Weifeng Ma
    Xuefen Lin
    Multimedia Systems, 2024, 30
  • [36] Video question answering via traffic knowledge database and question classification
    Sun, Xiaoyong
    Dai, Yu
    Wang, Yuchen
    Ma, Weifeng
    Lin, Xuefen
    MULTIMEDIA SYSTEMS, 2024, 30 (01)
  • [37] Question Difficulty Estimation with Directional Modality Association in Video Question Answering
    Kim, Bong-Min
    Park, Seong-Bae
    ADVANCES AND TRENDS IN ARTIFICIAL INTELLIGENCE: THEORY AND PRACTICES IN ARTIFICIAL INTELLIGENCE, 2022, 13343 : 287 - 299
  • [38] Learning Question-Guided Video Representation for Multi-Turn Video Question Answering
    Chao, Guan-Lin
    Rastogi, Abhinav
    Yavuz, Semih
    Hakkani-Tur, Dilek
    Chen, Jindong
    Lane, Ian
    20TH ANNUAL MEETING OF THE SPECIAL INTEREST GROUP ON DISCOURSE AND DIALOGUE (SIGDIAL 2019), 2019, : 215 - 225
  • [39] ViLA: Efficient Video-Language Alignment for Video Question Answering
    Wang, Xijun
    Liang, Junbang
    Wang, Chun-Kai
    Deng, Kenan
    Lou, Yu
    Lin, Ming C.
    Yang, Shan
    COMPUTER VISION - ECCV 2024, PT LXII, 2025, 15120 : 186 - 204
  • [40] Knowledge Proxy Intervention for Deconfounded Video Question Answering
    Li, Jiangtong
    Niu, Li
    Zhang, Liqing
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2770 - 2781