Video Question Answering with Procedural Programs

被引:0
|
作者
Choudhury, Rohan [1 ]
Niinuma, Koichiro [2 ]
Kitani, Kris M. [1 ]
Jeni, Laszlo A. [1 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] Fujitsu Res Amer, Santa Clara, CA USA
来源
关键词
D O I
10.1007/978-3-031-72920-1_18
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose to answer questions about videos by generating short procedural programs that solve visual subtasks to obtain a final answer. We present Procedural Video Querying (ProViQ), which uses a large language model to generate such programs from an input question and an API of visual modules in the prompt, then executes them to obtain the output. Recent similar procedural approaches have proven successful for image question answering, but cannot effectively or efficiently answer questions about videos due to their image-centric modules and lack of temporal reasoning ability. We address this by providing ProViQ with novel modules intended for video understanding, allowing it to generalize to a wide variety of videos with no additional training. As a result, ProViQ can efficiently find relevant moments in long videos, do causal and temporal reasoning, and summarize videos over long time horizons in order to answer complex questions. This code generation framework additionally enables ProViQ to perform other video tasks beyond question answering, such as multi-object tracking or basic video editing. ProViQ achieves state-of-the-art results on a diverse range of benchmarks, with improvements of up to 25% on short, long, open-ended, multiple-choice and multimodal video question-answering datasets. Our project page is at https://rccchoudhury.github.io/proviq2023/.
引用
收藏
页码:315 / 332
页数:18
相关论文
共 50 条
  • [21] Remember and forget: video and text fusion for video question answering
    Gao, Feng
    Ge, Yuanyuan
    Liu, Yongge
    MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (22) : 29269 - 29282
  • [22] Uncovering the Temporal Context for Video Question Answering
    Linchao Zhu
    Zhongwen Xu
    Yi Yang
    Alexander G. Hauptmann
    International Journal of Computer Vision, 2017, 124 : 409 - 421
  • [23] Contrastive Video Question Answering via Video Graph Transformer
    Xiao, Junbin
    Zhou, Pan
    Yao, Angela
    Li, Yicong
    Hong, Richang
    Yan, Shuicheng
    Chua, Tat-Seng
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 13265 - 13280
  • [24] Uncovering the Temporal Context for Video Question Answering
    Zhu, Linchao
    Xu, Zhongwen
    Yang, Yi
    Hauptmann, Alexander G.
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2017, 124 (03) : 409 - 421
  • [25] Video Question Answering With Semantic Disentanglement and Reasoning
    Liu, Jin
    Wang, Guoxiang
    Xie, Jialong
    Zhou, Fengyu
    Xu, Huijuan
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (05) : 3663 - 3673
  • [26] Embedding VLAD in Transformer for Video Question Answering
    Guo D.
    Yao S.-T.
    Wang H.
    Wang M.
    Jisuanji Xuebao/Chinese Journal of Computers, 2023, 46 (04): : 671 - 689
  • [27] Question answering on large news video archive
    Chua, TS
    ISPA 2003: PROCEEDINGS OF THE 3RD INTERNATIONAL SYMPOSIUM ON IMAGE AND SIGNAL PROCESSING AND ANALYSIS, PTS 1 AND 2, 2003, : 289 - 294
  • [28] On the hidden treasure of dialog in video question answering
    Engin, Deniz
    Schnitzler, Francois
    Duong, Ngoc Q. K.
    Avrithis, Yannis
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2044 - 2053
  • [29] Video Question Answering: a Survey of Models and Datasets
    Guanglu Sun
    Lili Liang
    Tianlin Li
    Bo Yu
    Meng Wu
    Bolun Zhang
    Mobile Networks and Applications, 2021, 26 : 1904 - 1937
  • [30] Video Question Answering: a Survey of Models and Datasets
    Sun, Guanglu
    Liang, Lili
    Li, Tianlin
    Yu, Bo
    Wu, Meng
    Zhang, Bolun
    MOBILE NETWORKS & APPLICATIONS, 2021, 26 (05): : 1904 - 1937