MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

被引:1
|
作者
Min, Juhong [1 ,2 ]
Buchl, Shyamal [1 ]
Nagrani, Arsha [1 ]
Cho, Minsu [2 ]
Schm, Cordelia [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
[2] POSTEC, Pohang, South Korea
关键词
D O I
10.1109/CVPR52733.2024.01257
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective baseline, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).
引用
收藏
页码:13235 / 13245
页数:11
相关论文
共 50 条
  • [1] Video Question Answering With Semantic Disentanglement and Reasoning
    Liu, Jin
    Wang, Guoxiang
    Xie, Jialong
    Zhou, Fengyu
    Xu, Huijuan
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (05) : 3663 - 3673
  • [2] Neural Reasoning, Fast and Slow, for Video Question Answering
    Thao Minh Le
    Vuong Le
    Venkatesh, Svetha
    Truyen Tran
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [3] Multimodal Graph Reasoning and Fusion for Video Question Answering
    Zhang, Shuai
    Wang, Xingfu
    Hawbani, Ammar
    Zhao, Liang
    Alsamhi, Saeed Hamood
    2022 IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, 2022, : 1410 - 1415
  • [4] Video Question Answering with Spatio-Temporal Reasoning
    Jang, Yunseok
    Song, Yale
    Kim, Chris Dongjoo
    Yu, Youngjae
    Kim, Youngjin
    Kim, Gunhee
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2019, 127 (10) : 1385 - 1412
  • [5] Instance-sequence reasoning for video question answering
    LIU Rui
    HAN Yahong
    Frontiers of Computer Science, 2022, 16 (06)
  • [6] Instance-sequence reasoning for video question answering
    Liu, Rui
    Han, Yahong
    FRONTIERS OF COMPUTER SCIENCE, 2022, 16 (06)
  • [7] Instance-sequence reasoning for video question answering
    Rui Liu
    Yahong Han
    Frontiers of Computer Science, 2022, 16
  • [8] Reasoning with Heterogeneous Graph Alignment for Video Question Answering
    Jiang, Pin
    Han, Yahong
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11109 - 11116
  • [9] Video Question Answering with Spatio-Temporal Reasoning
    Yunseok Jang
    Yale Song
    Chris Dongjoo Kim
    Youngjae Yu
    Youngjin Kim
    Gunhee Kim
    International Journal of Computer Vision, 2019, 127 : 1385 - 1412
  • [10] Explore Multi-Step Reasoning in Video Question Answering
    Han, Yahong
    PROCEEDINGS OF THE 1ST WORKSHOP AND CHALLENGE ON COMPREHENSIVE VIDEO UNDERSTANDING IN THE WILD (COVIEW'18), 2018, : 5 - 5