MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

被引:1
|
作者
Min, Juhong [1 ,2 ]
Buchl, Shyamal [1 ]
Nagrani, Arsha [1 ]
Cho, Minsu [2 ]
Schm, Cordelia [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
[2] POSTEC, Pohang, South Korea
关键词
D O I
10.1109/CVPR52733.2024.01257
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective baseline, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).
引用
收藏
页码:13235 / 13245
页数:11
相关论文
共 50 条
  • [21] Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering
    Zang, Chuanqi
    Wang, Hanqing
    Pei, Mingtao
    Liang, Wei
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19027 - 19036
  • [22] Graph-based relational reasoning network for video question answering
    Tan, Tao
    Sun, Guanglu
    MACHINE VISION AND APPLICATIONS, 2025, 36 (01)
  • [23] Dynamic Spatio-Temporal Modular Network for Video Question Answering
    Qian, Zi
    Wang, Xin
    Duan, Xuguang
    Chen, Hong
    Zhu, Wenwu
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4466 - 4477
  • [24] From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering
    Li, Jiangtong
    Niu, Li
    Zhang, Liqing
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 21241 - 21250
  • [25] Tree -of-Reasoning Question Decomposition for Complex Question Answering with Large Language Models
    Zhang, Kun
    Zeng, Jiali
    Meng, Fandong
    Wang, Yuanzhuo
    Sun, Shiqi
    Bai, Long
    Shen, Huawei
    Zhou, Jie
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 19560 - 19568
  • [26] DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering
    Wang, Jianyu
    Bao, Bing-Kun
    Xu, Changsheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 24 : 3369 - 3380
  • [27] LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering
    Jiang, Jingjing
    Liu, Ziyi
    Zheng, Nanning
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 5002 - 5013
  • [28] HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video Question Answering
    Liu, Fei
    Liu, Jing
    Wang, Weining
    Lu, Hanqing
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1678 - 1687
  • [29] ReGR: Relation-aware graph reasoning framework for video question answering
    Wang, Zheng
    Li, Fangtao
    Ota, Kaoru
    Dong, Mianxiong
    Wu, Bin
    INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (04)
  • [30] Chain of Reasoning for Visual Question Answering
    Wu, Chenfei
    Liu, Jinlai
    Wang, Xiaojie
    Dong, Xuan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31