MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

被引:1
|
作者
Min, Juhong [1 ,2 ]
Buchl, Shyamal [1 ]
Nagrani, Arsha [1 ]
Cho, Minsu [2 ]
Schm, Cordelia [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
[2] POSTEC, Pohang, South Korea
关键词
D O I
10.1109/CVPR52733.2024.01257
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective baseline, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).
引用
收藏
页码:13235 / 13245
页数:11
相关论文
共 50 条
  • [31] Testing the reasoning for question answering validation
    Penas, Anselmo
    Rodrigo, Alvaro
    Sama, Valentin
    Verdejo, Felisa
    JOURNAL OF LOGIC AND COMPUTATION, 2008, 18 (03) : 459 - 474
  • [32] Affective question answering on video
    Ruwa, Nelson
    Mao, Qirong
    Wang, Liangjun
    Gou, Jianping
    NEUROCOMPUTING, 2019, 363 : 125 - 139
  • [33] MULTI-SEMANTIC ALIGNMENT CO-REASONING NETWORK FOR VIDEO QUESTION ANSWERING
    Peng, Min
    Liu, Liangchen
    Li, Zhenghao
    Shi, Yu
    Zhou, Xiangdong
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 2090 - 2094
  • [34] Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering
    Dang, Long Hoang
    Le, Thao Minh
    Le, Vuong
    Tran, Truyen
    PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 636 - 642
  • [35] STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering
    Wang, Yueqian
    Wang, Yuxuan
    Chen, Kai
    Zhao, Dongyan
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 19215 - 19223
  • [36] Event Graph Guided Compositional Spatial--Temporal Reasoning for Video Question Answering
    Bai, Ziyi
    Wang, Ruiping
    Gao, Difei
    Chen, Xilin
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1109 - 1121
  • [37] Open-Ended Multi-Modal Relational Reasoning for Video Question Answering
    Luo, Haozheng
    Qin, Ruiyang
    Xu, Chenwei
    Ye, Guo
    Luo, Zening
    2023 32ND IEEE INTERNATIONAL CONFERENCE ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION, RO-MAN, 2023, : 363 - 369
  • [38] TLNet: Temporal Span Localization Network With Collaborative Graph Reasoning for Video Question Answering
    Liang, Lili
    Sun, Guanglu
    Li, Tianlin
    Liu, Shuai
    Ding, Weiping
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024,
  • [39] Video Graph Transformer for Video Question Answering
    Xiao, Junbin
    Zhou, Pan
    Chua, Tat-Seng
    Yan, Shuicheng
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 39 - 58
  • [40] QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering
    Yasunaga, Michihiro
    Ren, Hongyu
    Bosselut, Antoine
    Liang, Percy
    Leskovec, Jure
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 535 - 546