MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

被引：1

作者：

Min, Juhong ^{[1
,2
]}

Buchl, Shyamal ^{[1
]}

Nagrani, Arsha ^{[1
]}

Cho, Minsu ^{[2
]}

Schm, Cordelia ^{[1
]}

机构：

[1] Google Res, Mountain View, CA 94043 USA

[2] POSTEC, Pohang, South Korea

来源：

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年

关键词：

D O I：

10.1109/CVPR52733.2024.01257

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective baseline, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).

引用

页码：13235 / 13245

页数：11

共 50 条

[21] Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering
Zang, Chuanqi
Wang, Hanqing
Pei, Mingtao
Liang, Wei
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19027 - 19036
[22] Graph-based relational reasoning network for video question answering
Tan, Tao
Sun, Guanglu
MACHINE VISION AND APPLICATIONS, 2025, 36 (01)
[23] Dynamic Spatio-Temporal Modular Network for Video Question Answering
Qian, Zi
Wang, Xin
Duan, Xuguang
Chen, Hong
Zhu, Wenwu
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4466 - 4477
[24] From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering
Li, Jiangtong
Niu, Li
Zhang, Liqing
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 21241 - 21250
[25] Tree -of-Reasoning Question Decomposition for Complex Question Answering with Large Language Models
Zhang, Kun
Zeng, Jiali
Meng, Fandong
Wang, Yuanzhuo
Sun, Shiqi
Bai, Long
Shen, Huawei
Zhou, Jie
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 19560 - 19568
[26] DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering
Wang, Jianyu
Bao, Bing-Kun
Xu, Changsheng
IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 24 : 3369 - 3380
[27] LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering
Jiang, Jingjing
Liu, Ziyi
Zheng, Nanning
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 5002 - 5013
[28] HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video Question Answering
Liu, Fei
Liu, Jing
Wang, Weining
Lu, Hanqing
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1678 - 1687
[29] ReGR: Relation-aware graph reasoning framework for video question answering
Wang, Zheng
Li, Fangtao
Ota, Kaoru
Dong, Mianxiong
Wu, Bin
INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (04)
[30] Chain of Reasoning for Visual Question Answering
Wu, Chenfei
Liu, Jinlai
Wang, Xiaojie
Dong, Xuan
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31

← 1 2 3 4 5 →