MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

被引:1
|
作者
Min, Juhong [1 ,2 ]
Buchl, Shyamal [1 ]
Nagrani, Arsha [1 ]
Cho, Minsu [2 ]
Schm, Cordelia [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
[2] POSTEC, Pohang, South Korea
关键词
D O I
10.1109/CVPR52733.2024.01257
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective baseline, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).
引用
收藏
页码:13235 / 13245
页数:11
相关论文
共 50 条
  • [41] Video Reference: A Video Question Answering Engine
    Gao, Lei
    Li, Guangda
    Zheng, Yan-Tao
    Hong, Richang
    Chua, Tat-Seng
    ADVANCES IN MULTIMEDIA MODELING, PROCEEDINGS, 2010, 5916 : 799 - +
  • [42] JointLK: Joint Reasoning with Language Models and Knowledge Graphs for Commonsense Question Answering
    Sun, Yueqing
    Shi, Qi
    Qi, Le
    Zhang, Yu
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5049 - 5060
  • [43] Large Language Models are Temporal and Causal Reasoners for Video Question Answering
    Ko, Dohwan
    Lee, Ji Soo
    Kang, Wooyoung
    Roh, Byungseok
    Kim, Hyunwoo J.
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 4300 - 4316
  • [44] Locate Before Answering: Answer Guided Question Localization for Video Question Answering
    Qian, Tianwen
    Cui, Ran
    Chen, Jingjing
    Peng, Pai
    Guo, Xiaowei
    Jiang, Yu-Gang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 4554 - 4563
  • [45] Knowledge and reasoning for question answering: Research perspectives
    Saint-Dizier, Patrick
    Moens, Marie-Francine
    INFORMATION PROCESSING & MANAGEMENT, 2011, 47 (06) : 899 - 906
  • [46] Variational Reasoning for Question Answering with Knowledge Graph
    Zhang, Yuyu
    Dai, Hanjun
    Kozareva, Zornitsa
    Smola, Alexander J.
    Song, Le
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 6069 - 6076
  • [47] Sequential Visual Reasoning for Visual Question Answering
    Liu, Jinlai
    Wu, Chenfei
    Wang, Xiaojie
    Dong, Xuan
    PROCEEDINGS OF 2018 5TH IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENCE SYSTEMS (CCIS), 2018, : 410 - 415
  • [48] Preference reasoning in advanced question answering systems
    Benamara, Farah
    Kaci, Souhila
    AI 2006: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 4304 : 1116 - +
  • [49] Visual question answering by pattern matching and reasoning
    Zhan, Huayi
    Xiong, Peixi
    Wang, Xin
    Yang, Lan
    NEUROCOMPUTING, 2022, 467 : 323 - 336
  • [50] Multimodal Learning and Reasoning for Visual Question Answering
    Ilievski, Ilija
    Feng, Jiashi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30