From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering

被引：17

作者：

Li, Jiangtong ^{[1
]}

Niu, Li ^{[1
]}

Zhang, Liqing ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, MoE Key Lab Artificial Intelligence, Shanghai, Peoples R China

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年

基金：

国家重点研发计划; 美国国家科学基金会;

关键词：

D O I：

10.1109/CVPR52688.2022.02059

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video understanding has achieved great success in representation learning, such as video caption, video object grounding, and video descriptive question-answer. However, current methods still struggle on video reasoning, including evidence reasoning and commonsense reasoning. To facilitate deeper video understanding towards video reasoning, we present the task of Causal-VidQA, which includes four types of questions ranging from scene description (description) to evidence reasoning (explanation) and commonsense reasoning (prediction and counterfactual). For commonsense reasoning, we set up a two-step solution by answering the question and providing a proper reason. Through extensive experiments on existing VideoQA methods, we find that the state-of-the-art methods are strong in descriptions but weak in reasoning. We hope that Causal-VidQA can guide the research of video understanding from representation learning to deeper reasoning. The dataset and related resources are available at https://github.com/bcmi/Causal-VidQA.git.

引用

页码：21241 / 21250

页数：10

共 50 条

[31] Learning to Explain: Datasets and Models for Identifying Valid Reasoning Chains in Multihop Question-Answering
Jhamtani, Harsh
Clark, Peter
PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 137 - 150
[32] Differentiated Attention with Multi-modal Reasoning for Video Question Answering
Yao, Shentao
Li, Kun
Xing, Kun
Wu, Kewei
Xie, Zhao
Guo, Dan
2022 IEEE INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING, BIG DATA AND ALGORITHMS (EEBDA), 2022, : 525 - 530
[33] Inferential Knowledge-Enhanced Integrated Reasoning for Video Question Answering
Mao, Jianguo
Jiang, Wenbin
Liu, Hong
Wang, Xiangdong
Lyu, Yajuan
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 13380 - 13388
[34] Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering
Zang, Chuanqi
Wang, Hanqing
Pei, Mingtao
Liang, Wei
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19027 - 19036
[35] Graph-based relational reasoning network for video question answering
Tan, Tao
Sun, Guanglu
MACHINE VISION AND APPLICATIONS, 2025, 36 (01)
[36] Fillers as Signals: Evidence From a Question-Answering Paradigm
Walker, Esther J.
Risko, Evan F.
Kingstone, Alan
DISCOURSE PROCESSES, 2014, 51 (03) : 264 - 286
[37] TOWARDS A THEORY OF COMMONSENSE VISUAL REASONING
CHANDRASEKARAN, B
NARAYANAN, NH
LECTURE NOTES IN COMPUTER SCIENCE, 1990, 472 : 387 - 409
[38] INDEPENDENCE OF QUESTION-ANSWERING STRATEGY AND SEARCHED REPRESENTATION
SINGER, M
MEMORY & COGNITION, 1991, 19 (02) : 189 - 196
[39] EGLR: Two-staged Explanation Generation and Language Reasoning framework for commonsense question answering
Liu, Wei
Huang, Zheng
Wang, Chao
Peng, Yan
Xie, Shaorong
KNOWLEDGE-BASED SYSTEMS, 2024, 286
[40] Testing the reasoning for question answering validation
Penas, Anselmo
Rodrigo, Alvaro
Sama, Valentin
Verdejo, Felisa
JOURNAL OF LOGIC AND COMPUTATION, 2008, 18 (03) : 459 - 474

← 1 2 3 4 5 →