AI-VQA: Visual Question Answering based on Agent Interaction with Interpretability

被引：3

作者：

Li, Rengang ^{[1
,2
]}

Xu, Cong ^{[1
,2
]}

Guo, Zhenhua ^{[3
]}

Fan, Baoyu ^{[3
]}

Zhang, Runze ^{[3
]}

Liu, Wei ^{[3
]}

Zhao, Yaqian ^{[3
]}

Gong, Weifeng ^{[3
]}

Wang, Endong ^{[3
]}

机构：

[1] Inspur Beijing Elect Informat Ind Co Ltd, Beijing, Peoples R China

[2] State Key Lab High End Server & Storage Technol, Jinan, Peoples R China

[3] Inspur Elect Informat Ind Co Ltd, State Key Lab High End Server & Storage Technol, Jinan, Peoples R China

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

关键词：

dataset; visual question answer; vision and language;

D O I：

10.1145/3503161.3548387

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Visual Question Answering (VQA) serves as a proxy for evaluating the scene understanding of an intelligent agent by answering questions about images. Most VQA benchmarks to date are focused on those questions that can be answered through understanding visual content in the scene, such as simple counting, visual attributes, and even a little challenging questions that require extra encyclopedic knowledge. However, humans have a remarkable capacity to reason dynamic interaction on the scene, which is beyond the literal content of an image and has not been investigated so far. In this paper, we propose Agent Interaction Visual Question Answering (AI-VQA), a task investigating deep scene understanding if the agent takes a certain action. For this task, a model not only needs to answer action-related questions but also to locate the objects in which the interaction occurs for guaranteeing it truly comprehends the action. Accordingly, we make a new dataset based on Visual Genome and ATOMIC knowledge graph, including more than 19,000 manually annotated questions, and will make it publicly available. Besides, we also provide an annotation of the reasoning path while developing the answer for each question. Based on the dataset, we further propose a novel method, called ARE, that can comprehend the interaction and explain the reason based on a given event knowledge base. Experimental results show that our proposed method outperforms the baseline by a clear margin.

引用

页码：5274 / 5282

页数：9

共 50 条

[1] VQA: Visual Question Answering
Antol, Stanislaw
Agrawal, Aishwarya
Lu, Jiasen
Mitchell, Margaret
Batra, Dhruv
Zitnick, C. Lawrence
Parikh, Devi
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
[2] VQA: Visual Question Answering
Agrawal, Aishwarya
Lu, Jiasen
Antol, Stanislaw
Mitchell, Margaret
Zitnick, C. Lawrence
Parikh, Devi
Batra, Dhruv
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2017, 123 (01) : 4 - 31
[3] VC-VQA: VISUAL CALIBRATION MECHANISM FOR VISUAL QUESTION ANSWERING
Qiao, Yanyuan
Yu, Zheng
Liu, Jing
2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 1481 - 1485
[4] R-VQA: A robust visual question answering model
Chowdhury, Souvik
Soni, Badal
KNOWLEDGE-BASED SYSTEMS, 2025, 309
[5] CQ-VQA: Visual Question Answering on Categorized Questions
Mishra, Aakansha
Anand, Ashish
Guha, Prithwijit
2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
[6] Increasing Interpretability in Outside Knowledge Visual Question Answering
Upravitelev, Max
Krauss, Christopher
Kuhlmann, Isabelle
KNOWLEDGE MANAGEMENT IN ORGANISATIONS, KMO 2024, 2024, 2152 : 319 - 330
[7] CS-VQA: VISUAL QUESTION ANSWERING WITH COMPRESSIVELY SENSED IMAGES
Huang, Li-Chi
Kulkarni, Kuldeep
Jha, Anik
Lohit, Suhas
Jayasuriya, Suren
Turaga, Pavan
2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 1283 - 1287
[8] Inverse Visual Question Answering: A New Benchmark and VQA Diagnosis Tool
Liu, Feng
Xiang, Tao
Hospedales, Timothy M.
Yang, Wankou
Sun, Changyin
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2020, 42 (02) : 460 - 474
[9] VQA as a factoid question answering problem: A novel approach for knowledge-aware and explainable visual question answering
Narayanan, Abhishek
Rao, Abijna
Prasad, Abhishek
Natarajan, S.
IMAGE AND VISION COMPUTING, 2021, 116
[10] SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering
Xiong, Peixi
You, Quanzeng
Yu, Pei
Liu, Zicheng
Wu, Ying
arXiv, 2022,

← 1 2 3 4 5 →