AI-VQA: Visual Question Answering based on Agent Interaction with Interpretability

被引：3

作者：

Li, Rengang ^{[1
,2
]}

Xu, Cong ^{[1
,2
]}

Guo, Zhenhua ^{[3
]}

Fan, Baoyu ^{[3
]}

Zhang, Runze ^{[3
]}

Liu, Wei ^{[3
]}

Zhao, Yaqian ^{[3
]}

Gong, Weifeng ^{[3
]}

Wang, Endong ^{[3
]}

机构：

[1] Inspur Beijing Elect Informat Ind Co Ltd, Beijing, Peoples R China

[2] State Key Lab High End Server & Storage Technol, Jinan, Peoples R China

[3] Inspur Elect Informat Ind Co Ltd, State Key Lab High End Server & Storage Technol, Jinan, Peoples R China

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

关键词：

dataset; visual question answer; vision and language;

D O I：

10.1145/3503161.3548387

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Visual Question Answering (VQA) serves as a proxy for evaluating the scene understanding of an intelligent agent by answering questions about images. Most VQA benchmarks to date are focused on those questions that can be answered through understanding visual content in the scene, such as simple counting, visual attributes, and even a little challenging questions that require extra encyclopedic knowledge. However, humans have a remarkable capacity to reason dynamic interaction on the scene, which is beyond the literal content of an image and has not been investigated so far. In this paper, we propose Agent Interaction Visual Question Answering (AI-VQA), a task investigating deep scene understanding if the agent takes a certain action. For this task, a model not only needs to answer action-related questions but also to locate the objects in which the interaction occurs for guaranteeing it truly comprehends the action. Accordingly, we make a new dataset based on Visual Genome and ATOMIC knowledge graph, including more than 19,000 manually annotated questions, and will make it publicly available. Besides, we also provide an annotation of the reasoning path while developing the answer for each question. Based on the dataset, we further propose a novel method, called ARE, that can comprehend the interaction and explain the reason based on a given event knowledge base. Experimental results show that our proposed method outperforms the baseline by a clear margin.

引用

页码：5274 / 5282

页数：9

共 50 条

[11] Surgical-VQA: Visual Question Answering in Surgical Scenes Using Transformer
Seenivasan, Lalithkumar
Islam, Mobarakol
Krishna, Adithya K.
Ren, Hongliang
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT VII, 2022, 13437 : 33 - 43
[12] ST-VQA: shrinkage transformer with accurate alignment for visual question answering
Xia, Haiying
Lan, Richeng
Li, Haisheng
Song, Shuxiang
APPLIED INTELLIGENCE, 2023, 53 (18) : 20967 - 20978
[13] ST-VQA: shrinkage transformer with accurate alignment for visual question answering
Haiying Xia
Richeng Lan
Haisheng Li
Shuxiang Song
Applied Intelligence, 2023, 53 : 20967 - 20978
[14] OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
Marino, Kenneth
Rastegari, Mohammad
Farhadi, Ali
Mottaghi, Roozbeh
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3190 - 3199
[15] VQA-BC: ROBUST VISUAL QUESTION ANSWERING VIA BIDIRECTIONAL CHAINING
Lao, Mingrui
Guo, Yanming
Chen, Wei
Pu, Nan
Lew, Michael S.
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4833 - 4837
[16] Visual Question Answering for Intelligent Interaction
Gao, Panpan
Sun, Hanxu
Chen, Gang
Wang, Ruiquan
Li, Minggang
MOBILE INFORMATION SYSTEMS, 2022, 2022
[17] R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering
Lu, Pan
Ji, Lei
Zhang, Wei
Duan, Nan
Zhou, Ming
Wang, Jianyong
KDD'18: PROCEEDINGS OF THE 24TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2018, : 1880 - 1889
[18] Cross Modality Bias in Visual Question Answering: A Causal View With Possible Worlds VQA
Vosoughi, Ali
Deng, Shijian
Zhang, Songyang
Tian, Yapeng
Xu, Chenliang
Luo, Jiebo
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 8609 - 8624
[19] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Goyal, Yash
Khot, Tejas
Summers-Stay, Douglas
Batra, Dhruv
Parikh, Devi
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 6325 - 6334
[20] WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering
Chen, Pingyi
Zhu, Chenglu
Zheng, Sunyi
Li, Honglin
Yang, Lin
COMPUTER VISION - ECCV 2024, PT XXXVI, 2025, 15094 : 401 - 417

← 1 2 3 4 5 →