AI-VQA: Visual Question Answering based on Agent Interaction with Interpretability

被引:3
|
作者
Li, Rengang [1 ,2 ]
Xu, Cong [1 ,2 ]
Guo, Zhenhua [3 ]
Fan, Baoyu [3 ]
Zhang, Runze [3 ]
Liu, Wei [3 ]
Zhao, Yaqian [3 ]
Gong, Weifeng [3 ]
Wang, Endong [3 ]
机构
[1] Inspur Beijing Elect Informat Ind Co Ltd, Beijing, Peoples R China
[2] State Key Lab High End Server & Storage Technol, Jinan, Peoples R China
[3] Inspur Elect Informat Ind Co Ltd, State Key Lab High End Server & Storage Technol, Jinan, Peoples R China
关键词
dataset; visual question answer; vision and language;
D O I
10.1145/3503161.3548387
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Visual Question Answering (VQA) serves as a proxy for evaluating the scene understanding of an intelligent agent by answering questions about images. Most VQA benchmarks to date are focused on those questions that can be answered through understanding visual content in the scene, such as simple counting, visual attributes, and even a little challenging questions that require extra encyclopedic knowledge. However, humans have a remarkable capacity to reason dynamic interaction on the scene, which is beyond the literal content of an image and has not been investigated so far. In this paper, we propose Agent Interaction Visual Question Answering (AI-VQA), a task investigating deep scene understanding if the agent takes a certain action. For this task, a model not only needs to answer action-related questions but also to locate the objects in which the interaction occurs for guaranteeing it truly comprehends the action. Accordingly, we make a new dataset based on Visual Genome and ATOMIC knowledge graph, including more than 19,000 manually annotated questions, and will make it publicly available. Besides, we also provide an annotation of the reasoning path while developing the answer for each question. Based on the dataset, we further propose a novel method, called ARE, that can comprehend the interaction and explain the reason based on a given event knowledge base. Experimental results show that our proposed method outperforms the baseline by a clear margin.
引用
收藏
页码:5274 / 5282
页数:9
相关论文
共 50 条
  • [11] Surgical-VQA: Visual Question Answering in Surgical Scenes Using Transformer
    Seenivasan, Lalithkumar
    Islam, Mobarakol
    Krishna, Adithya K.
    Ren, Hongliang
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT VII, 2022, 13437 : 33 - 43
  • [12] ST-VQA: shrinkage transformer with accurate alignment for visual question answering
    Xia, Haiying
    Lan, Richeng
    Li, Haisheng
    Song, Shuxiang
    APPLIED INTELLIGENCE, 2023, 53 (18) : 20967 - 20978
  • [13] ST-VQA: shrinkage transformer with accurate alignment for visual question answering
    Haiying Xia
    Richeng Lan
    Haisheng Li
    Shuxiang Song
    Applied Intelligence, 2023, 53 : 20967 - 20978
  • [14] OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
    Marino, Kenneth
    Rastegari, Mohammad
    Farhadi, Ali
    Mottaghi, Roozbeh
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3190 - 3199
  • [15] VQA-BC: ROBUST VISUAL QUESTION ANSWERING VIA BIDIRECTIONAL CHAINING
    Lao, Mingrui
    Guo, Yanming
    Chen, Wei
    Pu, Nan
    Lew, Michael S.
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4833 - 4837
  • [16] Visual Question Answering for Intelligent Interaction
    Gao, Panpan
    Sun, Hanxu
    Chen, Gang
    Wang, Ruiquan
    Li, Minggang
    MOBILE INFORMATION SYSTEMS, 2022, 2022
  • [17] R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering
    Lu, Pan
    Ji, Lei
    Zhang, Wei
    Duan, Nan
    Zhou, Ming
    Wang, Jianyong
    KDD'18: PROCEEDINGS OF THE 24TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2018, : 1880 - 1889
  • [18] Cross Modality Bias in Visual Question Answering: A Causal View With Possible Worlds VQA
    Vosoughi, Ali
    Deng, Shijian
    Zhang, Songyang
    Tian, Yapeng
    Xu, Chenliang
    Luo, Jiebo
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 8609 - 8624
  • [19] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
    Goyal, Yash
    Khot, Tejas
    Summers-Stay, Douglas
    Batra, Dhruv
    Parikh, Devi
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 6325 - 6334
  • [20] WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering
    Chen, Pingyi
    Zhu, Chenglu
    Zheng, Sunyi
    Li, Honglin
    Yang, Lin
    COMPUTER VISION - ECCV 2024, PT XXXVI, 2025, 15094 : 401 - 417