DMRFNet: Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation

被引:0
|
作者
Zhang, Weifeng [1 ]
Yu, Jing [2 ]
Zhao, Wenhong [3 ]
Ran, Chuan [4 ]
机构
[1] Jiaxing University, Zhejiang, China
[2] Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
[3] Nanhu College, Jiaxing University, Zhejiang, China
[4] IBM Corporation, NC, United States
关键词
Artificial intelligence - Natural language processing systems - Visual languages;
D O I
暂无
中图分类号
学科分类号
摘要
Visual Question Answering (VQA), which aims to answer questions in natural language according to the content of image, has attracted extensive attention from artificial intelligence community. Multimodal reasoning and fusion is a central component in recent VQA models. However, most existing VQA models are still insufficient to reason and fuse clues from multiple modalities. Furthermore, they are lack of interpretability since they disregard the explanations. We argue that reasoning and fusing multiple relations implied in multimodalities contributes to more accurate answers and explanations. In this paper, we design an effective multimodal reasoning and fusion model to achieve fine-grained multimodal reasoning and fusion. Specifically, we propose Multi-Graph Reasoning and Fusion (MGRF) layer, which adopts pre-trained semantic relation embeddings, to reason complex spatial and semantic relations between visual objects and fuse these two kinds of relations adaptively. The MGRF layers can be further stacked in depth to form Deep Multimodal Reasoning and Fusion Network (DMRFNet) to sufficiently reason and fuse multimodal relations. Furthermore, an explanation generation module is designed to justify the predicted answer. This justification reveals the motive of the model's decision and enhances the model's interpretability. Quantitative and qualitative experimental results on VQA 2.0, and VQA-E datasets show DMRFNet's effectiveness. © 2021 Elsevier B.V.
引用
收藏
页码:70 / 79
相关论文
共 50 条
  • [31] EGLR: Two-staged Explanation Generation and Language Reasoning framework for commonsense question answering
    Liu, Wei
    Huang, Zheng
    Wang, Chao
    Peng, Yan
    Xie, Shaorong
    KNOWLEDGE-BASED SYSTEMS, 2024, 286
  • [32] Visual Question Generation as Dual Task of Visual Question Answering
    Li, Yikang
    Duan, Nan
    Zhou, Bolei
    Chu, Xiao
    Ouyang, Wanli
    Wang, Xiaogang
    Zhou, Ming
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6116 - 6124
  • [33] MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering
    Khan, Aisha Urooj
    Mazaheri, Amir
    Lobo, Niels Da Vitoria
    Shah, Mubarak
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 4648 - 4660
  • [34] Intelligent visual question answering in TCM education: An innovative application of IoT and multimodal fusion
    Bi, Wei
    Xiong, Qingzhen
    Chen, Xingyi
    Du, Qingkun
    Wu, Jun
    Zhuang, Zhaoyu
    ALEXANDRIA ENGINEERING JOURNAL, 2025, 118 : 325 - 336
  • [35] Improving reasoning with contrastive visual information for visual question answering
    Long, Yu
    Tang, Pengjie
    Wang, Hanli
    Yu, Jian
    ELECTRONICS LETTERS, 2021, 57 (20) : 758 - 760
  • [36] Non-monotonic Logical Reasoning and Deep Learning for Explainable Visual Question Answering
    Riley, Heather
    Sridharan, Mohan
    HAI'18: PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON HUMAN-AGENT INTERACTION, 2018, : 11 - 19
  • [37] Coarse-to-Fine Reasoning for Visual Question Answering
    Nguyen, Binh X.
    Tuong Do
    Huy Tran
    Tjiputra, Erman
    Tran, Quang D.
    Anh Nguyen
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4557 - 4565
  • [38] Medical Visual Question Answering via Conditional Reasoning
    Zhan, Li-Ming
    Liu, Bo
    Fan, Lu
    Chen, Jiaxin
    Wu, Xiao-Ming
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 2345 - 2354
  • [39] Interpretable Visual Question Answering by Reasoning on Dependency Trees
    Cao, Qingxing
    Liang, Xiaodan
    Li, Bailin
    Lin, Liang
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (03) : 887 - 901
  • [40] INTERPRETABLE VISUAL QUESTION ANSWERING VIA REASONING SUPERVISION
    Parelli, Maria
    Mallis, Dimitrios
    Diomataris, Markos
    Pitsikalis, Vassilis
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 2525 - 2529