Multi-Modal Validation and Domain Interaction Learning for Knowledge-Based Visual Question Answering

被引:1
|
作者
Xu, Ning [1 ]
Gao, Yifei [1 ]
Liu, An-An [1 ]
Tian, Hongshuo [1 ]
Zhang, Yongdong [2 ]
机构
[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China
[2] Univ Sci & Technol China, Hefei 230026, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Knowledge based systems; Transformers; Databases; Knowledge graphs; Question answering (information retrieval); Task analysis; Multi-modal validation; domain interaction learning; knowledge-based visual question answering; GRAPH;
D O I
10.1109/TKDE.2024.3384270
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Knowledge-based Visual Question Answering (KB-VQA) aims to answer the image-aware question via the external knowledge, which requires an agent to not only understand images but also explicitly retrieve and integrate knowledge facts. Intuitively, to accurately answer the question, we humans can validate the retrieved knowledge based on our memory, and then align the knowledge facts with the image regions to infer answers. However, most existing methods ignore the process of knowledge validation and alignment. In this paper, we propose the Multi-Modal Validation and Domain Interaction Learning method, which consists of two components: 1) Multi-modal validation for knowledge retrieval. We propose the multi-modal validation module (MMV) to evaluate the confidence of each retrieved knowledge fact via images and questions, which preserves knowledge candidates effective for inferring answers. 2) Domain interaction for knowledge integration. We propose the Domain Interaction TRansformer module (DI-TR) to align visual regions with knowledge facts by the interaction learning in the improved transformer. Specifically, the inter-domain and intra-domain masks are injected into each self-attention layer to control the integration scope. The proposed method outperforms several strong baselines on three widely-used knowledge-based datasets: KRVQA, OK-VQA and VQA2.0. Extensive experiments and ablation studies demonstrate the effectiveness of multi-modal knowledge validation and domain interaction learning.
引用
收藏
页码:6628 / 6640
页数:13
相关论文
共 50 条
  • [1] Knowledge-Based Visual Question Answering Using Multi-Modal Semantic Graph
    Jiang, Lei
    Meng, Zuqiang
    ELECTRONICS, 2023, 12 (06)
  • [2] MM-Reasoner: A Multi-Modal Knowledge-Aware Framework for Knowledge-Based Visual Question Answering
    Khademi, Mahmoud
    Yang, Ziyi
    Frujeri, Felipe Vieira
    Zhu, Chenguang
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 6571 - 6581
  • [3] Adversarial Learning With Multi-Modal Attention for Visual Question Answering
    Liu, Yun
    Zhang, Xiaoming
    Huang, Feiran
    Cheng, Lei
    Li, Zhoujun
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (09) : 3894 - 3908
  • [4] Multi-modal Question Answering System Driven by Domain Knowledge Graph
    Zhao, Zhengwei
    Wang, Xiaodong
    Xu, Xiaowei
    Wang, Qing
    5TH INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING AND COMMUNICATIONS (BIGCOM 2019), 2019, : 43 - 47
  • [5] Cross-modal knowledge reasoning for knowledge-based visual question answering
    Yu, Jing
    Zhu, Zihao
    Wang, Yujing
    Zhang, Weifeng
    Hu, Yue
    Tan, Jianlong
    PATTERN RECOGNITION, 2020, 108
  • [6] Cross-Modal Retrieval for Knowledge-Based Visual Question Answering
    Lerner, Paul
    Ferret, Olivier
    Guinaudeau, Camille
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I, 2024, 14608 : 421 - 438
  • [7] Open-Ended Visual Question Answering by Multi-Modal Domain Adaptation
    Xu, Yiming
    Chen, Lin
    Cheng, Zhongwei
    Duan, Lixin
    Luo, Jiebo
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 367 - 376
  • [8] Knowledge-Enhanced Visual Question Answering with Multi-modal Joint Guidance
    Wang, Jianfeng
    Zhang, Anda
    Du, Huifang
    Wang, Haofen
    Zhang, Wenqiang
    PROCEEDINGS OF THE 11TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE GRAPHS, IJCKG 2022, 2022, : 115 - 120
  • [9] Multi-Modal Answer Validation for Knowledge-Based VQA
    Wu, Jialin
    Lu, Jiasen
    Sabharwal, Ashish
    Mottaghi, Roozbeh
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 2712 - 2721
  • [10] Medical Visual Question-Answering Model Based on Knowledge Enhancement and Multi-Modal Fusion
    Zhang, Dianyuan
    Yu, Chuanming
    An, Lu
    Proceedings of the Association for Information Science and Technology, 2024, 61 (01) : 703 - 708