Multi-Modal Validation and Domain Interaction Learning for Knowledge-Based Visual Question Answering

被引:1
|
作者
Xu, Ning [1 ]
Gao, Yifei [1 ]
Liu, An-An [1 ]
Tian, Hongshuo [1 ]
Zhang, Yongdong [2 ]
机构
[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China
[2] Univ Sci & Technol China, Hefei 230026, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Knowledge based systems; Transformers; Databases; Knowledge graphs; Question answering (information retrieval); Task analysis; Multi-modal validation; domain interaction learning; knowledge-based visual question answering; GRAPH;
D O I
10.1109/TKDE.2024.3384270
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Knowledge-based Visual Question Answering (KB-VQA) aims to answer the image-aware question via the external knowledge, which requires an agent to not only understand images but also explicitly retrieve and integrate knowledge facts. Intuitively, to accurately answer the question, we humans can validate the retrieved knowledge based on our memory, and then align the knowledge facts with the image regions to infer answers. However, most existing methods ignore the process of knowledge validation and alignment. In this paper, we propose the Multi-Modal Validation and Domain Interaction Learning method, which consists of two components: 1) Multi-modal validation for knowledge retrieval. We propose the multi-modal validation module (MMV) to evaluate the confidence of each retrieved knowledge fact via images and questions, which preserves knowledge candidates effective for inferring answers. 2) Domain interaction for knowledge integration. We propose the Domain Interaction TRansformer module (DI-TR) to align visual regions with knowledge facts by the interaction learning in the improved transformer. Specifically, the inter-domain and intra-domain masks are injected into each self-attention layer to control the integration scope. The proposed method outperforms several strong baselines on three widely-used knowledge-based datasets: KRVQA, OK-VQA and VQA2.0. Extensive experiments and ablation studies demonstrate the effectiveness of multi-modal knowledge validation and domain interaction learning.
引用
收藏
页码:6628 / 6640
页数:13
相关论文
共 50 条
  • [21] The multi-modal fusion in visual question answering: a review of attention mechanisms
    Lu, Siyu
    Liu, Mingzhe
    Yin, Lirong
    Yin, Zhengtong
    Liu, Xuan
    Zheng, Wenfeng
    PEERJ COMPUTER SCIENCE, 2023, 9
  • [22] Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering
    Guo, Zihan
    Han, Dezhi
    SENSORS, 2020, 20 (23) : 1 - 15
  • [23] Hierarchical deep multi-modal network for medical visual question answering
    Gupta D.
    Suman S.
    Ekbal A.
    Expert Systems with Applications, 2021, 164
  • [24] Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering
    Yu, Zhou
    Yu, Jun
    Fan, Jianping
    Tao, Dacheng
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1839 - 1848
  • [25] Multi-Modal Knowledge-Aware Attention Network for Question Answering
    Zhang Y.
    Qian S.
    Fang Q.
    Xu C.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2020, 57 (05): : 1037 - 1045
  • [26] Interpretable medical image Visual Question Answering via multi-modal relationship graph learning
    Hu, Xinyue
    Gu, Lin
    Kobayashi, Kazuma
    Liu, Liangchen
    Zhang, Mengliang
    Harada, Tatsuya
    Summers, Ronald M.
    Zhu, Yingying
    MEDICAL IMAGE ANALYSIS, 2024, 97
  • [27] Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question Answering
    Salemi, Alireza
    Rafiee, Mahta
    Zamani, Hamed
    PROCEEDINGS OF THE 2023 ACM SIGIR INTERNATIONAL CONFERENCE ON THE THEORY OF INFORMATION RETRIEVAL, ICTIR 2023, 2023, : 169 - 176
  • [28] Knowledge-based question answering
    Rinaldi, F
    Dowdall, J
    Hess, M
    Mollá, D
    Schwitter, R
    Kaljurand, K
    KNOWLEDGE-BASED INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT 1, PROCEEDINGS, 2003, 2773 : 785 - 792
  • [29] Knowledge-based question answering
    Hermjakob, U
    Hovy, EH
    Lin, CY
    6TH WORLD MULTICONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL XVI, PROCEEDINGS: COMPUTER SCIENCE III, 2002, : 66 - 71
  • [30] Interactive Multi-Modal Question-Answering
    Orasan, Constantin
    COMPUTATIONAL LINGUISTICS, 2012, 38 (02) : 451 - 453