Multi-Modal Validation and Domain Interaction Learning for Knowledge-Based Visual Question Answering

被引：1

作者：

Xu, Ning ^{[1
]}

Gao, Yifei ^{[1
]}

Liu, An-An ^{[1
]}

Tian, Hongshuo ^{[1
]}

Zhang, Yongdong ^{[2
]}

机构：

[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China

[2] Univ Sci & Technol China, Hefei 230026, Peoples R China

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2024年 / 36卷 / 11期

基金：

中国国家自然科学基金;

关键词：

Visualization; Knowledge based systems; Transformers; Databases; Knowledge graphs; Question answering (information retrieval); Task analysis; Multi-modal validation; domain interaction learning; knowledge-based visual question answering; GRAPH;

D O I：

10.1109/TKDE.2024.3384270

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Knowledge-based Visual Question Answering (KB-VQA) aims to answer the image-aware question via the external knowledge, which requires an agent to not only understand images but also explicitly retrieve and integrate knowledge facts. Intuitively, to accurately answer the question, we humans can validate the retrieved knowledge based on our memory, and then align the knowledge facts with the image regions to infer answers. However, most existing methods ignore the process of knowledge validation and alignment. In this paper, we propose the Multi-Modal Validation and Domain Interaction Learning method, which consists of two components: 1) Multi-modal validation for knowledge retrieval. We propose the multi-modal validation module (MMV) to evaluate the confidence of each retrieved knowledge fact via images and questions, which preserves knowledge candidates effective for inferring answers. 2) Domain interaction for knowledge integration. We propose the Domain Interaction TRansformer module (DI-TR) to align visual regions with knowledge facts by the interaction learning in the improved transformer. Specifically, the inter-domain and intra-domain masks are injected into each self-attention layer to control the integration scope. The proposed method outperforms several strong baselines on three widely-used knowledge-based datasets: KRVQA, OK-VQA and VQA2.0. Extensive experiments and ablation studies demonstrate the effectiveness of multi-modal knowledge validation and domain interaction learning.

引用

页码：6628 / 6640

页数：13

共 50 条

[21] The multi-modal fusion in visual question answering: a review of attention mechanisms
Lu, Siyu
Liu, Mingzhe
Yin, Lirong
Yin, Zhengtong
Liu, Xuan
Zheng, Wenfeng
PEERJ COMPUTER SCIENCE, 2023, 9
[22] Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering
Guo, Zihan
Han, Dezhi
SENSORS, 2020, 20 (23) : 1 - 15
[23] Hierarchical deep multi-modal network for medical visual question answering
Gupta D.
Suman S.
Ekbal A.
Expert Systems with Applications, 2021, 164
[24] Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering
Yu, Zhou
Yu, Jun
Fan, Jianping
Tao, Dacheng
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1839 - 1848
[25] Multi-Modal Knowledge-Aware Attention Network for Question Answering
Zhang Y.
Qian S.
Fang Q.
Xu C.
Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2020, 57 (05): : 1037 - 1045
[26] Interpretable medical image Visual Question Answering via multi-modal relationship graph learning
Hu, Xinyue
Gu, Lin
Kobayashi, Kazuma
Liu, Liangchen
Zhang, Mengliang
Harada, Tatsuya
Summers, Ronald M.
Zhu, Yingying
MEDICAL IMAGE ANALYSIS, 2024, 97
[27] Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question Answering
Salemi, Alireza
Rafiee, Mahta
Zamani, Hamed
PROCEEDINGS OF THE 2023 ACM SIGIR INTERNATIONAL CONFERENCE ON THE THEORY OF INFORMATION RETRIEVAL, ICTIR 2023, 2023, : 169 - 176
[28] Knowledge-based question answering
Rinaldi, F
Dowdall, J
Hess, M
Mollá, D
Schwitter, R
Kaljurand, K
KNOWLEDGE-BASED INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT 1, PROCEEDINGS, 2003, 2773 : 785 - 792
[29] Knowledge-based question answering
Hermjakob, U
Hovy, EH
Lin, CY
6TH WORLD MULTICONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL XVI, PROCEEDINGS: COMPUTER SCIENCE III, 2002, : 66 - 71
[30] Interactive Multi-Modal Question-Answering
Orasan, Constantin
COMPUTATIONAL LINGUISTICS, 2012, 38 (02) : 451 - 453

← 1 2 3 4 5 →