Multi-Modal Alignment of Visual Question Answering Based on Multi-Hop Attention Mechanism

被引：5

作者：

Xia, Qihao ^{[1
]}

Yu, Chao ^{[1
,2
,3
]}

Hou, Yinong ^{[1
]}

Peng, Pingping ^{[1
]}

Zheng, Zhengqi ^{[1
,2
]}

Chen, Wen ^{[1
,2
,3
]}

机构：

[1] East China Normal Univ, Engn Ctr SHMEC Space Informat & GNSS, Shanghai 200241, Peoples R China

[2] East China Normal Univ, Shanghai Key Lab Multidimens Informat Proc, Shanghai 200241, Peoples R China

[3] East China Normal Univ, Key Lab Geog Informat Sci, Minist Educ, Shanghai 200241, Peoples R China

来源：

ELECTRONICS | 2022年 / 11卷 / 11期

基金：

中国国家自然科学基金;

关键词：

multi-modal alignment; multi-hop attention; visual question answering; feature fusion; SIGMOID FUNCTION; MODEL;

D O I：

10.3390/electronics11111778

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The alignment of information between the image and the question is of great significance in the visual question answering (VQA) task. Self-attention is commonly used to generate attention weights between image and question. These attention weights can align two modalities. Through the attention weight, the model can select the relevant area of the image to align with the question. However, when using the self-attention mechanism, the attention weight between two objects is only determined by the representation of these two objects. It ignores the influence of other objects around these two objects. This contribution proposes a novel multi-hop attention alignment method that enriches surrounding information when using self-attention to align two modalities. Simultaneously, in order to utilize position information in alignment, we also propose a position embedding mechanism. The position embedding mechanism extracts the position information of each object and implements the position embedding mechanism to align the question word with the correct position in the image. According to the experiment on the VQA2.0 dataset, our model achieves validation accuracy of 65.77%, outperforming several state-of-the-art methods. The experimental result shows that our proposed methods have better performance and effectiveness.

引用

页数：14

共 50 条

[41] A Multi-modal Debiasing Model with Dynamical Constraint for Robust Visual Question Answering
Li, Yu
Hu, Bojie
Zhang, Fengshuo
Yu, Yahan
Liu, Jian
Chen, Yufeng
Xu, Jinan
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 5032 - 5045
[42] Open-Ended Visual Question Answering by Multi-Modal Domain Adaptation
Xu, Yiming
Chen, Lin
Cheng, Zhongwei
Duan, Lixin
Luo, Jiebo
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 367 - 376
[43] Multi-modal Contextual Graph Neural Network for Text Visual Question Answering
Liang, Yaoyuan
Wang, Xin
Duan, Xuguang
Zhu, Wenwu
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 3491 - 3498
[44] Knowledge-Enhanced Visual Question Answering with Multi-modal Joint Guidance
Wang, Jianfeng
Zhang, Anda
Du, Huifang
Wang, Haofen
Zhang, Wenqiang
PROCEEDINGS OF THE 11TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE GRAPHS, IJCKG 2022, 2022, : 115 - 120
[45] Multi-Modal Validation and Domain Interaction Learning for Knowledge-Based Visual Question Answering
Xu, Ning
Gao, Yifei
Liu, An-An
Tian, Hongshuo
Zhang, Yongdong
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (11) : 6628 - 6640
[46] Medical Visual Question-Answering Model Based on Knowledge Enhancement and Multi-Modal Fusion
Zhang, Dianyuan
Yu, Chuanming
An, Lu
Proceedings of the Association for Information Science and Technology, 2024, 61 (01) : 703 - 708
[47] HGMAN: Multi-Hop and Multi-Answer Question Answering Based on Heterogeneous Knowledge Graph
Wang, Xu
Zhao, Shuai
Cheng, Bo
Han, Jiale
Li, Yingling
Yang, Hao
Nan, Guoshun
THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 13953 - 13954
[48] A Survey of Multi-modal Question Answering Systems for Robotics
Liu, Xiaomeng
Long, Fei
2017 2ND INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS AND MECHATRONICS (ICARM), 2017, : 189 - 194
[49] Multi-modal multi-hop interaction network for dialogue response generation
Zhou, Jie
Tian, Junfeng
Wang, Rui
Wu, Yuanbin
Yan, Ming
He, Liang
Huang, Xuanjing
EXPERT SYSTEMS WITH APPLICATIONS, 2023, 227
[50] Multi-view Semantic Reasoning Networks for Multi-hop Question Answering
Long X.
Zhao R.
Sun J.
Ju S.
Gongcheng Kexue Yu Jishu/Advanced Engineering Sciences, 2023, 55 (02): : 285 - 297

← 1 2 3 4 5 →