Multi-Modal Alignment of Visual Question Answering Based on Multi-Hop Attention Mechanism

被引:5
|
作者
Xia, Qihao [1 ]
Yu, Chao [1 ,2 ,3 ]
Hou, Yinong [1 ]
Peng, Pingping [1 ]
Zheng, Zhengqi [1 ,2 ]
Chen, Wen [1 ,2 ,3 ]
机构
[1] East China Normal Univ, Engn Ctr SHMEC Space Informat & GNSS, Shanghai 200241, Peoples R China
[2] East China Normal Univ, Shanghai Key Lab Multidimens Informat Proc, Shanghai 200241, Peoples R China
[3] East China Normal Univ, Key Lab Geog Informat Sci, Minist Educ, Shanghai 200241, Peoples R China
基金
中国国家自然科学基金;
关键词
multi-modal alignment; multi-hop attention; visual question answering; feature fusion; SIGMOID FUNCTION; MODEL;
D O I
10.3390/electronics11111778
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The alignment of information between the image and the question is of great significance in the visual question answering (VQA) task. Self-attention is commonly used to generate attention weights between image and question. These attention weights can align two modalities. Through the attention weight, the model can select the relevant area of the image to align with the question. However, when using the self-attention mechanism, the attention weight between two objects is only determined by the representation of these two objects. It ignores the influence of other objects around these two objects. This contribution proposes a novel multi-hop attention alignment method that enriches surrounding information when using self-attention to align two modalities. Simultaneously, in order to utilize position information in alignment, we also propose a position embedding mechanism. The position embedding mechanism extracts the position information of each object and implements the position embedding mechanism to align the question word with the correct position in the image. According to the experiment on the VQA2.0 dataset, our model achieves validation accuracy of 65.77%, outperforming several state-of-the-art methods. The experimental result shows that our proposed methods have better performance and effectiveness.
引用
收藏
页数:14
相关论文
共 50 条
  • [31] Repurposing Entailment for Multi-Hop Question Answering Tasks
    Trivedi, Harsh
    Kwon, Heeyoung
    Khot, Tushar
    Sabharwal, Ashish
    Balasubramanian, Niranjan
    2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 2948 - 2958
  • [32] Rethinking Label Smoothing on Multi-Hop Question Answering
    Yin, Zhangyue
    Wang, Yuxin
    Hu, Xiannian
    Wu, Yiguang
    Yan, Hang
    Zhang, Xinyu
    Cao, Zhao
    Huang, Xuanjing
    Qiu, Xipeng
    CHINESE COMPUTATIONAL LINGUISTICS, CCL 2023, 2023, 14232 : 72 - 87
  • [33] Multi-Hop Reasoning for Question Answering with Knowledge Graph
    Zhang, Jiayuan
    Cai, Yifei
    Zhang, Qian
    Cao, Zehao
    Cheng, Zhenrong
    Li, Dongmei
    Meng, Xianghao
    2021 IEEE/ACIS 20TH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE (ICIS 2021-SUMMER), 2021, : 121 - 125
  • [34] Commonsense for Generative Multi-Hop Question Answering Tasks
    Bauer, Lisa
    Wang, Yicheng
    Bansal, Mohit
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 4220 - 4230
  • [35] Multi-hop community question answering based on multi-aspect heterogeneous graph
    Wu, Yongliang
    Yin, Hu
    Zhou, Qianqian
    Liu, Dongbo
    Wei, Dan
    Dong, Jiahao
    INFORMATION PROCESSING & MANAGEMENT, 2024, 61 (01)
  • [36] Constraint-based Multi-hop Question Answering with Knowledge Graph
    Mitra, Sayantan
    Ramnani, Roshni
    Sengupta, Shubhashis
    2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, NAACL-HLT 2022, 2022, : 280 - 288
  • [37] Multi-level, multi-modal interactions for visual question answering over text in images
    Chen, Jincai
    Zhang, Sheng
    Zeng, Jiangfeng
    Zou, Fuhao
    Li, Yuan-Fang
    Liu, Tao
    Lu, Ping
    World Wide Web, 2022, 25 (04) : 1607 - 1623
  • [38] Multi-level, multi-modal interactions for visual question answering over text in images
    Jincai Chen
    Sheng Zhang
    Jiangfeng Zeng
    Fuhao Zou
    Yuan-Fang Li
    Tao Liu
    Ping Lu
    World Wide Web, 2022, 25 : 1607 - 1623
  • [39] Multi-modal Multi-scale State Space Model for Medical Visual Question Answering
    Chen, Qishen
    Bian, Minjie
    He, Wenxuan
    Xu, Huahu
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT VIII, 2024, 15023 : 328 - 342
  • [40] Multi-level, multi-modal interactions for visual question answering over text in images
    Chen, Jincai
    Zhang, Sheng
    Zeng, Jiangfeng
    Zou, Fuhao
    Li, Yuan-Fang
    Liu, Tao
    Lu, Ping
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2022, 25 (04): : 1607 - 1623