Multi-Modal Alignment of Visual Question Answering Based on Multi-Hop Attention Mechanism

被引:5
|
作者
Xia, Qihao [1 ]
Yu, Chao [1 ,2 ,3 ]
Hou, Yinong [1 ]
Peng, Pingping [1 ]
Zheng, Zhengqi [1 ,2 ]
Chen, Wen [1 ,2 ,3 ]
机构
[1] East China Normal Univ, Engn Ctr SHMEC Space Informat & GNSS, Shanghai 200241, Peoples R China
[2] East China Normal Univ, Shanghai Key Lab Multidimens Informat Proc, Shanghai 200241, Peoples R China
[3] East China Normal Univ, Key Lab Geog Informat Sci, Minist Educ, Shanghai 200241, Peoples R China
基金
中国国家自然科学基金;
关键词
multi-modal alignment; multi-hop attention; visual question answering; feature fusion; SIGMOID FUNCTION; MODEL;
D O I
10.3390/electronics11111778
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The alignment of information between the image and the question is of great significance in the visual question answering (VQA) task. Self-attention is commonly used to generate attention weights between image and question. These attention weights can align two modalities. Through the attention weight, the model can select the relevant area of the image to align with the question. However, when using the self-attention mechanism, the attention weight between two objects is only determined by the representation of these two objects. It ignores the influence of other objects around these two objects. This contribution proposes a novel multi-hop attention alignment method that enriches surrounding information when using self-attention to align two modalities. Simultaneously, in order to utilize position information in alignment, we also propose a position embedding mechanism. The position embedding mechanism extracts the position information of each object and implements the position embedding mechanism to align the question word with the correct position in the image. According to the experiment on the VQA2.0 dataset, our model achieves validation accuracy of 65.77%, outperforming several state-of-the-art methods. The experimental result shows that our proposed methods have better performance and effectiveness.
引用
收藏
页数:14
相关论文
共 50 条
  • [21] Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing
    Siebert, Tim
    Clasen, Kai Norman
    Ravanbakhsh, Mahdyar
    Demir, Beguem
    IMAGE AND SIGNAL PROCESSING FOR REMOTE SENSING XXVIII, 2022, 12267
  • [22] Hierarchical deep multi-modal network for medical visual question answering
    Gupta D.
    Suman S.
    Ekbal A.
    Expert Systems with Applications, 2021, 164
  • [23] Question Calibration and Multi-Hop Modeling for Temporal Question Answering
    Xue, Chao
    Liang, Di
    Wang, Pengfei
    Zhang, Jing
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 19332 - 19340
  • [24] Ask to Understand: Question Generation for Multi-hop Question Answering
    Li, Jiawei
    Ren, Mucheng
    Gao, Yang
    Yang, Yizhe
    CHINESE COMPUTATIONAL LINGUISTICS, CCL 2023, 2023, 14232 : 19 - 36
  • [25] Knowledge-Based Visual Question Answering Using Multi-Modal Semantic Graph
    Jiang, Lei
    Meng, Zuqiang
    ELECTRONICS, 2023, 12 (06)
  • [26] Interactive Multi-Modal Question-Answering
    Orasan, Constantin
    COMPUTATIONAL LINGUISTICS, 2012, 38 (02) : 451 - 453
  • [27] MoQA - A Multi-modal Question Answering Architecture
    Haurilet, Monica
    Al-Halah, Ziad
    Stiefelhagen, Rainer
    COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV, 2019, 11132 : 106 - 113
  • [28] Hierarchical Graph Network for Multi-hop Question Answering
    Fang, Yuwei
    Sun, Siqi
    Gan, Zhe
    Pillai, Rohit
    Wang, Shuohang
    Liu, Jingjing
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 8823 - 8838
  • [29] Multi-hop question answering using sparse graphs
    Hemmati, Nima
    Ghassem-Sani, Gholamreza
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 126
  • [30] Is Graph Structure Necessary for Multi-hop Question Answering?
    Shao, Nan
    Cui, Yiming
    Liu, Ting
    Wang, Shijin
    Hu, Guoping
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 7187 - 7192