Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

被引:0
|
作者
Cai, Linqin [1 ]
Xu, Nuoying [1 ]
Tian, Hang [1 ]
Chen, Kejia [2 ]
Fan, Haodu [1 ]
机构
[1] Chongqing Univ Posts & Telecommun, Res Ctr Artificial Intelligence & Smart Educ, Chongqing 400065, Peoples R China
[2] Chengdu Huawei Technol Co Ltd, Chengdu 500643, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual question answering; Attention mechanism; Position attention; Deep learning; FUSION; KNOWLEDGE;
D O I
10.1007/s11063-023-11403-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current visual question answering (VQA) has become a research hotspot in the computer vision and natural language processing field. A core solution of VQA is how to fuse multi-modal features from images and questions. This paper proposes a Multimodal Bi-direction Guided Attention Network (MBGAN) for VQA by combining visual relationships and attention to achieve more refined feature fusion. Specifically, the self-attention is used to extract image features and text features, the guided-attention is applied to obtain the correlation between each image area and the related question. To obtain the relative position relationship of different objects, position attention is further introduced to realize relationship correlation modeling and enhance the matching ability of multi-modal features. Given an image and a natural language question, the proposed MBGAN learns visual relation inference and question attention networks in parallel to achieve the fine-grained fusion of the visual features and the textual features, then the final answers can be obtained accurately through model stacking. MBGAN achieves 69.41% overall accuracy on the VQA-v1 dataset, 70.79% overall accuracy on the VQA-v2 dataset, and 68.79% overall accuracy on the COCO-QA dataset, which shows that the proposed MBGAN outperforms most of the state-of-the-art models.
引用
收藏
页码:11921 / 11943
页数:23
相关论文
共 50 条
  • [31] Deep Modular Co-Attention Networks for Visual Question Answering
    Yu, Zhou
    Yu, Jun
    Cui, Yuhao
    Tao, Dacheng
    Tian, Qi
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6274 - 6283
  • [32] Positional Attention Guided Transformer-Like Architecture for Visual Question Answering
    Mao, Aihua
    Yang, Zhi
    Lin, Ken
    Xuan, Jun
    Liu, Yong-Jin
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 6997 - 7009
  • [33] Dual self-attention with co-attention networks for visual question answering
    Liu, Yun
    Zhang, Xiaoming
    Zhang, Qianyun
    Li, Chaozhuo
    Huang, Feiran
    Tang, Xianghong
    Li, Zhoujun
    PATTERN RECOGNITION, 2021, 117 (117)
  • [34] Question -Led object attention for visual question answering
    Gao, Lianli
    Cao, Liangfu
    Xu, Xing
    Shao, Jie
    Song, Jingkuan
    NEUROCOMPUTING, 2020, 391 : 227 - 233
  • [35] Question-Agnostic Attention for Visual Question Answering
    Farazi, Moshiur
    Khan, Salman
    Barnes, Nick
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 3542 - 3549
  • [36] Multimodal Learning and Reasoning for Visual Question Answering
    Ilievski, Ilija
    Feng, Jiashi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [37] Faithful Multimodal Explanation for Visual Question Answering
    Wu, Jialin
    Mooney, Raymond J.
    BLACKBOXNLP WORKSHOP ON ANALYZING AND INTERPRETING NEURAL NETWORKS FOR NLP AT ACL 2019, 2019, : 103 - 112
  • [38] Question Answering with Hierarchical Attention Networks
    Alpay, Tayfun
    Heinrich, Stefan
    Nelskamp, Michael
    Wermter, Stefan
    2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [39] Question guided multimodal receptive field reasoning network for fact-based visual question answering
    Zicheng Zuo
    Yanhan Sun
    Zhenfang Zhu
    Mei Wu
    Hui Zhao
    Multimedia Tools and Applications, 2025, 84 (12) : 11063 - 11078
  • [40] Multimodal feature-wise co-attention method for visual question answering
    Zhang, Sheng
    Chen, Min
    Chen, Jincai
    Zou, Fuhao
    Li, Yuan-Fang
    Lu, Ping
    INFORMATION FUSION, 2021, 73 : 1 - 10