Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering

被引:241
|
作者
Duy-Kien Nguyen [1 ]
Okatani, Takayuki [1 ,2 ]
机构
[1] Tohoku Univ, Sendai, Miyagi, Japan
[2] RIKEN, Ctr AIP, Wako, Saitama, Japan
关键词
D O I
10.1109/CVPR.2018.00637
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A key solution to visual question answering (VQA) exists in how to fuse visual and language features extracted from an input image and question. We show that an attention mechanism that enables dense, bi-directional interactions between the two modalities contributes to boost accuracy of prediction of answers. Specifically, we present a simple architecture that is fully symmetric between visual and language representations, in which each question word attends on image regions and each image region attends on question words. It can be stacked to form a hierarchy for multi-step interactions between an image-question pair. We show through experiments that the proposed architecture achieves a new state-of-the-art on VQA and VQA 2.0 despite its small size. We also present qualitative evaluation, demonstrating how the proposed attention mechanism can generate reasonable attention maps on images and questions, which leads to the correct answer prediction.
引用
收藏
页码:6087 / 6096
页数:10
相关论文
共 50 条
  • [11] Multi-Channel Co-Attention Network for Visual Question Answering
    Tian, Weidong
    He, Bin
    Wang, Nanxun
    Zhao, Zhongqiu
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [12] Cross-modality co-attention networks for visual question answering
    Han, Dezhi
    Zhou, Shuli
    Li, Kuan Ching
    de Mello, Rodrigo Fernandes
    SOFT COMPUTING, 2021, 25 (07) : 5411 - 5421
  • [13] Dual self-attention with co-attention networks for visual question answering
    Liu, Yun
    Zhang, Xiaoming
    Zhang, Qianyun
    Li, Chaozhuo
    Huang, Feiran
    Tang, Xianghong
    Li, Zhoujun
    PATTERN RECOGNITION, 2021, 117 (117)
  • [14] Cross-modality co-attention networks for visual question answering
    Dezhi Han
    Shuli Zhou
    Kuan Ching Li
    Rodrigo Fernandes de Mello
    Soft Computing, 2021, 25 : 5411 - 5421
  • [15] Sparse co-attention visual question answering networks based on thresholds
    Guo, Zihan
    Han, Dezhi
    APPLIED INTELLIGENCE, 2023, 53 (01) : 586 - 600
  • [16] A medical visual question answering approach based on co-attention networks
    Cui W.
    Shi W.
    Shao H.
    Shengwu Yixue Gongchengxue Zazhi/Journal of Biomedical Engineering, 2024, 41 (03): : 560 - 568
  • [17] Sparse co-attention visual question answering networks based on thresholds
    Zihan Guo
    Dezhi Han
    Applied Intelligence, 2023, 53 : 586 - 600
  • [18] Multi-modal co-attention relation networks for visual question answering
    Zihan Guo
    Dezhi Han
    The Visual Computer, 2023, 39 : 5783 - 5795
  • [19] An Improved Attention for Visual Question Answering
    Rahman, Tanzila
    Chou, Shih-Han
    Sigal, Leonid
    Carenini, Giuseppe
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 1653 - 1662
  • [20] Multi-modal co-attention relation networks for visual question answering
    Guo, Zihan
    Han, Dezhi
    VISUAL COMPUTER, 2023, 39 (11): : 5783 - 5795