Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering

被引:241
|
作者
Duy-Kien Nguyen [1 ]
Okatani, Takayuki [1 ,2 ]
机构
[1] Tohoku Univ, Sendai, Miyagi, Japan
[2] RIKEN, Ctr AIP, Wako, Saitama, Japan
关键词
D O I
10.1109/CVPR.2018.00637
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A key solution to visual question answering (VQA) exists in how to fuse visual and language features extracted from an input image and question. We show that an attention mechanism that enables dense, bi-directional interactions between the two modalities contributes to boost accuracy of prediction of answers. Specifically, we present a simple architecture that is fully symmetric between visual and language representations, in which each question word attends on image regions and each image region attends on question words. It can be stacked to form a hierarchy for multi-step interactions between an image-question pair. We show through experiments that the proposed architecture achieves a new state-of-the-art on VQA and VQA 2.0 despite its small size. We also present qualitative evaluation, demonstrating how the proposed attention mechanism can generate reasonable attention maps on images and questions, which leads to the correct answer prediction.
引用
收藏
页码:6087 / 6096
页数:10
相关论文
共 50 条
  • [21] LRCN: Layer-residual Co-Attention Networks for visual question answering
    Han, Dezhi
    Shi, Jingya
    Zhao, Jiahao
    Wu, Huafeng
    Zhou, Yachao
    Li, Ling-Huey
    Khan, Muhammad Khurram
    Li, Kuan-Ching
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 263
  • [22] Multimodal feature-wise co-attention method for visual question answering
    Zhang, Sheng
    Chen, Min
    Chen, Jincai
    Zou, Fuhao
    Li, Yuan-Fang
    Lu, Ping
    INFORMATION FUSION, 2021, 73 : 1 - 10
  • [23] Visual question answering model based on the fusion of multimodal features by a two-wav co-attention mechanism
    Sharma, Himanshu
    Srivastava, Swati
    IMAGING SCIENCE JOURNAL, 2021, 69 (1-4): : 177 - 189
  • [24] Feature Fusion Attention Visual Question Answering
    Wang, Chunlin
    Sun, Jianyong
    Chen, Xiaolin
    ICMLC 2019: 2019 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, 2019, : 412 - 416
  • [25] Bi-direction Co-Attention Network on Visual Question Answering for Blind People
    Tung Le
    Thong Bui
    Huy Tien Nguyen
    Minh Le Nguyen
    FOURTEENTH INTERNATIONAL CONFERENCE ON MACHINE VISION (ICMV 2021), 2022, 12084
  • [26] Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering
    Yu, Zhou
    Yu, Jun
    Fan, Jianping
    Tao, Dacheng
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1839 - 1848
  • [27] SceneGATE: Scene-Graph Based Co-Attention Networks for Text Visual Question Answering
    Cao, Feiqi
    Luo, Siwen
    Nunez, Felipe
    Wen, Zean
    Poon, Josiah
    Han, Soyeon Caren
    ROBOTICS, 2023, 12 (04)
  • [28] Integrating multimodal features by a two-way co-attention mechanism for visual question answering
    Sharma, Himanshu
    Srivastava, Swati
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (21) : 59577 - 59595
  • [29] ADAPTIVE ATTENTION FUSION NETWORK FOR VISUAL QUESTION ANSWERING
    Gu, Geonmo
    Kim, Seong Tae
    Ro, Yong Man
    2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2017, : 997 - 1002
  • [30] SPCA-Net: a based on spatial position relationship co-attention network for visual question answering
    Yan, Feng
    Silamu, Wushouer
    Li, Yanbin
    Chai, Yachuang
    VISUAL COMPUTER, 2022, 38 (9-10): : 3097 - 3108