Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering

被引:241
|
作者
Duy-Kien Nguyen [1 ]
Okatani, Takayuki [1 ,2 ]
机构
[1] Tohoku Univ, Sendai, Miyagi, Japan
[2] RIKEN, Ctr AIP, Wako, Saitama, Japan
关键词
D O I
10.1109/CVPR.2018.00637
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A key solution to visual question answering (VQA) exists in how to fuse visual and language features extracted from an input image and question. We show that an attention mechanism that enables dense, bi-directional interactions between the two modalities contributes to boost accuracy of prediction of answers. Specifically, we present a simple architecture that is fully symmetric between visual and language representations, in which each question word attends on image regions and each image region attends on question words. It can be stacked to form a hierarchy for multi-step interactions between an image-question pair. We show through experiments that the proposed architecture achieves a new state-of-the-art on VQA and VQA 2.0 despite its small size. We also present qualitative evaluation, demonstrating how the proposed attention mechanism can generate reasonable attention maps on images and questions, which leads to the correct answer prediction.
引用
收藏
页码:6087 / 6096
页数:10
相关论文
共 50 条
  • [31] Enhancing visual question answering with a two-way co-attention mechanism and integrated multimodal features
    Agrawal, Mayank
    Jalal, Anand Singh
    Sharma, Himanshu
    COMPUTATIONAL INTELLIGENCE, 2024, 40 (01)
  • [32] SPCA-Net: a based on spatial position relationship co-attention network for visual question answering
    Feng Yan
    Wushouer Silamu
    Yanbin Li
    Yachuang Chai
    The Visual Computer, 2022, 38 : 3097 - 3108
  • [33] JGRCAN: A Visual Question Answering Co-Attention Network via Joint Grid-Region Features
    Liang, Jianpeng
    Xu, Tianjiao
    Chen, Shihong
    Ao, Zhuopan
    Mathematical Problems in Engineering, 2022, 2022
  • [34] Graph-enhanced visual representations and question-guided dual attention for visual question answering
    Yusuf, Abdulganiyu Abdu
    Feng, Chong
    Mao, Xianling
    Haruna, Yunusa
    Li, Xinyan
    Duma, Ramadhani Ally
    NEUROCOMPUTING, 2025, 614
  • [35] CAT-ViL: Co-attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery
    Bai, Long
    Islam, Mobarakol
    Ren, Hongliang
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT IX, 2023, 14228 : 397 - 407
  • [36] An Improved Attention and Hybrid Optimization Technique for Visual Question Answering
    Himanshu Sharma
    Anand Singh Jalal
    Neural Processing Letters, 2022, 54 : 709 - 730
  • [37] An Improved Attention and Hybrid Optimization Technique for Visual Question Answering
    Sharma, Himanshu
    Jalal, Anand Singh
    NEURAL PROCESSING LETTERS, 2022, 54 (01) : 709 - 730
  • [38] Visual Question Answering using Explicit Visual Attention
    Lioutas, Vasileios
    Passalis, Nikolaos
    Tefas, Anastasios
    2018 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2018,
  • [39] Differential Attention for Visual Question Answering
    Patro, Badri
    Namboodiri, Vinay P.
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7680 - 7688
  • [40] Multimodal Attention for Visual Question Answering
    Kodra, Lorena
    Mece, Elinda Kajo
    INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 783 - 792