Local self-attention in transformer for visual question answering

被引:33
|
作者
Shen, Xiang [1 ]
Han, Dezhi [1 ]
Guo, Zihan [1 ]
Chen, Chongqing [1 ]
Hua, Jie [2 ]
Luo, Gaofeng [3 ]
机构
[1] Shanghai Maritime Univ, Coll Informat Engn, 1550 Haigang Ave, Shanghai 201306, Peoples R China
[2] Univ Technol, TD Sch, Ultimo, NSW 2007, Australia
[3] Shaoyang Univ, Coll Informat Engn, Shaoyang 422099, Peoples R China
基金
上海市自然科学基金; 中国国家自然科学基金;
关键词
Transformer; Local self-attention; Grid; regional visual features; Visual question answering;
D O I
10.1007/s10489-022-04355-w
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Question Answering (VQA) is a multimodal task that requires models to understand both textual and visual information. Various VQA models have applied the Transformer structure due to its excellent ability to model self-attention global dependencies. However, balancing global and local dependency modeling in traditional Transformer structures is an ongoing issue. A Transformer-based VQA model that only models global dependencies cannot effectively capture image context information. Thus, this paper proposes a novel Local Self-Attention in Transformer (LSAT) for a visual question answering model to address these issues. The LSAT model simultaneously models intra-window and inter-window attention by setting local windows for visual features. Therefore, the LSAT model can effectively avoid redundant information in global self-attention while capturing rich contextual information. This paper uses grid visual features to conduct extensive experiments and ablation studies on the VQA benchmark datasets VQA 2.0 and CLEVR. The experimental results show that the LSAT model outperforms the benchmark model in all indicators when the appropriate local window size is selected. Specifically, the best test results of LSAT using grid visual features on the VQA 2.0 and CLEVR datasets were 71.94% and 98.72%, respectively. Experimental results and ablation studies demonstrate that the proposed method has good performance. Source code is available at
引用
收藏
页码:16706 / 16723
页数:18
相关论文
共 50 条
  • [31] Visual Question Answering using Explicit Visual Attention
    Lioutas, Vasileios
    Passalis, Nikolaos
    Tefas, Anastasios
    2018 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2018,
  • [32] Global-Local Self-Attention Based Transformer for Speaker Verification
    Xie, Fei
    Zhang, Dalong
    Liu, Chengming
    APPLIED SCIENCES-BASEL, 2022, 12 (19):
  • [33] Relative molecule self-attention transformer
    Łukasz Maziarka
    Dawid Majchrowski
    Tomasz Danel
    Piotr Gaiński
    Jacek Tabor
    Igor Podolak
    Paweł Morkisz
    Stanisław Jastrzębski
    Journal of Cheminformatics, 16
  • [34] Scaling Local Self-Attention for Parameter Efficient Visual Backbones
    Vaswani, Ashish
    Ramachandran, Prajit
    Srinivas, Aravind
    Parmar, Niki
    Hechtman, Blake
    Shlens, Jonathon
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 12889 - 12899
  • [35] Relative molecule self-attention transformer
    Maziarka, Lukasz
    Majchrowski, Dawid
    Danel, Tomasz
    Gainski, Piotr
    Tabor, Jacek
    Podolak, Igor
    Morkisz, Pawel
    Jastrzebski, Stanislaw
    JOURNAL OF CHEMINFORMATICS, 2024, 16 (01)
  • [36] Guiding Visual Question Answering with Attention Priors
    Le, Thao Minh
    Le, Vuong
    Gupta, Sunil
    Venkatesh, Svetha
    Tran, Truyen
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 4370 - 4379
  • [37] Re-Attention for Visual Question Answering
    Guo, Wenya
    Zhang, Ying
    Yang, Jufeng
    Yuan, Xiaojie
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 6730 - 6743
  • [38] Re-Attention for Visual Question Answering
    Guo, Wenya
    Zhang, Ying
    Wu, Xiaoping
    Yang, Jufeng
    Cai, Xiangrui
    Yuan, Xiaojie
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 91 - 98
  • [39] Feature Enhancement in Attention for Visual Question Answering
    Lin, Yuetan
    Pang, Zhangyang
    Wang, Donghui
    Zhuang, Yueting
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 4216 - 4222
  • [40] Feature Fusion Attention Visual Question Answering
    Wang, Chunlin
    Sun, Jianyong
    Chen, Xiaolin
    ICMLC 2019: 2019 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, 2019, : 412 - 416