Local self-attention in transformer for visual question answering

被引:33
|
作者
Shen, Xiang [1 ]
Han, Dezhi [1 ]
Guo, Zihan [1 ]
Chen, Chongqing [1 ]
Hua, Jie [2 ]
Luo, Gaofeng [3 ]
机构
[1] Shanghai Maritime Univ, Coll Informat Engn, 1550 Haigang Ave, Shanghai 201306, Peoples R China
[2] Univ Technol, TD Sch, Ultimo, NSW 2007, Australia
[3] Shaoyang Univ, Coll Informat Engn, Shaoyang 422099, Peoples R China
基金
上海市自然科学基金; 中国国家自然科学基金;
关键词
Transformer; Local self-attention; Grid; regional visual features; Visual question answering;
D O I
10.1007/s10489-022-04355-w
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Question Answering (VQA) is a multimodal task that requires models to understand both textual and visual information. Various VQA models have applied the Transformer structure due to its excellent ability to model self-attention global dependencies. However, balancing global and local dependency modeling in traditional Transformer structures is an ongoing issue. A Transformer-based VQA model that only models global dependencies cannot effectively capture image context information. Thus, this paper proposes a novel Local Self-Attention in Transformer (LSAT) for a visual question answering model to address these issues. The LSAT model simultaneously models intra-window and inter-window attention by setting local windows for visual features. Therefore, the LSAT model can effectively avoid redundant information in global self-attention while capturing rich contextual information. This paper uses grid visual features to conduct extensive experiments and ablation studies on the VQA benchmark datasets VQA 2.0 and CLEVR. The experimental results show that the LSAT model outperforms the benchmark model in all indicators when the appropriate local window size is selected. Specifically, the best test results of LSAT using grid visual features on the VQA 2.0 and CLEVR datasets were 71.94% and 98.72%, respectively. Experimental results and ablation studies demonstrate that the proposed method has good performance. Source code is available at
引用
收藏
页码:16706 / 16723
页数:18
相关论文
共 50 条
  • [1] Local self-attention in transformer for visual question answering
    Xiang Shen
    Dezhi Han
    Zihan Guo
    Chongqing Chen
    Jie Hua
    Gaofeng Luo
    Applied Intelligence, 2023, 53 : 16706 - 16723
  • [2] Stacked Self-Attention Networks for Visual Question Answering
    Sun, Qiang
    Fu, Yanwei
    ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2019, : 207 - 211
  • [3] ASAM: Asynchronous Self-Attention Model for Visual Question Answering
    Liu, Han
    Han, Dezhi
    Zhang, Shukai
    Shi, Jingya
    Wu, Huafeng
    Zhou, Yachao
    Li, Kuan-Ching
    COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2025, 22 (01)
  • [4] Dual self-attention with co-attention networks for visual question answering
    Liu, Yun
    Zhang, Xiaoming
    Zhang, Qianyun
    Li, Chaozhuo
    Huang, Feiran
    Tang, Xianghong
    Li, Zhoujun
    PATTERN RECOGNITION, 2021, 117 (117)
  • [5] Intra-Modality Feature Interaction Using Self-attention for Visual Question Answering
    Shao, Huan
    Xu, Yunlong
    Ji, Yi
    Yang, Jianyu
    Liu, Chunping
    NEURAL INFORMATION PROCESSING, ICONIP 2019, PT V, 2019, 1143 : 215 - 222
  • [6] A novel self-attention enriching mechanism for biomedical question answering
    Kaddari, Zakaria
    Bouchentouf, Toumi
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 225
  • [7] TRAR: Routing the Attention Spans in Transformer for Visual Question Answering
    Zhou, Yiyi
    Ren, Tianhe
    Zhu, Chaoyang
    Sun, Xiaoshuai
    Liu, Jianzhuang
    Ding, Xinghao
    Xu, Mingliang
    Ji, Rongrong
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2054 - 2064
  • [8] Multi-page Document Visual Question Answering Using Self-attention Scoring Mechanism
    Kang, Lei
    Tito, Ruben
    Valveny, Ernest
    Karatzas, Dimosthenis
    DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT VI, 2024, 14809 : 219 - 232
  • [9] SAFFNet: self-attention based on Fourier frequency domain filter network for visual question answering
    Shi, Jingya
    Han, Dezhi
    Chen, Chongqing
    Shen, Xiang
    VISUAL COMPUTER, 2025,
  • [10] Transformer Gate Attention Model: An Improved Attention Model for Visual Question Answering
    Zhang, Haotian
    Wu, Wei
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,