Local self-attention in transformer for visual question answering

被引:33
|
作者
Shen, Xiang [1 ]
Han, Dezhi [1 ]
Guo, Zihan [1 ]
Chen, Chongqing [1 ]
Hua, Jie [2 ]
Luo, Gaofeng [3 ]
机构
[1] Shanghai Maritime Univ, Coll Informat Engn, 1550 Haigang Ave, Shanghai 201306, Peoples R China
[2] Univ Technol, TD Sch, Ultimo, NSW 2007, Australia
[3] Shaoyang Univ, Coll Informat Engn, Shaoyang 422099, Peoples R China
基金
上海市自然科学基金; 中国国家自然科学基金;
关键词
Transformer; Local self-attention; Grid; regional visual features; Visual question answering;
D O I
10.1007/s10489-022-04355-w
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Question Answering (VQA) is a multimodal task that requires models to understand both textual and visual information. Various VQA models have applied the Transformer structure due to its excellent ability to model self-attention global dependencies. However, balancing global and local dependency modeling in traditional Transformer structures is an ongoing issue. A Transformer-based VQA model that only models global dependencies cannot effectively capture image context information. Thus, this paper proposes a novel Local Self-Attention in Transformer (LSAT) for a visual question answering model to address these issues. The LSAT model simultaneously models intra-window and inter-window attention by setting local windows for visual features. Therefore, the LSAT model can effectively avoid redundant information in global self-attention while capturing rich contextual information. This paper uses grid visual features to conduct extensive experiments and ablation studies on the VQA benchmark datasets VQA 2.0 and CLEVR. The experimental results show that the LSAT model outperforms the benchmark model in all indicators when the appropriate local window size is selected. Specifically, the best test results of LSAT using grid visual features on the VQA 2.0 and CLEVR datasets were 71.94% and 98.72%, respectively. Experimental results and ablation studies demonstrate that the proposed method has good performance. Source code is available at
引用
收藏
页码:16706 / 16723
页数:18
相关论文
共 50 条
  • [41] Dynamic Capsule Attention for Visual Question Answering
    Zhou, Yiyi
    Ji, Rongrong
    Su, Jinsong
    Sun, Xiaoshuai
    Chen, Weiqiu
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 9324 - 9331
  • [42] How Self-Attention Improves Rare Class Performance in a Question-Answering Dialogue Agent
    Stiff, Adam
    Song, Qi
    Fosler-Lussier, Eric
    SIGDIAL 2020: 21ST ANNUAL MEETING OF THE SPECIAL INTEREST GROUP ON DISCOURSE AND DIALOGUE (SIGDIAL 2020), 2020, : 196 - 202
  • [43] Generative Attention Model with Adversarial Self-learning for Visual Question Answering
    Ilievski, Ilija
    Feng, Jiashi
    PROCEEDINGS OF THE THEMATIC WORKSHOPS OF ACM MULTIMEDIA 2017 (THEMATIC WORKSHOPS'17), 2017, : 415 - 423
  • [44] RVT-Transformer: Residual Attention in Answerability Prediction on Visual Question Answering for Blind People
    Duy-Minh Nguyen-Tran
    Tung Le
    Khoa Pho
    Minh Le Nguyen
    Huy Tien Nguyen
    ADVANCES IN COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2022, 2022, 1653 : 423 - 435
  • [45] Advancing Vietnamese Visual Question Answering with Transformer and Convolutional
    Nguyen, Ngoc Son
    Nguyen, Van Son
    Le, Tung
    COMPUTERS & ELECTRICAL ENGINEERING, 2024, 119
  • [46] Focal Visual-Text Attention for Visual Question Answering
    Liang, Junwei
    Jiang, Lu
    Cao, Liangliang
    Li, Li-Jia
    Hauptmann, Alexander
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6135 - 6143
  • [47] Light-Weight Vision Transformer with Parallel Local and Global Self-Attention
    Ebert, Nikolas
    Reichardt, Laurenz
    Stricker, Didier
    Wasenmueller, Oliver
    2023 IEEE 26TH INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS, ITSC, 2023, : 452 - 459
  • [48] PLG-ViT: Vision Transformer with Parallel Local and Global Self-Attention
    Ebert, Nikolas
    Stricker, Didier
    Wasenmueller, Oliver
    SENSORS, 2023, 23 (07)
  • [49] Local-Global Self-Attention for Transformer-Based Object Tracking
    Chen, Langkun
    Gao, Long
    Jiang, Yan
    Li, Yunsong
    He, Gang
    Ning, Jifeng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (12) : 12316 - 12329
  • [50] Universal Graph Transformer Self-Attention Networks
    Dai Quoc Nguyen
    Tu Dinh Nguyen
    Dinh Phung
    COMPANION PROCEEDINGS OF THE WEB CONFERENCE 2022, WWW 2022 COMPANION, 2022, : 193 - 196