Local self-attention in transformer for visual question answering

被引：33

作者：

Shen, Xiang ^{[1
]}

Han, Dezhi ^{[1
]}

Guo, Zihan ^{[1
]}

Chen, Chongqing ^{[1
]}

Hua, Jie ^{[2
]}

Luo, Gaofeng ^{[3
]}

机构：

[1] Shanghai Maritime Univ, Coll Informat Engn, 1550 Haigang Ave, Shanghai 201306, Peoples R China

[2] Univ Technol, TD Sch, Ultimo, NSW 2007, Australia

[3] Shaoyang Univ, Coll Informat Engn, Shaoyang 422099, Peoples R China

来源：

APPLIED INTELLIGENCE | 2023年 / 53卷 / 13期

基金：

上海市自然科学基金; 中国国家自然科学基金;

关键词：

Transformer; Local self-attention; Grid; regional visual features; Visual question answering;

D O I：

10.1007/s10489-022-04355-w

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual Question Answering (VQA) is a multimodal task that requires models to understand both textual and visual information. Various VQA models have applied the Transformer structure due to its excellent ability to model self-attention global dependencies. However, balancing global and local dependency modeling in traditional Transformer structures is an ongoing issue. A Transformer-based VQA model that only models global dependencies cannot effectively capture image context information. Thus, this paper proposes a novel Local Self-Attention in Transformer (LSAT) for a visual question answering model to address these issues. The LSAT model simultaneously models intra-window and inter-window attention by setting local windows for visual features. Therefore, the LSAT model can effectively avoid redundant information in global self-attention while capturing rich contextual information. This paper uses grid visual features to conduct extensive experiments and ablation studies on the VQA benchmark datasets VQA 2.0 and CLEVR. The experimental results show that the LSAT model outperforms the benchmark model in all indicators when the appropriate local window size is selected. Specifically, the best test results of LSAT using grid visual features on the VQA 2.0 and CLEVR datasets were 71.94% and 98.72%, respectively. Experimental results and ablation studies demonstrate that the proposed method has good performance. Source code is available at

引用

页码：16706 / 16723

页数：18

共 50 条

[1] Local self-attention in transformer for visual question answering
Xiang Shen
Dezhi Han
Zihan Guo
Chongqing Chen
Jie Hua
Gaofeng Luo
Applied Intelligence, 2023, 53 : 16706 - 16723
[2] Stacked Self-Attention Networks for Visual Question Answering
Sun, Qiang
Fu, Yanwei
ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2019, : 207 - 211
[3] ASAM: Asynchronous Self-Attention Model for Visual Question Answering
Liu, Han
Han, Dezhi
Zhang, Shukai
Shi, Jingya
Wu, Huafeng
Zhou, Yachao
Li, Kuan-Ching
COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2025, 22 (01)
[4] Dual self-attention with co-attention networks for visual question answering
Liu, Yun
Zhang, Xiaoming
Zhang, Qianyun
Li, Chaozhuo
Huang, Feiran
Tang, Xianghong
Li, Zhoujun
PATTERN RECOGNITION, 2021, 117 (117)
[5] Intra-Modality Feature Interaction Using Self-attention for Visual Question Answering
Shao, Huan
Xu, Yunlong
Ji, Yi
Yang, Jianyu
Liu, Chunping
NEURAL INFORMATION PROCESSING, ICONIP 2019, PT V, 2019, 1143 : 215 - 222
[6] A novel self-attention enriching mechanism for biomedical question answering
Kaddari, Zakaria
Bouchentouf, Toumi
EXPERT SYSTEMS WITH APPLICATIONS, 2023, 225
[7] TRAR: Routing the Attention Spans in Transformer for Visual Question Answering
Zhou, Yiyi
Ren, Tianhe
Zhu, Chaoyang
Sun, Xiaoshuai
Liu, Jianzhuang
Ding, Xinghao
Xu, Mingliang
Ji, Rongrong
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2054 - 2064
[8] Multi-page Document Visual Question Answering Using Self-attention Scoring Mechanism
Kang, Lei
Tito, Ruben
Valveny, Ernest
Karatzas, Dimosthenis
DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT VI, 2024, 14809 : 219 - 232
[9] SAFFNet: self-attention based on Fourier frequency domain filter network for visual question answering
Shi, Jingya
Han, Dezhi
Chen, Chongqing
Shen, Xiang
VISUAL COMPUTER, 2025,
[10] Transformer Gate Attention Model: An Improved Attention Model for Visual Question Answering
Zhang, Haotian
Wu, Wei
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,

← 1 2 3 4 5 →