Local self-attention in transformer for visual question answering

被引：33

作者：

Shen, Xiang ^{[1
]}

Han, Dezhi ^{[1
]}

Guo, Zihan ^{[1
]}

Chen, Chongqing ^{[1
]}

Hua, Jie ^{[2
]}

Luo, Gaofeng ^{[3
]}

机构：

[1] Shanghai Maritime Univ, Coll Informat Engn, 1550 Haigang Ave, Shanghai 201306, Peoples R China

[2] Univ Technol, TD Sch, Ultimo, NSW 2007, Australia

[3] Shaoyang Univ, Coll Informat Engn, Shaoyang 422099, Peoples R China

来源：

APPLIED INTELLIGENCE | 2023年 / 53卷 / 13期

基金：

上海市自然科学基金; 中国国家自然科学基金;

关键词：

Transformer; Local self-attention; Grid; regional visual features; Visual question answering;

D O I：

10.1007/s10489-022-04355-w

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual Question Answering (VQA) is a multimodal task that requires models to understand both textual and visual information. Various VQA models have applied the Transformer structure due to its excellent ability to model self-attention global dependencies. However, balancing global and local dependency modeling in traditional Transformer structures is an ongoing issue. A Transformer-based VQA model that only models global dependencies cannot effectively capture image context information. Thus, this paper proposes a novel Local Self-Attention in Transformer (LSAT) for a visual question answering model to address these issues. The LSAT model simultaneously models intra-window and inter-window attention by setting local windows for visual features. Therefore, the LSAT model can effectively avoid redundant information in global self-attention while capturing rich contextual information. This paper uses grid visual features to conduct extensive experiments and ablation studies on the VQA benchmark datasets VQA 2.0 and CLEVR. The experimental results show that the LSAT model outperforms the benchmark model in all indicators when the appropriate local window size is selected. Specifically, the best test results of LSAT using grid visual features on the VQA 2.0 and CLEVR datasets were 71.94% and 98.72%, respectively. Experimental results and ablation studies demonstrate that the proposed method has good performance. Source code is available at

引用

页码：16706 / 16723

页数：18

共 50 条

[21] Differential Attention for Visual Question Answering
Patro, Badri
Namboodiri, Vinay P.
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7680 - 7688
[22] Multimodal Attention for Visual Question Answering
Kodra, Lorena
Mece, Elinda Kajo
INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 783 - 792
[23] Fusing Attention with Visual Question Answering
Burt, Ryan
Cudic, Mihael
Principe, Jose C.
2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2017, : 949 - 953
[24] Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering
Gong, Haifan
Chen, Guanqi
Liu, Sishuo
Yu, Yizhou
Li, Guanbin
PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 456 - 460
[25] Research on Question Answering System Based on Bi-LSTM and Self-attention Mechanism
Xiang, Hao
Gu, Jinguang
2020 IEEE 7TH INTERNATIONAL CONFERENCE ON INDUSTRIAL ENGINEERING AND APPLICATIONS (ICIEA 2020), 2020, : 726 - 730
[26] Question -Led object attention for visual question answering
Gao, Lianli
Cao, Liangfu
Xu, Xing
Shao, Jie
Song, Jingkuan
NEUROCOMPUTING, 2020, 391 : 227 - 233
[27] Question-Agnostic Attention for Visual Question Answering
Farazi, Moshiur
Khan, Salman
Barnes, Nick
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 3542 - 3549
[28] Dual Self-Guided Attention with Sparse Question Networks for Visual Question Answering
Shen, Xiang
Han, Dezhi
Chang, Chin-Chen
Zong, Liang
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2022, E105D (04) : 785 - 796
[29] Question Type Guided Attention in Visual Question Answering
Shi, Yang
Furlanello, Tommaso
Zha, Sheng
Anandkumar, Animashree
COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 : 158 - 175
[30] Cross-attention Based Text-image Transformer for Visual Question Answering
Rezapour M.
Recent Advances in Computer Science and Communications, 2024, 17 (04) : 72 - 78

← 1 2 3 4 5 →