Local self-attention in transformer for visual question answering

被引：33

作者：

Shen, Xiang ^{[1
]}

Han, Dezhi ^{[1
]}

Guo, Zihan ^{[1
]}

Chen, Chongqing ^{[1
]}

Hua, Jie ^{[2
]}

Luo, Gaofeng ^{[3
]}

机构：

[1] Shanghai Maritime Univ, Coll Informat Engn, 1550 Haigang Ave, Shanghai 201306, Peoples R China

[2] Univ Technol, TD Sch, Ultimo, NSW 2007, Australia

[3] Shaoyang Univ, Coll Informat Engn, Shaoyang 422099, Peoples R China

来源：

APPLIED INTELLIGENCE | 2023年 / 53卷 / 13期

基金：

上海市自然科学基金; 中国国家自然科学基金;

关键词：

Transformer; Local self-attention; Grid; regional visual features; Visual question answering;

D O I：

10.1007/s10489-022-04355-w

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual Question Answering (VQA) is a multimodal task that requires models to understand both textual and visual information. Various VQA models have applied the Transformer structure due to its excellent ability to model self-attention global dependencies. However, balancing global and local dependency modeling in traditional Transformer structures is an ongoing issue. A Transformer-based VQA model that only models global dependencies cannot effectively capture image context information. Thus, this paper proposes a novel Local Self-Attention in Transformer (LSAT) for a visual question answering model to address these issues. The LSAT model simultaneously models intra-window and inter-window attention by setting local windows for visual features. Therefore, the LSAT model can effectively avoid redundant information in global self-attention while capturing rich contextual information. This paper uses grid visual features to conduct extensive experiments and ablation studies on the VQA benchmark datasets VQA 2.0 and CLEVR. The experimental results show that the LSAT model outperforms the benchmark model in all indicators when the appropriate local window size is selected. Specifically, the best test results of LSAT using grid visual features on the VQA 2.0 and CLEVR datasets were 71.94% and 98.72%, respectively. Experimental results and ablation studies demonstrate that the proposed method has good performance. Source code is available at

引用

页码：16706 / 16723

页数：18

共 50 条

[31] Visual Question Answering using Explicit Visual Attention
Lioutas, Vasileios
Passalis, Nikolaos
Tefas, Anastasios
2018 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2018,
[32] Global-Local Self-Attention Based Transformer for Speaker Verification
Xie, Fei
Zhang, Dalong
Liu, Chengming
APPLIED SCIENCES-BASEL, 2022, 12 (19):
[33] Relative molecule self-attention transformer
Łukasz Maziarka
Dawid Majchrowski
Tomasz Danel
Piotr Gaiński
Jacek Tabor
Igor Podolak
Paweł Morkisz
Stanisław Jastrzębski
Journal of Cheminformatics, 16
[34] Scaling Local Self-Attention for Parameter Efficient Visual Backbones
Vaswani, Ashish
Ramachandran, Prajit
Srinivas, Aravind
Parmar, Niki
Hechtman, Blake
Shlens, Jonathon
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 12889 - 12899
[35] Relative molecule self-attention transformer
Maziarka, Lukasz
Majchrowski, Dawid
Danel, Tomasz
Gainski, Piotr
Tabor, Jacek
Podolak, Igor
Morkisz, Pawel
Jastrzebski, Stanislaw
JOURNAL OF CHEMINFORMATICS, 2024, 16 (01)
[36] Guiding Visual Question Answering with Attention Priors
Le, Thao Minh
Le, Vuong
Gupta, Sunil
Venkatesh, Svetha
Tran, Truyen
2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 4370 - 4379
[37] Re-Attention for Visual Question Answering
Guo, Wenya
Zhang, Ying
Yang, Jufeng
Yuan, Xiaojie
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 6730 - 6743
[38] Re-Attention for Visual Question Answering
Guo, Wenya
Zhang, Ying
Wu, Xiaoping
Yang, Jufeng
Cai, Xiangrui
Yuan, Xiaojie
THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 91 - 98
[39] Feature Enhancement in Attention for Visual Question Answering
Lin, Yuetan
Pang, Zhangyang
Wang, Donghui
Zhuang, Yueting
PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 4216 - 4222
[40] Feature Fusion Attention Visual Question Answering
Wang, Chunlin
Sun, Jianyong
Chen, Xiaolin
ICMLC 2019: 2019 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, 2019, : 412 - 416

← 1 2 3 4 5 →