Local self-attention in transformer for visual question answering

被引：33

作者：

Shen, Xiang ^{[1
]}

Han, Dezhi ^{[1
]}

Guo, Zihan ^{[1
]}

Chen, Chongqing ^{[1
]}

Hua, Jie ^{[2
]}

Luo, Gaofeng ^{[3
]}

机构：

[1] Shanghai Maritime Univ, Coll Informat Engn, 1550 Haigang Ave, Shanghai 201306, Peoples R China

[2] Univ Technol, TD Sch, Ultimo, NSW 2007, Australia

[3] Shaoyang Univ, Coll Informat Engn, Shaoyang 422099, Peoples R China

来源：

APPLIED INTELLIGENCE | 2023年 / 53卷 / 13期

基金：

上海市自然科学基金; 中国国家自然科学基金;

关键词：

Transformer; Local self-attention; Grid; regional visual features; Visual question answering;

D O I：

10.1007/s10489-022-04355-w

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual Question Answering (VQA) is a multimodal task that requires models to understand both textual and visual information. Various VQA models have applied the Transformer structure due to its excellent ability to model self-attention global dependencies. However, balancing global and local dependency modeling in traditional Transformer structures is an ongoing issue. A Transformer-based VQA model that only models global dependencies cannot effectively capture image context information. Thus, this paper proposes a novel Local Self-Attention in Transformer (LSAT) for a visual question answering model to address these issues. The LSAT model simultaneously models intra-window and inter-window attention by setting local windows for visual features. Therefore, the LSAT model can effectively avoid redundant information in global self-attention while capturing rich contextual information. This paper uses grid visual features to conduct extensive experiments and ablation studies on the VQA benchmark datasets VQA 2.0 and CLEVR. The experimental results show that the LSAT model outperforms the benchmark model in all indicators when the appropriate local window size is selected. Specifically, the best test results of LSAT using grid visual features on the VQA 2.0 and CLEVR datasets were 71.94% and 98.72%, respectively. Experimental results and ablation studies demonstrate that the proposed method has good performance. Source code is available at

引用

页码：16706 / 16723

页数：18

共 50 条

[41] Dynamic Capsule Attention for Visual Question Answering
Zhou, Yiyi
Ji, Rongrong
Su, Jinsong
Sun, Xiaoshuai
Chen, Weiqiu
THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 9324 - 9331
[42] How Self-Attention Improves Rare Class Performance in a Question-Answering Dialogue Agent
Stiff, Adam
Song, Qi
Fosler-Lussier, Eric
SIGDIAL 2020: 21ST ANNUAL MEETING OF THE SPECIAL INTEREST GROUP ON DISCOURSE AND DIALOGUE (SIGDIAL 2020), 2020, : 196 - 202
[43] Generative Attention Model with Adversarial Self-learning for Visual Question Answering
Ilievski, Ilija
Feng, Jiashi
PROCEEDINGS OF THE THEMATIC WORKSHOPS OF ACM MULTIMEDIA 2017 (THEMATIC WORKSHOPS'17), 2017, : 415 - 423
[44] RVT-Transformer: Residual Attention in Answerability Prediction on Visual Question Answering for Blind People
Duy-Minh Nguyen-Tran
Tung Le
Khoa Pho
Minh Le Nguyen
Huy Tien Nguyen
ADVANCES IN COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2022, 2022, 1653 : 423 - 435
[45] Advancing Vietnamese Visual Question Answering with Transformer and Convolutional
Nguyen, Ngoc Son
Nguyen, Van Son
Le, Tung
COMPUTERS & ELECTRICAL ENGINEERING, 2024, 119
[46] Focal Visual-Text Attention for Visual Question Answering
Liang, Junwei
Jiang, Lu
Cao, Liangliang
Li, Li-Jia
Hauptmann, Alexander
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6135 - 6143
[47] Light-Weight Vision Transformer with Parallel Local and Global Self-Attention
Ebert, Nikolas
Reichardt, Laurenz
Stricker, Didier
Wasenmueller, Oliver
2023 IEEE 26TH INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS, ITSC, 2023, : 452 - 459
[48] PLG-ViT: Vision Transformer with Parallel Local and Global Self-Attention
Ebert, Nikolas
Stricker, Didier
Wasenmueller, Oliver
SENSORS, 2023, 23 (07)
[49] Local-Global Self-Attention for Transformer-Based Object Tracking
Chen, Langkun
Gao, Long
Jiang, Yan
Li, Yunsong
He, Gang
Ning, Jifeng
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (12) : 12316 - 12329
[50] Universal Graph Transformer Self-Attention Networks
Dai Quoc Nguyen
Tu Dinh Nguyen
Dinh Phung
COMPANION PROCEEDINGS OF THE WEB CONFERENCE 2022, WWW 2022 COMPANION, 2022, : 193 - 196

← 1 2 3 4 5 →