Transformer Module Networks for Systematic Generalization in Visual Question Answering

被引：0

作者：

Yamada, Moyuru ^{[1
,2
]}

D'Amario, Vanessa ^{[3
,4
,5
,6
]}

Takemoto, Kentaro ^{[1
]}

Boix, Xavier ^{[4
,5
,7
]}

Sasaki, Tomotake ^{[1
,5
,8
]}

机构：

[1] Fujitsu Ltd, Kawasaki, Kanagawa 2138502, Japan

[2] Fujitsu Res India Pvt Ltd, Bangalore 560037, Karnataka, India

[3] Fujitsu Res Amer Inc, Sunnyvale, CA 94085 USA

[4] MIT, Cambridge, MA 02139 USA

[5] Ctr Brains Minds & Machines, Cambridge, MA 02139 USA

[6] Nova Southeastern Univ, Ft Lauderdale, FL 33328 USA

[7] Fujitsu Res Amer, Santa Clara, CA 95054 USA

[8] Japan Elect Coll, Tokyo 1698522, Japan

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2024年 / 46卷 / 12期

关键词：

Transformers; Systematics; Visualization; Libraries; Cognition; Question answering (information retrieval); Training; Neural module network; systematic generalization; transformer; visual question answering;

D O I：

10.1109/TPAMI.2024.3438887

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Transformers achieve great performance on Visual Question Answering (VQA). However, their systematic generalization capabilities, i.e., handling novel combinations of known concepts, is unclear. We reveal that Neural Module Networks (NMNs), i.e., question-specific compositions of modules that tackle a sub-task, achieve better or similar systematic generalization performance than the conventional Transformers, even though NMNs' modules are CNN-based. In order to address this shortcoming of Transformers with respect to NMNs, in this paper we investigate whether and how modularity can bring benefits to Transformers. Namely, we introduce Transformer Module Network (TMN), a novel NMN based on compositions of Transformer modules. TMNs achieve state-of-the-art systematic generalization performance in three VQA datasets, improving more than 30% over standard Transformers for novel compositions of sub-tasks. We show that not only the module composition but also the module specialization for each sub-task are the key of such performance gain.

引用

页码：10096 / 10105

页数：10

共 50 条

[21] CAT: Re-Conv Attention in Transformer for Visual Question Answering
Zhang, Haotian
Wu, Wei
2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1471 - 1477
[22] Latent Compositional Representations Improve Systematic Generalization in Grounded Question Answering
Bogin, Ben
Subramanian, Sanjay
Gardner, Matt
Berant, Jonathan
TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2021, 9 : 195 - 210
[23] Visual Question Answering
Nada, Ahmed
Chen, Min
2024 INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKING AND COMMUNICATIONS, ICNC, 2024, : 6 - 10
[24] MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering
Gokhale, Tejas
Banerjee, Pratyay
Baral, Chitta
Yang, Yezhou
PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 878 - 892
[25] Question Modifiers in Visual Question Answering
Britton, William
Sarkhel, Somdeb
Venugopal, Deepak
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1472 - 1479
[26] Surgical-VQA: Visual Question Answering in Surgical Scenes Using Transformer
Seenivasan, Lalithkumar
Islam, Mobarakol
Krishna, Adithya K.
Ren, Hongliang
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT VII, 2022, 13437 : 33 - 43
[27] Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing
Siebert, Tim
Clasen, Kai Norman
Ravanbakhsh, Mahdyar
Demir, Beguem
IMAGE AND SIGNAL PROCESSING FOR REMOTE SENSING XXVIII, 2022, 12267
[28] ST-VQA: shrinkage transformer with accurate alignment for visual question answering
Xia, Haiying
Lan, Richeng
Li, Haisheng
Song, Shuxiang
APPLIED INTELLIGENCE, 2023, 53 (18) : 20967 - 20978
[29] ST-VQA: shrinkage transformer with accurate alignment for visual question answering
Haiying Xia
Richeng Lan
Haisheng Li
Shuxiang Song
Applied Intelligence, 2023, 53 : 20967 - 20978
[30] Transformer Gate Attention Model: An Improved Attention Model for Visual Question Answering
Zhang, Haotian
Wu, Wei
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,

← 1 2 3 4 5 →