Transformer Module Networks for Systematic Generalization in Visual Question Answering

被引:0
|
作者
Yamada, Moyuru [1 ,2 ]
D'Amario, Vanessa [3 ,4 ,5 ,6 ]
Takemoto, Kentaro [1 ]
Boix, Xavier [4 ,5 ,7 ]
Sasaki, Tomotake [1 ,5 ,8 ]
机构
[1] Fujitsu Ltd, Kawasaki, Kanagawa 2138502, Japan
[2] Fujitsu Res India Pvt Ltd, Bangalore 560037, Karnataka, India
[3] Fujitsu Res Amer Inc, Sunnyvale, CA 94085 USA
[4] MIT, Cambridge, MA 02139 USA
[5] Ctr Brains Minds & Machines, Cambridge, MA 02139 USA
[6] Nova Southeastern Univ, Ft Lauderdale, FL 33328 USA
[7] Fujitsu Res Amer, Santa Clara, CA 95054 USA
[8] Japan Elect Coll, Tokyo 1698522, Japan
关键词
Transformers; Systematics; Visualization; Libraries; Cognition; Question answering (information retrieval); Training; Neural module network; systematic generalization; transformer; visual question answering;
D O I
10.1109/TPAMI.2024.3438887
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformers achieve great performance on Visual Question Answering (VQA). However, their systematic generalization capabilities, i.e., handling novel combinations of known concepts, is unclear. We reveal that Neural Module Networks (NMNs), i.e., question-specific compositions of modules that tackle a sub-task, achieve better or similar systematic generalization performance than the conventional Transformers, even though NMNs' modules are CNN-based. In order to address this shortcoming of Transformers with respect to NMNs, in this paper we investigate whether and how modularity can bring benefits to Transformers. Namely, we introduce Transformer Module Network (TMN), a novel NMN based on compositions of Transformer modules. TMNs achieve state-of-the-art systematic generalization performance in three VQA datasets, improving more than 30% over standard Transformers for novel compositions of sub-tasks. We show that not only the module composition but also the module specialization for each sub-task are the key of such performance gain.
引用
收藏
页码:10096 / 10105
页数:10
相关论文
共 50 条
  • [21] CAT: Re-Conv Attention in Transformer for Visual Question Answering
    Zhang, Haotian
    Wu, Wei
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1471 - 1477
  • [22] Latent Compositional Representations Improve Systematic Generalization in Grounded Question Answering
    Bogin, Ben
    Subramanian, Sanjay
    Gardner, Matt
    Berant, Jonathan
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2021, 9 : 195 - 210
  • [23] Visual Question Answering
    Nada, Ahmed
    Chen, Min
    2024 INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKING AND COMMUNICATIONS, ICNC, 2024, : 6 - 10
  • [24] MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering
    Gokhale, Tejas
    Banerjee, Pratyay
    Baral, Chitta
    Yang, Yezhou
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 878 - 892
  • [25] Question Modifiers in Visual Question Answering
    Britton, William
    Sarkhel, Somdeb
    Venugopal, Deepak
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1472 - 1479
  • [26] Surgical-VQA: Visual Question Answering in Surgical Scenes Using Transformer
    Seenivasan, Lalithkumar
    Islam, Mobarakol
    Krishna, Adithya K.
    Ren, Hongliang
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT VII, 2022, 13437 : 33 - 43
  • [27] Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing
    Siebert, Tim
    Clasen, Kai Norman
    Ravanbakhsh, Mahdyar
    Demir, Beguem
    IMAGE AND SIGNAL PROCESSING FOR REMOTE SENSING XXVIII, 2022, 12267
  • [28] ST-VQA: shrinkage transformer with accurate alignment for visual question answering
    Xia, Haiying
    Lan, Richeng
    Li, Haisheng
    Song, Shuxiang
    APPLIED INTELLIGENCE, 2023, 53 (18) : 20967 - 20978
  • [29] ST-VQA: shrinkage transformer with accurate alignment for visual question answering
    Haiying Xia
    Richeng Lan
    Haisheng Li
    Shuxiang Song
    Applied Intelligence, 2023, 53 : 20967 - 20978
  • [30] Transformer Gate Attention Model: An Improved Attention Model for Visual Question Answering
    Zhang, Haotian
    Wu, Wei
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,