Transformer Module Networks for Systematic Generalization in Visual Question Answering

被引:0
|
作者
Yamada, Moyuru [1 ,2 ]
D'Amario, Vanessa [3 ,4 ,5 ,6 ]
Takemoto, Kentaro [1 ]
Boix, Xavier [4 ,5 ,7 ]
Sasaki, Tomotake [1 ,5 ,8 ]
机构
[1] Fujitsu Ltd, Kawasaki, Kanagawa 2138502, Japan
[2] Fujitsu Res India Pvt Ltd, Bangalore 560037, Karnataka, India
[3] Fujitsu Res Amer Inc, Sunnyvale, CA 94085 USA
[4] MIT, Cambridge, MA 02139 USA
[5] Ctr Brains Minds & Machines, Cambridge, MA 02139 USA
[6] Nova Southeastern Univ, Ft Lauderdale, FL 33328 USA
[7] Fujitsu Res Amer, Santa Clara, CA 95054 USA
[8] Japan Elect Coll, Tokyo 1698522, Japan
关键词
Transformers; Systematics; Visualization; Libraries; Cognition; Question answering (information retrieval); Training; Neural module network; systematic generalization; transformer; visual question answering;
D O I
10.1109/TPAMI.2024.3438887
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformers achieve great performance on Visual Question Answering (VQA). However, their systematic generalization capabilities, i.e., handling novel combinations of known concepts, is unclear. We reveal that Neural Module Networks (NMNs), i.e., question-specific compositions of modules that tackle a sub-task, achieve better or similar systematic generalization performance than the conventional Transformers, even though NMNs' modules are CNN-based. In order to address this shortcoming of Transformers with respect to NMNs, in this paper we investigate whether and how modularity can bring benefits to Transformers. Namely, we introduce Transformer Module Network (TMN), a novel NMN based on compositions of Transformer modules. TMNs achieve state-of-the-art systematic generalization performance in three VQA datasets, improving more than 30% over standard Transformers for novel compositions of sub-tasks. We show that not only the module composition but also the module specialization for each sub-task are the key of such performance gain.
引用
收藏
页码:10096 / 10105
页数:10
相关论文
共 50 条
  • [41] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
  • [42] Causal Reasoning through Two Cognition Layers for Improving Generalization in Visual Question Answering
    Nguyen, Trang
    Okazaki, Naoaki
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 9221 - 9236
  • [43] Indic Visual Question Answering
    Chandrasekar, Aditya
    Shimpi, Amey
    Naik, Dinesh
    2022 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS, SPCOM, 2022,
  • [44] VQA: Visual Question Answering
    Agrawal, Aishwarya
    Lu, Jiasen
    Antol, Stanislaw
    Mitchell, Margaret
    Zitnick, C. Lawrence
    Parikh, Devi
    Batra, Dhruv
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2017, 123 (01) : 4 - 31
  • [45] Survey on Visual Question Answering
    Bao X.-G.
    Zhou C.-L.
    Xiao K.-J.
    Qin B.
    Ruan Jian Xue Bao/Journal of Software, 2021, 32 (08): : 2522 - 2544
  • [46] Visual Question Answering A tutorial
    Teney, Damien
    Wu, Qi
    van den Hengel, Anton
    IEEE SIGNAL PROCESSING MAGAZINE, 2017, 34 (06) : 63 - 75
  • [47] Dual Self-Guided Attention with Sparse Question Networks for Visual Question Answering
    Shen, Xiang
    Han, Dezhi
    Chang, Chin-Chen
    Zong, Liang
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2022, E105D (04) : 785 - 796
  • [48] Conversational Question Answering over Knowledge Graphs with Transformer and Graph Attention Networks
    Kacupaj, Endri
    Plepi, Joan
    Singh, Kuldeep
    Thakkar, Harsh
    Lehmann, Jens
    Maleshkova, Maria
    16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 850 - 862
  • [49] Visual Question Generation as Dual Task of Visual Question Answering
    Li, Yikang
    Duan, Nan
    Zhou, Bolei
    Chu, Xiao
    Ouyang, Wanli
    Wang, Xiaogang
    Zhou, Ming
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6116 - 6124
  • [50] MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering
    Khan, Aisha Urooj
    Mazaheri, Amir
    Lobo, Niels Da Vitoria
    Shah, Mubarak
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 4648 - 4660