Transformer Module Networks for Systematic Generalization in Visual Question Answering

被引:0
|
作者
Yamada, Moyuru [1 ,2 ]
D'Amario, Vanessa [3 ,4 ,5 ,6 ]
Takemoto, Kentaro [1 ]
Boix, Xavier [4 ,5 ,7 ]
Sasaki, Tomotake [1 ,5 ,8 ]
机构
[1] Fujitsu Ltd, Kawasaki, Kanagawa 2138502, Japan
[2] Fujitsu Res India Pvt Ltd, Bangalore 560037, Karnataka, India
[3] Fujitsu Res Amer Inc, Sunnyvale, CA 94085 USA
[4] MIT, Cambridge, MA 02139 USA
[5] Ctr Brains Minds & Machines, Cambridge, MA 02139 USA
[6] Nova Southeastern Univ, Ft Lauderdale, FL 33328 USA
[7] Fujitsu Res Amer, Santa Clara, CA 95054 USA
[8] Japan Elect Coll, Tokyo 1698522, Japan
关键词
Transformers; Systematics; Visualization; Libraries; Cognition; Question answering (information retrieval); Training; Neural module network; systematic generalization; transformer; visual question answering;
D O I
10.1109/TPAMI.2024.3438887
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformers achieve great performance on Visual Question Answering (VQA). However, their systematic generalization capabilities, i.e., handling novel combinations of known concepts, is unclear. We reveal that Neural Module Networks (NMNs), i.e., question-specific compositions of modules that tackle a sub-task, achieve better or similar systematic generalization performance than the conventional Transformers, even though NMNs' modules are CNN-based. In order to address this shortcoming of Transformers with respect to NMNs, in this paper we investigate whether and how modularity can bring benefits to Transformers. Namely, we introduce Transformer Module Network (TMN), a novel NMN based on compositions of Transformer modules. TMNs achieve state-of-the-art systematic generalization performance in three VQA datasets, improving more than 30% over standard Transformers for novel compositions of sub-tasks. We show that not only the module composition but also the module specialization for each sub-task are the key of such performance gain.
引用
收藏
页码:10096 / 10105
页数:10
相关论文
共 50 条
  • [1] Self-Adaptive Neural Module Transformer for Visual Question Answering
    Zhong, Huasong
    Chen, Jingyuan
    Shen, Chen
    Zhang, Hanwang
    Huang, Jianqiang
    Hua, Xian-Sheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 1264 - 1273
  • [2] Multimodal Graph Networks for Compositional Generalization in Visual Question Answering
    Saqur, Raeid
    Narasimhan, Karthik
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [3] Graph neural networks for visual question answering: a systematic review
    Abdulganiyu Abdu Yusuf
    Chong Feng
    Xianling Mao
    Ramadhani Ally Duma
    Mohammed Salah Abood
    Abdulrahman Hamman Adama Chukkol
    Multimedia Tools and Applications, 2024, 83 : 55471 - 55508
  • [4] Graph neural networks for visual question answering: a systematic review
    Yusuf, Abdulganiyu Abdu
    Feng, Chong
    Mao, Xianling
    Ally Duma, Ramadhani
    Abood, Mohammed Salah
    Chukkol, Abdulrahman Hamman Adama
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (18) : 55471 - 55508
  • [5] Learning to Reason: End-to-End Module Networks for Visual Question Answering
    Hu, Ronghang
    Andreas, Jacob
    Rohrbach, Marcus
    Darrell, Trevor
    Saenko, Kate
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 804 - 813
  • [6] Unshuffling Data for Improved Generalization in Visual Question Answering
    Teney, Damien
    Abbasnejad, Ehsan
    van den Hengel, Anton
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1397 - 1407
  • [7] Differential Networks for Visual Question Answering
    Wu, Chenfei
    Liu, Jinlai
    Wang, Xiaojie
    Li, Ruifan
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8997 - 9004
  • [8] Advancing Vietnamese Visual Question Answering with Transformer and Convolutional
    Nguyen, Ngoc Son
    Nguyen, Van Son
    Le, Tung
    COMPUTERS & ELECTRICAL ENGINEERING, 2024, 119
  • [9] Bilinear Graph Networks for Visual Question Answering
    Guo, Dalu
    Xu, Chang
    Tao, Dacheng
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (02) : 1023 - 1034
  • [10] Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering
    Yu, Zhou
    Jin, Zitian
    Yu, Jun
    Xu, Mingliang
    Wang, Hongbo
    Fan, Jianping
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 9543 - 9556