Transformer Module Networks for Systematic Generalization in Visual Question Answering

被引:0
|
作者
Yamada, Moyuru [1 ,2 ]
D'Amario, Vanessa [3 ,4 ,5 ,6 ]
Takemoto, Kentaro [1 ]
Boix, Xavier [4 ,5 ,7 ]
Sasaki, Tomotake [1 ,5 ,8 ]
机构
[1] Fujitsu Ltd, Kawasaki, Kanagawa 2138502, Japan
[2] Fujitsu Res India Pvt Ltd, Bangalore 560037, Karnataka, India
[3] Fujitsu Res Amer Inc, Sunnyvale, CA 94085 USA
[4] MIT, Cambridge, MA 02139 USA
[5] Ctr Brains Minds & Machines, Cambridge, MA 02139 USA
[6] Nova Southeastern Univ, Ft Lauderdale, FL 33328 USA
[7] Fujitsu Res Amer, Santa Clara, CA 95054 USA
[8] Japan Elect Coll, Tokyo 1698522, Japan
关键词
Transformers; Systematics; Visualization; Libraries; Cognition; Question answering (information retrieval); Training; Neural module network; systematic generalization; transformer; visual question answering;
D O I
10.1109/TPAMI.2024.3438887
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformers achieve great performance on Visual Question Answering (VQA). However, their systematic generalization capabilities, i.e., handling novel combinations of known concepts, is unclear. We reveal that Neural Module Networks (NMNs), i.e., question-specific compositions of modules that tackle a sub-task, achieve better or similar systematic generalization performance than the conventional Transformers, even though NMNs' modules are CNN-based. In order to address this shortcoming of Transformers with respect to NMNs, in this paper we investigate whether and how modularity can bring benefits to Transformers. Namely, we introduce Transformer Module Network (TMN), a novel NMN based on compositions of Transformer modules. TMNs achieve state-of-the-art systematic generalization performance in three VQA datasets, improving more than 30% over standard Transformers for novel compositions of sub-tasks. We show that not only the module composition but also the module specialization for each sub-task are the key of such performance gain.
引用
收藏
页码:10096 / 10105
页数:10
相关论文
共 50 条
  • [31] An effective spatial relational reasoning networks for visual question answering
    Shen, Xiang
    Han, Dezhi
    Chen, Chongqing
    Luo, Gaofeng
    Wu, Zhongdai
    PLOS ONE, 2022, 17 (11):
  • [32] Stacked Self-Attention Networks for Visual Question Answering
    Sun, Qiang
    Fu, Yanwei
    ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2019, : 207 - 211
  • [33] Multi-level Attention Networks for Visual Question Answering
    Yu, Dongfei
    Fu, Jianlong
    Mei, Tao
    Rui, Yong
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4187 - 4195
  • [34] Dual-decoder transformer network for answer grounding in visual question answering
    Zhu, Liangjun
    Peng, Li
    Zhou, Weinan
    Yang, Jielong
    PATTERN RECOGNITION LETTERS, 2023, 171 : 53 - 60
  • [35] Positional Attention Guided Transformer-Like Architecture for Visual Question Answering
    Mao, Aihua
    Yang, Zhi
    Lin, Ken
    Xuan, Jun
    Liu, Yong-Jin
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 6997 - 7009
  • [36] Regularizing Attention Networks for Anomaly Detection in Visual Question Answering
    Lee, Doyup
    Cheon, Yeongjae
    Han, Wook-Shin
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 1845 - 1853
  • [37] Visual Question Answering using Hierarchical Dynamic Memory Networks
    Shang, Jiayu
    Li, Shiren
    Duan, Zhikui
    Huang, Junwei
    NINTH INTERNATIONAL CONFERENCE ON GRAPHIC AND IMAGE PROCESSING (ICGIP 2017), 2018, 10615
  • [38] Multi-view Attention Networks for Visual Question Answering
    Li, Min
    Bai, Zongwen
    Deng, Jie
    2024 6TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING, ICNLP 2024, 2024, : 788 - 794
  • [39] Transformer-based Sparse Encoder and Answer Decoder for Visual Question Answering
    Peng, Longkun
    An, Gaoyun
    Ruan, Qiuqi
    2022 16TH IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP2022), VOL 1, 2022, : 120 - 123
  • [40] GGM: Graph Generative Modeling for Out-of-Distribution Generalization in Visual Question Answering
    Jiang, Jingjing
    Liu, Ziyi
    Liu, Yifan
    Nan, Zhixiong
    Zheng, Nanning
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 199 - 208