Multimodal Local Perception Bilinear Pooling for Visual Question Answering

被引:8
|
作者
Lao, Mingrui [1 ]
Guo, Yanming [1 ]
Wang, Hui [1 ]
Zhang, Xin [1 ]
机构
[1] Natl Univ Def Technol, Coll Syst Engn, Changsha 410073, Hunan, Peoples R China
来源
IEEE ACCESS | 2018年 / 6卷
基金
中国国家自然科学基金;
关键词
Visual question answering; bilinear pooling; local perception; parameter-sharing mechanism;
D O I
10.1109/ACCESS.2018.2873570
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Visual question answering is a challenging multimodal task, which has received increasing attention in recent years. One key solution to visual question answering is how to fuse the visual and textual features extracted from the image and questions, and thus, we can comprehensively employ the information from both modals and deliver correct answers. Bilinear pooling has been a powerful fusion approach owing to its exhausting interaction of each element of two modals, but its overuse of parameters limits its practical application. In this paper, we aim to retain the advantages of bilinear pooling for feature interaction and propose a novel multimodal feature fusion approach named multimodal local perception bilinear (MLPB) pooling, which can retain the second-order interactions between visual and textual features with limited learning parameters. To be specific, the MLPB utilizes local perception mechanism, which transforms the bilinear pooling between two high-dimensional raw features into multiple low-dimensional part features. To further reduce the computational cost, we propose to share the learning parameters of each local bilinear pooling. In this way, MLPB can achieve the complex interactions of the bilinear pooling without taking up too much computational resource. Extensive experiments show that the proposed method can achieve competitive or better performance than the state of the art.
引用
收藏
页码:57923 / 57932
页数:10
相关论文
共 50 条
  • [1] Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering
    Yu, Zhou
    Yu, Jun
    Xiang, Chenchao
    Fan, Jianping
    Tao, Dacheng
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2018, 29 (12) : 5947 - 5959
  • [2] Bilinear Graph Networks for Visual Question Answering
    Guo, Dalu
    Xu, Chang
    Tao, Dacheng
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (02) : 1023 - 1034
  • [3] Multimodal Attention for Visual Question Answering
    Kodra, Lorena
    Mece, Elinda Kajo
    INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 783 - 792
  • [4] Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering
    Yu, Zhou
    Yu, Jun
    Fan, Jianping
    Tao, Dacheng
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1839 - 1848
  • [5] Multimodal Learning and Reasoning for Visual Question Answering
    Ilievski, Ilija
    Feng, Jiashi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [6] Faithful Multimodal Explanation for Visual Question Answering
    Wu, Jialin
    Mooney, Raymond J.
    BLACKBOXNLP WORKSHOP ON ANALYZING AND INTERPRETING NEURAL NETWORKS FOR NLP AT ACL 2019, 2019, : 103 - 112
  • [7] Deep Modular Bilinear Attention Network for Visual Question Answering
    Yan, Feng
    Silamu, Wushouer
    Li, Yanbing
    SENSORS, 2022, 22 (03)
  • [8] MUTAN: Multimodal Tucker Fusion for Visual Question Answering
    Ben-younes, Hedi
    Cadene, Remi
    Cord, Matthieu
    Thome, Nicolas
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2631 - 2639
  • [9] Multimodal Knowledge Reasoning for Enhanced Visual Question Answering
    Hussain, Afzaal
    Maqsood, Ifrah
    Shahzad, Muhammad
    Fraz, Muhammad Moazam
    2022 16TH INTERNATIONAL CONFERENCE ON SIGNAL-IMAGE TECHNOLOGY & INTERNET-BASED SYSTEMS, SITIS, 2022, : 224 - 230
  • [10] MUREL: Multimodal Relational Reasoning for Visual Question Answering
    Cadene, Remi
    Ben-younes, Hedi
    Cord, Matthieu
    Thome, Nicolas
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1989 - 1998