Multimodal Local Perception Bilinear Pooling for Visual Question Answering

被引:8
|
作者
Lao, Mingrui [1 ]
Guo, Yanming [1 ]
Wang, Hui [1 ]
Zhang, Xin [1 ]
机构
[1] Natl Univ Def Technol, Coll Syst Engn, Changsha 410073, Hunan, Peoples R China
来源
IEEE ACCESS | 2018年 / 6卷
基金
中国国家自然科学基金;
关键词
Visual question answering; bilinear pooling; local perception; parameter-sharing mechanism;
D O I
10.1109/ACCESS.2018.2873570
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Visual question answering is a challenging multimodal task, which has received increasing attention in recent years. One key solution to visual question answering is how to fuse the visual and textual features extracted from the image and questions, and thus, we can comprehensively employ the information from both modals and deliver correct answers. Bilinear pooling has been a powerful fusion approach owing to its exhausting interaction of each element of two modals, but its overuse of parameters limits its practical application. In this paper, we aim to retain the advantages of bilinear pooling for feature interaction and propose a novel multimodal feature fusion approach named multimodal local perception bilinear (MLPB) pooling, which can retain the second-order interactions between visual and textual features with limited learning parameters. To be specific, the MLPB utilizes local perception mechanism, which transforms the bilinear pooling between two high-dimensional raw features into multiple low-dimensional part features. To further reduce the computational cost, we propose to share the learning parameters of each local bilinear pooling. In this way, MLPB can achieve the complex interactions of the bilinear pooling without taking up too much computational resource. Extensive experiments show that the proposed method can achieve competitive or better performance than the state of the art.
引用
收藏
页码:57923 / 57932
页数:10
相关论文
共 50 条
  • [31] HaVQA: A Dataset for Visual Question Answering and Multimodal Research in Hausa Language
    Parida, Shantipriya
    Abdulmumin, Idris
    Muhammad, Shamsuddeen Hassan
    Bose, Aneesh
    Kohli, Guneet Singh
    Ahmad, Ibrahim Said
    Kotwal, Ketan
    Sarkar, Sayan Deb
    Bojar, Ondrej
    Kakudi, Habeebah Adamu
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 10162 - 10183
  • [32] Multimodal feature fusion by relational reasoning and attention for visual question answering
    Zhang, Weifeng
    Yu, Jing
    Hu, Hua
    Hu, Haiyang
    Qin, Zengchang
    INFORMATION FUSION, 2020, 55 (55) : 116 - 126
  • [33] Bidirectional cascaded multimodal attention for multiple choice visual question answering
    Upadhyay, Sushmita
    Tripathy, Sanjaya Shankar
    MACHINE VISION AND APPLICATIONS, 2025, 36 (02)
  • [34] RSAdapter: Adapting Multimodal Models for Remote Sensing Visual Question Answering
    Wang, Yuduo
    Ghamisi, Pedram
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [35] Multimodal Cross-guided Attention Networks for Visual Question Answering
    Liu, Haibin
    Gong, Shengrong
    Ji, Yi
    Yang, Jianyu
    Xing, Tengfei
    Liu, Chunping
    PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON COMPUTER MODELING, SIMULATION AND ALGORITHM (CMSA 2018), 2018, 151 : 347 - 353
  • [36] Multimodal Graph Transformer for Multimodal Question Answering
    He, Xuehai
    Wang, Xin Eric
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 189 - 200
  • [37] Multimodal Graph Transformer for Multimodal Question Answering
    He, Xuehai
    Wang, Xin Eric
    EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, 2023, : 189 - 200
  • [38] Multimodal Graph Transformer for Multimodal Question Answering
    He, Xuehai
    Wang, Xin Eric
    arXiv, 2023,
  • [39] Visual Question Answering
    Nada, Ahmed
    Chen, Min
    2024 INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKING AND COMMUNICATIONS, ICNC, 2024, : 6 - 10
  • [40] BTDP: Toward Sparse Fusion with Block Term Decomposition Pooling for Visual Question Answering
    Fang, Zhiwei
    Liu, Jing
    Liu, Xueliang
    Tang, Qu
    Li, Yong
    Lu, Hanqing
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (02)