Multimodal Local Perception Bilinear Pooling for Visual Question Answering

被引:8
|
作者
Lao, Mingrui [1 ]
Guo, Yanming [1 ]
Wang, Hui [1 ]
Zhang, Xin [1 ]
机构
[1] Natl Univ Def Technol, Coll Syst Engn, Changsha 410073, Hunan, Peoples R China
来源
IEEE ACCESS | 2018年 / 6卷
基金
中国国家自然科学基金;
关键词
Visual question answering; bilinear pooling; local perception; parameter-sharing mechanism;
D O I
10.1109/ACCESS.2018.2873570
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Visual question answering is a challenging multimodal task, which has received increasing attention in recent years. One key solution to visual question answering is how to fuse the visual and textual features extracted from the image and questions, and thus, we can comprehensively employ the information from both modals and deliver correct answers. Bilinear pooling has been a powerful fusion approach owing to its exhausting interaction of each element of two modals, but its overuse of parameters limits its practical application. In this paper, we aim to retain the advantages of bilinear pooling for feature interaction and propose a novel multimodal feature fusion approach named multimodal local perception bilinear (MLPB) pooling, which can retain the second-order interactions between visual and textual features with limited learning parameters. To be specific, the MLPB utilizes local perception mechanism, which transforms the bilinear pooling between two high-dimensional raw features into multiple low-dimensional part features. To further reduce the computational cost, we propose to share the learning parameters of each local bilinear pooling. In this way, MLPB can achieve the complex interactions of the bilinear pooling without taking up too much computational resource. Extensive experiments show that the proposed method can achieve competitive or better performance than the state of the art.
引用
收藏
页码:57923 / 57932
页数:10
相关论文
共 50 条
  • [21] Multimodal Graph Networks for Compositional Generalization in Visual Question Answering
    Saqur, Raeid
    Narasimhan, Karthik
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [22] EduVQA: A multimodal Visual Question Answering framework for smart education
    Xiao, Jiongen
    Zhang, Zifeng
    ALEXANDRIA ENGINEERING JOURNAL, 2025, 122 : 615 - 624
  • [23] Visual Question Answering based on multimodal triplet knowledge accumuation
    Wang, Fengjuan
    An, Gaoyun
    2022 16TH IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP2022), VOL 1, 2022, : 81 - 84
  • [24] Dual-Key Multimodal Backdoors for Visual Question Answering
    Walmer, Matthew
    Sikka, Karan
    Sur, Indranil
    Shrivastava, Abhinav
    Jha, Susmit
    arXiv, 2021,
  • [25] Improving Visual Question Answering by Multimodal Gate Fusion Network
    Xiang, Shenxiang
    Chen, Qiaohong
    Fang, Xian
    Guo, Menghao
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [26] Comprehensive-perception dynamic reasoning for visual question answering
    Shuang, Kai
    Guo, Jinyu
    Wang, Zihan
    PATTERN RECOGNITION, 2022, 131
  • [27] Latent Attention Network With Position Perception for Visual Question Answering
    Zhang, Jing
    Liu, Xiaoqiang
    Wang, Zhe
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2025, 36 (03) : 5059 - 5069
  • [28] Latent Attention Network With Position Perception for Visual Question Answering
    Zhang, Jing
    Liu, Xiaoqiang
    Wang, Zhe
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 11
  • [29] Visual Experience-Based Question Answering with Complex Multimodal Environments
    Kim, Incheol
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2020, 2020 (2020)
  • [30] Multimodal Encoder-Decoder Attention Networks for Visual Question Answering
    Chen, Chongqing
    Han, Dezhi
    Wang, Jun
    IEEE ACCESS, 2020, 8 : 35662 - 35671