Unified Transformer with Cross-Modal Mixture Experts for Remote-Sensing Visual Question Answering

被引:3
|
作者
Liu, Gang [1 ,2 ]
He, Jinlong [1 ,2 ]
Li, Pengfei [1 ,2 ]
Zhong, Shenjun [3 ,4 ]
Li, Hongyang [1 ,2 ]
He, Genrong [1 ,2 ]
机构
[1] Harbin Engn Univ, Coll Comp Sci & Technol, Harbin 150001, Peoples R China
[2] Harbin Engn Univ, Natl Engn Lab E Govt Modeling & Simulat, Harbin 150001, Peoples R China
[3] Monash Biomed Imaging, Monash Biomed Imaging, Clayton, Australia
[4] Monash Univ, Natl Imaging Facil, Clayton, Vic 3800, Australia
关键词
remote-sensing visual question answering; cross-modal mixture experts; cross-modal attention; transformer; vision transformer; BERT; CLASSIFICATION;
D O I
10.3390/rs15194682
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
Remote-sensing visual question answering (RSVQA) aims to provide accurate answers to remote sensing images and their associated questions by leveraging both visual and textual information during the inference process. However, most existing methods ignore the significance of the interaction between visual and language features, which typically adopt simple feature fusion strategies and fail to adequately model cross-modal attention, struggling to capture the complex semantic relationships between questions and images. In this study, we introduce a unified transformer with cross-modal mixture expert (TCMME) model to address the RSVQA problem. Specifically, we utilize the vision transformer (VIT) and BERT to extract visual and language features, respectively. Furthermore, we incorporate cross-modal mixture experts (CMMEs) to facilitate cross-modal representation learning. By leveraging the shared self-attention and cross-modal attention within CMMEs, as well as the modality experts, we effectively capture the intricate interactions between visual and language features and better focus on their complex semantic relationships. Finally, we conduct qualitative and quantitative experiments on two benchmark datasets: RSVQA-LR and RSVQA-HR. The results demonstrate that our proposed method surpasses the current state-of-the-art (SOTA) techniques. Additionally, we perform an extensive analysis to validate the effectiveness of different components in our framework.
引用
收藏
页数:21
相关论文
共 50 条
  • [1] Cross-Modal Visual Question Answering for Remote Sensing Data
    Felix, Rafael
    Repasky, Boris
    Hodge, Samuel
    Zolfaghari, Reza
    Abbasnejad, Ehsan
    Sherrah, Jamie
    2021 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA 2021), 2021, : 57 - 65
  • [2] Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing
    Siebert, Tim
    Clasen, Kai Norman
    Ravanbakhsh, Mahdyar
    Demir, Beguem
    IMAGE AND SIGNAL PROCESSING FOR REMOTE SENSING XXVIII, 2022, 12267
  • [3] Cross-modal Relational Reasoning Network for Visual Question Answering
    Chen, Hongyu
    Liu, Ruifang
    Peng, Bo
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3939 - 3948
  • [4] Interacting-Enhancing Feature Transformer for Cross-Modal Remote-Sensing Image and Text Retrieval
    Tang, Xu
    Wang, Yijing
    Ma, Jingjing
    Zhang, Xiangrong
    Liu, Fang
    Jiao, Licheng
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [5] Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery
    Bazi, Yakoub
    Al Rahhal, Mohamad Mahmoud
    Mekhalfi, Mohamed Lamine
    Al Zuair, Mansour Abdulaziz
    Melgani, Farid
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [6] Visual question answering with attention transfer and a cross-modal gating mechanism
    Li, Wei
    Sun, Jianhui
    Liu, Ge
    Zhao, Linglan
    Fang, Xiangzhong
    PATTERN RECOGNITION LETTERS, 2020, 133 (133) : 334 - 340
  • [7] Cross-Modal Retrieval for Knowledge-Based Visual Question Answering
    Lerner, Paul
    Ferret, Olivier
    Guinaudeau, Camille
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I, 2024, 14608 : 421 - 438
  • [8] Reasoning on the Relation: Enhancing Visual Representation for Visual Question Answering and Cross-Modal Retrieval
    Yu, Jing
    Zhang, Weifeng
    Lu, Yuhang
    Qin, Zengchang
    Hu, Yue
    Tan, Jianlong
    Wu, Qi
    IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (12) : 3196 - 3209
  • [9] Cross-modal knowledge reasoning for knowledge-based visual question answering
    Yu, Jing
    Zhu, Zihao
    Wang, Yujing
    Zhang, Weifeng
    Hu, Yue
    Tan, Jianlong
    PATTERN RECOGNITION, 2020, 108
  • [10] Medical visual question answering with symmetric interaction attention and cross-modal gating
    Chen, Zhi
    Zou, Beiji
    Dai, Yulan
    Zhu, Chengzhang
    Kong, Guilan
    Zhang, Wensheng
    BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2023, 85