Unified Transformer with Cross-Modal Mixture Experts for Remote-Sensing Visual Question Answering

被引:3
|
作者
Liu, Gang [1 ,2 ]
He, Jinlong [1 ,2 ]
Li, Pengfei [1 ,2 ]
Zhong, Shenjun [3 ,4 ]
Li, Hongyang [1 ,2 ]
He, Genrong [1 ,2 ]
机构
[1] Harbin Engn Univ, Coll Comp Sci & Technol, Harbin 150001, Peoples R China
[2] Harbin Engn Univ, Natl Engn Lab E Govt Modeling & Simulat, Harbin 150001, Peoples R China
[3] Monash Biomed Imaging, Monash Biomed Imaging, Clayton, Australia
[4] Monash Univ, Natl Imaging Facil, Clayton, Vic 3800, Australia
关键词
remote-sensing visual question answering; cross-modal mixture experts; cross-modal attention; transformer; vision transformer; BERT; CLASSIFICATION;
D O I
10.3390/rs15194682
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
Remote-sensing visual question answering (RSVQA) aims to provide accurate answers to remote sensing images and their associated questions by leveraging both visual and textual information during the inference process. However, most existing methods ignore the significance of the interaction between visual and language features, which typically adopt simple feature fusion strategies and fail to adequately model cross-modal attention, struggling to capture the complex semantic relationships between questions and images. In this study, we introduce a unified transformer with cross-modal mixture expert (TCMME) model to address the RSVQA problem. Specifically, we utilize the vision transformer (VIT) and BERT to extract visual and language features, respectively. Furthermore, we incorporate cross-modal mixture experts (CMMEs) to facilitate cross-modal representation learning. By leveraging the shared self-attention and cross-modal attention within CMMEs, as well as the modality experts, we effectively capture the intricate interactions between visual and language features and better focus on their complex semantic relationships. Finally, we conduct qualitative and quantitative experiments on two benchmark datasets: RSVQA-LR and RSVQA-HR. The results demonstrate that our proposed method surpasses the current state-of-the-art (SOTA) techniques. Additionally, we perform an extensive analysis to validate the effectiveness of different components in our framework.
引用
收藏
页数:21
相关论文
共 50 条
  • [21] VISUAL QUESTION ANSWERING FROM REMOTE SENSING IMAGES
    Lobry, Sylvain
    Murray, Jesse
    Marcos, Diego
    Tuia, Devis
    2019 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2019), 2019, : 4951 - 4954
  • [22] RSVQA: Visual Question Answering for Remote Sensing Data
    Lobry, Sylvain
    Marcos, Diego
    Murray, Jesse
    Tuia, Devis
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2020, 58 (12): : 8555 - 8566
  • [23] LANGUAGE TRANSFORMERS FOR REMOTE SENSING VISUAL QUESTION ANSWERING
    Chappuis, Christel
    Mendez, Vincent
    Walt, Eliot
    Lobry, Sylvain
    Le Saux, Bertrand
    Tuia, Devis
    2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 4855 - 4858
  • [24] Multistep Question-Driven Visual Question Answering for Remote Sensing
    Zhang, Meimei
    Chen, Fang
    Li, Bin
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [25] VISUAL QUESTION ANSWERING IN REMOTE SENSING WITH CROSS-ATTENTION AND MULTIMODAL INFORMATION BOTTLENECK
    Songara, Jayesh
    Pande, Shivam
    Choudhury, Shabnam
    Banerjee, Biplab
    Velmurugan, Rajbabu
    IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 6278 - 6281
  • [26] TrTr-CMR: Cross-Modal Reasoning Dual Transformer for Remote Sensing Image Captioning
    Wu, Yinan
    Li, Lingling
    Jiao, Licheng
    Liu, Fang
    Liu, Xu
    Yang, Shuyuan
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [27] LIT-4-RSVQA: LIGHTWEIGHT TRANSFORMER-BASED VISUAL QUESTION ANSWERING IN REMOTE SENSING
    Hackel, Leonard
    Clasen, Kai Norman
    Ravanbakhsh, Mahdyar
    Demir, Beguem
    IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 2231 - 2234
  • [28] Deep Cross-Modal ImageVoice Retrieval in Remote Sensing
    Chen, Yaxiong
    Lu, Xiaoqiang
    Wang, Shuai
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2020, 58 (10): : 7049 - 7061
  • [29] Embedding Spatial Relations in Visual Question Answering for Remote Sensing
    Faure, Maxime
    Lobry, Sylvain
    Kurtz, Camille
    Wendling, Laurent
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 310 - 316
  • [30] Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering
    Zhu, Zihao
    Yu, Jing
    Wang, Yujing
    Sun, Yajing
    Hu, Yue
    Wu, Qi
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 1097 - 1103