Unified Transformer with Cross-Modal Mixture Experts for Remote-Sensing Visual Question Answering

被引:3
|
作者
Liu, Gang [1 ,2 ]
He, Jinlong [1 ,2 ]
Li, Pengfei [1 ,2 ]
Zhong, Shenjun [3 ,4 ]
Li, Hongyang [1 ,2 ]
He, Genrong [1 ,2 ]
机构
[1] Harbin Engn Univ, Coll Comp Sci & Technol, Harbin 150001, Peoples R China
[2] Harbin Engn Univ, Natl Engn Lab E Govt Modeling & Simulat, Harbin 150001, Peoples R China
[3] Monash Biomed Imaging, Monash Biomed Imaging, Clayton, Australia
[4] Monash Univ, Natl Imaging Facil, Clayton, Vic 3800, Australia
关键词
remote-sensing visual question answering; cross-modal mixture experts; cross-modal attention; transformer; vision transformer; BERT; CLASSIFICATION;
D O I
10.3390/rs15194682
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
Remote-sensing visual question answering (RSVQA) aims to provide accurate answers to remote sensing images and their associated questions by leveraging both visual and textual information during the inference process. However, most existing methods ignore the significance of the interaction between visual and language features, which typically adopt simple feature fusion strategies and fail to adequately model cross-modal attention, struggling to capture the complex semantic relationships between questions and images. In this study, we introduce a unified transformer with cross-modal mixture expert (TCMME) model to address the RSVQA problem. Specifically, we utilize the vision transformer (VIT) and BERT to extract visual and language features, respectively. Furthermore, we incorporate cross-modal mixture experts (CMMEs) to facilitate cross-modal representation learning. By leveraging the shared self-attention and cross-modal attention within CMMEs, as well as the modality experts, we effectively capture the intricate interactions between visual and language features and better focus on their complex semantic relationships. Finally, we conduct qualitative and quantitative experiments on two benchmark datasets: RSVQA-LR and RSVQA-HR. The results demonstrate that our proposed method surpasses the current state-of-the-art (SOTA) techniques. Additionally, we perform an extensive analysis to validate the effectiveness of different components in our framework.
引用
收藏
页数:21
相关论文
共 50 条
  • [41] Cross-Modal Contrastive Learning for Remote Sensing Image Classification
    Feng, Zhixi
    Song, Liangliang
    Yang, Shuyuan
    Zhang, Xinyu
    Jiao, Licheng
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [42] Cross-Modal feature description for remote sensing image matching
    Li, Liangzhi
    Liu, Ming
    Ma, Lingfei
    Han, Ling
    INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2022, 112
  • [43] Open-ended remote sensing visual question answering with transformers
    Al Rahhal, Mohamad M.
    Bazi, Yakoub
    Alsaleh, Sara O.
    Al-Razgan, Muna
    Mekhalfi, Mohamed Lamine
    Al Zuair, Mansour
    Alajlan, Naif
    INTERNATIONAL JOURNAL OF REMOTE SENSING, 2022, 43 (18) : 6809 - 6823
  • [44] Mutual Attention Inception Network for Remote Sensing Visual Question Answering
    Zheng, Xiangtao
    Wang, Binqiang
    Du, Xingqian
    Lu, Xiaoqiang
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [45] A Spatial Hierarchical Reasoning Network for Remote Sensing Visual Question Answering
    Zhang, Zixiao
    Jiao, Licheng
    Li, Lingling
    Liu, Xu
    Chen, Puhua
    Liu, Fang
    Li, Yuxuan
    Guo, Zhicheng
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [46] Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering
    Gong, Haifan
    Chen, Guanqi
    Liu, Sishuo
    Yu, Yizhou
    Li, Guanbin
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 456 - 460
  • [47] Hierarchical Multimodality Graph Reasoning for Remote Sensing Visual Question Answering
    Zhang, Han
    Wang, Keming
    Zhang, Laixian
    Wang, Bingshu
    Li, Xuelong
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [48] CAPTURING GLOBAL AND LOCAL INFORMATION IN REMOTE SENSING VISUAL QUESTION ANSWERING
    Guo, Yan
    Huang, Yuancheng
    2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 6340 - 6343
  • [49] RSAdapter: Adapting Multimodal Models for Remote Sensing Visual Question Answering
    Wang, Yuduo
    Ghamisi, Pedram
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [50] Cross-Modal Local Calibration and Global Context Modeling Network for RGB–Infrared Remote-Sensing Object Detection
    Xie, Jin
    Nie, Jing
    Ding, Bonan
    Yu, Mingyang
    Cao, Jiale
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2023, 16 : 8933 - 8942