Unified Transformer with Cross-Modal Mixture Experts for Remote-Sensing Visual Question Answering

被引:3
|
作者
Liu, Gang [1 ,2 ]
He, Jinlong [1 ,2 ]
Li, Pengfei [1 ,2 ]
Zhong, Shenjun [3 ,4 ]
Li, Hongyang [1 ,2 ]
He, Genrong [1 ,2 ]
机构
[1] Harbin Engn Univ, Coll Comp Sci & Technol, Harbin 150001, Peoples R China
[2] Harbin Engn Univ, Natl Engn Lab E Govt Modeling & Simulat, Harbin 150001, Peoples R China
[3] Monash Biomed Imaging, Monash Biomed Imaging, Clayton, Australia
[4] Monash Univ, Natl Imaging Facil, Clayton, Vic 3800, Australia
关键词
remote-sensing visual question answering; cross-modal mixture experts; cross-modal attention; transformer; vision transformer; BERT; CLASSIFICATION;
D O I
10.3390/rs15194682
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
Remote-sensing visual question answering (RSVQA) aims to provide accurate answers to remote sensing images and their associated questions by leveraging both visual and textual information during the inference process. However, most existing methods ignore the significance of the interaction between visual and language features, which typically adopt simple feature fusion strategies and fail to adequately model cross-modal attention, struggling to capture the complex semantic relationships between questions and images. In this study, we introduce a unified transformer with cross-modal mixture expert (TCMME) model to address the RSVQA problem. Specifically, we utilize the vision transformer (VIT) and BERT to extract visual and language features, respectively. Furthermore, we incorporate cross-modal mixture experts (CMMEs) to facilitate cross-modal representation learning. By leveraging the shared self-attention and cross-modal attention within CMMEs, as well as the modality experts, we effectively capture the intricate interactions between visual and language features and better focus on their complex semantic relationships. Finally, we conduct qualitative and quantitative experiments on two benchmark datasets: RSVQA-LR and RSVQA-HR. The results demonstrate that our proposed method surpasses the current state-of-the-art (SOTA) techniques. Additionally, we perform an extensive analysis to validate the effectiveness of different components in our framework.
引用
收藏
页数:21
相关论文
共 50 条
  • [31] Gated Multi-modal Fusion with Cross-modal Contrastive Learning for Video Question Answering
    Lyu, Chenyang
    Li, Wenxi
    Ji, Tianbo
    Zhou, Liting
    Gurrin, Cathal
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VII, 2023, 14260 : 427 - 438
  • [32] Enhancing Visual Question Answering with Prompt-based Learning: A Cross-modal Approach for Deep Semantic Understanding
    Zhu, Shuaiyu
    Peng, Shuo
    Chen, Shengbo
    PROCEEDINGS OF INTERNATIONAL CONFERENCE ON ALGORITHMS, SOFTWARE ENGINEERING, AND NETWORK SECURITY, ASENS 2024, 2024, : 713 - 717
  • [33] CroMIC-QA: The Cross-Modal Information Complementation Based Question Answering
    Qian, Shun
    Liu, Bingquan
    Sun, Chengjie
    Xu, Zhen
    Ma, Lin
    Wang, Baoxun
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 8348 - 8359
  • [34] VCD: Visual Causality Discovery for Cross-Modal Question Reasoning
    Liu, Yang
    Tan, Ying
    Luo, Jingzhou
    Chen, Weixing
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT VII, 2024, 14431 : 309 - 322
  • [35] Cross-Modal Feature Fusion and Interaction Strategy for CNN-Transformer-Based Object Detection in Visual and Infrared Remote Sensing Imagery
    Nie, Jinyan
    Sun, He
    Sun, Xu
    Ni, Li
    Gao, Lianru
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2024, 21 : 1 - 5
  • [36] MULTI-SCALE INTERACTIVE TRANSFORMER FOR REMOTE SENSING CROSS-MODAL IMAGE-TEXT RETRIEVAL
    Wang, Yijing
    Ma, Jingjing
    Li, Mingteng
    Tang, Xu
    Han, Xiao
    Jiao, Licheng
    2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 839 - 842
  • [37] Robust visual question answering via semantic cross modal augmentation
    Mashrur, Akib
    Luo, Wei
    Zaidi, Nayyar A.
    Robles-Kelly, Antonio
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 238
  • [38] A General Cross-Modal Correlation Learning Method for Remote Sensing
    Lü Y.
    Xiong W.
    Zhang X.
    Wuhan Daxue Xuebao (Xinxi Kexue Ban)/Geomatics and Information Science of Wuhan University, 2022, 47 (11): : 1887 - 1895
  • [39] Deep Cross-Modal Retrieval for Remote Sensing Image and Audio
    Guo Mao
    Yuan Yuan
    Lu Xiaoqiang
    2018 10TH IAPR WORKSHOP ON PATTERN RECOGNITION IN REMOTE SENSING (PRRS), 2018,
  • [40] UNSUPERVISED CONTRASTIVE HASHING FOR CROSS-MODAL RETRIEVAL IN REMOTE SENSING
    Mikriukov, Georgii
    Ravanbakhsh, Mahdyar
    Demir, Begum
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4463 - 4467