Unified Transformer with Cross-Modal Mixture Experts for Remote-Sensing Visual Question Answering

被引：3

作者：

Liu, Gang ^{[1
,2
]}

He, Jinlong ^{[1
,2
]}

Li, Pengfei ^{[1
,2
]}

Zhong, Shenjun ^{[3
,4
]}

Li, Hongyang ^{[1
,2
]}

He, Genrong ^{[1
,2
]}

机构：

[1] Harbin Engn Univ, Coll Comp Sci & Technol, Harbin 150001, Peoples R China

[2] Harbin Engn Univ, Natl Engn Lab E Govt Modeling & Simulat, Harbin 150001, Peoples R China

[3] Monash Biomed Imaging, Monash Biomed Imaging, Clayton, Australia

[4] Monash Univ, Natl Imaging Facil, Clayton, Vic 3800, Australia

来源：

REMOTE SENSING | 2023年 / 15卷 / 19期

关键词：

remote-sensing visual question answering; cross-modal mixture experts; cross-modal attention; transformer; vision transformer; BERT; CLASSIFICATION;

D O I：

10.3390/rs15194682

中图分类号：

X [环境科学、安全科学];

学科分类号：

08 ; 0830 ;

摘要：

Remote-sensing visual question answering (RSVQA) aims to provide accurate answers to remote sensing images and their associated questions by leveraging both visual and textual information during the inference process. However, most existing methods ignore the significance of the interaction between visual and language features, which typically adopt simple feature fusion strategies and fail to adequately model cross-modal attention, struggling to capture the complex semantic relationships between questions and images. In this study, we introduce a unified transformer with cross-modal mixture expert (TCMME) model to address the RSVQA problem. Specifically, we utilize the vision transformer (VIT) and BERT to extract visual and language features, respectively. Furthermore, we incorporate cross-modal mixture experts (CMMEs) to facilitate cross-modal representation learning. By leveraging the shared self-attention and cross-modal attention within CMMEs, as well as the modality experts, we effectively capture the intricate interactions between visual and language features and better focus on their complex semantic relationships. Finally, we conduct qualitative and quantitative experiments on two benchmark datasets: RSVQA-LR and RSVQA-HR. The results demonstrate that our proposed method surpasses the current state-of-the-art (SOTA) techniques. Additionally, we perform an extensive analysis to validate the effectiveness of different components in our framework.

引用

页数：21

共 50 条

[1] Cross-Modal Visual Question Answering for Remote Sensing Data
Felix, Rafael
Repasky, Boris
Hodge, Samuel
Zolfaghari, Reza
Abbasnejad, Ehsan
Sherrah, Jamie
2021 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA 2021), 2021, : 57 - 65
[2] Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing
Siebert, Tim
Clasen, Kai Norman
Ravanbakhsh, Mahdyar
Demir, Beguem
IMAGE AND SIGNAL PROCESSING FOR REMOTE SENSING XXVIII, 2022, 12267
[3] Cross-modal Relational Reasoning Network for Visual Question Answering
Chen, Hongyu
Liu, Ruifang
Peng, Bo
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3939 - 3948
[4] Interacting-Enhancing Feature Transformer for Cross-Modal Remote-Sensing Image and Text Retrieval
Tang, Xu
Wang, Yijing
Ma, Jingjing
Zhang, Xiangrong
Liu, Fang
Jiao, Licheng
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
[5] Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery
Bazi, Yakoub
Al Rahhal, Mohamad Mahmoud
Mekhalfi, Mohamed Lamine
Al Zuair, Mansour Abdulaziz
Melgani, Farid
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
[6] Visual question answering with attention transfer and a cross-modal gating mechanism
Li, Wei
Sun, Jianhui
Liu, Ge
Zhao, Linglan
Fang, Xiangzhong
PATTERN RECOGNITION LETTERS, 2020, 133 (133) : 334 - 340
[7] Cross-Modal Retrieval for Knowledge-Based Visual Question Answering
Lerner, Paul
Ferret, Olivier
Guinaudeau, Camille
ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I, 2024, 14608 : 421 - 438
[8] Reasoning on the Relation: Enhancing Visual Representation for Visual Question Answering and Cross-Modal Retrieval
Yu, Jing
Zhang, Weifeng
Lu, Yuhang
Qin, Zengchang
Hu, Yue
Tan, Jianlong
Wu, Qi
IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (12) : 3196 - 3209
[9] Cross-modal knowledge reasoning for knowledge-based visual question answering
Yu, Jing
Zhu, Zihao
Wang, Yujing
Zhang, Weifeng
Hu, Yue
Tan, Jianlong
PATTERN RECOGNITION, 2020, 108
[10] Medical visual question answering with symmetric interaction attention and cross-modal gating
Chen, Zhi
Zou, Beiji
Dai, Yulan
Zhu, Chengzhang
Kong, Guilan
Zhang, Wensheng
BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2023, 85

← 1 2 3 4 5 →