Hierarchical Multimodality Graph Reasoning for Remote Sensing Visual Question Answering

被引：0

作者：

Zhang, Han ^{[1
]}

Wang, Keming ^{[1
]}

Zhang, Laixian ^{[2
]}

Wang, Bingshu ^{[3
,4
]}

Li, Xuelong ^{[5
]}

机构：

[1] Northwestern Polytech Univ, Sch Artificial Intelligence Opt & Elect iOPEN, Xian 710072, Peoples R China

[2] Space Engn Univ, Key Lab Intelligent Space TTC&O, Beijing 101416, Peoples R China

[3] Northwestern Polytech Univ, Sch Software, Xian 710129, Peoples R China

[4] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Peoples R China

[5] China Telecom Corp, Inst Artificial Intelligence TeleAI, Beijing 100033, Peoples R China

来源：

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING | 2024年 / 62卷

基金：

中国国家自然科学基金;

关键词：

Visualization; Semantics; Cognition; Remote sensing; Question answering (information retrieval); Roads; Feature extraction; Attention mechanisms; Sensors; Convolution; Hierarchical learning; parallel multimodality graph reasoning; remote sensing visual question answering (RSVQA);

D O I：

10.1109/TGRS.2024.3502800

中图分类号：

P3 [地球物理学]; P59 [地球化学];

学科分类号：

0708 ; 070902 ;

摘要：

Remote sensing visual question answering (RSVQA) targets answering the questions about RS images in natural language form. RSVQA in real-world applications is always challenging, which may contain wide-field visual information and complicated queries. The current methods in RSVQA overlook the semantic hierarchy of visual and linguistic information and ignore the complex relations of multimodal instances. Thus, they severely suffer from vital deficiencies in comprehensively representing and associating the vision-language semantics. In this research, we design an innovative end-to-end model, named Hierarchical Multimodality Graph Reasoning (HMGR) network, which hierarchically learns multigranular vision-language joint representations, and interactively parses the heterogeneous multimodal relationships. Specifically, we design a hierarchical vision-language encoder (HVLE), which could simultaneously represent multiscale vision features and multilevel language features. Based on the representations, the vision-language semantic graphs are built, and the parallel multimodal graph relation reasoning is posed, which could explore the complex interaction patterns and implicit semantic relations of both intramodality and intermodality instances. Moreover, we raise a distinctive vision-question (VQ) feature fusion module for the collaboration of information at different semantic levels. Extensive experiments on three public large-scale datasets (RSVQA-LR, RSVQA-HRv1, and RSVQA-HRv2) demonstrate that our work is superior to the state-of-the-art results toward a mass of vision and query types.

引用

页数：12

共 50 条

[1] A Spatial Hierarchical Reasoning Network for Remote Sensing Visual Question Answering
Zhang, Zixiao
Jiao, Licheng
Li, Lingling
Liu, Xu
Chen, Puhua
Liu, Fang
Li, Yuxuan
Guo, Zhicheng
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
[2] Semantic Relation Graph Reasoning Network for Visual Question Answering
Lan, Hong
Zhang, Pufen
TWELFTH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING SYSTEMS, 2021, 11719
[3] VISUAL QUESTION ANSWERING FROM REMOTE SENSING IMAGES
Lobry, Sylvain
Murray, Jesse
Marcos, Diego
Tuia, Devis
2019 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2019), 2019, : 4951 - 4954
[4] RSVQA: Visual Question Answering for Remote Sensing Data
Lobry, Sylvain
Marcos, Diego
Murray, Jesse
Tuia, Devis
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2020, 58 (12): : 8555 - 8566
[5] LANGUAGE TRANSFORMERS FOR REMOTE SENSING VISUAL QUESTION ANSWERING
Chappuis, Christel
Mendez, Vincent
Walt, Eliot
Lobry, Sylvain
Le Saux, Bertrand
Tuia, Devis
2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 4855 - 4858
[6] Multistep Question-Driven Visual Question Answering for Remote Sensing
Zhang, Meimei
Chen, Fang
Li, Bin
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
[7] Hierarchical reasoning based on perception action cycle for visual question answering
Mohamud, Safaa Abdullahi Moallim
Jalali, Amin
Lee, Minho
EXPERT SYSTEMS WITH APPLICATIONS, 2024, 241
[8] Learning Hierarchical Reasoning for Text-Based Visual Question Answering
Li, Caiyuan
Du, Qinyi
Wang, Qingqing
Jin, Yaohui
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2021, PT III, 2021, 12893 : 305 - 316
[9] Embedding Spatial Relations in Visual Question Answering for Remote Sensing
Faure, Maxime
Lobry, Sylvain
Kurtz, Camille
Wendling, Laurent
2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 310 - 316
[10] Sequential Visual Reasoning for Visual Question Answering
Liu, Jinlai
Wu, Chenfei
Wang, Xiaojie
Dong, Xuan
PROCEEDINGS OF 2018 5TH IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENCE SYSTEMS (CCIS), 2018, : 410 - 415

← 1 2 3 4 5 →