Hierarchical Multimodality Graph Reasoning for Remote Sensing Visual Question Answering

被引:0
|
作者
Zhang, Han [1 ]
Wang, Keming [1 ]
Zhang, Laixian [2 ]
Wang, Bingshu [3 ,4 ]
Li, Xuelong [5 ]
机构
[1] Northwestern Polytech Univ, Sch Artificial Intelligence Opt & Elect iOPEN, Xian 710072, Peoples R China
[2] Space Engn Univ, Key Lab Intelligent Space TTC&O, Beijing 101416, Peoples R China
[3] Northwestern Polytech Univ, Sch Software, Xian 710129, Peoples R China
[4] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Peoples R China
[5] China Telecom Corp, Inst Artificial Intelligence TeleAI, Beijing 100033, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Semantics; Cognition; Remote sensing; Question answering (information retrieval); Roads; Feature extraction; Attention mechanisms; Sensors; Convolution; Hierarchical learning; parallel multimodality graph reasoning; remote sensing visual question answering (RSVQA);
D O I
10.1109/TGRS.2024.3502800
中图分类号
P3 [地球物理学]; P59 [地球化学];
学科分类号
0708 ; 070902 ;
摘要
Remote sensing visual question answering (RSVQA) targets answering the questions about RS images in natural language form. RSVQA in real-world applications is always challenging, which may contain wide-field visual information and complicated queries. The current methods in RSVQA overlook the semantic hierarchy of visual and linguistic information and ignore the complex relations of multimodal instances. Thus, they severely suffer from vital deficiencies in comprehensively representing and associating the vision-language semantics. In this research, we design an innovative end-to-end model, named Hierarchical Multimodality Graph Reasoning (HMGR) network, which hierarchically learns multigranular vision-language joint representations, and interactively parses the heterogeneous multimodal relationships. Specifically, we design a hierarchical vision-language encoder (HVLE), which could simultaneously represent multiscale vision features and multilevel language features. Based on the representations, the vision-language semantic graphs are built, and the parallel multimodal graph relation reasoning is posed, which could explore the complex interaction patterns and implicit semantic relations of both intramodality and intermodality instances. Moreover, we raise a distinctive vision-question (VQ) feature fusion module for the collaboration of information at different semantic levels. Extensive experiments on three public large-scale datasets (RSVQA-LR, RSVQA-HRv1, and RSVQA-HRv2) demonstrate that our work is superior to the state-of-the-art results toward a mass of vision and query types.
引用
收藏
页数:12
相关论文
共 50 条
  • [31] Visual Question Answering reasoning with external knowledge based on bimodal graph neural network
    Yang, Zhenyu
    Wu, Lei
    Wen, Peian
    Chen, Peng
    ELECTRONIC RESEARCH ARCHIVE, 2023, 31 (04): : 1948 - 1965
  • [32] Exploiting hierarchical visual features for visual question answering
    Hong, Jongkwang
    Fu, Jianlong
    Uh, Youngjung
    Mei, Tao
    Byun, Hyeran
    NEUROCOMPUTING, 2019, 351 : 187 - 195
  • [33] Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing
    Siebert, Tim
    Clasen, Kai Norman
    Ravanbakhsh, Mahdyar
    Demir, Beguem
    IMAGE AND SIGNAL PROCESSING FOR REMOTE SENSING XXVIII, 2022, 12267
  • [34] Multimodal Graph Reasoning and Fusion for Video Question Answering
    Zhang, Shuai
    Wang, Xingfu
    Hawbani, Ammar
    Zhao, Liang
    Alsamhi, Saeed Hamood
    2022 IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, 2022, : 1410 - 1415
  • [35] RSMoDM: Multimodal Momentum Distillation Model for Remote Sensing Visual Question Answering
    Li, Pengfei
    Liu, Gang
    He, Jinlong
    Meng, Xiangxu
    Zhong, Shenjun
    Chen, Xun
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 16799 - 16814
  • [36] Reasoning with Heterogeneous Graph Alignment for Video Question Answering
    Jiang, Pin
    Han, Yahong
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11109 - 11116
  • [37] Graph Reasoning Transformers for Knowledge -Aware Question Answering
    Zhao, Ruilin
    Zhao, Feng
    Hu, Liang
    Xu, Guandong
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 19652 - 19660
  • [38] OPEN-ENDED VISUAL QUESTION ANSWERING MODEL FOR REMOTE SENSING IMAGES
    Alsaleh, Sara O.
    Bazi, Yakoub
    Al Rahhal, Mohamad M.
    Al Zuair, Mansour
    2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 2848 - 2851
  • [39] Reasoning over Hierarchical Question Decomposition Tree for Explainable Question Answering
    Zhang, Jiajie
    Cao, Shulin
    Zhang, Tingjian
    Lv, Xin
    Shi, Jiaxin
    Tian, Qi
    Li, Juanzi
    Hou, Lei
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 14556 - 14570
  • [40] Improving reasoning with contrastive visual information for visual question answering
    Long, Yu
    Tang, Pengjie
    Wang, Hanli
    Yu, Jian
    ELECTRONICS LETTERS, 2021, 57 (20) : 758 - 760