Hierarchical Multimodality Graph Reasoning for Remote Sensing Visual Question Answering

被引:0
|
作者
Zhang, Han [1 ]
Wang, Keming [1 ]
Zhang, Laixian [2 ]
Wang, Bingshu [3 ,4 ]
Li, Xuelong [5 ]
机构
[1] Northwestern Polytech Univ, Sch Artificial Intelligence Opt & Elect iOPEN, Xian 710072, Peoples R China
[2] Space Engn Univ, Key Lab Intelligent Space TTC&O, Beijing 101416, Peoples R China
[3] Northwestern Polytech Univ, Sch Software, Xian 710129, Peoples R China
[4] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Peoples R China
[5] China Telecom Corp, Inst Artificial Intelligence TeleAI, Beijing 100033, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Semantics; Cognition; Remote sensing; Question answering (information retrieval); Roads; Feature extraction; Attention mechanisms; Sensors; Convolution; Hierarchical learning; parallel multimodality graph reasoning; remote sensing visual question answering (RSVQA);
D O I
10.1109/TGRS.2024.3502800
中图分类号
P3 [地球物理学]; P59 [地球化学];
学科分类号
0708 ; 070902 ;
摘要
Remote sensing visual question answering (RSVQA) targets answering the questions about RS images in natural language form. RSVQA in real-world applications is always challenging, which may contain wide-field visual information and complicated queries. The current methods in RSVQA overlook the semantic hierarchy of visual and linguistic information and ignore the complex relations of multimodal instances. Thus, they severely suffer from vital deficiencies in comprehensively representing and associating the vision-language semantics. In this research, we design an innovative end-to-end model, named Hierarchical Multimodality Graph Reasoning (HMGR) network, which hierarchically learns multigranular vision-language joint representations, and interactively parses the heterogeneous multimodal relationships. Specifically, we design a hierarchical vision-language encoder (HVLE), which could simultaneously represent multiscale vision features and multilevel language features. Based on the representations, the vision-language semantic graphs are built, and the parallel multimodal graph relation reasoning is posed, which could explore the complex interaction patterns and implicit semantic relations of both intramodality and intermodality instances. Moreover, we raise a distinctive vision-question (VQ) feature fusion module for the collaboration of information at different semantic levels. Extensive experiments on three public large-scale datasets (RSVQA-LR, RSVQA-HRv1, and RSVQA-HRv2) demonstrate that our work is superior to the state-of-the-art results toward a mass of vision and query types.
引用
收藏
页数:12
相关论文
共 50 条
  • [21] Graph Reasoning for Question Answering with Triplet Retrieval
    Li, Shiyang
    Gao, Yifan
    Jiang, Haoming
    Yin, Qingyu
    Li, Zheng
    Yan, Xifeng
    Zhang, Chao
    Yin, Bing
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 3366 - 3375
  • [22] Cross-Modal Visual Question Answering for Remote Sensing Data
    Felix, Rafael
    Repasky, Boris
    Hodge, Samuel
    Zolfaghari, Reza
    Abbasnejad, Ehsan
    Sherrah, Jamie
    2021 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA 2021), 2021, : 57 - 65
  • [23] Open-ended remote sensing visual question answering with transformers
    Al Rahhal, Mohamad M.
    Bazi, Yakoub
    Alsaleh, Sara O.
    Al-Razgan, Muna
    Mekhalfi, Mohamed Lamine
    Al Zuair, Mansour
    Alajlan, Naif
    INTERNATIONAL JOURNAL OF REMOTE SENSING, 2022, 43 (18) : 6809 - 6823
  • [24] Mutual Attention Inception Network for Remote Sensing Visual Question Answering
    Zheng, Xiangtao
    Wang, Binqiang
    Du, Xingqian
    Lu, Xiaoqiang
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [25] CAPTURING GLOBAL AND LOCAL INFORMATION IN REMOTE SENSING VISUAL QUESTION ANSWERING
    Guo, Yan
    Huang, Yuancheng
    2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 6340 - 6343
  • [26] RSAdapter: Adapting Multimodal Models for Remote Sensing Visual Question Answering
    Wang, Yuduo
    Ghamisi, Pedram
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [27] Compositional Substitutivity of Visual Reasoning for Visual Question Answering
    Li, Chuanhao
    Li, Zhen
    Jing, Chenchen
    Wu, Yuwei
    Zhai, Mingliang
    Jia, Yunde
    COMPUTER VISION - ECCV 2024, PT XLVIII, 2025, 15106 : 143 - 160
  • [28] PRIOR VISUAL RELATIONSHIP REASONING FOR VISUAL QUESTION ANSWERING
    Yang, Zhuoqian
    Qin, Zengchang
    Yu, Jing
    Wan, Tao
    2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 1411 - 1415
  • [29] Visual question answering by pattern matching and reasoning
    Zhan, Huayi
    Xiong, Peixi
    Wang, Xin
    Yang, Lan
    NEUROCOMPUTING, 2022, 467 : 323 - 336
  • [30] Multimodal Learning and Reasoning for Visual Question Answering
    Ilievski, Ilija
    Feng, Jiashi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30