Hierarchical Multimodality Graph Reasoning for Remote Sensing Visual Question Answering

被引:0
|
作者
Zhang, Han [1 ]
Wang, Keming [1 ]
Zhang, Laixian [2 ]
Wang, Bingshu [3 ,4 ]
Li, Xuelong [5 ]
机构
[1] Northwestern Polytech Univ, Sch Artificial Intelligence Opt & Elect iOPEN, Xian 710072, Peoples R China
[2] Space Engn Univ, Key Lab Intelligent Space TTC&O, Beijing 101416, Peoples R China
[3] Northwestern Polytech Univ, Sch Software, Xian 710129, Peoples R China
[4] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Peoples R China
[5] China Telecom Corp, Inst Artificial Intelligence TeleAI, Beijing 100033, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Semantics; Cognition; Remote sensing; Question answering (information retrieval); Roads; Feature extraction; Attention mechanisms; Sensors; Convolution; Hierarchical learning; parallel multimodality graph reasoning; remote sensing visual question answering (RSVQA);
D O I
10.1109/TGRS.2024.3502800
中图分类号
P3 [地球物理学]; P59 [地球化学];
学科分类号
0708 ; 070902 ;
摘要
Remote sensing visual question answering (RSVQA) targets answering the questions about RS images in natural language form. RSVQA in real-world applications is always challenging, which may contain wide-field visual information and complicated queries. The current methods in RSVQA overlook the semantic hierarchy of visual and linguistic information and ignore the complex relations of multimodal instances. Thus, they severely suffer from vital deficiencies in comprehensively representing and associating the vision-language semantics. In this research, we design an innovative end-to-end model, named Hierarchical Multimodality Graph Reasoning (HMGR) network, which hierarchically learns multigranular vision-language joint representations, and interactively parses the heterogeneous multimodal relationships. Specifically, we design a hierarchical vision-language encoder (HVLE), which could simultaneously represent multiscale vision features and multilevel language features. Based on the representations, the vision-language semantic graphs are built, and the parallel multimodal graph relation reasoning is posed, which could explore the complex interaction patterns and implicit semantic relations of both intramodality and intermodality instances. Moreover, we raise a distinctive vision-question (VQ) feature fusion module for the collaboration of information at different semantic levels. Extensive experiments on three public large-scale datasets (RSVQA-LR, RSVQA-HRv1, and RSVQA-HRv2) demonstrate that our work is superior to the state-of-the-art results toward a mass of vision and query types.
引用
收藏
页数:12
相关论文
共 50 条
  • [41] Bilinear Graph Networks for Visual Question Answering
    Guo, Dalu
    Xu, Chang
    Tao, Dacheng
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (02) : 1023 - 1034
  • [42] Graph Strategy for Interpretable Visual Question Answering
    Sarkisyan, Christina
    Savelov, Mikhail
    Kovalev, Alexey K.
    Panov, Aleksandr I.
    ARTIFICIAL GENERAL INTELLIGENCE, AGI 2022, 2023, 13539 : 86 - 99
  • [43] DriveLM: Driving with Graph Visual Question Answering
    Sima, Chonghao
    Renz, Katrin
    Chitta, Kashyap
    Chen, Li
    Zhang, Hanxue
    Xie, Chengen
    Beisswenger, Jens
    Luo, Ping
    Geiger, Andreas
    Li, Hongyang
    COMPUTER VISION - ECCV 2024, PT LII, 2025, 15110 : 256 - 274
  • [44] Multi-scale dual-stream visual feature extraction and graph reasoning for visual question answering
    Yusuf, Abdulganiyu Abdu
    Feng, Chong
    Mao, Xianling
    Li, Xinyan
    Haruna, Yunusa
    Duma, Ramadhani Ally
    APPLIED INTELLIGENCE, 2025, 55 (06)
  • [45] Coarse-to-Fine Reasoning for Visual Question Answering
    Nguyen, Binh X.
    Tuong Do
    Huy Tran
    Tjiputra, Erman
    Tran, Quang D.
    Anh Nguyen
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4557 - 4565
  • [46] Medical Visual Question Answering via Conditional Reasoning
    Zhan, Li-Ming
    Liu, Bo
    Fan, Lu
    Chen, Jiaxin
    Wu, Xiao-Ming
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 2345 - 2354
  • [47] Interpretable Visual Question Answering by Reasoning on Dependency Trees
    Cao, Qingxing
    Liang, Xiaodan
    Li, Bailin
    Lin, Liang
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (03) : 887 - 901
  • [48] Multimodal Knowledge Reasoning for Enhanced Visual Question Answering
    Hussain, Afzaal
    Maqsood, Ifrah
    Shahzad, Muhammad
    Fraz, Muhammad Moazam
    2022 16TH INTERNATIONAL CONFERENCE ON SIGNAL-IMAGE TECHNOLOGY & INTERNET-BASED SYSTEMS, SITIS, 2022, : 224 - 230
  • [49] Relational reasoning and adaptive fusion for visual question answering
    Shen, Xiang
    Han, Dezhi
    Zong, Liang
    Guo, Zihan
    Hua, Jie
    APPLIED INTELLIGENCE, 2024, 54 (06) : 5062 - 5080
  • [50] INTERPRETABLE VISUAL QUESTION ANSWERING VIA REASONING SUPERVISION
    Parelli, Maria
    Mallis, Dimitrios
    Diomataris, Markos
    Pitsikalis, Vassilis
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 2525 - 2529