Cross-Modal Visual Question Answering for Remote Sensing Data

被引:1
|
作者
Felix, Rafael [1 ]
Repasky, Boris [1 ,2 ]
Hodge, Samuel [1 ]
Zolfaghari, Reza [3 ]
Abbasnejad, Ehsan [2 ]
Sherrah, Jamie [2 ]
机构
[1] Australian Inst Machine Learning, Adelaide, SA, Australia
[2] Lockheed Martin Australia STELaRLab, Mawson Lakes, Australia
[3] Def Sci & Technol Grp, Canberra, ACT, Australia
关键词
Visual Question Answering; Deep learning; Natural Language Processing; Convolution Neural Networks; Recurrent Neural Networks; OpenStreetMap; CLASSIFICATION;
D O I
10.1109/DICTA52665.2021.9647287
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While querying of structured geo-spatial data such as Google Maps has become commonplace, there remains a wealth of unstructured information in overhead imagery that is largely inaccessible to users. This information can be made accessible using machine learning for Visual Question Answering (VQA) about remote sensing imagery. We propose a novel method for Earth observation based on answering natural language questions about satellite images that uses cross-modal attention between image objects and text. The image is encoded with an object-centric feature space, with self-attention between objects, and the question is encoded with a language transformer network. The image and question representations are fed to a crossmodal transformer network that uses cross-attention between the image and text modalities to generate the answer. Our method is applied to the RSVQA remote sensing dataset and achieves a significant accuracy increase over the previous benchmark.
引用
收藏
页码:57 / 65
页数:9
相关论文
共 50 条
  • [11] Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing
    Siebert, Tim
    Clasen, Kai Norman
    Ravanbakhsh, Mahdyar
    Demir, Beguem
    IMAGE AND SIGNAL PROCESSING FOR REMOTE SENSING XXVIII, 2022, 12267
  • [12] Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering
    Liu, Yang
    Li, Guanbin
    Lin, Liang
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (10) : 11624 - 11641
  • [13] Cross-Modal Multistep Fusion Network With Co-Attention for Visual Question Answering
    Lao, Mingrui
    Guo, Yanming
    Wang, Hui
    Zhang, Xin
    IEEE ACCESS, 2018, 6 : 31516 - 31524
  • [14] HUMAN GUIDED CROSS-MODAL REASONING WITH SEMANTIC ATTENTION LEARNING FOR VISUAL QUESTION ANSWERING
    Liao, Lei
    Feng, Mao
    Yang, Meng
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 2775 - 2779
  • [15] Cross-Modal Feature Distribution Calibration for Few-Shot Visual Question Answering
    Zhang, Jing
    Liu, Xiaoqiang
    Chen, Mingzhe
    Wang, Zhe
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7151 - 7159
  • [16] Lightweight recurrent cross-modal encoder for video question answering
    Immanuel, Steve Andreas
    Jeong, Cheol
    KNOWLEDGE-BASED SYSTEMS, 2023, 276
  • [17] Asymmetric cross-modal attention network with multimodal augmented mixup for medical visual question answering
    Li, Yong
    Yang, Qihao
    Wang, Fu Lee
    Lee, Lap-Kei
    Qu, Yingying
    Hao, Tianyong
    ARTIFICIAL INTELLIGENCE IN MEDICINE, 2023, 144
  • [18] VISUAL QUESTION ANSWERING FROM REMOTE SENSING IMAGES
    Lobry, Sylvain
    Murray, Jesse
    Marcos, Diego
    Tuia, Devis
    2019 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2019), 2019, : 4951 - 4954
  • [19] LANGUAGE TRANSFORMERS FOR REMOTE SENSING VISUAL QUESTION ANSWERING
    Chappuis, Christel
    Mendez, Vincent
    Walt, Eliot
    Lobry, Sylvain
    Le Saux, Bertrand
    Tuia, Devis
    2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 4855 - 4858
  • [20] Multistep Question-Driven Visual Question Answering for Remote Sensing
    Zhang, Meimei
    Chen, Fang
    Li, Bin
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61