Prior-Experience-Based Vision-Language Model for Remote Sensing Image-Text Retrieval

被引:0
|
作者
Tang, Xu [1 ]
Huang, Dabiao [1 ]
Ma, Jingjing [1 ]
Zhang, Xiangrong [1 ]
Liu, Fang [2 ]
Jiao, Licheng [1 ]
机构
[1] Xidian Univ, Minist Educ, Sch Artificial Intelligence, Key Lab Intelligent Percept & Image Understanding, Xian 710071, Peoples R China
[2] Nanjing Univ Sci & Technol, Minist Educ, Sch Comp Sci & Engn, Key Lab Intelligent Percept & Systemsfor High Dime, Nanjing 210094, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Feature extraction; Transformers; Semantics; Training; Convolutional neural networks; Recurrent neural networks; Learning from prior experiences (LPEs); multiscale feature fusion; remote sensing image-text retrieval (RSITR); transformer; BIG DATA; FUSION;
D O I
10.1109/TGRS.2024.3464468
中图分类号
P3 [地球物理学]; P59 [地球化学];
学科分类号
0708 ; 070902 ;
摘要
Remote sensing (RS) image-text retrieval (RSITR) aims to retrieve relevant texts (RS images) based on the content of a given RS image (text). Existing methods are used to employing the convolutional neural network (CNN) and recurrent neural network (RNN) as encoders to learn visual and textual features for retrieval. Although feasible, the global information hidden in different data does not receive the attention it deserves. To mitigate this problem, transformers have been introduced. Nevertheless, the complexity of RS images present challenges in directly introducing Transformer-based architectures to multimodal learning in RS scenes, particularly in visual feature extraction and cross-modal interaction. In addition, the textual captions are always simpler than the complex RS images, leading to a semantic description appearing in different images. This typical false-negative (FN) sample problem increases the difficulty of RSITR tasks. To address the above limitations, we propose a new RSITR model named prior-experience-based RS vision-language (PERSVL). First, the specific visual and text encoders are used to extract features from RS images and texts. Also, a high-level feature complement (HFC) module is developed based on the self-attention mechanism (SAM) for the visual encoder to explore the complex contents from RS images fully. Second, a dual-branch multimodal fusion encoder (DBMFE) is designed to complete the cross-modal learning. It comprises a dual-branch multimodal interaction (DBMI) module and a branch fusion module. DBMI is designed to fully explore the relationships between different modalities, enriching visual and textual features. The branch fusion module integrates the cross-modal features and utilizes a classification head to generate matching scores for retrieval. Finally, a learning from prior experiences (LPEs) module is designed to reduce the influence of FN samples by analyzing the historical data produced in the model training process. Experiments are conducted on three popular datasets, and the positive results show that our PERSVL model achieves superior performance compared with previous methods. By integrating the advantages of natural language and RS images, our PERSVL can be applied in various applications, such as environmental monitoring, disaster evaluation, and urban planning. Our source codes are available at: https://github.com/TangXu-Group/Cross-modal-remote-sensing-image-and-text-retrieval-models/tree/main/PERSVL.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] MULTI-SCALE INTERACTIVE TRANSFORMER FOR REMOTE SENSING CROSS-MODAL IMAGE-TEXT RETRIEVAL
    Wang, Yijing
    Ma, Jingjing
    Li, Mingteng
    Tang, Xu
    Han, Xiao
    Jiao, Licheng
    2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 839 - 842
  • [32] A novel approach for image retrieval in remote sensing using vision-language-based image caption generation
    Prem Shanker Yadav
    Dinesh Kumar Tyagi
    Santosh Kumar Vipparthi
    Multimedia Tools and Applications, 2025, 84 (6) : 2985 - 3014
  • [33] SkyEyeGPT: Unifying remote sensing vision-language tasks via instruction tuning with large language model
    Zhan, Yang
    Xiong, Zhitong
    Yuan, Yuan
    ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2025, 221 : 64 - 77
  • [34] ViLEM: Visual-Language Error Modeling for Image-Text Retrieval
    Chen, Yuxin
    Zhang, Zongyang
    Zhang, Ziqi
    Qi, Zhongang
    Yuan, Chunfeng
    Shan, Ying
    Li, Bing
    Hu, Weiming
    Qie, Xiaohu
    Wu, JianPing
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 11018 - 11027
  • [35] Fine-Grained Information Supplementation and Value-Guided Learning for Remote Sensing Image-Text Retrieval
    Zhou, Zihui
    Feng, Yong
    Qiu, Agen
    Duan, Guofan
    Zhou, Mingliang
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 19194 - 19210
  • [36] Cross-Modal Remote Sensing Image-Text Retrieval via Context and Uncertainty-Aware Prompt
    Wang, Yijing
    Tang, Xu
    Ma, Jingjing
    Zhang, Xiangrong
    Liu, Fang
    Jiao, Licheng
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024,
  • [37] SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing
    Wang, Zhecheng
    Prabha, Rajanie
    Huang, Tianyuan
    Wu, Jiajun
    Rajagopal, Ram
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 5805 - 5813
  • [38] Vision-Language Models in Remote Sensing: Current progress and future trends
    Li, Xiang
    Wen, Congcong
    Hu, Yuan
    Yuan, Zhenghang
    Zhu, Xiao Xiang
    IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE, 2024, 12 (02) : 32 - 66
  • [39] Efficient text-image semantic search: A multi-modal vision-language approach for fashion retrieval
    Moro, Gianluca
    Salvatori, Stefano
    Frisoni, Giacomo
    NEUROCOMPUTING, 2023, 538
  • [40] Scene Graph based Fusion Network for Image-Text Retrieval
    Wang, Guoliang
    Shang, Yanlei
    Chen, Yong
    Zhen, Chaoqi
    Cheng, Dequan
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 138 - 143