Prior-Experience-Based Vision-Language Model for Remote Sensing Image-Text Retrieval

被引:0
|
作者
Tang, Xu [1 ]
Huang, Dabiao [1 ]
Ma, Jingjing [1 ]
Zhang, Xiangrong [1 ]
Liu, Fang [2 ]
Jiao, Licheng [1 ]
机构
[1] Xidian Univ, Minist Educ, Sch Artificial Intelligence, Key Lab Intelligent Percept & Image Understanding, Xian 710071, Peoples R China
[2] Nanjing Univ Sci & Technol, Minist Educ, Sch Comp Sci & Engn, Key Lab Intelligent Percept & Systemsfor High Dime, Nanjing 210094, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Feature extraction; Transformers; Semantics; Training; Convolutional neural networks; Recurrent neural networks; Learning from prior experiences (LPEs); multiscale feature fusion; remote sensing image-text retrieval (RSITR); transformer; BIG DATA; FUSION;
D O I
10.1109/TGRS.2024.3464468
中图分类号
P3 [地球物理学]; P59 [地球化学];
学科分类号
0708 ; 070902 ;
摘要
Remote sensing (RS) image-text retrieval (RSITR) aims to retrieve relevant texts (RS images) based on the content of a given RS image (text). Existing methods are used to employing the convolutional neural network (CNN) and recurrent neural network (RNN) as encoders to learn visual and textual features for retrieval. Although feasible, the global information hidden in different data does not receive the attention it deserves. To mitigate this problem, transformers have been introduced. Nevertheless, the complexity of RS images present challenges in directly introducing Transformer-based architectures to multimodal learning in RS scenes, particularly in visual feature extraction and cross-modal interaction. In addition, the textual captions are always simpler than the complex RS images, leading to a semantic description appearing in different images. This typical false-negative (FN) sample problem increases the difficulty of RSITR tasks. To address the above limitations, we propose a new RSITR model named prior-experience-based RS vision-language (PERSVL). First, the specific visual and text encoders are used to extract features from RS images and texts. Also, a high-level feature complement (HFC) module is developed based on the self-attention mechanism (SAM) for the visual encoder to explore the complex contents from RS images fully. Second, a dual-branch multimodal fusion encoder (DBMFE) is designed to complete the cross-modal learning. It comprises a dual-branch multimodal interaction (DBMI) module and a branch fusion module. DBMI is designed to fully explore the relationships between different modalities, enriching visual and textual features. The branch fusion module integrates the cross-modal features and utilizes a classification head to generate matching scores for retrieval. Finally, a learning from prior experiences (LPEs) module is designed to reduce the influence of FN samples by analyzing the historical data produced in the model training process. Experiments are conducted on three popular datasets, and the positive results show that our PERSVL model achieves superior performance compared with previous methods. By integrating the advantages of natural language and RS images, our PERSVL can be applied in various applications, such as environmental monitoring, disaster evaluation, and urban planning. Our source codes are available at: https://github.com/TangXu-Group/Cross-modal-remote-sensing-image-and-text-retrieval-models/tree/main/PERSVL.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] GilBERT: Generative Vision-Language Pre-Training for Image-Text Retrieval
    Hong, Weixiang
    Ji, Kaixiang
    Liu, Jiajia
    Wang, Jian
    Chen, Jingdong
    Chu, Wei
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1379 - 1388
  • [2] A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval
    Pan, Jiancheng
    Ma, Qing
    Bai, Cong
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 611 - 620
  • [3] Remote sensing image-text retrieval based on layout semantic joint representation
    Zhang R.
    Nie J.
    Song N.
    Zheng C.
    Wei Z.
    Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics, 2024, 50 (02): : 671 - 683
  • [4] An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal Text-Image Retrieval
    He, Liu
    Liu, Shuyan
    An, Ran
    Zhuo, Yudong
    Tao, Jian
    MATHEMATICS, 2023, 11 (10)
  • [5] Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval
    Yao, Tao
    Peng, Shouyong
    Wang, Lili
    Li, Ying
    Sun, Yujuan
    APPLIED INTELLIGENCE, 2024, 54 (23) : 12230 - 12245
  • [6] Text-Guided Knowledge Transfer for Remote Sensing Image-Text Retrieval
    Liu, An-An
    Yang, Bo
    Li, Wenhui
    Song, Dan
    Sun, Zhengya
    Ren, Tongwei
    Wei, Zhiqiang
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2024, 21 : 1 - 5
  • [7] A FAST AND ACCURATE METHOD FOR REMOTE SENSING IMAGE-TEXT RETRIEVAL BASED ON LARGE MODEL KNOWLEDGE DISTILLATION
    Liao, Yu
    Yang, Rui
    Xie, Tao
    Xing, Hantong
    Quan, Dou
    Wang, Shuang
    Hou, Biao
    IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 5077 - 5080
  • [8] FSVLM: A Vision-Language Model for Remote Sensing Farmland Segmentation
    Wu, Haiyang
    Du, Zhuofei
    Zhong, Dandan
    Wang, Yuze
    Tao, Chao
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2025, 63
  • [9] Practical Techniques for Vision-Language Segmentation Model in Remote Sensing
    Lin, Yuting
    Suzuki, Kumiko
    Sogo, Shinichiro
    MID-TERM SYMPOSIUM THE ROLE OF PHOTOGRAMMETRY FOR A SUSTAINABLE WORLD, VOL. 48-2, 2024, : 203 - 210
  • [10] Transcending Fusion: A Multiscale Alignment Method for Remote Sensing Image-Text Retrieval
    Yang, Rui
    Wang, Shuang
    Han, Yingping
    Li, Yuanheng
    Zhao, Dong
    Quan, Dou
    Guo, Yanhe
    Jiao, Licheng
    Yang, Zhi
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62