Prior-Experience-Based Vision-Language Model for Remote Sensing Image-Text Retrieval

被引:0
|
作者
Tang, Xu [1 ]
Huang, Dabiao [1 ]
Ma, Jingjing [1 ]
Zhang, Xiangrong [1 ]
Liu, Fang [2 ]
Jiao, Licheng [1 ]
机构
[1] Xidian Univ, Minist Educ, Sch Artificial Intelligence, Key Lab Intelligent Percept & Image Understanding, Xian 710071, Peoples R China
[2] Nanjing Univ Sci & Technol, Minist Educ, Sch Comp Sci & Engn, Key Lab Intelligent Percept & Systemsfor High Dime, Nanjing 210094, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Feature extraction; Transformers; Semantics; Training; Convolutional neural networks; Recurrent neural networks; Learning from prior experiences (LPEs); multiscale feature fusion; remote sensing image-text retrieval (RSITR); transformer; BIG DATA; FUSION;
D O I
10.1109/TGRS.2024.3464468
中图分类号
P3 [地球物理学]; P59 [地球化学];
学科分类号
0708 ; 070902 ;
摘要
Remote sensing (RS) image-text retrieval (RSITR) aims to retrieve relevant texts (RS images) based on the content of a given RS image (text). Existing methods are used to employing the convolutional neural network (CNN) and recurrent neural network (RNN) as encoders to learn visual and textual features for retrieval. Although feasible, the global information hidden in different data does not receive the attention it deserves. To mitigate this problem, transformers have been introduced. Nevertheless, the complexity of RS images present challenges in directly introducing Transformer-based architectures to multimodal learning in RS scenes, particularly in visual feature extraction and cross-modal interaction. In addition, the textual captions are always simpler than the complex RS images, leading to a semantic description appearing in different images. This typical false-negative (FN) sample problem increases the difficulty of RSITR tasks. To address the above limitations, we propose a new RSITR model named prior-experience-based RS vision-language (PERSVL). First, the specific visual and text encoders are used to extract features from RS images and texts. Also, a high-level feature complement (HFC) module is developed based on the self-attention mechanism (SAM) for the visual encoder to explore the complex contents from RS images fully. Second, a dual-branch multimodal fusion encoder (DBMFE) is designed to complete the cross-modal learning. It comprises a dual-branch multimodal interaction (DBMI) module and a branch fusion module. DBMI is designed to fully explore the relationships between different modalities, enriching visual and textual features. The branch fusion module integrates the cross-modal features and utilizes a classification head to generate matching scores for retrieval. Finally, a learning from prior experiences (LPEs) module is designed to reduce the influence of FN samples by analyzing the historical data produced in the model training process. Experiments are conducted on three popular datasets, and the positive results show that our PERSVL model achieves superior performance compared with previous methods. By integrating the advantages of natural language and RS images, our PERSVL can be applied in various applications, such as environmental monitoring, disaster evaluation, and urban planning. Our source codes are available at: https://github.com/TangXu-Group/Cross-modal-remote-sensing-image-and-text-retrieval-models/tree/main/PERSVL.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] RTQ: Rethinking Video-language Understanding Based on Image-text Model
    Wang, Xiao
    Li, Yaoyu
    Gan, Tian
    Zhang, Zheng
    Lv, Jingjing
    Nie, Liqiang
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 557 - 566
  • [42] A remote sensing image retrieval model based on semantic mining
    Liu, Tingting
    Li, Pingxiang
    Zhang, Liangpei
    Chen, Xu
    Wuhan Daxue Xuebao (Xinxi Kexue Ban)/ Geomatics and Information Science of Wuhan University, 2009, 34 (06): : 684 - 687
  • [43] Bridging the Lexical Gap: Generative Text-to-Image Retrieval for Parts-of-Speech Imbalance in Vision-Language Models
    Hwang, Hyesu
    Kim, Daeun
    Park, Jaehui
    Kwon, Yongjin
    PROCEEDINGS OF THE 2ND INTERNATIONAL WORKSHOP ON DEEP MULTIMODAL GENERATION AND RETRIEVAL, MMGR 2024, 2024, : 25 - 33
  • [44] Vision-language joint representation learning for sketch less facial image retrieval
    Dai, Dawei
    Fu, Shiyu
    Liu, Yingge
    Wang, Guoyin
    INFORMATION FUSION, 2024, 112
  • [45] ChangeCLIP: Remote sensing change detection with multimodal vision-language representation learning
    Dong, Sijun
    Wang, Libo
    Du, Bo
    Meng, Xiaoliang
    ISPRS Journal of Photogrammetry and Remote Sensing, 2024, 208 : 53 - 69
  • [46] Advancements in Vision-Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques
    Tao, Lijie
    Zhang, Haokui
    Jing, Haizhao
    Liu, Yu
    Yan, Dawei
    Wei, Guoting
    Xue, Xizhe
    REMOTE SENSING, 2025, 17 (01)
  • [47] Vision-Language Models for Zero-Shot Classification of Remote Sensing Images
    Al Rahhal, Mohamad Mahmoud
    Bazi, Yakoub
    Elgibreen, Hebah
    Zuair, Mansour
    APPLIED SCIENCES-BASEL, 2023, 13 (22):
  • [48] ChangeCLIP: Remote sensing change detection with multimodal vision-language representation learning
    Dong, Sijun
    Wang, Libo
    Du, Bo
    Meng, Xiaoliang
    ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2024, 208 : 53 - 69
  • [49] A Natural Language User Demand Semantic Model for Remote Sensing Image Retrieval
    Zhang, Xia
    Chen, Liuyuan
    Zhu, Xinyan
    INDUSTRIAL INSTRUMENTATION AND CONTROL SYSTEMS, PTS 1-4, 2013, 241-244 : 2897 - +
  • [50] Image caption generation via improved vision-language pre-training model: perception towards image retrieval
    Padate, Roshni
    Gupta, Ashutosh
    Kalla, Mukesh
    Sharma, Arvind
    IMAGING SCIENCE JOURNAL, 2025,