Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation

被引:0
|
作者
Lei, Sen [1 ]
Xiao, Xinyu [2 ]
Zhang, Tianlin [3 ]
Li, Heng-Chao [1 ]
Shi, Zhenwei [4 ]
Zhu, Qing [5 ]
机构
[1] Southwest Jiaotong Univ, Sch Informat Sci & Technol, Chengdu 611756, Peoples R China
[2] Co Ant Grp, Hangzhou 688688, Peoples R China
[3] AVIC, Luoyang Inst Electroopt Equipment, Luoyang 471000, Peoples R China
[4] Beihang Univ, Image Proc Ctr, Sch Astronaut, State Key Lab Virtual Real Technol & Syst, Beijing 100191, Peoples R China
[5] Southwest Jiaotong Univ, Fac Geosci & Engn, Chengdu 611756, Peoples R China
基金
中国国家自然科学基金;
关键词
Remote sensing; Image segmentation; Visualization; Feature extraction; Linguistics; Transformers; Electronic mail; Adaptation models; Object recognition; Grounding; Fine-grained image-text alignment; referring image segmentation; remote sensing images; CLASSIFICATION; NETWORK;
D O I
10.1109/TGRS.2024.3522293
中图分类号
P3 [地球物理学]; P59 [地球化学];
学科分类号
0708 ; 070902 ;
摘要
Given a language expression, referring remote sensing image segmentation (RRSIS) aims to identify ground objects and assign pixelwise labels within the imagery. One of the key challenges for this task is to capture discriminative multimodal features via image-text alignment. However, the existing RRSIS methods use one vanilla and coarse alignment, where the language expression is directly extracted to be fused with the visual features. In this article, we argue that a "fine-grained image-text alignment" can improve the extraction of multimodal information. To this point, we propose a new RRSIS method to fully exploit the visual and linguistic representations. Specifically, the original referring expression is regarded as context text, which is further decoupled into the ground object and spatial position texts. The proposed fine-grained image-text alignment module (FIAM) would simultaneously leverage the features of the input image and the corresponding texts, obtaining better discriminative multimodal representation. Meanwhile, to handle the various scales of ground objects in remote sensing, we introduce a text-aware multiscale enhancement module (TMEM) to adaptively perform cross-scale fusion and intersections. We evaluate the effectiveness of the proposed method on two public referring remote sensing datasets including RefSegRS and RRSIS-D, and our method obtains superior performance over several state-of-the-art methods. The code will be publicly available at https://github.com/Shaosifan/FIANet.
引用
收藏
页数:11
相关论文
共 50 条
  • [21] Fine-grained multimodal named entity recognition with heterogeneous image-text similarity graphs
    Wang, Yongpeng
    Jiang, Chunmao
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2024, : 2401 - 2415
  • [22] Fine-Grained Bidirectional Attention-Based Generative Networks for Image-Text Matching
    Li, Zhixin
    Zhu, Jianwei
    Wei, Jiahui
    Zeng, Yufei
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2022, PT III, 2023, 13715 : 390 - 406
  • [23] Remote Sensing Image Harmonization Method for Fine-Grained Ship Classification
    Zhang, Jingpu
    Zhong, Ziyan
    Wei, Xingzhuo
    Wu, Xianyun
    Li, Yunsong
    REMOTE SENSING, 2024, 16 (12)
  • [24] DKA-RG: Disease-Knowledge-Enhanced Fine-Grained Image-Text Alignment for Automatic Radiology Report Generation
    Yin, Heng
    Wu, Wei
    Hao, Yongtao
    ELECTRONICS, 2024, 13 (16)
  • [25] Text-Vision Relationship Alignment for Referring Image Segmentation
    Pu, Mingxing
    Luo, Bing
    Zhang, Chao
    Xu, Li
    Xu, Fayou
    Kong, Mingming
    NEURAL PROCESSING LETTERS, 2024, 56 (02)
  • [26] Multi-level network based on transformer encoder for fine-grained image-text matching
    Yang, Lei
    Feng, Yong
    Zhou, Mingliang
    Xiong, Xiancai
    Wang, Yongheng
    Qiang, Baohua
    MULTIMEDIA SYSTEMS, 2023, 29 (04) : 1981 - 1994
  • [27] VSR plus plus : Improving Visual Semantic Reasoning for Fine-Grained Image-Text Matching
    Yuan, Hui
    Huang, Yan
    Zhang, Dongbo
    Chen, Zerui
    Cheng, Wenlong
    Wang, Liang
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 3728 - 3735
  • [28] Text-Vision Relationship Alignment for Referring Image Segmentation
    Mingxing Pu
    Bing Luo
    Chao Zhang
    Li Xu
    Fayou Xu
    Mingming Kong
    Neural Processing Letters, 56
  • [29] Multi-Attention Fusion and Fine-Grained Alignment for Bidirectional Image-Sentence Retrieval in Remote Sensing
    Cheng, Qimin
    Zhou, Yuzhuo
    Huang, Haiyan
    Wang, Zhongyuan
    IEEE-CAA JOURNAL OF AUTOMATICA SINICA, 2022, 9 (08) : 1532 - 1535
  • [30] Multi-Attention Fusion and Fine-Grained Alignment for Bidirectional Image-Sentence Retrieval in Remote Sensing
    Qimin Cheng
    Yuzhuo Zhou
    Haiyan Huang
    Zhongyuan Wang
    IEEE/CAAJournalofAutomaticaSinica, 2022, 9 (08) : 1532 - 1535