Fuse and Calibrate: A Bi-directional Vision-Language Guided Framework for Referring Image Segmentation

被引:0
|
作者
Yan, Yichen [1 ,2 ]
He, Xingjian [1 ]
Chen, Sihan [2 ]
Lu, Shichen [3 ]
Liu, Jing [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] Beihang Univ, Sch Comp Sci & Engn, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Referring Image Segmentation; Vision-Language Models; Fusion & Calibration;
D O I
10.1007/978-981-97-5612-4_27
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring Image Segmentation (RIS) aims to segment an object described in natural language from an image, with the main challenge being a text-to-pixel correlation. Previous methods typically rely on single-modality features, such as vision or language features, to guide the multi-modal fusion process. However, this approach limits the interaction between vision and language, leading to a lack of fine-grained correlation between the language description and pixel-level details during the decoding process. In this paper, we introduce FCNet, a framework that employs a bi-directional guided fusion approach where both vision and language play guiding roles. Specifically, we use a vision-guided approach to conduct initial multi-modal fusion, obtaining multi-modal features that focus on key vision information. We then propose a language-guided calibration module to further calibrate these multi-modal features, ensuring they understand the context of the input sentence. This bi-directional vision-language guided approach produces higher-quality multi-modal features sent to the decoder, facilitating adaptive propagation of fine-grained semantic information from textual features to visual features. Experiments on RefCOCO, RefCOCO+, and G-Ref datasets with various backbones consistently show our approach outperforming state-of-the-art methods.
引用
收藏
页码:313 / 324
页数:12
相关论文
共 50 条
  • [31] Expanding Open-Vocabulary Understanding for UAV Aerial Imagery: A Vision-Language Framework to Semantic Segmentation
    Huang, Bangju
    Li, Junhui
    Luan, Wuyang
    Tan, Jintao
    Li, Chenglong
    Huang, Longyang
    DRONES, 2025, 9 (02)
  • [32] Uniformly Distributed Category Prototype-Guided Vision-Language Framework for Long-Tail Recognition
    He, Xiaoxuan
    Fu, Siming
    Ding, Xinpeng
    Cao, Yuchen
    Wang, Hualiang
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5027 - 5037
  • [33] Region level Bi-directional Deep Learning Framework for EEG-based Image Classification
    Fares, Ahmed
    Zhong, Shenghua
    Jiang, Jianmin
    PROCEEDINGS 2018 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2018, : 368 - 373
  • [34] BiDFDC-Net: a dense connection network based on bi-directional feedback for skin image segmentation
    Jiang, Jinyun
    Sun, Zitong
    Zhang, Qile
    Lan, Kun
    Jiang, Xiaoliang
    Wu, Jun
    FRONTIERS IN PHYSIOLOGY, 2023, 14
  • [35] Vision-Based Fall Event Detection in Complex Background Using Attention Guided Bi-Directional LSTM
    Chen, Yong
    Li, Weitong
    Wang, Lu
    Hu, Jiajia
    Ye, Mingbin
    IEEE ACCESS, 2020, 8 : 161337 - 161348
  • [36] A parallel bi-directional self-organizing neural network (PBDSONN) architecture for color image extraction and segmentation
    Bhattacharyya, Siddhartha
    Maulik, Ujjwal
    Dutta, Paramartha
    NEUROCOMPUTING, 2012, 86 : 1 - 23
  • [37] LUIE: Learnable physical model-guided underwater image enhancement with bi-directional unsupervised domain adaptation
    Pan, Jingyi
    Duan, Zeyu
    Duan, Jianghua
    Wang, Zhe
    NEUROCOMPUTING, 2024, 602
  • [38] Bi-Directional Multi-Granularity Generation Framework for Knowledge Graph-to-Text with Large Language Model
    Du, Haowei
    Li, Chen
    Zhang, Dinghao
    Zhao, Dongyan
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2: SHORT PAPERS, 2024, : 147 - 152
  • [39] Cross-Aware Early Fusion With Stage-Divided Vision and Language Transformer Encoders for Referring Image Segmentation
    Cho, Yubin
    Yu, Hyunwoo
    Kang, Suk-Ju
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 5823 - 5833
  • [40] Vision-Language Consistency Guided Multi-Modal Prompt Learning for Blind AI Generated Image Quality Assessment
    Fu, Jun
    Zhou, Wei
    Jiang, Qiuping
    Liu, Hantao
    Zhai, Guangtao
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 1820 - 1824