Fuse and Calibrate: A Bi-directional Vision-Language Guided Framework for Referring Image Segmentation

被引:0
|
作者
Yan, Yichen [1 ,2 ]
He, Xingjian [1 ]
Chen, Sihan [2 ]
Lu, Shichen [3 ]
Liu, Jing [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] Beihang Univ, Sch Comp Sci & Engn, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Referring Image Segmentation; Vision-Language Models; Fusion & Calibration;
D O I
10.1007/978-981-97-5612-4_27
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring Image Segmentation (RIS) aims to segment an object described in natural language from an image, with the main challenge being a text-to-pixel correlation. Previous methods typically rely on single-modality features, such as vision or language features, to guide the multi-modal fusion process. However, this approach limits the interaction between vision and language, leading to a lack of fine-grained correlation between the language description and pixel-level details during the decoding process. In this paper, we introduce FCNet, a framework that employs a bi-directional guided fusion approach where both vision and language play guiding roles. Specifically, we use a vision-guided approach to conduct initial multi-modal fusion, obtaining multi-modal features that focus on key vision information. We then propose a language-guided calibration module to further calibrate these multi-modal features, ensuring they understand the context of the input sentence. This bi-directional vision-language guided approach produces higher-quality multi-modal features sent to the decoder, facilitating adaptive propagation of fine-grained semantic information from textual features to visual features. Experiments on RefCOCO, RefCOCO+, and G-Ref datasets with various backbones consistently show our approach outperforming state-of-the-art methods.
引用
收藏
页码:313 / 324
页数:12
相关论文
共 50 条
  • [1] Bi-directional Relationship Inferring Network for Referring Image Segmentation
    Hu, Zhiwei
    Feng, Guang
    Sun, Jiayu
    Zhang, Lihe
    Lu, Huchuan
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 4423 - 4432
  • [2] Vision-Language Transformer and Query Generation for Referring Segmentation
    Ding, Henghui
    Liu, Chang
    Wang, Suchen
    Jiang, Xudong
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 16301 - 16310
  • [3] VLT: Vision-Language Transformer and Query Generation for Referring Segmentation
    Ding, Henghui
    Liu, Chang
    Wang, Suchen
    Jiang, Xudong
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 7900 - 7916
  • [4] Common Vision-Language Attention for Text-Guided Medical Image Segmentation of Pneumonia
    Guo, Yunpeng
    Zeng, Xinyi
    Zeng, Pinxian
    Fei, Yuchen
    Wen, Lu
    Zhou, Jiliu
    Wang, Yan
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT IX, 2024, 15009 : 192 - 201
  • [5] UnsupervisedWord Segmentation with Bi-directional Neural Language Model
    Wang, Lihao
    Zheng, Xiaoqing
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (01)
  • [6] SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation
    Ouyang, Shuyi
    Wang, Hongyi
    Xie, Shiao
    Niu, Ziwei
    Tong, Ruofeng
    Chen, Yen-Wei
    Lin, Lanfen
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 1294 - 1302
  • [7] Vision-Aware Language Reasoning for Referring Image Segmentation
    Xu, Fayou
    Luo, Bing
    Zhang, Chao
    Xu, Li
    Pu, Mingxing
    Li, Bo
    NEURAL PROCESSING LETTERS, 2023, 55 (08) : 11313 - 11331
  • [8] Vision-Aware Language Reasoning for Referring Image Segmentation
    Fayou Xu
    Bing Luo
    Chao Zhang
    Li Xu
    Mingxing Pu
    Bo Li
    Neural Processing Letters, 2023, 55 : 11313 - 11331
  • [9] Bi-VLGM: Bi-Level Class-Severity-Aware Vision-Language Graph Matching for Text Guided Medical Image Segmentation
    Chen, Wenting
    Liu, Jie
    Liu, Tianming
    Yuan, Yixuan
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (03) : 1375 - 1391
  • [10] Guided Image-to-Image Translation with Bi-Directional Feature Transformation
    AlBahar, Badour
    Huang, Jia-Bin
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 9015 - 9024