Fuse and Calibrate: A Bi-directional Vision-Language Guided Framework for Referring Image Segmentation

被引:0
|
作者
Yan, Yichen [1 ,2 ]
He, Xingjian [1 ]
Chen, Sihan [2 ]
Lu, Shichen [3 ]
Liu, Jing [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] Beihang Univ, Sch Comp Sci & Engn, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Referring Image Segmentation; Vision-Language Models; Fusion & Calibration;
D O I
10.1007/978-981-97-5612-4_27
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring Image Segmentation (RIS) aims to segment an object described in natural language from an image, with the main challenge being a text-to-pixel correlation. Previous methods typically rely on single-modality features, such as vision or language features, to guide the multi-modal fusion process. However, this approach limits the interaction between vision and language, leading to a lack of fine-grained correlation between the language description and pixel-level details during the decoding process. In this paper, we introduce FCNet, a framework that employs a bi-directional guided fusion approach where both vision and language play guiding roles. Specifically, we use a vision-guided approach to conduct initial multi-modal fusion, obtaining multi-modal features that focus on key vision information. We then propose a language-guided calibration module to further calibrate these multi-modal features, ensuring they understand the context of the input sentence. This bi-directional vision-language guided approach produces higher-quality multi-modal features sent to the decoder, facilitating adaptive propagation of fine-grained semantic information from textual features to visual features. Experiments on RefCOCO, RefCOCO+, and G-Ref datasets with various backbones consistently show our approach outperforming state-of-the-art methods.
引用
收藏
页码:313 / 324
页数:12
相关论文
共 50 条
  • [41] Bi-Directional Image-to-Text Mapping for NLP-Based Schedule Generation and Computer Vision Progress Monitoring
    Nunez-Morales, Juan D.
    Jung, Yoonhwa
    Golparvar-Fard, Mani
    CONSTRUCTION RESEARCH CONGRESS 2024: ADVANCED TECHNOLOGIES, AUTOMATION, AND COMPUTER APPLICATIONS IN CONSTRUCTION, 2024, : 826 - 835
  • [42] Medical Image Segmentation Using Grey Wolf-Based U-Net with Bi-Directional Convolutional LSTM
    Tamilmani, G.
    Varma, CH. Phaneendra
    Brindha Devi, V.
    Ramesh Babu, G.
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2024, 38 (02)
  • [43] BIG-FG: A Bi-directional Interaction Graph Framework with Filter Gate Mechanism for Chinese Spoken Language Understanding
    Zhang, Wentao
    Zeng, Bi
    Wei, Pengfei
    Hu, Huiting
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT IV, 2023, 14257 : 304 - 315
  • [44] SgVA-CLIP: Semantic-Guided Visual Adapting of Vision-Language Models for Few-Shot Image Classification
    Peng, Fang
    Yang, Xiaoshan
    Xiao, Linhui
    Wang, Yaowei
    Xu, Changsheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3469 - 3480
  • [45] EEG-based image classification via a region-level stacked bi-directional deep learning framework
    Fares, Ahmed
    Zhong, Sheng-hua
    Jiang, Jianmin
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2019, 19 (Suppl 6)
  • [46] EEG-based image classification via a region-level stacked bi-directional deep learning framework
    Ahmed Fares
    Sheng-hua Zhong
    Jianmin Jiang
    BMC Medical Informatics and Decision Making, 19
  • [47] Cross-Modal Person Search: A Coarse-to-Fine Framework using Bi-directional Text-Image Matching
    Yu, Xiaojing
    Chen, Tianlong
    Yang, Yang
    Mugo, Michael
    Wang, Zhangyang
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 1799 - 1804
  • [48] A Modified Bi-Directional Convolutional U-Net (BCDU-Net) Neural Network Approach for Lung CT Image Segmentation
    Vu, Tran Anh
    Van Kien, Phung
    Tram, Nguyen Ngoc
    Huy, Hoang Quang
    Huong, Pham Thi Viet
    JOURNAL OF BIOMIMETICS BIOMATERIALS AND BIOMEDICAL ENGINEERING, 2025, 67 : 9 - 20
  • [49] An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal Text-Image Retrieval
    He, Liu
    Liu, Shuyan
    An, Ran
    Zhuo, Yudong
    Tao, Jian
    MATHEMATICS, 2023, 11 (10)
  • [50] BMCS-Net: A Bi-directional multi-scale cascaded segmentation network based on transformer-guided feature Aggregation for medical images
    Li, Bicao
    Wang, Jing
    Wang, Bei
    Shao, Zhuhong
    Li, Wei
    Huang, Jie
    Li, Panpan
    Computers in Biology and Medicine, 2024, 180