Fuse and Calibrate: A Bi-directional Vision-Language Guided Framework for Referring Image Segmentation

被引：0

作者：

Yan, Yichen ^{[1
,2
]}

He, Xingjian ^{[1
]}

Chen, Sihan ^{[2
]}

Lu, Shichen ^{[3
]}

Liu, Jing ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China

[3] Beihang Univ, Sch Comp Sci & Engn, Beijing, Peoples R China

来源：

ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT XI, ICIC 2024 | 2024年 / 14872卷

基金：

中国国家自然科学基金;

关键词：

Referring Image Segmentation; Vision-Language Models; Fusion & Calibration;

D O I：

10.1007/978-981-97-5612-4_27

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Referring Image Segmentation (RIS) aims to segment an object described in natural language from an image, with the main challenge being a text-to-pixel correlation. Previous methods typically rely on single-modality features, such as vision or language features, to guide the multi-modal fusion process. However, this approach limits the interaction between vision and language, leading to a lack of fine-grained correlation between the language description and pixel-level details during the decoding process. In this paper, we introduce FCNet, a framework that employs a bi-directional guided fusion approach where both vision and language play guiding roles. Specifically, we use a vision-guided approach to conduct initial multi-modal fusion, obtaining multi-modal features that focus on key vision information. We then propose a language-guided calibration module to further calibrate these multi-modal features, ensuring they understand the context of the input sentence. This bi-directional vision-language guided approach produces higher-quality multi-modal features sent to the decoder, facilitating adaptive propagation of fine-grained semantic information from textual features to visual features. Experiments on RefCOCO, RefCOCO+, and G-Ref datasets with various backbones consistently show our approach outperforming state-of-the-art methods.

引用

页码：313 / 324

页数：12

共 50 条

[41] Bi-Directional Image-to-Text Mapping for NLP-Based Schedule Generation and Computer Vision Progress Monitoring
Nunez-Morales, Juan D.
Jung, Yoonhwa
Golparvar-Fard, Mani
CONSTRUCTION RESEARCH CONGRESS 2024: ADVANCED TECHNOLOGIES, AUTOMATION, AND COMPUTER APPLICATIONS IN CONSTRUCTION, 2024, : 826 - 835
[42] Medical Image Segmentation Using Grey Wolf-Based U-Net with Bi-Directional Convolutional LSTM
Tamilmani, G.
Varma, CH. Phaneendra
Brindha Devi, V.
Ramesh Babu, G.
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2024, 38 (02)
[43] BIG-FG: A Bi-directional Interaction Graph Framework with Filter Gate Mechanism for Chinese Spoken Language Understanding
Zhang, Wentao
Zeng, Bi
Wei, Pengfei
Hu, Huiting
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT IV, 2023, 14257 : 304 - 315
[44] SgVA-CLIP: Semantic-Guided Visual Adapting of Vision-Language Models for Few-Shot Image Classification
Peng, Fang
Yang, Xiaoshan
Xiao, Linhui
Wang, Yaowei
Xu, Changsheng
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3469 - 3480
[45] EEG-based image classification via a region-level stacked bi-directional deep learning framework
Fares, Ahmed
Zhong, Sheng-hua
Jiang, Jianmin
BMC MEDICAL INFORMATICS AND DECISION MAKING, 2019, 19 (Suppl 6)
[46] EEG-based image classification via a region-level stacked bi-directional deep learning framework
Ahmed Fares
Sheng-hua Zhong
Jianmin Jiang
BMC Medical Informatics and Decision Making, 19
[47] Cross-Modal Person Search: A Coarse-to-Fine Framework using Bi-directional Text-Image Matching
Yu, Xiaojing
Chen, Tianlong
Yang, Yang
Mugo, Michael
Wang, Zhangyang
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 1799 - 1804
[48] A Modified Bi-Directional Convolutional U-Net (BCDU-Net) Neural Network Approach for Lung CT Image Segmentation
Vu, Tran Anh
Van Kien, Phung
Tram, Nguyen Ngoc
Huy, Hoang Quang
Huong, Pham Thi Viet
JOURNAL OF BIOMIMETICS BIOMATERIALS AND BIOMEDICAL ENGINEERING, 2025, 67 : 9 - 20
[49] An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal Text-Image Retrieval
He, Liu
Liu, Shuyan
An, Ran
Zhuo, Yudong
Tao, Jian
MATHEMATICS, 2023, 11 (10)
[50] BMCS-Net: A Bi-directional multi-scale cascaded segmentation network based on transformer-guided feature Aggregation for medical images
Li, Bicao
Wang, Jing
Wang, Bei
Shao, Zhuhong
Li, Wei
Huang, Jie
Li, Panpan
Computers in Biology and Medicine, 2024, 180

← 1 2 3 4 5 →