Mask Grounding for Referring Image Segmentation

被引：0

作者：

Chng, Yong Xien ^{[1
,2
]}

Zheng, Henry ^{[1
]}

Han, Yizeng ^{[1
]}

Qiu, Xuchong ^{[2
]}

Huang, Gao ^{[1
]}

机构：

[1] Tsinghua Univ, Dept Automat, BNRist, Beijing, Peoples R China

[2] Bosch Corp Res, Renningen, Germany

来源：

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

D O I：

10.1109/CVPR52733.2024.02509

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Referring Image Segmentation (RIS) is a challenging task that requires an algorithm to segment objects referred by free-form language expressions. Despite significant progress in recent years, most state-of-the-art (SOTA) methods still suffer from considerable language-image modality gap at the pixel and word level. These methods generally 1) rely on sentence-level language features for language-image alignment and 2) lack explicit training supervision for fine-grained visual grounding. Consequently, they exhibit weak object-level correspondence between visual and language features. Without well-grounded features, prior methods struggle to understand complex expressions that require strong reasoning over relationships among multiple objects, especially when dealing with rarely used or ambiguous clauses. To tackle this challenge, we introduce a novel Mask Grounding auxiliary task that significantly improves visual grounding within language features, by explicitly teaching the model to learn fine-grained correspondence between masked textual tokens and their matching visual objects. Mask Grounding can be directly used on prior RIS methods and consistently bring improvements. Furthermore, to holistically address the modality gap, we also design a cross-modal alignment loss and an accompanying alignment module. These additions work synergistically with Mask Grounding. With all these techniques, our comprehensive approach culminates in MagNet (Mask-grounded Network), an architecture that significantly outperforms prior arts on three key benchmarks (RefCOCO, RefCOCO+ and G-Ref), demonstrating our method's effectiveness in addressing current limitations of RIS algorithms. Our code and pre-trained weights will be released.

引用

页码：26563 / 26573

页数：11

共 50 条

[21] GTMS: A Gradient-Driven Tree-Guided Mask-Free Referring Image Segmentation Method
Lyu, Haoxin
Zhong, Tianxiong
Zhao, Sanyuan
COMPUTER VISION - ECCV 2024, PT LXVI, 2025, 15124 : 288 - 304
[22] PolyFormer: Referring Image Segmentation as Sequential Polygon Generation
Liu, Jiang
Ding, Hui
Cai, Zhaowei
Zhang, Yuting
Satzoda, Ravi Kumar
Mahadevan, Vijay
Manmatha, R.
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18653 - 18663
[23] CRIS: CLIP-Driven Referring Image Segmentation
Wang, Zhaoqing
Lu, Yu
Li, Qiang
Tao, Xunqiang
Guo, Yandong
Gong, Mingming
Liu, Tongliang
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 11676 - 11685
[24] Attentive Excitation and Aggregation for Bilingual Referring Image Segmentation
Zhou, Qianli
Hui, Tianrui
Wang, Rong
Hu, Haimiao
Liu, Si
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2021, 12 (02)
[25] Structured Multimodal Fusion Network for Referring Image Segmentation
Xue, Mingcheng
Liu, Yu
Xu, Kaiping
Zhang, Haiyang
Yu, Chengyang
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 36 - 47
[26] Dual Convolutional LSTM Network for Referring Image Segmentation
Ye, Linwei
Liu, Zhi
Wang, Yang
IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (12) : 3224 - 3235
[27] A survey of methods for addressing the challenges of referring image segmentation
Ji, Lixia
Du, Yunlong
Dang, Yiping
Gao, Wenzhao
Zhang, Han
NEUROCOMPUTING, 2024, 583
[28] Locate then Segment: A Strong Pipeline for Referring Image Segmentation
Jing, Ya
Kong, Tao
Wang, Wei
Wang, Liang
Li, Lei
Tan, Tieniu
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 9853 - 9862
[29] Learning From Box Annotations for Referring Image Segmentation
Feng, Guang
Zhang, Lihe
Hu, Zhiwei
Lu, Huchuan
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (03) : 3927 - 3937
[30] CARIS: Context-Aware Referring Image Segmentation
Liu, Sun-Ao
Zhang, Yiheng
Qiu, Zhaofan
Xie, Hongtao
Zhang, Yongdong
Yao, Ting
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 779 - 788

← 1 2 3 4 5 →