Mask Grounding for Referring Image Segmentation

被引：0

作者：

Chng, Yong Xien ^{[1
,2
]}

Zheng, Henry ^{[1
]}

Han, Yizeng ^{[1
]}

Qiu, Xuchong ^{[2
]}

Huang, Gao ^{[1
]}

机构：

[1] Tsinghua Univ, Dept Automat, BNRist, Beijing, Peoples R China

[2] Bosch Corp Res, Renningen, Germany

来源：

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

D O I：

10.1109/CVPR52733.2024.02509

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Referring Image Segmentation (RIS) is a challenging task that requires an algorithm to segment objects referred by free-form language expressions. Despite significant progress in recent years, most state-of-the-art (SOTA) methods still suffer from considerable language-image modality gap at the pixel and word level. These methods generally 1) rely on sentence-level language features for language-image alignment and 2) lack explicit training supervision for fine-grained visual grounding. Consequently, they exhibit weak object-level correspondence between visual and language features. Without well-grounded features, prior methods struggle to understand complex expressions that require strong reasoning over relationships among multiple objects, especially when dealing with rarely used or ambiguous clauses. To tackle this challenge, we introduce a novel Mask Grounding auxiliary task that significantly improves visual grounding within language features, by explicitly teaching the model to learn fine-grained correspondence between masked textual tokens and their matching visual objects. Mask Grounding can be directly used on prior RIS methods and consistently bring improvements. Furthermore, to holistically address the modality gap, we also design a cross-modal alignment loss and an accompanying alignment module. These additions work synergistically with Mask Grounding. With all these techniques, our comprehensive approach culminates in MagNet (Mask-grounded Network), an architecture that significantly outperforms prior arts on three key benchmarks (RefCOCO, RefCOCO+ and G-Ref), demonstrating our method's effectiveness in addressing current limitations of RIS algorithms. Our code and pre-trained weights will be released.

引用

页码：26563 / 26573

页数：11

共 50 条

[41] Referring Image Segmentation via Language-Driven Attention
Chen, Ding-Jie
Hsieh, He-Yen
Liu, Tyng-Luh
2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, : 13997 - 14003
[42] Calibration & Reconstruction: Deep Integrated Language for Referring Image Segmentation
Yan, Yichen
He, Xingjian
Chen, Sihan
Liu, Jing
PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 451 - 459
[43] Beyond One-to-One: Rethinking the Referring Image Segmentation
Hu, Yutao
Wang, Qixiong
Shao, Wenqi
Xie, Enze
Li, Zhenguo
Han, Jungong
Luo, Ping
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 4044 - 4054
[44] Vision-Aware Language Reasoning for Referring Image Segmentation
Xu, Fayou
Luo, Bing
Zhang, Chao
Xu, Li
Pu, Mingxing
Li, Bo
NEURAL PROCESSING LETTERS, 2023, 55 (08) : 11313 - 11331
[45] Global Selection and Local Attention Network for Referring Image Segmentation
Ding, Haixin
Zhang, Shengchuan
Cao, Liujuan
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT VII, 2024, 14431 : 284 - 295
[46] Bidirectional Relationship Inferring Network for Referring Image Localization and Segmentation
Feng, Guang
Hu, Zhiwei
Zhang, Lihe
Sun, Jiayu
Lu, Huchuan
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (05) : 2246 - 2258
[47] Global and Local Interactive Perception Network for Referring Image Segmentation
Liu, Jing
Tan, Hongchen
Hu, Yongli
Sun, Yanfeng
Wang, Huasheng
Yin, Baocai
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 35 (12) : 1 - 14
[48] Comprehensive Multi-Modal Interactions for Referring Image Segmentation
Jain, Kanishk
Gandhi, Vineet
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 3427 - 3435
[49] Vision-Aware Language Reasoning for Referring Image Segmentation
Fayou Xu
Bing Luo
Chao Zhang
Li Xu
Mingxing Pu
Bo Li
Neural Processing Letters, 2023, 55 : 11313 - 11331
[50] See-Through-Text Grouping for Referring Image Segmentation
Chen, Ding-Jie
Jia, Songhao
Lo, Yi-Chen
Chen, Hwann-Tzong
Liu, Tyng-Luh
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 7453 - 7462

← 1 2 3 4 5 →