Mask Grounding for Referring Image Segmentation

被引:0
|
作者
Chng, Yong Xien [1 ,2 ]
Zheng, Henry [1 ]
Han, Yizeng [1 ]
Qiu, Xuchong [2 ]
Huang, Gao [1 ]
机构
[1] Tsinghua Univ, Dept Automat, BNRist, Beijing, Peoples R China
[2] Bosch Corp Res, Renningen, Germany
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
D O I
10.1109/CVPR52733.2024.02509
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring Image Segmentation (RIS) is a challenging task that requires an algorithm to segment objects referred by free-form language expressions. Despite significant progress in recent years, most state-of-the-art (SOTA) methods still suffer from considerable language-image modality gap at the pixel and word level. These methods generally 1) rely on sentence-level language features for language-image alignment and 2) lack explicit training supervision for fine-grained visual grounding. Consequently, they exhibit weak object-level correspondence between visual and language features. Without well-grounded features, prior methods struggle to understand complex expressions that require strong reasoning over relationships among multiple objects, especially when dealing with rarely used or ambiguous clauses. To tackle this challenge, we introduce a novel Mask Grounding auxiliary task that significantly improves visual grounding within language features, by explicitly teaching the model to learn fine-grained correspondence between masked textual tokens and their matching visual objects. Mask Grounding can be directly used on prior RIS methods and consistently bring improvements. Furthermore, to holistically address the modality gap, we also design a cross-modal alignment loss and an accompanying alignment module. These additions work synergistically with Mask Grounding. With all these techniques, our comprehensive approach culminates in MagNet (Mask-grounded Network), an architecture that significantly outperforms prior arts on three key benchmarks (RefCOCO, RefCOCO+ and G-Ref), demonstrating our method's effectiveness in addressing current limitations of RIS algorithms. Our code and pre-trained weights will be released.
引用
收藏
页码:26563 / 26573
页数:11
相关论文
共 50 条
  • [1] De-noising mask transformer for referring image segmentation
    Wang, Yehui
    Lei, Fang
    Wang, Baoyan
    Zhang, Qiang
    Zhen, Xiantong
    Zhang, Lei
    IMAGE AND VISION COMPUTING, 2025, 154
  • [2] Bidirectional Mask Selection for Zero-Shot Referring Image Segmentation
    Li, Wenhui
    Pang, Chao
    Nie, Weizhi
    Tian, Hongshuo
    Liu, An-An
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (01) : 911 - 921
  • [3] Mask prior generation with language queries guided networks for referring image segmentation
    Zhou, Jinhao
    Xiao, Guoqiang
    Lew, Michael S.
    Wu, Song
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2025, 253
  • [4] UniQRNet: Unifying Referring Expression Grounding and Segmentation with QRNet
    Ye, Jiabo
    Tian, Junfeng
    Yan, Ming
    Xu, Haiyang
    Ye, Qinghao
    Shi, Yaya
    Yang, Xiaoshan
    Wang, Xuwu
    Zhang, Ji
    He, Liang
    Lin, Xin
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (08)
  • [5] Referring Image Segmentation via Joint Mask Contextual Embedding Learning and Progressive Alignment Network
    Huang, Ziling
    Satoh, Shin'ichi
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 7753 - 7762
  • [6] CoupAlign: Coupling Word-Pixel with Sentence-Mask Alignments for Referring Image Segmentation
    Zhang, Zicheng
    Zhu, Yi
    Liu, Jianzhuang
    Liang, Xiaodan
    Wei, Ke
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [7] Hierarchical collaboration for referring image segmentation
    Zhang, Wei
    Cheng, Zesen
    Chen, Jie
    Gao, Wen
    NEUROCOMPUTING, 2025, 613
  • [8] Toward Robust Referring Image Segmentation
    Wu, Jianzong
    Li, Xiangtai
    Li, Xia
    Ding, Henghui
    Tong, Yunhai
    Tao, Dacheng
    IEEE Transactions on Image Processing, 2024, 33 : 1782 - 1794
  • [9] Toward Robust Referring Image Segmentation
    Wu, Jianzong
    Li, Xiangtai
    Li, Xia
    Ding, Henghui
    Tong, Yunhai
    Tao, Dacheng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1782 - 1794
  • [10] RRSIS: Referring Remote Sensing Image Segmentation
    Yuan, Zhenghang
    Mou, Lichao
    Hua, Yuansheng
    Zhu, Xiao Xiang
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 12