Learning Mask-aware CLIP Representations for Zero-Shot Segmentation

被引:0
|
作者
Jiao, Siyu [1 ,2 ,3 ,5 ]
Wei, Yunchao [1 ,2 ,3 ]
Wang, Yaowei [3 ]
Zhao, Yao [1 ,2 ,3 ]
Shi, Humphrey [4 ,5 ]
机构
[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing, Peoples R China
[2] Peng Cheng Lab, Shenzhen, Peoples R China
[3] Beijing Key Lab Adv Informat Sci & Network, Beijing, Peoples R China
[4] Georgia Inst Technol, Atlanta, GA USA
[5] Picsart AI Res PAIR, Miami, FL USA
基金
国家重点研发计划;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, pre-trained vision-language models have been increasingly used to tackle the challenging zero-shot segmentation task. Typical solutions follow the paradigm of first generating mask proposals and then adopting CLIP to classify them. To maintain the CLIP's zero-shot transferability, previous practices favour to freeze CLIP during training. However, in the paper, we reveal that CLIP is insensitive to different mask proposals and tends to produce similar predictions for various mask proposals of the same image. This insensitivity results in numerous false positives when classifying mask proposals. This issue mainly relates to the fact that CLIP is trained with image-level supervision. To alleviate this issue, we propose a simple yet effective method, named Mask-aware Fine-tuning (MAFT). Specifically, Image-Proposals CLIP Encoder (IP-CLIP Encoder) is proposed to handle arbitrary numbers of image and mask proposals simultaneously. Then, mask-aware loss and self-distillation loss are designed to fine-tune IP-CLIP Encoder, ensuring CLIP is responsive to different mask proposals while not sacrificing transferability. In this way, mask-aware representations can be easily learned to make the true positives stand out. Notably, our solution can seamlessly plug into most existing methods without introducing any new parameters during the fine-tuning process. We conduct extensive experiments on the popular zero-shot benchmarks. With MAFT, the performance of the state-of-the-art methods is promoted by a large margin: 50.4% (+ 8.2%) on COCO, 81.8% (+ 3.2%) on Pascal-VOC, and 8.7% (+4.3%) on ADE20K in terms of mIoU for unseen classes. Code is available at github.com/jiaosiyu1999/MAFT.git.
引用
收藏
页数:23
相关论文
共 50 条
  • [1] Uncertainty-Aware Learning for Zero-Shot Semantic Segmentation
    Hu, Ping
    Sclaroff, Stan
    Saenko, Kate
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [2] Application of CLIP for efficient zero-shot learning
    Yang, Hairui
    Wang, Ning
    Li, Haojie
    Wang, Lei
    Wang, Zhihui
    MULTIMEDIA SYSTEMS, 2024, 30 (04)
  • [3] ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation
    Zhou, Ziqin
    Lei, Yinjie
    Zhano, Bowen
    Liu, Lingqiao
    Liu, Yifan
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 11175 - 11185
  • [4] ClipSAM: CLIP and SAM collaboration for zero-shot anomaly segmentation
    Li, Shengze
    Cao, Jianjian
    Ye, Peng
    Ding, Yuhan
    Tu, Chongjun
    Chen, Tao
    NEUROCOMPUTING, 2025, 618
  • [5] Learning MLatent Representations for Generalized Zero-Shot Learning
    Ye, Yalan
    Pan, Tongjie
    Luo, Tonghoujun
    Li, Jingjing
    Shen, Heng Tao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2252 - 2265
  • [6] Bidirectional Mask Selection for Zero-Shot Referring Image Segmentation
    Li, Wenhui
    Pang, Chao
    Nie, Weizhi
    Tian, Hongshuo
    Liu, An-An
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (01) : 911 - 921
  • [7] Online Zero-Shot Classification with CLIP
    Qian, Qi
    Hu, Juhua
    COMPUTER VISION - ECCV 2024, PT LXXVII, 2024, 15135 : 462 - 477
  • [8] Delving into Shape-aware Zero-shot Semantic Segmentation
    Liu, Xinyu
    Tian, Beiwen
    Wang, Zhen
    Wang, Rui
    Sheng, Kehua
    Zhang, Bo
    Zhao, Hao
    Zhou, Guyue
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2999 - 3009
  • [9] Learning Invariant Visual Representations for Compositional Zero-Shot Learning
    Zhang, Tian
    Liang, Kongming
    Du, Ruoyi
    Sun, Xian
    Ma, Zhanyu
    Guo, Jun
    COMPUTER VISION, ECCV 2022, PT XXIV, 2022, 13684 : 339 - 355
  • [10] A meaningful learning method for zero-shot semantic segmentation
    Liu, Xianglong
    Bai, Shihao
    An, Shan
    Wang, Shuo
    Liu, Wei
    Zhao, Xiaowei
    Ma, Yuqing
    SCIENCE CHINA-INFORMATION SCIENCES, 2023, 66 (11)