Learning Mask-aware CLIP Representations for Zero-Shot Segmentation

被引:0
|
作者
Jiao, Siyu [1 ,2 ,3 ,5 ]
Wei, Yunchao [1 ,2 ,3 ]
Wang, Yaowei [3 ]
Zhao, Yao [1 ,2 ,3 ]
Shi, Humphrey [4 ,5 ]
机构
[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing, Peoples R China
[2] Peng Cheng Lab, Shenzhen, Peoples R China
[3] Beijing Key Lab Adv Informat Sci & Network, Beijing, Peoples R China
[4] Georgia Inst Technol, Atlanta, GA USA
[5] Picsart AI Res PAIR, Miami, FL USA
基金
国家重点研发计划;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, pre-trained vision-language models have been increasingly used to tackle the challenging zero-shot segmentation task. Typical solutions follow the paradigm of first generating mask proposals and then adopting CLIP to classify them. To maintain the CLIP's zero-shot transferability, previous practices favour to freeze CLIP during training. However, in the paper, we reveal that CLIP is insensitive to different mask proposals and tends to produce similar predictions for various mask proposals of the same image. This insensitivity results in numerous false positives when classifying mask proposals. This issue mainly relates to the fact that CLIP is trained with image-level supervision. To alleviate this issue, we propose a simple yet effective method, named Mask-aware Fine-tuning (MAFT). Specifically, Image-Proposals CLIP Encoder (IP-CLIP Encoder) is proposed to handle arbitrary numbers of image and mask proposals simultaneously. Then, mask-aware loss and self-distillation loss are designed to fine-tune IP-CLIP Encoder, ensuring CLIP is responsive to different mask proposals while not sacrificing transferability. In this way, mask-aware representations can be easily learned to make the true positives stand out. Notably, our solution can seamlessly plug into most existing methods without introducing any new parameters during the fine-tuning process. We conduct extensive experiments on the popular zero-shot benchmarks. With MAFT, the performance of the state-of-the-art methods is promoted by a large margin: 50.4% (+ 8.2%) on COCO, 81.8% (+ 3.2%) on Pascal-VOC, and 8.7% (+4.3%) on ADE20K in terms of mIoU for unseen classes. Code is available at github.com/jiaosiyu1999/MAFT.git.
引用
收藏
页数:23
相关论文
共 50 条
  • [31] Revisiting Document Representations for Large-Scale Zero-Shot Learning
    Kil, Jihyung
    Chao, Wei-Lun
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 3117 - 3128
  • [32] Improving Zero-Shot Generalization for CLIP with Variational Adapter
    Lu, Ziqian
    Shen, Fengli
    Liu, Mushui
    Yu, Yunlong
    Li, Xi
    COMPUTER VISION - ECCV 2024, PT XX, 2025, 15078 : 328 - 344
  • [33] Boosting Zero-Shot Learning via Contrastive Optimization of Attribute Representations
    Du, Yu
    Shi, Miaojing
    Wei, Fangyun
    Li, Guoqi
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 35 (11) : 1 - 14
  • [34] Generalized zero-shot learning via discriminative and transferable disentangled representations
    Zhang, Chunyu
    Li, Zhanshan
    NEURAL NETWORKS, 2025, 183
  • [35] Ordinal Zero-Shot Learning
    Huo, Zengwei
    Geng, Xin
    PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 1916 - 1922
  • [36] Zero-Shot Kernel Learning
    Zhang, Hongguang
    Koniusz, Piotr
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7670 - 7679
  • [37] Zero-shot causal learning
    Nilforoshan, Hamed
    Moor, Michael
    Roohani, Yusuf
    Chen, Yining
    Surina, Anja
    Yasunaga, Michihiro
    Oblak, Sara
    Leskovec, Jure
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [38] Zero-shot Metric Learning
    Xu, Xinyi
    Cao, Huanhuan
    Yang, Yanhua
    Yang, Erkun
    Deng, Cheng
    PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 3996 - 4002
  • [39] Active Zero-Shot Learning
    Xie, Sihong
    Wang, Shaoxiong
    Yu, Philip S.
    CIKM'16: PROCEEDINGS OF THE 2016 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2016, : 1889 - 1892
  • [40] Spherical Zero-Shot Learning
    Shen, Jiayi
    Xiao, Zehao
    Zhen, Xiantong
    Zhang, Lei
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (02) : 634 - 645