Learning Mask-aware CLIP Representations for Zero-Shot Segmentation

被引:0
|
作者
Jiao, Siyu [1 ,2 ,3 ,5 ]
Wei, Yunchao [1 ,2 ,3 ]
Wang, Yaowei [3 ]
Zhao, Yao [1 ,2 ,3 ]
Shi, Humphrey [4 ,5 ]
机构
[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing, Peoples R China
[2] Peng Cheng Lab, Shenzhen, Peoples R China
[3] Beijing Key Lab Adv Informat Sci & Network, Beijing, Peoples R China
[4] Georgia Inst Technol, Atlanta, GA USA
[5] Picsart AI Res PAIR, Miami, FL USA
基金
国家重点研发计划;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, pre-trained vision-language models have been increasingly used to tackle the challenging zero-shot segmentation task. Typical solutions follow the paradigm of first generating mask proposals and then adopting CLIP to classify them. To maintain the CLIP's zero-shot transferability, previous practices favour to freeze CLIP during training. However, in the paper, we reveal that CLIP is insensitive to different mask proposals and tends to produce similar predictions for various mask proposals of the same image. This insensitivity results in numerous false positives when classifying mask proposals. This issue mainly relates to the fact that CLIP is trained with image-level supervision. To alleviate this issue, we propose a simple yet effective method, named Mask-aware Fine-tuning (MAFT). Specifically, Image-Proposals CLIP Encoder (IP-CLIP Encoder) is proposed to handle arbitrary numbers of image and mask proposals simultaneously. Then, mask-aware loss and self-distillation loss are designed to fine-tune IP-CLIP Encoder, ensuring CLIP is responsive to different mask proposals while not sacrificing transferability. In this way, mask-aware representations can be easily learned to make the true positives stand out. Notably, our solution can seamlessly plug into most existing methods without introducing any new parameters during the fine-tuning process. We conduct extensive experiments on the popular zero-shot benchmarks. With MAFT, the performance of the state-of-the-art methods is promoted by a large margin: 50.4% (+ 8.2%) on COCO, 81.8% (+ 3.2%) on Pascal-VOC, and 8.7% (+4.3%) on ADE20K in terms of mIoU for unseen classes. Code is available at github.com/jiaosiyu1999/MAFT.git.
引用
收藏
页数:23
相关论文
共 50 条
  • [41] Rebalanced Zero-Shot Learning
    Ye, Zihan
    Yang, Guanyu
    Jin, Xiaobo
    Liu, Youfa
    Huang, Kaizhu
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 4185 - 4198
  • [42] Incremental Zero-Shot Learning
    Wei, Kun
    Deng, Cheng
    Yang, Xu
    Tao, Dacheng
    IEEE TRANSACTIONS ON CYBERNETICS, 2022, 52 (12) : 13788 - 13799
  • [43] Domain-Aware Prototype Network for Generalized Zero-Shot Learning
    Hu, Yongli
    Feng, Lincong
    Jiang, Huajie
    Liu, Mengting
    Yin, Baocai
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (05) : 3180 - 3191
  • [44] Semantic-aware visual attributes learning for zero-shot recognition
    Xie, Yurui
    Song, Tiecheng
    Li, Wei
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2021, 74 (74)
  • [45] Lifelong Zero-Shot Learning
    Wei, Kun
    Deng, Cheng
    Yang, Xu
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 551 - 557
  • [46] Semantic-aware visual attributes learning for zero-shot recognition
    Xie, Yurui
    Song, Tiecheng
    Li, Wei
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2021, 74
  • [47] Semantic-aware visual attributes learning for zero-shot recognition
    Xie, Yurui
    Song, Tiecheng
    Li, Wei
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2021, 74
  • [48] Semantic-aware visual attributes learning for zero-shot recognition
    Xie, Yurui
    Song, Tiecheng
    Li, Wei
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2021, 74
  • [49] Generating Visual Representations for Zero-Shot Classification
    Bucher, Maxime
    Herbin, Stephane
    Jurie, Frederic
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2017), 2017, : 2666 - 2673
  • [50] Object Figure-Ground Segmentation Using Zero-Shot Learning
    Naha, Shujon
    Wang, Yang
    2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2016, : 2842 - 2847