CLIP-Count: Towards Text-Guided Zero-Shot Object Counting

被引:15
|
作者
Jiang, Ruixiang [1 ]
Liu, Lingbo [1 ]
Chen, Changwen [1 ]
机构
[1] Hong Kong Polytech Univ, HKSAR, Hong Kong, Peoples R China
关键词
class-agnostic object counting; clip; zero-shot; text-guided;
D O I
10.1145/3581783.3611789
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in visual-language models have shown remarkable zero-shot text-image matching ability that is transferable to downstream tasks such as object detection and segmentation. Adapting these models for object counting, however, remains a formidable challenge. In this study, we first investigate transferring vision-language models (VLMs) for class-agnostic object counting. Specifically, we propose CLIP-Count, the first end-to-end pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner. To align the text embedding with dense visual features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level visual representations for dense prediction. Moreover, we design a hierarchical patch-text interaction module to propagate semantic information across different resolution levels of visual features. Benefiting from the full exploitation of the rich image-text alignment knowledge of pretrained VLMs, our method effectively generates high-quality density maps for objects-of-interest. Extensive experiments on FSC-147, CARPK, and ShanghaiTech crowd counting datasets demonstrate state-of-the-art accuracy and generalizability of the proposed method. Code is available: https://github.com/songrise/CLIP-Count.
引用
收藏
页码:4535 / 4545
页数:11
相关论文
共 50 条
  • [1] Zero-Shot Text-Guided Object Generation with Dream Fields
    Jain, Ajay
    Mildenhall, Ben
    Barron, Jonathan T.
    Abbeel, Pieter
    Poole, Ben
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 857 - 866
  • [2] CLIPCAM: A SIMPLE BASELINE FOR ZERO-SHOT TEXT-GUIDED OBJECT AND ACTION LOCALIZATION
    Hsia, Hsuan-An
    Lin, Che-Hsien
    Kung, Bo-Han
    Chen, Jhao-Ting
    Tan, Daniel Stanley
    Chen, Jun-Cheng
    Hua, Kai-Lung
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4453 - 4457
  • [3] LANGUAGE-GUIDED ZERO-SHOT OBJECT COUNTING
    Wang, Mingjie
    Yuan, Song
    Li, Zhuohang
    Zhu, Longlong
    Buys, Eric
    Gong, Minglun
    2024 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS, ICMEW 2024, 2024,
  • [4] Zero-Shot Object Counting
    Xu, Jingyi
    Le, Hieu
    Nguyen, Vu
    Ranjan, Viresh
    Samaras, Dimitris
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 15548 - 15557
  • [5] Zero-Shot Contrastive Loss for Text-Guided Diffusion Image Style Transfer
    Yang, Serin
    Hwang, Hyunmin
    Ye, Jong Chul
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22816 - 22825
  • [6] Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation
    Yang, Shuai
    Zhou, Yifan
    Liu, Ziwei
    Loy, Chen Change
    PROCEEDINGS OF THE SIGGRAPH ASIA 2023 CONFERENCE PAPERS, 2023,
  • [7] CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation
    Sanghi, Aditya
    Chu, Hang
    Lambourne, Joseph G.
    Wang, Ye
    Cheng, Chin-Yi
    Fumero, Marco
    Malekshan, Kamal Rahimi
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18582 - 18592
  • [8] Zero-Shot Object Counting with Good Exemplars
    Zhu, Huilin
    Yuan, Jingling
    Yang, Zhengwei
    Guo, Yu
    Wang, Zheng
    Zhong, Xian
    He, Shengfeng
    COMPUTER VISION - ECCV 2024, PT V, 2025, 15063 : 368 - 385
  • [9] ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation
    Zhou, Ziqin
    Lei, Yinjie
    Zhano, Bowen
    Liu, Lingqiao
    Liu, Yifan
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 11175 - 11185
  • [10] Online Zero-Shot Classification with CLIP
    Qian, Qi
    Hu, Juhua
    COMPUTER VISION - ECCV 2024, PT LXXVII, 2024, 15135 : 462 - 477