CLIP-Count: Towards Text-Guided Zero-Shot Object Counting

被引:15
|
作者
Jiang, Ruixiang [1 ]
Liu, Lingbo [1 ]
Chen, Changwen [1 ]
机构
[1] Hong Kong Polytech Univ, HKSAR, Hong Kong, Peoples R China
关键词
class-agnostic object counting; clip; zero-shot; text-guided;
D O I
10.1145/3581783.3611789
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in visual-language models have shown remarkable zero-shot text-image matching ability that is transferable to downstream tasks such as object detection and segmentation. Adapting these models for object counting, however, remains a formidable challenge. In this study, we first investigate transferring vision-language models (VLMs) for class-agnostic object counting. Specifically, we propose CLIP-Count, the first end-to-end pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner. To align the text embedding with dense visual features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level visual representations for dense prediction. Moreover, we design a hierarchical patch-text interaction module to propagate semantic information across different resolution levels of visual features. Benefiting from the full exploitation of the rich image-text alignment knowledge of pretrained VLMs, our method effectively generates high-quality density maps for objects-of-interest. Extensive experiments on FSC-147, CARPK, and ShanghaiTech crowd counting datasets demonstrate state-of-the-art accuracy and generalizability of the proposed method. Code is available: https://github.com/songrise/CLIP-Count.
引用
收藏
页码:4535 / 4545
页数:11
相关论文
共 50 条
  • [31] Zero-Shot Object Goal Visual Navigation
    Zhao, Qianfan
    Zhang, Lu
    He, Bin
    Qiao, Hong
    Liu, Zhiyong
    2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA, 2023, : 2025 - 2031
  • [32] Transductive Learning for Zero-Shot Object Detection
    Rahman, Shafin
    Khan, Salman
    Barnes, Nick
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6081 - 6090
  • [33] ClipSAM: CLIP and SAM collaboration for zero-shot anomaly segmentation
    Li, Shengze
    Cao, Jianjian
    Ye, Peng
    Ding, Yuhan
    Tu, Chongjun
    Chen, Tao
    NEUROCOMPUTING, 2025, 618
  • [34] CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No
    Wang, Hualiang
    Li, Yi
    Yao, Huifeng
    Li, Xiaomeng
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 1802 - 1812
  • [35] Single-stage zero-shot object detection network based on CLIP and pseudo-labeling
    Li, Jiafeng
    Sun, Shengyao
    Zhang, Kang
    Zhang, Jing
    Zhuo, Li
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2025, 16 (02) : 1055 - 1070
  • [36] TextStyler: A CLIP-based approach to text-guided style transfer
    Wu, Yichun
    Zhao, Huihuang
    Chen, Wenhui
    Yang, Yunfei
    Bu, Jiayi
    COMPUTERS & GRAPHICS-UK, 2024, 119
  • [37] Towards scalable zero-shot modulation recognition
    Xiong, Wei
    Bogdanov, Petko
    Zheleva, Mariya
    2020 IEEE 92ND VEHICULAR TECHNOLOGY CONFERENCE (VTC2020-FALL), 2020,
  • [38] Towards Zero-Shot Sign Language Recognition
    Bilge, Yunus Can
    Cinbis, Ramazan Gokberk
    Ikizler-Cinbis, Nazli
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (01) : 1217 - 1232
  • [39] YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone
    Casanova, Edresson
    Weber, Julian
    Shulby, Christopher
    Candido Junior, Arnaldo
    Goelge, Eren
    Ponti, Moacir Antonelli
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [40] CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning
    Taesiri, Mohammad Reza
    Macklon, Finlay
    Bezemer, Cor-Paul
    2022 MINING SOFTWARE REPOSITORIES CONFERENCE (MSR 2022), 2022, : 270 - 281