CLIP-Count: Towards Text-Guided Zero-Shot Object Counting

被引:15
|
作者
Jiang, Ruixiang [1 ]
Liu, Lingbo [1 ]
Chen, Changwen [1 ]
机构
[1] Hong Kong Polytech Univ, HKSAR, Hong Kong, Peoples R China
关键词
class-agnostic object counting; clip; zero-shot; text-guided;
D O I
10.1145/3581783.3611789
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in visual-language models have shown remarkable zero-shot text-image matching ability that is transferable to downstream tasks such as object detection and segmentation. Adapting these models for object counting, however, remains a formidable challenge. In this study, we first investigate transferring vision-language models (VLMs) for class-agnostic object counting. Specifically, we propose CLIP-Count, the first end-to-end pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner. To align the text embedding with dense visual features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level visual representations for dense prediction. Moreover, we design a hierarchical patch-text interaction module to propagate semantic information across different resolution levels of visual features. Benefiting from the full exploitation of the rich image-text alignment knowledge of pretrained VLMs, our method effectively generates high-quality density maps for objects-of-interest. Extensive experiments on FSC-147, CARPK, and ShanghaiTech crowd counting datasets demonstrate state-of-the-art accuracy and generalizability of the proposed method. Code is available: https://github.com/songrise/CLIP-Count.
引用
收藏
页码:4535 / 4545
页数:11
相关论文
共 50 条
  • [11] CLIPMulti: Explore the performance of multimodal enhanced CLIP for zero-shot text classification
    Wang, Peng
    Li, Dagang
    Hu, Xuesi
    Wang, Yongmei
    Zhang, Youhua
    COMPUTER SPEECH AND LANGUAGE, 2025, 90
  • [12] Zero-Shot Object Detection
    Bansal, Ankan
    Sikka, Karan
    Sharma, Gaurav
    Chellappa, Rama
    Divakaran, Ajay
    COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 : 397 - 414
  • [13] CLIP4HOI: Towards Adapting CLIP for Practical Zero-Shot HOI Detection
    Mao, Yunyao
    Deng, Jiajun
    Zhou, Wengang
    Li, Li
    Fang, Yao
    Li, Houqiang
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [14] Towards zero-shot object counting via deep spatial prior cross-modality fusion
    Chen, Jinyong
    Li, Qilei
    Gao, Mingliang
    Zhai, Wenzhe
    Jeon, Gwanggil
    Camacho, David
    INFORMATION FUSION, 2024, 111
  • [15] Application of CLIP for efficient zero-shot learning
    Yang, Hairui
    Wang, Ning
    Li, Haojie
    Wang, Lei
    Wang, Zhihui
    MULTIMEDIA SYSTEMS, 2024, 30 (04)
  • [16] ProZe: Explainable and Prompt-Guided Zero-Shot Text Classification
    Harrando, Ismail
    Reboud, Alison
    Schleider, Thomas
    Ehrhart, Thibault
    Troncy, Raphael
    IEEE INTERNET COMPUTING, 2022, 26 (06) : 69 - 77
  • [17] Semantics-Guided Contrastive Network for Zero-Shot Object Detection
    Yan, Caixia
    Chang, Xiaojun
    Luo, Minnan
    Liu, Huan
    Zhang, Xiaoqin
    Zheng, Qinghua
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (03) : 1530 - 1544
  • [18] Zero-Shot Object Counting With Vision-Language Prior Guidance Network
    Zhai, Wenzhe
    Xing, Xianglei
    Gao, Mingliang
    Li, Qilei
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (03) : 2487 - 2498
  • [19] Zero-Shot Turkish Text Classification
    Birim, Ahmet
    Erden, Mustafa
    Arslan, Levent M.
    29TH IEEE CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS (SIU 2021), 2021,
  • [20] ZERO-SHOT OBJECT DETECTION WITH TRANSFORMERS
    Zheng, Ye
    Cui, Li
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 444 - 448