Towards zero-shot object counting via deep spatial prior cross-modality fusion

被引:4
|
作者
Chen, Jinyong [1 ]
Li, Qilei [1 ,2 ]
Gao, Mingliang [1 ]
Zhai, Wenzhe [1 ]
Jeon, Gwanggil [3 ]
Camacho, David [4 ]
机构
[1] Shandong Univ Technol, Sch Elect & Elect Engn, Zibo 255000, Peoples R China
[2] Queen Mary Univ London, Sch Elect Engn & Comp Sci, London E1 4NS, England
[3] Incheon Natl Univ, Dept Embedded Syst Engn, Incheon 22012, South Korea
[4] Univ Politecn Madrid, Comp Sci Dept, Madrid 28040, Spain
关键词
Object counting; Cross-modality; Deep Spatial Prior; Grounding DINO; Zero-shot;
D O I
10.1016/j.inffus.2024.102537
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing counting models predominantly operate on a specific category of objects, such as crowds and vehicles. The recent emergence of multi -modal foundational models, e.g. , Contrastive Language -Image Pretraining (CLIP), has facilitated class -agnostic counting. This involves counting objects of any given class from a single image based on textual instructions. However, CLIP -based class -agnostic counting models face two primary challenges. Firstly, the CLIP model lacks sensitivity to location information. It generally considers global content rather than the fine-grain location of objects. Therefore, adapting the CLIP model directly is suboptimal. Secondly, these models generally freeze pre -trained vision and language encoders, while neglecting the potential misalignment in the constructed hypothesis space. In this paper, we address these two issues in a unified framework termed Deep Spatial Prior Interaction (DSPI) network. The DSPI leverages the spatialawareness ability of large-scale pre -trained object grounding models, i.e., Grounding DINO, to incorporate spatial location as an additional prior for a specific query class. This enables the network to be more specifically focused on the precise location of the objects. Additionally, to align the feature space across different modalities, we tailor a meta adapter that extracts textual information into an object query. This serves as an instruction for cross -modality matching. These two modules collaboratively ensure the alignment of multimodal representations while preserving their discriminative nature. Comprehensive experiments conducted on a diverse set of benchmarks verify the superiority of the proposed model. The code is available at https://github.com/jinyongch/DSPI.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Infrared colorization with cross-modality zero-shot learning
    Wei, Chiheng
    Chen, Huawei
    Bai, Lianfa
    Han, Jing
    Chen, Xiaoyu
    NEUROCOMPUTING, 2024, 579
  • [2] Zero-shot learning with regularized cross-modality ranking
    Yu, Yunlong
    Ji, Zhong
    Guo, Jichang
    Pang, Yanwei
    NEUROCOMPUTING, 2017, 259 : 14 - 20
  • [3] Zero-Shot Object Counting
    Xu, Jingyi
    Le, Hieu
    Nguyen, Vu
    Ranjan, Viresh
    Samaras, Dimitris
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 15548 - 15557
  • [4] Zero-Shot Object Counting With Vision-Language Prior Guidance Network
    Zhai, Wenzhe
    Xing, Xianglei
    Gao, Mingliang
    Li, Qilei
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (03) : 2487 - 2498
  • [5] Zero-Shot Object Counting with Good Exemplars
    Zhu, Huilin
    Yuan, Jingling
    Yang, Zhengwei
    Guo, Yu
    Wang, Zheng
    Zhong, Xian
    He, Shengfeng
    COMPUTER VISION - ECCV 2024, PT V, 2025, 15063 : 368 - 385
  • [6] Mutual Information Guided Diffusion for Zero-Shot Cross-Modality Medical Image Translation
    Wang, Zihao
    Yang, Yingyu
    Chen, Yuzhou
    Yuan, Tingting
    Sermesant, Maxime
    Delingette, Herve
    Wu, Ona
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2024, 43 (08) : 2825 - 2838
  • [7] LANGUAGE-GUIDED ZERO-SHOT OBJECT COUNTING
    Wang, Mingjie
    Yuan, Song
    Li, Zhuohang
    Zhu, Longlong
    Buys, Eric
    Gong, Minglun
    2024 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS, ICMEW 2024, 2024,
  • [8] CLIP-Count: Towards Text-Guided Zero-Shot Object Counting
    Jiang, Ruixiang
    Liu, Lingbo
    Chen, Changwen
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4535 - 4545
  • [9] Parameter-Free Latent Space Transformer for Zero-Shot Bidirectional Cross-modality Liver Segmentation
    Li, Yang
    Zou, Beiji
    Dai, Yulan
    Zhu, Chengzhang
    Yang, Fan
    Li, Xin
    Bai, Harrison X.
    Jiao, Zhicheng
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT IV, 2022, 13434 : 619 - 628
  • [10] COMe-SEE: Cross-modality Semantic Embedding Ensemble for Generalized Zero-Shot Diagnosis of Chest Radiographs
    Paul, Angshuman
    Shen, Thomas C.
    Balachandar, Niranjan
    Tang, Yuxing
    Peng, Yifan
    Lu, Zhiyong
    Summers, Ronald M.
    INTERPRETABLE AND ANNOTATION-EFFICIENT LEARNING FOR MEDICAL IMAGE COMPUTING, IMIMIC 2020, MIL3ID 2020, LABELS 2020, 2020, 12446 : 103 - 111