Towards zero-shot object counting via deep spatial prior cross-modality fusion

被引:4
|
作者
Chen, Jinyong [1 ]
Li, Qilei [1 ,2 ]
Gao, Mingliang [1 ]
Zhai, Wenzhe [1 ]
Jeon, Gwanggil [3 ]
Camacho, David [4 ]
机构
[1] Shandong Univ Technol, Sch Elect & Elect Engn, Zibo 255000, Peoples R China
[2] Queen Mary Univ London, Sch Elect Engn & Comp Sci, London E1 4NS, England
[3] Incheon Natl Univ, Dept Embedded Syst Engn, Incheon 22012, South Korea
[4] Univ Politecn Madrid, Comp Sci Dept, Madrid 28040, Spain
关键词
Object counting; Cross-modality; Deep Spatial Prior; Grounding DINO; Zero-shot;
D O I
10.1016/j.inffus.2024.102537
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing counting models predominantly operate on a specific category of objects, such as crowds and vehicles. The recent emergence of multi -modal foundational models, e.g. , Contrastive Language -Image Pretraining (CLIP), has facilitated class -agnostic counting. This involves counting objects of any given class from a single image based on textual instructions. However, CLIP -based class -agnostic counting models face two primary challenges. Firstly, the CLIP model lacks sensitivity to location information. It generally considers global content rather than the fine-grain location of objects. Therefore, adapting the CLIP model directly is suboptimal. Secondly, these models generally freeze pre -trained vision and language encoders, while neglecting the potential misalignment in the constructed hypothesis space. In this paper, we address these two issues in a unified framework termed Deep Spatial Prior Interaction (DSPI) network. The DSPI leverages the spatialawareness ability of large-scale pre -trained object grounding models, i.e., Grounding DINO, to incorporate spatial location as an additional prior for a specific query class. This enables the network to be more specifically focused on the precise location of the objects. Additionally, to align the feature space across different modalities, we tailor a meta adapter that extracts textual information into an object query. This serves as an instruction for cross -modality matching. These two modules collaboratively ensure the alignment of multimodal representations while preserving their discriminative nature. Comprehensive experiments conducted on a diverse set of benchmarks verify the superiority of the proposed model. The code is available at https://github.com/jinyongch/DSPI.
引用
收藏
页数:12
相关论文
共 50 条
  • [21] SAVE: Self-Attention on Visual Embedding for Zero-Shot Generic Object Counting
    Zgaren, Ahmed
    Bouachir, Wassim
    Bouguila, Nizar
    JOURNAL OF IMAGING, 2025, 11 (02)
  • [22] Cross-modality interaction for few-shot multispectral object detection with semantic knowledge
    Huang, Lian
    Peng, Zongju
    Chen, Fen
    Dai, Shaosheng
    He, Ziqiang
    Liu, Kesheng
    Neural Networks, 2024, 173
  • [23] Cascaded Cross-Modality Fusion Network for 3D Object Detection
    Chen, Zhiyu
    Lin, Qiong
    Sun, Jing
    Feng, Yujian
    Liu, Shangdong
    Liu, Qiang
    Ji, Yimu
    Xu, He
    SENSORS, 2020, 20 (24) : 1 - 14
  • [24] MCAFNet: Multiscale cross-modality adaptive fusion network for multispectral object detection
    Zheng, Shangpo
    Liu, Junfeng
    Jun, Zeng
    DIGITAL SIGNAL PROCESSING, 2025, 159
  • [25] Towards zero-shot cross-lingual named entity disambiguation
    Barrena, Ander
    Soroa, Aitor
    Agirre, Eneko
    EXPERT SYSTEMS WITH APPLICATIONS, 2021, 184
  • [26] Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions
    Mettes, Pascal
    Snoek, Cees G. M.
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 4453 - 4462
  • [27] Zero-shot Learning via the fusion of generation and embedding for image recognition
    Zhao, Peng
    Zhang, Siying
    Liu, Jinhui
    Liu, Huiting
    INFORMATION SCIENCES, 2021, 578 (578) : 831 - 847
  • [28] Cross-Modality Binary Code Learning via Fusion Similarity Hashing
    Liu, Hong
    Ji, Rongrong
    Wu, Yongjian
    Huang, Feiyue
    Zhang, Baochang
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 6345 - 6353
  • [29] Towards zero-shot learning generalization via a cosine distance loss
    Pan, Chongyu
    Huang, Jian
    Hao, Jianguo
    Gong, Jianxing
    NEUROCOMPUTING, 2020, 381 : 167 - 176
  • [30] Deep Cross-Modality Alignment for Multi-Shot Person Re-IDentification
    Song, Zhichao
    Ni, Bingbing
    Yan, Yichao
    Ren, Zhe
    Xu, Yi
    Yang, Xiaokang
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 645 - 653