Towards zero-shot object counting via deep spatial prior cross-modality fusion

被引：4

作者：

Chen, Jinyong ^{[1
]}

Li, Qilei ^{[1
,2
]}

Gao, Mingliang ^{[1
]}

Zhai, Wenzhe ^{[1
]}

Jeon, Gwanggil ^{[3
]}

Camacho, David ^{[4
]}

机构：

[1] Shandong Univ Technol, Sch Elect & Elect Engn, Zibo 255000, Peoples R China

[2] Queen Mary Univ London, Sch Elect Engn & Comp Sci, London E1 4NS, England

[3] Incheon Natl Univ, Dept Embedded Syst Engn, Incheon 22012, South Korea

[4] Univ Politecn Madrid, Comp Sci Dept, Madrid 28040, Spain

来源：

INFORMATION FUSION | 2024年 / 111卷

关键词：

Object counting; Cross-modality; Deep Spatial Prior; Grounding DINO; Zero-shot;

D O I：

10.1016/j.inffus.2024.102537

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Existing counting models predominantly operate on a specific category of objects, such as crowds and vehicles. The recent emergence of multi -modal foundational models, e.g. , Contrastive Language -Image Pretraining (CLIP), has facilitated class -agnostic counting. This involves counting objects of any given class from a single image based on textual instructions. However, CLIP -based class -agnostic counting models face two primary challenges. Firstly, the CLIP model lacks sensitivity to location information. It generally considers global content rather than the fine-grain location of objects. Therefore, adapting the CLIP model directly is suboptimal. Secondly, these models generally freeze pre -trained vision and language encoders, while neglecting the potential misalignment in the constructed hypothesis space. In this paper, we address these two issues in a unified framework termed Deep Spatial Prior Interaction (DSPI) network. The DSPI leverages the spatialawareness ability of large-scale pre -trained object grounding models, i.e., Grounding DINO, to incorporate spatial location as an additional prior for a specific query class. This enables the network to be more specifically focused on the precise location of the objects. Additionally, to align the feature space across different modalities, we tailor a meta adapter that extracts textual information into an object query. This serves as an instruction for cross -modality matching. These two modules collaboratively ensure the alignment of multimodal representations while preserving their discriminative nature. Comprehensive experiments conducted on a diverse set of benchmarks verify the superiority of the proposed model. The code is available at https://github.com/jinyongch/DSPI.

引用

页数：12

共 50 条

[21] SAVE: Self-Attention on Visual Embedding for Zero-Shot Generic Object Counting
Zgaren, Ahmed
Bouachir, Wassim
Bouguila, Nizar
JOURNAL OF IMAGING, 2025, 11 (02)
[22] Cross-modality interaction for few-shot multispectral object detection with semantic knowledge
Huang, Lian
Peng, Zongju
Chen, Fen
Dai, Shaosheng
He, Ziqiang
Liu, Kesheng
Neural Networks, 2024, 173
[23] Cascaded Cross-Modality Fusion Network for 3D Object Detection
Chen, Zhiyu
Lin, Qiong
Sun, Jing
Feng, Yujian
Liu, Shangdong
Liu, Qiang
Ji, Yimu
Xu, He
SENSORS, 2020, 20 (24) : 1 - 14
[24] MCAFNet: Multiscale cross-modality adaptive fusion network for multispectral object detection
Zheng, Shangpo
Liu, Junfeng
Jun, Zeng
DIGITAL SIGNAL PROCESSING, 2025, 159
[25] Towards zero-shot cross-lingual named entity disambiguation
Barrena, Ander
Soroa, Aitor
Agirre, Eneko
EXPERT SYSTEMS WITH APPLICATIONS, 2021, 184
[26] Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions
Mettes, Pascal
Snoek, Cees G. M.
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 4453 - 4462
[27] Zero-shot Learning via the fusion of generation and embedding for image recognition
Zhao, Peng
Zhang, Siying
Liu, Jinhui
Liu, Huiting
INFORMATION SCIENCES, 2021, 578 (578) : 831 - 847
[28] Cross-Modality Binary Code Learning via Fusion Similarity Hashing
Liu, Hong
Ji, Rongrong
Wu, Yongjian
Huang, Feiyue
Zhang, Baochang
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 6345 - 6353
[29] Towards zero-shot learning generalization via a cosine distance loss
Pan, Chongyu
Huang, Jian
Hao, Jianguo
Gong, Jianxing
NEUROCOMPUTING, 2020, 381 : 167 - 176
[30] Deep Cross-Modality Alignment for Multi-Shot Person Re-IDentification
Song, Zhichao
Ni, Bingbing
Yan, Yichao
Ren, Zhe
Xu, Yi
Yang, Xiaokang
PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 645 - 653

← 1 2 3 4 5 →