Towards zero-shot object counting via deep spatial prior cross-modality fusion

被引：4

作者：

Chen, Jinyong ^{[1
]}

Li, Qilei ^{[1
,2
]}

Gao, Mingliang ^{[1
]}

Zhai, Wenzhe ^{[1
]}

Jeon, Gwanggil ^{[3
]}

Camacho, David ^{[4
]}

机构：

[1] Shandong Univ Technol, Sch Elect & Elect Engn, Zibo 255000, Peoples R China

[2] Queen Mary Univ London, Sch Elect Engn & Comp Sci, London E1 4NS, England

[3] Incheon Natl Univ, Dept Embedded Syst Engn, Incheon 22012, South Korea

[4] Univ Politecn Madrid, Comp Sci Dept, Madrid 28040, Spain

来源：

INFORMATION FUSION | 2024年 / 111卷

关键词：

Object counting; Cross-modality; Deep Spatial Prior; Grounding DINO; Zero-shot;

D O I：

10.1016/j.inffus.2024.102537

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Existing counting models predominantly operate on a specific category of objects, such as crowds and vehicles. The recent emergence of multi -modal foundational models, e.g. , Contrastive Language -Image Pretraining (CLIP), has facilitated class -agnostic counting. This involves counting objects of any given class from a single image based on textual instructions. However, CLIP -based class -agnostic counting models face two primary challenges. Firstly, the CLIP model lacks sensitivity to location information. It generally considers global content rather than the fine-grain location of objects. Therefore, adapting the CLIP model directly is suboptimal. Secondly, these models generally freeze pre -trained vision and language encoders, while neglecting the potential misalignment in the constructed hypothesis space. In this paper, we address these two issues in a unified framework termed Deep Spatial Prior Interaction (DSPI) network. The DSPI leverages the spatialawareness ability of large-scale pre -trained object grounding models, i.e., Grounding DINO, to incorporate spatial location as an additional prior for a specific query class. This enables the network to be more specifically focused on the precise location of the objects. Additionally, to align the feature space across different modalities, we tailor a meta adapter that extracts textual information into an object query. This serves as an instruction for cross -modality matching. These two modules collaboratively ensure the alignment of multimodal representations while preserving their discriminative nature. Comprehensive experiments conducted on a diverse set of benchmarks verify the superiority of the proposed model. The code is available at https://github.com/jinyongch/DSPI.

引用

页数：12

共 50 条

[1] Infrared colorization with cross-modality zero-shot learning
Wei, Chiheng
Chen, Huawei
Bai, Lianfa
Han, Jing
Chen, Xiaoyu
NEUROCOMPUTING, 2024, 579
[2] Zero-shot learning with regularized cross-modality ranking
Yu, Yunlong
Ji, Zhong
Guo, Jichang
Pang, Yanwei
NEUROCOMPUTING, 2017, 259 : 14 - 20
[3] Zero-Shot Object Counting
Xu, Jingyi
Le, Hieu
Nguyen, Vu
Ranjan, Viresh
Samaras, Dimitris
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 15548 - 15557
[4] Zero-Shot Object Counting With Vision-Language Prior Guidance Network
Zhai, Wenzhe
Xing, Xianglei
Gao, Mingliang
Li, Qilei
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (03) : 2487 - 2498
[5] Zero-Shot Object Counting with Good Exemplars
Zhu, Huilin
Yuan, Jingling
Yang, Zhengwei
Guo, Yu
Wang, Zheng
Zhong, Xian
He, Shengfeng
COMPUTER VISION - ECCV 2024, PT V, 2025, 15063 : 368 - 385
[6] Mutual Information Guided Diffusion for Zero-Shot Cross-Modality Medical Image Translation
Wang, Zihao
Yang, Yingyu
Chen, Yuzhou
Yuan, Tingting
Sermesant, Maxime
Delingette, Herve
Wu, Ona
IEEE TRANSACTIONS ON MEDICAL IMAGING, 2024, 43 (08) : 2825 - 2838
[7] LANGUAGE-GUIDED ZERO-SHOT OBJECT COUNTING
Wang, Mingjie
Yuan, Song
Li, Zhuohang
Zhu, Longlong
Buys, Eric
Gong, Minglun
2024 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS, ICMEW 2024, 2024,
[8] CLIP-Count: Towards Text-Guided Zero-Shot Object Counting
Jiang, Ruixiang
Liu, Lingbo
Chen, Changwen
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4535 - 4545
[9] Parameter-Free Latent Space Transformer for Zero-Shot Bidirectional Cross-modality Liver Segmentation
Li, Yang
Zou, Beiji
Dai, Yulan
Zhu, Chengzhang
Yang, Fan
Li, Xin
Bai, Harrison X.
Jiao, Zhicheng
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT IV, 2022, 13434 : 619 - 628
[10] COMe-SEE: Cross-modality Semantic Embedding Ensemble for Generalized Zero-Shot Diagnosis of Chest Radiographs
Paul, Angshuman
Shen, Thomas C.
Balachandar, Niranjan
Tang, Yuxing
Peng, Yifan
Lu, Zhiyong
Summers, Ronald M.
INTERPRETABLE AND ANNOTATION-EFFICIENT LEARNING FOR MEDICAL IMAGE COMPUTING, IMIMIC 2020, MIL3ID 2020, LABELS 2020, 2020, 12446 : 103 - 111

← 1 2 3 4 5 →