Towards zero-shot object counting via deep spatial prior cross-modality fusion

被引：4

作者：

Chen, Jinyong ^{[1
]}

Li, Qilei ^{[1
,2
]}

Gao, Mingliang ^{[1
]}

Zhai, Wenzhe ^{[1
]}

Jeon, Gwanggil ^{[3
]}

Camacho, David ^{[4
]}

机构：

[1] Shandong Univ Technol, Sch Elect & Elect Engn, Zibo 255000, Peoples R China

[2] Queen Mary Univ London, Sch Elect Engn & Comp Sci, London E1 4NS, England

[3] Incheon Natl Univ, Dept Embedded Syst Engn, Incheon 22012, South Korea

[4] Univ Politecn Madrid, Comp Sci Dept, Madrid 28040, Spain

来源：

INFORMATION FUSION | 2024年 / 111卷

关键词：

Object counting; Cross-modality; Deep Spatial Prior; Grounding DINO; Zero-shot;

D O I：

10.1016/j.inffus.2024.102537

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Existing counting models predominantly operate on a specific category of objects, such as crowds and vehicles. The recent emergence of multi -modal foundational models, e.g. , Contrastive Language -Image Pretraining (CLIP), has facilitated class -agnostic counting. This involves counting objects of any given class from a single image based on textual instructions. However, CLIP -based class -agnostic counting models face two primary challenges. Firstly, the CLIP model lacks sensitivity to location information. It generally considers global content rather than the fine-grain location of objects. Therefore, adapting the CLIP model directly is suboptimal. Secondly, these models generally freeze pre -trained vision and language encoders, while neglecting the potential misalignment in the constructed hypothesis space. In this paper, we address these two issues in a unified framework termed Deep Spatial Prior Interaction (DSPI) network. The DSPI leverages the spatialawareness ability of large-scale pre -trained object grounding models, i.e., Grounding DINO, to incorporate spatial location as an additional prior for a specific query class. This enables the network to be more specifically focused on the precise location of the objects. Additionally, to align the feature space across different modalities, we tailor a meta adapter that extracts textual information into an object query. This serves as an instruction for cross -modality matching. These two modules collaboratively ensure the alignment of multimodal representations while preserving their discriminative nature. Comprehensive experiments conducted on a diverse set of benchmarks verify the superiority of the proposed model. The code is available at https://github.com/jinyongch/DSPI.

引用

页数：12

共 50 条

[31] Modality-Fusion Spiking Transformer Network for Audio-Visual Zero-Shot Learning
Li, Wenrui
Ma, Zhengyu
Deng, Liang-Jian
Man, Hengyu
Fan, Xiaopeng
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 426 - 431
[32] Domain-aware multi-modality fusion network for generalized zero-shot learning
Wang, Jia
Wang, Xiao
Zhang, Han
NEUROCOMPUTING, 2022, 488 : 23 - 35
[33] Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks
Wang, Wenguan
Lu, Xiankai
Shen, Jianbing
Crandall, David
Shao, Ling
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 9235 - 9244
[34] Zero-Shot Human-Object Interaction Detection via Similarity Propagation
Zong, Daoming
Sun, Shiliang
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (12) : 17805 - 17816
[35] Cross-modality attentive feature fusion for object detection in multispectral remote sensing imagery
Fang Qingyun
Wang Zhaokui
PATTERN RECOGNITION, 2022, 130
[36] Cross-modality attentive feature fusion for object detection in multispectral remote sensing imagery
Qingyun, Fang
Zhaokui, Wang
Pattern Recognition, 2022, 130
[37] 6D Object Pose Estimation Based on Cross-Modality Feature Fusion
Jiang, Meng
Zhang, Liming
Wang, Xiaohua
Li, Shuang
Jiao, Yijie
SENSORS, 2023, 23 (19)
[38] One-shot Retinal Artery and Vein Segmentation via Cross-modality Pretraining
Shi, Danli
He, Shuang
Yang, Jiancheng
Zheng, Yingfeng
He, Mingguang
OPHTHALMOLOGY SCIENCE, 2024, 4 (02):
[39] Image manipulation localization via dynamic cross-modality fusion and progressive integration
Jin, Xiao
Yu, Wen
Shi, Wei
NEUROCOMPUTING, 2024, 610
[40] Deep Exploration of Cross-Lingual Zero-Shot Generalization in Instruction Tuning
Han, Janghoon
Lee, Changho
Shin, Joongbo
Choi, Stanley Jungkyu
Lee, Honglak
Bae, Kyunghoon
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 15436 - 15452

← 1 2 3 4 5 →