Towards zero-shot object counting via deep spatial prior cross-modality fusion

被引：4

作者：

Chen, Jinyong ^{[1
]}

Li, Qilei ^{[1
,2
]}

Gao, Mingliang ^{[1
]}

Zhai, Wenzhe ^{[1
]}

Jeon, Gwanggil ^{[3
]}

Camacho, David ^{[4
]}

机构：

[1] Shandong Univ Technol, Sch Elect & Elect Engn, Zibo 255000, Peoples R China

[2] Queen Mary Univ London, Sch Elect Engn & Comp Sci, London E1 4NS, England

[3] Incheon Natl Univ, Dept Embedded Syst Engn, Incheon 22012, South Korea

[4] Univ Politecn Madrid, Comp Sci Dept, Madrid 28040, Spain

来源：

INFORMATION FUSION | 2024年 / 111卷

关键词：

Object counting; Cross-modality; Deep Spatial Prior; Grounding DINO; Zero-shot;

D O I：

10.1016/j.inffus.2024.102537

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Existing counting models predominantly operate on a specific category of objects, such as crowds and vehicles. The recent emergence of multi -modal foundational models, e.g. , Contrastive Language -Image Pretraining (CLIP), has facilitated class -agnostic counting. This involves counting objects of any given class from a single image based on textual instructions. However, CLIP -based class -agnostic counting models face two primary challenges. Firstly, the CLIP model lacks sensitivity to location information. It generally considers global content rather than the fine-grain location of objects. Therefore, adapting the CLIP model directly is suboptimal. Secondly, these models generally freeze pre -trained vision and language encoders, while neglecting the potential misalignment in the constructed hypothesis space. In this paper, we address these two issues in a unified framework termed Deep Spatial Prior Interaction (DSPI) network. The DSPI leverages the spatialawareness ability of large-scale pre -trained object grounding models, i.e., Grounding DINO, to incorporate spatial location as an additional prior for a specific query class. This enables the network to be more specifically focused on the precise location of the objects. Additionally, to align the feature space across different modalities, we tailor a meta adapter that extracts textual information into an object query. This serves as an instruction for cross -modality matching. These two modules collaboratively ensure the alignment of multimodal representations while preserving their discriminative nature. Comprehensive experiments conducted on a diverse set of benchmarks verify the superiority of the proposed model. The code is available at https://github.com/jinyongch/DSPI.

引用

页数：12

共 50 条

[41] Meta-hallucinator: Towards Few-Shot Cross-Modality Cardiac Image Segmentation
Zhao, Ziyuan
Zhou, Fangcheng
Zeng, Zeng
Guan, Cuntai
Zhou, S. Kevin
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT V, 2022, 13435 : 128 - 139
[42] Towards Affordable Semantic Searching: Zero-Shot Retrieval via Dominant Attributes
Long, Yang
Liu, Li
Shen, Yuming
Shao, Ling
THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 7210 - 7217
[43] Zero-Shot Cross Modal Retrieval Method Based on Deep Supervised Learning
Zeng S.
Pang S.
Hao W.
Hsi-An Chiao Tung Ta Hsueh/Journal of Xi'an Jiaotong University, 2022, 56 (11): : 156 - 166
[44] Towards Generalizable Zero-Shot Manipulation via Translating Human Interaction Plans
Bharadhwaj, Homanga
Gupta, Abhinav
Kumar, Vikash
Tulsiani, Shubham
2024 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA 2024, 2024, : 6904 - 6911
[45] Interaction Compass: Multi-Label Zero-Shot Learning of Human-Object Interactions via Spatial Relations
Huynh, Dat
Elhamifar, Ehsan
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 8452 - 8463
[46] Towards Self-supervised Face Labeling via Cross-modality Association
Lu, Chris Xiaoxuan
Kan, Xuan
Rosa, Stefano
Du, Bowen
Wen, Hongkai
Markham, Andrew
Trigoni, Niki
PROCEEDINGS OF THE 15TH ACM CONFERENCE ON EMBEDDED NETWORKED SENSOR SYSTEMS (SENSYS'17), 2017,
[47] Fusion by synthesizing: A multi-view deep neural network for zero-shot recognition
Xu, Xing
Zhou, Xiang
Shen, Fumin
Gao, Lianli
Shen, Heng Tao
Li, Xuelong
SIGNAL PROCESSING, 2019, 164 : 354 - 367
[48] Zero-Shot Learning via Class-Conditioned Deep Generative Models
Wang, Wenlin
Pu, Yunchen
Verma, Vinay Kumar
Fan, Kai
Zhang, Yizhe
Chen, Changyou
Rai, Piyush
Carin, Lawrence
THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 4211 - 4218
[49] How to Transfer? Zero-Shot Object Recognition via Hierarchical Transfer of Semantic Attributes
Al-Halah, Ziad
Stiefelhagen, Rainer
2015 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2015, : 837 - 843
[50] Curious Exploration via Structured World Models Yields Zero-Shot Object Manipulation
Sancaktar, Cansu
Blaes, Sebastian
Martius, Georg
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,

← 1 2 3 4 5 →