Towards zero-shot object counting via deep spatial prior cross-modality fusion

被引:4
|
作者
Chen, Jinyong [1 ]
Li, Qilei [1 ,2 ]
Gao, Mingliang [1 ]
Zhai, Wenzhe [1 ]
Jeon, Gwanggil [3 ]
Camacho, David [4 ]
机构
[1] Shandong Univ Technol, Sch Elect & Elect Engn, Zibo 255000, Peoples R China
[2] Queen Mary Univ London, Sch Elect Engn & Comp Sci, London E1 4NS, England
[3] Incheon Natl Univ, Dept Embedded Syst Engn, Incheon 22012, South Korea
[4] Univ Politecn Madrid, Comp Sci Dept, Madrid 28040, Spain
关键词
Object counting; Cross-modality; Deep Spatial Prior; Grounding DINO; Zero-shot;
D O I
10.1016/j.inffus.2024.102537
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing counting models predominantly operate on a specific category of objects, such as crowds and vehicles. The recent emergence of multi -modal foundational models, e.g. , Contrastive Language -Image Pretraining (CLIP), has facilitated class -agnostic counting. This involves counting objects of any given class from a single image based on textual instructions. However, CLIP -based class -agnostic counting models face two primary challenges. Firstly, the CLIP model lacks sensitivity to location information. It generally considers global content rather than the fine-grain location of objects. Therefore, adapting the CLIP model directly is suboptimal. Secondly, these models generally freeze pre -trained vision and language encoders, while neglecting the potential misalignment in the constructed hypothesis space. In this paper, we address these two issues in a unified framework termed Deep Spatial Prior Interaction (DSPI) network. The DSPI leverages the spatialawareness ability of large-scale pre -trained object grounding models, i.e., Grounding DINO, to incorporate spatial location as an additional prior for a specific query class. This enables the network to be more specifically focused on the precise location of the objects. Additionally, to align the feature space across different modalities, we tailor a meta adapter that extracts textual information into an object query. This serves as an instruction for cross -modality matching. These two modules collaboratively ensure the alignment of multimodal representations while preserving their discriminative nature. Comprehensive experiments conducted on a diverse set of benchmarks verify the superiority of the proposed model. The code is available at https://github.com/jinyongch/DSPI.
引用
收藏
页数:12
相关论文
共 50 条
  • [41] Meta-hallucinator: Towards Few-Shot Cross-Modality Cardiac Image Segmentation
    Zhao, Ziyuan
    Zhou, Fangcheng
    Zeng, Zeng
    Guan, Cuntai
    Zhou, S. Kevin
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT V, 2022, 13435 : 128 - 139
  • [42] Towards Affordable Semantic Searching: Zero-Shot Retrieval via Dominant Attributes
    Long, Yang
    Liu, Li
    Shen, Yuming
    Shao, Ling
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 7210 - 7217
  • [43] Zero-Shot Cross Modal Retrieval Method Based on Deep Supervised Learning
    Zeng S.
    Pang S.
    Hao W.
    Hsi-An Chiao Tung Ta Hsueh/Journal of Xi'an Jiaotong University, 2022, 56 (11): : 156 - 166
  • [44] Towards Generalizable Zero-Shot Manipulation via Translating Human Interaction Plans
    Bharadhwaj, Homanga
    Gupta, Abhinav
    Kumar, Vikash
    Tulsiani, Shubham
    2024 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA 2024, 2024, : 6904 - 6911
  • [45] Interaction Compass: Multi-Label Zero-Shot Learning of Human-Object Interactions via Spatial Relations
    Huynh, Dat
    Elhamifar, Ehsan
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 8452 - 8463
  • [46] Towards Self-supervised Face Labeling via Cross-modality Association
    Lu, Chris Xiaoxuan
    Kan, Xuan
    Rosa, Stefano
    Du, Bowen
    Wen, Hongkai
    Markham, Andrew
    Trigoni, Niki
    PROCEEDINGS OF THE 15TH ACM CONFERENCE ON EMBEDDED NETWORKED SENSOR SYSTEMS (SENSYS'17), 2017,
  • [47] Fusion by synthesizing: A multi-view deep neural network for zero-shot recognition
    Xu, Xing
    Zhou, Xiang
    Shen, Fumin
    Gao, Lianli
    Shen, Heng Tao
    Li, Xuelong
    SIGNAL PROCESSING, 2019, 164 : 354 - 367
  • [48] Zero-Shot Learning via Class-Conditioned Deep Generative Models
    Wang, Wenlin
    Pu, Yunchen
    Verma, Vinay Kumar
    Fan, Kai
    Zhang, Yizhe
    Chen, Changyou
    Rai, Piyush
    Carin, Lawrence
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 4211 - 4218
  • [49] How to Transfer? Zero-Shot Object Recognition via Hierarchical Transfer of Semantic Attributes
    Al-Halah, Ziad
    Stiefelhagen, Rainer
    2015 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2015, : 837 - 843
  • [50] Curious Exploration via Structured World Models Yields Zero-Shot Object Manipulation
    Sancaktar, Cansu
    Blaes, Sebastian
    Martius, Georg
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,