Towards zero-shot object counting via deep spatial prior cross-modality fusion

被引:4
|
作者
Chen, Jinyong [1 ]
Li, Qilei [1 ,2 ]
Gao, Mingliang [1 ]
Zhai, Wenzhe [1 ]
Jeon, Gwanggil [3 ]
Camacho, David [4 ]
机构
[1] Shandong Univ Technol, Sch Elect & Elect Engn, Zibo 255000, Peoples R China
[2] Queen Mary Univ London, Sch Elect Engn & Comp Sci, London E1 4NS, England
[3] Incheon Natl Univ, Dept Embedded Syst Engn, Incheon 22012, South Korea
[4] Univ Politecn Madrid, Comp Sci Dept, Madrid 28040, Spain
关键词
Object counting; Cross-modality; Deep Spatial Prior; Grounding DINO; Zero-shot;
D O I
10.1016/j.inffus.2024.102537
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing counting models predominantly operate on a specific category of objects, such as crowds and vehicles. The recent emergence of multi -modal foundational models, e.g. , Contrastive Language -Image Pretraining (CLIP), has facilitated class -agnostic counting. This involves counting objects of any given class from a single image based on textual instructions. However, CLIP -based class -agnostic counting models face two primary challenges. Firstly, the CLIP model lacks sensitivity to location information. It generally considers global content rather than the fine-grain location of objects. Therefore, adapting the CLIP model directly is suboptimal. Secondly, these models generally freeze pre -trained vision and language encoders, while neglecting the potential misalignment in the constructed hypothesis space. In this paper, we address these two issues in a unified framework termed Deep Spatial Prior Interaction (DSPI) network. The DSPI leverages the spatialawareness ability of large-scale pre -trained object grounding models, i.e., Grounding DINO, to incorporate spatial location as an additional prior for a specific query class. This enables the network to be more specifically focused on the precise location of the objects. Additionally, to align the feature space across different modalities, we tailor a meta adapter that extracts textual information into an object query. This serves as an instruction for cross -modality matching. These two modules collaboratively ensure the alignment of multimodal representations while preserving their discriminative nature. Comprehensive experiments conducted on a diverse set of benchmarks verify the superiority of the proposed model. The code is available at https://github.com/jinyongch/DSPI.
引用
收藏
页数:12
相关论文
共 50 条
  • [31] Modality-Fusion Spiking Transformer Network for Audio-Visual Zero-Shot Learning
    Li, Wenrui
    Ma, Zhengyu
    Deng, Liang-Jian
    Man, Hengyu
    Fan, Xiaopeng
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 426 - 431
  • [32] Domain-aware multi-modality fusion network for generalized zero-shot learning
    Wang, Jia
    Wang, Xiao
    Zhang, Han
    NEUROCOMPUTING, 2022, 488 : 23 - 35
  • [33] Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks
    Wang, Wenguan
    Lu, Xiankai
    Shen, Jianbing
    Crandall, David
    Shao, Ling
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 9235 - 9244
  • [34] Zero-Shot Human-Object Interaction Detection via Similarity Propagation
    Zong, Daoming
    Sun, Shiliang
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (12) : 17805 - 17816
  • [35] Cross-modality attentive feature fusion for object detection in multispectral remote sensing imagery
    Fang Qingyun
    Wang Zhaokui
    PATTERN RECOGNITION, 2022, 130
  • [36] Cross-modality attentive feature fusion for object detection in multispectral remote sensing imagery
    Qingyun, Fang
    Zhaokui, Wang
    Pattern Recognition, 2022, 130
  • [37] 6D Object Pose Estimation Based on Cross-Modality Feature Fusion
    Jiang, Meng
    Zhang, Liming
    Wang, Xiaohua
    Li, Shuang
    Jiao, Yijie
    SENSORS, 2023, 23 (19)
  • [38] One-shot Retinal Artery and Vein Segmentation via Cross-modality Pretraining
    Shi, Danli
    He, Shuang
    Yang, Jiancheng
    Zheng, Yingfeng
    He, Mingguang
    OPHTHALMOLOGY SCIENCE, 2024, 4 (02):
  • [39] Image manipulation localization via dynamic cross-modality fusion and progressive integration
    Jin, Xiao
    Yu, Wen
    Shi, Wei
    NEUROCOMPUTING, 2024, 610
  • [40] Deep Exploration of Cross-Lingual Zero-Shot Generalization in Instruction Tuning
    Han, Janghoon
    Lee, Changho
    Shin, Joongbo
    Choi, Stanley Jungkyu
    Lee, Honglak
    Bae, Kyunghoon
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 15436 - 15452