ViT-CAPS: Vision transformer with contrastive adaptive prompt segmentation

被引:1
|
作者
Rashid, Khawaja Iftekhar [1 ]
Yang, Chenhui [1 ]
机构
[1] Xiamen Univ, Sch Informat, Xiamen 361005, Peoples R China
基金
中国国家自然科学基金;
关键词
Contrastive learning; Feature extraction; Few-shot segmentation; Semantic segmentation; Vision transformer; CHALLENGE;
D O I
10.1016/j.neucom.2025.129578
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Real-time segmentation plays an important role in numerous applications, including autonomous driving and medical imaging, where accurate and instantaneous segmentation influences essential decisions. The previous approaches suffer from the lack of cross-domain transferability and the need for large amounts of labeled data that prevent them from being applied successfully to real-world scenarios. This study presents a new model, ViTCAPS, that utilizes Vision Transformers in the encoder to improve segmentation performance in challenging and large-scale scenes. We employ the Adaptive Context Embedding (ACE) module, incorporating contrastive learning to improve domain adaptation by matching features from support and query images. Also, the Meta Prompt Generator (MPG) is designed to generate prompts from aligned features, and it can segment in complicated environments without requiring much human input. ViT-CAPS has shown promising results in resolving domain shift problems and improving few-shot segmentation in dynamic low-annotation settings. We conducted extensive experiments on four well-known datasets, FSS-1000, Cityscapes, ISIC, and DeepGlobe, and achieved noteworthy performance. We achieved a performance gain of 4.6 % on FSS-1000, 4.2 % on DeepGlobe, 6.1 % on Cityscapes, and a slight difference of-3 % on the ISIC dataset compared to previous approaches. We achieved an average mean IoU of 60.52 and 69.3, which is 2.7 % and 5.1 % higher accuracy over state-of-the-art Cross-Domain Few-Shot Segmentation (CD-FSS) models on 1-shot and 5-shot settings respectively.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] A-ViT: Adaptive Tokens for Efficient Vision Transformer
    Yin, Hongxu
    Vahdat, Arash
    Alvarez, Jose M.
    Mallya, Arun
    Kautz, Jan
    Molchanov, Pavlo
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10799 - 10808
  • [2] BinaryFormer: A Hierarchical-Adaptive Binary Vision Transformer (ViT) for Efficient Computing
    Wang, Miaohui
    Xu, Zhuowei
    Zheng, Bin
    Xie, Wuyuan
    IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2024, 20 (08) : 10657 - 10668
  • [3] eX-ViT: A Novel explainable vision transformer for weakly supervised semantic segmentation *
    Yu, Lu
    Xiang, Wei
    Fang, Juan
    Chen, Yi-Ping Phoebe
    Chi, Lianhua
    PATTERN RECOGNITION, 2023, 142
  • [4] Gait-ViT: Gait Recognition with Vision Transformer
    Mogan, Jashila Nair
    Lee, Chin Poo
    Lim, Kian Ming
    Muthu, Kalaiarasi Sonai
    SENSORS, 2022, 22 (19)
  • [5] Vision Transformer (ViT)-based Applications in Image Classification
    Huo, Yingzi
    Jin, Kai
    Cai, Jiahong
    Xiong, Huixuan
    Pang, Jiacheng
    2023 IEEE 9TH INTL CONFERENCE ON BIG DATA SECURITY ON CLOUD, BIGDATASECURITY, IEEE INTL CONFERENCE ON HIGH PERFORMANCE AND SMART COMPUTING, HPSC AND IEEE INTL CONFERENCE ON INTELLIGENT DATA AND SECURITY, IDS, 2023, : 135 - 140
  • [6] Appearance Prompt Vision Transformer for Connectome Reconstruction
    Sun, Rui
    Luo, Naisong
    Pan, Yuwen
    Mai, Huayu
    Zhang, Tianzhu
    Xiong, Zhiwei
    Wu, Feng
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 1423 - 1431
  • [7] Online Continual Learning with Contrastive Vision Transformer
    Wang, Zhen
    Liu, Liu
    Kong, Yajing
    Guo, Jiaxian
    Tao, Dacheng
    COMPUTER VISION, ECCV 2022, PT XX, 2022, 13680 : 631 - 650
  • [8] ViT-UperNet: a hybrid vision transformer with unified-perceptual-parsing network for medical image segmentation
    Ruiping, Yang
    Kun, Liu
    Shaohua, Xu
    Jian, Yin
    Zhen, Zhang
    COMPLEX & INTELLIGENT SYSTEMS, 2024, 10 (03) : 3819 - 3831
  • [9] ViT-FRD: A Vision Transformer Model for Cardiac MRI Image Segmentation Based on Feature Recombination Distillation
    Fan, Chunyu
    Su, Qi
    Xiao, Zhifeng
    Su, Hao
    Hou, Aijie
    Luan, Bo
    IEEE ACCESS, 2023, 11 : 129763 - 129772
  • [10] Contrastive hashing with vision transformer for image retrieval
    Ren, Xiuxiu
    Zheng, Xiangwei
    Zhou, Huiyu
    Liu, Weilong
    Dong, Xiao
    INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2022, 37 (12) : 12192 - 12211