ViT-CAPS: Vision transformer with contrastive adaptive prompt segmentation

被引:1
|
作者
Rashid, Khawaja Iftekhar [1 ]
Yang, Chenhui [1 ]
机构
[1] Xiamen Univ, Sch Informat, Xiamen 361005, Peoples R China
基金
中国国家自然科学基金;
关键词
Contrastive learning; Feature extraction; Few-shot segmentation; Semantic segmentation; Vision transformer; CHALLENGE;
D O I
10.1016/j.neucom.2025.129578
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Real-time segmentation plays an important role in numerous applications, including autonomous driving and medical imaging, where accurate and instantaneous segmentation influences essential decisions. The previous approaches suffer from the lack of cross-domain transferability and the need for large amounts of labeled data that prevent them from being applied successfully to real-world scenarios. This study presents a new model, ViTCAPS, that utilizes Vision Transformers in the encoder to improve segmentation performance in challenging and large-scale scenes. We employ the Adaptive Context Embedding (ACE) module, incorporating contrastive learning to improve domain adaptation by matching features from support and query images. Also, the Meta Prompt Generator (MPG) is designed to generate prompts from aligned features, and it can segment in complicated environments without requiring much human input. ViT-CAPS has shown promising results in resolving domain shift problems and improving few-shot segmentation in dynamic low-annotation settings. We conducted extensive experiments on four well-known datasets, FSS-1000, Cityscapes, ISIC, and DeepGlobe, and achieved noteworthy performance. We achieved a performance gain of 4.6 % on FSS-1000, 4.2 % on DeepGlobe, 6.1 % on Cityscapes, and a slight difference of-3 % on the ISIC dataset compared to previous approaches. We achieved an average mean IoU of 60.52 and 69.3, which is 2.7 % and 5.1 % higher accuracy over state-of-the-art Cross-Domain Few-Shot Segmentation (CD-FSS) models on 1-shot and 5-shot settings respectively.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] SC-ViT: Semantic Contrast Vision Transformer for Scene Recognition
    Niu, Jiahui (niujiahui418@mail.sdu.edu.cn), 1600, Institute of Electrical and Electronics Engineers Inc.
  • [32] HIRI-ViT: Scaling Vision Transformer With High Resolution Inputs
    Yao, Ting
    Li, Yehao
    Pan, Yingwei
    Mei, Tao
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (09) : 6431 - 6442
  • [33] ViT-A*: Legged Robot Path Planning using Vision Transformer A
    Liu, Jianwei
    Lyu, Shirui
    Hadjivelichkov, Denis
    Modugno, Valerio
    Kanoulas, Dimitrios
    2023 IEEE-RAS 22ND INTERNATIONAL CONFERENCE ON HUMANOID ROBOTS, HUMANOIDS, 2023,
  • [34] VISION TRANSFORMER-BASED RETINA VESSEL SEGMENTATION WITH DEEP ADAPTIVE GAMMA CORRECTION
    Yu, Hyunwoo
    Shim, Jae-hun
    Kwak, Jaeho
    Song, Jou Won
    Kang, Suk-Ju
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 1456 - 1460
  • [35] Contrastive Feature Masking Open-Vocabulary Vision Transformer
    Kim, Dahun
    Angelova, Anelia
    Kuo, Weicheng
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15556 - 15566
  • [36] Supervised Contrastive Vision Transformer for Breast Histopathological Image Classification
    Shiri, Mohammad
    Reddy, Monalika Padma
    Sun, Jiangwen
    2024 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION FOR DATA SCIENCE, IRI 2024, 2024, : 296 - 301
  • [37] Enabling Efficient Hardware Acceleration of Hybrid Vision Transformer (ViT) Networks at the Edge
    Dumoulin, Joren
    Houshmand, Pouya
    Jain, Vikram
    Verhelst, Marian
    2024 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS 2024, 2024,
  • [38] Gearbox Fault Detection Using Continuous Wavelet Transform and Vision Transformer (ViT)
    Asadian, Ali
    Riyazi, Yassin
    Ayati, Moo Sa
    2024 32ND INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING, ICEE 2024, 2024, : 688 - 692
  • [39] Shot-ViT: Cricket Batting Shots Classification with Vision Transformer Network
    Dey, A.
    Biswas, S.
    INTERNATIONAL JOURNAL OF ENGINEERING, 2024, 37 (12): : 2463 - 2472
  • [40] MIL-ViT: A multiple instance vision transformer for fundus image classification
    Bi, Qi
    Sun, Xu
    Yu, Shuang
    Ma, Kai
    Bian, Cheng
    Ning, Munan
    He, Nanjun
    Huang, Yawen
    Li, Yuexiang
    Liu, Hanruo
    Zheng, Yefeng
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2023, 97