Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation

被引：0

作者：

Shao, Tong ^{[1
]}

Tian, Zhuotao ^{[1
]}

Zhao, Hang ^{[1
]}

Su, Jingyong ^{[1
]}

机构：

[1] Harbin Inst Technol, Shenzhen, Peoples R China

来源：

COMPUTER VISION - ECCV 2024, PT LXXXVI | 2025年 / 15144卷

基金：

中国国家自然科学基金;

关键词：

CLIP; Training-free; Semantic Segmentation;

D O I：

10.1007/978-3-031-73016-0_9

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

CLIP, as a vision-language model, has significantly advanced Open-Vocabulary Semantic Segmentation (OVSS) with its zero-shot capabilities. Despite its success, its application to OVSS faces challenges due to its initial image-level alignment training, which affects its performance in tasks requiring detailed local context. Our study delves into the impact of CLIP's [CLS] token on patch feature correlations, revealing a dominance of "global" patches that hinders local feature discrimination. To overcome this, we propose CLIPtrase, a novel training-free semantic segmentation strategy that enhances local feature awareness through recalibrated self-correlation among patches. This approach demonstrates notable improvements in segmentation accuracy and the ability to maintain semantic coherence across objects. Experiments show that we are 22.3% ahead of CLIP on average on 9 segmentation benchmarks, outperforming existing state-of-the-art training-free methods. The code are made publicly available at https://github.com/leaves162/CLIPtrase.

引用

页码：139 / 156

页数：18

共 50 条

[41] Source-Free Open Compound Domain Adaptation in Semantic Segmentation
Zhao, Yuyang
Zhong, Zhun
Luo, Zhiming
Lee, Gim Hee
Sebe, Nicu
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (10) : 7019 - 7032
[42] A Real-Time Training-Free Laughter Detection System Based on Novel Syllable Segmentation and Correlation Methods
Chou, Chih-Hung
Li, Chih-Hung
Chen, Bo-Wei
Wang, Jhing-Fa
Lin, Po-Chuan
4TH INTERNATIONAL CONFERENCE ON AWARENESS SCIENCE AND TECHNOLOGY (ICAST 2012), 2012, : 294 - 297
[43] From ViT Features to Training-free Video Object Segmentation via Streaming-data Mixture Models
Uziel, Roy
Dinari, Or
Freifeld, Oren
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[44] Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models
Zhu, Xiaoyu
Zhou, Hao
Xing, Pengfei
Zhao, Long
Xu, Hao
Liang, Junwei
Hauptmann, Alexander
Liu, Ting
Gallagher, Andrew
COMPUTER VISION - ECCV 2024, PT XXIX, 2025, 15087 : 357 - 375
[45] LANGUAGE-DRIVEN OPEN-VOCABULARY 3D SEMANTIC SEGMENTATION WITH KNOWLEDGE DISTILLATION
Wu, Yuting
Han, Xian-Feng
Xiao, Guoqiang
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 3320 - 3324
[46] MVP-SEG: Multi-view Prompt Learning for Open-Vocabulary Semantic Segmentation
Guo, Jie
Wang, Qimeng
Gao, Yan
Jiang, Xiaolong
Lin, Shaohui
Zhang, Baochang
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT XII, 2024, 14436 : 158 - 171
[47] Generative AI-aided Joint Training-free Secure Semantic Communications via Multi-modal Prompts
Du, Hongyang
Liu, Guangyuan
Niyato, Dusit
Zhang, Jiayi
Kang, Jiawen
Xiong, Zehui
Ai, Bo
Kim, Dong In
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, : 12896 - 12900
[48] OSAM-Fundus: A training-free, one-shot segmentation framework for optic disc and cup in fundus images
Wang, Rui
Yang, Zhouwang
Song, Yanzhi
BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2024, 100
[49] LMC: Large Model Collaboration with Cross-assessment for Training-Free Open-Set Object Recognition
Qu, Haoxuan
Hui, Xiaofei
Cai, Yujun
Liu, Jun
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[50] Expanding Open-Vocabulary Understanding for UAV Aerial Imagery: A Vision-Language Framework to Semantic Segmentation
Huang, Bangju
Li, Junhui
Luan, Wuyang
Tan, Jintao
Li, Chenglong
Huang, Longyang
DRONES, 2025, 9 (02)

← 1 2 3 4 5 →