Leveraging vision-language prompts for real-world image restoration and enhancement

被引：0

作者：

Wei, Yanyan ^{[1
,2
]}

Zhang, Yilin ^{[1
]}

Li, Kun ^{[1
]}

Wang, Fei ^{[1
]}

Tang, Shengeng ^{[1
]}

Zhang, Zhao ^{[1
,3
]}

机构：

[1] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei, Peoples R China

[2] Hefei Univ Technol, Anhui Prov Key Lab Ind Safety & Emergency Technol, Hefei, Peoples R China

[3] Yunnan Key Lab Software Engn, Kunming, Yunan, Peoples R China

来源：

COMPUTER VISION AND IMAGE UNDERSTANDING | 2025年 / 250卷

基金：

中国国家自然科学基金;

关键词：

Vision-language fusion; Textual prompts; Multimodal interaction; Image restoration; Synthetic data augmentation; Adverse weather removal; NETWORK;

D O I：

10.1016/j.cviu.2024.104222

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Significant advancements have been made in image restoration methods aimed at removing adverse weather effects. However, due to natural constraints, it is challenging to collect real-world datasets for adverse weather removal tasks. Consequently, existing methods predominantly rely on synthetic datasets, which struggle to generalize to real-world data, thereby limiting their practical utility. While some real-world adverse weather removal datasets have emerged, their design, which involves capturing ground truths at a different moment, inevitably introduces interfering discrepancies between the degraded images and the ground truths. These discrepancies include variations in brightness, color, contrast, and minor misalignments. Meanwhile, real- world datasets typically involve complex rather than singular degradation types. In many samples, degradation features are not overt, which poses immense challenges to real-world adverse weather removal methodologies. To tackle these issues, we introduce the recently prominent vision-language model, CLIP, to aid in the image restoration process. An expanded and fine-tuned CLIP model acts as an 'expert', leveraging the image priors acquired through large-scale pre-training to guide the operation of the image restoration model. Additionally, we generate a set of pseudo-ground-truths on sequences of degraded images to further alleviate the difficulty for the model in fitting the data. To imbue the model with more prior knowledge about degradation characteristics, we also incorporate additional synthetic training data. Lastly, the progressive learning and fine-tuning strategies employed during training enhance the model's final performance, enabling our method to surpass existing approaches in both visual quality and objective image quality assessment metrics.

引用

页数：11

共 50 条

[1] Towards Real-World Adverse Weather Image Restoration: Enhancing Clearness and Semantics with Vision-Language Models
Xu, Jiaqi
Wu, Mengyang
Hu, Xiaowei
Fu, Chi-Wing
Dou, Qi
Heng, Pheng-Ann
COMPUTER VISION-ECCV 2024, PT XVIII, 2025, 15076 : 147 - 164
[2] Advancing Real-World Stereoscopic Image Super-Resolution via Vision-Language Model
Zhang, Zhe
Lei, Jianjun
Peng, Bo
Zhu, Jie
Xu, Liying
Huang, Qingming
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2025, 34 : 2187 - 2197
[3] Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model
Cheng, Kanzhi
Song, Wenpo
Ma, Zheng
Zhu, Wenhao
Zhu, Zixuan
Zhang, Jianbing
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5038 - 5047
[4] VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use
Bitton, Yonatan
Bansal, Hritik
Hessel, Jack
Shao, Rulin
Zhu, Wanrong
Awadalla, Anas
Gardner, Josh
Taori, Rohan
Schimdt, Ludwig
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[5] ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data
Varma, Maya
Delbrouck, Jean-Benoit
Hooper, Sarah
Chaudhari, Akshay
Langlotz, Curtis
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22168 - 22178
[6] GalLoP: Learning Global and Local Prompts for Vision-Language Models
Lafon, Marc
Ramzi, Elias
Rambour, Clement
Audebert, Nicolas
Thome, Nicolas
COMPUTER VISION - ECCV 2024, PT LXI, 2025, 15119 : 264 - 282
[7] Leveraging per Image-Token Consistency for Vision-Language Pre-training
Gou, Yunhao
Ko, Tom
Yang, Hansi
Kwok, James
Zhang, Yu
Wang, Mingxuan
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19155 - 19164
[8] ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts
Lin, Bingqian
Zhu, Yi
Chen, Zicong
Liang, Xiwen
Liu, Jianzhuang
Liang, Xiaodan
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15375 - 15385
[9] Toward Real-world Panoramic Image Enhancement
Zhang, Yupeng
Zhang, Hengzhi
Li, Daojing
Liu, Liyan
Yi, Hong
Wang, Wei
Suitoh, Hiroshi
Odamaki, Makoto
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 2675 - 2684
[10] Image as a Foreign Language: BEIT Pretraining for Vision and Vision-Language Tasks
Wang, Wenhui
Bao, Hangbo
Dong, Li
Bjorck, Johan
Peng, Zhiliang
Liu, Qiang
Aggarwal, Kriti
Mohammed, Owais Khan
Singhal, Saksham
Som, Subhojit
Wei, Furu
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19175 - 19186

← 1 2 3 4 5 →