Leveraging vision-language prompts for real-world image restoration and enhancement

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Vision and Image Understanding Pub Date : 2024-11-16 DOI:10.1016/j.cviu.2024.104222

Yanyan Wei , Yilin Zhang , Kun Li , Fei Wang , Shengeng Tang , Zhao Zhang

{"title":"Leveraging vision-language prompts for real-world image restoration and enhancement","authors":"Yanyan Wei , Yilin Zhang , Kun Li , Fei Wang , Shengeng Tang , Zhao Zhang","doi":"10.1016/j.cviu.2024.104222","DOIUrl":null,"url":null,"abstract":"<div><div>Significant advancements have been made in image restoration methods aimed at removing adverse weather effects. However, due to natural constraints, it is challenging to collect real-world datasets for adverse weather removal tasks. Consequently, existing methods predominantly rely on synthetic datasets, which struggle to generalize to real-world data, thereby limiting their practical utility. While some real-world adverse weather removal datasets have emerged, their design, which involves capturing ground truths at a different moment, inevitably introduces interfering discrepancies between the degraded images and the ground truths. These discrepancies include variations in brightness, color, contrast, and minor misalignments. Meanwhile, real-world datasets typically involve complex rather than singular degradation types. In many samples, degradation features are not overt, which poses immense challenges to real-world adverse weather removal methodologies. To tackle these issues, we introduce the recently prominent vision-language model, CLIP, to aid in the image restoration process. An expanded and fine-tuned CLIP model acts as an ‘expert’, leveraging the image priors acquired through large-scale pre-training to guide the operation of the image restoration model. Additionally, we generate a set of pseudo-ground-truths on sequences of degraded images to further alleviate the difficulty for the model in fitting the data. To imbue the model with more prior knowledge about degradation characteristics, we also incorporate additional synthetic training data. Lastly, the progressive learning and fine-tuning strategies employed during training enhance the model’s final performance, enabling our method to surpass existing approaches in both visual quality and objective image quality assessment metrics.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104222"},"PeriodicalIF":4.3000,"publicationDate":"2024-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224003035","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Significant advancements have been made in image restoration methods aimed at removing adverse weather effects. However, due to natural constraints, it is challenging to collect real-world datasets for adverse weather removal tasks. Consequently, existing methods predominantly rely on synthetic datasets, which struggle to generalize to real-world data, thereby limiting their practical utility. While some real-world adverse weather removal datasets have emerged, their design, which involves capturing ground truths at a different moment, inevitably introduces interfering discrepancies between the degraded images and the ground truths. These discrepancies include variations in brightness, color, contrast, and minor misalignments. Meanwhile, real-world datasets typically involve complex rather than singular degradation types. In many samples, degradation features are not overt, which poses immense challenges to real-world adverse weather removal methodologies. To tackle these issues, we introduce the recently prominent vision-language model, CLIP, to aid in the image restoration process. An expanded and fine-tuned CLIP model acts as an ‘expert’, leveraging the image priors acquired through large-scale pre-training to guide the operation of the image restoration model. Additionally, we generate a set of pseudo-ground-truths on sequences of degraded images to further alleviate the difficulty for the model in fitting the data. To imbue the model with more prior knowledge about degradation characteristics, we also incorporate additional synthetic training data. Lastly, the progressive learning and fine-tuning strategies employed during training enhance the model’s final performance, enabling our method to surpass existing approaches in both visual quality and objective image quality assessment metrics.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用视觉语言提示进行真实世界图像修复和增强

旨在消除不利天气影响的图像修复方法取得了重大进展。然而，由于自然条件的限制，收集真实世界的数据集来完成消除不利天气影响的任务具有挑战性。因此，现有的方法主要依赖于合成数据集，而合成数据集很难推广到真实世界的数据中，从而限制了这些方法的实用性。虽然已经出现了一些真实世界的不利天气消除数据集，但其设计涉及在不同时刻捕捉地面实况，不可避免地会在降级图像和地面实况之间引入干扰性差异。这些差异包括亮度、颜色、对比度的变化以及轻微的错位。同时，真实世界的数据集通常涉及复杂而非单一的退化类型。在许多样本中，退化特征并不明显，这给现实世界中的不利天气消除方法带来了巨大挑战。为了解决这些问题，我们引入了最近突出的视觉语言模型 CLIP 来帮助图像修复过程。经过扩展和微调的 CLIP 模型就像一个 "专家"，利用通过大规模预训练获得的图像先验来指导图像修复模型的运行。此外，我们还在退化图像序列上生成了一组伪地面真实值，以进一步减轻模型拟合数据的难度。为了给模型注入更多关于降解特征的先验知识，我们还加入了额外的合成训练数据。最后，在训练过程中采用的渐进学习和微调策略提高了模型的最终性能，使我们的方法在视觉质量和客观图像质量评估指标方面都超越了现有方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems