Guest Editorial: Advanced image restoration and enhancement in the wild

IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE IET Computer Vision Pub Date : 2024-04-19 DOI:10.1049/cvi2.12283
Longguang Wang, Juncheng Li, Naoto Yokoya, Radu Timofte, Yulan Guo
{"title":"Guest Editorial: Advanced image restoration and enhancement in the wild","authors":"Longguang Wang,&nbsp;Juncheng Li,&nbsp;Naoto Yokoya,&nbsp;Radu Timofte,&nbsp;Yulan Guo","doi":"10.1049/cvi2.12283","DOIUrl":null,"url":null,"abstract":"<p>Image restoration and enhancement has always been a fundamental task in computer vision and is widely used in numerous applications, such as surveillance imaging, remote sensing, and medical imaging. In recent years, remarkable progress has been witnessed with deep learning techniques. Despite the promising performance achieved on synthetic data, compelling research challenges remain to be addressed in the wild. These include: (i) degradation models for low-quality images in the real world are complicated and unknown, (ii) paired low-quality and high-quality data are difficult to acquire in the real world, and a large quantity of real data are provided in an unpaired form, (iii) it is challenging to incorporate cross-modal information provided by advanced imaging techniques (e.g. RGB-D camera) for image restoration, (iv) real-time inference on edge devices is important for image restoration and enhancement methods, and (v) it is difficult to provide the confidence or performance bounds of a learning-based method on different images/regions. This special issue invites original contributions in datasets, innovative architectures, and training methods for image restoration and enhancement to address these and other challenges.</p><p>In this Special Issue, we have received 17 papers, of which 8 papers underwent the peer review process, while the rest were desk-rejected. Among these reviewed papers, 5 papers have been accepted and 3 papers have been rejected as they did not meet the criteria of IET Computer Vision. Thus, the overall submissions were of high quality, which marks the success of this Special Issue.</p><p>The five eventually accepted papers can be clustered into two categories, namely video reconstruction and image super-resolution. The first category of papers aims at reconstructing high-quality videos. The papers in this category are of Zhang et al., Gu et al., and Xu et al. The second category of papers studies the task of image super-resolution. The papers in this category are of Dou et al. and Yang et al. A brief presentation of each of the paper in this special issue is as follows.</p><p>Zhang et al. propose a point-image fusion network for event-based frame interpolation. Temporal information in event streams plays a critical role in this task as it provides temporal context cues complementary to images. Previous approaches commonly transform the unstructured event data to structured data formats through voxelisation and then employ advanced CNNs to extract temporal information. However, the voxelisation operation inevitably leads to information loss and introduces redundant computation. To address these limitations, the proposed method directly extracts temporal information from the events at the point level without relying on any voxelisation operation. Afterwards, a fusion module is adopted to aggregate complementary cues from both points and images for frame interpolation. Experiments on both synthetic and real-world datasets show that their method produces state-of-the-art accuracy with high efficiency.</p><p>Gu et al. develop a temporal shift reconstruction network for compressive video sensing. To exploit the temporal cues between adjacent frames during the reconstruction of videos, most previous approaches commonly preform alignment between initial reconstructions. However, the estimated motions are usually too coarse to provide accurate temporal information. To remedy this, the proposed network employs stacked temporal shift reconstruction blocks to enhance the initial reconstruction progressively. Within each block, an efficient temporal shift operation is used to capture temporal structures in addition to computational overheads. Then, a bidirectional alignment module is adopted to capture the temporal dependencies in a video sequence. Different from previous methods that only extract supplementary information from the key frames, the proposed alignment module can receive temporal information from the whole video sequence via bidirectional propagations. Experiments demonstrate the superior performance of the proposed method.</p><p>Qu et al. propose a lightweight video frame interpolation network with a three-scale encoding-decoding structure. Specifically, multi-scale motion information is first extracted from the input video. Then, recurrent convolutional layers are adopted to refine the resultant features. Afterwards, the resultant features are aggregated to generate high-quality interpolated frames. Experimental results on the CelebA and Helen datasets show that the proposed method outperforms state-of-the-art methods while using fewer parameters.</p><p>Dou et al. introduce a decoder structure-guided CNN-Transformer network for face super-resolution. Most previous approaches follow a multi-task learning paradigm to perform landmark detection while super-resolving the low-resolution images. However, these methods require additional annotation cost, and the extracted facial prior structures are usually of low quality. To address these issues, the proposed network employs a global-local feature extraction unit to extract the global structure while capturing local texture details. In addition, a multi-state fusion module is incorporated to aggregate embeddings from different stages. Experiments show that the proposed method surpasses previous approaches by notable margins.</p><p>Yang et al. study the problem of blind super-resolution and propose a method to exploit degradation information through degradation representation learning. Specifically, a generative adversarial network is employed to model the degradation process from HR images to LR images and constrain the data distribution of the synthetic LR images. Then, the learnt representation is adopted to super-resolve the input low-resolution images using a transformer-based SR network. Experiments on both synthetic and real-world datasets demonstrate the effectiveness and superiority of the proposed method.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 4","pages":"435-438"},"PeriodicalIF":1.5000,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12283","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/cvi2.12283","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Image restoration and enhancement has always been a fundamental task in computer vision and is widely used in numerous applications, such as surveillance imaging, remote sensing, and medical imaging. In recent years, remarkable progress has been witnessed with deep learning techniques. Despite the promising performance achieved on synthetic data, compelling research challenges remain to be addressed in the wild. These include: (i) degradation models for low-quality images in the real world are complicated and unknown, (ii) paired low-quality and high-quality data are difficult to acquire in the real world, and a large quantity of real data are provided in an unpaired form, (iii) it is challenging to incorporate cross-modal information provided by advanced imaging techniques (e.g. RGB-D camera) for image restoration, (iv) real-time inference on edge devices is important for image restoration and enhancement methods, and (v) it is difficult to provide the confidence or performance bounds of a learning-based method on different images/regions. This special issue invites original contributions in datasets, innovative architectures, and training methods for image restoration and enhancement to address these and other challenges.

In this Special Issue, we have received 17 papers, of which 8 papers underwent the peer review process, while the rest were desk-rejected. Among these reviewed papers, 5 papers have been accepted and 3 papers have been rejected as they did not meet the criteria of IET Computer Vision. Thus, the overall submissions were of high quality, which marks the success of this Special Issue.

The five eventually accepted papers can be clustered into two categories, namely video reconstruction and image super-resolution. The first category of papers aims at reconstructing high-quality videos. The papers in this category are of Zhang et al., Gu et al., and Xu et al. The second category of papers studies the task of image super-resolution. The papers in this category are of Dou et al. and Yang et al. A brief presentation of each of the paper in this special issue is as follows.

Zhang et al. propose a point-image fusion network for event-based frame interpolation. Temporal information in event streams plays a critical role in this task as it provides temporal context cues complementary to images. Previous approaches commonly transform the unstructured event data to structured data formats through voxelisation and then employ advanced CNNs to extract temporal information. However, the voxelisation operation inevitably leads to information loss and introduces redundant computation. To address these limitations, the proposed method directly extracts temporal information from the events at the point level without relying on any voxelisation operation. Afterwards, a fusion module is adopted to aggregate complementary cues from both points and images for frame interpolation. Experiments on both synthetic and real-world datasets show that their method produces state-of-the-art accuracy with high efficiency.

Gu et al. develop a temporal shift reconstruction network for compressive video sensing. To exploit the temporal cues between adjacent frames during the reconstruction of videos, most previous approaches commonly preform alignment between initial reconstructions. However, the estimated motions are usually too coarse to provide accurate temporal information. To remedy this, the proposed network employs stacked temporal shift reconstruction blocks to enhance the initial reconstruction progressively. Within each block, an efficient temporal shift operation is used to capture temporal structures in addition to computational overheads. Then, a bidirectional alignment module is adopted to capture the temporal dependencies in a video sequence. Different from previous methods that only extract supplementary information from the key frames, the proposed alignment module can receive temporal information from the whole video sequence via bidirectional propagations. Experiments demonstrate the superior performance of the proposed method.

Qu et al. propose a lightweight video frame interpolation network with a three-scale encoding-decoding structure. Specifically, multi-scale motion information is first extracted from the input video. Then, recurrent convolutional layers are adopted to refine the resultant features. Afterwards, the resultant features are aggregated to generate high-quality interpolated frames. Experimental results on the CelebA and Helen datasets show that the proposed method outperforms state-of-the-art methods while using fewer parameters.

Dou et al. introduce a decoder structure-guided CNN-Transformer network for face super-resolution. Most previous approaches follow a multi-task learning paradigm to perform landmark detection while super-resolving the low-resolution images. However, these methods require additional annotation cost, and the extracted facial prior structures are usually of low quality. To address these issues, the proposed network employs a global-local feature extraction unit to extract the global structure while capturing local texture details. In addition, a multi-state fusion module is incorporated to aggregate embeddings from different stages. Experiments show that the proposed method surpasses previous approaches by notable margins.

Yang et al. study the problem of blind super-resolution and propose a method to exploit degradation information through degradation representation learning. Specifically, a generative adversarial network is employed to model the degradation process from HR images to LR images and constrain the data distribution of the synthetic LR images. Then, the learnt representation is adopted to super-resolve the input low-resolution images using a transformer-based SR network. Experiments on both synthetic and real-world datasets demonstrate the effectiveness and superiority of the proposed method.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
特邀社论:野生图像的高级修复和增强
图像复原和增强一直是计算机视觉领域的一项基本任务,被广泛应用于监控成像、遥感和医疗成像等众多领域。近年来,深度学习技术取得了显著进展。尽管在合成数据上取得了可喜的成绩,但在实际应用中仍有许多紧迫的研究挑战有待解决。这些挑战包括(i) 现实世界中低质量图像的降解模型既复杂又未知;(ii) 现实世界中难以获得成对的低质量和高质量数据,而大量真实数据是以非成对形式提供的;(iii) 将先进成像技术(如 RGB-D 相机)提供的跨模态信息纳入深度学习技术具有挑战性。(iv) 边缘设备上的实时推理对于图像修复和增强方法非常重要,以及 (v) 很难提供基于学习的方法在不同图像/区域上的置信度或性能边界。本特刊诚邀在数据集、创新架构和图像修复与增强训练方法方面的原创性文章,以应对上述挑战和其他挑战。在本特刊中,我们共收到 17 篇论文,其中 8 篇经过了同行评审,其余论文被退回。在这些经过评审的论文中,5 篇论文被接受,3 篇论文因不符合 IET 计算机视觉标准而被拒绝。最终录用的 5 篇论文可分为两类,即视频重建和图像超分辨率。第一类论文旨在重建高质量视频。第二类论文研究图像超分辨率任务。本特刊中每篇论文的简要介绍如下:Zhang 等人提出了一种用于基于事件的帧插值的点图像融合网络。事件流中的时间信息在这项任务中起着至关重要的作用,因为它提供了与图像互补的时间上下文线索。以往的方法通常通过体素化将非结构化事件数据转换为结构化数据格式,然后采用高级 CNN 提取时间信息。然而,象素化操作不可避免地会导致信息丢失,并引入冗余计算。针对这些局限性,本文提出的方法不依赖任何体素化操作,而是直接从点级别的事件中提取时间信息。然后,采用融合模块从点和图像中汇总互补线索,进行帧插值。在合成数据集和真实数据集上的实验表明,他们的方法能以高效率达到最先进的精度。为了在视频重建过程中利用相邻帧之间的时间线索,以前的大多数方法通常在初始重建之间进行对齐。然而,估计的运动通常过于粗糙,无法提供准确的时间信息。为了解决这个问题,所提出的网络采用了堆叠时移重建块来逐步增强初始重建。在每个块内,除了计算开销外,还使用高效的时移操作来捕捉时间结构。然后,采用双向对齐模块来捕捉视频序列中的时间依赖性。与以往只从关键帧中提取补充信息的方法不同,所提出的配准模块可通过双向传播从整个视频序列中接收时间信息。Qu 等人提出了一种具有三尺度编码-解码结构的轻量级视频帧插值网络。具体来说,首先从输入视频中提取多尺度运动信息。然后,采用递归卷积层来提炼结果特征。然后,对结果特征进行聚合,生成高质量的插值帧。在 CelebA 和 Helen 数据集上的实验结果表明,所提出的方法在使用较少参数的情况下优于最先进的方法。之前的大多数方法都采用多任务学习范式,在对低分辨率图像进行超分辨率处理的同时进行地标检测。然而,这些方法需要额外的注释成本,而且提取的面部先验结构通常质量不高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IET Computer Vision
IET Computer Vision 工程技术-工程:电子与电气
CiteScore
3.30
自引率
11.80%
发文量
76
审稿时长
3.4 months
期刊介绍: IET Computer Vision seeks original research papers in a wide range of areas of computer vision. The vision of the journal is to publish the highest quality research work that is relevant and topical to the field, but not forgetting those works that aim to introduce new horizons and set the agenda for future avenues of research in computer vision. IET Computer Vision welcomes submissions on the following topics: Biologically and perceptually motivated approaches to low level vision (feature detection, etc.); Perceptual grouping and organisation Representation, analysis and matching of 2D and 3D shape Shape-from-X Object recognition Image understanding Learning with visual inputs Motion analysis and object tracking Multiview scene analysis Cognitive approaches in low, mid and high level vision Control in visual systems Colour, reflectance and light Statistical and probabilistic models Face and gesture Surveillance Biometrics and security Robotics Vehicle guidance Automatic model aquisition Medical image analysis and understanding Aerial scene analysis and remote sensing Deep learning models in computer vision Both methodological and applications orientated papers are welcome. Manuscripts submitted are expected to include a detailed and analytical review of the literature and state-of-the-art exposition of the original proposed research and its methodology, its thorough experimental evaluation, and last but not least, comparative evaluation against relevant and state-of-the-art methods. Submissions not abiding by these minimum requirements may be returned to authors without being sent to review. Special Issues Current Call for Papers: Computer Vision for Smart Cameras and Camera Networks - https://digital-library.theiet.org/files/IET_CVI_SC.pdf Computer Vision for the Creative Industries - https://digital-library.theiet.org/files/IET_CVI_CVCI.pdf
期刊最新文献
SRL-ProtoNet: Self-supervised representation learning for few-shot remote sensing scene classification Balanced parametric body prior for implicit clothed human reconstruction from a monocular RGB Social-ATPGNN: Prediction of multi-modal pedestrian trajectory of non-homogeneous social interaction HIST: Hierarchical and sequential transformer for image captioning Multi-modal video search by examples—A video quality impact analysis
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1