Transformer-based image and video inpainting: current challenges and future directions

IF 10.7 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Artificial Intelligence Review Pub Date : 2025-02-05 DOI:10.1007/s10462-024-11075-9

Omar Elharrouss, Rafat Damseh, Abdelkader Nasreddine Belkacem, Elarbi Badidi, Abderrahmane Lakas

{"title":"Transformer-based image and video inpainting: current challenges and future directions","authors":"Omar Elharrouss, Rafat Damseh, Abdelkader Nasreddine Belkacem, Elarbi Badidi, Abderrahmane Lakas","doi":"10.1007/s10462-024-11075-9","DOIUrl":null,"url":null,"abstract":"<div><p>Image inpainting is currently a hot topic within the field of computer vision. It offers a viable solution for various applications, including photographic restoration, video editing, and medical imaging. Deep learning advancements, notably convolutional neural networks (CNNs) and generative adversarial networks (GANs), have significantly enhanced the inpainting task with an improved capability to fill missing or damaged regions in an image or a video through the incorporation of contextually appropriate details. These advancements have improved other aspects, including efficiency, information preservation, and achieving both realistic textures and structures. Recently, Vision Transformers (ViTs) have been exploited and offer some improvements to image or video inpainting. The advent of transformer-based architectures, which were initially designed for natural language processing, has also been integrated into computer vision tasks. These methods utilize self-attention mechanisms that excel in capturing long-range dependencies within data; therefore, they are particularly effective for tasks requiring a comprehensive understanding of the global context of an image or video. In this paper, we provide a comprehensive review of the current image/video inpainting approaches, with a specific focus on Vision Transformer (ViT) techniques, with the goal to highlight the significant improvements and provide a guideline for new researchers in the field of image/video inpainting using vision transformers. We categorized the transformer-based techniques by their architectural configurations, types of damage, and performance metrics. Furthermore, we present an organized synthesis of the current challenges, and suggest directions for future research in the field of image or video inpainting.</p></div>","PeriodicalId":8449,"journal":{"name":"Artificial Intelligence Review","volume":"58 4","pages":""},"PeriodicalIF":10.7000,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10462-024-11075-9.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence Review","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10462-024-11075-9","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Image inpainting is currently a hot topic within the field of computer vision. It offers a viable solution for various applications, including photographic restoration, video editing, and medical imaging. Deep learning advancements, notably convolutional neural networks (CNNs) and generative adversarial networks (GANs), have significantly enhanced the inpainting task with an improved capability to fill missing or damaged regions in an image or a video through the incorporation of contextually appropriate details. These advancements have improved other aspects, including efficiency, information preservation, and achieving both realistic textures and structures. Recently, Vision Transformers (ViTs) have been exploited and offer some improvements to image or video inpainting. The advent of transformer-based architectures, which were initially designed for natural language processing, has also been integrated into computer vision tasks. These methods utilize self-attention mechanisms that excel in capturing long-range dependencies within data; therefore, they are particularly effective for tasks requiring a comprehensive understanding of the global context of an image or video. In this paper, we provide a comprehensive review of the current image/video inpainting approaches, with a specific focus on Vision Transformer (ViT) techniques, with the goal to highlight the significant improvements and provide a guideline for new researchers in the field of image/video inpainting using vision transformers. We categorized the transformer-based techniques by their architectural configurations, types of damage, and performance metrics. Furthermore, we present an organized synthesis of the current challenges, and suggest directions for future research in the field of image or video inpainting.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

图像着色是当前计算机视觉领域的热门话题。它为各种应用提供了可行的解决方案，包括照片修复、视频编辑和医学成像。深度学习的进步，特别是卷积神经网络（CNN）和生成对抗网络（GAN）的进步，大大增强了内绘任务的能力，通过结合上下文的适当细节，填补图像或视频中缺失或损坏的区域。这些进步还改善了其他方面，包括效率、信息保存以及实现逼真的纹理和结构。最近，视觉变换器（ViTs）得到了利用，并为图像或视频的绘制提供了一些改进。基于变换器的架构最初是为自然语言处理而设计的，这种架构的出现也融入了计算机视觉任务中。这些方法利用自我关注机制，擅长捕捉数据中的长距离依赖关系；因此，它们对于需要全面了解图像或视频全局背景的任务特别有效。在本文中，我们对当前的图像/视频绘制方法进行了全面回顾，并特别关注了视觉变换器（ViT）技术，目的是强调这些方法的显著改进，并为使用视觉变换器进行图像/视频绘制领域的新研究人员提供指导。我们按照架构配置、损坏类型和性能指标对基于变换器的技术进行了分类。此外，我们还对当前面临的挑战进行了有条理的总结，并提出了图像或视频着色领域的未来研究方向。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Artificial Intelligence Review 工程技术-计算机：人工智能

CiteScore

22.00

自引率

3.30%

发文量

194

审稿时长

5.3 months

期刊介绍： Artificial Intelligence Review, a fully open access journal, publishes cutting-edge research in artificial intelligence and cognitive science. It features critical evaluations of applications, techniques, and algorithms, providing a platform for both researchers and application developers. The journal includes refereed survey and tutorial articles, along with reviews and commentary on significant developments in the field.