Wav2Lip-HR: Synthesising clear high-resolution talking head in the wild

IF 1.7 4区计算机科学 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING Computer Animation and Virtual Worlds Pub Date : 2023-12-15 DOI:10.1002/cav.2226

Chao Liang, Qinghua Wang, Yunlin Chen, Minjie Tang

{"title":"Wav2Lip-HR: Synthesising clear high-resolution talking head in the wild","authors":"Chao Liang, Qinghua Wang, Yunlin Chen, Minjie Tang","doi":"10.1002/cav.2226","DOIUrl":null,"url":null,"abstract":"<p>Talking head generation aims to synthesize a photo-realistic speaking video with accurate lip motion. While this field has attracted more attention in recent audio-visual researches, most existing methods do not achieve the simultaneous improvement of lip synchronization and visual quality. In this paper, we propose Wav2Lip-HR, a neural-based audio-driven high-resolution talking head generation method. With our technique, all required to generate a clear high-resolution lip sync talking video is an image/video of the target face and an audio clip of any speech. The primary benefit of our method is that it generates clear high-resolution videos with sufficient facial details, rather than the ones just be large-sized with less clarity. We first analyze key factors that limit the clarity of generated videos and then put forth several important solutions to address the problem, including data augmentation, model structure improvement and a more effective loss function. Finally, we employ several efficient metrics to evaluate the clarity of images generated by our proposed approach as well as several widely used metrics to evaluate lip-sync performance. Numerous experiments demonstrate that our method has superior performance on visual quality and lip synchronization when compared to other existing schemes.</p>","PeriodicalId":50645,"journal":{"name":"Computer Animation and Virtual Worlds","volume":"35 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2023-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Animation and Virtual Worlds","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cav.2226","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Talking head generation aims to synthesize a photo-realistic speaking video with accurate lip motion. While this field has attracted more attention in recent audio-visual researches, most existing methods do not achieve the simultaneous improvement of lip synchronization and visual quality. In this paper, we propose Wav2Lip-HR, a neural-based audio-driven high-resolution talking head generation method. With our technique, all required to generate a clear high-resolution lip sync talking video is an image/video of the target face and an audio clip of any speech. The primary benefit of our method is that it generates clear high-resolution videos with sufficient facial details, rather than the ones just be large-sized with less clarity. We first analyze key factors that limit the clarity of generated videos and then put forth several important solutions to address the problem, including data augmentation, model structure improvement and a more effective loss function. Finally, we employ several efficient metrics to evaluate the clarity of images generated by our proposed approach as well as several widely used metrics to evaluate lip-sync performance. Numerous experiments demonstrate that our method has superior performance on visual quality and lip synchronization when compared to other existing schemes.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Wav2Lip-HR：在野外合成清晰的高分辨率话头

话头生成的目的是合成具有准确唇部动作的逼真说话视频。虽然这一领域在近年来的视听研究中受到越来越多的关注，但现有的大多数方法并不能同时实现唇部同步和视觉质量的提高。在本文中，我们提出了 Wav2Lip-HR，一种基于神经的音频驱动高分辨率说话头生成方法。利用我们的技术，生成清晰的高分辨率唇语同步视频所需的只是目标面部的图像/视频和任何语音的音频片段。我们的方法的主要优点是，它能生成清晰的高分辨率视频，并能提供足够的面部细节，而不是只生成大尺寸而不太清晰的视频。我们首先分析了限制生成视频清晰度的关键因素，然后提出了几个重要的解决方案来解决这个问题，包括数据增强、模型结构改进和更有效的损失函数。最后，我们采用了几种有效的指标来评估我们提出的方法生成的图像的清晰度，以及几种广泛使用的指标来评估唇语同步性能。大量实验证明，与其他现有方案相比，我们的方法在视觉质量和唇部同步方面表现出色。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computer Animation and Virtual Worlds 工程技术-计算机：软件工程

CiteScore

2.20

自引率

0.00%

发文量

审稿时长

6-12 weeks

期刊介绍： With the advent of very powerful PCs and high-end graphics cards, there has been an incredible development in Virtual Worlds, real-time computer animation and simulation, games. But at the same time, new and cheaper Virtual Reality devices have appeared allowing an interaction with these real-time Virtual Worlds and even with real worlds through Augmented Reality. Three-dimensional characters, especially Virtual Humans are now of an exceptional quality, which allows to use them in the movie industry. But this is only a beginning, as with the development of Artificial Intelligence and Agent technology, these characters will become more and more autonomous and even intelligent. They will inhabit the Virtual Worlds in a Virtual Life together with animals and plants.