Spatially and Temporally Optimized Audio-Driven Talking Face Generation

IF 2.7 4区 计算机科学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING Computer Graphics Forum Pub Date : 2024-10-24 DOI:10.1111/cgf.15228
Biao Dong, Bo-Yao Ma, Lei Zhang
{"title":"Spatially and Temporally Optimized Audio-Driven Talking Face Generation","authors":"Biao Dong,&nbsp;Bo-Yao Ma,&nbsp;Lei Zhang","doi":"10.1111/cgf.15228","DOIUrl":null,"url":null,"abstract":"<p>Audio-driven talking face generation is essentially a cross-modal mapping from audio to video frames. The main challenge lies in the intricate one-to-many mapping, which affects lip sync accuracy. And the loss of facial details during image reconstruction often results in visual artifacts in the generated video. To overcome these challenges, this paper proposes to enhance the quality of generated talking faces with a new spatio-temporal consistency. Specifically, the temporal consistency is achieved through consecutive frames of the each phoneme, which form temporal modules that exhibit similar lip appearance changes. This allows for adaptive adjustment in the lip movement for accurate sync. The spatial consistency pertains to the uniform distribution of textures within local regions, which form spatial modules and regulate the texture distribution in the generator. This yields fine details in the reconstructed facial images. Extensive experiments show that our method can generate more natural talking faces than previous state-of-the-art methods in both accurate lip sync and realistic facial details.</p>","PeriodicalId":10687,"journal":{"name":"Computer Graphics Forum","volume":"43 7","pages":""},"PeriodicalIF":2.7000,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Graphics Forum","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/cgf.15228","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

Abstract

Audio-driven talking face generation is essentially a cross-modal mapping from audio to video frames. The main challenge lies in the intricate one-to-many mapping, which affects lip sync accuracy. And the loss of facial details during image reconstruction often results in visual artifacts in the generated video. To overcome these challenges, this paper proposes to enhance the quality of generated talking faces with a new spatio-temporal consistency. Specifically, the temporal consistency is achieved through consecutive frames of the each phoneme, which form temporal modules that exhibit similar lip appearance changes. This allows for adaptive adjustment in the lip movement for accurate sync. The spatial consistency pertains to the uniform distribution of textures within local regions, which form spatial modules and regulate the texture distribution in the generator. This yields fine details in the reconstructed facial images. Extensive experiments show that our method can generate more natural talking faces than previous state-of-the-art methods in both accurate lip sync and realistic facial details.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
时空优化的音频驱动人脸生成技术
音频驱动的人脸识别本质上是从音频到视频帧的跨模态映射。主要挑战在于复杂的一对多映射,这会影响唇语同步的准确性。而图像重建过程中面部细节的丢失往往会导致生成的视频出现视觉假象。为了克服这些挑战,本文提出用一种新的时空一致性来提高生成的会说话人脸的质量。具体来说,时空一致性是通过每个音素的连续帧来实现的,这些连续帧形成的时空模块表现出相似的唇部外观变化。这样就可以自适应地调整嘴唇的运动,实现准确的同步。空间一致性与局部区域内纹理的均匀分布有关,这些纹理形成空间模块,并调节生成器中的纹理分布。这就产生了重建面部图像中的精细细节。广泛的实验表明,我们的方法在准确的唇部同步和逼真的面部细节方面都能生成比以往最先进的方法更自然的说话人脸。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Computer Graphics Forum
Computer Graphics Forum 工程技术-计算机:软件工程
CiteScore
5.80
自引率
12.00%
发文量
175
审稿时长
3-6 weeks
期刊介绍: Computer Graphics Forum is the official journal of Eurographics, published in cooperation with Wiley-Blackwell, and is a unique, international source of information for computer graphics professionals interested in graphics developments worldwide. It is now one of the leading journals for researchers, developers and users of computer graphics in both commercial and academic environments. The journal reports on the latest developments in the field throughout the world and covers all aspects of the theory, practice and application of computer graphics.
期刊最新文献
DiffPop: Plausibility-Guided Object Placement Diffusion for Image Composition Front Matter LGSur-Net: A Local Gaussian Surface Representation Network for Upsampling Highly Sparse Point Cloud 𝒢-Style: Stylized Gaussian Splatting iShapEditing: Intelligent Shape Editing with Diffusion Models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1