StableFace:分析和改进运动稳定性以生成会说话的人脸

IF 8.7 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Journal of Selected Topics in Signal Processing Pub Date : 2023-11-16 DOI:10.1109/JSTSP.2023.3333552
Jun Ling;Xu Tan;Liyang Chen;Runnan Li;Yuchao Zhang;Sheng Zhao;Li Song
{"title":"StableFace:分析和改进运动稳定性以生成会说话的人脸","authors":"Jun Ling;Xu Tan;Liyang Chen;Runnan Li;Yuchao Zhang;Sheng Zhao;Li Song","doi":"10.1109/JSTSP.2023.3333552","DOIUrl":null,"url":null,"abstract":"While previous methods for speech-driven talking face generation have shown significant advances in improving the visual and lip-sync quality of the synthesized videos, they have paid less attention to lip motion jitters which can substantially undermine the perceived quality of talking face videos. What causes motion jitters, and how to mitigate the problem? In this article, we conduct systematic analyses to investigate the motion jittering problem based on a state-of-the-art pipeline that utilizes 3D face representations to bridge the input audio and output video, and implement several effective designs to improve motion stability. This study finds that several factors can lead to jitters in the synthesized talking face video, including jitters from the input face representations, training-inference mismatch, and a lack of dependency modeling in the generation network. Accordingly, we propose three effective solutions: 1) a Gaussian-based adaptive smoothing module to smooth the 3D face representations to eliminate jitters in the input; 2) augmented erosions added to the input data of the neural renderer in training to simulate the inference distortion to reduce mismatch; 3) an audio-fused transformer generator to model inter-frame dependency. In addition, considering there is no off-the-shelf metric that can measures motion jitters of talking face video, we devise an objective metric (Motion Stability Index, MSI) to quantitatively measure the motion jitters. Extensive experimental results show the superiority of the proposed method on motion-stable talking video generation, with superior quality to previous systems.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"17 6","pages":"1232-1247"},"PeriodicalIF":8.7000,"publicationDate":"2023-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"StableFace: Analyzing and Improving Motion Stability for Talking Face Generation\",\"authors\":\"Jun Ling;Xu Tan;Liyang Chen;Runnan Li;Yuchao Zhang;Sheng Zhao;Li Song\",\"doi\":\"10.1109/JSTSP.2023.3333552\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"While previous methods for speech-driven talking face generation have shown significant advances in improving the visual and lip-sync quality of the synthesized videos, they have paid less attention to lip motion jitters which can substantially undermine the perceived quality of talking face videos. What causes motion jitters, and how to mitigate the problem? In this article, we conduct systematic analyses to investigate the motion jittering problem based on a state-of-the-art pipeline that utilizes 3D face representations to bridge the input audio and output video, and implement several effective designs to improve motion stability. This study finds that several factors can lead to jitters in the synthesized talking face video, including jitters from the input face representations, training-inference mismatch, and a lack of dependency modeling in the generation network. Accordingly, we propose three effective solutions: 1) a Gaussian-based adaptive smoothing module to smooth the 3D face representations to eliminate jitters in the input; 2) augmented erosions added to the input data of the neural renderer in training to simulate the inference distortion to reduce mismatch; 3) an audio-fused transformer generator to model inter-frame dependency. In addition, considering there is no off-the-shelf metric that can measures motion jitters of talking face video, we devise an objective metric (Motion Stability Index, MSI) to quantitatively measure the motion jitters. Extensive experimental results show the superiority of the proposed method on motion-stable talking video generation, with superior quality to previous systems.\",\"PeriodicalId\":13038,\"journal\":{\"name\":\"IEEE Journal of Selected Topics in Signal Processing\",\"volume\":\"17 6\",\"pages\":\"1232-1247\"},\"PeriodicalIF\":8.7000,\"publicationDate\":\"2023-11-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Journal of Selected Topics in Signal Processing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10319685/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Selected Topics in Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10319685/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

摘要

虽然以前的语音驱动人脸识别方法在提高合成视频的视觉和唇音质量方面取得了显著进步,但却较少关注唇部运动抖动问题,而唇部运动抖动会严重影响人脸识别视频的感知质量。是什么导致了运动抖动?在本文中,我们基于最先进的管道(该管道利用三维人脸表征来连接输入音频和输出视频)进行了系统分析,以研究运动抖动问题,并实施了几种有效的设计来提高运动稳定性。这项研究发现,有几个因素会导致合成的会说话人脸视频出现抖动,包括输入人脸表征的抖动、训练-推理不匹配以及生成网络中缺乏依赖性建模。因此,我们提出了三种有效的解决方案:1) 基于高斯的自适应平滑模块来平滑三维人脸表征,以消除输入中的抖动;2) 在训练中向神经渲染器的输入数据添加增强侵蚀,以模拟推理失真,从而减少不匹配;3) 音频融合变压器生成器来模拟帧间依赖性。此外,考虑到目前还没有现成的指标可以测量说话人脸视频的运动抖动,我们设计了一个客观指标(运动稳定指数,MSI)来定量测量运动抖动。广泛的实验结果表明,所提出的方法在生成运动稳定的谈话视频方面具有优越性,其质量优于以前的系统。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
StableFace: Analyzing and Improving Motion Stability for Talking Face Generation
While previous methods for speech-driven talking face generation have shown significant advances in improving the visual and lip-sync quality of the synthesized videos, they have paid less attention to lip motion jitters which can substantially undermine the perceived quality of talking face videos. What causes motion jitters, and how to mitigate the problem? In this article, we conduct systematic analyses to investigate the motion jittering problem based on a state-of-the-art pipeline that utilizes 3D face representations to bridge the input audio and output video, and implement several effective designs to improve motion stability. This study finds that several factors can lead to jitters in the synthesized talking face video, including jitters from the input face representations, training-inference mismatch, and a lack of dependency modeling in the generation network. Accordingly, we propose three effective solutions: 1) a Gaussian-based adaptive smoothing module to smooth the 3D face representations to eliminate jitters in the input; 2) augmented erosions added to the input data of the neural renderer in training to simulate the inference distortion to reduce mismatch; 3) an audio-fused transformer generator to model inter-frame dependency. In addition, considering there is no off-the-shelf metric that can measures motion jitters of talking face video, we devise an objective metric (Motion Stability Index, MSI) to quantitatively measure the motion jitters. Extensive experimental results show the superiority of the proposed method on motion-stable talking video generation, with superior quality to previous systems.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE Journal of Selected Topics in Signal Processing
IEEE Journal of Selected Topics in Signal Processing 工程技术-工程:电子与电气
CiteScore
19.00
自引率
1.30%
发文量
135
审稿时长
3 months
期刊介绍: The IEEE Journal of Selected Topics in Signal Processing (JSTSP) focuses on the Field of Interest of the IEEE Signal Processing Society, which encompasses the theory and application of various signal processing techniques. These techniques include filtering, coding, transmitting, estimating, detecting, analyzing, recognizing, synthesizing, recording, and reproducing signals using digital or analog devices. The term "signal" covers a wide range of data types, including audio, video, speech, image, communication, geophysical, sonar, radar, medical, musical, and others. The journal format allows for in-depth exploration of signal processing topics, enabling the Society to cover both established and emerging areas. This includes interdisciplinary fields such as biomedical engineering and language processing, as well as areas not traditionally associated with engineering.
期刊最新文献
Front Cover Table of Contents IEEE Signal Processing Society Information Introduction to the Special Issue Near-Field Signal Processing: Algorithms, Implementations and Applications IEEE Signal Processing Society Information
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1