Jun Ling;Xu Tan;Liyang Chen;Runnan Li;Yuchao Zhang;Sheng Zhao;Li Song
{"title":"StableFace: Analyzing and Improving Motion Stability for Talking Face Generation","authors":"Jun Ling;Xu Tan;Liyang Chen;Runnan Li;Yuchao Zhang;Sheng Zhao;Li Song","doi":"10.1109/JSTSP.2023.3333552","DOIUrl":null,"url":null,"abstract":"While previous methods for speech-driven talking face generation have shown significant advances in improving the visual and lip-sync quality of the synthesized videos, they have paid less attention to lip motion jitters which can substantially undermine the perceived quality of talking face videos. What causes motion jitters, and how to mitigate the problem? In this article, we conduct systematic analyses to investigate the motion jittering problem based on a state-of-the-art pipeline that utilizes 3D face representations to bridge the input audio and output video, and implement several effective designs to improve motion stability. This study finds that several factors can lead to jitters in the synthesized talking face video, including jitters from the input face representations, training-inference mismatch, and a lack of dependency modeling in the generation network. Accordingly, we propose three effective solutions: 1) a Gaussian-based adaptive smoothing module to smooth the 3D face representations to eliminate jitters in the input; 2) augmented erosions added to the input data of the neural renderer in training to simulate the inference distortion to reduce mismatch; 3) an audio-fused transformer generator to model inter-frame dependency. In addition, considering there is no off-the-shelf metric that can measures motion jitters of talking face video, we devise an objective metric (Motion Stability Index, MSI) to quantitatively measure the motion jitters. Extensive experimental results show the superiority of the proposed method on motion-stable talking video generation, with superior quality to previous systems.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":null,"pages":null},"PeriodicalIF":8.7000,"publicationDate":"2023-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Selected Topics in Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10319685/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
While previous methods for speech-driven talking face generation have shown significant advances in improving the visual and lip-sync quality of the synthesized videos, they have paid less attention to lip motion jitters which can substantially undermine the perceived quality of talking face videos. What causes motion jitters, and how to mitigate the problem? In this article, we conduct systematic analyses to investigate the motion jittering problem based on a state-of-the-art pipeline that utilizes 3D face representations to bridge the input audio and output video, and implement several effective designs to improve motion stability. This study finds that several factors can lead to jitters in the synthesized talking face video, including jitters from the input face representations, training-inference mismatch, and a lack of dependency modeling in the generation network. Accordingly, we propose three effective solutions: 1) a Gaussian-based adaptive smoothing module to smooth the 3D face representations to eliminate jitters in the input; 2) augmented erosions added to the input data of the neural renderer in training to simulate the inference distortion to reduce mismatch; 3) an audio-fused transformer generator to model inter-frame dependency. In addition, considering there is no off-the-shelf metric that can measures motion jitters of talking face video, we devise an objective metric (Motion Stability Index, MSI) to quantitatively measure the motion jitters. Extensive experimental results show the superiority of the proposed method on motion-stable talking video generation, with superior quality to previous systems.
期刊介绍:
The IEEE Journal of Selected Topics in Signal Processing (JSTSP) focuses on the Field of Interest of the IEEE Signal Processing Society, which encompasses the theory and application of various signal processing techniques. These techniques include filtering, coding, transmitting, estimating, detecting, analyzing, recognizing, synthesizing, recording, and reproducing signals using digital or analog devices. The term "signal" covers a wide range of data types, including audio, video, speech, image, communication, geophysical, sonar, radar, medical, musical, and others.
The journal format allows for in-depth exploration of signal processing topics, enabling the Society to cover both established and emerging areas. This includes interdisciplinary fields such as biomedical engineering and language processing, as well as areas not traditionally associated with engineering.