Sai Tanmay Reddy Chakkera, Aggelina Chatziagapi, Dimitris Samaras
{"title":"JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation","authors":"Sai Tanmay Reddy Chakkera, Aggelina Chatziagapi, Dimitris Samaras","doi":"arxiv-2409.12156","DOIUrl":null,"url":null,"abstract":"We introduce a novel method for joint expression and audio-guided talking\nface generation. Recent approaches either struggle to preserve the speaker\nidentity or fail to produce faithful facial expressions. To address these\nchallenges, we propose a NeRF-based network. Since we train our network on\nmonocular videos without any ground truth, it is essential to learn\ndisentangled representations for audio and expression. We first learn audio\nfeatures in a self-supervised manner, given utterances from multiple subjects.\nBy incorporating a contrastive learning technique, we ensure that the learned\naudio features are aligned to the lip motion and disentangled from the muscle\nmotion of the rest of the face. We then devise a transformer-based architecture\nthat learns expression features, capturing long-range facial expressions and\ndisentangling them from the speech-specific mouth movements. Through\nquantitative and qualitative evaluation, we demonstrate that our method can\nsynthesize high-fidelity talking face videos, achieving state-of-the-art facial\nexpression transfer along with lip synchronization to unseen audio.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"14 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.12156","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We introduce a novel method for joint expression and audio-guided talking
face generation. Recent approaches either struggle to preserve the speaker
identity or fail to produce faithful facial expressions. To address these
challenges, we propose a NeRF-based network. Since we train our network on
monocular videos without any ground truth, it is essential to learn
disentangled representations for audio and expression. We first learn audio
features in a self-supervised manner, given utterances from multiple subjects.
By incorporating a contrastive learning technique, we ensure that the learned
audio features are aligned to the lip motion and disentangled from the muscle
motion of the rest of the face. We then devise a transformer-based architecture
that learns expression features, capturing long-range facial expressions and
disentangling them from the speech-specific mouth movements. Through
quantitative and qualitative evaluation, we demonstrate that our method can
synthesize high-fidelity talking face videos, achieving state-of-the-art facial
expression transfer along with lip synchronization to unseen audio.