ProbTalk3D：使用 VQ-VAE 进行非确定性情感可控语音驱动三维面部动画合成

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI:arxiv-2409.07966

Sichun Wu, Kazi Injamamul Haque, Zerrin Yumak

{"title":"ProbTalk3D：使用 VQ-VAE 进行非确定性情感可控语音驱动三维面部动画合成","authors":"Sichun Wu, Kazi Injamamul Haque, Zerrin Yumak","doi":"arxiv-2409.07966","DOIUrl":null,"url":null,"abstract":"Audio-driven 3D facial animation synthesis has been an active field of\nresearch with attention from both academia and industry. While there are\npromising results in this area, recent approaches largely focus on lip-sync and\nidentity control, neglecting the role of emotions and emotion control in the\ngenerative process. That is mainly due to the lack of emotionally rich facial\nanimation data and algorithms that can synthesize speech animations with\nemotional expressions at the same time. In addition, majority of the models are\ndeterministic, meaning given the same audio input, they produce the same output\nmotion. We argue that emotions and non-determinism are crucial to generate\ndiverse and emotionally-rich facial animations. In this paper, we propose\nProbTalk3D a non-deterministic neural network approach for emotion controllable\nspeech-driven 3D facial animation synthesis using a two-stage VQ-VAE model and\nan emotionally rich facial animation dataset 3DMEAD. We provide an extensive\ncomparative analysis of our model against the recent 3D facial animation\nsynthesis approaches, by evaluating the results objectively, qualitatively, and\nwith a perceptual user study. We highlight several objective metrics that are\nmore suitable for evaluating stochastic outputs and use both in-the-wild and\nground truth data for subjective evaluation. To our knowledge, that is the\nfirst non-deterministic 3D facial animation synthesis method incorporating a\nrich emotion dataset and emotion control with emotion labels and intensity\nlevels. Our evaluation demonstrates that the proposed model achieves superior\nperformance compared to state-of-the-art emotion-controlled, deterministic and\nnon-deterministic models. We recommend watching the supplementary video for\nquality judgement. The entire codebase is publicly available\n(https://github.com/uuembodiedsocialai/ProbTalk3D/).","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"60 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE\",\"authors\":\"Sichun Wu, Kazi Injamamul Haque, Zerrin Yumak\",\"doi\":\"arxiv-2409.07966\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Audio-driven 3D facial animation synthesis has been an active field of\\nresearch with attention from both academia and industry. While there are\\npromising results in this area, recent approaches largely focus on lip-sync and\\nidentity control, neglecting the role of emotions and emotion control in the\\ngenerative process. That is mainly due to the lack of emotionally rich facial\\nanimation data and algorithms that can synthesize speech animations with\\nemotional expressions at the same time. In addition, majority of the models are\\ndeterministic, meaning given the same audio input, they produce the same output\\nmotion. We argue that emotions and non-determinism are crucial to generate\\ndiverse and emotionally-rich facial animations. In this paper, we propose\\nProbTalk3D a non-deterministic neural network approach for emotion controllable\\nspeech-driven 3D facial animation synthesis using a two-stage VQ-VAE model and\\nan emotionally rich facial animation dataset 3DMEAD. We provide an extensive\\ncomparative analysis of our model against the recent 3D facial animation\\nsynthesis approaches, by evaluating the results objectively, qualitatively, and\\nwith a perceptual user study. We highlight several objective metrics that are\\nmore suitable for evaluating stochastic outputs and use both in-the-wild and\\nground truth data for subjective evaluation. To our knowledge, that is the\\nfirst non-deterministic 3D facial animation synthesis method incorporating a\\nrich emotion dataset and emotion control with emotion labels and intensity\\nlevels. Our evaluation demonstrates that the proposed model achieves superior\\nperformance compared to state-of-the-art emotion-controlled, deterministic and\\nnon-deterministic models. We recommend watching the supplementary video for\\nquality judgement. The entire codebase is publicly available\\n(https://github.com/uuembodiedsocialai/ProbTalk3D/).\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":\"60 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07966\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07966","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

音频驱动的三维面部动画合成一直是一个活跃的研究领域，受到学术界和工业界的关注。虽然在这一领域取得了令人鼓舞的成果，但最近的研究方法主要集中在唇部同步和身份控制上，而忽视了情感和情感控制在合成过程中的作用。这主要是由于缺乏情感丰富的面部动画数据和能同时合成具有情感表达的语音动画的算法。此外，大多数模型都是确定性的，即给定相同的音频输入，它们会产生相同的输出动作。我们认为，情感和非确定性对于生成多样化和情感丰富的面部动画至关重要。在本文中，我们使用两阶段 VQ-VAE 模型和情感丰富的面部动画数据集 3DMEAD，提出了一种用于情感可控语音驱动三维面部动画合成的非确定性神经网络方法 ProbTalk3D。我们通过对结果进行客观、定性和用户感知研究评估，对我们的模型与最近的三维面部动画合成方法进行了广泛的比较分析。我们强调了几个更适合评估随机输出的客观指标，并使用野外和地面真实数据进行主观评估。据我们所知，这是第一种非确定性三维面部动画合成方法，其中包含丰富的情感数据集以及带有情感标签和强度级别的情感控制。我们的评估结果表明，与最先进的情感控制、确定性和非确定性模型相比，所提出的模型具有更出色的性能。我们建议您观看补充视频以进行质量判断。整个代码库均可公开获取（https://github.com/uuembodiedsocialai/ProbTalk3D/）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE

Audio-driven 3D facial animation synthesis has been an active field of research with attention from both academia and industry. While there are promising results in this area, recent approaches largely focus on lip-sync and identity control, neglecting the role of emotions and emotion control in the generative process. That is mainly due to the lack of emotionally rich facial animation data and algorithms that can synthesize speech animations with emotional expressions at the same time. In addition, majority of the models are deterministic, meaning given the same audio input, they produce the same output motion. We argue that emotions and non-determinism are crucial to generate diverse and emotionally-rich facial animations. In this paper, we propose ProbTalk3D a non-deterministic neural network approach for emotion controllable speech-driven 3D facial animation synthesis using a two-stage VQ-VAE model and an emotionally rich facial animation dataset 3DMEAD. We provide an extensive comparative analysis of our model against the recent 3D facial animation synthesis approaches, by evaluating the results objectively, qualitatively, and with a perceptual user study. We highlight several objective metrics that are more suitable for evaluating stochastic outputs and use both in-the-wild and ground truth data for subjective evaluation. To our knowledge, that is the first non-deterministic 3D facial animation synthesis method incorporating a rich emotion dataset and emotion control with emotion labels and intensity levels. Our evaluation demonstrates that the proposed model achieves superior performance compared to state-of-the-art emotion-controlled, deterministic and non-deterministic models. We recommend watching the supplementary video for quality judgement. The entire codebase is publicly available (https://github.com/uuembodiedsocialai/ProbTalk3D/).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Computer Vision and Pattern Recognition

自引率

0.00%

发文量

期刊最新文献

Massively Multi-Person 3D Human Motion Forecasting with Scene Context Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Precise Forecasting of Sky Images Using Spatial Warping JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation Applications of Knowledge Distillation in Remote Sensing: A Survey