FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning

Kazi Injamamul Haque, Zerrin Yumak
{"title":"FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning","authors":"Kazi Injamamul Haque, Zerrin Yumak","doi":"10.1145/3577190.3614157","DOIUrl":null,"url":null,"abstract":"This paper presents FaceXHuBERT, a text-less speech-driven 3D facial animation generation method that generates facial cues driven by an emotional expressiveness condition. In addition, it can handle audio recorded in a variety of situations (e.g. background noise, multiple people speaking). Recent approaches employ end-to-end deep learning taking into account both audio and text as input to generate 3D facial animation. However, scarcity of publicly available expressive audio-3D facial animation datasets poses a major bottleneck. The resulting animations still have issues regarding accurate lip-syncing, emotional expressivity, person-specific facial cues and generalizability. In this work, we first achieve better results than state-of-the-art on the speech-driven 3D facial animation generation task by effectively employing the self-supervised pretrained HuBERT speech model that allows to incorporate both lexical and non-lexical information in the audio without using a large lexicon. Second, we incorporate emotional expressiveness modality by guiding the network with a binary emotion condition. We carried out extensive objective and subjective evaluations in comparison to ground-truth and state-of-the-art. A perceptual user study demonstrates that expressively generated facial animations using our approach are indeed perceived more realistic and are preferred over the non-expressive ones. In addition, we show that having a strong audio encoder alone eliminates the need of a complex decoder for the network architecture, reducing the network complexity and training time significantly. We provide the code1 publicly and recommend watching the video.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion Publication of the 2020 International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3577190.3614157","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

This paper presents FaceXHuBERT, a text-less speech-driven 3D facial animation generation method that generates facial cues driven by an emotional expressiveness condition. In addition, it can handle audio recorded in a variety of situations (e.g. background noise, multiple people speaking). Recent approaches employ end-to-end deep learning taking into account both audio and text as input to generate 3D facial animation. However, scarcity of publicly available expressive audio-3D facial animation datasets poses a major bottleneck. The resulting animations still have issues regarding accurate lip-syncing, emotional expressivity, person-specific facial cues and generalizability. In this work, we first achieve better results than state-of-the-art on the speech-driven 3D facial animation generation task by effectively employing the self-supervised pretrained HuBERT speech model that allows to incorporate both lexical and non-lexical information in the audio without using a large lexicon. Second, we incorporate emotional expressiveness modality by guiding the network with a binary emotion condition. We carried out extensive objective and subjective evaluations in comparison to ground-truth and state-of-the-art. A perceptual user study demonstrates that expressively generated facial animations using our approach are indeed perceived more realistic and are preferred over the non-expressive ones. In addition, we show that having a strong audio encoder alone eliminates the need of a complex decoder for the network architecture, reducing the network complexity and training time significantly. We provide the code1 publicly and recommend watching the video.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
FaceXHuBERT:使用自监督语音表示学习的无文本语音驱动的E(X)压制3D面部动画合成
本文介绍了FaceXHuBERT,一种无文本语音驱动的3D面部动画生成方法,该方法生成由情感表达条件驱动的面部线索。此外,它可以处理在各种情况下录制的音频(例如背景噪音,多人说话)。最近的方法采用端到端深度学习,将音频和文本作为输入来生成3D面部动画。然而,缺乏公开可用的具有表现力的音频- 3d面部动画数据集是一个主要的瓶颈。最终的动画在准确的对口型、情感表达、个人特定的面部线索和概括性方面仍然存在问题。在这项工作中,我们首先通过有效地使用自监督预训练的HuBERT语音模型,在语音驱动的3D面部动画生成任务上取得了比最新技术更好的结果,该模型允许在音频中合并词汇和非词汇信息,而无需使用大型词汇库。其次,我们采用二元情绪条件引导网络整合情感表达方式。我们进行了广泛的客观和主观的评估,与实际情况和最先进的技术进行比较。一项感性用户研究表明,使用我们的方法生成的富有表现力的面部动画确实被认为更真实,并且比无表现力的更受欢迎。此外,我们还表明,仅拥有一个强大的音频编码器就可以消除对网络架构的复杂解码器的需求,从而显着降低网络复杂性和训练时间。我们公开提供代码,并推荐观看视频。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Gesture Motion Graphs for Few-Shot Speech-Driven Gesture Reenactment The UEA Digital Humans entry to the GENEA Challenge 2023 Deciphering Entrepreneurial Pitches: A Multimodal Deep Learning Approach to Predict Probability of Investment The FineMotion entry to the GENEA Challenge 2023: DeepPhase for conversational gestures generation FEIN-Z: Autoregressive Behavior Cloning for Speech-Driven Gesture Generation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1