PersonaTalk:在视觉配音中关注您的角色

Longhao Zhang, Shuang Liang, Zhipeng Ge, Tianshu Hu
{"title":"PersonaTalk:在视觉配音中关注您的角色","authors":"Longhao Zhang, Shuang Liang, Zhipeng Ge, Tianshu Hu","doi":"arxiv-2409.05379","DOIUrl":null,"url":null,"abstract":"For audio-driven visual dubbing, it remains a considerable challenge to\nuphold and highlight speaker's persona while synthesizing accurate lip\nsynchronization. Existing methods fall short of capturing speaker's unique\nspeaking style or preserving facial details. In this paper, we present\nPersonaTalk, an attention-based two-stage framework, including geometry\nconstruction and face rendering, for high-fidelity and personalized visual\ndubbing. In the first stage, we propose a style-aware audio encoding module\nthat injects speaking style into audio features through a cross-attention\nlayer. The stylized audio features are then used to drive speaker's template\ngeometry to obtain lip-synced geometries. In the second stage, a dual-attention\nface renderer is introduced to render textures for the target geometries. It\nconsists of two parallel cross-attention layers, namely Lip-Attention and\nFace-Attention, which respectively sample textures from different reference\nframes to render the entire face. With our innovative design, intricate facial\ndetails can be well preserved. Comprehensive experiments and user studies\ndemonstrate our advantages over other state-of-the-art methods in terms of\nvisual quality, lip-sync accuracy and persona preservation. Furthermore, as a\nperson-generic framework, PersonaTalk can achieve competitive performance as\nstate-of-the-art person-specific methods. Project Page:\nhttps://grisoon.github.io/PersonaTalk/.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PersonaTalk: Bring Attention to Your Persona in Visual Dubbing\",\"authors\":\"Longhao Zhang, Shuang Liang, Zhipeng Ge, Tianshu Hu\",\"doi\":\"arxiv-2409.05379\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"For audio-driven visual dubbing, it remains a considerable challenge to\\nuphold and highlight speaker's persona while synthesizing accurate lip\\nsynchronization. Existing methods fall short of capturing speaker's unique\\nspeaking style or preserving facial details. In this paper, we present\\nPersonaTalk, an attention-based two-stage framework, including geometry\\nconstruction and face rendering, for high-fidelity and personalized visual\\ndubbing. In the first stage, we propose a style-aware audio encoding module\\nthat injects speaking style into audio features through a cross-attention\\nlayer. The stylized audio features are then used to drive speaker's template\\ngeometry to obtain lip-synced geometries. In the second stage, a dual-attention\\nface renderer is introduced to render textures for the target geometries. It\\nconsists of two parallel cross-attention layers, namely Lip-Attention and\\nFace-Attention, which respectively sample textures from different reference\\nframes to render the entire face. With our innovative design, intricate facial\\ndetails can be well preserved. Comprehensive experiments and user studies\\ndemonstrate our advantages over other state-of-the-art methods in terms of\\nvisual quality, lip-sync accuracy and persona preservation. Furthermore, as a\\nperson-generic framework, PersonaTalk can achieve competitive performance as\\nstate-of-the-art person-specific methods. Project Page:\\nhttps://grisoon.github.io/PersonaTalk/.\",\"PeriodicalId\":501174,\"journal\":{\"name\":\"arXiv - CS - Graphics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Graphics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.05379\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.05379","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

对于音频驱动的视觉配音来说,在合成准确的唇语同步的同时,如何体现和突出说话者的个性仍然是一个相当大的挑战。现有的方法无法捕捉说话者独特的说话风格或保留面部细节。在本文中,我们提出了一个基于注意力的两阶段框架--PersonaTalk,包括几何构建和面部渲染,用于高保真和个性化的视觉配音。在第一阶段,我们提出了一种风格感知音频编码模块,通过跨注意力层将说话风格注入音频特征。然后,风格化的音频特征被用于驱动说话者的模板几何,以获得唇语同步几何。在第二阶段,引入双注意面渲染器来渲染目标几何图形的纹理。它由两个并行的交叉注意力层(即唇部注意力层和面部注意力层)组成,分别从不同的参考帧中采样纹理,以渲染整个面部。通过我们的创新设计,可以很好地保留复杂的面部细节。全面的实验和用户研究表明,我们在视觉质量、唇音同步准确性和人物形象保存方面都优于其他最先进的方法。此外,作为一种人形通用框架,PersonaTalk 的性能可与最先进的人形特定方法相媲美。项目页面:https://grisoon.github.io/PersonaTalk/.
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
PersonaTalk: Bring Attention to Your Persona in Visual Dubbing
For audio-driven visual dubbing, it remains a considerable challenge to uphold and highlight speaker's persona while synthesizing accurate lip synchronization. Existing methods fall short of capturing speaker's unique speaking style or preserving facial details. In this paper, we present PersonaTalk, an attention-based two-stage framework, including geometry construction and face rendering, for high-fidelity and personalized visual dubbing. In the first stage, we propose a style-aware audio encoding module that injects speaking style into audio features through a cross-attention layer. The stylized audio features are then used to drive speaker's template geometry to obtain lip-synced geometries. In the second stage, a dual-attention face renderer is introduced to render textures for the target geometries. It consists of two parallel cross-attention layers, namely Lip-Attention and Face-Attention, which respectively sample textures from different reference frames to render the entire face. With our innovative design, intricate facial details can be well preserved. Comprehensive experiments and user studies demonstrate our advantages over other state-of-the-art methods in terms of visual quality, lip-sync accuracy and persona preservation. Furthermore, as a person-generic framework, PersonaTalk can achieve competitive performance as state-of-the-art person-specific methods. Project Page: https://grisoon.github.io/PersonaTalk/.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
GaussianHeads: End-to-End Learning of Drivable Gaussian Head Avatars from Coarse-to-fine Representations A Missing Data Imputation GAN for Character Sprite Generation Visualizing Temporal Topic Embeddings with a Compass Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models Phys3DGS: Physically-based 3D Gaussian Splatting for Inverse Rendering
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1