ReSyncer:基于风格的重新布线生成器,用于统一的视听同步面部表演者

Jiazhi Guan, Zhiliang Xu, Hang Zhou, Kaisiyuan Wang, Shengyi He, Zhanwang Zhang, Borong Liang, Haocheng Feng, Errui Ding, Jingtuo Liu, Jingdong Wang, Youjian Zhao, Ziwei Liu
{"title":"ReSyncer:基于风格的重新布线生成器,用于统一的视听同步面部表演者","authors":"Jiazhi Guan, Zhiliang Xu, Hang Zhou, Kaisiyuan Wang, Shengyi He, Zhanwang Zhang, Borong Liang, Haocheng Feng, Errui Ding, Jingtuo Liu, Jingdong Wang, Youjian Zhao, Ziwei Liu","doi":"arxiv-2408.03284","DOIUrl":null,"url":null,"abstract":"Lip-syncing videos with given audio is the foundation for various\napplications including the creation of virtual presenters or performers. While\nrecent studies explore high-fidelity lip-sync with different techniques, their\ntask-orientated models either require long-term videos for clip-specific\ntraining or retain visible artifacts. In this paper, we propose a unified and\neffective framework ReSyncer, that synchronizes generalized audio-visual facial\ninformation. The key design is revisiting and rewiring the Style-based\ngenerator to efficiently adopt 3D facial dynamics predicted by a principled\nstyle-injected Transformer. By simply re-configuring the information insertion\nmechanisms within the noise and style space, our framework fuses motion and\nappearance with unified training. Extensive experiments demonstrate that\nReSyncer not only produces high-fidelity lip-synced videos according to audio,\nbut also supports multiple appealing properties that are suitable for creating\nvirtual presenters and performers, including fast personalized fine-tuning,\nvideo-driven lip-syncing, the transfer of speaking styles, and even face\nswapping. Resources can be found at\nhttps://guanjz20.github.io/projects/ReSyncer.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"59 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer\",\"authors\":\"Jiazhi Guan, Zhiliang Xu, Hang Zhou, Kaisiyuan Wang, Shengyi He, Zhanwang Zhang, Borong Liang, Haocheng Feng, Errui Ding, Jingtuo Liu, Jingdong Wang, Youjian Zhao, Ziwei Liu\",\"doi\":\"arxiv-2408.03284\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Lip-syncing videos with given audio is the foundation for various\\napplications including the creation of virtual presenters or performers. While\\nrecent studies explore high-fidelity lip-sync with different techniques, their\\ntask-orientated models either require long-term videos for clip-specific\\ntraining or retain visible artifacts. In this paper, we propose a unified and\\neffective framework ReSyncer, that synchronizes generalized audio-visual facial\\ninformation. The key design is revisiting and rewiring the Style-based\\ngenerator to efficiently adopt 3D facial dynamics predicted by a principled\\nstyle-injected Transformer. By simply re-configuring the information insertion\\nmechanisms within the noise and style space, our framework fuses motion and\\nappearance with unified training. Extensive experiments demonstrate that\\nReSyncer not only produces high-fidelity lip-synced videos according to audio,\\nbut also supports multiple appealing properties that are suitable for creating\\nvirtual presenters and performers, including fast personalized fine-tuning,\\nvideo-driven lip-syncing, the transfer of speaking styles, and even face\\nswapping. Resources can be found at\\nhttps://guanjz20.github.io/projects/ReSyncer.\",\"PeriodicalId\":501480,\"journal\":{\"name\":\"arXiv - CS - Multimedia\",\"volume\":\"59 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.03284\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.03284","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

将视频与给定音频进行唇语同步是各种应用的基础,包括创建虚拟主持人或表演者。虽然近期的研究通过不同的技术探索了高保真唇语同步技术,但其面向任务的模型要么需要长期观看视频以进行特定片段的训练,要么会保留可见的人工痕迹。在本文中,我们提出了一个统一而有效的框架 ReSyncer,它能同步通用的视听面部信息。其关键设计在于重新审视和重新连接基于风格的生成器,以有效地采用由原则性风格注入变换器预测的三维面部动态。通过简单地重新配置噪声和风格空间内的信息插入机制,我们的框架将运动和外观与统一训练融合在一起。广泛的实验证明,ReSyncer 不仅能根据音频生成高保真的唇音同步视频,还支持多种适合创建虚拟主持人和表演者的吸引人的特性,包括快速个性化微调、视频驱动的唇音同步、说话风格的转换,甚至是换脸。相关资源请访问:https://guanjz20.github.io/projects/ReSyncer。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer
Lip-syncing videos with given audio is the foundation for various applications including the creation of virtual presenters or performers. While recent studies explore high-fidelity lip-sync with different techniques, their task-orientated models either require long-term videos for clip-specific training or retain visible artifacts. In this paper, we propose a unified and effective framework ReSyncer, that synchronizes generalized audio-visual facial information. The key design is revisiting and rewiring the Style-based generator to efficiently adopt 3D facial dynamics predicted by a principled style-injected Transformer. By simply re-configuring the information insertion mechanisms within the noise and style space, our framework fuses motion and appearance with unified training. Extensive experiments demonstrate that ReSyncer not only produces high-fidelity lip-synced videos according to audio, but also supports multiple appealing properties that are suitable for creating virtual presenters and performers, including fast personalized fine-tuning, video-driven lip-syncing, the transfer of speaking styles, and even face swapping. Resources can be found at https://guanjz20.github.io/projects/ReSyncer.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Vista3D: Unravel the 3D Darkside of a Single Image MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion Efficient Low-Resolution Face Recognition via Bridge Distillation Enhancing Few-Shot Classification without Forgetting through Multi-Level Contrastive Constraints NVLM: Open Frontier-Class Multimodal LLMs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1