{"title":"ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer","authors":"Jiazhi Guan, Zhiliang Xu, Hang Zhou, Kaisiyuan Wang, Shengyi He, Zhanwang Zhang, Borong Liang, Haocheng Feng, Errui Ding, Jingtuo Liu, Jingdong Wang, Youjian Zhao, Ziwei Liu","doi":"arxiv-2408.03284","DOIUrl":null,"url":null,"abstract":"Lip-syncing videos with given audio is the foundation for various\napplications including the creation of virtual presenters or performers. While\nrecent studies explore high-fidelity lip-sync with different techniques, their\ntask-orientated models either require long-term videos for clip-specific\ntraining or retain visible artifacts. In this paper, we propose a unified and\neffective framework ReSyncer, that synchronizes generalized audio-visual facial\ninformation. The key design is revisiting and rewiring the Style-based\ngenerator to efficiently adopt 3D facial dynamics predicted by a principled\nstyle-injected Transformer. By simply re-configuring the information insertion\nmechanisms within the noise and style space, our framework fuses motion and\nappearance with unified training. Extensive experiments demonstrate that\nReSyncer not only produces high-fidelity lip-synced videos according to audio,\nbut also supports multiple appealing properties that are suitable for creating\nvirtual presenters and performers, including fast personalized fine-tuning,\nvideo-driven lip-syncing, the transfer of speaking styles, and even face\nswapping. Resources can be found at\nhttps://guanjz20.github.io/projects/ReSyncer.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"59 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.03284","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Lip-syncing videos with given audio is the foundation for various
applications including the creation of virtual presenters or performers. While
recent studies explore high-fidelity lip-sync with different techniques, their
task-orientated models either require long-term videos for clip-specific
training or retain visible artifacts. In this paper, we propose a unified and
effective framework ReSyncer, that synchronizes generalized audio-visual facial
information. The key design is revisiting and rewiring the Style-based
generator to efficiently adopt 3D facial dynamics predicted by a principled
style-injected Transformer. By simply re-configuring the information insertion
mechanisms within the noise and style space, our framework fuses motion and
appearance with unified training. Extensive experiments demonstrate that
ReSyncer not only produces high-fidelity lip-synced videos according to audio,
but also supports multiple appealing properties that are suitable for creating
virtual presenters and performers, including fast personalized fine-tuning,
video-driven lip-syncing, the transfer of speaking styles, and even face
swapping. Resources can be found at
https://guanjz20.github.io/projects/ReSyncer.