SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing

Lingyu Xiong, Xize Cheng, Jintao Tan, Xianjia Wu, Xiandong Li, Lei Zhu, Fei Ma, Minglei Li, Huang Xu, Zhihu Hu
{"title":"SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing","authors":"Lingyu Xiong, Xize Cheng, Jintao Tan, Xianjia Wu, Xiandong Li, Lei Zhu, Fei Ma, Minglei Li, Huang Xu, Zhihu Hu","doi":"arxiv-2409.03605","DOIUrl":null,"url":null,"abstract":"Audio-driven talking face generation aims to synthesize video with lip\nmovements synchronized to input audio. However, current generative techniques\nface challenges in preserving intricate regional textures (skin, teeth). To\naddress the aforementioned challenges, we propose a novel framework called\nSegTalker to decouple lip movements and image textures by introducing\nsegmentation as intermediate representation. Specifically, given the mask of\nimage employed by a parsing network, we first leverage the speech to drive the\nmask and generate talking segmentation. Then we disentangle semantic regions of\nimage into style codes using a mask-guided encoder. Ultimately, we inject the\npreviously generated talking segmentation and style codes into a mask-guided\nStyleGAN to synthesize video frame. In this way, most of textures are fully\npreserved. Moreover, our approach can inherently achieve background separation\nand facilitate mask-guided facial local editing. In particular, by editing the\nmask and swapping the region textures from a given reference image (e.g. hair,\nlip, eyebrows), our approach enables facial editing seamlessly when generating\ntalking face video. Experiments demonstrate that our proposed approach can\neffectively preserve texture details and generate temporally consistent video\nwhile remaining competitive in lip synchronization. Quantitative and\nqualitative results on the HDTF and MEAD datasets illustrate the superior\nperformance of our method over existing methods.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"2 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.03605","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Audio-driven talking face generation aims to synthesize video with lip movements synchronized to input audio. However, current generative techniques face challenges in preserving intricate regional textures (skin, teeth). To address the aforementioned challenges, we propose a novel framework called SegTalker to decouple lip movements and image textures by introducing segmentation as intermediate representation. Specifically, given the mask of image employed by a parsing network, we first leverage the speech to drive the mask and generate talking segmentation. Then we disentangle semantic regions of image into style codes using a mask-guided encoder. Ultimately, we inject the previously generated talking segmentation and style codes into a mask-guided StyleGAN to synthesize video frame. In this way, most of textures are fully preserved. Moreover, our approach can inherently achieve background separation and facilitate mask-guided facial local editing. In particular, by editing the mask and swapping the region textures from a given reference image (e.g. hair, lip, eyebrows), our approach enables facial editing seamlessly when generating talking face video. Experiments demonstrate that our proposed approach can effectively preserve texture details and generate temporally consistent video while remaining competitive in lip synchronization. Quantitative and qualitative results on the HDTF and MEAD datasets illustrate the superior performance of our method over existing methods.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
SegTalker:基于分割的会说话人脸生成与遮罩引导的局部编辑
音频驱动的 "会说话的脸 "生成技术旨在合成与输入音频同步的带有唇部动作的视频。然而,当前的生成技术在保留复杂的区域纹理(皮肤、牙齿)方面面临挑战。为了应对上述挑战,我们提出了一个名为 "SegTalker "的新框架,通过引入分割作为中间表示,将唇部动作和图像纹理分离开来。具体来说,给定解析网络使用的图像掩码后,我们首先利用语音来驱动任务并生成说话分割。然后,我们使用掩码引导编码器将图像的语义区域分离成风格代码。最后,我们将先前生成的会话分割和风格代码注入掩码引导的风格广域网(Mask-guidedStyleGAN),以合成视频帧。通过这种方式,大部分纹理都得到了完全保留。此外,我们的方法还能在本质上实现背景分离,并促进面具引导的面部局部编辑。特别是,通过编辑任务和交换给定参考图像中的区域纹理(如头发、嘴唇、眉毛),我们的方法可以在生成会说话的人脸视频时实现无缝面部编辑。实验证明,我们提出的方法可以有效地保留纹理细节,生成时间上一致的视频,同时在唇部同步方面保持竞争力。在 HDTF 和 MEAD 数据集上的定量和定性结果表明,我们的方法比现有方法性能更优越。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Vista3D: Unravel the 3D Darkside of a Single Image MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion Efficient Low-Resolution Face Recognition via Bridge Distillation Enhancing Few-Shot Classification without Forgetting through Multi-Level Contrastive Constraints NVLM: Open Frontier-Class Multimodal LLMs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1