Longhao Zhang, Shuang Liang, Zhipeng Ge, Tianshu Hu
{"title":"PersonaTalk:在视觉配音中关注您的角色","authors":"Longhao Zhang, Shuang Liang, Zhipeng Ge, Tianshu Hu","doi":"arxiv-2409.05379","DOIUrl":null,"url":null,"abstract":"For audio-driven visual dubbing, it remains a considerable challenge to\nuphold and highlight speaker's persona while synthesizing accurate lip\nsynchronization. Existing methods fall short of capturing speaker's unique\nspeaking style or preserving facial details. In this paper, we present\nPersonaTalk, an attention-based two-stage framework, including geometry\nconstruction and face rendering, for high-fidelity and personalized visual\ndubbing. In the first stage, we propose a style-aware audio encoding module\nthat injects speaking style into audio features through a cross-attention\nlayer. The stylized audio features are then used to drive speaker's template\ngeometry to obtain lip-synced geometries. In the second stage, a dual-attention\nface renderer is introduced to render textures for the target geometries. It\nconsists of two parallel cross-attention layers, namely Lip-Attention and\nFace-Attention, which respectively sample textures from different reference\nframes to render the entire face. With our innovative design, intricate facial\ndetails can be well preserved. Comprehensive experiments and user studies\ndemonstrate our advantages over other state-of-the-art methods in terms of\nvisual quality, lip-sync accuracy and persona preservation. Furthermore, as a\nperson-generic framework, PersonaTalk can achieve competitive performance as\nstate-of-the-art person-specific methods. Project Page:\nhttps://grisoon.github.io/PersonaTalk/.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PersonaTalk: Bring Attention to Your Persona in Visual Dubbing\",\"authors\":\"Longhao Zhang, Shuang Liang, Zhipeng Ge, Tianshu Hu\",\"doi\":\"arxiv-2409.05379\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"For audio-driven visual dubbing, it remains a considerable challenge to\\nuphold and highlight speaker's persona while synthesizing accurate lip\\nsynchronization. Existing methods fall short of capturing speaker's unique\\nspeaking style or preserving facial details. In this paper, we present\\nPersonaTalk, an attention-based two-stage framework, including geometry\\nconstruction and face rendering, for high-fidelity and personalized visual\\ndubbing. In the first stage, we propose a style-aware audio encoding module\\nthat injects speaking style into audio features through a cross-attention\\nlayer. The stylized audio features are then used to drive speaker's template\\ngeometry to obtain lip-synced geometries. In the second stage, a dual-attention\\nface renderer is introduced to render textures for the target geometries. It\\nconsists of two parallel cross-attention layers, namely Lip-Attention and\\nFace-Attention, which respectively sample textures from different reference\\nframes to render the entire face. With our innovative design, intricate facial\\ndetails can be well preserved. Comprehensive experiments and user studies\\ndemonstrate our advantages over other state-of-the-art methods in terms of\\nvisual quality, lip-sync accuracy and persona preservation. Furthermore, as a\\nperson-generic framework, PersonaTalk can achieve competitive performance as\\nstate-of-the-art person-specific methods. Project Page:\\nhttps://grisoon.github.io/PersonaTalk/.\",\"PeriodicalId\":501174,\"journal\":{\"name\":\"arXiv - CS - Graphics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Graphics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.05379\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.05379","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
PersonaTalk: Bring Attention to Your Persona in Visual Dubbing
For audio-driven visual dubbing, it remains a considerable challenge to
uphold and highlight speaker's persona while synthesizing accurate lip
synchronization. Existing methods fall short of capturing speaker's unique
speaking style or preserving facial details. In this paper, we present
PersonaTalk, an attention-based two-stage framework, including geometry
construction and face rendering, for high-fidelity and personalized visual
dubbing. In the first stage, we propose a style-aware audio encoding module
that injects speaking style into audio features through a cross-attention
layer. The stylized audio features are then used to drive speaker's template
geometry to obtain lip-synced geometries. In the second stage, a dual-attention
face renderer is introduced to render textures for the target geometries. It
consists of two parallel cross-attention layers, namely Lip-Attention and
Face-Attention, which respectively sample textures from different reference
frames to render the entire face. With our innovative design, intricate facial
details can be well preserved. Comprehensive experiments and user studies
demonstrate our advantages over other state-of-the-art methods in terms of
visual quality, lip-sync accuracy and persona preservation. Furthermore, as a
person-generic framework, PersonaTalk can achieve competitive performance as
state-of-the-art person-specific methods. Project Page:
https://grisoon.github.io/PersonaTalk/.