{"title":"PoseTalk:基于文字和音频的姿态控制和动作细化,用于一次性生成对话头像","authors":"Jun Ling, Yiwen Wang, Han Xue, Rong Xie, Li Song","doi":"arxiv-2409.02657","DOIUrl":null,"url":null,"abstract":"While previous audio-driven talking head generation (THG) methods generate\nhead poses from driving audio, the generated poses or lips cannot match the\naudio well or are not editable. In this study, we propose \\textbf{PoseTalk}, a\nTHG system that can freely generate lip-synchronized talking head videos with\nfree head poses conditioned on text prompts and audio. The core insight of our\nmethod is using head pose to connect visual, linguistic, and audio signals.\nFirst, we propose to generate poses from both audio and text prompts, where the\naudio offers short-term variations and rhythm correspondence of the head\nmovements and the text prompts describe the long-term semantics of head\nmotions. To achieve this goal, we devise a Pose Latent Diffusion (PLD) model to\ngenerate motion latent from text prompts and audio cues in a pose latent space.\nSecond, we observe a loss-imbalance problem: the loss for the lip region\ncontributes less than 4\\% of the total reconstruction loss caused by both pose\nand lip, making optimization lean towards head movements rather than lip\nshapes. To address this issue, we propose a refinement-based learning strategy\nto synthesize natural talking videos using two cascaded networks, i.e.,\nCoarseNet, and RefineNet. The CoarseNet estimates coarse motions to produce\nanimated images in novel poses and the RefineNet focuses on learning finer lip\nmotions by progressively estimating lip motions from low-to-high resolutions,\nyielding improved lip-synchronization performance. Experiments demonstrate our\npose prediction strategy achieves better pose diversity and realness compared\nto text-only or audio-only, and our video generator model outperforms\nstate-of-the-art methods in synthesizing talking videos with natural head\nmotions. Project: https://junleen.github.io/projects/posetalk.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"5 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation\",\"authors\":\"Jun Ling, Yiwen Wang, Han Xue, Rong Xie, Li Song\",\"doi\":\"arxiv-2409.02657\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"While previous audio-driven talking head generation (THG) methods generate\\nhead poses from driving audio, the generated poses or lips cannot match the\\naudio well or are not editable. In this study, we propose \\\\textbf{PoseTalk}, a\\nTHG system that can freely generate lip-synchronized talking head videos with\\nfree head poses conditioned on text prompts and audio. The core insight of our\\nmethod is using head pose to connect visual, linguistic, and audio signals.\\nFirst, we propose to generate poses from both audio and text prompts, where the\\naudio offers short-term variations and rhythm correspondence of the head\\nmovements and the text prompts describe the long-term semantics of head\\nmotions. To achieve this goal, we devise a Pose Latent Diffusion (PLD) model to\\ngenerate motion latent from text prompts and audio cues in a pose latent space.\\nSecond, we observe a loss-imbalance problem: the loss for the lip region\\ncontributes less than 4\\\\% of the total reconstruction loss caused by both pose\\nand lip, making optimization lean towards head movements rather than lip\\nshapes. To address this issue, we propose a refinement-based learning strategy\\nto synthesize natural talking videos using two cascaded networks, i.e.,\\nCoarseNet, and RefineNet. The CoarseNet estimates coarse motions to produce\\nanimated images in novel poses and the RefineNet focuses on learning finer lip\\nmotions by progressively estimating lip motions from low-to-high resolutions,\\nyielding improved lip-synchronization performance. Experiments demonstrate our\\npose prediction strategy achieves better pose diversity and realness compared\\nto text-only or audio-only, and our video generator model outperforms\\nstate-of-the-art methods in synthesizing talking videos with natural head\\nmotions. Project: https://junleen.github.io/projects/posetalk.\",\"PeriodicalId\":501480,\"journal\":{\"name\":\"arXiv - CS - Multimedia\",\"volume\":\"5 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.02657\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.02657","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation
While previous audio-driven talking head generation (THG) methods generate
head poses from driving audio, the generated poses or lips cannot match the
audio well or are not editable. In this study, we propose \textbf{PoseTalk}, a
THG system that can freely generate lip-synchronized talking head videos with
free head poses conditioned on text prompts and audio. The core insight of our
method is using head pose to connect visual, linguistic, and audio signals.
First, we propose to generate poses from both audio and text prompts, where the
audio offers short-term variations and rhythm correspondence of the head
movements and the text prompts describe the long-term semantics of head
motions. To achieve this goal, we devise a Pose Latent Diffusion (PLD) model to
generate motion latent from text prompts and audio cues in a pose latent space.
Second, we observe a loss-imbalance problem: the loss for the lip region
contributes less than 4\% of the total reconstruction loss caused by both pose
and lip, making optimization lean towards head movements rather than lip
shapes. To address this issue, we propose a refinement-based learning strategy
to synthesize natural talking videos using two cascaded networks, i.e.,
CoarseNet, and RefineNet. The CoarseNet estimates coarse motions to produce
animated images in novel poses and the RefineNet focuses on learning finer lip
motions by progressively estimating lip motions from low-to-high resolutions,
yielding improved lip-synchronization performance. Experiments demonstrate our
pose prediction strategy achieves better pose diversity and realness compared
to text-only or audio-only, and our video generator model outperforms
state-of-the-art methods in synthesizing talking videos with natural head
motions. Project: https://junleen.github.io/projects/posetalk.